92810




Technical Assessment of Open Data Platforms
    for National Statistical Organisations




               18 October, 2014

               World Bank Group
© 2014 International Bank for Reconstruction and Development / The World Bank
1818 H Street NW
Washington DC 20433
Telephone: 202-473-1000
Internet: www.worldbank.org



The findings, interpretations, and conclusions expressed in this work do not necessarily reflect the views of
The World Bank, its Board of Executive Directors, or the governments they represent.

The World Bank does not guarantee the accuracy of the data included in this work. The boundaries,
colors, denominations, and other information shown on any map in this work do not imply any judgment on
the part of The World Bank concerning the legal status of any territory or the endorsement or acceptance
of such boundaries.


Rights and Permissions

The material in this work is subject to copyright. Because The World Bank encourages dissemination of its
knowledge, this work may be reproduced, in whole or in part, for noncommercial purposes as long as full
attribution to this work is given.

Any queries on rights and licenses, including subsidiary rights, should be addressed to World Bank
Publications, The World Bank Group, 1818 H Street NW, Washington, DC 20433, USA; fax: 202-522-
2625; e-mail: pubrights@worldbank.org.

For questions or comments concerning this working paper, please contact Timothy Herzog
(therzog1@worldbank.org) or Amparo Ballivian (aballivian@worldbank.org).




                                                     2
Contents
1     Executive Summary .............................................................................................................. 5
2     Introduction to the Technical Assessment........................................................................... 10
    2.1      Objectives of this report .............................................................................................. 10
    2.2      Who should read this report ........................................................................................ 11
    2.3      What this report does not cover................................................................................... 11
3     Overview of data publication and management .................................................................. 12
    3.1      Open Data.................................................................................................................... 12
    3.2      Metadata ...................................................................................................................... 13
    3.3      Microdata and generalised data ................................................................................... 14
    3.4      Proprietary file formats ............................................................................................... 14
    3.5      Data structure and linked data ..................................................................................... 15
    3.6      Software development and deployment ...................................................................... 15
4     Requirements and components for data publication systems .............................................. 17
    4.1      Criteria for Open Data ................................................................................................. 18
      4.1.1         Descriptive metadata ........................................................................................... 18
      4.1.2         Machine-readable datasets .................................................................................. 18
      4.1.3         Anonymous access .............................................................................................. 19
      4.1.4         Data reuse and release licenses ........................................................................... 19
      4.1.5         Data attribution to source .................................................................................... 20
      4.1.6         Search for data discovery .................................................................................... 20
      4.1.7         Application Programming Interfaces (APIs) are public ...................................... 21
      4.1.8         Datasets are reachable via persistent URI ........................................................... 21
      4.1.9         Automated data harvesting .................................................................................. 22
      4.1.10        Federation of multiple data sites ......................................................................... 22
      4.1.11        Public documentation .......................................................................................... 23
      4.1.12        Compliance with generally accepted standards................................................... 23
    4.2      Criteria for National Statistics Offices data publication ............................................. 23
      4.2.1         Structural metadata .............................................................................................. 23
      4.2.2         OLAP hypercubes ............................................................................................... 24
      4.2.3         Data endpoints ..................................................................................................... 24
      4.2.4         Online analysis and visualisation ........................................................................ 25
      4.2.5         User-experience and software customisation ...................................................... 25
5     Review of Open Data Publication Systems ......................................................................... 27
    5.1      CKAN ......................................................................................................................... 28
    5.2      DevInfo ....................................................................................................................... 30
    5.3      DKAN ......................................................................................................................... 32
    5.4      Junar ............................................................................................................................ 34
    5.5      NADA ......................................................................................................................... 36
    5.6      Nesstar ......................................................................................................................... 38
    5.7      OpenDataSoft .............................................................................................................. 40
    5.8      PC-Axis and PX-Web ................................................................................................. 42
    5.9      Prognoz ....................................................................................................................... 44
    5.10     Semantic MediaWiki ................................................................................................... 46
    5.11     Socrata ......................................................................................................................... 48
    5.12     Swirrl ........................................................................................................................... 50
6     Conclusions and recommendations ..................................................................................... 52
    6.1      Improve technical documentation ............................................................................... 52
    6.2      Ensure public APIs and endpoints are interoperable................................................... 52
                                                                      3
    6.3      Presentation of metadata and URIs must conform to W3C standards ........................ 53
    6.4      Natural language search and metadata faceting should be standard ........................... 53
    6.5      Structural metadata and hypercube support are core NSO requirements .................... 54
    6.6      Dashboards and visualisations are necessary for user engagement............................. 54
    6.7      Develop data engagement tools for improving data-quality and reuse ....................... 54
7     Acknowledgements and Research Methodology ................................................................ 55
8     Glossary............................................................................................................................... 56
9     References ........................................................................................................................... 58




                                                                      4
1 Executive Summary
National Statistics Offices (NSOs) have the potential to play a pivotal role in the
implementation of open data initiatives. As producers and curators of data, the objective of
making high quality data more accessible and usable is consistent with their guiding principles.

NSOs indicate, in research conducted in support of this report, that one of the difficulties they
encounter is that the technology they use to publish - or electronically distribute - data for public
use is not compatible with open formats. They also indicate that common software packages
used for open data portals do not accommodate the data formats and metadata they produce.

This research report is intended to provide a better understanding and assessment of the
technical issues related to data dissemination tools that NSOs use (or could use) to distribute
data to the public under an open data initiative. The report defines a list of key criteria and
evaluates relevant technology products according to those criteria.

Two key concerns related to data dissemination products are addressed:

    1. Can such products designed primarily for NSOs satisfy requirements for an open data
       initiative?
    2. Can such products designed primarily for open data satisfy the requirements of NSOs?

The main audiences for this report are NSO staff, particularly managers and directors, seeking
to better understand technology issues relevant to dissemination of open data. Software
developers, donor agencies and consultants wishing to strengthen open data systems and
producers will also find it useful.

This report is limited to a technical discussion of data dissemination platforms; it does not cover
up-stream data production and curation issues. The report also does not cover a host of non-
technology issues that are nonetheless essential for NSOs to address, such as user engagement,
privacy protections, resource constraints and data quality. Many of these issues are explored in
depth in Open Data Challenges and Opportunities for National Statistical Offices 1.

The term “open data” is generally understood to be data that are made available to the public
free of charge, without registration or restrictive licenses, for any purpose whatsoever (including
commercial purposes), in electronic, machine-readable formats that ensure data are easy to find,
download and use.

Data reuse, both by data experts and the public at large, is key to creating new opportunities and
benefits from government data.

A visual representation of the various data and software components which form part of an
overall open data platform are presented as follows:




                                                 5
These components are often served by different software systems. Deployments should ensure
system interoperability. Different software systems shouldn’t result in duplicated functionality
or needless requirements for manual data conversion for interoperability where such systems
share common data resources.

Open data distribution software systems were not originally designed to serve NSOs, nor were
NSO systems originally designed for open data. While some data publication platforms offer
Extract, Transform and Load, Business Intelligence and Content Management functionality, this
report concentrates on Data Discovery and Publication as the core requirements for any open
data distribution software.

Besides the specific components of an open data system, NSOs must also consider the metadata
standards they need to support, as well as their procedures for disseminating microdata.
Dissemination of microdata is of particular concern to NSOs because of the need to protect
privacy of respondents and other concerns. It is common for NSOs to have a separate set of
dissemination policies - and separate dissemination platforms - for microdata, and there are
already platforms and approaches that are well suited for this purpose.

Proprietary file formats - SAS, STATA, SPSS and so forth - must also be available in more
generally accessible machine-readable formats to meet open data best practices. Data
dissemination in proprietary formats does not preclude dissemination in open formats and vice
versa.

Below is a complete list of assessment criteria for the software platforms described in this report
in the order in which they are presented:

Descriptive metadata:          Corresponds to external metadata and typically used for
                               discovery and identification, as information used to search and
                               locate an object such as title, author, subjects, keywords,
                               publisher.
Machine-readable:              Data available as machine-readable structured data in a non-
                               proprietary format can be easily read and used by software
                               systems without human interpretation.
Anonymous access:              Users can search for and access data and metadata without
                               having to identify themselves, create a user account, or receive
                               advance permission.



                                                6
Data licences:                Data licenses (terms of use) associated with each dataset are
                              clearly presented to the user and permit reuse and republication
                              of that data in any alternative form.
Data attribution:             Users can cite, attribute, and link to datasets, and contact data
                              owners if they have questions.
Search:                       Search results should return focused summaries on datasets,
                              along with keywords which aid classification, and the option of
                              reviewing the data online to assess its content.
Application                   Platforms make their contents available to external systems by
Programming Interface         supporting programmatic queries and access to metadata and
(API):                        resources.
Uniform Resource              Platforms make datasets available at persistent URIs that never
Identifiers (URI):            change, allowing them to be externally referenced reliably.
Harvesting:                   An automated and autonomous mechanism for ETL of known
                              data from known web addressable locations into a single database
                              or datastore.
Federation:                   A meta-database management system, which transparently maps
                              multiple autonomous database systems into a single federated
                              database, allowing discrete data publishing systems to be
                              integrated, yet operate independently.
Public Documentation:         Data platforms provide comprehensive information for
                              developers and the general public on how their platform works.
                              Such documentation should be updated with each new software
                              release.
Standards-based:              Platforms are consistent with emerging standards recognised by
                              the W3C especially as regards metadata, RDF, and hypercubes.
Structural metadata:          Corresponds to internal metadata about the structure of database
                              objects such as tables, columns, keys and indexes.
Online Analytical             Supports the ability to analyse multidimensional data
Processing (OLAP)             interactively from multiple perspectives, including the ability to
hypercube or cube:            consolidate, drill-down, or slice and dice data.
Data Endpoints:               Provides structured data endpoints return data in predictable
                              ways. These can be as simple as a known type of serialisation
                              format while more complex implementations permit the data to
                              be queried, filtering or refining the dataset prior to download.
Visualisation:                Provides tools to present data as common charts, maps or
                              perform more complex statistical analysis.
UX & S/W Extensibility:       Permits sufficient template and layout customisation to provide a
                              consistent user-experience and provide a common look and feel
                              across all NSO online services.

During the research stage for this report, custom software implementations were also assessed
along with commonly-used commercial software. Some organizations, for instance the US
Census Bureau, have developed their own systems. In large part, such work was initiated long
before there was an open data movement, let alone software to support it. Such custom software
can often become more generally accepted amongst closely-related NSOs and be released more
formally. This is true in the case of Nesstar and PX-Web.
                                              7
A representative sample of the most commonly used open- and statistical data publication
software platforms was evaluated relative to these components:

                                                                                                                                                                                                                                               NSO-Specific
                                                                        Open Data-Specific Criteria
                                                                                                                                                                                                                                                 Criteria




                                                                                                                                                                                                                                                                                                        UX & S/W Extensibility
                                                                                                                                                                                      Public Documentation
                           Descriptive Metadata




                                                                                                                                                                                                                               Structural Metadata
                                                                                                                                                                                                                                                     OLAP Hypercubes
                                                                     Anonymous access
                                                  Machine-readable




                                                                                                                                                                                                             Standards-based
                                                                                                        Data attribution




                                                                                                                                                                                                                                                                       Data Endpoints
Software platform



                                                                                        Data licences




                                                                                                                                                                                                                                                                                        Visualisation
                                                                                                                                                            Harvesting
                                                                                                                                                                         Federating
                                                                                                                                               Static URI
                                                                                                                                    Open API
                                                                                                                           Search
CKAN                                                                                                                                                                                                                                                                                      
DevInfo                                                                                                                                                                                                                                                                                    
DKAN                                                                                                                                                                                                                                                                                      
Junar                                                                                                                                                                                                                                                                                     
NADA                                                                                                                                                                                                                                                                                           
Nesstar                                                                                                                                                                                                                                                                                       
OpenDataSoft                                                                                                                                                                                                                                                                               
PC-Axis and PX-Web                                                                                                                                                                                                                                                                      
Prognoz                                                                                                                                                                                                                                                                                        
Semantic MediaWiki                                                                                                                                                                                                                                                                           
Socrata                                                                                                                                                                                                                                                                                   
Swirrl                                                                                                                                                                                                                                                                                      

Where  offers a complete solution and  offers a partial, or incomplete, solution.

Open data software has led to a number of different approaches to resolving requirements for
data publication, engagement and reuse. Many of these do not promote interoperability between
platforms and new problems are being raised.

This report makes the following recommendations to improve the overall utility of data
publication platforms to NSOs and the open data community:

    •   Improve technical documentation: Too little of the available documentation for
        developing custom components or using the software APIs offers much support to
        developers. It is essential that these APIs be sufficiently well-documented that they can
        be of use.
    •   Ensure public APIs and endpoints are interoperable: If vendors are able to agree on
        a common API – or are able to connect to a common standard – then harvesting from
        different systems as well as developing applications that integrate a number of unrelated
        platforms all become extremely straightforward.
    •   Presentation of metadata and URIs must conform to W3C standards: Many
        software platforms fail to present descriptive metadata which aids discovery and reuse.
        It needs to be clear what the licensing and reuse policies are, what the data are about,
        and who is responsible for it. Similarly, data discovery is time-consuming and discrete
        URIs for each step of the process permits sharing and saving of these states.



                                                                                                                           8
    •   Natural language search and metadata faceting should be standard: Free text
        search along with metadata faceting speeds up data discovery and improves system
        performance.
    •   Structural metadata and hypercube support are core NSO requirements: Individual
        data tables become more useful when they can be aggregated together and sliced into
        numerous views for analysis. Offering support for both descriptive and structural
        metadata (such as adopting the DDI metadata standard), as well as hypercubes (such as
        the W3C’s RDF Data Cube Vocabulary) will improve interoperability and wider utility.
    •   Dashboards and visualisations are necessary for user engagement: User-
        engagement in the face of vast and complex data is more likely where that data are
        presented in a tactile and engaging way.
    •   Develop data engagement tools for improving data-quality and reuse: Approaches
        which promote data quality and reuse include using the data published in visualisations
        and analysis, showcasing applications developed by users, registering and tracking
        issues users experience with data quality, and offering users a mechanism to make data
        requests.

The software platforms considered in this report are not a comprehensive list of those available
to NSOs and open data publishers. Even so, many of them come close to meeting all the criteria
for both requirements.

Both software developers and NSOs can take heart from this and work together towards
building the remaining components. Completing and meeting the recommendations in this
report will ensure system interoperability, data migration, and community engagement for data
reuse.




                                               9
2 Introduction to the Technical Assessment
Open data initiatives ensure that public data are freely available in open, electronic, and reusable
formats. National Statistical Offices (NSOs) are responsible for maintaining and disseminating a
country’s official statistics, many of which typically comprise the foundation of any open data
program.

NSOs have the potential to play a pivotal role in the implementation of open data initiatives,
since the objective of making high quality data more accessible and usable is consistent with
their guiding principles.

NSOs may also have existing relationships with other agencies that provide raw data to the
statistical system, endowing them with an important role in the local data community. NSOs
have expertise in dealing with the many technical issues attendant in publishing public datasets,
making them valuable knowledge resources.

Despite these advantages, NSOs do not always feature prominently in government-sponsored
open data initiatives.

NSOs indicate that one of the difficulties they encounter is that the technology they use to
publish - or electronically distribute - data for public use is not compatible with open formats.
They also indicate that common software packages used for open data portals do not
accommodate the types of data and metadata they produce.

Myriad technical solutions are available to enable NSOs and other organisations to publish data.
From the perspective of NSOs, these products generally fall into one of two groups:

      1. Platforms designed specifically for use by NSOs to satisfy the requirements of NSOs
         and their traditional users (who are typically data professionals with specialised
         technical experience);
      2. Platforms designed to help organisations - particularly government ministries - publish
         their data under a government-sponsored open data initiative.

These two categories are not necessarily mutually exclusive. However, many current technology
solutions were designed with only one of these requirements in mind and the interaction
between the two is unclear. NSOs thus face a challenge in developing a strategy and identifying
technology solutions to allow them to disseminate their data under an open data initiative.

2.1     Objectives of this report

This research report is intended to provide a better understanding and assessment of the
technical issues related to data dissemination tools that NSOs use (or could use) to distribute
data to the public under an open data initiative.

This report begins by defining a list of key criteria and the most relevant technology products to
assess. The assessment seeks to address two data dissemination product concerns:

      1. Can such products designed primarily for NSOs satisfy requirements for an open data
         initiative?
      2. Can such products designed primarily for open data satisfy the requirements of NSOs?

Specific recommendations for features are also identified which would contribute to a common
use-case.


                                                10
This report presents research on a selection of commonly-used open data and NSO publishing
software platforms to ascertain and fully understand the design characteristics, core
functionalities and business models of each product, and the implications for use by NSOs in
disseminating data.

A product matrix, organised as a set of individual assessments, presents a detailed description of
each product’s features, along with a list of current use cases. We have also investigated
research cases where NSOs developed a custom or proprietary platform and the reasons for
doing so (i.e., why commercial or open source platforms were deemed insufficient or less
desirable). Research included interviews with users and vendors to obtain product information
and demonstrations, where possible.

2.2       Who should read this report

      •    NSO staff, particularly managers and directors, seeking to better understand technology
           issues relevant to dissemination of open data;
      •    Developers seeking to better understand the needs of NSOs and other government
           agencies;
      •    Donor agencies that support statistical strengthening and capacity building in
           developing countries;
      •    Consultants wishing to support NSOs in developing and deploying integrated statistical
           and open data publication platforms;

2.3       What this report does not cover

There are a wide range of issues and components that form part of an integrated open data
initiative. Only those which are technology-based and core to the success and implementation of
open data software are covered in this report.

Up-stream data lifecycle issues, such as data production, curation, management, or any other
actions which precede data dissemination, are not included. While not necessarily true in
practice, this report assumes that NSO data management systems and its data dissemination
platform are discrete systems. Discussion of the inter-operability between the publishing
platform and the public data discovery systems is an important criterion for NSOs and is
included.

This report does not cover a host of non-technology issues that are nonetheless essential for
NSOs to address. Many of these issues are explored in depth in Open Data Challenges and
Opportunities for National Statistical Offices 2.

Several products have specific features which may be particularly useful in certain contexts, but
are not broadly implemented. A comparative analysis of such features is out of scope, including:

      •    Version control: iterative versions of datasets available for comparative purposes;
      •    Data collections: organising multiple data resources together as a single set;
      •    Social media: integration with popular social media services is not unique to data
           publishing software. Furthermore, open data initiatives require much more
           comprehensive user-engagement practices than social media implementations typically
           provide;
      •    Organisation sub-sites;

This report specifically does not make product recommendations or offer a ranking of the
software discussed. It does make suggestions for specific improvements but is not a
comprehensive review of all the available software.

                                                 11
3 Overview of data publication and management
  “The first rule of data storage: don't store the same data in two different places: you will have
                                                 problems keeping it consistent,” Tim Berners-Lee

Data publication aimed at the public, rather than internal work-streams and data life-cycle
management, may seem remote from the day-to-day requirements of an NSO. However, if these
public systems are to encourage publication and data reuse then they must integrate well with
in-house data process management systems.

There are three generalised systems used to manage online and public versions of research
reports, content and research data:

      1. Content Management Systems (CMS): permit publishing, editing and modifying of
         qualitative content, as well as providing mechanisms to manage workflows and
         individual users in a collaborative environment;
      2. Data Discovery Systems (DDS): are similar to CMS but provide mechanisms to
         manage the semi-structured quantitative and qualitative data in documents and
         spreadsheets and offer methods for data publication, discovery and reuse;
      3. Business Intelligence Systems (BIS): provide a platform for engaging with structured
         quantitative data to produce custom slices of that data, charts, tables and geospatial
         representations;

While it is certainly possible for a single, integrated, software system to serve all of these
requirements, this is not a common use-case. Data managers usually operate a number of
different systems which often require manual data restructuring for availability in each system.

The following sections provide context for the characteristics of systems which support the
fulfilment of “data dissemination, user engagement and transparency within an open data
initiative, as opposed to up-stream management of data production.” 3 Common terms used in
the industry are only briefly defined, with references for those requiring a deeper understanding.

3.1       Open Data

The term “open data” is generally understood to be data that are made available to the public
free of charge, without registration or restrictive licenses, for any purpose whatsoever (including
commercial purposes), in electronic, machine-readable formats that ensure data are easy to find,
download and use.

Open data initiatives by public institutions, such as governments and intergovernmental
organisations, recognise that such data is produced with public funds and so, with few
exceptions, should be treated as public goods.

Data reuse, both by data experts and the public at large, is key to creating new opportunities and
benefits from government data. Open data reuse requires two basic criteria:

      •    Data must be legally open, meaning that it is placed in the public domain or under
           liberal terms of use with minimal restrictions. This ensures that government policies do
           not create barriers or ambiguities concerning how the data may be used.
      •    Data must be technically open, meaning that it is published in electronic formats that
           are machine-readable and non-proprietary. This ensures that ordinary citizens can
           access and use the data with little or no cost using common software tools.



                                                  12
Open data is of particular interest to NSOs for several reasons. NSOs typically manage many of
the data products considered high value in government-wide open data initiatives. As
governments increasingly develop open data policies, these products (and hence NSOs
themselves) may receive greater prominence. The underlying principles of open data are clearly
linked to one of the NSOs’ fundamental purposes: to make relevant statistics available in ways
that are easy for users to access and use. Open data can also create opportunities to increase
efficiency of dissemination, improve data quality, lead to the modernisation of administrative
records, and raise the public profile of the NSO.

Many NSO products are relatively straight-forward to release as open data, including:

      •    Statistical products that are already publicly available without restriction; perhaps
           through printed publications, the NSO’s website, or upon request;
      •    Other vital census and economic statistics at the national and sub-national levels;
      •    Price and trade data;
      •    Registers used for drawing statistical samples, for example lists of businesses;
      •    Official maps of political boundaries, voting districts, infrastructure, and the location of
           public facilities (schools, government offices, police stations, libraries, etc.);
      •    Classification systems such as for household consumption or types of industry;

Open data presents opportunities and challenges for NSOs.

3.2       Metadata

The creator of the data would best know what the document is about and should assign
keywords as descriptors. This data about the data is called metadata. The term is ambiguous, as
it is used for two fundamentally different concepts:

      •    Structural metadata correspond to internal metadata (i.e. metadata about the structure
           of database objects such as tables, columns, keys and indexes);
      •    Descriptive metadata correspond to external metadata. (i.e. metadata typically used for
           discovery and identification, as information used to search and locate an object such as
           title, author, subjects, keywords, publisher);

Descriptive metadata permits discovery of the object. Structural metadata permits the data to be
applied, interpreted, analysed, restructured, and linked to other, similar, datasets.

Metadata can permit interoperability between different systems. An agreed-upon structure for
querying the ‘aboutness’ of a data series can permit unrelated software systems to find and use
remote data.

Beyond metadata, there are also mechanisms for the structuring of relationships between
hierarchies of keywords. These are known as ontologies and, along with metadata, can be used
to accurately define and permit discovery of data.

Adding metadata to existing data resources can be a labour-intensive and expensive process.
This may become a barrier to implementing a comprehensive knowledge management system.

Where there are millions of users conducting a high volume of data interactions (in the millions
per day), algorithmic systems can assign commonly-used search terms as metadata to data
through drawing conclusions from their behaviour (e.g. a particular search result always results
in a particular data choice). In the low-frequency user activity of specialist research data, such
metadata and ontologies are often developed in advance and assigned manually.


                                                   13
3.3     Microdata and generalised data

Data are often presented as a ubiquitous mass. Statistical data encompasses a range of forms,
from questionnaires and individual – personally-identifying – responses, through to aggregated
tables of numerical values, analysis and text-driven reports.

In this report we differentiate between:

      1. Microdata: information at the level of individual respondents, households and
         businesses, typically through surveys; for example, a national census may collect age,
         address, education, employment status, etc. from individuals;
      2. Generalised data: aggregations derived from microdata; for example, the total number
         of people of a particular education category in the general population;

Dissemination of microdata is an issue of particular concern to NSOs. There is almost always
the need, and sometimes, a legal obligation, to protect confidentiality and privacy for the
providers of microdata. In some cases, provenance of microdata may lie with external partners
(such as academic institutions) with their own policies or restrictions on dissemination.
Accordingly, decisions concerning how to manage and disseminate microdata, either as open
data or under other policies, is typically made on a case-by-case basis according to the policies
and professional judgment of the NSO. However, where provenance concerns do not exist and
where privacy issues have been addressed (for instance, through anonymisation techniques),
there is not necessarily any reason why microdata should not be released as open data.

NSOs have used a variety of approaches to making microdata publicly available2. Some are
quite consistent with open data principles; however, many policies place restrictions on who
may access the data and how the data may be used.

It is common for NSOs to have a separate set of dissemination policies - and separate
dissemination platforms - for microdata that are particularly sensitive than for their other data
products. This dual approach may be beneficial for users, since it highlights that certain
microdata are provided with additional conditions with respect to data access.

There are already platforms and approaches that are well suited for distributing microdata
online.

For example, the Integrated Public Use Microdata Series (IPUMS) 4 requires researchers to
implement security measures, avoid redistribution of microdata, use microdata only for non-
commercial research/education purposes, and not make any attempt to identify the individuals
recorded.

The International Household Survey Network (IHSN) has developed tools and guidelines to
help interested statistical agencies improve their microdata management practices, including a
Microdata Cataloging Tool (NADA) 5 which is assessed in this report. NADA allows
administrators to specify an access policy for each dataset. Policies can range from “Open
access” (similar to “open data”) to “Data not available” (metadata only) for each microdata file.

3.4     Proprietary file formats

One of the criteria for open data is the use of machine-readable, non-proprietary electronic data
files for data distribution. These formats reduce technical barriers to data access to an absolute
minimum for broad categories of users. NSOs commonly distribute data in a variety of formats.
Some are considered “open” (such as CSV, XML, text and others) and some are proprietary
formats used in data analysis software products (SAS, STATA, SPSS, etc.).

                                                14
The latter are legitimate even in an open data initiative since these are the software systems used
by many professional data users. However, since these formats are not interoperable, the
potential for data re-use is limited unless open formats are also supported. Data dissemination in
proprietary formats does not preclude dissemination in open formats and vice versa.

If an NSO already distributes data in a proprietary format, it should also distribute in one or
more open formats.

3.5   Data structure and linked data

Data is usually stored in a relational database. The approach to such systems is beyond the
scope of this report, however, a brief overview is necessary to understand an increasingly
popular approach to managing open data on the web.

Structural metadata permits data contained in relational databases to be aligned and joined. This,
however, can be slow and inefficient for large and complex datasets.

The World Wide Web Consortium (W3C) develops web standards. Their approach for data
interchange on the web is known as the Resource Description Framework (RDF) 6. It has
come to be used as a general method for conceptual description or modelling of information that
is implemented in web resources, using a variety of syntax formats.

RDF breaks away from the standard relational database and can be thought of as a graph of
entity-relationships of the form: subject, predicate and object. The subject (e.g. John) is linked
to the object (e.g. Carol) by a predicate (e.g. ‘is a friend’) and gives rise to the terms Triple
Store (the three entities being the ‘triple’) or Linked Data.




      Figure 1: RDF subject, predicate, object model (A) combining to form an RDF graph (B) 7

The subject and object are also known as nodes, while the predicate is an edge. A network of
nodes linked by edges is called a graph.

Numerous implementations of this have resulted in interoperable structured, and machine-
readable, metadata systems. There are also, however, numerous legacy approaches to
categorising data which have arisen in individual research institutions across the world.

3.6   Software development and deployment

There is tension between the needs of NSOs for standards-compliant but customisable systems,
versus proprietary implementations and limited customisability preferred by many vendors.

NSOs have a government mandate to maintain national statistics indefinitely. They legitimately
fear that proprietary systems expose them to future data migration costs as such companies go
                                                 15
out of business or are acquired. Numerous custom systems built by NSOs were developed when
their existing vendors went insolvent or discontinued software support.

Vendors are concerned that they will be unable to amortise the costs of research and
development over the long-term. Vendors often look for mechanisms to ensure lock-in;
reducing the ability for clients to migrate to alternative platforms.

This has resulted in two different solutions:

    •    Open Source Software (OS): the source-code is available in an online and public
         repository under a liberal reuse license (such as the General Public License 8 and its
         affiliates); sometimes known as Free and Open Source Software (FOSS), not all open
         source software is free, and not all free software is open source; full customisation and
         extensibility is guaranteed;
    •    Software-as-a-Service (SaaS): software is available online on a centralised hosted
         server via a subscription service instead of as a deployable software system with a
         single, static price; upgrades, bug fixes and patches are consistently and regularly
         applied; custom extensions of functionality can be achieved via an API but
         customisation of the user interface is more limited;

Note that software can be both OS and SaaS, or neither.

Open source is also only useful if a developer community remains engaged and continues
improving the software. The lifespan for open source software is often no different from
proprietary solutions.

A combination of standards compliance and open APIs can ensure that you have the ability to
migrate to a different service provider even where the software is not open source. If you must
make a choice between standards compliance (i.e. offering a straightforward mechanism to
migrate your data to another service) and open source, go for standards compliance.

Total cost of ownership for online software systems are often difficult to assess. Open Source
products, where the software is effectively free, still come with a requirement for deployment,
customisation and maintenance. A license fee for proprietary software is often a small part of a
total lifetime cost.

Cost-of-life comparison between such vendors is difficult and NSOs are advised to present a
clear brief on their needs to ensure that pricing is well-presented. This includes very specific
guidance on data storage volumes and rate of growth.




                                                16
4 Requirements and components for data publication systems
Open data systems were not originally designed to serve NSOs, nor were NSO systems
originally designed for open data. While some data publication platforms offer Extract,
Transform and Load (ETL), Business Intelligence and Content Management functionality,
this report concentrates on Data Discovery and Publication.

A visual representation of the various data and software components is presented as follows:




  Figure 2: Overview of the technical components and processes in a data publication platform

The “Datastore & Filestore” component is the core of the service, providing both the data
catalogue and presenting data in the raw file formats required for download. These components
are often served by different software systems. Deployments should ensure system
interoperability. Different software systems shouldn’t result in duplicated functionality or
needless requirements for manual data conversion for interoperability.

Table 1 presents assessment criteria used in this report and the relevance of each to open data
and NSOs’ traditional needs:

Components                                            Open Data Systems          NSO Systems
Descriptive metadata                                                                 
Machine-readable                                                                     
Anonymous access                                                                     
Data licenses                                                
Data attribution                                             
Search                                                                                
Open API                                                     
Static URI                                                   
Harvesting                                                   
Federating                                                   
Documentation                                                
Standards-based                                              
Structural metadata                                                                    
OLAP hypercube                                                                         
Data endpoints                                                                        
Visualisation                                                                         
UX & S/W extensibility                                                                

                       Table 1: Elements and components provided by data publishing software
                                                17
Where  is of primary importance, while  serves a secondary, or partial, role.

4.1     Criteria for Open Data

4.1.1     Descriptive metadata

Descriptive metadata are used for discovery and identification, as well as for data life-cycle
management.

There are a large number of metadata vocabularies and ontologies used in open data systems
and this is by no means an exhaustive list, but these are the most common:

      1. Data Catalog Vocabulary (DCAT) 9: an RDF vocabulary designed to facilitate
         interoperability of data catalogues published online and making extensive use of Dublin
         Core;
      2. Data Document Initiative (DDI) 10: an international standard for describing the
         complete research data life-cycle for the social, behavioural and economic sciences; the
         DDI-RDF Discovery Vocabulary is designed to support RDF;
      3. Dublin Core Metadata Element Set 11: an RDF vocabulary of fifteen “core” properties
         for use in resource description; elements include: title, creator, subject, description,
         publisher, date, format, source, etc.

Similarly, there are metadata systems for describing geospatial data, such as the Infrastructure
for Spatial Information in the European Community (INSPIRE) 12 system which is derived
from ISO 19115 “Geographic Information – Metadata”. The revised version for 2014, “ISO
19115-1:2014 defines the schema required for describing geographic information and services
by means of metadata. It provides information about the identification, the extent, the quality,
the spatial and temporal aspects, the content, the spatial reference, the portrayal, distribution,
and other properties of digital geographic data and services” 13.

As well as permitting data and metadata upload, user-friendly interfaces are required in which
system administrators can associate metadata with each data resource.

4.1.2     Machine-readable datasets

Tim Berners-Lee summarises a recommended hierarchy of data availability14:

        ★                Available on the web (in any format) but with an open licence;
        ★★               Available as machine-readable structured data in a proprietary format
                         (e.g. Excel instead of image scan of a table);
        ★★★              Available as machine-readable structured data but in a non-proprietary
                         format (e.g. CSV instead of Excel);
        ★★★★             All the above plus standards from W3C (RDF and SPARQL) to identify
                         data uniquely so that others can access and reference this live data;
        ★★★★★            All the above plus cross-link data to other data to provide context;

Spreadsheets and distributed data systems often lack an agreed data structure. A researcher who
wishes to combine this with other data first needs to normalise it and then decide on
standardised terms.

Converting semi-structured tabular data into a machine-readable format results in tabular files
with a header row defining each of the data in the columns and rows below. Such tabular files
(e.g. Excel) can easily be converted to a comma-separated-value (CSV) file.

                                                18
Ignoring any further standards compliance, CSV files can be so arranged that they are “joined”
on a common column. For example, a set standardised geospatial reference codes (e.g. ISO
3166 country codes) can be used to connect similar files covering different data series.

The process for converting data into a machine-readable format is one of Extract, Transform
and Load (ETL). A common requirement is parsing PDF files to extract tables into Excel or
CSV and so render the data machine-readable. The process describes extracting data from its
source, transforming it into the required machine-readable format, and loading it into a database
or Datastore.

There is often a requirement to maintain a connection to the original data in case of a query or
concern, and so data are often maintained in its original format in a Filestore alongside (and
directly connected to) the Datastore.

4.1.3   Anonymous access

A key expectation of open data systems is that users can search for and access data and
metadata without having to identify themselves, create a user account, or receive advance
permission.

Experience to date strongly indicates that user registration schemes are significant deterrents to
data access and use by broad audiences. When the UK Times introduced a paywall, they lost
66% of their readers overnight 15. This happened despite the fact that users could register for free
and access a limited number of articles each month.

Data managers often want to know two things about data reuse: who is using their data, and in
what way? The former does not require user-registration, and the latter would not be addressed
by user-registration.

In the case of analytical data, ensuring ease of access is critical to expanding the user-base
beyond academics. The more approachable and user-friendly a data resource, the more likely
that people will experiment.

Certainly, site managers need to understand data engagement: downloads, search terms, site
referrals, and so on. None of this requires that users give up their anonymity. Similarly, no
amount of user registration will tell site administrators how the data are used unless the user
volunteers this.

Create opportunities for engagement, and make the process pleasant, and users will tell you
themselves what they’re doing with the data.

Note, this does not always imply that each dataset requires a full social media experience.
Datasets are not opinion or blog-posts and comment threads attached to data are usually related
to data quality. This requirement is best served by an issue tracker.

Data.gov.uk has an entire section dedicated to data-driven apps submitted by users. Some open
data software systems permit users to not only create and save their own visualisations, but also
feedback information on where those visualisations are embedded. That is useful feedback to
the publisher and can still permit relatively anonymous engagement.

4.1.4   Data reuse and release licenses

Data publication software must offer a mechanism by which the license associated with each
dataset is clearly presented to the user.

                                                19
Licenses which permit data discovery but not liberal reuse are all but useless. If a person is not
permitted to restructure or republish the data as they require then they are unlikely to want to
use it at all.

Peter Desmet, a researcher at the Canadian Research Institute for Nature and Forest, describes
how non-standard open access data licenses have made it illegal for him to aggregate 13,297
georeferenced American bullfrog records and place them on a single map 16. This, despite the
data being released as open access on the Global Biodiversity Information Facility (GBIF).

If an NSO is to meet its mandate for public dissemination, it must also ensure that the public has
full rights to use that data in any form it may choose subject to due reference to the data
publisher concerned. Datasets should be clearly labelled as released under standard open data
licenses such:

    •   Creative Commons (CC-By, CC-0) 17,
    •   Open Government Licence (OGL) 18,
    •   Open Database Licence (ODbL) 19,
    •   Open Intergovernmental Organisation License (IGO) 20, or similar;

Each of these licenses permits the user to use the data they have downloaded, combine it with
other data to create novel insight, and then to release or sell that data and insight as they wish
subject – if required – to attribution to the original source.

4.1.5   Data attribution to source

Any dataset should present a clear set of information which permits a user to:

    •   link directly to the data;
    •   cite the data in their reuse of that data;
    •   attribute the data creator, either as an individual or as an organisation;
    •   contact a data owner should they have any queries;

Platforms must offer management and presentation of such data in a clear and readily presented
format, as well as an interface to associate such attribution with source data. Where source data
are not available, it is impossible for data users to offer appropriate attribution and for other
users to verify the bona fides of the relevant data.

4.1.6   Search for data discovery

The default standard for online discovery is the natural language search box. Bing, Google and
Yahoo are familiar examples. The process of data discovery offers stakeholders the ability to
find relevant data quickly and easily, and then have access to that data in a useable format for
the wide range of research activities which they may wish to perform.

Navigating via an impenetrable branching tree structure to find data - which has been structured
according to the needs of the NSO and data publishers - is another barrier to data dissemination.
Such structured systems can be useful for the expert user who has experience of that data
structure, but – for everyone else – makes discovery extremely difficult.

The user needs the minimum number of steps to find appropriate data, verify that the data is
what they’re looking for, and then access that data in a format which permits them to use it.

Search results presented by data portals for their content should return focused summaries on
datasets, along with keywords which aid classification, and the option of reviewing the data

                                                 20
online to assess its content. This may also include visualisation tools to produce basic charts or
maps.

Faceting is encouraged when metadata are used as additional selection criteria to filter lengthy
search results. A simple mechanism is for metadata to be listed alongside search results with
check-boxes that automatically apply an “and” to filter the results.

4.1.7   Application Programming Interfaces (APIs) are public

Interoperability requires an Application Programming Interface (API) through which
standardised commands are available to an external system and used to query the data, metadata
and other attributes in a database.

Such interoperability permits a range of actions by other software systems, for example:

    •   Import data into another application (such as Tableau, R, or Excel) for analysis and
        merging with other data sources;
    •   Development of free-standing applications, such as transport apps on mobile phones;
    •   Automate repetitive processes, such as setting a routine to regularly download data
        released monthly;

APIs can also be used by site administrators to automate data harvesting, uploading or similar
bulk processes. Such APIs are often connect to ETL systems for data transformation prior to
loading.

The most common approach to implementing an API is via a Representational State Transfer
(REST) system. The standard commands generally used in REST for creating, reading,
updating, or deleting data are POST, GET, PUT, DELETE.

The more sophisticated software platforms often have a query-builder utility which permits
users to experiment live on the server and see the results of different API queries. Such
interfaces should permit unique URLs so that particular queries can be bookmarked and shared.

4.1.8   Datasets are reachable via persistent URI

Permanent and persistent availability means not just that data are available, but also that they are
always in the same place.

For online systems, this implies a requirement for Uniform Resource Identifiers (URIs) which
ensure that resources, content or data are always located through one discrete address for any
and all users or software-driven applications. The most familiar of these are the Uniform
Resource Locators (URLs) that you see as links to websites. These can also be described as
endpoints.

It is essential that these endpoints never change and that their behaviour is predictable. A person
who bookmarks a link, or who includes such a link in an article, assumes that it will still be
there when they need this.

More generally, a URI can also define a persistent link to a point within a dataset. This permits
the interlinking of different datasets, the creation of more complex data aggregations and better
insight into that data.

Software should be capable of providing a clear and easy to find permanent URLs for every
dataset being served by the platform. Some systems are capable of providing URIs to subsets of
data, or points, within datasets.
                                                21
4.1.9   Automated data harvesting

The ETL process of capturing data for publication can often become a significant barrier to data
migration or implementing a new data system.

The manual process of creating datasets, entering metadata, and uploading data, is slow and
labour-intensive. It is also difficult to automate since identifying and allocating appropriate
metadata is a specialist task often difficult to extract from the data files.

Some data, however, are updated regularly as part of a data release cycle. Datasets already exist
and merely need to be updated.

Where software has an API which permits data editing and uploading, custom scripts can be
written which will automate such processes. Some software systems go further, offering a
dashboard to system administrators allowing them to set up and manage a large number of
automated processes for data upload. Such as service is known as data harvesting.

Harvesting is the data publication receiving end of an ETL process which is usually delivered
by alternative software systems.

The more such routine tasks can be automated, the easier data release becomes and the more
likely that both users and providers of data will adopt the system.

4.1.10 Federation of multiple data sites

Federation is the mechanism by which dataset metadata are polled from different platforms and
copied to a centralised software service or database. The original data usually continues to be
hosted on the original platform but the metadata, and links to the data resources, are now
accessible and discoverable via the platform’s search engine.

There are numerous reasons why data may be published from a variety of different software
platforms. Different departments may wish to manage their own data life-cycle. Federal
agencies and ministries often enjoy significant autonomy and they will be used to operating
their own systems.

Open data, however, is more useful when users do not need to visit numerous different web
services in order to discover data.

Note that federation differs from harvesting. Data which is harvested is maintained and
managed from that system. Federated data captures only the metadata but also permits live
exploration or visualisation of that data in the remote system.

Not all software which permits harvesting will support federation.

As with harvesting, an API which permits search and discovery also permits federation. A
system can traverse the data for the site and build a metadata structure for local search. More
sophisticated software systems offer dashboards for automated federation.

Federating between unrelated software platform offers challenges as a result of different
database schema. Same software federation is often straightforward while specific software
adapters (transformations) are required to federate across unrelated software systems.

Federated systems need to regularly poll their data sources to ensure continued alignment in the
case of deletions, updates or additions.
                                                22
4.1.11 Public documentation

APIs, faceting, data reuse are - by their very nature - technical topics. Different platforms
behave differently and even experienced analysts will struggle in the absence of clear
documentation.

Software documentation is the responsibility of the platform vendor and it is essential that they
provide comprehensive information for developers and the general public on how their platform
works. Such documentation should be updated with each new software release.

For developers, demonstration services which encourage experimentation with the use of APIs
are extremely helpful.

It is also helpful if such data are available online instead of as PDF documents. This permits
search, cross-referencing and persistent URIs. Machine-readability is as important for
documentation as for data.

4.1.12 Compliance with generally accepted standards

The definitions of open data systems are still emerging as best-practice is agreed. The leading
open data software systems tend to have similar approaches to metadata and in data
presentation. Many of these are becoming standards recognised by the W3C especially as
regards metadata, RDF, and hypercubes.

The list of components presented in this section, 4.1, are a reflection of the current generally
accepted standards. Software which deviates from this in any significant way is more likely
struggle with interoperability and general compliance.

4.2     Criteria for National Statistics Offices data publication

4.2.1    Structural metadata

Even in the unlikely situation where an NSO has implemented an integrated software platform
for all their data publishing needs, there is still a requirement for researchers to access that data
and use it in their own systems.

The internet offers the ability to connect and mash together a wide variety of data in different
formats to produce new insight. Writing in 2001, Tim Berners-Lee pointed out that the majority
of information on the web is designed for people, rather than computers, to read 21.

Gareth McGuinness, of the International Monetary Fund, describes the challenge: “For each
dataset, the IMF must go through the laborious work of matching each dimension in the source
data to the equivalent IMF dimension, and then matching each item in each code list to the
equivalent item in the IMF code list. In the best case scenario, there will be some instance where
code lists match. However, even for one of the simplest geographic dimensions – country –
there are several different “standards” in common use.” 22

Semantic interoperability is the ability for computer systems to exchange data unambiguously.
Structural metadata permits alignment of the data itself to perform restructuring or use in
linking to other, similar, datasets.

A number of metadata formats are used by NSOs to structure or promote data interoperability,
and these are the leading vocabularies:

                                                  23
    1. PC-Axis 23: is a software suite developed by Statistics Sweden – and in use by more
       than 50 NSOs around the world – which provides a set of structured keywords defining
       the file format for loading data as a cube; it was initially developed in the 1980s for use
       with the Axis database system but has been extended;
    2. Statistical Data and Metadata eXchange (SDMX) 24: a mechanism for the exchange
       of statistical information. The initiative is sponsored by EUROSTAT, IMF, OECD, UN
       and World Bank, amongst others. This is an extremely detailed approach which offers a
       language in which different statistical data can be integrated across different software
       systems. SDMX creates a mechanism for mapping existing NSO metadata to SDMX
       and forming a hypercube for custom slices to be extracted.
    3. RDF Data Cube Vocabulary 25: provides a means to publish multi-dimensional data
       using RDF and compatible with SDMX; the RDF Data Cube vocabulary is a core
       foundation which supports extension vocabularies to enable publication of other aspects
       of statistical data flows or other multi-dimensional data sets;

DDI, described in 4.1.1, is also used by NSOs to define structural metadata. The RDF approach
to interoperability is the OWL Web Ontology Language 26.

4.2.2   OLAP hypercubes

Conforming machine-readable data into a relational database results in an array of data in
multiple dimensions; an online analytical processing cube, or OLAP cube. Such data are now
available for integrated analysis and straightforward software-driven conversion into multiple
formats.

The term “cube” can be misleading as an OLAP is not limited to only three dimensions. The
UK government financial data from Combined Online Information System (COINS) was
converted into a linked data system in June 2010 27. Each datum in COINS is uniquely identified
from a combination of seven indices in a structure called a hypercube.

A cube consists of as many dimensions as required to define its data uniquely. The terms
OLAP, datacube and hypercube will be used interchangeably in this report.

Getting from individual spreadsheets to a neatly aligned, comprehensively analytical, datacube
requires a process of Data Structure Governance – agreeing on data structures across entire
organisations – as well as ETL.

Hypercubes permit filtering and faceting of the data itself. Users can select particular series, for
a range of geographies, and over specific dates, to create a custom data slice.

Statistical data are used to develop research insight. Individual tables become more useful when
they can be aggregated together and sliced into numerous views for analysis.

4.2.3   Data endpoints

Structured data endpoints return data in predictable ways. These can be as simple as a known
type of serialisation format while more complex implementations permit the data to be queried,
filtering or refining the dataset prior to download.

What they have in common is that there is a fixed URL to reach the end-point.

A number of commonly-used endpoints are:

    1. JavaScript Object Notation (JSON) 28: an open standard presenting data as a set of
       key-value pairs; as an end-point, it is accessible via RESTful APIs;
                                                 24
    2. OData 29: While RDF has achieved a high degree of traction for linking diverse data
       together, it is still not that straightforward to connect it back to the tools researchers use
       most frequently to work with data. OData offers a standardised protocol for creating and
       consuming data. The format is extremely popular, and software as diverse as Tableau
       (for analysis and visualisation), Drupal (for content management), and Microsoft’s
       Excel are all able to accept OData as an input. OECD.stat, for example, is currently
       offering a test interface to trial OData 30.
    3. Extensible Markup Language (XML) 31: a markup language, more difficult for
       humans to read, found in diverse uses such as encoding Microsoft Office documents,
       websites and for data interchange; it too can be presented via RESTful APIs;
    4. SPARQL Protocol and RDF Query Language (SPARQL) 32: an RDF database query
       language able to retrieve and manipulate RDF data, returning it as RDF or – with
       appropriate interpreters – conversion for use by SQL or other query languages;

4.2.4   Online analysis and visualisation

Business Intelligence systems offer meaningful tools to stakeholders who wish to perform
statistical analysis across the complete data produced or managed by an NSO. Such tools may
be as simple as selecting a data slice and displaying it as a chart, or performing complex
statistical analysis on multiple data series, as well as presenting on maps and charts.

User engagement in the face of vast and complex data is more likely where that data are
presented in a tactile and engaging way. The growth of data journalism has depended on the
growing availability of data resources and the presentation of exciting views on that data.

Certainly, NSOs should not editorialise or lose their neutrality, but they can provide tools for
users to explore the data. Technical researchers and analysts will want to import that data into
their favourite systems and data endpoints will permit them to do so.

Whether a software platform provides its own native visualisation and business intelligence
tools is often a design decision. Some systems prefer an integrated approach, while others prefer
to focus only on data publication. Which approach will serve better depends on any existing
infrastructure in the deployment environment, or on the needs and expectations of the NSOs.

4.2.5   User-experience and software customisation

Data publication platforms are only one software system amongst many deployed by NSOs.
Software should permit sufficient template and layout customisation to provide a consistent
user-experience and provide a common look and feel across all NSO online services.

Sub-domains, which are the URLs to different software in a suite of platforms, are part of
providing a consistent user-experience. If NSOs have chosen a particular addressing format (e.g.
reports.nso.gov, data.nso.gov, events.nso.gov, etc.), it is important to verify that this is possible
where software is provided as SaaS, should the NSO require.

Similarly to user-experience customisation, NSOs often require custom software extensions for
integration with existing platforms, or to perform particular tasks.

NSOs may choose to develop custom software extensions and enhancements in-house or via
third parties.

Where software is licensed as open source, then such development is relatively straightforward
(with due consideration to documentation mentioned in 4.1.11). If the software is not open
source, it is essential that a well-documented API permit software extension, or that licensees
have access to the source-code in the absence of APIs.
                                                 25
In the case of open source, or licensed code access, the programming language (Python, Java,
Ruby, C++, etc.) becomes a consideration. APIs, as mentioned before, are software agnostic and
permit development in the programming language of your choice.




                                             26
5 Review of Open Data Publication Systems
This section presents a review of a selection of software identified as in-use by NSOs or
currently considered to be viable open data platforms.

During the research stage for this report, custom software implementations were also assessed.
Organisations like the US Census Bureau have developed their own system. In large part such
work was initiated long before there was an open data movement, let alone software to support
it. Such custom software can often become more generally accepted amongst closely-related
NSOs and be released more formally. This is true in the case of Nesstar and PC-Axis.

By no means is this an exhaustive list and it is likely that this comparison will be updated from
time-to-time.




                                                                                                                                                                                       Public Documentation
                            Descriptive Metadata




                                                                                                                                                                                                                                Structural Metadata
                                                                                                                                                                                                                                                      OLAP Hypercubes
                                                                      Anonymous access
                                                   Machine-readable




                                                                                                                                                                                                              Standards-based
                                                                                                         Data attribution




                                                                                                                                                                                                                                                                        Data Endpoints
                                                                                         Data licences




                                                                                                                                                                                                                                                                                         Visualisation

                                                                                                                                                                                                                                                                                         Extensibility
Software platform


                                                                                                                                                             Harvesting
                                                                                                                                                                          Federating




                                                                                                                                                                                                                                                                                         UX & S/W
                                                                                                                                                Static URI
                                                                                                                                     Open API
                                                                                                                            Search




CKAN                                                                                                                                                                                                                                                                             
DevInfo                                                                                                                                                                                                                                                                           
DKAN                                                                                                                                                                                                                                                                             
Junar                                                                                                                                                                                                                                                                            
NADA                                                                                                                                                                                                                                                                                  
Nesstar                                                                                                                                                                                                                                                                              
OpenDataSoft                                                                                                                                                                                                                                                                      
PC-Axis and PX-Web                                                                                                                                                                                                                                                             
Prognoz                                                                                                                                                                                                                                                                               
Semantic MediaWiki                                                                                                                                                                                                                                                                  
Socrata                                                                                                                                                                                                                                                                          
Swirrl                                                                                                                                                                                                                                                                             

                           Table 2: Software commonly found supporting online data publication

 offers a complete solution
 offers a partial, or incomplete, solution

Rows highlighted in blue are for software commonly regarded as open data platforms. Note that
implementations of various elements may differ even as they offer similar functionality.




                                                                                                                            27
5.1   CKAN                                     URL             http://www.ckan.org/
                                               S/W Licence     Affero GPL, open source
Overview                                       Language        Python, Javascript
                                               SaaS            http://ckanexpress.com/
CKAN is an open-source, data discovery         Demo            http://demo.ckan.org/
system making data accessible and usable       Examples        data.gov.uk
by streamlining publishing, sharing,                           data.surrey.ca
finding and using data. As well as                             publicdata.eu
harvesting, cataloguing, and advanced
searching, it can store data and provides rich data APIs, and simple visualization and
exploration tools.

Most CKAN instances are self-deployed and self-hosted by the organisations concerned,
although a number of companies now offer CKAN as SaaS. There are also a globally distributed
range of consulting and software services available to deploy CKAN.

Open data suitability

Descriptive metadata               CKAN has the concept of the dataset being a folder in which
                                   files, known as resources, are stored. Metadata are applied at
                                   the dataset level and not at the data structure level. Metadata
                                   is served as RDF and support Dublin Core, and DCAT, with
                                   the ability to harvest documents in the geospatial INSPIRE
                                   format.
Machine-readable                   CKAN is able to import and interpret CSV, XLS, GeoJSON
                                   and text files in a machine-readable format. It is further able
                                   to interpret and serve PDFs and other file-types even where it
                                   cannot import this into the database. Geospatial support
                                   includes ArcGIS through extensions, and the ability to
                                   harvest INSPIRE and other ISO19139 based geospatial
                                   metadata.
Anonymous access                   Anonymous access is permitted, although CKAN does have
                                   the ability to offer private datasets to subsets of registered
                                   users. An API key is required for users wishing to modify or
                                   upload data via the API.
Data licenses                      CKAN offers a range of common license types to the user
                                   during the data upload process and these are presented clearly
                                   in the dataset view.
Data attribution                   CKAN presents clear attribution where such data exists.
Search                             CKAN’s search is clearly presented, and returns results
                                   permitting filtering by metadata. Search results are presented
                                   with title and description, and a list of file-types present in
                                   the dataset.
Open API                           CKAN has a clear, although complex, API that is
                                   documented thoroughly online as part of the main
                                   documentation. Data resources can be downloaded via the
                                   API but are only available as JSON output from the database.
Static URI                         CKAN presents all datasets and data resources as static URIs.
Harvesting                         CKAN is able to harvest existing data resources, as well as
                                   regularly changing data, via the API. There is also a data
                                   harvester extension which permits a limited user interface for
                                   setting up individual harvesting processes.
Federating                         CKAN is able to federate other CKAN sites in order to
                                   consolidate metadata. The harvester extension provides a
                                   common framework for developing custom harvesters for
                                               28
                                  different metadata sources, which means that federation from
                                  non-CKAN sources is also possible. Some examples include
                                  generic spatial metadata sources like CSW and WAF, and
                                  ArcGIS Server portals.
Documentation                     CKAN’s documentation is comprehensive and available at
                                  http://docs.ckan.org/. This presents both information for users
                                  as well as for developers.
Standards-based                   CKAN is an entirely open source software platform, although
                                  the learning curve for the original Pylons framework
                                  underlying CKAN can be steep. Extending CKAN is
                                  performed via either of the software library (in which case
                                  the extensions run on the server with CKAN) or via the API
                                  (where they can then run remotely).

NSO suitability

Structural metadata               CKAN offers no support for metadata for data structure.
OLAP hypercube                    CKAN places greater emphasis on the source data than on the
                                  database. There is no support for hypercubes. Data can be
                                  filtered or faceted via the API, although such support will
                                  require custom software development to be generally useful.
Data endpoints                    CKAN produces JSON output, and – through an extension –
                                  also supports OData. CKAN only presents data in the format
                                  in which it was uploaded and does not offer data
                                  transformations. CKAN endpoints do not support any
                                  proprietary formats like Stata or SPSS, although extensions
                                  can be developed.
Visualisation                     CKAN’s included visualisation library, Recline.js, is very
                                  limited in terms of what it can support. It does not permit the
                                  user to save or share specific visualisations either, although
                                  this is planned for a future version. A new Dashboard feature
                                  offers a range of persistent visualisations to the site
                                  administrators but not to visitors.
UX & S/W extensibility            CKAN uses Bootstrap, a popular UX library, as its CSS
                                  system, with Jinja2 for templating, and so can be easily
                                  customised. It should be noted, though, that CKAN is not
                                  responsive in design and does not support mobile screen form
                                  factors.
                                  While CKAN predominates in serving national open data
                                  portals, its real strength is in the community of developers
                                  producing custom extensions and enhancing the software.
                                  The UK, US, Canada, Australia and even Mexico
                                  governments have each supported local developers.
                                  Universities have also acted to extend CKAN to host
                                  academic research data.

Observations

CKAN is a powerful, and extensible, platform for open data systems but will need hypercube,
and data structure metadata, support before it can be deployed for integrated NSO statistical
portals.




                                              29
5.2   DevInfo                                 URL             http://www.devinfo.org/
                                              S/W Licence     Free but license unclear
Overview                                      Language        Unknown
                                              SaaS            http://www.devinfo.org/
DevInfo was designed to integrate the         Demo            http://www.devinfo.org/
data generated through monitoring the         Examples        All sites hosted on
Millennium Development Goals and is                           http://www.devinfo.org/
developed and supported by UNICEF.
Numerous UN agencies use variations of DevInfo and governments around the world are
encouraged to publish their MDG indicators in it.

The software is available free and can be deployed on your own servers. This gives you the
opportunity to use what is quite a powerful business intelligence tool to manage and present
your own data.

Open data suitability

Descriptive metadata              Metadata is entered via a specific module in the software and
                                  is compliant with DDI and Dublin Core, as well as ISO
                                  19115 for geospatial data.
Machine-readable                  DevInfo’s objective is to align all data into a datacube. Only
                                  machine-readable data, formatted according to the specific
                                  platform requirements, can be read by the system. It does not
                                  host a variety of data objects.
Anonymous access                  Anonymous access is a default, and logins are available for
                                  site administration. Registration and login is required to
                                  download data or save visualisations.
Data licenses                     There is no mechanism for data contributors to declare data
                                  reuse licenses and it is unclear what those licenses may be.
Data attribution                  Links and references are provided for each data-series for
                                  appropriate attribution, including – where available – a
                                  reference email for direct contact.
Search                            Search is limited to exact matching of location and data
                                  series. Natural language and metadata search are extremely
                                  limited.
Open API                          An API is provided, including an online tool for
                                  experimentation with generative calls. The request format
                                  supports both REST and SOAP, with output in XML and
                                  JSON, as well as SDMX.
Static URI                        There are no static URIs available for data series. DevInfo
                                  appears to be structured as a “one-page website” with no
                                  URL changes despite different views being created.
                                  Once data has been selected and saved then it is possible to
                                  generate a static url, or even to embed the data, but the
                                  approach of search-and-link is absent.
Harvesting                        Bulk uploading is possible through additional DevInfo
                                  modules but establishing regular, automated data harvesting
                                  is not available via an interface. This could be written as a
                                  script, however.
Federating                        While it is possible to search all sites hosted by DevInfo on
                                  the main site, there is no automated mechanism for federating
                                  remote sites hosted independently.
Public Documentation              DevInfo is not open source and the documentation is only
                                  that provided by the UNICEF team. All the documentation is
                                  only available in zipped and PDFed form making using it
                                               30
                                  somewhat tedious. Documentation is aimed at end-users
                                  rather than at developers. All content is available from the
                                  main DevInfo website.
Standards-based                   The software stack runs only on Windows Server and
                                  Microsoft SQL Server. This means that it will need entirely
                                  separate architecture from any open source software the NSO
                                  may choose to deploy.

NSO suitability

Structural metadata               DevInfo supports SDMX for data output and via the API, but
                                  structures data via its own format internally.
OLAP hypercube                    Data is stored in a hypercube but the interface for creating
                                  custom slices of that data is limited. Once the user has
                                  selected data – with the ability to select multiple data series –
                                  the user then has the option to slice that data by time-period
                                  as part of the data visualisation component.
Data endpoints                    Endpoints are available via the API or once a user logs in.
                                  This does, however, limit a user’s ability to integrate data
                                  with their own applications.
Visualisation                     The software offers a clear, user-friendly data browser and
                                  business intelligence tool which should suit most online
                                  users. The user is able to create a range of standard charts,
                                  including plotting geospatially, and to name axes and change
                                  the colour presentation of the data.
UX & S/W extensibility            The software is not open source and there is limited
                                  documentation. Beyond changing a few logos, there is almost
                                  no way to customise the user-interface and software
                                  extensibility is extremely limited. The lack of a development
                                  community also means that any NSO would have limited
                                  support for customisation.

Observations

DevInfo offers a comprehensive data management and publication platform. It does not meet all
the criteria required for open data publication, especially regarding anonymous access, and data
licensing. DevInfo does appear to serve the immediate needs of NSOs.

That said, DevInfo is not very customisable and NSOs with requirements which vary from these
will find it difficult to modify the system. It was designed to serve Millennium Development
Goal data. Less specific data series may fare less well.




                                               31
5.3   DKAN                                     URL              http://www.nucivic.com/
                                               S/W Licence      Open source, GNU GPL
Overview                                       Language         PHP, JavaScript
                                               SaaS             http://nucivic.com/data
DKAN is an open source solution built on       Demo             http://demo.getdkan.com/
Drupal, a leading content management           Examples         whitehouse.gov/raise-the-wage
system for thousands of governments                             abrepr.org
worldwide, and aligned with the data                            www.offenedaten-koeln.de
standards and best practices of the CKAN
data portal software. Like CKAN, DKAN is a data discovery system making data accessible and
usable by streamlining publishing, sharing, finding and using data. As well as harvesting,
cataloguing, and advanced searching, it can store data and provides rich data APIs, visualization
and exploration tools.

Additionally, unlike CKAN, DKAN is a distribution (pre-configuration) of Drupal and as such
is also a complete CMS offering comprehensive tools to manage content, documents, and
community, in addition to data sets. This also gives DKAN access to the tens of thousands of
Drupal developers and extensions already developed for the platform. Key CMS features
include blogs, groups, taxonomies, WYSIWYG editing, faceted search, form building,
calendars, and a full graphical user interface for administering all content, workflows, user roles
and permissions.

Open data suitability

Descriptive metadata               DKAN, similarly to CKAN, has the concept of the dataset
                                   being a folder in which files, known as resources, are stored.
                                   Metadata are applied at the dataset level and not at the data
                                   structure level. Metadata is served as RDF and support
                                   Dublin Core, DCAT and the INSPIRE geospatial format.
                                   Drupal supports the creation of custom metadata as well.
Machine-readable                   DKAN is able to import and interpret CSV, XLS, XLSX, and
                                   text files in a machine-readable format. It is further able to
                                   interpret and serve PDFs and other file-types even where it
                                   cannot import this into the database.
Anonymous access                   DKAN is able to use the full Drupal authorisation system,
                                   including permitting anonymous access to public datasets for
                                   search and download.
Data licenses                      DKAN offers a range of common license types to the user
                                   during the data upload process and these are presented clearly
                                   in the dataset view.
Data attribution                   DKAN presents clear attribution where such data exists.
Search                             DKAN’s search is clearly presented, and returns results
                                   permitting filtering by metadata. Search results are presented
                                   with title and description, and a list of file-types present in
                                   the dataset.
Open API                           DKAN has a clear, although complex, API that is
                                   documented thoroughly online as part of the main
                                   documentation. Data resources can be downloaded via the
                                   API and available as JSON or XML output. An API key is
                                   required for users wishing to modify or upload data via the
                                   API. No API key is required for search or download.
Static URI                         DKAN presents all datasets and data resources as static URIs.
Harvesting                         DKAN is able to harvest existing data resources, as well as
                                   regularly changing data, via the API. There is currently no
                                   user-interface for setting up automated harvesting tasks,
                                                32
                                  however, it should be possible to use the CKAN harvester for
                                  this.
Federating                        Drupal is able to federate with multiple Drupal sites and so,
                                  intrinsically, this is possible with DKAN. However, this has
                                  not been tested to any large degree.
Public Documentation              DKAN’s documentation is comprehensive and available at
                                  http://docs.getdkan.com/. This presents both information for
                                  users as well as for developers.
Standards-based                   DKAN is aligned with best practice in the open data industry.

NSO suitability

Structural metadata               DKAN offers no support for metadata for data structure.
OLAP hypercube                    DKAN places greater emphasis on the source data than on
                                  the database. There is no support for hypercubes. Data cannot
                                  be filtered or faceted via the API, although support for OData
                                  may permit this in future.
Data endpoints                    DKAN produces JSON and XML output. DKAN presents
                                  data in the format in which it was uploaded and does not
                                  offer data transformations.
Visualisation                     DKAN’s included public facing visualisation library,
                                  Recline.js, is very limited in terms of what it can support. It
                                  does not permit the user to save or share specific
                                  visualisations either. Recently, DKAN has developed a more
                                  sophisticated visualisation system for embedding and saving
                                  charts, including geospatial data, as part of data-driven story-
                                  telling. While not a sophisticated business intelligence tool,
                                  this does provide entry-level data presentation services.
                                  Overall, rather than trying to be a robust data visualisation
                                  tool itself, DKAN has developed a toolkit to facilitate
                                  integration with third-party data visualisation web services
                                  such as CartoDB.
UX & S/W extensibility            As a set of Drupal components, DKAN also has the
                                  advantage of being part of one of the most active open source
                                  projects in the world. The range of services and software
                                  available to Drupal is extensive, including thousands of
                                  skilled developers available internationally.
                                  Drupal is one of the leading content management systems and
                                  is used by many of the world’s most popular websites. The
                                  user-interface is based on leading best-practice, and the
                                  software is extremely extensible.
                                  DKAN has developed flexible UX tools to map its schema
                                  (and that of any Drupal content type) to any other schema,
                                  such as CKAN or US Project Open Data.

Observations

DKAN is a powerful, and extensible, open source platform for open data systems, including
CMS support. There is also the option of enterprise-level SaaS. However, it will need
hypercube, and data structure metadata, support before it can be deployed for integrated NSO
statistical portals.




                                              33
5.4   Junar                                    URL             http://www.junar.com/
                                               S/W Licence     Proprietary
Overview                                       Language        Java, Django / Python
                                               SaaS            http://www.junar.com/
Junar is a specifically Software-as-a-         Demo            http://www.junar.com/
service platform offering one of the           Examples        data.sanjoseca.gov
leading open data platforms. The system                        datosabiertos.gob.go.cr
is able to import and use a wide variety of                    data.cityofsacramento.org
data formats and, as with all SaaS
offerings, is useful to users looking for rapid deployment and the ability to develop and present
insight from their data very rapidly.

Junar has also developed a proposition for academic publishing and so is extending into new
markets.

Open data suitability

Descriptive metadata               Junar uses RDF metadata to describe the datasets, presented
                                   in Dublin Core and DCAT.
Machine-readable                   Junar offers a wide range of support for different machine-
                                   readable formats, including CSV, XLS and XLSX, JSON and
                                   SOAP/XML 2.0, as well as the KML, KMZ, GeoJSON, and
                                   Shapefile geospatial formats.
Anonymous access                   Users do not need to create accounts in order to access data,
                                   but they will need to do so in order to access higher services.
Data licenses                      Licenses are not clearly presented for individual datasets. The
                                   next software release will include custom licences for
                                   datasets using the template provided by http://project-open-
                                   data.github.io/license-examples/
Data attribution                   Additional info for each dataset provides links to the source
                                   data.
Search                             Junar offers search, but this is not – visually – a priority on
                                   the site. One weakness is that Junar has limited focus on
                                   faceting data. Data may be structured with metadata but that
                                   is not exposed through the interface to the user to permit
                                   them to filter search results in a more accessible way.
Open API                           Junar provides an interactive API for each of their sites. This
                                   permits developers to experiment live on the database to see
                                   what results they can achieve.
Static URI                         Each dataset generates a unique URI. Visualisations and
                                   dashboards created by the system administrator also get a
                                   unique URI.
Harvesting                         Junar has an additional component called Publishing
                                   Workflow which permits some automated collection and
                                   management of data from different locations, including
                                   assigning metadata to that data.
                                   The Junar Uploader is a set of scripts for automatically
                                   uploading a range of file-types, including CSV, XLS, XLSX,
                                   KML and KMZ.
                                   Junar has integrated capabilities to collect data from
                                   REST/JSON or SOAP/XML web services linked directly to
                                   source databases for real-time, or near-real-time, data
                                   collection. Integrated REDATAM+SP software permits
                                   harvesting from HTML forms to collect data directly.
Federating                         All Junar sites run as SaaS from the same servers but, at this
                                                34
                                   stage, data are not federated across the different platforms.
                                   Junar has produced an extension for CKAN so that CKAN is
                                   able to read metadata from Junar and present it in search
                                   results.
Public Documentation               The documentation for Junar is available on a set of wiki
                                   pages, many of which are customised for particular clients
                                   http://en.wiki.junar.com/index.php/Main_Page. This is not
                                   particularly easy to read and of limited value to developers.
                                   Given that this is a critical part of the service, this needs to be
                                   much more visible, both from the Junar main site (which
                                   would benefit from an entire Developers subsite) and from
                                   the client sites. There is a Knowledge Base available at
                                   http://support.junar.com which is not regularly updated but
                                   contains a basic subset of information regarding the use of
                                   the platform, the Publishing Workflow, an FAQ section and a
                                   Feature Request section.
Standards-based                    Junar has focused on supporting the leading standards in the
                                   open data community.

NSO suitability

Structural metadata                Junar does not currently support structural metadata.
OLAP hypercube                     There is no hypercube support.
Data endpoints                     Junar supports a wide range of data endpoints, including
                                   CSV, JSON, PDF, RDF, RSS, XLS, XLSX, XML, all of
                                   which are also available via the API. They have also recently
                                   added OData to their list, giving the opportunity to query
                                   individual datasets as well. They also provide integration
                                   with Google Docs and Dropbox.
Visualisation                      Junar’s main focus is on providing support for the
                                   development of comprehensive data-driven visual
                                   dashboards. While discovery is important, Junar has realised
                                   that many of their clients also want to use their data to tell
                                   stories which offer stakeholders an easy way to digest
                                   complex data. The standard range of charts are available,
                                   including geospatial plotting, and the ability to drag and drop
                                   a wide variety of visual types to create an integrated
                                   dashboard.
UX & S/W extensibility             Junar is proprietary software and the range of published
                                   public APIs are only about downloading data rather than
                                   extending functionality or uploading data. While the user-
                                   interface can be customised, and new functionality written,
                                   the NSO is reliant on Junar for these services.

Observations

While Junar is one of the leading open data publication services, it has no support for structural
metadata or hypercubes required by NSOs.

Junar, as with the other proprietary SaaS services, concentrates on ease of deployment and
providing visual tools with plenty of hooks for downloading and developing custom
applications. From a client perspective, this is a straightforward approach to getting open data to
the public quickly.



                                                35
5.5   NADA                                    URL             ihsn.org/home/software/nada
                                              S/W Licence     Open Source, BSD
Overview                                      Language        PHP
                                              SaaS
NADA is a web-based cataloging system         Demo
that serves as a portal for researchers to    Examples        microdata.statistics.gov.rw
browse, search, compare, apply for                            statistics.knbs.or.ke
access, and download relevant census or                       nigerianstat.gov.ng/nada
survey information. It was originally
developed to support the establishment of national survey data archives. The application is used
by a diverse and growing number of national, regional, and international organisations.

While the platform is open source, the additional DDI Metadata Editor is proprietary and
provided by Nesstar Publisher as freeware.

The International Household Survey Network coordinated by the World Bank responsible for
maintaining NADA have discussed a thorough rearchitecture of the platform onto the
Symphony developer framework. This is what Drupal, the CMS, is developed on and promises
a future in which NADA integrates well and is more readily deployed. It is currently on the
CodeIgnitor Framework.

Open data suitability

Descriptive metadata              All resources are associated with DDI metadata which is not
                                  produced from NADA itself. The DDI Metadata Editor
                                  produces DDI compliant XML files for upload into NADA. It
                                  is not an obligatory as part of the platform – there are other
                                  editors and advanced users can even use a text editor – but
                                  this one happens to be free. NADA also presents the metadata
                                  in RDF.
Machine-readable                  NADA is specifically designed for managing and presenting
                                  microdata. It does not have mechanisms for interpreting data
                                  resources as machine-readable. Any and all data resources are
                                  stored and presented as-is for download.
Anonymous access                  Anonymous access for search is default, however – as this is
                                  designed for microdata - access for data download requires a
                                  login and appropriate authorisation. If the direct access, or
                                  recently added open data license types, are chosen then no
                                  login is required. Users then only agree to terms appropriate
                                  for the license and go directly to download the data files.
Data licenses                     Each dataset contains comprehensive details on licensing and
                                  reuse.
Data attribution                  Each dataset contains comprehensive details for data
                                  attribution. Importantly, NADA offers clear citation
                                  references as part of an international drive to encourage better
                                  citation of data and recognise it as citable work.
Search                            Full text search is provided, with filtering and faceting,
                                  including the ability to limit the search space by the data
                                  range of research publication.
Open API                          NADA does not have a published API which makes
                                  extension more difficult. There is a private and
                                  undocumented API which offers access to a few of the DDI
                                  metadata fields. The lack of API or mechanism to manipulate
                                  data means that aggregations cannot be derived from the
                                  microdata programmatically.
                                               36
Static URI                        Every view presents its own URL, including the resources for
                                  download.
Harvesting                        NADA does not support harvesting, although – given that
                                  DDI can be captured in an XML file for import – a
                                  mechanism for developing such automated import should be
                                  feasible.
Federating                        NADA does not support federation.
Documentation                     Current documentation is available in PDF and on
                                  documentation.ihsn.org/nada/4.2/. Overall developer
                                  documentation is limited and until the new Symphony
                                  framework is adopted – potentially more than 12 months
                                  away – this will continue to be somewhat forbidding for
                                  custom extension development.
Standards-based                   NADA is designed to support the exacting requirements for
                                  DDI metadata and document lifecycle management. It is
                                  standards based.

NSO suitability

Structural metadata               NADA supports DDI structural metadata requirements.
OLAP hypercube                    There is no hypercube support.
Data endpoints                    Metadata are available as XML and the description of the
                                  external resources are available in Dublin Core, but the data
                                  itself are only available as the originally uploaded source
                                  document.
Visualisation                     NADA provides no support for data visualisation.
UX & S/W extensibility            The developer documentation is limited but NADA is open
                                  source and so customisation and extension is possible with
                                  some trial-and-error. Once the system is ported to Symphony,
                                  customisation should be far easier.

Observations

NADA is the only platform in this survey which is designed expressly to support the needs of
producers and archives publishing microdata. Most users are National Data Archives, some
universities and some international organisations. It is not designed as an open data platform
and does not provide an API or interpret machine-readable resources. It also does not support
the integrated needs of NSOs.




                                               37
5.6   Nesstar                                  URL             http://www.nesstar.com/
                                               S/W Licence     Proprietary
Overview                                       Language        Java
                                               SaaS
Nesstar offers a vertically integrated suite   Demo            http://nesstar-demo.nsd.uib.no/
of tools for data publishing and               Examples        nesstar.ess.nsd.uib.no
management. Nesstar Publisher consists                         nesstar.ukdataservice.ac.uk
of data and metadata conversion and                            nesstar.ssc.wisc.edu
editing tools, enabling the user to prepare
these materials for publication to a Nesstar Server. However, it can also be used as a stand-alone
tool for the preparation of data and metadata. The Publisher enables users to enhance datasets by
combining a wide range of catalogue and contextual information, which can then be viewed
within the Nesstar web client, Nesstar WebView.

Nesstar offers support for multilingual metadata, microdata, aggregate data, multi-layered maps,
various visualization, subscriptions/notifications, cell notes/missing data symbols, basic analysis
and embedding of live data into regular web pages.

Open data suitability

Descriptive metadata               Nesstar supports both Dublin Core and DDI for descriptive
                                   metadata.
Machine-readable                   A very large range of data files are acceptable as data input,
                                   including: NSDstat, DDI, SPSS, Stata, Statistica, dBase, DIF,
                                   CSV, PC-Axis, Excel and Hierarchy Definition Files. This is
                                   amongst the most comprehensive of data format systems.
                                   Where non-machine-readable files are imported, such as PDF
                                   or Microsoft Word documents, Dublin Core and e-GMS are
                                   used to define the descriptive metadata.
                                   Geospatial data is supported through deployment of
                                   GeoServer.
Anonymous access                   Anonymous access is available for aggregate (i.e. generalised
                                   data) but not automatically for the microdata stored in the
                                   system. An authentication system controls access to such data
                                   unless specifically set as direct download.
Data licenses                      Additional variable data and links to licenses can be supplied
                                   along with data descriptions or associated metadata files, but
                                   there is no standardised mechanism for listing and declaring
                                   data licenses.
Data attribution                   A descriptive metadata file is provided with each data-series,
                                   offering the complete reference for the data.
Search                             Search is rudimentary, returning results as a branching-tree
                                   with little guidance as to where the results may be found. The
                                   main mechanism for search is a tree list of all the data
                                   resources. There is a more comprehensive advanced search,
                                   but this will require some knowledge of the data the user is
                                   hoping to find.
Open API                           A RESTful API is provided for the platform, including
                                   comprehensive documentation on a Git repository at
                                   gitlab.nsd.uib.no/nesstar/nesstar-rest-api/. There is no online
                                   interactive demonstration of the API.
Static URI                         While not immediately obvious, clicking on the link icon in
                                   the data browser does provide a static link to each of the data.
Harvesting                         Nesstar now supports the Open Archives Initiative Protocol
                                   for Metadata Harvesting 33. A new standalone component
                                                38
                                   allows server administrators to expose a server's metadata for
                                   harvesting by others.
                                   OAI-PMH is a standard protocol designed to make it simpler
                                   for data providers to open up their repositories and for service
                                   providers to harvest metadata. The protocol uses XML over
                                   HTTP and supports Dublin Core and DDI.
Federating                         Nesstar does not appear to be designed for federation.
Public Documentation               Documentation is fairly comprehensive and, since the
                                   software is designed for self-deployment, system
                                   administration and customisation documentation also exists.
                                   Note, however, that much of the documentation is only
                                   available as PDFs from their site. Each of the products,
                                   Server, WebView, and Publisher, are presented there.
                                   Critically, the Server documentation is available online and is
                                   searchable.
Standards-based                    While Nesstar is proprietary, the extent of support for
                                   metadata standards is comprehensive. Additionally, while
                                   proprietary, Nesstar is developed using the open source JBoss
                                   suite of middleware. Nesstar can be installed on Windows or
                                   Linux systems.

NSO suitability

Structural metadata                DDI is used for structural metadata.
OLAP hypercube                     Nesstar has comprehensive datacube support, permitting files
                                   to be imported directly into the database. Filtering and
                                   faceting of data are supported.
Data endpoints                     A comprehensive range of data endpoints are offered,
                                   including SPSS, Stata, Statistica, SAS, and Dbase. Users can
                                   also download data in Excel, PDF or CSV. These are also
                                   available via the API.
Visualisation                      While not striking, Nesstar offers a comprehensive range of
                                   visualisation and analytical functions which can be applied to
                                   datacubes. Charts can be saved and shared, or embedded into
                                   a CMS. Beyond charting, Nesstar offers users the option of
                                   adding in new calculations into tables as well as performing
                                   correlation analysis.
UX & S/W extensibility             With a comprehensive API and many open source
                                   components, Nesstar would appear suitable for some
                                   customisation, however little is reflected in the various
                                   independent sites running the software. Most appear
                                   identical.

Observations

Nesstar was originally designed to support microdata publication and does not meet all the
criteria required for an open data portal, with a need for individual dataset licensing, improved
search, and metadata faceting. Nesstar does provide comprehensive data support, including
individual files and data-series, as well as integrated datacubes and is well suited to the
immediate requirements of NSOs. Few other platforms are as well integrated.




                                                39
5.7   OpenDataSoft                              URL             http://www.opendatasoft.com/
                                                S/W Licence     Proprietary
Overview                                        Language
                                                SaaS            http://www.opendatasoft.com/
OpenDataSoft offers a comprehensive             Demo            http://public.opendatasoft.com/
suite of open data and visualisation tools.     Examples        opendata.brussels.be
Their search functionality is                                   opendata.paris.fr
straightforward with faceting / filtering                       data.sncf.com
and well-structured results listings which
include icons defining the alternative forms for the data (such as table, map, charts or export).

Open data suitability

Descriptive metadata               OpenDataSoft supports DCAT, and INSPIRE for geospatial
                                   data. You are also able to create custom metadata templates.
Machine-readable                   A wide range of data-types are readable by the software,
                                   including Shapefiles, OSM, KML, WFS, ESRI, GTFS, as
                                   well as the more traditional XLS, CSV and XML types.
                                   There is also the potential to link to alternative data sources,
                                   such as web forms, other databases, and APIs.
Anonymous access                   Users are able to access the site anonymously, including
                                   downloading data and creating visualisations. An API is
                                   similarly available for anonymous querying and
                                   downloading, although a key will be required for
                                   modification of data.
Data licenses                      All licenses are clearly referenced, including in search
                                   results.
Data attribution                   Similarly to licenses, attribution is clearly referenced,
                                   including in search results.
Search                             Natural language search is available, including filtering by a
                                   wide range of metadata and data-types. The API similarly
                                   permits faceting during search.
Open API                           The API is open and includes an interactive online
                                   dashboard, plus clear documentation, for testing and working
                                   with the API. The API permits HTTP/HTTPS/BasicAuth and
                                   presents data in JSON/P, CSV, RDF, as well as GeoJSON/P
Static URI                         Static URIs are plentiful. Anonymous users can create
                                   custom visualisations and share the URI for this. URIs are
                                   generated for every different view ensuring easy sharing and
                                   referencing.
Harvesting                         OpenDataSoft has a range of services for importing data from
                                   a wide range of services, including setting processes for
                                   removing personal data, performing calculations based on
                                   formulae. Data collection can be via remote locations or web
                                   services.
Federating                         The software is provided as SaaS but each site is currently
                                   presented independently. No federation takes place at this
                                   time, but the API and metadata traversal means that this
                                   should be possible in future.
Public Documentation               The public documentation on use of the API is fairly good
                                   (http://public.opendatasoft.com/api/doc/), but there is little on
                                   the software itself, from user to administration. Most
                                   guidance is provided via videos on the main site.
                                   http://www.opendatasoft.com/ressources/
Standards-based                    With support for RDF and the leading open data metadata
                                                40
                                 standards, OpenDataSoft is aligned with industry standards.

NSO suitability

Structural metadata              OpenDataSoft currently provides no support for structural
                                 metadata.
OLAP hypercube                   Currently, there is no hypercube support, however they are in
                                 development of an OLAP based on Microsoft’s MDX34
                                 standard.
Data endpoints                   Beyond the standard endpoints of CSV, XML and JSON, and
                                 geospatial formats like KML, WFS, GTFS, they have also
                                 developed an OData endpoint as well.
Visualisation                    There is very good visualisation support, including the ability
                                 to embed visualisations in other web services. Images cannot
                                 be exported as PDF or image files. Chart-types include line,
                                 spline, column, area, bar and pie charts. OpenDataSoft has
                                 also developed a comprehensive geospatial visualisation
                                 platform called Cartograph which permits multiple geospatial
                                 data to be presented simultaneously. The map can then be
                                 shared or embedded.
UX & S/W extensibility           They are a SaaS vendor and, as with all such vendors,
                                 deployment is straightforward. Their template interface and
                                 customisation options appear fairly good and the various
                                 client sites are quite different from each other. However, full
                                 customisation is limited to the OpenDataSoft team.

Observations

OpenDataSoft is one of the more sophisticated open data platforms and well designed to serve
that need. While it is not yet entirely suitable for NSO requirements, they are currently
preparing an OLAP extension which will also lead to support for structural metadata.




                                              41
5.8   PC-Axis and PX-Web                        URL             http://www.scb.se/pc-axis
                                                S/W Licence     Proprietary
Overview                                        Language        Unknown
                                                SaaS            No
The PC-Axis family consists of a number         Demo            statistikdatabasen.scb.se
of programs for the Windows and Internet        Examples        www.bfs.admin.ch
environment used to present statistical                         www.cso.ie
information. It is mostly used by the                           www.stats.govt.nz
statistical offices in different countries to
let their users retrieve statistics.

PC-Axis is a software family with several programs all aimed at facilitating quick and easy
dissemination of statistics. PX-Web is the online data publishing and presentation component of
PC-Axis.

Open data suitability

Descriptive metadata                In PC-Axis, such metadata about data objects is referred to as
                                    the quality data and this is supported as a set of additional
                                    views. This supports descriptions and definitions for the data.
Machine-readable                    PC-Axis is designed to inform a hypercube and the system
                                    only accepts PC-Axis format files in order to import the
                                    necessary data. General machine-readable files are not
                                    supported.
Anonymous access                    Anonymous user access to data via the PX-Web browser is
                                    supported, as well as administrative permissions for
                                    managing the data itself.
Data licenses                       Licensing is presented at site level since the assumption is
                                    that all data are released from the same source. This is not
                                    necessarily always the case and does limit multi-organisation,
                                    multi-licensing releases.
Data attribution                    Once data are extracted from the hypercube, metadata are
                                    provided with clear attribution and contact details for the
                                    series selected.
Search                              The mechanism for searching datasets is an interactive
                                    statistical browser in which dates, data series and
                                    geographical range are selected prior to data being presented.
                                    It is equivalent to performing a SQL data lookup. This means
                                    that users require specialist knowledge about the
                                    classification of statistical data in order to find data of
                                    interest. PX-Web permits individual pages of text, with tables
                                    and special views, to be created but this does not permit
                                    searching to find data of interest.
Open API                            PX-Web has begun offering an API, although this is not
                                    offered across all deployments. Statistics Sweden’s API and
                                    documentation is available here:
                                    http://www.scb.se/en_/About-us/Open-data-API/API-for-the-
                                    Statistical-Database-/. Data output is for a range of formats,
                                    including: PX (PC-Axis), CSV, JSON, XLSX, JSON-STAT
                                    and SDMX.
Static URI                          Static URIs to specific data series are not possible.
Harvesting                          Data files and resources which are in PC-Axis format can be
                                    harvested automatically from remote folders.
Federating                          The service is not designed for federation.
Documentation                       Documentation on the public PC-Axis file structure is readily
                                                42
                                   available via PDF documents from the main PC-Axis website
                                   although there is no public developers documentation site.
                                   Licensees of PX-Web are permitted access to further
                                   documentation and are able to access the source code.
Standards-based                    PC-Axis has become a widely-used standard for NSOs, and
                                   PX-Web similarly supports SDMX with some support for
                                   DDI as well. The software itself is proprietary.

NSO suitability

Structural metadata                There is a range of support for data structure metadata,
                                   including PC-Axis for internal management, SDMX for data
                                   exchange, and DDI for some data formats.
OLAP hypercube                     PC-Axis has comprehensive hypercube support. The
                                   hypercube supports full filtering and faceting of dataseries,
                                   and similar functionality is accessible via the API.
Data endpoints                     The web interface and API offer a range of machine-readable
                                   file-formats as endpoints, including PX (PC-Axis), CSV,
                                   JSON, XLSX, JSON-STAT and SDMX.
Visualisation                      PX-Web does not provide full business intelligence
                                   functionality but does offer a series of static options for data
                                   visualisation, including tables, line, bar and pie charts, and
                                   some geographic representation. This is not a fully interactive
                                   visual package but does provide quick and clear functionality.
                                   Charts are not shareable or persistent for end-users.
UX & S/W extensibility             Licensees of the platform have access to the source code and
                                   are able to customise the user-interface. The API permits
                                   software extensibility while licensees are similarly able to
                                   extend the software platform as required.

Observations

PC-Axis meets the needs of NSOs for data publication, hypercubes and metadata, but requires
enhancements in servicing open data needs. It can only store PC-Axis-compliant data and
import these into hypercubes, but not generic data files nor metadata to support these. There is
also no URI for data address persistence.

PC-Axis requires a degree of expert knowledge to use and this limits data discovery for lay
users.




                                               43
5.9   Prognoz                                 URL             http://www.prognoz.com/
                                              S/W Licence     Proprietary
Overview                                      Language        Unknown
                                              SaaS            http://www.prognoz.com/
Prognoz is a business intelligence            Demo            http://dataportal.prognoz.com/
platform which supports the development       Examples        nigeria.prognoz.com
of software solutions on the desktop, web,                    indicators.statistics.gov.rw
and mobile devices for visualisation and                      dataportal.afdb.org
OLAP, reporting, and modelling and
forecasting of business processes.

The Prognoz Platform provides collaboration with portal solutions such as MS SharePoint, SAP
Netweaver, IBM WebSphere, and GIS services such as Google Maps, Microsoft Bing,
OpenStreetMap, and Yandex Maps.

Open data suitability

Descriptive metadata              Prognoz uses their own metadata structure but it is
                                  customisable so could be set up to mimic standard
                                  approaches.
Machine-readable                  An ETL module allows data import from databases such as
                                  Oracle, Microsoft SQL Server, IBM DB2, and a variety of
                                  different file-types XML, EDIFACT, DBF, TXT, and
                                  XLS/X.
                                  At this stage there doesn’t appear to be much support for
                                  geospatial data.
Anonymous access                  Data access can be strictly controlled but public datasets are
                                  available without a requirement for a login.
Data licenses                     There is no clear metadata specifying the license for reuse of
                                  the underlying data.
Data attribution                  A generic link to the data source is available.
Search                            Search can be implemented in a number of different ways,
                                  including full text-search plus filtering based on metadata, or
                                  implemented only as a branching tree system for data
                                  traversal.
                                  Search results are sometimes presented with a degree of
                                  interactivity where it is not always clear how to download or
                                  access the data. Mostly, however, results take you through to
                                  a data interaction page.
Open API                          There is no public API for accessing the data or building new
                                  functionality.
Static URI                        There are no static URIs to any individual pages within the
                                  portals; none of search results, data pages, or custom views
                                  generate a shareable URI.
Harvesting                        Prognoz has a visual ETL task manager which permits
                                  creation of processes for transformation and loading of data.
                                  These can be set to run regularly as data changes. Data can
                                  also be set to be loaded directly into datacubes.
Federating                        Technically, since Prognoz has a sophisticated harvesting
                                  system, it can harvest from other Prognoz instances – only
                                  importing the metadata – and acting as a federated system.
Public Documentation              Public documentation is minimal. Given there is no public
                                  API, this is expected.
Standards-based                   Prognoz integrates well with Microsoft Office and related
                                  services but does not comply with either of open data or
                                              44
                                   statistical data publication standards. Prognoz is mainly
                                   aimed at proprietary commercial business intelligence
                                   requirements rather than interoperability or standards.

NSO suitability

Structural metadata                Similarly to with the descriptive metadata, Prognoz has its
                                   own approach to managing structural metadata.
OLAP hypercube                     Full support for hypercubes are available, including the
                                   ability to facet and produce custom slices. As a business
                                   intelligence tool, cubes can also be subjected to numerous
                                   transformations including analysis and forecasting, validation
                                   and so on.
Data endpoints                     Data selections can be downloaded as any of XLS, XLSX,
                                   PDF, RTF, HTML, MHT, PPTX, ODS, EMF, and PPReport.
Visualisation                      Prognoz is a comprehensive business intelligence platform
                                   with the ability to conduct analysis, including modelling and
                                   forecasting, on the data, as well as producing complex
                                   visualisations and dashboards. While there doesn’t appear to
                                   be support for shapefiles or other coordinate data, place-
                                   names are recognised and plotted on maps. Searchable
                                   dashboards permit visual data exploration, and a wide range
                                   of endpoints allow for the charts and graphics to be
                                   downloaded for presentation elsewhere.
UX & S/W extensibility             Prognoz comes with its own software development toolkit
                                   which is compatible with .NET. This permits developing
                                   macros for data management, forms for online data capture,
                                   as well as creating custom visualisations and charts. External
                                   libraries for data representation using technologies of COM,
                                   ActiveX, Flash, .NET, and ASP.NET can also be used.
                                   Note that none of the development tools or documentation are
                                   public or standard and so third-party developers are unlikely
                                   to have experience with the software.

Observations

Prognoz is not suitable as an open data portal, offering few of the requirements for such data
publication. It also does not appear to be entirely suitable for NSOs since it does not support
standard metadata requirements.

Prognoz is clearly aimed at the bespoke needs of the corporate environment and has little
support for the standards taken for granted in more collaborative industries. While it has been
used in data publication it appears to be an unusual choice given the amount of data
transformation required for interoperability with other statistical platforms.




                                                45
5.10 Semantic MediaWiki                       URL             http://semantic-mediawiki.org/
                                              S/W Licence     GNU General Public License
Overview                                      Language        PHP
                                              SaaS            http://www.referata.com/
Semantic MediaWiki is an extension of         Demo            http://semantic-mediawiki.org/
MediaWiki – the wiki application best         Examples        openei.org
known for powering Wikipedia – that                           floridalegalwiki.com
helps to search, organise, tag, browse,                       www.skybrary.aero
evaluate, and share the wiki's content.
While traditional wikis contain only text, SMW adds semantic annotations that allow a wiki to
function as a collaborative database.

Semantic MediaWiki is the only data publication platform evaluated in this report offering
version control.

Open data suitability

Descriptive metadata              Semantic MediaWiki is an RDF implementation with
                                  templates to structure the metadata linked to imported data.
Machine-readable                  Importing of data is performed via XML or CSV only, with
                                  additional extensions permitting JSON as well. The
                                  expectation, though, is that data is read using the CSV format
                                  for inline queries. There is also recognition of coordinate data
                                  for plotting on maps.
Anonymous access                  Users may remain anonymous but there is a full
                                  authentication service as well for data and interaction
                                  management.
Data licenses                     Metadata templates present data licenses on each page.
Data attribution                  Similar to licenses, each dataset is presented with its source.
Search                            Free text search is supported, although search results are
                                  limited to title and an excerpt from the description. Filtering
                                  is limited to preset metadata which does not guide the user to
                                  refining the results to any great degree.
Open API                          The platform is queryable via a SPARQL interface and is
                                  able to return JSON data serialisation. Note, though, that the
                                  API only queries the database. Extending the software is
                                  done via independent modules that must be plugged into the
                                  software itself.
Static URI                        Static links are available for all data and views.
Harvesting                        There is limited support for automated importing of data.
Federating                        As with harvesting, federation of data sites is limited.
Public Documentation              Semantic MediaWiki has been in continuous development
                                  since 2005 and has a large and enthusiastic developer
                                  community. As with all popular open source projects,
                                  documentation is comprehensive and widely available.
Standards-based                   The software has a passionate community and numerous
                                  research projects and extensions have aimed to ensure that
                                  software is entirely standards compliant. Many of their
                                  researchers are statisticians and have aimed to develop
                                  features of use to NSOs.

NSO suitability

Structural metadata               Semantic MediaWiki does not currently provide support for
                                  structural metadata.
                                               46
OLAP hypercube                     Papers offering proof-of-concept for OLAP support for
                                   Semantic MediaWiki have been published but formal
                                   implementations have not yet been completed. The likelihood
                                   is that support would be via the RDF Data Cube vocabulary.
Data endpoints                     The system provides output as XML, JSON and via
                                   SPARQL.
Visualisation                      Visualisation is extremely limited and mostly to tabular
                                   formats, but extensions can be developed for a variety of
                                   open source libraries.
UX & S/W extensibility             MediaWiki is a platform in its own right and a vast number
                                   of software extensions have been written to enhance it.
                                   Similarly, the active developer community has written up
                                   comprehensive documentation which is available to support
                                   any custom extension or UX work which may be required.

Observations

Semantic MediaWiki is suitable as an open data publication service. While there are initiatives
underway to incorporate NSO requirements, it is does not currently meet those needs.

Semantic MediaWiki is still fitted to MediaWiki which means a fairly rigid template style and
that the service is mostly about text. The WikiData project is aimed mainly at data but currently
for internal MediaWiki use. The likelihood is that these two projects will start to merge.

Despite an extremely large developer community, and numerous working NSO statisticians
amongst the developers, there are few NSO portals built on this platform.




                                               47
5.11 Socrata                                   URL              http://www.socrata.com/
                                               S/W Licence      Mixed proprietary and OS
Overview                                       Language         Scala, Javascript, Ruby
                                               SaaS             socrata.com/products/open-
Socrata’s Open Data Portal SaaS provides                        data-portal/
one of the more comprehensive open data        Demo             nycopendata.socrata.com
services, with a range of extensions for       Examples         data.undp.org
dashboards, live reports and the ability to                     data.cityofchicago.org
manipulate and update existing data live                        opendata.go.ke
in the portal.

Their commitment to SaaS means you can deploy a new site and be serving data in a day.
Beyond open data, they offer business intelligence and visualisation functionality permitting
data visualisation, analysis and sharing via social media.

Open data suitability

Descriptive metadata               Socrata uses RDF metadata to describe the datasets,
                                   presented in Dublin Core and DCAT, as well as custom
                                   metadata fields.
Machine-readable                   Socrata is able to read and produce the following data types:
                                   CSV, JSON, PDF, RDF, RSS, XLS, XLSX, XML, OData,
                                   Shapefile, KMZ, and KML.
Anonymous access                   Users do not need to create accounts in order to access data,
                                   but they will need to do so in order to access higher services.
                                   Such include following datasets, commenting and saving
                                   visualisations that they produce from the data.
Data licenses                      Each dataset is individually licenced and is clearly labelled.
Data attribution                   Socrata does present clear attribution when such data exists.
Search                             Searching is fast and user-friendly, with the ability to filter by
                                   view types, as well as categories and topics. Information
                                   returned in the search offers not only the description of each
                                   dataset, but also an abbreviated view of the first three
                                   matching rows of data, permitting rapid assessment of results.
                                   Datasets can be filtered and facetted via the web interface as
                                   well as the API.
Open API                           Socrata produces a wide range of endpoints via their API,
                                   including REST JSON, CSV and RDF-XML. Their
                                   documentation is comprehensive
                                   http://dev.socrata.com/docs/endpoints.html and permits
                                   developers to experiment with the system live.
                                   Authentication is required for users wishing to push data to
                                   Socrata’s servers.
Static URI                         Each dataset and view gets its own URI as well as a
                                   generated short URI to facilitate sharing via social media.
Harvesting                         The API permits development of automated processes for
                                   uploading fast-changing datasets or importing existing
                                   resources. Socrata provides a dashboard for managing such
                                   processes as well. Socrata supports The White House’s
                                   /data.JSON URL extension specification.
Federating                         Since all Socrata sites run on a single server, federating and
                                   sharing resources/datasets between Socrata sites is
                                   straightforward. Socrata has developed additional extensions
                                   to import metadata from alternative open data portals, such as
                                   CKAN.
                                                48
Documentation                     Possibly Socrata’s greatest strength is the well-developed and
                                  presented developer portal including numerous libraries for
                                  working with software as diverse as the R statistical platform,
                                  Scala, Ruby and Java, amongst others. Their developer portal
                                  is available at http://dev.socrata.com/.
Standards-based                   Socrata has adopted a mixed licensing approach with their
                                  core architecture for their centralised, scalable systems being
                                  proprietary and the various tools available via their API
                                  available open source. Most of the individual software
                                  components for Socrata are open source, available on their
                                  Github repository (Socrata Open Data Server Community
                                  Edition), and data storage and presentation complies with the
                                  emerging open data standards, such as RDF.

NSO suitability

Structural metadata               Socrata does not provide standardised metadata for dataset
                                  structure or format – though publishers can set custom
                                  metadata fields.
OLAP hypercube                    Socrata does not support presentation of data as a hypercube.
                                  The software approach does facilitate eventual hypercube
                                  support as all machine-readable data are imported into a
                                  database, permitting column-alignment. Similarly, metadata
                                  can be edited and this could be enhanced to support NSO
                                  requirements.
Data endpoints                    Socrata provides a range of endpoints, including OData,
                                  JSON and XML. They do not provide any of the proprietary
                                  formats, such as Stata or SPSS.
Visualisation                     While Socrata is not yet a full business intelligence service,
                                  the range of visualisations is extensive. Chart creation
                                  capabilities includes various chart types such as Area, Bar,
                                  Column, Donut, Line, Pie, Time Line, Tree Map and Heat
                                  Map. Geospatial support and visualisations include location
                                  data, or GIS files such as Esri shapefiles, KML/KMZ files,
                                  using either Google Maps, Bing Maps or ESRI. A range of
                                  additional interfaces make live report generation and charting
                                  straightforward even for the layman.
UX & S/W extensibility            Socrata is designed to be easy to deploy and is managed as
                                  SaaS. This reduces complexity in management but also limits
                                  the degree to which sites can be customised. Landing pages
                                  can certainly be bespoke, but interactivity in search and
                                  visualisation remains quite consistently defined. Sites can
                                  have a degree of colour and branding changes, but the overall
                                  look-and-feel remains similar.
                                  Given the breadth of interactivity possible via the API,
                                  extending Socrata is straightforward. A library of existing
                                  extensions, released under various open source licenses
                                  (including the liberal MIT license) are available on their
                                  Github repository at https://github.com/socrata.

Observations

Socrata is a good choice for open data sites but will require development to support hypercubes
and dataset-level structured metadata in order to support the complete requirements for an
integrated NSO portal.
                                              49
5.12 Swirrl                                   URL              http://www.swirrl.com/
                                              S/W Licence      Mix proprietary & open source
Overview                                      Language         Rails
                                              SaaS             swirrl.com/publishmydata
Swirrl’s PublishMyData platform is            Demo
probably the purest linked data service       Examples        opendatacommunities.org
available for publishing open data. RDF                       opendatascotland.org
with SPARQL are still sufficiently novel                      linkeddata.hants.gov.uk
that many data publishers do not
necessarily think about data architecture when deciding on their vendor. Fully-realised RDF
offers the most future-proof mechanism for data publishing and is worth considering.

Swirrl also offers a mixed proprietary and open source set of licenses. Their hosted environment
and configuration is proprietary while they offer numerous Ruby-based libraries for developers
using the GNU Affero General Public License or MIT license.

Open data suitability

Descriptive metadata              Swirrl offers RDF as the mechanism for metadata. This is
                                  extensible and underlies common metadata formats like
                                  DCAT or Dublin Core.
Machine-readable                  Standard machine-readable formats like CSV and XLS/X
                                  (Excel) are supported. If the data are machine-readable then
                                  Swirrl will serve it, however, there is limited recognition of
                                  datatypes (e.g. geospatial and similar).
Anonymous access                  Users have anonymous access and there are a wealth of
                                  controls for authentication management.
Data licenses                     Individual data are clearly licensed.
Data attribution                  Attribution for every dataset is implemented.
Search                            Swirrl does not have a user-interface for search although the
                                  data can be traversed and searched via the API. At present,
                                  data are simply listed via an interface. A text search facility is
                                  scheduled for the next version of the software.
Open API                          The API is a SPARQL implementation offering a standards-
                                  based interface for interaction and application development.
Static URI                        All data and endpoints are offered via static URIs making
                                  sharing and referencing straightforward.
Harvesting                        The API offers full interaction and data manipulation, and the
                                  SaaS platform offers dashboards for uploading and importing
                                  files. API keys are required for authenticated actions.
Federating                        There is currently limited support for federating from non-
                                  RDF sites, but the API does permit importing metadata from
                                  other RDF-compatible services.
Public Documentation              There is fairly good documentation on using the API
                                  (http://opendatacommunities.org/docs) although there isn’t
                                  very much public information on the interface for the SaaS or
                                  for the open source version of the software.
Standards-based                   Swirrl is the purest implementation of RDF for open data
                                  currently available. It adheres closely to W3C standards.

NSO suitability

Structural metadata               PublishMyData incorporates a number of tools for data using
                                  the W3C Data Cube Vocabulary and Swirrl is involved in a
                                  research project to extend RDF Data Cube support 35.
                                               50
OLAP hypercube                     The system offers tools for transforming CSV files and Excel
                                   spreadsheets to RDF Data Cube datasets (without the need
                                   for programming), managing a collection of concept
                                   schemes, selecting URIs from external reference data, and
                                   quality checks that make sure any generated RDF meets the
                                   required standards.
Data endpoints                     The API offers JSON and SPARQL as endpoints.
Visualisation                      There is no native visualisation system but the API permits
                                   integration with various other visualisation systems. Basic
                                   native visualisation is scheduled for the next release.
UX & S/W extensibility             The community edition of Swirrl permits full customisation
                                   (http://github.com/swirrl/publish_my_data) while the online
                                   platform SaaS can also be customised or integrated into other
                                   systems.

Observations

Swirrl is designed for open data publication and meets the requirements of these services. It
does not yet, however, meet all the requirements for NSO portals.

They are committed to linked data with a focus on technical users of statistical data, and offer
RDF Data Cube support. This offers a comprehensive data publication and management service.

Recognise, though, that Swirrl is currently aimed at technical users and developers, but is
working on enhanced features for less technical users.




                                               51
6 Conclusions and recommendations
The software platforms considered in this report are not a comprehensive list of those available
to NSOs and open data publishers. Even so, many of them come close to meeting all the criteria
for both requirements.

Open data software has led to a number of different approaches to resolving requirements for
data publication, engagement and reuse. Many of these do not promote interoperability between
platforms and new problems are being raised.

Here are a list of attributes where enhancement would improve the overall utility to NSOs and
the open data community.

6.1    Improve technical documentation

Too little of the available documentation for developing custom components or using the
software APIs offers much support to developers. Worse, many software services release no
documentation at all to the public.

Open data is not only about the release of content but also of the mechanisms to engage with
that content.

This is something where the ostensible open data solutions do reasonably well, although some
could be improved. However, releasing documentation in poorly-updated PDFs is unacceptable
and of limited use.

If software vendors persist in developing proprietary and incompatible APIs then it is essential
that these APIs be sufficiently well-documented that they can be of use.

6.2    Ensure public APIs and endpoints are interoperable

In addition to providing complete documentation, it is recommended that vendors adopt
consistent and interoperable APIs.

At present, a developer wishing to integrate two unrelated software platforms has to manually
craft a solution which has to read two different APIs.

If vendors are able to agree on a common API – or are able to connect to a common standard –
then harvesting from different systems as well as developing applications that integrate a
number of unrelated platforms all become extremely straightforward.

Adopting RDF is one mechanism which will permit eventual harmonisation. However, this also
means agreeing to metadata standards and terms which permit different software platforms to
communicate directly. Going from human-readable to machine-readable inevitably requires
compromise but it is essential.

Critically, it isn’t helpful if RDF is used for data architecture (i.e. to find out what is in the
database of datasets in the platform) but there is still no standardised way to traverse that
architecture and arrive at a data endpoint.

The likelihood is that NSOs will not manage all the data in a particular country and that
different data publishers will choose different software platforms. It is essential that harvesting
and federation adopt common standards for interoperability.


                                                   52
This will permit services like IPUMS to harvest census data directly without have to gather
direct copies and restructure the data. It will also reduce multiple software systems serving the
overlapping needs of funders and stakeholders.

Such interoperability will permit NSOs to adopt a staged approach to updating and integrating
their systems. Retrofitting DDS to existing systems will be more straightforward if those
systems are able to pull data for visualisations and dashboards directly from an RDF endpoint.

This does not imply that SDMX, DDI or OData are necessary solutions, but they may provide
interim points towards SPARQL and RDF data cubes.

Common endpoints and structural metadata, at least, permit third-party software to interpret
data, when it is eventually discovered.

6.3   Presentation of metadata and URIs must conform to W3C standards

It is critical that data be released in machine-readable format but it is just as important that
search and data discovery generate unique and persistent URIs.

Metadata associated with datasets needs also to be easily discoverable and presented with the
data it applies to.

Many software platforms fail to present descriptive metadata which aids discovery and reuse. It
needs to be clear what the licensing and reuse policies are, what the data are about, and who is
responsible for it.

Similarly, data discovery is time-consuming and discrete URIs for each step of the process
permits sharing and saving of these states. A user who spends half-an-hour finding data only to
be unable to simply send a link to colleague (or bookmark it for later use) is less likely to
engage with the data at all.

This is simple compliance with W3C standards for web acceptable applications and is not
specific to open data. Best practice, in this case, is simply good internet manners.

6.4   Natural language search and metadata faceting should be standard

Many of the solutions chosen for data publication favour visualisation and presentation over
discovery and reuse.

From a budget allocation perspective, plumbing usually takes second-place to more attractive
considerations. Sadly, this makes data discovery awkward and reduces the potential for data
reuse.

Google, Bing and Yahoo are the most common search experience for most users. Expert
systems, which data software are, can enhance the search experience through permitting
metadata to be used as additional context-sensitive filters.

Data discovery should not be limited to professionals who are already familiar with the data.
Free text search along with metadata faceting speeds up data discovery and improves system
performance.




                                                  53
6.5       Structural metadata and hypercube support are core NSO
          requirements

Statistical data are used to develop research insight. Individual tables become more useful when
they can be aggregated together and sliced into numerous views for analysis.

Obviously if the data do not conform to a structural metadata standard such as SDMX, PC-Axis
or DDI, then offering support for hypercubes is not going to be useful. However, in the case of
NSOs, supporting commonly-used structural metadata is common.

With the release of the W3C’s RDF Data Cube Vocabulary, it is also becoming less necessary
to build a complete implementation from scratch.

For some open data software services this will prove an extremely difficult enhancement but
offering support for both descriptive and structural metadata, as well as hypercubes, is all part of
ensuring the data are as widely available and as useful as is possible.

6.6       Dashboards and visualisations are necessary for user engagement

User-engagement in the face of vast and complex data is more likely where that data are
presented in a tactile and engaging way. The growth of data journalism has depended on the
growing availability of public data resources and their ability to present visually exciting views
on that data.

Numerous platforms already offer data visualisation, dashboard development and some level of
business intelligence. For others, where such systems require major investment, compliance
with standards and interoperability will permit integration with existing software.

Online visualisation systems weren’t evaluated for this report, but there are a vast number,
ranging from simple solutions like Datawrapper.de to sophisticated business intelligence
systems like Tableau.

If software cannot offer everything, then it should offer integration.

6.7       Develop data engagement tools for improving data-quality and reuse

Visualisations permit users to interact with the data but not necessarily support the enhancement
of data, or its reuse.

There are a number of approaches which promote data quality and reuse:

      •    Use the data you publish: data publishers should have a mechanism in the software to
           derive visualisations and analysis from the data, and save or link these to published
           datasets;
      •    Showcase user applications: a mechanism for users to share their applications,
           research or content developed, with a workflow for site administrators to evaluate and
           present such content;
      •    Register issues: instead of comments which require moderation, offer users the
           opportunity to raise issues with data quality for each dataset; provide a tracker for
           response and a dashboard showing issue responses;
      •    Data requests: offer a direct mechanism for users to request datasets which may not yet
           be available; Data.gov.uk offers known – but unavailable – data in search results but
           flagged as requiring a formal request;


                                                 54
7 Acknowledgements and Research Methodology
This technical research assessment was commissioned and supported by the World Bank. The
report was researched and written by Gavin Chait of Whythawk 36, open data software
consultants.

Both primary interviews and secondary research of existing literature contributed to the content
of this report.

The complete list of people interviewed as part of the study (in alphabetical order of
organisation or software) are:

    •   Adam McGreggor and Adrià Mercader of the CKAN team at Open Knowledge;
    •   Andrew Hoppin of the DKAN team at NuCivic;
    •   Matthew Welch and Olivier Dupriez on behalf of the International Household Survey
        Network (IHSN);
    •   Robert McCaa at IPUMS;
    •   Diego May at Junar;
    •   Jean-Marc Lazard at OpenDataSoft;
    •   Rajiv Ranjan at Rwanda’s National Statistics Office;
    •   Ben McInnis, Joe Pringle, Jessica Carsten and Jeff Kaplan at Socrata;
    •   Bill Joyce at Statistics Canada (Canada NSO);
    •   Lars Knudsen at Statistics Denmark (Denmark NSO);
    •   Bill Roberts at Swirrl;
    •   Marian Brady, Oliver Fischer and Jeffrey Sisson of the US Census Office;
    •   Tim Harris, Tim Herzog and Thomas Danielewitz at the World Bank;

All secondary sources are cited in the text and available in the References section.

Analysis presented in this report is derived over the course of the interviews conducted during
the primary research phase and offer insight software and design choices and priorities. It
cannot be considered a statistically relevant sample, but does provide a sense-check as to how
different organisations and entities across the open data industry respond to constraints and
opportunities.




                                                55
8 Glossary

Application         Specifies how some software components should interact with each other.
Programming         Used to ease the work of programming graphical user interface
Interface (API):    components, to allow integration of new features into existing
                    applications, or to share data between otherwise distinct applications.
                    Presented as a library that includes specifications for routines, data
                    structures, object classes, and variables. In some other cases, notably for
                    SOAP and REST services, an API comes as just a specification of remote
                    calls exposed to the API consumers.
Business            A platform for engaging with structured quantitative data to produce
Intelligence        custom slices of that data, charts, tables and geospatial representations.
Systems (BIS):
Content             Permit publishing, editing and modifying of qualitative content, as well as
Management          providing mechanisms to manage workflows and individual users in a
Systems (CMS):      collaborative environment.
Data Discovery      Similar to CMS but provide mechanisms to manage the semi-structured
Systems (DDS):      quantitative and qualitative data in documents and spreadsheets and offer
                    methods for data publication, discovery and reuse.
Datastore:          A data repository of a set of integrated objects modelled using classes
                    defined in database schemas. A datastore includes not only data
                    repositories like databases, but more generally includes also flat files that
                    can store data.
Descriptive         Corresponds to external metadata and typically used for discovery and
metadata:           identification, as information used to search and locate an object such as
                    title, author, subjects, keywords, publisher.
Extract,            A process in database usage and especially in data warehousing that:
Transform and       extracts data from outside sources; transforms it to fit operational needs,
Load (ETL):         which can include quality levels; loads it into the end target (database,
                    more specifically, operational data store, data mart, or data warehouse).
Faceting:           Also called faceted search, faceted navigation or faceted browsing, is a
                    technique for accessing information organised according to a faceted
                    classification system, allowing users to explore a collection of information
                    by applying multiple filters. Each information element is classified along
                    multiple explicit dimensions, enabling classifications to be accessed and
                    ordered in multiple ways rather than in a single, pre-determined,
                    taxonomic order.
Federation:         A meta-database management system, which transparently maps multiple
                    autonomous database systems into a single federated database. The
                    constituent databases are interconnected via a computer network and may
                    be geographically decentralised. Since the constituent database systems
                    remain autonomous, a federated database system is a contrastable
                    alternative to the (sometimes daunting) task of merging several disparate
                    databases. A federated database, or virtual database, is a composite of all
                    constituent databases in a federated database system. There is no actual
                    data integration in the constituent disparate databases as a result of data
                    federation.
Filestore:          A collection of binary data stored as individual files and referenced in a
                    database management system.
Generalised data:   Aggregations derived from microdata; for example, the total number of
                    people of a particular education category.
Harvesting:         An automated and autonomous mechanism for ETL of known data from
                                              56
                     known web addressable locations into a single database or datastore.
Linked Data:         A method of publishing structured data so that it can be interlinked. It
                     builds upon standard web technologies such as HTTP, RDF and URIs
                     extending them to share information in a way that can be read
                     automatically by computers. This enables data from different sources to be
                     connected and queried.
Microdata:           Information at the level of individual respondents, households and
                     businesses, typically through surveys; for example, a national census may
                     collect age, address, education, employment status, etc. from individuals.
Online Analytical    An approach to enable users to analyse multidimensional data interactively
Processing           from multiple perspectives. OLAP consists of three basic analytical
(OLAP)               operations: consolidation (roll-up), drill-down, and slicing and dicing.
hypercube or         Consolidation involves the aggregation of data that can be accumulated
cube:                and computed in one or more dimensions. Drill-down is a technique that
                     allows users to navigate through the details. Slicing and dicing is a feature
                     whereby users can take out (slicing) a specific set of data of the OLAP
                     cube and view (dicing) the slices from different viewpoints.
Open Source          The source-code is available in an online and public repository under a
Software (OS):       liberal reuse license (such as the General Public License and its affiliates);
                     sometimes known as Free and Open Source Software (FOSS), not all open
                     source software is free, and not all free software is open source.
Representational     An architectural style consisting of a coordinated set of architectural
State Transfer       constraints applied to components, connectors, and data elements, within a
(REST):              distributed hypermedia system. REST ignores the details of component
                     implementation and protocol syntax in order to focus on the roles of
                     components, the constraints upon their interaction with other components,
                     and their interpretation of significant data elements.
Resource             General method for conceptual description or modelling of information
Description          that is implemented in web resources, using a variety of syntax notations
Framework            and data serialisation formats.
(RDF):
Semantic             The ability for computer systems to exchange data unambiguously.
interoperability:
Software-as-a-     Software is available online on a centralised hosted server via a
Service (SaaS):    subscription service instead of as a deployable software system with a
                   single, static price; upgrades, bug fixes and patches are consistently and
                   regularly applied; custom extensions of functionality can be achieved via
                   an API but customisation of the user interface is more limited.
Structural         Corresponds to internal metadata about the structure of database objects
metadata:          such as tables, columns, keys and indexes.
Uniform Resource A string of characters used to identify a name of a resource. Such
Identifiers (URI): identification enables interaction with representations of the resource over
                   a network, typically the World Wide Web, using specific protocols.
                   Schemes specifying a concrete syntax and associated protocols define each
                   URI. URIs can consist of both namespaces and locators at the same time.
Uniform Resource Also known as web address, particularly when used with HTTP, is a
Locators (URL): specific character string that constitutes a reference to a resource. In most
                   web browsers, the URL of a web page is displayed on top inside an
                   address bar. A URL is a type of URI.




                                               57
9 References
1
  Open data challenges and opportunities for national statistical offices. Washington, DC: World Bank
Group. World Bank. 2014. http://documents.worldbank.org/curated/en/2014/07/19791395/open-data-
challenges-opportunities-national-statistical-offices-open-data-challenges-opportunities-national-
statistical-offices
2
  Ibid
3
  Technology Options for Open Government Data Platforms – Timothy Herzog, World Bank, 2014-01-31
4
  https://international.ipums.org/international/
5
  http://www.ihsn.org/home/software/nada
6
  http://www.w3.org/RDF/
7
  Anwar and Hunt BMC Bioinformatics 2009 10(Suppl 10):S3 doi:10.1186/1471-2105-10-S10-S3
8
  https://www.gnu.org/licenses/licenses.html
9
  http://www.w3.org/TR/vocab-dcat/
10
   http://www.ddialliance.org/
11
   http://dublincore.org/documents/dces/
12
   http://inspire.ec.europa.eu/index.cfm
13
   ISO 19115-1:2014, Geographic information -- Metadata
http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=53798
14
   Linked Data – Tim Berners-Lee, 2006-07-27 http://www.w3.org/DesignIssues/LinkedData.html
15
   New paywall costs the Times 66% of its internet readership,
http://www.theguardian.com/media/2010/jul/18/times-paywall-readership
16
   Showing you this map of aggregated bullfrog occurrences would be illegal,
http://peterdesmet.com/posts/illegal-bullfrogs.html
17
   http://creativecommons.org/licenses/
18
   http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/
19
   http://opendatacommons.org/licenses/odbl/1.0/
20
   http://creativecommons.org/licenses/by/3.0/igo/
21
   The Semantic Web - Tim Berners-Lee, James Hendler and Ora Lassila, Scientific American, 2001-05,
http://www.scientificamerican.com/article/the-semantic-web/ [Subscription, annoyingly]
22
   That’s just semantics - Gareth McGuinness, International Monetary Fund, 2009-10,
https://app.box.com/shared/9zms2mvsio
23
   http://www.scb.se/sv_/PC-Axis/About-PC-Axis/
24
   http://sdmx.org/
25
   https://dvcs.w3.org/hg/gld/raw-file/default/data-cube/index.html
26
   http://www.w3.org/TR/owl-features/
27
   http://data.gov.uk/resources/coins
28
   http://www.json.org/
29
   http://www.odata.org/
30
   http://stats.oecd.org/OpenDataAPI/Index.htm
31
   http://www.w3.org/XML/
32
   http://www.w3.org/TR/sparql11-overview/
33
   http://www.openarchives.org/OAI/openarchivesprotocol.html
34
   http://en.wikipedia.org/wiki/Multidimensional_Expressions
35
   http://blog.swirrl.com/articles/open-cube/
36
   http://www.whythawk.com/




                                                 58