Comprehensive Study of Open Data Platforms

Digvijay Mali
Analytics Vidhya
Published in
19 min readOct 14, 2020

--

PART 1: Comprehensive Knowledge Archive Network (CKAN)

CKAN(Comprehensive Knowledge Archive Network) is an open-source data management system designed to power data hubs and portals. CKAN facilitates the worldwide publication, sharing and use of data. Nowadays, each person carries around 5 GB of data in total stored data around the world. This increases the demand for platforms like CKAN that enables people to access the necessary data sets across the world. CKAN is a powerful, open source data management system that makes data easily accessible by providing tools for streamlining, publishing, and sharing data across multiple users from different domains. This means that without any license fees we can use it, and we retain all rights to the data and metadata we entered. CKAN is useful for product developers, technical consultants, data analyzers, data journalists and researchers. It also makes finding, sharing and reusing data easy for people, whether they are research scientists or civil servants, data nerds or the average citizen. CKAN also helps the government better manage its data, as it provides better data and knowledge discovery. CKAN has smart ways to search data such as keyword search (Example: tag browsing), location filter, license, group, publisher, etc.

One of CKAN’s most important features is storing the metadata information including small details such as original resource references and converted data standards (like ISO). There are various sites that are connected to open source datasets, and CKAN also acts as a gateway through which we can access those endpoints and get the data all over the world. We can compare it to sites like Amazon which connect with different retail stores in the same way that CKAN connects with different datasets around the globe.

[1] Who uses CKAN?

[2] CKAN is not Just a Repository

Sometimes a repository is seen as a place to deposit your research, and then forget it. The CKAN is not a repository in that sense. While it can certainly do what’s needed from a repository, it’s also a place where data will continue to work for the research community. One can also use CKAN as a datastore alongside an existing repository. CKAN can be and is used now to publish research outputs on the Datahub.

CKAN aims to provide a platform that is both simple and powerful and as easy as using and interacting with it to build on the data store and extend. CKAN’s core is a powerful machine interaction registry/catalog system designed to automate tasks such as registering and acquiring data sets. This core can then be flexibly expanded to become a full data hub in many ways — for example, you can add integrated storage, social tools, checking data quality, listing apps and ideas, and integration with tools and services from third parties. CKAN also has the essential features for an academic repository such as rich configurable metadata, data sets to which resources can be added, a preview datastore, fine-grained authorization options, curated dataset groups, versioned history, face to face search, easy and intuitive Web interface.

[3] CKAN for Personal Publication

A repository need not be run by an institution to be useful. Got a piece of data or research you want to share via CKAN at a permanent address? One can do it right now at the datahub, a CKAN repository where anyone can register and upload datasets. One can start a group for all our own papers, giving our output a permanent address even when we move departments. One can start a group for our department’s papers. One can also configure the permissions for each dataset, for example allowing all co-authors of a paper to update it.

[4] CKAN provides Rich Metadata management

Every resource has its own associated metadata, including external links, as does the entire dataset. The default setup includes standard fields like author, title, description, license etc. For each dataset additional arbitrary fields can be added. By default, a research specific CKAN site could include such fields as DOI, Journal, etc., and these may vary by type of dataset. (A thesis, for example, might have required fields such as ‘Supervisor’)

[5] Federation and Linking

CKAN’s ‘harvesting’ feature can federate datasets between different servers. For example, a research council could run its own repository, and harvest metadata from institutions about research it has funded. For example, publidata.eu harvests data from 18 European data catalogues CKAN’s metadata can also be exported in standard formats including the W3C data catalogue standard DCAT, and RDF (Linked Data) output is built in. Because CKAN has not been widely used for academic repositories, there is no support now for OAI-PMH. This would be an excellent area for a CKAN extension.

[6] Web, Command Line and API Interfaces of CKAN

CKAN has a user-friendly and intuitive web interface for uploading, editing and searching: a user can create a dataset in a few minutes. The search is heavily tested on portals such as data.gov.uk and allows the search or faceting of free text by group (department), type of document etc. Heavy users can also make use of open-source command-line data package manager, dpm, from the Open Knowledge Foundation.

[7] CKAN for Maximizing Re-use

CKAN’s datastore may store and access structured data via an API. This means that a data file can be linked to or uploaded as a CSV or spreadsheet, and users can query it directly on the server-as well as download it. This could make life easier for researchers to check and re-use data from previous research. CKAN uses the built-in recline data viewer to create interactive data visualizations, which can be embedded elsewhere on the Web-for example, in a blog post about the research that produced data. This visualization also includes map plots of geo-coded data and image files of visualizations are displayed on their resource pages.

[8] More on CKAN

CKAN is highly extensible, with a standard interface for writing extensions, which can also do background processing. Although the Datahub can be used to store research right now, it would be interesting to see how a widely used, research specific CKAN instance would develop. CKAN data set metadata can include links to other datasets to mention just one additional aspect. This could be used to implement a reference system as external links, with inward links automatically displayed as citations

[9] Advantages of CKAN

1.Better Visualization

When it comes to usability, being able to visualize imported.csv files within CKAN gives more options for users to help them understand the data sets available. CKAN’s ‘resource view’ features allow users to quickly understand the core trends in the dataset through a range of simple types of graphs.

2. Easy maintenance

CKAN is the simplest tool to install at low cost. Additionally, handy CKAN extensions allow users to upload datasets from other sources with minimal cost and hassle.

3. Data speaks the Right Language

Compatibility requires smooth language localization features to make the portal fully accessible to users from different parts of the world. CKAN has great extensions that can support multi-language datasets making it a great fit for useful data sets around the globe.

4. It keeps things light

There are always valid concerns about accessibility in areas with low internet connectivity levels with platforms such as CKAN. Where Internet speeds can vary dramatically depending on the local infrastructure, it is useful. However, there is the possibility of running a ‘lite’ version of the platform in the current version of CKAN to keep it accessible to users with lousy connectivity.

5. Selective Update

We can select a part of our dataset that we want to make public. We can upload a dataset and make it public or private by setting various permissions.

6. Integration with Google Analytics

CKAN is built into Google Analytics and provides an easy analysis of top tags, most uploads, most views, time series, etc.

[10] Few Exciting CKAN Extensions

Pages: It is an extension for building custom pages. We can use it for building new ‘About Us’ and ‘Use Cases’ pages.

Fluent: It is an extension to store and return multilingual fields in CKAN datasets, resources, organizations and groups.

Scheming: This is the extension for custom-built metadata schema.

PDFview: This extension renders any PDF.

Custom Theme: CKAN gives the possibility to create your own theme instead of changing the core files

Bulk Import: Import datasets in bulk from a spreadsheet. This will save our team a lot of valuable time.

(i) Features for publishers (Local/National government or Data Provider)

Publish: Publish data through a guided process or import via API/harvesting from catalogs

Customize: Add your own metadata fields, themes and branding

Store: Store data within CKAN or on external departmental site

Manage: Full access control, version control history with rollback, INSPIRE/RDF support and user analytics

(ii) Features for Data Users (Researchers, Journalists, Programmers, NGO’ and citizens)

Search: Search, add, edit, describe, tag, group datasets via web front-end or API

Collaborate: User profiles, dashboards, social network integration and comments

Use: Metadata and data API, data previews and visualizations

Extend: Full documentation for building extensions

[11] Data Movement using CKAN

It is possible to do data integration from CKAN to the Data Lake and expose Data Lake data through CKAN. By using CKAN as the primary UI for data entry, upload, management, and access, we can leverage the capabilities existing in CKAN for robust data analysis. We can move or copy the data from CKAN to the Data Lake and make it available to leverage visualization tools such as Tableau (or others) or other tools used for deeper analytics and machine learning. CKAN also used to provide a data catalog of data sets by capturing the metadata of each data set and links to the data set in the data lake.

Sample Architecture:

[12] Metadata Capabilities of CKAN

CKAN provides a rich set of metadata that is associated with your date. The set of metadata may include following features:

Title: Provides intuitive labelling, searching and linking.

Unique Identifier: Each dataset has a unique URL which is initially customized by the publisher that makes the dataset easily accessible just by API call.

Groups: Groups belong to set people with similar interests or belong to the same domain. For example, a group may have people who are interested in the same domain as Science. This makes it easy for publishers and users to find the data and place in which they are interested.

Description: Additional information describing the dataset. This information can be edited by admin or other publishers in future.

Data Preview: We can view our dataset in csv mode before downloading it to check if this is the data we want for analysis or knowledge discovery.

Revision History: This feature is like the facility provided by git. We can store the history of revisions we made in the dataset. It makes it easy to keep track of dataset modification.

Extra Fields: These fields may store additional information such as location of the publisher and the region specific to the dataset.

License: As we discussed previously, we can publish data under open license, or we can keep the data private for publisher or for a group. This field is used to make the user clear whether they have rights to use, change or amend the data.

Tags: This field is to clarify the question to which this dataset belongs. Tags are used to browse between similarly tagged datasets. This also makes it easy to find additional data for our question and better data discovery.

Multiple Formats: This provides the information about different formats in which dataset may be available.

API Key: This allows the access to every metadata field of the dataset. This also allows us to restrict the access for metadata.

[Q] Why do we want Metadata?

We want the metadata inventory that gives investors the details they need to analyze the data. Companies like Alation, Collibra, not only keep tabs on the data, but also combine with machine learning and automation to make data more discoverable, interactive and now, in line with corporate, industry-wide, or even government regulations. Since metadata offers a common source of truth regarding a company’s data sources, the management of the data in your pipelines is very easy to utilize metadata.

[Q] Why are users afraid to use CKAN?

  • Not natively comply with the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol.
  • CKAN requires a user to essentially bear all risks associated with install, integration and platform service and updating on an ongoing basis — as well as the full lifecycle of data management.
  • Each function carries risks. For many, shifting much of the technical load to a vendor allows them to concentrate their limited resources on the harder and more important tasks of managing data supply chain and promoting reuse of data which is where the real value is.
  • CKAN requires technically savvy people to implement and maintain their project solution.
  • No strict guidelines for data publishing.

[Q] Why users like CKAN?

  • Better control in the recorded data as CKAN is open source.
  • Permit a customization with improvements ranging from small interface modification to the development of new data imagining plugins.
  • Able to export records that comply with established metadata schemas (XML, JSON, etc).
  • Records do not follow any standard schema.
  • The platform allows the inclusion of a dictionary of key-value pairs that can be used to record domain specific metadata.
  • CKAN provides an auditing trail of each update on the dataset by showing all changes made to it since it’s uploaded.
  • CKAN, that can be installed in an institutional server instead of relying on external storage provided by contracted services.
  • Options for reserved storage to let researchers control the data publication mode.

PART 2: A Comparative Study of Platforms for Data Management and Metadata Capabilities

Socrata

Socrata is a software-as-a-service platform focused on the principle of ‘data in API out’ offering a cloud-based solution for the publication and visualization of open data. All Socrata datasets are API-enabled, and Socrata Open Data API (SODA) developers around the world can use SODA to build applications, analyzes, and complex visualizations on top of any Socrata dataset. The SODA server was distributed freely and could also be self-provided. The web center opened in New York City is a perfect example of a Socrata center: NYC Open Data.

[1] Comparison between CKAN and Socrata

The main differences between Socrata and CKAN are the distribution model, Socrata uses self-provisioned while CKAN is cloud based. Socrata and CKAN data sets are federally interoperable and both organizations share the goal of promoting the open data movement worldwide.

1) Both CKAN and Socrata are often integrated in various ways with a Content Management System such as WordPress (data.gov) or Drupal (data.gov.uk), in order to more readily provide a more full featured web portal including use cases like telling stories around data, creating groups to collaborate around the data, etc.

2) Some people still think open-source and lack of “vendor lock-in” is a big advantage, hence they prefer to go with a one that is flexible with changing decisions. Hence, they think that CKAN and DKAN are stronger in that regard. Both CKAN and DKAN are available in the cloud on IaaS platforms like AWS and Azure, as well as in SLA supported SaaS versions. Open source + SaaS = OpenSaaS

3) But Socrata has also recently released an open source version of their software that can be self-implemented to effectively include the API service at the core of Socrata’s SaaS offering. Socrata made this open to the community so that if one of their customers ever wishes to stop working with Socrata, they can set up compatible API servers to support any new apps that they have built on top of their platform. This prevents lock-in at the API level.

5) Similar to CKAN, Socrata enables customers and users to download the data via a huge number of open standards (CSV, JSON, etc.) so there is no risk of lock-in at the data level.

6) We can also see more and more cases where Socrata and CKAN are both part of a federated data publishing operation network, rather than a monolithic repository that must represent everyone. CDC has an immense public health / education mandate and can want to publish data and create APIs using Socrata (or something else), while federating with other catalogs (e.g. healthdata.gov, data.gov, etc). Often the greatest potential value may be generated by extremely well publishing a single dateset and distributing it in a way that maximizes its usefulness to various audiences.

7) If we can see from the perspective of risk management. CKAN, especially the option of taking the code base without OKF support, requires a user to essentially bear all risks associated with install, integration and platform service and updating on an ongoing basis — as well as the full lifecycle of data management. For some, this will be fine if they have a technical team prepared to handle this. But each function carries risks. OKF’s hosted offering shifts some of that risk (eg,for setup and hosting) back to OKF, but at a price. Other risks remain on the customer. For solutions by firms like Socrata, a chunk of that risk is shifted to the solution provider. For many, shifting much of the technical load to a vendor allows them to concentrate their limited resources (and team) on the harder and more important tasks of managing data supply chain and promoting reuse of data which is where the real value is. For many governments, agencies and orgs, they simply do not have the bandwidth or team to successfully do it all. This can have real implications for how much use a platform gets at the end of the day, as some governments have found. In many ways, this comes down to what risks a user is best positioned or willing to bear and where they prefer to invest their limited resources. The issue of lock-in is also in many ways an issue of risk, but there is no generic answer on this. Open standards — especially with respect to data and APIs — goes a long way to neutralizing the real lock-in risks with respect to an Open Data platform.

8), We don’t have to consider CKAN vs Socrata as an “either” decision. Sure, both the OKFN and Socrata have a similar vision for customers. Even with data.gov, CKAN is the architecture, but the White House and some Federal agencies are still using Socrata to accomplish their Open Government goals in a way that CKAN cannot. Second, there are some basic differences about the tools. Socrata is a turn-key product solution; CKAN requires technically savvy people to implement and maintain their project solution. One approach is not universally better than the other, they are just different.

9) Most people end up comparing CKAN against Socrata on the basis of Open Data Portals. But Socrata also has two other products: “GovStat”, which is an open performance solution, and “API Foundry”, which is an automation tool for Developers.

[2] Other Implementation Standards

A survey was done to identify currently implemented standards, requirements and features related to research data repositories based on this, following well-known platforms are chosen for the study. These tools are considered and evaluated based on architecture, metadata handling capabilities, interoperability, content dissemination, search features and community acceptance.

DSpace

  • Can comply with domain-level metadata schemas
  • Is open-source and has a wide supporting community
  • Has an extensive, community-maintained documentation
  • Can be fully under institutions control
  • Structured metadata representation
  • Complaint with OAI-PMH
  • Supports Dublin Core, and MARCXML for metadata exporting

CKAN

  • Is open-source and widely supported by the developer community
  • Features extensive and comprehensive documentation
  • Allows deep customization of its features
  • Can be fully under institutions control
  • Supports unrestricted (non standards-compliant) metadata
  • Has faceted search with fuzzy matching
  • Records datasets change logs and versioning information

Figshare

  • Gives credit to authors through citations and references
  • Can export reference to Mendeley, DataCite, RefWorks, Endnote, NLM and Reference Manager
  • Records statistics related to citations and shares
  • Does not require any maintenance

Zenodo

  • Allows creating communities to validate submissions
  • Supports Dublin Core, MARC and MARCXML for metadata exporting
  • Can export references to BibTeX, DataCite, DC, EndNote, NLM, RefWorks
  • Complies with OAI-PMH for data dissemination
  • Does not require any maintenance
  • Includes metadata records in the searchable fields

Dataverse

  • Is open-source and widely supported by the developer community
  • Data Citation automatically generated
  • Multiple Publishing Workflows
  • Faceted Search as well as tags can be used for searches
  • Already defines roles and also custom roles can be designed and assigned to the users
  • Branding, metadata-based facets, sub-dataverses, featured dataverses,
  • Re-format, Summary Statistics, and Analysis for Tabular Files integration with TwoRavens
  • Mapping of Geospatial files and integration with WorldMap
  • Restricted Files as well as ability to request access to restricted files
  • Three level of Metadata i.e. description/citation, domain-specific or custom fields, file metadata Search API, data deposit API etc.
  • Notifications will be generated to the user and also will be communicated by mail for access request, roles, and when data is published
  • CC0 waiver default, terms of use can be customised by user, and download statistics
  • Can export reference to EndNote XML, RIS Format, or BibTeX Format

[3] Comparison Based on Architecture

https://slideplayer.com/slide/17580256/

Most of the above-mentioned software are open source based and have given some flexibility to the users. Speedy and simple deployment of the software is a crucial part for the implementation.

Open source software can be installed in house whereas platforms like Figshare and Zenodo are to be installed and implemented by the help of the developer. Dspace, Dataverse & CKAN have better control in the recorded data as they are open source. The proprietary software like Figshare or Zenodo are not viable platform for the researchers and the institution as they have to rely on the developers. DSpace, CKAN, Dataverse and Zenodo permit a customization with improvements ranging from small interface modification to the development of new data imagining plugins to satisfy the needs of their users: while Zenodo allows parametrization settings such as community-level can be further customized. DSpace, Zenodo and Dataverse permit users to stipulate embargo period whereas CKAN and Figshare have options for reserved storage to let researchers control the data publication mode. Zenodo and Figshare software are able to export records that comply with established metadata schemas (XML, JSON, etc).

[4] Comparison Based on Metadata Management

https://slideplayer.com/slide/17580256/

CKAN and Dataverse metadata records do not follow any standard schema, the platform allows the inclusion of a dictionary of key-value pairs that can be used to record domain specific metadata as a complement to generic metadata descriptions. Neither platform natively supports collaborative validation stages where curators and researchers enforce the correct data and metadata structure, but Zenodo allows the users to create a highly curated area within communities, as highlighted in the “validation” feature. Every deposit will have to be validated by the community curator, if the policy of a particular community specifies manual validation. There is an important issue to tracking content changes in data management with Zenodo. CKAN provides an auditing trail of each deposited dataset by showing all changes made to it since its deposit. All of the evaluated platforms allow the development of external clients and tools as they already provide their own APIs for exposing metadata records to the outside community, but there are some differences regarding standards compliance. Zenodo and DSpace natively comply with the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol. This is a widely used protocol that promotes interoperability between repositories while also streamlining data dissemination and is a valuable resource for harvesters to index the contents of the repository.

[5] Conclusion

Dataverse, CKAN and DSpace’s open-source licenses were highlighted that allow them to be updated and customized, while keeping the core functionalities intact. CKAN is mainly used by governmental institutions to disclose their data, its features. DSpace enables system administrators to parametrize additional metadata schemas that can be used to describe resources.

https://www.researchgate.net/publication/303918099_A_comparison_of_research_data_management_platforms_architecture_flexible_metadata_and_interoperability

In summary, we can say that it can be hard to select a platform without first performing a careful study of the requirements of all stakeholders. The main positive aspects of the platforms considered here are summarized in above table. Both CKAN and DSpace’s open-source licenses that allow them to be updated and customized, while keeping the core functionalities intact, are highlighted. Although CKAN is mainly used by governmental institutions to disclose their data, its features and the extensive API making it also possible to use this repository to manage research data, making use of its key value dictionary to store any domain-level descriptors. Curators may favor DSpace though, since it enables system administrators to parametrize additional metadata schemas that can be used to describe resources. These will in turn be used to capture richer domain-specific features that may prove valuable for data reuse. Researchers need to comply with funding agency requirements, so they may favor easy deposit combined with easy data citation. Zenodo and Figshare provide ways to assign a permanent link and a DOI, even if the actual dataset is under embargo at the time of first citation. This will require a direct contact between the data creator and the potential reuse before access can be provided. Both these platforms are aimed at the direct involvement of researchers in the publication of their data, as they streamline the upload and description processes, though they do not provide support for domain-specific metadata descriptors. A very important factor to consider is also the control over where the data is stored. Some institutions may want the servers where data is stored under their control, and to directly manage their research assets. Platforms such as DSpace and CKAN, that can be installed in an institutional server instead of relying on external storage provided by contracted services are appropriate for this. The evaluation of research data repositories can take into account other features besides those considered in this analysis, namely their acceptance within specific research communities and their usability. Considering small institutions that somehow struggle to contract a dedicated service for data management purposes, having a wide community supporting the development of a stand-alone platform can be a valuable asset. In this regard, CKAN may have an advantage over the remaining alternatives, as several governmental institutions are already converging to this platform for data publishing.

Researchers, research sponsors, data libraries, the scientific community, and the public benefit from data sharing. It promotes more scientific engagement and cooperation, and better research leads to better decision-making.

“Your value will be not what you know; it will be what you share.”

— Ginni Rometty

References:

[1] A Comparative Study of Platforms for Research Data Management: Interoperability, Metadata Capabilities and Integration Potential

https://link.springer.com/chapter/10.1007%2F978-3-319-16486-1_10

[2] A comparison of research data management platforms: Architecture, flexible metadata and interoperability

https://repositorio-aberto.up.pt/bitstream/10216/111537/2/229906.pdf

[3] A Comparative Review of Various Data Repositories

https://dataverse.org/blog/comparative-review-various-data-repositories

[4] An Evaluation of CKAN for Research Data Management

https://rd-alliance.org/system/files/documents/CKANEvaluation.pdf

[5] A comparison of research data management platforms: architecture, flexible metadata and interoperability

https://www.researchgate.net/publication/303918099_A_comparison_of_research_data_management_platforms_architecture_flexible_metadata_and_interoperability

--

--

Digvijay Mali
Analytics Vidhya

Software Engineer at Intel Corporation | Graduate Student -Viterbi School of Engineering | University of Southern California, Los Angeles.