The Summer of Open Data, which took place in 2020, brought together Open Data experts to delve into the Third Wave of Open Data.

In this blog post we will explore the definition of Open Data, its evolution towards the Third Wave, and how the Snowflake Data Sharing features can play a part in this particular puzzle.

What is Open Data?

While there are various definitions which often focus on the Second Wave of Open Data, the most reliable definition can be found at Open.Data.Policy.Lab:

Open data comprises data made accessible for re-use along a spectrum of openness and conditions for re-use. It includes data collected by or on behalf of government institutions of all levels, and which has been reviewed as appropriate for public distribution and use by individuals and organizations of all types. Private-sector organizations, civil society organizations, scientific research institutions, and other parties also hold data that could benefit the public if made accessible to certain parties for re-use — though with additional constraints and challenges to be navigated.

Key points:

  • The spectrum for openness
  • Conditions for re-use
  • Spanning public and private sector

The Open Data waves

This section shows us how to distinguish between the various waves of Open Data. Therefore, a newer wave does not replace a previous iteration, but rather builds on previous waves.

The first wave

  • Audiences such as journalists and lawyers use Freedom of Information laws to make data requests.
    • Freedom of Information laws grant the right to get copies of records held by public bodies under Freedom of Information (FOI) Acts.
  • National level focus

The second wave

  • Primarily public sector focused
  • Data is proactively shared with a goal of creating public value from previously siloed assets

The third wave

Building on the previous waves, the Open Data Policy Lab describes the Third Wave as:

  1. Publishing with purpose
    • Data is made open in a way that focuses on impactful reuse i.e. prioritize demand as much as supply
  2. Fostering partnerships and data collaboration
    • Engage a diversity of data suppliers and consumers, the output of which would be the creation of public value based on the data made available
  3. Advanced open data at the subnational level
  4. Prioritizing data responsibility
    • A responsibility-by-design approach to Open Data activities
    • Emphasis on fairness, accountability, and transparency across all stages of the data lifecycle to manage risks and maximize value.

What is the Snowflake Data Cloud?

The Snowflake Data Cloud is a cloud data platform covering various use cases such as data warehousing, data engineering, and data sharing. Data can be stored, processed, and shared whether that data is:

  • Structured
  • Semi-structured e.g. JSON
  • Unstructured e.g. images or audio files.

The data can be queried directly using SQL, or via a variety of Snowflake APIs and connectors.   

Also, multi-cloud means Snowflake can run on various cloud providers such as Azure, AWS, and Google Cloud. In addition, all data on the platform is encrypted whether in transit or at rest.

Snowflake Data Sharing

Snowflake data exchange data sharing
Using Snowflake secure data sharing, data exchange, and data marketplace. Credit: Snowflake.

Snowflake Data Sharing allows organizations to share data across their ecosystems, whether internally or externally. It eliminates the need to move, copy, or transfer any data. As a result, we also reduce data governance risks.

Data consumers can immediately start processing the shared data, and use that to augment data that they already have. We can use out-of-the-box metrics to help determine which published datasets are actually being used, along with associated metadata.

In most cases, data must be on the Snowflake data platform before it can be shared. Of course there are always exceptions, an example being Snowflake External Tables. We can use External Tables, which are a type of database table, to access third party cloud storage such as AWS S3, Azure blob storage, or Google Cloud storage. We can share External Tables as part of the overall Snowflake data sharing capability.

The following sections provide an overview of the data sharing options available.

Snowflake Direct Share

Used by a data publisher to share data directly with a data consumer. If that data consumer is not a Snowflake customer, the data provider can create a Snowflake Reader Account to share the data. In other words:

  • The data consumer does not have to become a Snowflake customer
  • The data provider:
    • Creates a dedicated Snowflake account for data consumption purposes
    • Governs the shared data
    • Determines the compute power the data consumer is permitted to use via a Snowflake feature called Virtual Warehouses (compute nodes)
      • That usage can be capped using Snowflake resource monitors
        • A very good reason for this is that the data provider pays for usage costs!

Snowflake Data Exchange

A Data Exchange is a secure data collaboration hub. Members are invited to the hub to publish and consume data. As such, data can be shared between units within an organization or with external partners such as vendors, suppliers, partners, and customers.

Data publishers on the exchange create listings of their published data which data subscribers can then search. Also, the Data Exchange provides various governance features such as:

  • Membership management
  • Granting / revoking access to data
  • Auditing of shared data usage
  • Applying security controls on data

Snowflake Data Marketplace

Browsing the Snowflake Data Cloud Data Marketplace listings
Browsing the Snowflake Data Cloud Data Marketplace listings. Credit: Dan Galavan.

Used to connect a variety of data providers and data consumers.  

A data consumer can discover and access a plethora of third-party data and have those datasets available directly in their Snowflake account. That data consumer can then readily query and join those shared datasets with their own data.

If a data consumer needs to use several different vendors for data sourcing, the Data Marketplace gives one single location from where to get the data.

As with the Data Exchange, data providers can create listings, and data consumers can browse those listings.

Snowflake and data governance

There are a variety of data governance options with Snowflake, examples being:

  • Data encryption while the data is transit or at rest
  • Roll based access control whereby access privileges are assigned to roles, which are in turn assigned to users
  • Object tagging which enables data stewards to track sensitive data for compliance, discovery, protection, and resource usage analysis purposes
    • Keep an eye out for some upcoming functionality – automatic data classification – tagging of e.g. personal data. However this is limited to Snowflake private preview at present.
  • Query History which provides a log of each query that is run and associated metadata such as when the query ran, for how long, and who ran the query
  • Access History which determines which specific data is being read and associated metadata
  • Dynamic Data Masking is used to obfuscate data from an end user at time of data access depending on that user’s authorization.
    • Tip: there are some gotchas here in relation to data sharing so it’s important to have the right expertise
  • Security validations e.g. SOC 1 Type II & SOC 2 type II.

Snowflake scalability

Snowflake is very scalable. When it comes to an increase in required processing power, there are a plenty of options e.g.:

  • Snowflake Virtual Warehouses
    • Compute resources
    • We can immediately scaled up or down
    • Can be from one up to 512 compute nodes
  • Multi-clustering
    • We can use this for additional concurrency (the number of queries being processed simultaneously)
    • Capped at 10 clusters per ‘cluster grouping’
  • Elastic storage
    • As data volume increases, so automatically does data storage
  • Separation of workloads
    • Multiple compute clusters can access the data simultaneously
    • E.g. the data publisher is adding new data to a shared dataset, and the data consumer can continue to access that shared dataset including changes made by the data publisher

Additional notes

  • In all cases, we only pay for what we use.
  • Scaling takes place with zero downtime.
  • Snowflake offers a variety of optimization options which are out of scope of this blog post.

In other words when sharing data using the Snowflake Data Cloud, scalability is not a bottleneck.

Riding the Open Data wave

An output of the Summer of Open Data was the identification of many investments and interventions required to ride the Open Data wave. The following are examples of those areas and where Snowflake Data Sharing can play a role:

  • Foster and distribute institutional data capacity
    • Breaking down data silos within organizations & support decision making across the institution – Snowflake Data Sharing & data processing with scalability
  • Articulating value and building an impact evidence base
    • Snowflake Data Sharing provides metrics which demonstrate whether the shared data is being used, when, how often, by whom etc.
  • Creating new data intermediaries
    • E.g. matching supply and demand actors using the Snowflake Data Exchange or Snowflake Data Marketplace
  • Establishing governance frameworks
    • Snowflake offers a plethora of data governance functionality
      • Please note that these are the tools to help implement a data governance framework. As opposed to establishing what a data governance framework should entail!
  • Creating the technical infrastructure for reuse
    • The Snowflake Data Cloud can provide an input here with Secure Data Sharing out of the box, not least the ability to store, process, and share structured, semi-structured and unstructured data. Plus significant scalability and governance features
  • Empowering data stewards
    • In tandem with Snowflake Data Sharing, the Snowflake data governance capabilities can be used to great effect here

Conclusion

Surfing the Third Wave of Open Data
Surfing the Third Wave of Open Data. Image by Kanenori from Pixabay

“Publish with Purpose”.  

Ania Calderon, Open Data Charter advocates

The Snowflake Data Cloud offers a technical solution to data sharing as an input into the Third Wave of Open Data. We have seen that we can take different approaches to connect data publishers and consumers. We can accomplish this using:

  • Snowflake Direct Share
  • Snowflake Data Exchange
  • Snowflake Marketplace
  • A combination of these approaches

A significant benefit with Snowflake Data Sharing is that we can remove the need to move, copy, or transfer any data. As such, data governance risk is also reduced.

Examples of Snowflake features which compliment it’s Data Sharing capabilities include data governance and scalability. Therefore the Snowflake Data Cloud can be an important technical piece of the Open Data puzzle.

It’s also important that we keep in mind the Third Wave of Open Data’s bigger picture – ensuring that the right data sharing approach is used. That approach must promote the re-use of data for public interest purposes while ensuring data rights and the growth of data partnerships and collaborations. We must also consider social aspects such as:

  • data ethics
  • regulatory clarity where required
  • developing common data sharing agreements and corresponding licencing regimes

Of equal importance is matching supply with demand. As Ania Calderon of the Open Data Charter advocates, “Publish with Purpose”.  

Most importantly of all, it’s time to roll with the Third Wave of Open Data.

Surf’s up!

 

Copyright ©2021, Dan Galavan.

Need Snowflake training? You’ve come to the right place.