Parquet Modular Encryption: The New Open Standard for Big Data Security Reaches a Milestone
1st April, 2022
Authors CyberKit4SME and Maya Anderson
Parquet Modular Encryption Milestones
IBM Research initiated and led joint work with the Apache Parquet community to address critical issues in securing the confidentiality and integrity of sensitive data, without degrading the performance of analytic systems [1], [2].
Apache Parquet is the industry-leading standard for the formatting, storage and efficient processing of big data. Parquet Modular Encryption (PME) encrypts Parquet files module-by-module — the footer, page headers, column indexes, offset indexes, pages, etc. Thus, it not only enables granular control of the data based on access to per-column encryption keys, it also preserves all the benefits of efficient analytics on Parquet. This includes column projection and predicate push-down, where entire file parts can be skipped if the metadata indicates that the part has no matching values.
PME has reached some major milestones. Both Java and C++ implementations of Parquet with PME have been released, Apache Spark 3.2 release is using this Java implementation [3] after upgrading to Apache Parquet 1.12.1 [4] and the upcoming release 8.0.0 of Apache Arrow is going to enable its use in PyArrow [5]. In the CyberKit4SME project, when working on exposing PME in PyArrow, the interoperability between the Spark and PyArrow implementation was also verified.
Parquet encryption in Cyberkit4SME use cases
Financial Use Case
An interesting financial use case that we encountered for PME is part of CyberKit4SME. Here, a small financial institute buys Foreign Exchange tick data that records every price change about once per second for every pair of currencies. The financial institute gives orders to traders to buy or sell currencies based on the analytics models that run on the ForEx data. Clearly, confidentiality is important since this detailed data has been paid for, but its integrity is important too since financial decisions are made based on the data. Any missing or erroneous data can affect the decision and possibly result in great financial losses. Moreover, storing the data should be cheap and easy for the SME partner. As a result, saving the data in encrypted Parquet files helps protect the privacy and integrity of the data. It is affordable because of the excellent compression of Apache Parquet, and the performance of analytics queries running on these parquet files is very good:
Transportation Use Case
Another interesting use case for PME is a smart transportation use case from CyberKit4SME, where data is collected from cars (e.g., positions, acceleration and velocity). This data is used to build and train machine-learning models using TensorFlow, which are then used in smart cars to make real-time decisions.
The data collected from the cars contains sensitive information, so it must be stored in a way that is compact and encrypted. That said, various personas should be able to run Python scripts on this data to analyze it and to train models. PME allows you to store large amounts of data in a compact way encrypted with different encryption keys (e.g., according to sensitivity levels) and to give access to the encryption keys based on security clearance or some other enterprise policy. Access control is achieved by controlling access to the keys without creating multiple replicas of the table — the physical data files remain accessible to a large set of people, but they can only read data for which they have access to keys.
For example, in the diagram below, two different users run queries on the same table that has five columns encrypted with PME. The first user selects three columns out of the four available to them based on permissions granted with their access token, and the second user selects two columns out of the three available to them based on their access token. That might be achieved by using one key for the least sensitive columns 1, 3 and 5, another key for the more sensitive column 2 and yet another key for the most sensitive column 4:
For a full version of this blog, but with older milestone updates, see:
[1] Parquet Modular Encryption: Developing a new open standard for big data security
[2] Structured Data and Hybrid Clouds: Getting Value From Your Data While Remaining Secure and Compliant
[3] Data and AI Summit: Data Security at Scale through Spark and Parquet Encryption
[4] https://spark.apache.org/releases/spark-release-3-2-0.html
[5] Reading and Writing the Apache Parquet Format – Parquet Modular Encryption