Developed by: IBM
Secure Data Services
Analysis
The Secure Data Services (SDS), are services that protect data across its lifecycle as it is stored, accessed and used. This includes facilities to govern data access and validation across the whole data lifecycle.
The usage of the Secure Data Service in the context of a use-case is presented in Figure below:
Usage examples :
- Database engine: data insert / delete / get / query
- Getting data from standard sources (streams and file formats)
- Encrypted export/import of bulk data
- Anonymized export of bulk data
- Deleting individual records (e.g., for GDPR)
- Cloud backup and ransomware protection
- Use for data and/or metadata
- Applicability in wide SME ecosystem
- Demo inside existing CyberKit4SME usecases
- e.g. database, data import/export
- Demo as additions to existing CyberKit4SME usecases
- e.g. cloud backup
- Demo in new usecases (modelled after existing, or other typical SME apps)
- all features
- Properties
- Free open source technology
- No download or maintenance charges
- Easy to use
- (Almost) Transparent data security support
- Standard interfaces and formats
- Helps to leverage hybrid cloud
- Offload SME IT tasks to public cloud: less expensive, less headache
- No cloud lock-in! use / switch to any public cloud
- Keep extra-sensitive data on-premises
- Cutting-edge data security mechanisms
- Contribute to Cyberkit4SME Beyond-SoTA and Standards goals
- Contribute to leading open source repositories
- Integrate in public clouds (leveraged by various customers, including SMEs)
Technical
The Secure Data Service (SDS) enables to store data and consume data in a secure and safe manner. The use cases use the SDS API in order to import/export data from Hybrid Cloud storage and to run analytic SQL queries and machine learning data processing on the data. The SDS exposes a REST API to the use cases, and in addition it sends data access events and security alerts to the SIEM (Keenaï).
At the heart of the Secure Data Service (SDS) is parquet encryption, which is the new open standard for data encryption and integrity verification[1]. Data imported by SDS can be persisted even in untrusted storage in Parquet format[2] , which is an analytics-friendly columnar format for large-scale data. Parquet files containing sensitive information can be protected by the modular encryption mechanism that encrypts and authenticates the file data and metadata – while reaping all the benefits of Parquet format functionality, e.g. columnar projection, predicate pushdown, encoding and compression.
Master keys used for the encryption are stored in Hashicorp Vault[3]. In use cases, in which the partners are reluctant to store encryption keys in the public cloud, the Vault will be on customer premises, together with the SDS, whereas the data can reside in the public cloud, since it is encrypted. Access to data is governed by controlling access to the encryption master keys.
The general architecture of Secure Data Services and its interactions:
The SDS is driven by an engine, which can be Apache Spark [4], Apache Hudi [5] (for table formats) or similar. Spark enables to import data from multiple data formats, and in particular from CSV files, which most use cases use, and to write them into Parquet and encrypted Parquet. Engines such as Hudi allow the SDS to work with table formats.
As part of its data access flows, the SDS will send data access events and security alerts to SIEM for further analysis.
There are different options for deploying and using Secure Data Services:
-
Secure Data Services With HTTP API
The first option is to deploy it as a service on use case partner premises, and to use its HTTP API, as shown in the Figure below
In such a deployment, there is an SDS Gateway, implemented using the Play Framework and exposing methods, such as sql-query, insert-data, bulk-import in the HTTP API. The SDS Engine implements these methods using Apache Spark and Apache Hudi ,with the help of an internal Table catalogue, where different tables can be configured, e.g. parquet files, CSV files and Hudi tables.
The Java implementation of Parquet Modular Encryption (PME) is used for writing and reading encrypted parquet files with privacy and integrity protection, that enable secure and efficient work with encrypted data that is stored either in the cloud or on-prem. PME with key management tools can use either encryption keys managed by application, or keys managed by a Key Management Service, such as Hashicorp Vault, which also provides role-based access control to the encryption keys. Access control to encryption keys enables fine-grained access control to the data.
For example, a large parquet can contain data collected from different sources, but various personas accessing it will get access only to the columns, to which master keys they are allowed access in Vault.
-
Secure Data Services as a Library
The second deployment option is as part of the use case tooling, as a library for securely reading and writing parquet data. Shown in the Figure below
In this deployment option, SDS is deployed as a library for reading and writing parquet data. We are working with the Open Source Apache Arrow community on the design and implementation of exposing Parquet Modular Encryption with Key Management Tools in PyArrow.
The use case will be able to read and write parquet files and convert them to pandas, modin, etc. Then machine learning can efficiently run on this encrypted data with such common advanced tools as TensorFlow, Ray, etc. Again, as in the previous deployment option, encryption keys are managed either by application or by a Key Management Service such as Hashicorp Vault.
References
[1] https://github.com/apache/parquet-format/blob/apache-parquet-format-2.7.0/Encryption.md
[2] https://parquet.apache.org/