Data Encryption at Rest and In Motion for Cloudera Hadoop Cluster


Encryption is an important feature in Cloudera hadoop , it allows you to securely protect data that you don't want anyone else to have access to, real time use case is to protect hdfs data , which could contain emails, chat histories, tax information, credit card numbers, or any other sensitive information.


HDFS Transparent Encryption


  The word Transparent emphasizes that the client is not aware the encryption and decryption process under the covers. The following blog covers how HDFS Data encryption at Rest and in Transit happens in Cloudera hadoop cluster, the solution may be little different in other flavors of hadoop but the key features are almost similar

The Architecture

Key Management Server(KMS): Is a proxy service that bridges the connection between hadoop services and the enterprise key trustee or key store servers. 

Navigator Key Trustee : Though JavaKeyStore can be leveraged as a key store to store the keys, its highly recommeneded to rely on enterprise grade Key store (ex. Navigator Key Trustee or HSM(hardware security modules)). The advantage of maintaining and external Key store is that , in the event of entire HDFS being compromised, the keys are secured in the external key store without which a rogue user can never read the file from HDFS



The following picture depicts how different keys are formed and their purpose/usage



Key : The Key used to encrypt/decrypt the files.
EK :  The associated key with the Encryption Zone(Encryption zone is nothing but a designated folder in hdfs to be encrypted)  
DEK : Data Encryption Key (The actual key used to decrypt/encrypt the files, each file has their own DEK)
EZEK :  Encrypted zone Encryption Key(The authenticate key that is encrypted version of DEK , encrypted using the EK. ,.ie.  DEK+EK->EDEK)

The Process Flow 



Step 1 : Client reached Namenode to write a file to hdfs
Step 2:  Namenode retrieves EDEK from cache which in turn being refreshed periodically from KMS in the background
Step 3: The EDEK is now added to the files metadata


Step  4: The EDEK is also returned to the client
Step 5 :  Client now reaches out to KMS with the EDEK to obtain DEK
Step 6  : Client uses DEK to encrypt the file and write the same to HDFS


Encryption In Action


Lets create a test encryption key and list it out

$ sudo -u <key_admin> hadoop key create keytrustee_test
$ hadoop key list

 Lets create an encryption zone(sensitive_data) using the test key(keytrustee_test) , EZ is technically a hdfs folder where the encrypted data to reside
$ sudo -u hdfs hadoop fs -mkdir /sensitive_data
$ sudo -u hdfs hdfs crypto -createZone -keyName keytrustee_test -path /sensitive_data
Now, verify the zone
$ sudo -u hdfs hdfs crypto -listZones
/sensitive_data   keytrustee_test
Let us now add files in to the sensitive data folder where the data will be stored as encrypted


I think by now you have a better understanding of how data encryption at Rest and in Transit works

References:


Comments

Popular Posts