Make your S3 data lake more resilient
For your AI / BI to stay up, your data lake needs to stay up
AWS data lakes, now protected
Ensure continuity for business intelligence
An air gapped and immutable mirror lake for your data
Instant recovery, even at exabyte scale
Always-on compliance for your regulated data lakes
What is an S3 data lake?
A data lake is a repository that stores large amounts of disparate and unstructured data, in a way that makes it readily available for processing and analytics. Amazon S3, part of Amazon Web Services (AWS), is one of the world’s most popular storage technologies to build data lakes.
What is the difference between an S3 data lake and a data warehouse?
Data lakes are different from data warehouses in two ways. While data lakes store unstructured data in its native format, such as images, media, objects, and files, data warehouses generally store structured data formats like databases and tables. The other key difference is that data in a data lake is generally in the customers’ custody—it resides in their account, and the customer is responsible for its protection, compliance, and security. In a data warehouse, customers usually have to migrate their data into a data warehouse provider and pay the cost of storing this additional copy. The data warehouse provider, however, takes care of the encryption and resilience of the data.
AWS has both data lake and data warehousing services, called AWS Lake Formation, and Redshift, respectively. Both approaches need data cleansing, transformation, and quality engineering for them to be useful for downstream applications such as data visualization, data integration, data modeling, and data science.
How do I set up the S3 data lake?
Just log into your AWS console, select from an available data lake solution such as AWS Lake Formation, and start ingesting data into it. The data lake will store data on S3, and you will have the responsibility to secure and backup this data.
What is the difference between data lakes and ETL?
While data lakes and ETL and both used in the context of data management and analytics, they have different use cases. ETL, or Extract, Transform, and Load, involves the extraction and aggregation of structured data from multiple sources, and then loading it into a data warehouse. ETL is traditionally a ‘big data’ concept. Data lakes, on the other hand, can be read directly for data analytics and data science applications. The customer always has control of the data, which significantly simplifies data architecture.
What security measures should I take to protect my S3 data lake?
Amazon S3 data lakes can be secured using access control methods, encryption, and backups. Access control helps reduce unauthorized access to the S3 data lake. This can be done using AWS IAM roles and enabling multi-factor authentication. The next layer of security is encryption.
Encryption ensures that even if there is a data leak or data breach due to failed access control, the resident data cannot be read by a third party. AWS provides key management systems (KMS).
The final line of defense for data security in data lakes is backups. Backups will ensure that even if a data lake is ransom-encrypted or deleted or wiped out, it can be recovered quickly to a last-known good point in time. Backups are also essential for data compliance with various industry regulations. It is crucial that your backups are air gapped, so even if your entire account is compromised, your critical data is still unharmed and can be recovered to a last known good point in time.
What are the costs associated with an S3 data lake?
Data lake architectures usually incur costs for storage, data pipelines, and data processing.
Why should I backup an S3 data lake?
Backups are essential for the security, resilience, and compliance of your data lake. Here are some benefits:
Protection against accidental deletion or corruption
While S3 itself is a durable service, your data on S3 is still susceptible to accidental deletions, human errors, software overwrites, or faulty data migrations. Backing up your S3 data lake ensures that no matter which of these scenarios occur, your data lake can be restored to a previous version.
In case of a mass encryption event like a ransomware attack, data lake backups can help you recover to a last known good point in time and resume operations without having to pay a ransom.
Compliance with industry regulations
If you have a data lake with sensitive data that needs to adhere to industry regulations, you may be required to maintain air gapped backups of that data. Some important regulations that mandate backups include HIPAA in healthcare, and PCI-DSS in finance. Backing up your S3 data lake ensures data compliance with these regulations.
In the event of a regional failure or outage, having a backup of your S3 data lake can be critical for disaster recovery. While most customers use a secondary site that replicates their data for DR scenarios, doing so for data lakes can incur exorbitant costs. Therefore, if your data lake use case can withstand a few hours of downtime, backups can be a cost-effective approach for recovering data lakes from outages.
Data lake backups also enable historical data analysis, trend lines, and metrics on security, performance, and usage across a wide range of time.