Clumio announces $75M Series D and 4X YoY growth in ARR

// 31 Mar 2023

World Backup Day 2023: AI puts spotlight on data lakes like S3

Poojan Kumar, Co-Founder & CEO

ChatGPT has gained a lot of attention recently, opening up a whole new world of possibilities across knowledge, research, and collaboration. The growing capabilities of AI and the proliferation of AI-based tools is giving rise to new modalities of working.

Use of AI/ML in regulated industries

While it’s been fun to play around with large language models, I’ve been following the discourse on their implications for higher education dissertations, creative copyrights, and building software. What I find most interesting, however, is their impending impact on businesses in highly regulated industries. Some examples:

  • Large Language Models (LLMs) are being used to flag anomalies in patient data and identify potential health issues
  • In life sciences, AI/ML is used to analyze sequences for genetic research, processing massive amounts of data to find minute differences or unobvious patterns
  • In financial services, machine learning models are being deployed widely to detect irregular activity and fraud
  • Manufacturing organizations are using machine learning to power smart technologies, from home lighting systems to municipal street grid systems

All these industries are subject to stringent data compliance requirements around retention, encryption, storage, and privacy, including but not restricted to:

  • Healthcare and life sciences: HIPAA, OSHA, FDA, DEA
  • Financial services: FINRA, GLBA, SOX, PCI-DSS
  • Manufacturing & Energy: ISO 27001, FERC, EPA

Protecting the data that powers your ML models

The underlying data used in training machine learning models are typically stored in data lakes. The data lake also serves source data for customer portals, support dashboards, development projects, and such. Naturally, as AI and LLMs proliferate, it’s never been more important to ensure that your data lake is protected at the source level.

One cannot talk about data lakes without talking about Amazon S3. S3 is the underlying platform for all major data lakes operating on AWS – Delta Lake, LakeFormation, Iceberg, etc. And while the infrastructure behind S3 is supremely resilient, the resident data is your responsibility. This includes the resilience, uptime, availability, and integrity of all the data in your data lake. And that means discovering important data in your data lake and backing it up.

3 considerations for World Backup Day

On this World Backup Day, take a few minutes to review your data protection strategy for your critical data, especially as it relates to data lakes.

  • Make sure that your backup solution can scale to your AI/ML needs. With terabytes of new data being generated every day, your data resilience platform should be able to scale to petabytes of data, and be able track millions of events and changes in your data with high fidelity.
  • Test your recoverability to ensure that it meets your SLAs. A backup is only useful if you are able to restore exactly the data you need in the time that you need it.
  • Review your TCO. Investigate cloud-native, cost effective solutions for your long term retention needs. Take stock of how much overhead is going into managing copies of data (versions, replicas, archives, vaults).

Final thoughts

Every forward-leaning company today has two things in common — they are leveraging data lakes, and they are subject to regulation. And data resilience is essential to not just a company’s innovation, but its survival. It’s the new metric by which business health will be measured. As much data growth as we’ve seen in the last few years, the next few years will bring orders of magnitude more. AI is just one of the technology trends intertwined with data at scale. I can’t wait to see what’s next.


About the author

Poojan Kumar is the CEO and co-founder at Clumio. Poojan brings 18 years of experience in cloud computing and storage and is known for seeing an opportunity for change, innovating and capitalizing on it.