Did you say data lakes?

01 Oct, 2024

1 min

Contact us

This morning you almost dropped your cup of coffee, learning that your data scientist was not talking about her holidays but massive databases.

Yes, sad news. We won’t discuss travels either today, but big data storage to help you manage all those spreadsheets, users’ information and papers that fly through your office. 

A data lake is nothing like a giant puddle of water. It is way more helpful than that for your company. If you reach a saturation point with data approximation, it can truly interest you. Your data scientists may be good, but without the appropriate tools, they can lack precision.

What are data lakes?

A data lake works as a cloud database to discharge your raw, semi-structured, and structured data. In a data lake, you can collect data from all sources. It is single storage for your CRM data, customers’ purchases, or even accounting spreadsheets. 

A data lake in itself is not a work tool in which you will process your data. Indeed it is more of a cloud. Big data needs to be simplified and be reviewed, but a data lake is not the appropriate place to do so. In fact, data warehouses are the best to execute this task.

What does it look like?

Whether you need to stock your accounting spreadsheets to set up your budget, you want to generate new leads thanks to your current customer database, and you need to have that data within easy reach. A data lake combines perfectly with security and accessibility. 

A data lake is one of the best solutions to store your database. It is the only thing that separates you from your processed data. But, why do you need it

The answer depends on the size of your enterprise. A hard disk and a spreadsheet can be enough for a small company, but they might not handle it if you plan to grow your business.

Data lake vs data warehouse: what's the difference?

Whether your data is raw or structured, the schema of processing is the same. Your databases are stored to be classified and organized to make them accessible to business professionals. But how? 

After you have discharged your data on a data lake, it needs to be cleaned so that you can use it and exploit it for business matters. In cloud storage, there are often bugs, useless information and unappropriated data. It needs to be removed so that your analysis can be trustworthy. 

To do so, you can use a processing tool such as Google Big Query. Google's data warehouse, Big Query, is serverless. It allows the cleaning of your data and its structuring. 

With its machine learning, Big Query allows you to automate the classification of your data. If you are used to working with Google tools, it can be the best choice for you. Keep in mind: data lakes and data warehouses are complementary. They are not devised for the same purpose. A data lake will never replace a data warehouse. 

3 main reasons to store
your data in a data lake

  1. It helps you manage high volumes of data. When your enterprise grows, you may need more space for storage. Data lakes are flexible and easy to adapt to your volume of data. You need one more Tera of data storage? That is never a problem. 

  2. It stores different types of data in one standardized format. This has two main implications. In case of data structure changing, you will not have your data deleted or unusable. You can compare different data types on the same level since they are all stored on the same structure.

  3. The time to store is brief since the storage can involve raw data that does not need to be processed and classified beforehand. Note that if your data has to be structured, it can take more time.

How much will you spend on data storage?

It depends. What a silly way to answer your question, you may think. Well, as a general rule, the more queries you make, the more you pay. Let’s dig in 3 data lakes pricing. 

  • To help estimate data lake pricing, we chose to dig into Microsoft storage called Azure data lake as their pricing is relatively easy to understand. 

You can see in this grid that depending on the storage option that you choose, prices may vary. The calculator estimates the cost per GB of your storage that values are adjusted according to your location and device directly on Azure’s website. Depending on your subscription, you can access your data more or less often.

  Premium Hot Cool Archive
For the first 50 TB/month $0.18 per GB $0.0184 per GB $0.01 per GB $0.002 per GB
For the next 450 TB/month $0.18 per GB $0.0177 per GB $0.01 per GB $0.002 per GB
For an additional 500 TB/month $0.18 per GB $0.0169 per GB $0.01 per GB $0.002 per GB

Source: https://azure.microsoft.com/en-us/pricing/details/data-lake-storage-gen1/ 

  • You may also be interested in AWS services, but there are many others. AWS services are made-to-measure. If you want to compare Azure data lake to AWS, Amazon data lake, you can estimate the budget needed to store your data thanks to AWS pricing calculator.

  • And last but not least: Google One. If you want to calculate an estimate of your spending: look over Google One’s price calculator. To give you the general idea: 50 GB of standard storage will cost $1.00.

3 steps of data processing

Data processing is not an easy process. Your IT team handles data processing so that you and your business professionals team can have it processed, ready to tap.

  1. The storage of raw data on a data lake.

  2. Big Query helps you classify your data. It scans the anomalies to remove them and it classifies your data in a personalized way. Thanks to an ETL, you can pipe your data to the data warehouse. 

  3. The storage and use of processed data in your data warehouse.

How can data lakes improve
your business efficiency? 

A data lake aims at storing raw data. It does not help your board because it stores all sorts of data from all sources you can think of.  It has no genuine interest for you if you do not process it in a data warehouse. To sum things up: 

  • You process big data? If so, a data lake is useful.

  • Why? Because it centralizes the storage of all your data so that you can access it within easy reach. It facilitates the management of your data, doesn’t it? 

  • For whom? Which companies? Companies with massive databases. So: you, if you have a lot to store. 

  • How? In itself, it only stores but if used wisely with a data warehouse, it will improve your data analysis.

Remember: your databases are worth analyzing but first, they need to be stored.You may have other questions on that specific topic? Don't hesitate to contact our dedicated team, we'll be pleased to help you.

 

By Emma Jeanpierre

03 Jan, 2022