Ever imagined how Tech Giants manage their Huge data ????

Yashwanth Medisetti
4 min readSep 16, 2020

It is a known fact that the data held by a company is the most crucial or the key to any of the company. It is through this data that the google , Facebook , Amazon , etc., have collected made them the tech giants of the world and now are currently ruling it. So now the discussion is all about data…..

Where do we actually store data ??????

Well it is non other than in our storage devices like pen-drives , hard-discs , etc.,. To manage our daily storage , we generally use an external hardware called as the pen-drives which is generally of the size of some Giga Bytes. There is no any other way that you can store data in your hard-disc rather than to wait for the complete time.

According to the research papers published by the big firms , the data collected is as follows ;Facebook receives a content of size nearly 4 Peta-Bytes which is a million Giga-Bytes. The biggest giant of all , the Google’s datacenter normally holds petabytes to exabytes of data. Google currently processes over 20 petabytes of data per day. Likewise , the handling of this data to produce the same to the user faster and with maximum efficiency is what makes the company the best.

Ever imagined how these companies are this tremendously huge data ???

Well the first answer would be , “Google and the other tech giants have appliances or devices that are capable of storing Exa-Bytes of data”. This statement is goes with a NO.

According to the world’s leading storage providing companies like the Dell-EMC , “they are capable of creating appliances that are capable of storing huge data but they never create them.”

Ever thought why ??

Well the answer that they give is , those kind of devices lack the efficiency in their performance. What is the factor that causes it’s inefficiency ???

Let’s imagine that you are storing a file whose size is 10Gb into a pen-drive. How much time would it take ?? Depending upon the type of hard-disc used , it might take maybe 30–45mins. Apart from storing it , displaying it also takes a lot of time to read the complete file and store it in RAM and finally give the output, lets say it takes another 30 mins. For just a 10GB file to perform both input(storing) and output (displaying) it is taking nearly an hour , then how much time would it take if it was data of size in peta-bytes ?? Can’t even imagine right …… it would definitely take some days. Lets say Google uses the same procedure of storing data in it’s datacenter , then it would take for example 10 days to input our requirement , and again 10 days for google to display the output. In total , we have to wait for 20 days for our output searched in google which will be frustrating and hence no one uses Google if the procedure was this.

So what is the procedure that these fortune 500 companies are following to the same process instantly , it is the Distributed Storage.

How this works ??

Assume we have to store a file of 40 GB which generally takes 40mins for storing it. We also have 40 operating systems , now if we could somehow distribute this 40GB data to each of the 40 systems ate 1GB/system it would accordingly take 1min for each system to storing. Since this distribution is done in parallel , all the 40 systems work in parallel to store the 1GB allocated to them in just 1min. So in just 1min your file of 40GB is stored rather than waiting for the complete 40mins to store it. Hence the problem has been solved. Explaining this infrastructure , the main system is called as the MASTER whereas the systems to which a particular part of the data has been given to store are called as the SLAVES. This topology is called as the master-slave cluster which is constructed is using the software called as the HADOOP.

Looking into the problem that this kind of infrastructure solves ,

  1. VELOCITY : The speed with which the data will be stored is tremendously increased hence making it the fastest way of storing the big data.
  2. VOLUME : Since this kind of setup uses a procedure of distribution of the data among different systems , the space that the data consumes per system termed as the volume is minimized thus giving no way for the I/O problems.

There are also many use cases of using this master-slave topology. According to the size of the data that is being received daily , the more number of slaves to the cluster provides the efficiency required to the cluster.