How big MNC’s like Google, Facebook, Instagram etc stores, manages and manipulate Thousands of Terabytes of data with High Speed and High Efficiency…
Ever wondered how big MNC’s like Google, Facebook, etc handles that much data ?
“Let’s Talk about BigData…”
Your data is simply data unless it is processed. Big data in simple words, is a large amount of data that has been stored in databases, clouds, and other file storage systems.
Webopedia defines big data as
a large volume of both structured and unstructured data sets that is difficult to process using traditional database and software techniques.
This means that Big Data storage can be structured or unstructured. The collection of both structured and unstructured data in digital format is Big data.
let us take the example of Google understand how they handle this much data.
“How Google Applies Big Data”
Google is an undisputed champion when it comes to big data. They have developed several open source tools and techniques that are extensively used in big data ecosystem. With the help of different big data tools and techniques, Google is now capable of exploring millions of websites and fetch you the right answer or information within milliseconds. The first question that comes to our mind is how can Google perform such complex operations so efficiently? The answer is Big data analytics. Google uses Big Data tools and techniques to understand our requirements based on several parameters like search history, locations, trends etc. Then it goes through an algorithm where complex calculations are done and then Google effortlessly displays the sorted or ranked search results in terms of relevancy and authority designed to match the user’s requirement
Here are the 4 V’s by which big data is been characterized….
• VOLUME — Scale of data
• VELOCITY — Analysis of streaming data
• VARIETY — The Different form of data
- VERACITY — Uncertainty of data
What is Distributed Storage?
A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.
There are some tools for crunching of Bigdata …one of them is Hadoop..
Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.
Hadoop consists of four main modules:
- Hadoop Distributed File System (HDFS) — A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.
- Yet Another Resource Negotiator (YARN) — Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.
- MapReduce — A framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result.
- Hadoop Common — Provides common Java libraries that can be used across all modules.
SOME BENEFITS :
- Resource Sharing
- Openness
- Scalability
- Fault Tolerance