0
Sponsored Links


Ad by Google
Big Data term is very popular in the market and common question to face in any Hadoop interview. After posting on java j2ee interview questions here at java, j2ee interview questions for experienced, now I am planning to start sharing the Hadoop real time interview questions. Although in my previous post, I have shared with the very first Hadoop interview question known as What is difference between MapReduce and RDBMS here and in this post going to tell you the concept/term of Big Data.

What is Big Data?
Nowadays we are surrounded with data, not only data, but huge amount of data(structure,semi-structure and UN-structure data), it's not easy to measure the amount of data we live.  And the growth ratio of data is also very very huge. The sources of these data is web, mobile device and IoT, due to the size of data, it's very difficult to process these data sets using traditional RDBM model.

Anyway, when we say big data it can be characterized using four VVVVs.
  1. V1- Volume: The amount of data is generated every day/month/year using web site, mobile device, IoT, cloud  and other sources are increasing exponentially, it's not easy to measure but sometime referred as Peta bytes/zettabytes of data.
  2. V2- Variety: The data is not only in huge volume, but in different variety of data and variety of data type is increasing like, structure data, semi structure data and UN-structure data. From the key value pairs of data to flat log file, data generated from sensor, geo locations, social media all are in different format/varieties of data.
  3. V3- Velocity: The speed/pace of data creation,stored and analyses is very fast, every minutes we upload more than 100 hours of videos on you-tube, millions of pic uploaded on social media, stock exchange generated data etc.
  4. V4- Veracity: It's very difficult to analyse the huge volume of data, if data is not correct/corrupted/ full of noises.  And working on these data set is biggest challenge like analysing movie review from the data set where most of the movie name is null or empty which is UN-excepted from the volume of data.

Below are the few major resources of generating big data:
  • Social Media: Facebook,Twitter,Youtube etc. Millions of social network users every minutes comments/tweets/uploading videos etc. produces petabytes of data, in structure/semi-structure and UN-structure format.
  • The New York Stack Exchange generates around one TB(Terabyte) of trade data per day.
  • Internet advertisement data.
  • Sensor usages to getting climate data and a lot.
Processing of big data is not easy using traditional RDBM that's why Hadoop comes in a picture although Grid computing can be the one of the possible approach to handle big data. But Hadoop is much much better than Grid computing, read this article Hadoop Vs Grid computing to get the better idea.

Hadoop uses HDFS(Hadoop Distributed File System) to store the bigdata and MapReduce model to process these dataset. You can start with Hadoop single node cluster setup and How MapReduce works?

That's it, Happy Analytic:)



Sponsored Links

0 comments:

Post a Comment