Sponsored Links

Ad by Google
Big Data is one of the biggest demanding technology in the next few years, although it's already in the tech-market but with limited data scientist resources. So if you don't know anything about the Big Data than, it's not too late yet, go get it to add into your resume.

Anyway, In this post, I'm going to list down few very popular Big Data interview questions asked in Data scientist interview, although you can take an overview of Big Data from my previous post What is Big Data.

Top 10 Big Data Interview Questions

1. What Is Apache Hadoop?
The Apache Hadoop is an open source framework, based on distributed file system to process the large volume of data into the distributed environment also known as clustered environment using simple programming model and commodity hardware.
Key features of Apache Hadoop are -
  • Accessible - It runs on large cluster of commodity machines.
  • Robust - Because highly fault tolerant.
  • Scalable - It scales linearly to handle large volume of data by simply adding more nodes to the cluster.
  • Simple - You can quickly write an efficient code to run on parallel machine.
Current stable version of Apache Hadoop is 2.7.1, to install Hadoop,you can follow the Hadoop Single Node Cluster Setup
2. What is MapReduce?
MapReduce is a programming model to process the large volume of data into distributed environment. MapReduce executes on HDFS(Hadoop Distributed File System) in two different phases called map(mapper) phase and reduce(reducer) phase. From here you can see an example of MapReduce program
3. What are the different steps involved in MapReduce?
Well MapReduce is a programming model, which is based on an algorithm to carried out the below steps while doing MapReduce.
  1. Iteration over the input.
  2. Computation of key value pairs from each piece of input.
  3. Grouping of all the intermediate values by key known as shuffling.
  4. Iteration over the resulting groups known as sorting.
  5. Reduction of each groups.
4. What is Secondary NameNode?
The Secondary NameNode is another node which reside in a separate machine no any other activity going on that machine, the purpose of the Secondary NameNode is to run on back ground and periodically read the file systems changes log from the NameNode and apply them to fsimage file. This will helps NameNode to start up faster next time. If somehow NameNode crashed and starting fresh than this fsimage file will help NameNode to get the up to date history of the running data nodes because NameNode will keeps these information in memory so once NameNode get crashed than these information will lost, to over come from this situation it will get the information from fsimage file.
5. What is NameNode?
The NameNode is the master node of HDFS which keeps the track of each and every data node,block details. In short NameNode keeps the metadata, i.e for which files are in the system and how each file are broken down into blocks size etc, NameNode keeps all these information in-memory not in hard disk and NameNode is the responsible to assigned the task to data node, data node which actually perform the IO operations.
6. What is DataNode?
The DataNode is also known as slave machine in Hadoop cluster. The DataNode where perform all the read and write operations on hadoop distributed file system. It gets the task assigned by the NameNode.
7. What are the java primitive data types supported by Hadoop
In Hadoop you can use java primitive data types and those are listed below -
  1. BooleanWritable
  2. ByteWritable
  3. DoubleWritable
  4. FloatWritable
  5. IntWritable
  6. LongWritable
  7. Text
You can create your own data type by implementing either Writable or WritableComparable interface, for all data types in Hadoop you can read Data Types in Hadoop.
8. What is Shuffling?
Well shuffling terms comes with MapReduce programming, MapReduce work is based on key value pairs, and works are done in two different phases called mapper and reducer.
While doing MapReduce, possible that the output of a single mapper node may be sent to reducer across the multiple nodes in the cluster and this concept is known as a shuffling. Shuffling means distributing the output of mapper task to the reducer task.
9. What is the default block size in Hadoop?
It's 64 MB
10. Can you give few examples where Big Data in use
Social media like Facebook,Twitter, Yahoo, LinkedIn etc using Big Data.
Advertising and marketing agencies like google ad words .
The New York Stack Exchange generates Big Data per day.
Financial institutions like Banking and Insurance companies are using Big Data analytic.
Retail domain companies like Sears holding using Big Data.
Hardware manufacturers using Big Data and many more..
11. What is difference between MapReduce and RDBMS?
MapReduce suits in an application where the data is written once and read many times like in your Facebook profile you post your photo once and that picture of your seen by your friends many times, whereas RDBMS good for data sets that are continuously updated. And RDBMS is suits for an application where data size is limited like it's in GBs,whereas MapReduce suits for an application where data size is in Petabytes. I have listed the complete differences in a separate post at MapReduce vs RDBMS must read for complete differences.

Sponsored Links


  1. Quite an insightful post. This has cleared so many of my doubts in this subject & has thrown light on many aspects that I didn’t know before. Thanks a ton!