Top 20 Big Data Technologies

1. Hadoop
The Apache Hadoop is an open source framework, based on distributed file system to process the large volume of data into the distributed environment also known as clustered environment using simple programming model and commodity hardware. Read more

2. MapReduce
MapReduce is a programming model to process the large volume of data into distributed environment. MapReduce executes on HDFS(Hadoop Distributed File System) in two different phases called map(mapper) phase and reduce(reducer) phase. Read more

3. Hive
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Read more

4. Spark
Apache Spark is a powerful open source data processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009, Now it's under Apache open source. Read more

5. Storm
Apache Storm is a system for processing streaming data in real time. It's a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Read more

6. Kafka
Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication. Read more

7. HBase
Apache HBase is an open source NoSQL database runs of top of the HDFS, that provides real-time read/write access to those large datasets. Read more

8. Phoenix
Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data, also known as SQL on HBase. Read more

9. Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Read more

10. Cassandra
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Read more

11. Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). Read more

12. Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications also known as coordination service. Read more

13. Sqoop
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Read more

14. Ambari
The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Read more

15. Avro
Apache Avro is a data serialization system, which provides Rich data structure, binary data format, RPC, simple integration with dynamic language and a lot. Read more

16. Mahout
The Apache Mahout project's goal is to build an environment for quickly creating scalable performant machine learning applications. Read more

17. Oozie
Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs, Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Read more

18. Tez
Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Read more

19. Chukwa
Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Read more

20. Solr
Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Read more

These are not only the Big Data technologies, but the popular ones.

Top 20 Big Data Technologies

Top 20 Big Data Technologies

1 comments: