What is Hadoop?
Hadoop Ecosystem : Hadoop is a software framework designed to process big data. RDBMS and DBMS access data too, small data, while Hadoop is designed for larger amount of data and larger variety of data. Data is distributed among nodes and clusters and can be accessed parallel, hence, speeding up the process. Also, Hadoop has been designed for a single computer or a cluster of computers (could be many thousands). All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
Generally, the term Hadoop Ecosystem refers to its elements and modules. But it has become common to refer to the entire ecosystem by the name containing its supporting applications and technologies. We will be reviewing complete Hadoop Ecosystem in this blog.
Elements of Hadoop:
- HDFS: Hadoop Distributed File System: a distributed file-system that stores data on nodes and clusters that can be accessed parallelly.
- MapReduce: Algorithm for accessing data and processing data parallelly
- YARN: Yet Another Resource Negotiator: Main jobs are cluster Management, Job Scheduling and Resource Management.
- Common: Contains Java Libraries and utilities that are required by the other above mentioned Hadoop modules.
- Robust: With many organizations using a large amount of data (big data), there has been no easier way to store and evaluate the data than using Hadoop.
- Cost-effective: As we have discussed, parallel processing saves money and increases the robust nature.
- Flexible: Hadoop can be used with structured or unstructured data, encoded or decoded data.
- Scalable: The major advantage is Hadoop doesn’t depend on the software like mentioned earlier. If a node is lost or is no more functional, the data can be sought out from other locations.
- Another great feature is Hadoop is now being used in sync with the cloud to manage big data. Hence, all the large organizations, including the cloud based ones are looking at Hadoop for their data handling.
Who uses Hadoop?
It is common knowledge that most of the Organisations concerning media were into Hadoop as they receive large amount of data – everyday – from all over the world. But, most companies use Hadoop to improve their data Handling.
The following organizations are the largest users of Hadoop:
- Amazon Web Services – The Elastic MapReduce
- Cloudera – Hadoop Vendor
- Hontonworks – Hadoop Vendor
- Teradata – Teradata being RDBMS, also offers a Hadoop platform integrated with SQL to its clients
- Facebook – Large Data daily is handled by Hadoop – Hive introduced by Fb
- LinkedIn – Originally developed Kafka
- Twitter – Bought storm and made public 2 years later
Why This training?
Hadoop Ecosystem – Hadoop is an Ecosystem Containing not only the 4 components discussed earlier, but also many applications for the processing of the client data. Hadoop trainings usually focus on the data storing system, called the HDFS – Hadoop Distributed File System and the tools used to access the data from this system, monitoring of this data, the data ingestion tools required for moving the data (ETL – Extraction, Transformation and Loading) and the analytics of this data. We, at Click4learning.com have segregated the complete course into the following phases:
Complete Duration: 30 Hours
Prerequisites of Hadoop Course: Basic Java, Basic Linux
The first phase of this tutorials is for beginners and would contain the following content:
- HDFS: Hadoop Distributed File system: The storage system of Hadoop. Data here is distributed among nodes that are combinedly called clusters. The data can be stored in a structured or unstructured manner. Only sequential access to the data is permitted and hence lacks the read/write capability at random.
- MapReduce: Applications can be written with MapReduce to access vast amounts of data parallelly on large clusters of commodity hardware. It can be easily scaled to meet t eh requirements of processing large, extremely large amounts of data.
- Hbase: HBase is a NoSQL database on top of HDFS that stores data in a column-oriented fashion. HBase is designed to overcome the limitations of HDFS when it comes to accessing the random data (real-time data access). Also, HBase supports much faster data lookups compared to HDFS.
- Hive: The major Data Access tool used in Hadoop for creating reports. The data is accessed by a language similar to SQL, but much simpler, called Hive Query Language (HQL). Only structured data can be accessed using HQL.
- Pig The Language used is called Latin Pig and is much better for complex, large applications for Structured or unstructured data and is mainly useful for Streaming data, that is, Live data.
- Sqoop: Sqoop is the tool used for two-way transfer of data between various databases like Oracle, Teradata or MySQl and HDFS. Mostly Sqoop is used to move structured data from HDFS to the other databases.
- Spark: Spark is the general purpose data processing engine. The catch? Spark is all of the following:
- Fast: In- memory real time data processing
- Easy: Can be writeed in Python, Scala, R…
- Runs on: Hadoop, Mesos, Standalone or even cloud
- Can access data from: HBase, Cassandra, HDFS, S3..
- Flume: Designed to transfer streaming data from any number of sources to HDFS. Hence, the data is mostly unstructured. That makes a point clear that flume can deal with any type of data – Structured or not.
- Impala: Impala is an open source parallel processing query engine that can deal with massive data. The only restriction here with Impala is that data should be stored in a cluster of computers running Apache Hadoop
- Zookeeper: Zookeeper is responsible for the distributed co-ordination service to manage large set of hosts that is, the synchronization service, the distributed configuration service and naming the distributed systems. For example, HBase uses ZooKeeper to track the status of distributed data.
- Cloudera: Hadoop vendor that integrate the most popular Apache Hadoop open source software like the ones mentioned above at one place. The other vendors are MapR, Hortonworks etc..
Also, this Phase includes introductions on Splunk and Yarn.
Prerequisites: Phase I mentioned above
- Oozie: Hadoop’s workflow scheduler system. The order at which the tools of Hadoop ecosystem are to be executed in determined by Oozie.
Duration: 6 Hours
- Kafka: It is used for streaming data in real-time into other systems. Kafka has the capacity to deal with all the real-time data flowing into the system. It provides quick, scalable and fault-tolerant handling of real-time data feeds.
Duration: 15 Hours
- Scala: Scala is the programming language that combined the concepts of bothe the OOPs concepts and the functional concepts.
Duration: 15 Hours
- Ambari: Ambari is the component in the Hadoop Ecosystem that deals with provisioning, managing, and monitoring Apache Hadoop clusters.
Duration: 1 Hour
Duration: 10 Hours
- Splunk: Platform to search, analyse and visualize the machine – generated data gathered from the websites, applications, sensors and devices.
Duration: 30 Hours
- Storm: Distributed stream processing computation framework. Works on Live streaming data. Though spark streaming works on Live data, the fault tolerance, debugging, monitoring and processing is different in both.
Duration: 30 Hours
- Elastic Search: Search – engine that is scalabe, fast and can search among large amounts of data to support your data discovery applications.
Duration: 6 Hours
Hadoop Ecosystem : Hadoop Online Training