Thursday, July 9, 2020

Introduction to Hadoop 2.0 and how it overcomes the Limitations of Hadoop 1.0 - Edureka

Introduction to Hadoop 2.0 and how it overcomes the Limitations of Hadoop 1.0 - Edureka Introduction to Hadoop 2.0 and Advantages of Hadoop 2.0 over 1.0 Back Home Categories Online Courses Mock Interviews Webinars NEW Community Write for Us Categories Artificial Intelligence AI vs Machine Learning vs Deep LearningMachine Learning AlgorithmsArtificial Intelligence TutorialWhat is Deep LearningDeep Learning TutorialInstall TensorFlowDeep Learning with PythonBackpropagationTensorFlow TutorialConvolutional Neural Network TutorialVIEW ALL BI and Visualization What is TableauTableau TutorialTableau Interview QuestionsWhat is InformaticaInformatica Interview QuestionsPower BI TutorialPower BI Interview QuestionsOLTP vs OLAPQlikView TutorialAdvanced Excel Formulas TutorialVIEW ALL Big Data What is HadoopHadoop ArchitectureHadoop TutorialHadoop Interview QuestionsHadoop EcosystemData Science vs Big Data vs Data AnalyticsWhat is Big DataMapReduce TutorialPig TutorialSpark TutorialSpark Interview QuestionsBig Data TutorialHive TutorialVIEW ALL Blockchain Blockchain TutorialWhat is BlockchainHyperledger FabricWhat Is EthereumEthereum TutorialB lockchain ApplicationsSolidity TutorialBlockchain ProgrammingHow Blockchain WorksVIEW ALL Cloud Computing What is AWSAWS TutorialAWS CertificationAzure Interview QuestionsAzure TutorialWhat Is Cloud ComputingWhat Is SalesforceIoT TutorialSalesforce TutorialSalesforce Interview QuestionsVIEW ALL Cyber Security Cloud SecurityWhat is CryptographyNmap TutorialSQL Injection AttacksHow To Install Kali LinuxHow to become an Ethical Hacker?Footprinting in Ethical HackingNetwork Scanning for Ethical HackingARP SpoofingApplication SecurityVIEW ALL Data Science Python Pandas TutorialWhat is Machine LearningMachine Learning TutorialMachine Learning ProjectsMachine Learning Interview QuestionsWhat Is Data ScienceSAS TutorialR TutorialData Science ProjectsHow to become a data scientistData Science Interview QuestionsData Scientist SalaryVIEW ALL Data Warehousing and ETL What is Data WarehouseDimension Table in Data WarehousingData Warehousing Interview QuestionsData warehouse architectureTalend T utorialTalend ETL ToolTalend Interview QuestionsFact Table and its TypesInformatica TransformationsInformatica TutorialVIEW ALL Databases What is MySQLMySQL Data TypesSQL JoinsSQL Data TypesWhat is MongoDBMongoDB Interview QuestionsMySQL TutorialSQL Interview QuestionsSQL CommandsMySQL Interview QuestionsVIEW ALL DevOps What is DevOpsDevOps vs AgileDevOps ToolsDevOps TutorialHow To Become A DevOps EngineerDevOps Interview QuestionsWhat Is DockerDocker TutorialDocker Interview QuestionsWhat Is ChefWhat Is KubernetesKubernetes TutorialVIEW ALL Front End Web Development What is JavaScript รข€" All You Need To Know About JavaScriptJavaScript TutorialJavaScript Interview QuestionsJavaScript FrameworksAngular TutorialAngular Interview QuestionsWhat is REST API?React TutorialReact vs AngularjQuery TutorialNode TutorialReact Interview QuestionsVIEW ALL Mobile Development Android TutorialAndroid Interview QuestionsAndroid ArchitectureAndroid SQLite DatabaseProgramming is in continuation to ou r previous blogpost announcing the arrival of stable release of Hadoop 2.0 for production deployments.Since then Apache has released two more releases of Hadoop 2. The most recent Release 2.4.0 of Hadoop 2 now supports Automatic Failover of the YARN ResourceManager. Because of many such enterprise ready features, Hadoop is making news and positive predictions.This post explains the new features in detail and clarifies many prevalent doubts about Hadoop 2.0. If you are new to Hadoop, review our previous blog posts on HDFS and MapReduce and HDFS Architecture.Following are the four main improvements in Hadoop 2.0 over Hadoop 1.x:HDFS Federation horizontal scalability of NameNodeNameNode High Availability NameNode is no longer a Single Point of FailureYARN ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPHResource Manager splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons: a global Resource Manager and per-application ApplicationMasterThere are additional features such asCapacity Scheduler (Enable Multi-tenancy support in Hadoop), Data Snapshot, Support for Windows, NFS access, enabling increased Hadoop adoption in the Industry to solve Big Data problems.HDFS FederationEven though a Hadoop Cluster can scale up to hundreds of DataNodes, the NameNode keeps all its metadata in memory (RAM). This results in the limitation on maximum number of files a Hadoop Cluster can store (typically 50-100M files). As your data size and cluster size grow this becomes a bottleneck as size of your cluster is limited by the NameNode memory.Hadoop 2.0 feature HDFS Federation allows horizontal scaling for Hadoop distributed file system (HDFS). This is one of the many sought after features by enterprise class Hadoop users such as Amazon and eBay. HDFS Federation supports multiple NameNodes and namespaces.In order to scale the name service horizontally, federation uses multiple independent Namenodes and Namespaces. The Namenodes are federated, that is, the Namenodes are independent and dont require coordination with each other. The DataNodes are used as common storage for blocks by all the Namenodes. Each DataNode registers with all the NameNodes in the cluster. DataNodes send periodic heartbeats and block reports and handle commands from the NameNodes.NameNode High AvailabilityIn Hadoop 1.x, NameNode was single point of failure. NameNode failure makes the Hadoop Cluster inaccessible. Usually, this is a rare occurrence because of business-critical hardware with RAS features used for NameNode servers.In case of NameNode failure, Hadoop Administrators need to manually recover the NameNode using Secondary NameNode.Hadoop 2.0 Architecture supports multiple NameNodes to remove this bottleneck. Hadoop 2.0, NameNode High Availability feature comes with support for a Passive Standby NameNode. These Active-Pa ssive NameNodes are configured for automatic failover.All namespace edits are logged to a shared NFS storage and there is only a single writer (with fencing configuration) to this shared storage at any point of time. The passive NodeNode reads from this storage and keeps an updated metadata information for cluster. In case of Active NameNode failure, the passive NameNode becomes the Active NameNode and starts writing to the shared storage. The fencing mechanism ensures that there is only one write to the shared storage at any point of time.With Hadoop Release 2.4.0, High Availability support for Resource Manager is also available.YARN Yet Another Resource Negotiator Large amount of Data from multiple stores is stored in HDFS but you can only run MapReduce framework jobs on to process and analyse the same (with Pig and Hive). To process with other framework applications such as Graph or Streaming, you need to take this data out of HDFS, for example, into Cassandra or HBase.Hadoop 2. 0 provides YARN APIs to write other frameworks to run on top of HDFS. This enables running Non-MapReduce Big Data Applications on Hadoop. Spark, MPI, Giraph, and HAMA are few of the applications written or ported to run within YARN.Image Credit: http://hortonworks.com/hadoop/yarn/YARN provides the daemons and APIs necessary to develop generic distributed applications of any kind, handles and schedules resource requests (such as memory and CPU) from such applications, and supervises their execution.YARN Resource ManagerIn Hadoop, JobTracker is the master daemon for both Job resource management and scheduling/monitor of Jobs. In large Hadoop Cluster with thousands of Map and Reduce tasks running with TaskTackers on DataNodes, this results in CPU and Network bottlenecks.It takes care of the entire life cycle of a Job from scheduling to successful completion Scheduling and Monitoring. It also has to maintain resource information on each of the nodes such as number of map and reduce slo ts available on DataNodes Resource management.The Next Generation MapReduce framework (MRv2) is an application framework that runs within YARN. The new MRv2 framework divides the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components.The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the applications scheduling and coordination.YARN provides better resource management in Hadoop, resulting in improved cluster efficiency and application performance. This feature not only improves the MapReduce Data Processing but also enables Hadoop usage in other data processing applications.YARNs execution model is more generic than the earlier MapReduce implementation in Hadoop 1.0. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MRv1).It is important to understand that YARN a nd MRv2 are two different concepts and should be used interchangeably. YARN is the resource management framework that provides infrastructure and APIs to facilitate the request for, allocation of, and scheduling of cluster resources. As explained earlier, MRv2 is an application framework that runs within YARN.Capacity Scheduler Multi-tenancy SupportIn Hadoop 1.0 all DataNodes are dedicated to Map and Reduce tasks and cannot be used for other processing. In Hadoop 1.0, the clusters capacity is measured in MapReduce slots. Each node in the cluster has a pre-defined set of slots, and the Scheduler ensures that a percentage of those slots are available to a set of users and groups. So if you are not running MapReduce jobs, you are wasting DataNode resources.With Capacity scheduler support in Hadoop 2.0, DataNode resources can be used for other Applications too. The Capacity Scheduler (CS) ensures that groups of users and applications will get a guaranteed share of the cluster, while ma ximizing overall utilization of the cluster. Through an elastic resource allocation, if the cluster has available resources then users and applications can take up more of the cluster than their guaranteed minimum share.In Hadoop 2.0 with YARN and MapReduce v2, the cluster capacity is measured as the physical resource (RAM now, and CPU as well in the future) that is available across the entire cluster.The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It performs no monitoring or tracking of status for the application and works as a pure scheduler.The ResourceManager performs its scheduling function based on the resource requirements of the applications. Each application has multiple resource request such as memory, CPU, disk, network etc. It is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization.You ca n check out our post onHDFS and MapReduce,HDFS Architecture, 5 Reasons to Learn Hadoopand alsoHow essential is Hadoop Training.Recommended videos for you Webinar: Introduction to Big Data Hadoop Watch Now Big Data Processing With Apache Spark Watch Now Pig Tutorial Know Everything About Apache Pig Script Watch Now Power of Python With BigData Watch Now Reduce Side Joins With MapReduce Watch Now MapReduce Tutorial All You Need To Know About MapReduce Watch Now Real-Time Analytics with Apache Storm Watch Now Top Hadoop Interview Questions and Answers Ace Your Interview Watch Now Is It The Right Time For Me To Learn Hadoop ? Find out. Watch Now Boost Your Data Career with Predictive Analytics! Learn How ? Watch Now Apache Kafka With Spark Streaming: Real-Time Analytics Redefined Watch Now What is Big Data and Why Learn Hadoop!!! Watch Now Introduction to Big Data TDD and Pig Unit Watch Now Hadoop Tutorial A Complete Tutorial For Hadoop Watch Now Hadoop Architecture Hadoop Tutoria l on HDFS Architecture Watch Now New-Age Search through Apache Solr Watch Now Apache Spark For Faster Batch Processing Watch Now Secure Your Hadoop Cluster With Kerberos Watch Now Ways to Succeed with Hadoop in 2015 Watch Now Big Data Tutorial Get Started With Big Data And Hadoop Watch NowRecommended blogs for you Hadoop Interview Questions On HBase In 2020 Read Article Apache Storm Use Cases Read Article Why Scala is getting Popular? Read Article Splunk vs. ELK vs. Sumo Logic: Which Works Best For You? Read Article Hadoop YARN Tutorial Learn the Fundamentals of YARN Architecture Read Article Jobs In Hadoop Read Article Top 50 Hadoop Interview Questions You Must Prepare In 2020 Read Article Drilling Down On Apache Drill, The New-Age Query Engine (Part 2) Read Article HDFS Commands: Hadoop Shell Commands to Manage HDFS Read Article Helpful Hadoop Shell Commands Read Article Overview of Hadoop 2.0 Cluster Architecture Federation Read Article Pig Programming: Apache Pig Script with U DF in HDFS Mode Read Article Install Hadoop: Setting up a Single Node Hadoop Cluster Read Article What Is Splunk? A Beginners Guide To Understanding Splunk Read Article Top Hadoop Interview Questions To Prepare In 2020 Apache Hive Read Article Apache Spark combineByKey Explained Read Article Top 5 Hadoop Admin Tasks Read Article What is SAP HANA? Read Article Steps to Create UDF in Apache Pig Read Article Splunk Use Case: Dominos Success Story Read Article Comments 7 Comments Trending Courses in Big Data Big Data Hadoop Certification Training158k Enrolled LearnersWeekend/WeekdayLive Class Reviews 5 (62900)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.