CHAPTER 1 : INTRODUCTION
- What is Cloud Computing
- What is Grid Computing
- What is Virtualization
- How above three are inter-related to each other
- What is Big Data
- Introduction to Analytics and the need for big data analytics
- Hadoop Solutions - Big Picture
- Hadoop distributions
- Comparing Hadoop Vs. Traditional systems
- Volunteer Computing
- Data Retrieval - Radom Access Vs. Sequential Access
- NoSQL Databases
CHAPTER 2 : THE MOTIVATION FOR HADOOP
- Problems with traditional large-scale systems
- Data Storage literature survey
- Data Processing literature Survey
- Network Constraints
- Requirements for a new approach
CHAPTER 3 : HADOOP BASIC CONCEPTS
- What is Hadoop?
- The Hadoop Distributed File System
- How MapReduce Works
- Anatomy of a Hadoop Cluster
CHAPTER 4 : HADOOP DEMONS
- Master Daemons
- Name node
- Job Tracker
- Secondary name node
- Slave Daemons
- Job tracker
- Task tracker
CHAPTER 5 : HDFS (HADOOP DISTRIBUTION FILE SYSTEM)
- Blocks and Splits
- Input Splits
- HDFS Splits
- Data Replication
- Hadoop Rack Aware
- Data high availability
- Data Integrity
- Cluster architecture and block placement
- Accessing HDFS
- JAVA Approach
- CLI Approach
CHAPTER 6 : PROGRAMMING PRACTICES & PERFORMING TUNING
- Developing MapReduce Programs in
- Local Mode
- Running without HDFS and Mapreduce
- Pseudo-distributed Mode
- Running all daemons in a single node
- Fully distributed mode
- Running daemons on dedicated nodes
CHAPTER 7: HADOOP ADMINISTATIVE TASKS - Setup Hadoop cluster of Apache, Cloudera and HortonWorks
- Install and configure Apache Hadoop
- Make a fully distributed Hadoop cluster on a single laptop/desktop (Psuedo Mode)
- Install and configure Cloudera Hadoop distribution in fully distributed mode
- Install and configure HortonWorks Hadoop distribution in fully distributed mode
- Monitoring the cluster
- Getting used to management console of Cloudera and Horton Works
- Name Node in Safe mode
- Meta Data Backup
- Integrating Kerberos security in Hadoop
- Ganglia and Nagios Cluster monitoring
- Benchmarking the Cluster
- Commissioning/Decommissioning Nodes.
CHAPTER 8 : HAOOP DEVELOPER TASKS-Writing a Map Reduce Program
- Examining a Sample Map Reduce Program
- With Several Examples
- Basic API Concepts
- The Driver Code
- The Mapper
- The Reducer
- Hadoop's Streaming API
CHAPTER 9 : Performing several Hadoop Jobs
- The configure and close Methods
- Sequence Files
- Record Reader
- Record Writer
- Role of Reporter
- Output Collector
- Processing video files and audio files
- Processing image files
- Processing XML files
- Processing Zip files
- Counters
- Directly Accessing HDFS
- Tool Runner
- Using The Distributed Cache.
CHAPTER 10 : Common Map Reduce Algorithms
- Sorting and Searching
- Indexing
- Classification/Machine Learning
- Term Frequency - Inverse Document Frequency
- Word Co-Occurrence
- Hands-On Exercise: Creating an Inverted Index
- Identify Mapper
- Identify Reducer
- Exploring well known problems using
- Map Reduce applications.
CHAPTER 11 : Debugging Map Reduce Programs
- Testing with MR Unit
- Logging
- Other Debugging Strategies.
|
CHAPTER 12 : Advanced Map Reduce Programming
- A Recap of the Map Reduce Flow
- Custom Writables and Writable Comparables
- The Secondary Sort
- Creating Input Formats and Output Formats
- Pipelining Jobs With Oozie
- Map-Side Joins
- Reduce-Side Joins.
CHAPTER 13 : Monitoring and debugging on a Production Cluster
- Counters
- Skipping Bad Records
- Rerunning failed tasks with Isolation Runner
CHAPTER 14 : Tuning for Performance
- Reducing network traffic with combiner
- Reducing the amount of input data
- Using Compression
- Running with speculative execution
- Refactoring code and rewriting algorithms Parameters affecting Performance
- Other Performance Aspects
CHAPTER 15 : Hadoop Ecosystem- Hive
- Hive concepts
- Hive architecture
- Install and configure hive on cluster
- Create database, access it console
- Buckets,Partitions
- Joins in Hive
- Inner joins
- Outer joins
- Hive UDF
- Hive UDAF
- Hive UDTF
- Develop and run sample applications in Java to access hive
- Load Data into Hive and process it using Hive
CHAPTER 16 : PIG
- Pig basics
- Install and configure PIG on a cluster
- PIG Vs MapReduce and SQL
- PIG Vs Hive
- Write sample Pig Latin scripts
- Modes of running PIG
- Running in Grunt shell
- Programming in Eclipse
- Running as Java program
- PIG UDFs
- PIG Macros
- Load data into Pig and process it using Pig
CHAPTER 17 : SQOOP
- Install and configure Sqoop on cluster
- Connecting to RDBMS
- Installing Mysql
- Import data from Oracle/Mysql to hive
- Export data to Oracle/Mysql
- Internal mechanism of import/export
- Import millions of records into HDFS from RDBMS using Sqoop
Chapter 18 : HBASE
- HBase concepts
- HBase architecture
- Region server architecture
- File storage architecture
- HBase basics
- Cloumn access
- Scans
- HBase Use Cases
- Install and configure HBase on cluster
- Create database, Develop and run sample applications
- Access data stored in HBase using clients like Java
- Map Resuce client to access the HBase data
- HBase and Hive Integration
- HBase admin tasks
- Defining Schema and basic operation
CHAPTER 19 : CASSANDRA
- Cassandra core concepts
- Install and configure Cassandra on cluster
- Create database, tables and access it console
- Developing applications to access data in Cassandra through Java
- Install and Configure OpsCenter to access Cassandra data using browser
CHAPTER 20 : OOZIE
- Oozie architecture
- XML file specifications
- Install and configure Oozie on cluster
- Specifying Work flow
- Action nodes
- Control nodes
- Oozie job coordinator
- Accessing Oozie jobs command line and using web console
- Create a sample workflows in oozie and run them on cluster
CHAPTER 21 : Zookeeper, Flume, Chukwa, Avro, Scribe,Thrift, HCatalog
- Flume and Chukwa Concepts
- Use cases of Thrift ,Avro and scribe
- Install and Configure flume on cluster
- Create a sample application to capture logs from Apache using flume
CHAPTER 22 : ANALYTICS BASIC
- Analytics and big data analytics
- Commonly used analytics algorithms
- Analytics tools like R and Weka
- R language basics
- Mahout
CHAPTER 23 : CDH4 ENHANCEMENTS
- Name Node High – Availability
- Name Node federation
- Fencing
- YARn
|