Apache Hadoop
Setup Apache Hadoop on Windows, Linux & Mac OS ..
x
x
Download the Apache-Hadoop prebuilt image from Docker Hub repository:
These images are about 14GB .. so please be patient ..!

Once completed, deploy the Apache-Hadoop containers.
docker run -it -p 9870:9870 -p 8095:8088 -p 9864:9864 --name AHW jporeilly/apache-hadoop:amd
The YARN Resource Manager port is mapped to: port:8095 to prevent conflict with Cedalo Management Center - MQTT
Once completed a shell will open:
root@955b8d17f170:/#
Enter: init
init
This stops all running processes, formats the HDFS namenodes, and starts all processes.
Enter: jps
jps


Now you can access the Hadoop services at:
NameNode Web UI: http://localhost:9870
YARN ResourceManager: http://localhost:8095 (instead of 8088)
DataNode Web UI: http://localhost:9864
This completes the installation of all tools required for the Big Data course.
Just type exit
to exit from the container - this will stop the container.
To start the container enter the following:
docker start -ai AHW
Once the Docker shell opens, just type restart
to restart all processes.

NameNode
The NameNode is the master node and central component of Hadoop's Distributed File System (HDFS). It acts as the "brain" of the file system.
Log into NameNode:
You can upload files to the root directory:

YARN
YARN acts as the operating system for Hadoop clusters by separating resource management from job scheduling and monitoring, allowing multiple data processing engines like MapReduce, Spark, Hive, and others to run simultaneously on the same cluster.
Log into YARN:

DataNode
A DataNode in Hadoop is a worker node in the Hadoop Distributed File System (HDFS) that stores the actual data blocks and serves read/write requests from clients. DataNodes communicate regularly with the NameNode through heartbeat messages to report their health status and the blocks they're storing.
They handle data replication by creating multiple copies of blocks across different nodes to ensure fault tolerance, and they perform block verification to detect corruption. DataNodes also participate in data pipeline operations during file writes and coordinate with other DataNodes to maintain data integrity and availability across the distributed cluster.
Log into the DataNode:
Useful for troubleshooting the Node.

Before you begin ensure Docker & Docker Compose have been installed & configured.
docker-compose --version
Run the docker containers using docker-compose
cd
cd ~/Hadoop
docker-compose up -d
[+] Running 28/5
✔ datanode Pulled 32.7s
✔ namenode Pulled 32.6s
✔ nodemanager1 Pulled 32.5s
✔ resourcemanager Pulled 32.3s
✔ historyserver Pulled 32.5s
[+] Running 9/9
✔ Network hadoop_default Creat... 0.5s
✔ Volume "hadoop_hadoop_datanode" Created 0.0s
✔ Volume "hadoop_hadoop_historyserver" Created 0.0s
✔ Volume "hadoop_hadoop_namenode" Created 0.0s
✔ Container datanode Started 3.8s
✔ Container namenode Started 3.9s
✔ Container nodemanager Starte... 3.9s
✔ Container historyserver Star... 3.8s
✔ Container resourcemanager St... 3.9s
...
Access the Cluster
Can login into any node by specifying the container.
docker exec -it datanode /bin/bash
Navigate to mapped data volume.
cd hadoop/dfs/
Accessing the UI
The Namenode UI can be accessed at:
ResourceManager UI can be accessed at:
History Server UI can be accessed at:
Shutdown Cluster
To shut down the cluster.
docker-compose down
Time to check we can run some Hadoop Jobs.
So we're going to run a Job that counts the number of instances a word appears in the Canterbury Tales.
Test - Word Count Algorithm
List all the files in our HDFS system.
hdfs dfs -l /
Create a /user/root/ file.
hdfs dfs -mkdir -p /user/root
Verify directory.
hdfs dfs -ls /user/
Found 1 items
drwxr-xr-x - root supergroup 0 2024-08-10 13:59 /user/root
Download the hadoop-mapreduce-examples-3.2.1-sources.jar file
We will use a .jar file containing the classes needed to execute MapReduce algorithm.
Save - hadoop-mapreduce-examples-3.2.1-sources.jar to: ~/Hadoop
Download & Save text file - Canterbury Tales or Ulysses
Copy the files into the namenode container.
cd
cd ~/Hadoop/assets
docker cp hadoop-mapreduce-examples-3.2.1-sources.jar namenode:/tmp
docker cp pg2383.txt namenode:/tmp
Create the Input folder.
docker exec -it namenode bash
hdfs dfs -mkdir /user/root/input
Copy over /tmp/pg2383.txt to /user/root/input.
cd
cd /tmp
hdfs dfs -put pg2383.txt /user/root/input
10 . Run MapReduce
hadoop jar hadoop-mapreduce-examples-3.2.1-sources.jar org.apache.hadoop.examples.WordCount input output
2024-08-10 14:10:15,533 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.18.0.6:8032
2024-08-10 14:10:15,702 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.18.0.3:10200
2024-08-10 14:10:15,879 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1723287966223_0001
2024-08-10 14:10:15,969 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,068 INFO input.FileInputFormat: Total input files to process : 1
2024-08-10 14:10:16,088 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,101 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,107 INFO mapreduce.JobSubmitter: number of splits:1
2024-08-10 14:10:16,189 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,200 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1723287966223_0001
2024-08-10 14:10:16,200 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-08-10 14:10:16,345 INFO conf.Configuration: resource-types.xml not found
2024-08-10 14:10:16,346 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-08-10 14:10:16,813 INFO impl.YarnClientImpl: Submitted application application_1723287966223_0001
2024-08-10 14:10:16,867 INFO mapreduce.Job: The url to track the job: http://resourcemanager:8088/proxy/application_1723287966223_0001/
2024-08-10 14:10:16,868 INFO mapreduce.Job: Running job: job_1723287966223_0001
2024-08-10 14:10:23,970 INFO mapreduce.Job: Job job_1723287966223_0001 running in uber mode : false
2024-08-10 14:10:23,971 INFO mapreduce.Job: map 0% reduce 0%
2024-08-10 14:10:30,048 INFO mapreduce.Job: map 100% reduce 0%
2024-08-10 14:10:34,065 INFO mapreduce.Job: map 100% reduce 100%
2024-08-10 14:10:35,074 INFO mapreduce.Job: Job job_1723287966223_0001 completed successfully
2024-08-10 14:10:35,163 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=187024
FILE: Number of bytes written=832593
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1692663
HDFS: Number of bytes written=438623
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=10968
Total time spent by all reduces in occupied slots (ms)=16448
Total time spent by all map tasks (ms)=2742
Total time spent by all reduce tasks (ms)=2056
Total vcore-milliseconds taken by all map tasks=2742
Total vcore-milliseconds taken by all reduce tasks=2056
Total megabyte-milliseconds taken by all map tasks=11231232
Total megabyte-milliseconds taken by all reduce tasks=16842752
Map-Reduce Framework
Map input records=36758
Map output records=282822
Map output bytes=2691784
Map output materialized bytes=187016
Input split bytes=112
Combine input records=282822
Combine output records=41330
Reduce input groups=41330
Reduce shuffle bytes=187016
Reduce input records=41330
Reduce output records=41330
Spilled Records=82660
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=237
CPU time spent (ms)=4180
Physical memory (bytes) snapshot=862277632
Virtual memory (bytes) snapshot=13577064448
Total committed heap usage (bytes)=1277165568
Peak Map Physical memory (bytes)=608587776
Peak Map Virtual memory (bytes)=5115801600
Peak Reduce Physical memory (bytes)=253689856
Peak Reduce Virtual memory (bytes)=8461262848
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1692551
File Output Format Counters
Bytes Written=438623
View the output.
hdfs dfs -cat /user/root/output/*
...
reserved, 1
reserved. 1
reserves, 1
resided 1
residence, 1
resign 1
resign, 1
resign. 1
resignation, 1
resist 1
resisted; 1
resistence, 1
resolution 4
resolved 8
resolves 1
resolving 2
resort, 3
resort; 1
resounded 1
resources 1
respect 2
respect. 1
respective 3
...
Check the results accessing to the output folder.
hdfs dfs -ls /user/root/output
Output the text file.
hdfs dfs -cat /user/root/output/part-r-00000 > /tmp/pg2383_wc.txt
exit
docker cp namenode:/tmp/pg2383_wc.txt
Last updated