HDFS Introduction: Architecture, Components, and Best Practices
•3 min read
hadoophdfsbig datadistributed systems
Overview
Hadoop Distributed File System (HDFS) is the primary storage layer used by Hadoop workloads. It is designed for high-throughput data access on commodity hardware and is most effective for write-once, read-many patterns.
HDFS is usually not the right fit for ultra-low-latency random reads.
Architecture
HDFS follows a master-slave model with three core roles:
- NameNode: manages namespace and metadata.
- DataNode: stores data blocks and serves read/write requests.
- Client: requests metadata from NameNode and streams data with DataNodes.
NameNode
The NameNode handles:
- File system tree metadata.
- Block mapping and placement.
- Access control and policy checks.
For production, run a high-availability setup to avoid single-point-of-failure risk.
DataNode
DataNodes handle:
- Block storage.
- Heartbeats to NameNode.
- Block checksum validation.
Client
Client responsibilities:
- File operations.
- Metadata lookup via NameNode.
- Direct block transfer with DataNodes.
Data Storage Model
Block Size
# Check block size
hdfs dfsadmin -confKey dfs.blocksize
# Common defaults
# Hadoop 1.x: 64MB
# Hadoop 2.x: 128MB
# Hadoop 3.x: 128MB (can be larger)
Replication
# Check replication factor
hdfs dfsadmin -confKey dfs.replication
# Common default: 3
Write path summary:
- Client streams to the first DataNode.
- DataNode forwards in pipeline to other replicas.
- Write is acknowledged when replica writes complete.
Useful HDFS Commands
Basic Operations
# Create directory
hdfs dfs -mkdir /user/hadoop
# Upload file
hdfs dfs -put localfile.txt /user/hadoop/
# Download file
hdfs dfs -get /user/hadoop/remote.txt local.txt
# List files
hdfs dfs -ls /user/hadoop/
# Remove file
hdfs dfs -rm /user/hadoop/oldfile.txt
Administrative Operations
# Set replication recursively
hdfs dfs -setrep -R 3 /user/hadoop/
# Disk usage
hdfs dfs -du -h /user/hadoop/
# Cluster report
hdfs dfsadmin -report
# Safe mode
hdfs dfsadmin -safemode enter
hdfs dfsadmin -safemode leave
Configuration
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/hadoop/tmp</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/var/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/var/hadoop/hdfs/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
</configuration>
Performance Optimization
Block Size Strategy
- Large files: prefer larger block sizes.
- Small files: avoid large volumes of tiny files.
- Align block and processing split strategy for batch jobs.
# Example custom block size on upload
hdfs dfs -D dfs.blocksize=268435456 -put largefile.dat /user/hadoop/
Memory Tuning Example
<property>
<name>mapreduce.map.memory.mb</name>
<value>1536</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>3072</value>
</property>
Security Basics
Authentication
hdfs dfs -D hadoop.security.authentication=kerberos -ls /
Permissions and Ownership
# Set permissions
hdfs dfs -chmod 750 /user/hadoop/protected
# Set ownership
hdfs dfs -chown hadoop:hadoop /user/hadoop/data
# Verify permissions
hdfs dfs -ls -la /user/hadoop/
Monitoring and Maintenance
Health Checks
# Cluster health
hdfs dfsadmin -report
# Live DataNodes
hdfs dfsadmin -report -live
# File/block checks
hdfs fsck /user/hadoop -locations -files -blocks
Maintenance Tasks
# Roll NameNode edits
hdfs dfsadmin -rollEdits
# Refresh DataNodes
hdfs dfsadmin -refreshNodes
# Decommission workflow
echo "datanode3.example.com" > decommissioning-nodes.txt
hdfs dfsadmin -refreshNodes
Common Issues
NameNode Not Starting
tail -f /var/hadoop/hdfs/namenode.log
jps | grep NameNode
hdfs dfsadmin -safemode get
DataNode Connection Problems
hdfs dfsadmin -report -dead
tail -f /var/hadoop/hdfs/datanode.log
rm -rf /var/hadoop/hdfs/datanode/*
hdfs datanode -format
Disk Space Pressure
hdfs dfsadmin -report
hdfs dfs -du / | sort -nr | head -10
hdfs dfs -expunge
Best Practices
- Keep large analytical files in HDFS.
- Avoid too many small files.
- Use compression where possible.
- Organize datasets with clear partitioning strategy.
- Monitor cluster health continuously.
- Plan DataNode scaling in advance.
- Back up critical datasets and document restore procedures.