2013年4月13日 星期六

雲端網路實務 2013/4/14

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop 是什麼?它本來是 Apache.org 在 Lucene下的一個專案,由 Dong Cutting 所開發,而在 Lucene 的發展歷史中,在搜尋引擎的應用上因 Google 的走紅,而使 Dong Cutting 也參考了由 Google 所發表的 MapReduce 和 GFS ( Google File System ) 的發表論文後 ( Google 是用這兩樣技術進行電腦的聯合運算,也就是雲端運算技術 ),在 Hadoop 專案中,將 MapReduce 及 GFS 進行了兩方面的實作,從這個開發原始碼的應用,使得大型機器在快速運算的能力延伸到了多台機器的聯合運算,並且利用分散式的架構概念及日漸成熟的 PC 運算,達到和大型主機類似的功能。

建置 Hadoop 虛擬模版電腦
基本安裝軟體
1.JDK(Oracle,open jdk)
2.Hadoop


建置 Hadoop 主機 : HDP119
Lab301 目錄中的 HDP119 目錄, 存放直接內核開機所需的檔案
$ cd ~/iLab/Lab301/HDP119

$ sudo virsh define HDP119.xml
區域 HDP119 定義自 HDP119.xml

$ sudo virsh start HDP119
區域 HDP119 已開啟
 
 [重點] HDP119 虛擬電腦要開機成功, 必須 /home/student/iLab, 
/home/student/iLab/Lab301 及 /home/student/iLab/Lab301/HDP119 這三個目錄, 權限要設為 755
 
 設定 JDK1. 登入系統
$ sudo virsh console HDP119
Connected to domain HDP119
Escape character is ^]

Core Linux
box login: tc
Password: student 
2. 檢視掛載硬碟內容
$ ls -al /mnt/sda1/
total 40
drwxr-xr-x  6 root root   4096 Nov 27 14:35 .
drwxr-xr-x  4 root root     80 Apr 11 16:02 ..
drwxr-xr-x 15 tc   staff  4096 Nov  3 16:05 hadoop-1.0.4
drwxr-xr-x  8 root root   4096 Nov  2 11:14 jdk1.6.0_37
drwx------  2 root root  16384 Nov  2 02:24 lost+found
-rw-r--r--  1 root root   7746 Nov 27 14:35 mydata.tgz
drwxrwxr-x  4 tc   staff  4096 Nov  2 02:26 tce

3. 修改 .ashrc
$ nano .ashrc
                             :
alias ping='ping -c 4'

export JAVA_HOME=/mnt/sda1/jdk1.6.0_37
# hadoop 從 1.0 開始是不需要設定 HADOOP_HOME 這個環境變數
# export HADOOP_HOME=/mnt/sda1/hadoop-1.0.4
export PATH="$PATH:$JAVA_HOME/bin"

5. 重新開機
$ filetool.sh -b
$ sudo reboot
 
 6. 再次登入系統
Core Linux
box login: tc
Password:student 

7. 測試 JDK
是否為 64 位元版 ?
# java -d64 -version
Running a 64-bit JVM is not supported on this platform.

是否為 32 位元版 ?
# java -d32 -version
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b03)
Java HotSpot(TM) Client VM (build 20.8-b03, mixed mode, sharing)
 
 安裝 Hadoop 核心套件 : HDP119
1. 登入系統
$ sudo virsh console HDP119
Core Linux
box login: tc
Password: student

2. 修改 .ashrc
$ nano .ashrc
                             :
alias ping='ping -c 4'

export JAVA_HOME=/mnt/sda1/jdk1.6.0_37
# hadoop 從 1.0 開始是不需要設定 HADOOP_HOME 這個環境變數
# export HADOOP_HOME=/mnt/sda1/hadoop-1.0.4
export PATH="$PATH:$JAVA_HOME/bin:/mnt/sda1/hadoop-1.0.4/bin"
[註] 下載 Hadoop 套件網址
$ wget ftp://ftp.twaren.net/Unix/Web/apache/hadoop/core/hadoop-1.0.4/hadoop-1.0.4-bin.tar.gz

3. 重新開機
$ filetool.sh -b
$ exit 

4. 再次登入系統
Core Linux
box login: tc
Password: student

5. 檢視 Hadoop 版本
$ hadoop version
Hadoop 1.0.4
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192
Compiled by hortonfo on Tue May  8 20:31:25 UTC 2012
From source with checksum e6b0c1e23dcf76907c5fecb4b832f3be

Standalone Operation以下操作在 HDP119 虛擬電腦執行
1. 資料準備
$ cd /mnt/sda1/hadoop-1.0.4/
$ mkdir input
$ cp conf/*.xml input

2. 執行 MapReduce 程式
$ hadoop jar hadoop-examples-1.0.4.jar grep input output 'dfs[a-z.]+' 

[重要] 上面命令要執行成功, 必須要先安裝 glibc_apps.tcz 套件, 因此套件中的 getconf 命令會被使用, 如 output 已存在, 必須先刪除
$ tce-load -wi glibc_apps.tcz
$ rm -rf output/

3. 檢視執行結果
$ cat output/part-00000 
1 dfsadmin
------------------------------------------------------------------- 
 
 
 
 
 
  

 
 
 












 
 
 
 
 
 


----------------------------------------------------------------------
佈暑 Hadoop 實驗系統

1. 批次建立 Hadoop 虛擬電腦及網路
$ cd ~/iLab
$ sudo ./labcmd.sh create -f Lab301
 



2. 啟動 Lab301 實驗系統
$ sudo ./labcmd.sh start Lab301

HDP120 虛擬電腦 - 建立 ssh 自動登入連接
執行 ssh localhost 命令, 登入不需輸入密碼, 請執行以下操作:

1. 登入 HDP120
$ sudo virsh console HDP120
Connected to domain HDP120
Escape character is ^]

root@HDP120:~# 

2. 編輯 /etc/hosts 檔
# 回到家目錄
$ cd                   

# 修改 /etc/hosts 設定檔   (電腦名稱解析,同DNS概念)
$ sudo nano /etc/hosts   
127.0.0.1 localhost
192.168.100.20 HDP120
192.168.100.21 HDP121

# The following lines are desirable for IPv6 capable hosts
# (added automatically by netbase upgrade)

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

3. 產生自動登入憑證 (在HDP120上做)
# cd
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

4. 第一次 ssh 自動登入操作 
# ssh HDP120
The authenticity of host 'hdp120 (192.168.100.20)' can't be established.
RSA key fingerprint is be:4d:f0:a2:a4:c7:aa:4f:ff:f3:29:39:b7:b8:c7:4f.
Are you sure you want to continue connecting (yes/no)? yes                         #  回答 yes, 下載 SSH Server 憑證
Warning: Permanently added 'hdp120,192.168.100.20' (RSA) to the list of known hosts.
Linux HDP120 2.6.32-33-generic-pae #72-Ubuntu SMP Fri Jul 29 22:06:29 UTC 2011 i686 GNU/Linux
Ubuntu 10.04.4 LTS

Welcome to Ubuntu!
 * Documentation:  https://help.ubuntu.com/
Last login: Sat Jun 30 23:12:49 2012
root@HDP120:~# 

5. 結束連線
root@HDP120:~#  exit
logout
Connection to HDP120 closed.

HDP120 虛擬電腦 - 設定 Hadoop 單點分散架構 
1. 建立檔案儲存目錄
$ cd /mnt/hda1/hadoop-1.0.3/
$ mkdir data

2. 設定 conf/core-site.xml
$ nano conf/core-site.xml 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://HDP120:9000</value>
     </property>
     <property>
         <name>hadoop.tmp.dir</name>
         <value>/mnt/hda1/hadoop-1.0.3/data</value>
     </property>
</configuration>

3. 設定 conf/hdfs-site.xml
$ nano conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>

[註] dfs.safemode.threshold.pct 的值為 0, NameNode 一啟動不會進入 safe mode (read only)

<property>
     <name>dfs.safemode.threshold.pct</name>
     <value>0</value>
</property>

4. 設定 conf/mapred-site.xml
$ nano conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>HDP120:9001</value>
     </property>
</configuration>

5. masters/slaves 設定檔
masters 用來設定 SecondaryNameNode, 如沒有設定, SecondaryNameNode 這服務不會被啟動
$ nano conf/masters
HDP120

slaves 用來設定 DataNode
$ nano conf/slaves 
HDP120



設定 Hadoop 環境變數 (conf/hadoop-env.sh)
之所以要做這設定, 因為執行 HDFS 系統, 是以 root 身份執行, 而 root 的環境變數, 並沒有設定 JAVA_HOME 及 HADOOP_HEAPSIZE 這二個環境變數, 所以請在 conf/hadoop-env.sh 檔案, 設定 JAVA_HOME 及 HADOOP_HEAPSIZE 這二個環境變數,
$ nano conf/hadoop-env.sh 
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
export JAVA_HOME=/mnt/hda1/jdk1.6.0_33

# Extra Java CLASSPATH elements.  Optional.
# export HADOOP_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=384 

# Extra Java runtime options.  Empty by default.
# export HADOOP_OPTS=-server
                       :

啟動 Hadoop
# start-all.sh 
starting namenode, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-namenode-HDP120.out
HDP120: starting datanode, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-datanode-HDP120.out
HDP120: starting secondarynamenode, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-secondarynamenode-HDP120.out
starting jobtracker, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-jobtracker-HDP120.out
HDP120: starting tasktracker, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-tasktracker-HDP120.out
 
*在哪一台機器run start-all.sh哪一台就當NameNode 
 
檢視啟動那些 Hadoop Daemon (jps=>Java process service)
$ jps
2232 NameNode
2667 TaskTracker
2347 DataNode
2468 SecondaryNameNode
2774 Jps
2546 JobTracker
 
停止 Hadoop
$  stop-all.sh
stopping jobtracker
HDP120: stopping tasktracker
stopping namenode
HDP120: stopping datanode
HDP120: stopping secondarynamenode
 
------------------------------------------------------------------------------
 HDFS 命令操作

1. 建立目錄
$ cd /mnt/hda1/hadoop-1.0.3
$ start-dfs.sh

$ hadoop dfs -ls  /    
Found 1 items
drwxr-xr-x   - root supergroup          0 2013-04-13 14:02 /mnt

$ hadoop dfs -mkdir  /user
$ hadoop dfs -ls  /   
Found 2 items
drwxr-xr-x   - root supergroup          0 2012-06-09 22:36 /mnt
drwxr-xr-x   - root supergroup          0 2012-06-09 22:40 /user
  
2. 上載檔案
$ hadoop dfs -copyFromLocal *.txt /user

$ hadoop dfs -ls /user
Found 4 items
-rw-r--r--   1 student supergroup          0 2011-06-12 00:32 /user/CHANGES.txt
-rw-r--r--   1 student supergroup          0 2011-06-12 00:32 /user/LICENSE.txt
-rw-r--r--   1 student supergroup          0 2011-06-12 00:32 /user/NOTICE.txt
-rw-r--r--   1 student supergroup          0 2011-06-12 00:32 /user/README.txt

3. 顯示檔案內容
# hadoop dfs -cat /user/NOTICE.txt
This product includes software developed by The Apache Software
Foundation (http://www.apache.org/).

4. 檢視檔案儲存資訊
# hadoop fsck /user/NOTICE.txt -files -blocks -locations
FSCK started by root from /192.168.100.20 for path /user/NOTICE.txt at Thu Nov 03 17:26:50 UTC 2011
/user/NOTICE.txt 101 bytes, 1 block(s):  OK
0. blk_-6938191868505178697_1006 len=101 repl=1 [192.168.100.20:50010]

Status: HEALTHY
 Total size: 101 B
 Total dirs: 0
 Total files: 1
 Total blocks (validated): 1 (avg. block size 101 B)
 Minimally replicated blocks: 1 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 0 (0.0 %)
 Mis-replicated blocks:  0 (0.0 %)
 Default replication factor: 1
 Average block replication: 1.0
 Corrupt blocks:  0
 Missing replicas:  0 (0.0 %)
 Number of data-nodes:  1
 Number of racks:  1
FSCK ended at Thu Nov 03 17:26:50 UTC 2011 in 5 milliseconds

The filesystem under path '/user/NOTICE.txt' is HEALTHY

5. 取回檔案
The get command is the inverse 
operation of put; it will copy a file or directory (recursively) from 
HDFS into the target of your choosing on the local file system. A 
synonymous operation is called -copyToLocal.
# hadoop dfs -get /user/README.txt a.txt
# ll a.txt
-rw-r--r-- 1 root root 1366 2011-11-05 21:40 a.txt
 
7. 刪除檔案
$ hadoop dfs -rm /user/*.txt
Deleted hdfs://HDP120:9000/user/CHANGES.txt
Deleted hdfs://HDP120:9000/user/LICENSE.txt
Deleted hdfs://HDP120:9000/user/NOTICE.txt
Deleted hdfs://HDP120:9000/user/README.txt

8. 停止 HDFS
# stop-dfs.sh
 
 





沒有留言:

張貼留言