Hadoop 是什麼?它本來是 Apache.org 在 Lucene下的一個專案,由 Dong Cutting 所開發,而在 Lucene 的發展歷史中,在搜尋引擎的應用上因 Google 的走紅,而使 Dong Cutting 也參考了由 Google 所發表的 MapReduce 和 GFS ( Google File System ) 的發表論文後 ( Google 是用這兩樣技術進行電腦的聯合運算,也就是雲端運算技術 ),在 Hadoop 專案中,將 MapReduce 及 GFS 進行了兩方面的實作,從這個開發原始碼的應用,使得大型機器在快速運算的能力延伸到了多台機器的聯合運算,並且利用分散式的架構概念及日漸成熟的 PC 運算,達到和大型主機類似的功能。
建置 Hadoop 虛擬模版電腦
基本安裝軟體
1.JDK(Oracle,open jdk)
2.Hadoop
建置 Hadoop 主機 : HDP119
Lab301 目錄中的 HDP119 目錄, 存放直接內核開機所需的檔案
$ cd ~/iLab/Lab301/HDP119 $ sudo virsh define HDP119.xml 區域 HDP119 定義自 HDP119.xml $ sudo virsh start HDP119 區域 HDP119 已開啟
[重點] HDP119 虛擬電腦要開機成功, 必須 /home/student/iLab,
/home/student/iLab/Lab301 及 /home/student/iLab/Lab301/HDP119 這三個目錄, 權限要設為 755
設定 JDK1. 登入系統 $ sudo virsh console HDP119 Connected to domain HDP119 Escape character is ^] Core Linux box login: tc Password: student
2. 檢視掛載硬碟內容 $ ls -al /mnt/sda1/ total 40 drwxr-xr-x 6 root root 4096 Nov 27 14:35 . drwxr-xr-x 4 root root 80 Apr 11 16:02 .. drwxr-xr-x 15 tc staff 4096 Nov 3 16:05 hadoop-1.0.4 drwxr-xr-x 8 root root 4096 Nov 2 11:14 jdk1.6.0_37 drwx------ 2 root root 16384 Nov 2 02:24 lost+found -rw-r--r-- 1 root root 7746 Nov 27 14:35 mydata.tgz drwxrwxr-x 4 tc staff 4096 Nov 2 02:26 tce 3. 修改 .ashrc $ nano .ashrc : alias ping='ping -c 4' export JAVA_HOME=/mnt/sda1/jdk1.6.0_37 # hadoop 從 1.0 開始是不需要設定 HADOOP_HOME 這個環境變數 # export HADOOP_HOME=/mnt/sda1/hadoop-1.0.4 export PATH="$PATH:$JAVA_HOME/bin" 5. 重新開機 $ filetool.sh -b $ sudo reboot
6. 再次登入系統 Core Linux box login: tc Password:student 7. 測試 JDK 是否為 64 位元版 ? # java -d64 -version Running a 64-bit JVM is not supported on this platform. 是否為 32 位元版 ? # java -d32 -version java version "1.6.0_37" Java(TM) SE Runtime Environment (build 1.6.0_37-b03) Java HotSpot(TM) Client VM (build 20.8-b03, mixed mode, sharing)
安裝 Hadoop 核心套件 : HDP119 1. 登入系統 $ sudo virsh console HDP119 Core Linux box login: tc Password: student 2. 修改 .ashrc $ nano .ashrc : alias ping='ping -c 4' export JAVA_HOME=/mnt/sda1/jdk1.6.0_37 # hadoop 從 1.0 開始是不需要設定 HADOOP_HOME 這個環境變數 # export HADOOP_HOME=/mnt/sda1/hadoop-1.0.4 export PATH="$PATH:$JAVA_HOME/bin:/mnt/sda1/hadoop-1.0.4/bin" [註] 下載 Hadoop 套件網址 $ wget ftp://ftp.twaren.net/Unix/Web/apache/hadoop/core/hadoop-1.0.4/hadoop-1.0.4-bin.tar.gz 3. 重新開機 $ filetool.sh -b $ exit 4. 再次登入系統 Core Linux box login: tc Password: student 5. 檢視 Hadoop 版本 $ hadoop version Hadoop 1.0.4 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192 Compiled by hortonfo on Tue May 8 20:31:25 UTC 2012 From source with checksum e6b0c1e23dcf76907c5fecb4b832f3be Standalone Operation以下操作在 HDP119 虛擬電腦執行 1. 資料準備 $ cd /mnt/sda1/hadoop-1.0.4/ $ mkdir input $ cp conf/*.xml input 2. 執行 MapReduce 程式 $ hadoop jar hadoop-examples-1.0.4.jar grep input output 'dfs[a-z.]+' [重要] 上面命令要執行成功, 必須要先安裝 glibc_apps.tcz 套件, 因此套件中的 getconf 命令會被使用, 如 output 已存在, 必須先刪除 $ tce-load -wi glibc_apps.tcz $ rm -rf output/ 3. 檢視執行結果 $ cat output/part-00000 1 dfsadmin
-------------------------------------------------------------------
----------------------------------------------------------------------
佈暑 Hadoop 實驗系統 1. 批次建立 Hadoop 虛擬電腦及網路 $ cd ~/iLab $ sudo ./labcmd.sh create -f Lab301
2. 啟動 Lab301 實驗系統 $ sudo ./labcmd.sh start Lab301 HDP120 虛擬電腦 - 建立 ssh 自動登入連接 執行 ssh localhost 命令, 登入不需輸入密碼, 請執行以下操作: 1. 登入 HDP120 $ sudo virsh console HDP120 Connected to domain HDP120 Escape character is ^] root@HDP120:~# 2. 編輯 /etc/hosts 檔 # 回到家目錄 $ cd # 修改 /etc/hosts 設定檔 (電腦名稱解析,同DNS概念) $ sudo nano /etc/hosts 127.0.0.1 localhost 192.168.100.20 HDP120 192.168.100.21 HDP121 # The following lines are desirable for IPv6 capable hosts # (added automatically by netbase upgrade) ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts 3. 產生自動登入憑證 (在HDP120上做)
# cd # ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa # cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys 4. 第一次 ssh 自動登入操作 # ssh HDP120 The authenticity of host 'hdp120 (192.168.100.20)' can't be established. RSA key fingerprint is be:4d:f0:a2:a4:c7:aa:4f:ff:f3:29:39:b7:b8:c7:4f. Are you sure you want to continue connecting (yes/no)? yes # 回答 yes, 下載 SSH Server 憑證 Warning: Permanently added 'hdp120,192.168.100.20' (RSA) to the list of known hosts. Linux HDP120 2.6.32-33-generic-pae #72-Ubuntu SMP Fri Jul 29 22:06:29 UTC 2011 i686 GNU/Linux Ubuntu 10.04.4 LTS Welcome to Ubuntu! * Documentation: https://help.ubuntu.com/ Last login: Sat Jun 30 23:12:49 2012 root@HDP120:~# 5. 結束連線 root@HDP120:~# exit logout Connection to HDP120 closed.
HDP120 虛擬電腦 - 設定 Hadoop 單點分散架構
1. 建立檔案儲存目錄
$ cd /mnt/hda1/hadoop-1.0.3/
$ mkdir data
2. 設定 conf/core-site.xml
$ nano conf/core-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://HDP120:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/mnt/hda1/hadoop-1.0.3/data</value> </property> </configuration>
3. 設定 conf/hdfs-site.xml
$ nano conf/hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
[註] dfs.safemode.threshold.pct 的值為 0, NameNode 一啟動不會進入 safe mode (read only)
<property>
<name>dfs.safemode.threshold.pct</name>
<value>0</value>
</property>
4. 設定 conf/mapred-site.xml
$ nano conf/mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>HDP120:9001</value> </property> </configuration>
5. masters/slaves 設定檔
masters 用來設定 SecondaryNameNode, 如沒有設定, SecondaryNameNode 這服務不會被啟動
$ nano conf/masters
HDP120
slaves 用來設定 DataNode
$ nano conf/slaves
HDP120
設定 Hadoop 環境變數 (conf/hadoop-env.sh)
之所以要做這設定, 因為執行 HDFS 系統, 是以 root 身份執行, 而 root 的環境變數, 並沒有設定 JAVA_HOME 及 HADOOP_HEAPSIZE 這二個環境變數, 所以請在 conf/hadoop-env.sh 檔案, 設定 JAVA_HOME 及 HADOOP_HEAPSIZE 這二個環境變數,
$ nano conf/hadoop-env.sh # Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. export JAVA_HOME=/mnt/hda1/jdk1.6.0_33 # Extra Java CLASSPATH elements. Optional. # export HADOOP_CLASSPATH= # The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE=384 # Extra Java runtime options. Empty by default. # export HADOOP_OPTS=-server :
啟動 Hadoop
# start-all.sh
starting namenode, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-namenode-HDP120.out
HDP120: starting datanode, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-datanode-HDP120.out
HDP120: starting secondarynamenode, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-secondarynamenode-HDP120.out
starting jobtracker, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-jobtracker-HDP120.out
HDP120: starting tasktracker, logging to /mnt/hda1/hadoop-1.0.3/libexec/../logs/hadoop-root-tasktracker-HDP120.out
*在哪一台機器run start-all.sh哪一台就當NameNode
檢視啟動那些 Hadoop Daemon (jps=>Java process service) $ jps 2232 NameNode 2667 TaskTracker 2347 DataNode 2468 SecondaryNameNode 2774 Jps 2546 JobTracker
停止 Hadoop $ stop-all.sh stopping jobtracker HDP120: stopping tasktracker stopping namenode HDP120: stopping datanode HDP120: stopping secondarynamenode
------------------------------------------------------------------------------
HDFS 命令操作 1. 建立目錄 $ cd /mnt/hda1/hadoop-1.0.3 $ start-dfs.sh $ hadoop dfs -ls / Found 1 items drwxr-xr-x - root supergroup 0 2013-04-13 14:02 /mnt $ hadoop dfs -mkdir /user $ hadoop dfs -ls / Found 2 items drwxr-xr-x - root supergroup 0 2012-06-09 22:36 /mnt drwxr-xr-x - root supergroup 0 2012-06-09 22:40 /user
2. 上載檔案 $ hadoop dfs -copyFromLocal *.txt /user $ hadoop dfs -ls /user Found 4 items -rw-r--r-- 1 student supergroup 0 2011-06-12 00:32 /user/CHANGES.txt -rw-r--r-- 1 student supergroup 0 2011-06-12 00:32 /user/LICENSE.txt -rw-r--r-- 1 student supergroup 0 2011-06-12 00:32 /user/NOTICE.txt -rw-r--r-- 1 student supergroup 0 2011-06-12 00:32 /user/README.txt 3. 顯示檔案內容 # hadoop dfs -cat /user/NOTICE.txt This product includes software developed by The Apache Software Foundation (http://www.apache.org/). 4. 檢視檔案儲存資訊 # hadoop fsck /user/NOTICE.txt -files -blocks -locations FSCK started by root from /192.168.100.20 for path /user/NOTICE.txt at Thu Nov 03 17:26:50 UTC 2011 /user/NOTICE.txt 101 bytes, 1 block(s): OK 0. blk_-6938191868505178697_1006 len=101 repl=1 [192.168.100.20:50010] Status: HEALTHY Total size: 101 B Total dirs: 0 Total files: 1 Total blocks (validated): 1 (avg. block size 101 B) Minimally replicated blocks: 1 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 1.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 1 Number of racks: 1 FSCK ended at Thu Nov 03 17:26:50 UTC 2011 in 5 milliseconds The filesystem under path '/user/NOTICE.txt' is HEALTHY 5. 取回檔案 The get command is the inverse operation of put; it will copy a file or directory (recursively) from HDFS into the target of your choosing on the local file system. A synonymous operation is called -copyToLocal. # hadoop dfs -get /user/README.txt a.txt # ll a.txt -rw-r--r-- 1 root root 1366 2011-11-05 21:40 a.txt
7. 刪除檔案 $ hadoop dfs -rm /user/*.txt Deleted hdfs://HDP120:9000/user/CHANGES.txt Deleted hdfs://HDP120:9000/user/LICENSE.txt Deleted hdfs://HDP120:9000/user/NOTICE.txt Deleted hdfs://HDP120:9000/user/README.txt 8. 停止 HDFS # stop-dfs.sh
沒有留言:
張貼留言