hadoop


总结

配置文件位于etc/hadoop下,主要配置以下环境

core-site.xml

1
2
hdfs://master:9000
#启动hdfs的server在master节点9000端口

hdfs-site.xml

1
2
dfs.replication
# 文件副本数目

mapred-site.xml

1
2
3
mapreduce.framework.name
yarn
# 使用yarn作为管理后台,用于真正的分布式部署

yarn-site.xml

1
2
3
4
mapreduce_shuffle # mapreduce中使用shuffle(洗牌)

yarn.resourcemanager.hostname
master# 指定resourcemanager在master主机

workers

1
写入作为slave的主机名称

hadoop:pill:

位于etc/hadoop-env.sh中的JAVA_HOME似乎应该需要配置绝对路径

使用hadoop3.2.1时,openjdk9似乎有问题,使用yarn时会报错声称

1
2
3
2019-10-04 16:21:59,566 INFO mapreduce.Job: Job job_1570176818335_0003 failed with state FAILED due to: 
Application application_1570176818335_0003 failed 2 times due to Error launching appattempt_1570176818335_0003_000002.
Got exception: org.apache.hadoop.security.AccessControlException: Unable to find SASL server implementation for DIGEST-MD5

换成openjdk8

1
2
3
4
[2019-10-04 17:17:02.073]Container exited with a non-zero exit code 127. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
/bin/bash: /bin/java: No such file or directory

换成oracle jdk8
报错和openjdk8一致
只好强行在/bin/中写入文件连接

1
ls -s /usr/bin/java /bin/java

在另一台机器配置了host文件,将对应的域名指向127.0.0.1,发现竟然没有这个错误。

配置分布式后发现启动没有Namenode,resourcemanager,果断将/etc/hosts域名指向127.0.0.1,解决了这个问题,重新启动就正常了

发现不能用正常ip的原因竟然是腾讯云的网络问题,只能用私有地址,不能用他提供访问的公有地址,:cake: 而且文档强调不能指向127.0.0.1

重启会报错

1
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-ubuntu/dfs/name is in an inconsistent state

有问答指出tmp会被清理
因此调整core-site.xml中对于tmp.dir设置

发现host配置错误会卡在resourcesmanager上,本机名字对应的ip一定要是127.0.0.1放在第一位
否则报错

1
INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

直接在master节点执行命令发现报错,

1
-bash: bin/dfs: No such file or directory

slave节点未完整配置$HADOOP_HOME和$path,一股脑全都加了进去,bin/hdfs就可以找到了。

配置分布式后始终不能连接,logs中datanode显示无法连接master,master对应9000端口为127.0.0.1:9000显然不能访问,尝试多次后将core-site.xml中的域名直接改为对应ip,将slave对应的host文件改成对应iP即可,同时telnet也可以成功连接上master。

但是新的问题是腾讯云提供了两个ip,使用master公网ip根本无法连通,datanode.log中显示一直在retry,telnet也无法成功,似乎是腾讯云有某种过滤机制。

使用私有地址的ip,telnet可以成功,但是新的报错出现

1
2
Datanode denied communication with namenode because hostname cannot be resolved
(ip=172.21.101.4, hostname=172.21.101.4)

发现了两种解决方案,其中详细代码解释很棒,stackoverflow也有人提出相关问题

最后采用在master和slave上的core-site中加入

1
2
3
4
<property>
           <name>dfs.namenode.datanode.registration.ip-hostname-check</name>                   
           <value>false</value>
 </property>

slave和master成功通信

启动sbin/start-yarn.sh,出现和之前类似的报错

1
2
3
4
5
6
7
 Connecting to ResourceManager at vm25/172.21.101.13:8031

2019-10-07 13:25:23,956 INFO org.apache.hadoop.util.JvmPauseMonitor: Starting JVM pause monitor
2019-10-07 13:25:23,961 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at vm25/172.21.101.13:8031
2019-10-07 13:25:23,999 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []
2019-10-07 13:25:24,008 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]
2019-10-07 13:25:25,043 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: vm25/172.21.101.13:8031. Already tried 0 time(s)

netstat看了master上的端口,发现又在27.0.0.1上启动了,直接改了yarn-site.xml中对应的resourcemanager的值

1
<value>0.0.0.0<value>

slave和master的resourcemanager通信成功,然而再次出现错误,对应代码
slave上报错

1
2
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: couldn't find application application_1570428476013_0001 while processing FINISH_APPS event. The ResourceManager allocated resources 
for this application to the NodeManager but no active containers were found to process.

master上报错

1
2
3
Application application_1570428476013_0001 failed 2 times due to Error launching appattempt_1570428476013_0001_000002. 
Got exception: java.net.ConnectException: Call From localhost.localdomain/127.0.0.1 to localhost.localdomain:38808 failed
on connection exception: java.net.ConnectException: Connection refused;

slave的网络

1
2
3
4
5
6
7
8
tcp        0      0 0.0.0.0:8042            0.0.0.0:*               LISTEN      12711/java  
tcp 0 0 0.0.0.0:9866 0.0.0.0:* LISTEN 5062/java
tcp 0 0 0.0.0.0:9867 0.0.0.0:* LISTEN 5062/java
tcp 0 0 127.0.0.1:35314 0.0.0.0:* LISTEN 5062/java
tcp 0 0 0.0.0.0:38808 0.0.0.0:* LISTEN 12711/java
tcp 0 0 0.0.0.0:13562 0.0.0.0:* LISTEN 12711/java
tcp 0 0 0.0.0.0:8040 0.0.0.0:* LISTEN 12711/java
tcp 0 0 0.0.0.0:9864 0.0.0.0:* LISTEN 5062/java

master的网络

1
2
3
4
5
6
7
8
9
10
11
tcp        0      0 0.0.0.0:9868            0.0.0.0:*               LISTEN      31184/java
tcp 0 0 0.0.0.0:9870 0.0.0.0:* LISTEN 30918/java
tcp 0 0 0.0.0.0:19888 0.0.0.0:* LISTEN 6694/java
tcp 0 0 0.0.0.0:10033 0.0.0.0:* LISTEN 6694/java
tcp 0 0 0.0.0.0:8088 0.0.0.0:* LISTEN 8121/java
tcp 0 0 0.0.0.0:8030 0.0.0.0:* LISTEN 8121/java
tcp 0 0 0.0.0.0:8031 0.0.0.0:* LISTEN 8121/java
tcp 0 0 0.0.0.0:8032 0.0.0.0:* LISTEN 8121/java
tcp 0 0 0.0.0.0:8033 0.0.0.0:* LISTEN 8121/java
tcp 0 0 0.0.0.0:10020 0.0.0.0:* LISTEN 6694/java
tcp 0 0 0.0.0.0:9000 0.0.0.0:* LISTEN 30918/java

最后发现是配置文件问题,在链接处指出host文件配置问题,而且由于腾讯云内部网络,只能用私有地址作为host文件内部各主机ip,公有ip是无法使得程序运行,之前所有的问题都是因为这个ip原因


文章作者: greatofdream
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 greatofdream !
  目录