Hello,
One of the nodes of our Analytics DC is dead, but ColumnFamilyInputFormat
(CFIF) still assigns Hadoop input splits to it. This leads to many failed
tasks and consequently a failed job.
* Tasks fail with: java.lang.RuntimeException:
org.apache.thrift.transport.TTransportException: Failed to open a transport
to XX.75:9160. (obviously, the node is dead)
* Job fails with: Job Failed: # of failed Map Tasks exceeded allowed limit.
FailedCount: 1. LastFailedTask: task_201404180250_4207_m_000079
We use RF=2 and CL=LOCAL_ONE for hadoop jobs, C* 1.2.16. Is this expected
behavior?
I checked CFIF code, but it always assigns input splits to all the ring
nodes, no matter if the node is dead or alive. What we do to fix is patch
CFIF to blacklist the dead node, but this is not very automatic procedure.
Am I not getting something here?
Cheers,
Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200
One of the nodes of our Analytics DC is dead, but ColumnFamilyInputFormat
(CFIF) still assigns Hadoop input splits to it. This leads to many failed
tasks and consequently a failed job.
* Tasks fail with: java.lang.RuntimeException:
org.apache.thrift.transport.TTransportException: Failed to open a transport
to XX.75:9160. (obviously, the node is dead)
* Job fails with: Job Failed: # of failed Map Tasks exceeded allowed limit.
FailedCount: 1. LastFailedTask: task_201404180250_4207_m_000079
We use RF=2 and CL=LOCAL_ONE for hadoop jobs, C* 1.2.16. Is this expected
behavior?
I checked CFIF code, but it always assigns input splits to all the ring
nodes, no matter if the node is dead or alive. What we do to fix is patch
CFIF to blacklist the dead node, but this is not very automatic procedure.
Am I not getting something here?
Cheers,
Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200