amazon s3 - Reading SequenceFiles from S3 with PySpark on EMR causes RACK_LOCAL locality -
i trying use pyspark on emr analyze data stored sequencefiles on s3, running performance issues due data locality. here simple sample doesn't work well:
seqrdd = sc.sequencefile("s3n://<access>:<secret>@<bucket>/<table>/day=2015-07-04/hour=*/*") seqrdd.count()
the issue count
action, works fine distribution of tasks poor. reason in spark logs see 2 ips of cluster doing actual work while rest sits idle. tried 5 node cluster , 50 nodes cluster , it's 2 ips appearing in logs.
also strange these 2 ips have locality of rack_local. i'm presuming it's because data in s3 it's not local, how can make spark use whole cluster instead of 2 instances?
i didn't specific spark configuration on emr, installing on emr via native app , believe takes care automatically of optimizing configs.
i saw in logs, allowlocal=false
issue couldn't find on that:
15/07/17 23:55:27 info spark.sparkcontext: starting job: count @ :1 15/07/17 23:55:27 info scheduler.dagscheduler: got job 1 (count @ :1) 1354 output partitions (allowlocal=false) 15/07/17 23:55:27 info scheduler.dagscheduler: final stage: stage 1(count @ :1)
some logs follow when running count
, showing 2 ips:
15/07/17 23:55:28 info scheduler.dagscheduler: submitting 1354 missing tasks stage 1 (pythonrdd[3] @ count @ :1) 15/07/17 23:55:28 info cluster.yarnscheduler: adding task set 1.0 1354 tasks 15/07/17 23:55:28 info scheduler.tasksetmanager: starting task 0.0 in stage 1.0 (tid 1, ip-172-31-41-210.ec2.internal, rack_local, 1418 bytes) 15/07/17 23:55:28 info scheduler.tasksetmanager: starting task 1.0 in stage 1.0 (tid 2, ip-172-31-36-179.ec2.internal, rack_local, 1420 bytes) 15/07/17 23:55:28 info storage.blockmanagerinfo: added broadcast_3_piece0 in memory on ip-172-31-36-179.ec2.internal:39998 (size: 3.7 kb, free: 535.0 mb) 15/07/17 23:55:28 info storage.blockmanagerinfo: added broadcast_3_piece0 in memory on ip-172-31-41-210.ec2.internal:36847 (size: 3.7 kb, free: 535.0 mb) 15/07/17 23:55:29 info storage.blockmanagerinfo: added broadcast_0_piece0 in memory on ip-172-31-41-210.ec2.internal:36847 (size: 18.8 kb, free: 535.0 mb) 15/07/17 23:55:31 info scheduler.tasksetmanager: starting task 2.0 in stage 1.0 (tid 3, ip-172-31-41-210.ec2.internal, rack_local, 1421 bytes) 15/07/17 23:55:31 info scheduler.tasksetmanager: finished task 0.0 in stage 1.0 (tid 1) in 3501 ms on ip-172-31-41-210.ec2.internal (1/1354) 15/07/17 23:55:31 info scheduler.tasksetmanager: starting task 3.0 in stage 1.0 (tid 4, ip-172-31-41-210.ec2.internal, rack_local, 1420 bytes) 15/07/17 23:55:31 info scheduler.tasksetmanager: finished task 2.0 in stage 1.0 (tid 3) in 99 ms on ip-172-31-41-210.ec2.internal (2/1354) 15/07/17 23:55:33 info scheduler.tasksetmanager: starting task 4.0 in stage 1.0 (tid 5, ip-172-31-36-179.ec2.internal, rack_local, 1420 bytes) 15/07/17 23:55:33 info scheduler.tasksetmanager: finished task 1.0 in stage 1.0 (tid 2) in 5190 ms on ip-172-31-36-179.ec2.internal (3/1354) 15/07/17 23:55:36 info scheduler.tasksetmanager: starting task 5.0 in stage 1.0 (tid 6, ip-172-31-41-210.ec2.internal, rack_local, 1420 bytes) 15/07/17 23:55:36 info scheduler.tasksetmanager: finished task 3.0 in stage 1.0 (tid 4) in 4471 ms on ip-172-31-41-210.ec2.internal (4/1354) 15/07/17 23:55:37 info scheduler.tasksetmanager: starting task 6.0 in stage 1.0 (tid 7, ip-172-31-36-179.ec2.internal, rack_local, 1420 bytes) 15/07/17 23:55:37 info scheduler.tasksetmanager: finished task 4.0 in stage 1.0 (tid 5) in 3676 ms on ip-172-31-36-179.ec2.internal (5/1354) 15/07/17 23:55:40 info scheduler.tasksetmanager: starting task 7.0 in stage 1.0 (tid 8, ip-172-31-41-210.ec2.internal, rack_local, 1420 bytes) 15/07/17 23:55:40 info scheduler.tasksetmanager: finished task 5.0 in stage 1.0 (tid 6) in 3895 ms on ip-172-31-41-210.ec2.internal (6/1354) 15/07/17 23:55:40 info scheduler.tasksetmanager: starting task 8.0 in stage 1.0 (tid 9, ip-172-31-36-179.ec2.internal, rack_local, 1420 bytes)
Comments
Post a Comment