spark is slower than java Mapreduce --scala big bulls pls advise# Programming - 葵花宝典
v*r
1 楼
spark beginner trying out the buzz tech
input 200GB uncompressed data file stored in hdfs
37 worker nodes, each has 24 cores
using java map reduce, 6-8 minutes
using spark, 37 minutes, 2 18 minute-stage
"lightning fast cluster computing, 100x faster" ???!!!!
Big bulls please advise!
#sortMapper sort values for each key, then do some iteration for the grouped
values
text = sc.textFile(input,1776) #24*37*2
text.map(mapper).filter(lambda x: x!=None).groupByKey().map(sortMapper).
filter(lambda x: x[1]!=[]).saveAsTextFile(output)
sc.textFile and saveAsTextFile is very slow
configuration as follows:
conf = SparkConf().set("spark.executor.memory","24g").set("spark.driver.
memory","16g").set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
input 200GB uncompressed data file stored in hdfs
37 worker nodes, each has 24 cores
using java map reduce, 6-8 minutes
using spark, 37 minutes, 2 18 minute-stage
"lightning fast cluster computing, 100x faster" ???!!!!
Big bulls please advise!
#sortMapper sort values for each key, then do some iteration for the grouped
values
text = sc.textFile(input,1776) #24*37*2
text.map(mapper).filter(lambda x: x!=None).groupByKey().map(sortMapper).
filter(lambda x: x[1]!=[]).saveAsTextFile(output)
sc.textFile and saveAsTextFile is very slow
configuration as follows:
conf = SparkConf().set("spark.executor.memory","24g").set("spark.driver.
memory","16g").set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")