1

I am noticing a difference in behaviour on upgrading to spark 3 where the NumPartitions are changing on df.select which causing my zip operations to fail on mismatch. With spark 2.4.4 it works fine. This does not happen with filter but only with select cols

    spark = SparkSession.builder.appName("local"). \
    master("local[2]"). \
    config("spark.executor.memory", "2g"). \
    config("spark.driver.memory", "2g"). \
    config("spark.sql.shuffle.partitions",10). \
    config("spark.default.parallelism", 10). \
    getOrCreate()

With Spark 2.4.4:

df = spark.table("tableA")
print(df.rdd.getNumPartitions()) #10
new_df = df.filter("id is not null")
print(new_df.rdd.getNumPartitions()) #10
new_2_df = df.select("id")
print(new_2_df.rdd.getNumPartitions()) #10

With Spark 3.0.0:

df = spark.table("tableA")
print(df.rdd.getNumPartitions()) #10
new_df = df.filter("id is not null")
print(new_df.rdd.getNumPartitions()) #10
new_2_df = df.select("id")
print(new_2_df.rdd.getNumPartitions()) #1

Lô đề onlineSee the last line where it changes to 1 partition from initial 10. Any thoughts?

0

Lô đề onlineDynamic partition pruning is new in 3.0.

| improve this answer | |
  • tried with the below config disabling dynamicPartitionPruning, still the same results spark = SparkSession.builder.appName("pytest-pyspark-local-testing"). \ master("local[5]"). \ config("spark.executor.memory", "2g"). \ config("spark.driver.memory", "2g"). \ config("spark.ui.showConsoleProgress", "false"). \ config("spark.sql.shuffle.partitions",10). \ config("spark.sql.optimizer.dynamicPartitionPruning.enabled","false"). \ config("spark.default.parallelism", 10). \ getOrCreate() – Ankush Oct 14 at 6:59

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policyLô đề online

Not the answer you're looking for? Browse other questions tagged or ask your own question.