BI - Data Warehouse/ Lake/ Lakehouse

pyspark: using Lamba to create dynamic join clause based on a list

April 16, 2023

364

In this example, we first define two input data providers DP1 and DP2, each with 5 columns id and value. We also define a list of join keys join_keys (in this case just “id”).

Next, we use a list comprehension to create a string join_clause containing the join condition based on the join keys. We use the col function from pyspark.sql.functions to specify the column names.

Finally, we perform an inner join on DP1 and DP2 using join_clause as the join condition, and store the result in result. We then show the result using the show() method.

Note that in a real-world scenario, you may need to modify this code to fit the specific requirements of your data and use case.

from pyspark.sql.functions import col

Sample input data

DP1 = spark.createDataFrame([(1, "A", 10, 100, "X"), (2, "B", 20, 200, "Y"), (3, "C", 30, 300, "Z")], ["id", "value", "col1", "col2", "col3"])

DP2 = spark.createDataFrame([(1, "X", 1000, 10000, "A"), (2, "Y", 2000, 20000, "B"), (3, "Z", 3000, 30000, "C")], ["id", "value", "col1", "col2", "col3"])

join_keys = ["id"]

Create join clause based on join keys

join_clause = " AND ".join([f"DP1.{key} = DP2.{key}" for key in join_keys])

Perform inner join on DP1 and DP2

result = DP1.join(DP2, col(join_clause), "inner")

Show result

result.show()

In this example, we use the zip function to create a list of pairs of corresponding elements from the join_keys_1 and join_keys_2 lists.

Next, we use these pairs to create the join condition using a list comprehension. For each pair of keys, we create a string that compares the corresponding columns in DP1 and DP2 using the = operator. We join these strings using the AND operator to create the full join condition.

Finally, we perform an inner join on DP1 and DP2 using join_clause as the join condition, and display the result.

Note that the zip function requires that the two lists have the same length. If your lists have different lengths, you will need to adjust your code accordingly.

from pyspark.sql.functions import col

Sample input data

DP1 = spark.createDataFrame([(1, "A", 10), (2, "B", 20), (3, "C", 30)], ["id", "value", "col1"])

DP2 = spark.createDataFrame([(1, "X", 100), (2, "Y", 200), (3, "Z", 300)], ["id", "value", "col1"])

join_keys_1 = ["id", "value"]

join_keys_2 = ["id", "col1"]

Zip the join_keys

join_pairs = zip(join_keys_1, join_keys_2)

Create join clause

join_clause = " AND ".join([f"DP1.{pair[0]} = DP2.{pair[1]}" for pair in join_pairs])

Perform inner join on DP1 and DP2

result = DP1.join(DP2, col(join_clause), "inner")

Show result

result.show()

pyspark: using Lamba to create dynamic join clause based on a list

Sample input data

Create join clause based on join keys

Perform inner join on DP1 and DP2

Show result

Sample input data

Zip the join_keys

Create join clause

Perform inner join on DP1 and DP2

Show result

EDITOR PICKS

Estimation for Agile Developers While Status Reporting to Waterfall Managers

5 Major Reasons Why So Many Companies Fail At Social Media

Best Practices for Distributed Or Remote Teams in the Age of...

POPULAR POSTS

How to use business objects @Prompt Variable to build flexible universes...

How to Merge Data from Multiple Data Providers in WEBIntelligence (webi)

How to Calculate Number Of Days in a Month or Month...

POPULAR CATEGORY

Basic Terminology of Data Warehousing (DW) for Business Intelligence (BI)

Business Intelligence Reporting Analytics Framework, a Maturity Model for BI Best...