{"id":18106,"date":"2023-04-16T22:46:51","date_gmt":"2023-04-17T04:46:51","guid":{"rendered":"http:\/\/www.designandexecute.com\/designs\/?p=18106"},"modified":"2023-04-16T22:48:50","modified_gmt":"2023-04-17T04:48:50","slug":"pyspark-using-lamba-to-create-dynamic-join-clause-based-on-a-list","status":"publish","type":"post","link":"https:\/\/www.designandexecute.com\/designs\/pyspark-using-lamba-to-create-dynamic-join-clause-based-on-a-list\/","title":{"rendered":"pyspark: using Lamba to create dynamic join clause based on a list"},"content":{"rendered":"\n<p>In this example, we first define two input data providers <code>DP1<\/code> and <code>DP2<\/code>, each with 5 columns <code>id<\/code> and <code>value<\/code>. We also define a list of join keys <code>join_keys<\/code> (in this case just &#8220;id&#8221;).<\/p>\n\n\n\n<p>Next, we use a list comprehension to create a string <code>join_clause<\/code> containing the join condition based on the join keys. We use the <code>col<\/code> function from <code>pyspark.sql.functions<\/code> to specify the column names.<\/p>\n\n\n\n<p>Finally, we perform an inner join on <code>DP1<\/code> and <code>DP2<\/code> using <code>join_clause<\/code> as the join condition, and store the result in <code>result<\/code>. We then show the result using the <code>show()<\/code> method.<\/p>\n\n\n\n<p>Note that in a real-world scenario, you may need to modify this code to fit the specific requirements of your data and use case.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">from pyspark.sql.functions import col<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">Sample input data<\/h1>\n\n\n\n<pre class=\"wp-block-preformatted\">DP1 = spark.createDataFrame([(1, \"A\", 10, 100, \"X\"), (2, \"B\", 20, 200, \"Y\"), (3, \"C\", 30, 300, \"Z\")], [\"id\", \"value\", \"col1\", \"col2\", \"col3\"])\n\nDP2 = spark.createDataFrame([(1, \"X\", 1000, 10000, \"A\"), (2, \"Y\", 2000, 20000, \"B\"), (3, \"Z\", 3000, 30000, \"C\")], [\"id\", \"value\", \"col1\", \"col2\", \"col3\"])\n\njoin_keys = [\"id\"]<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">Create join clause based on join keys<\/h1>\n\n\n\n<pre class=\"wp-block-preformatted\">join_clause = \" AND \".join([f\"DP1.{key} = DP2.{key}\" for key in join_keys])<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">Perform inner join on DP1 and DP2<\/h1>\n\n\n\n<pre class=\"wp-block-preformatted\">result = DP1.join(DP2, col(join_clause), \"inner\")<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">Show result<\/h1>\n\n\n\n<pre class=\"wp-block-preformatted\">result.show() <\/pre>\n\n\n\n<p>In this example, we use the <code>zip<\/code> function to create a list of pairs of corresponding elements from the <code>join_keys_1<\/code> and <code>join_keys_2<\/code> lists.<\/p>\n\n\n\n<p>Next, we use these pairs to create the join condition using a list comprehension. For each pair of keys, we create a string that compares the corresponding columns in <code>DP1<\/code> and <code>DP2<\/code> using the <code>=<\/code> operator. We join these strings using the <code>AND<\/code> operator to create the full join condition.<\/p>\n\n\n\n<p>Finally, we perform an inner join on <code>DP1<\/code> and <code>DP2<\/code> using <code>join_clause<\/code> as the join condition, and display the result.<\/p>\n\n\n\n<p>Note that the <code>zip<\/code> function requires that the two lists have the same length. If your lists have different lengths, you will need to adjust your code accordingly.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">from pyspark.sql.functions import col<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">Sample input data<\/h1>\n\n\n\n<pre class=\"wp-block-preformatted\">DP1 = spark.createDataFrame([(1, \"A\", 10), (2, \"B\", 20), (3, \"C\", 30)], [\"id\", \"value\", \"col1\"])<br>\nDP2 = spark.createDataFrame([(1, \"X\", 100), (2, \"Y\", 200), (3, \"Z\", 300)], [\"id\", \"value\", \"col1\"])<br>\njoin_keys_1 = [\"id\", \"value\"]<br>\njoin_keys_2 = [\"id\", \"col1\"]<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">Zip the join_keys<\/h1>\n\n\n\n<pre class=\"wp-block-preformatted\">join_pairs = zip(join_keys_1, join_keys_2)<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">Create join clause<\/h1>\n\n\n\n<pre class=\"wp-block-preformatted\">join_clause = \" AND \".join([f\"DP1.{pair[0]} = DP2.{pair[1]}\" for pair in join_pairs])<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">Perform inner join on DP1 and DP2<\/h1>\n\n\n\n<pre class=\"wp-block-preformatted\">result = DP1.join(DP2, col(join_clause), \"inner\")<\/pre>\n\n\n\n<h1 class=\"wp-block-heading\">Show result<\/h1>\n\n\n\n<pre class=\"wp-block-preformatted\">result.show()<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In this example, we first define two input data providers DP1 and DP2, each with 5 columns id and value. We also define a list of join keys join_keys (in this case just &#8220;id&#8221;). Next, we use a list comprehension to create a string join_clause containing the join condition based on the join keys. We [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":18113,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[31],"tags":[],"class_list":["post-18106","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bi-data-warehouse"],"jetpack_featured_media_url":"https:\/\/www.designandexecute.com\/designs\/wp-content\/uploads\/2023\/04\/pyspark.jpg","_links":{"self":[{"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/posts\/18106","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/comments?post=18106"}],"version-history":[{"count":5,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/posts\/18106\/revisions"}],"predecessor-version":[{"id":18112,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/posts\/18106\/revisions\/18112"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/media\/18113"}],"wp:attachment":[{"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/media?parent=18106"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/categories?post=18106"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/tags?post=18106"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}