Pyspark lazy evaluation in loops too slow - python
First of all I want to let you know that I am still very new in spark and getting used to the lazy-evaluation concept.
Here my issue:
I have two spark DataFrames that I load from reading CSV.GZ files.
What I am trying to do is to merge both tables in order to split the first table according keys that I have on the second one.
For example:
Table A
+----------+---------+--------+---------+------+
| Date| Zone| X| Type|Volume|
+----------+---------+--------+---------+------+
|2019-01-16|010010000| B| A| 684|
|2019-01-16|010020000| B| A| 21771|
|2019-01-16|010030000| B| A| 7497|
|2019-01-16|010040000| B| A| 74852|
Table B
+----+---------+
|Dept| Zone|
+----+---------+
| 01|010010000|
| 02|010020000|
| 01|010030000|
| 02|010040000|
Then when I merge both tables I have:
+---------+----------+--------+---------+------+----+
| Zone| Date| X| Type|Volume|Dept|
+---------+----------+--------+---------+------+----+
|010010000|2019-01-16| B| A| 684| 01|
|010020000|2019-01-16| B| A| 21771| 02|
|010030000|2019-01-16| B| A| 7497| 01|
|010040000|2019-01-16| B| A| 74852| 02|
So what I want to do is to split this table in Y disjointed tables, where Y is the number of different 'Dept' values that I find on my merged table.
So for example:
Result1:
+---------+----------+--------+---------+------+----+
| Zone| Date| X| Type|Volume|Dept|
+---------+----------+--------+---------+------+----+
|010010000|2019-01-16| B| A| 684| 01|
|010030000|2019-01-16| B| A| 7497| 01|
Result2:
+---------+----------+--------+---------+------+----+
| Zone| Date| X| Type|Volume|Dept|
+---------+----------+--------+---------+------+----+
|010020000|2019-01-16| B| A| 21771| 02|
|010040000|2019-01-16| B| A| 74852| 02|
My code looks like this:
sp_df_A = spark.read.csv(file_path_A, header=True, sep=';', encoding='cp1252')
sp_df_B = spark.read.csv(file_path_B, header=True, sep=';', encoding='cp1252')
sp_merged_df = sp_df_A.join(sp_df_B, on=['Zone'], how='left')
# list of unique 'Dept' values on the merged DataFrame
unique_buckets = [x.__getitem__('Dept') for x in sp_merged_df.select('Dept').distinct().collect()]
# Iterate over all 'Dept' found
for zone_bucket in unique_buckets:
print(zone_bucket)
bucket_dir = os.path.join(output_dir, 'Zone_%s' % zone_bucket)
if not os.path.exists(bucket_dir):
os.mkdir(bucket_dir)
# Filter target 'Dept'
tmp_df = sp_merged_df.filter(sp_merged_df['Dept'] == zone_bucket)
# write result
tmp_df.write.format('com.databricks.spark.csv').option('codec', 'org.apache.hadoop.io.compress.GzipCodec').save(bucket_dir, header = 'true')
The thing is that this very simple code is taking too much time to write a result. So my guess is that the lazy evaluation is loading, merging and filtering on every cycle of the loop.
Can this be the case?
Your guess is correct. Your code reads, joins and filters all the data for each of the buckets. This is indeed caused by the lazy evaluation of spark.
Spark waits with any data transformation until an action is performed. When an action is called, spark looks at all the transformations and creates a plan on how to efficiently get the results of the action. While spark executes this plan the program holds. When spark is done the program continues and spark "forgets" about everything it has done until the next action is called.
In your case spark "forgets" the joined dataframe sp_merged_df and each time a .collect() or .save() is called it reconstructs it.
If you want spark to "remember" a RDD or DataFrame you can .cache() it (see docs).
Related
PySpark - Transform multiple columns without using udf
I have a df like this one: df = spark.createDataFrame( [("1", "Apple", "cat"), ("2", "2.", "house"), ("3", "<strong>text</strong>", "HeLlo 2.5")], ["id", "text1", "text2"]) +---+---------------------+---------+ | id| text1| text2| +---+---------------------+---------+ | 1| Apple| cat| | 2| 2.| house| | 3|<strong>text</strong>|HeLlo 2.5| +---+---------------------+---------+ multiple functions to clean text like def remove_html_tags(text): document = html.fromstring(text) return " ".join(etree.XPath("//text()")(document)) def lowercase(text): return text.lower() def remove_wrong_dot(text): return re.sub(r'(?<!\d)[.,;:]|[.,;:](?!\d)', ' ', text) and a list of columns to clean COLS = ["text1", "text2"] I would like to apply the functions to the columns in the list and also keep the original text +---+---------------------+-----------+---------+-----------+ | id| text1|text1_clean| text2|text2_clean| +---+---------------------+-----------+---------+-----------+ | 1| Apple| apple| cat| cat| | 2| 2.| 2| house| house| | 3|<strong>text</strong>| text|HeLlo 2.5| hello 2.5| +---+---------------------+-----------+---------+-----------+ I already have an approach using UDF but it is not very efficient. I've been trying something like: rdds = [] for col in TEXT_COLS: rdd = df.rdd.map(lambda x: (x[col], lowercase(x[col]))) rdds.append(rdd.collect()) return df My idea would be to join all rdds in the list but I don't know how efficient this would be or how to list more functions. I appreciate any ideas or suggestions. EDIT: Not all transformations can be done with regexp_replace. For example, the text can include nested html labels and in that case a simple replace wouldn't work or I don't want to replace all dots, only those at the end or beginning of substrings
Spark built-in functions can do all the transformations you wanted from pyspark.sql import functions as F cols = ["text1", "text2"] for c in cols: df = (df .withColumn(f'{c}_clean', F.lower(c)) .withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '<[^>]+>', '')) .withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '(?<!\d)[.,;:]|[.,;:](?!\d)', '')) ) +---+--------------------+---------+-----------+-----------+ | id| text1| text2|text1_clean|text2_clean| +---+--------------------+---------+-----------+-----------+ | 1| Apple| cat| apple| cat| | 2| 2.| house| 2| house| | 3|<strong>text</str...|HeLlo 2.5| text| hello 2.5| +---+--------------------+---------+-----------+-----------+
Translating a SAS Ranking with Tie set to HIGH into PySpark
I'm trying to replicate the following SAS code in PySpark: PROC RANK DATA = aud_baskets OUT = aud_baskets_ranks GROUPS=10 TIES=HIGH; BY customer_id; VAR expenditure; RANKS basket_rank; RUN; The idea is to rank all expenditures under each customer_id block. The data would look like this: +-----------+--------------+-----------+ |customer_id|transaction_id|expenditure| +-----------+--------------+-----------+ | A| 1| 34| | A| 2| 90| | B| 1| 89| | A| 3| 6| | B| 2| 8| | B| 3| 7| | C| 1| 96| | C| 2| 9| +-----------+--------------+-----------+ In PySpark, I tried this: spendWindow = Window.partitionBy('customer_id').orderBy(col('expenditure').asc()) aud_baskets = (aud_baskets_ranks.withColumn('basket_rank', ntile(10).over(spendWindow))) The problem is that PySpark doesn't let the user change the way it will handle Ties, like SAS does (that I know of). I need to set this behavior in PySpark so that values are moved up to the next tier each time one of those edge cases occur, as oppose to dropping them to the rank below. Or is there a way to custom write this approach?
Use dense_rank it will give same rank in case of ties and next rank will not be skipped ntile function split the group of records in each partition into n parts. In your case which is 10 from pyspark.sql.functions import dense_rank spendWindow = Window.partitionBy('customer_id').orderBy(col('expenditure').asc()) aud_baskets = aud_baskets_ranks.withColumn('basket_rank',dense_rank.over(spendWindow))
Try The following code. It is generated by an automated tool called SPROCKET. It should take care of ties. df = (aud_baskets) for (colToRank,rankedName) in zip(['expenditure'],['basket_rank']): wA = Window.orderBy(asc(colToRank)) df_w_rank = (df.withColumn('raw_rank', rank().over(wA))) ties = df_w_rank.groupBy('raw_rank').count().filter("""count > 1""") df_w_rank = (df_w_rank.join(ties,['raw_rank'],'left').withColumn(rankedName,expr("""case when count is not null then (raw_rank + count - 1) else raw_rank end"""))) rankedNameGroup = rankedName n = df_w_rank.count() df_with_rank_groups = (df_w_rank.withColumn(rankedNameGroup,expr("""FLOOR({rankedName} *{k}/({n}+1))""".format(k=10, n=n, rankedName=rankedName)))) df = df_with_rank_groups aud_baskets_ranks = df_with_rank_groups.drop('raw_rank', 'count')
How can I access a specific column from Spark Data frame in python?
My Dataframe looks like this ------+-------+ |cat_id|counter| +------+-------+ | 12| 61060| | 1| 542118| | 13| 164700| | 3| 406622| | 5| 54902| | 10| 118281| | 11| 13658| | 14| 72229| | 2| 131206| +------+-------+ Query to get above data frame is : grouped_data = dataframe.groupBy("cat_id").agg(count("*").alias("counter")) Now I need to read values for different cat_id to save it in another database. The way I can get it done is by using a for loop on my id's for cat_id in cat_ids_map: statsCount = grouped_data.select("counter").filter("cat_id = " + cat_id).collect()[0].counter But I think there can be a better way to read the counter without for loop. Any suggestions would be helpful!!! Thanks
If you're to iterate through the entire dataframe, the way to do it is usually using a .foreach function. so you would do: grouped_data.foreach(lambda x: f(x)) where f is your function that will do whatever you want with each element in the dataframe
how to identify people's relationship based on name, address and then assign a same ID through linux comman or Pyspark
I have one csv file. D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address 2,66M,J,Rock,F,1995,201211.0,J 3,David,HM,Lee,M,1991,201211.0,J 6,66M,,Rock,F,1990,201211.0,J 0,David,H M,Lee,M,1990,201211.0,B 3,Marc,H,Robert,M,2000,201211.0,C 6,Marc,M,Robert,M,1988,201211.0,C 6,Marc,MS,Robert,M,2000,201211.0,D I want to assign persons with same last name living in the same address a same ID or index. It's better that ID is made up of only numbers. If persons have different last name in the same place, then ID should be different. Such ID should be unique. Namely, people who are different in either address or last name, ID must be different. My expected output is D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address,ID 2,66M,J,Rock,F,1995,201211.0,J,11 3,David,HM,Lee,M,1991,201211.0,J,12 6,66M,,Rock,F,1990,201211.0,J,11 0,David,H M,Lee,M,1990,201211.0,B,13 3,Marc,H,Robert,M,2000,201211.0,C,14 6,Marc,M,Robert,M,1988,201211.0,C,14 6,Marc,MS,Robert,M,2000,201211.0,D,15 My datafile size is around 30 GB. I am thinking of using groupBy function in spark based on the key consisting of LNAME and address to group those observations together. Then assign it a ID by key. But I don't know how to do this. After that, maybe I can use flatMap to split the line and return those observations with a ID. But I am not sure about it. In addition, can I also make it in Linux environment? Thank you.
As discussed in the comments, the basic idea is to partition the data properly so that records with the same LNAME+Address stay in the same partition, run Python code to generate separate idx on each partition and then merge them into the final id. Note: I added some new rows in your sample records, see the result of df_new.show() shown below. from pyspark.sql import Window, Row from pyspark.sql.functions import coalesce, sum as fsum, col, max as fmax, lit, broadcast # ...skip code to initialize the dataframe # tweak the number of repartitioning N based on actual data size N = 5 # Python function to iterate through the sorted list of elements in the same # partition and assign an in-partition idx based on Address and LNAME. def func(partition_id, it): idx, lname, address = (1, None, None) for row in sorted(it, key=lambda x: (x.LNAME, x.Address)): if lname and (row.LNAME != lname or row.Address != address): idx += 1 yield Row(partition_id=partition_id, idx=idx, **row.asDict()) lname = row.LNAME address = row.Address # Repartition based on 'LNAME' and 'Address' and then run mapPartitionsWithIndex() # function to create in-partition idx. Adjust N so that records in each partition # should be small enough to be loaded into the executor memory: df1 = df.repartition(N, 'LNAME', 'Address') \ .rdd.mapPartitionsWithIndex(func) \ .toDF() Get number of unique rows cnt (based on Address+LNAME) which is max_idx and then grab the running SUM of this rcnt. # idx: calculated in-partition id # cnt: number of unique ids in the same partition: fmax('idx') # rcnt: starting_id for a partition(something like a running count): coalesce(fsum('cnt').over(w1),lit(0)) # w1: WindowSpec to calculate the above rcnt w1 = Window.partitionBy().orderBy('partition_id').rowsBetween(Window.unboundedPreceding,-1) df2 = df1.groupby('partition_id') \ .agg(fmax('idx').alias('cnt')) \ .withColumn('rcnt', coalesce(fsum('cnt').over(w1),lit(0))) df2.show() +------------+---+----+ |partition_id|cnt|rcnt| +------------+---+----+ | 0| 3| 0| | 1| 1| 3| | 2| 1| 4| | 4| 1| 5| +------------+---+----+ Join df1 with df2 and create the final id which is idx + rcnt df_new = df1.join(broadcast(df2), on=['partition_id']).withColumn('id', col('idx')+col('rcnt')) df_new.show() #+------------+-------+---+----+-----+------+------+-----+---+--------+---+----+---+ #|partition_id|Address| D| DOB|FNAME|GENDER| LNAME|MNAME|idx|snapshot|cnt|rcnt| id| #+------------+-------+---+----+-----+------+------+-----+---+--------+---+----+---+ #| 0| B| 0|1990|David| M| Lee| H M| 1|201211.0| 3| 0| 1| #| 0| J| 3|1991|David| M| Lee| HM| 2|201211.0| 3| 0| 2| #| 0| D| 6|2000| Marc| M|Robert| MS| 3|201211.0| 3| 0| 3| #| 1| C| 3|2000| Marc| M|Robert| H| 1|201211.0| 1| 3| 4| #| 1| C| 6|1988| Marc| M|Robert| M| 1|201211.0| 1| 3| 4| #| 2| J| 6|1991| 66M| F| Rek| null| 1|201211.0| 1| 4| 5| #| 2| J| 6|1992| 66M| F| Rek| null| 1|201211.0| 1| 4| 5| #| 4| J| 2|1995| 66M| F| Rock| J| 1|201211.0| 1| 5| 6| #| 4| J| 6|1990| 66M| F| Rock| null| 1|201211.0| 1| 5| 6| #| 4| J| 6|1990| 66M| F| Rock| null| 1|201211.0| 1| 5| 6| #+------------+-------+---+----+-----+------+------+-----+---+--------+---+----+---+ df_new = df_new.drop('partition_id', 'idx', 'rcnt', 'cnt') Some notes: Practically, you will need to clean-out/normalize the column LNAME and Address before using them as uniqueness check. For example, use a separate column uniq_key which combine LNAME and Address as the unique key of the dataframe. see below for an example with some basic data cleansing procedures: from pyspark.sql.functions import coalesce, lit, concat_ws, upper, regexp_replace, trim #(1) convert NULL to '': coalesce(col, '') #(2) concatenate LNAME and Address using NULL char '\x00' or '\0' #(3) convert to uppercase: upper(text) #(4) remove all non-[word/whitespace/NULL_char]: regexp_replace(text, r'[^\x00\w\s]', '') #(5) convert consecutive whitespaces to a SPACE: regexp_replace(text, r'\s+', ' ') #(6) trim leading/trailing spaces: trim(text) df = (df.withColumn('uniq_key', trim( regexp_replace( regexp_replace( upper( concat_ws('\0', coalesce('LNAME', lit('')), coalesce('Address', lit(''))) ), r'[^\x00\s\w]+', '' ), r'\s+', ' ' ) ) )) Then in the code, replace 'LNAME' and 'Address' with uniq_key to find the idx As mentioned by cronoik in the comment, you can also try one of the Window rank functions to calculate the in-partition idx. for example: from pyspark.sql.functions import spark_partition_id, dense_rank # use dense_rank to calculate the in-partition idx w2 = Window.partitionBy('partition_id').orderBy('LNAME', 'Address') df1 = df.repartition(N, 'LNAME', 'Address') \ .withColumn('partition_id', spark_partition_id()) \ .withColumn('idx', dense_rank().over(w2)) After you have df1, use the same methods as above to calculate df2 and df_new. This should be faster than using mapPartitionsWithIndex() which is basically an RDD-based method. For your real data, adjust N to fit your actual data size. this N only influences the initial partitions, after dataframe join, the partition will be reset to default(200). you can adjust this using spark.sql.shuffle.partitions for example when you initialize the spark session: spark = SparkSession.builder \ .... .config("spark.sql.shuffle.partitions", 500) \ .getOrCreate()
Since you have 30GB of input data, you probably don't want something that'll attempt to hold it all in in-memory data structures. Let's use disk space instead. Here's one approach that loads all your data into a sqlite database, and generates an id for each unique last name and address pair, and then joins everything back up together: #!/bin/sh csv="$1" # Use an on-disk database instead of in-memory because source data is 30gb. # This will take a while to run. db=$(mktemp -p .) sqlite3 -batch -csv -header "${db}" <<EOF .import "${csv}" people CREATE TABLE ids(id INTEGER PRIMARY KEY, lname, address, UNIQUE(lname, address)); INSERT OR IGNORE INTO ids(lname, address) SELECT lname, address FROM people; SELECT p.*, i.id AS ID FROM people AS p JOIN ids AS i ON (p.lname, p.address) = (i.lname, i.address) ORDER BY p.rowid; EOF rm -f "${db}" Example: $./makeids.sh data.csv D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address,ID 2,66M,J,Rock,F,1995,201211.0,J,1 3,David,HM,Lee,M,1991,201211.0,J,2 6,66M,"",Rock,F,1990,201211.0,J,1 0,David,"H M",Lee,M,1990,201211.0,B,3 3,Marc,H,Robert,M,2000,201211.0,C,4 6,Marc,M,Robert,M,1988,201211.0,C,4 6,Marc,MS,Robert,M,2000,201211.0,D,5 It's better that ID is made up of only numbers. If that restriction can be relaxed, you can do it in a single pass by using a cryptographic hash of the last name and address as the ID: $ perl -MDigest::SHA=sha1_hex -F, -lane ' BEGIN { $" = $, = "," } if ($. == 1) { print #F, "ID" } else { print #F, sha1_hex("#F[3,7]") }' data.csv D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address,ID 2,66M,J,Rock,F,1995,201211.0,J,5c99211a841bd2b4c9cdcf72d7e95e46b2ae08b5 3,David,HM,Lee,M,1991,201211.0,J,c263f9d1feb4dc789de17a8aab8f2808aea2876a 6,66M,,Rock,F,1990,201211.0,J,5c99211a841bd2b4c9cdcf72d7e95e46b2ae08b5 0,David,H M,Lee,M,1990,201211.0,B,e86e81ab2715a8202e41b92ad979ca3a67743421 3,Marc,H,Robert,M,2000,201211.0,C,363ed8175fdf441ed59ac19cea3c37b6ce9df152 6,Marc,M,Robert,M,1988,201211.0,C,363ed8175fdf441ed59ac19cea3c37b6ce9df152 6,Marc,MS,Robert,M,2000,201211.0,D,cf5135dc402efe16cd170191b03b690d58ea5189 Or if the number of unique lname, address pairs is small enough that they can reasonably be stored in a hash table on your system: #!/usr/bin/gawk -f BEGIN { FS = OFS = "," } NR == 1 { print $0, "ID" next } ! ($4, $8) in ids { ids[$4, $8] = ++counter } { print $0, ids[$4, $8] }
$ sort -t, -k8,8 -k4,4 <<EOD | awk -F, ' $8","$4 != last { ++id; last = $8","$4 } { NR!=1 && $9=id; print }' id=9 OFS=, D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address 2,66M,J,Rock,F,1995,201211.0,J 3,David,HM,Lee,M,1991,201211.0,J 6,66M,,Rock,F,1990,201211.0,J 0,David,H M,Lee,M,1990,201211.0,B 3,Marc,H,Robert,M,2000,201211.0,C 6,Marc,M,Robert,M,1988,201211.0,C 6,Marc,MS,Robert,M,2000,201211.0,D > EOD D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address 0,David,H M,Lee,M,1990,201211.0,B,11 3,Marc,H,Robert,M,2000,201211.0,C,12 6,Marc,M,Robert,M,1988,201211.0,C,12 6,Marc,MS,Robert,M,2000,201211.0,D,13 3,David,HM,Lee,M,1991,201211.0,J,14 2,66M,J,Rock,F,1995,201211.0,J,15 6,66M,,Rock,F,1990,201211.0,J,15 $
Reading and grouping data to get count using python spark
I'm new to spark using python and I'm trying to do some basic stuff to get an understanding of python and spark. I have a file like below - empid||deptid||salary 1||10||500 2||10||200 3||20||300 4||20||400 5||20||100 I want to write a small python spark to read the print the count of employees in each department. I've been working with databases and this is quite simple in a sql, but I'm trying to do this using python spark. I don't have a code to share as I'm completely new to python and spark, but wanted to understand how it works using a simple hands-on example I've install pyspark and did some quick reading here https://spark.apache.org/docs/latest/quick-start.html Form my understanding there are dataframes on which one can perform sql like group by, but not sure how to write a proper code
You can read the text file as a dataframe using : df = spark.createDataFrame( sc.textFile("path/to/my/file").map(lambda l: l.split(',')), ["empid","deptid","salary"] ) textFile loads the data sample as an RDD with only one column. Then we split each line through a map and convert it to a dataframe. Starting from a python list of lists: df = spark.createDataFrame( sc.parallelize([[1,10,500], [2,10,200], [3,20,300], [4,20,400], [5,20,100]]), ["empid","deptid","salary"] ) df.show() +-----+------+------+ |empid|deptid|salary| +-----+------+------+ | 1| 10| 500| | 2| 10| 200| | 3| 20| 300| | 4| 20| 400| | 5| 20| 100| +-----+------+------+ Now to count the number of employees by department we'll use a groupBy and then use the count aggregation function: df_agg = df.groupBy("deptid").count().show() +------+-----+ |deptid|count| +------+-----+ | 10| 2| | 20| 3| +------+-----+ For the max: import pyspark.sql.functions as psf df_agg.agg(psf.max("count")).show()