Getting empty dataframe after foreachPartition execution in Pyspark - python

I'm kinda new in PySpark and I'm trying to perform a foreachPartition function in my dataframe and then I want to perform another function with the same dataframe.
The problem is that after using the foreachPartition function, my dataframe gets empty, so I cannot do anything else with it. My code looks like the following:
def my_random_function(partition, parameters):
#performs something with the dataframe
#does not return anything
my_py_spark_dataframe.foreachPartition(
lambda partition: my_random_function(partition, parameters))
Could someone tell me how can I perform this foreachPartition and also use the same dataframe to perform other functions?
I saw some users talking about copying the dataframe using df.toPandas().copy() but in my case this causes some perform issues, so I would like to use the same dataframe instead of creating a new one.
Thank you in advance!

It is not clear which operation you are trying; but here is a sample usage of foreachPartition:
The sample data is a list of coutries from three continents:
+---------+-------+
|Continent|Country|
+---------+-------+
| NA| USA|
| NA| Canada|
| NA| Mexico|
| EU|England|
| EU| France|
| EU|Germany|
| ASIA| India|
| ASIA| China|
| ASIA| Japan|
+---------+-------+
Following code partitions the data by "Continent", iterates each partition using foreachPartition and writes the "Country" name to each file of that specific partition i.e. continent.
df = spark.createDataFrame(data=[["NA", "USA"], ["NA", "Canada"], ["NA", "Mexico"], ["EU", "England"], ["EU", "France"], ["EU", "Germany"], ["ASIA", "India"], ["ASIA", "China"], ["ASIA", "Japan"]], schema=["Continent", "Country"])
df.withColumn("partition_id", F.spark_partition_id()).show()
df = df.repartition(F.col("Continent"))
df.withColumn("partition_id", F.spark_partition_id()).show()
def write_to_file(rows):
for row in rows:
with open(f"/content/sample_data/{row.Continent}.txt", "a+") as f:
f.write(f"{row.Country}\n")
df.foreachPartition(write_to_file)
Output:
Three files: one for each partition.
!ls -1 /content/sample_data/
ASIA.txt
EU.txt
NA.txt
Each file has country names for that continent (partition):
!cat /content/sample_data/ASIA.txt
India
China
Japan
!cat /content/sample_data/EU.txt
England
France
Germany
!cat /content/sample_data/NA.txt
USA
Canada
Mexico

Related

How to update Spark DataFrame Column Values of a table from another table based on a condition using Pyspark

I would like to compare 2 dataframes in pyspark.
Below is my test case dataset (from google).
So I have 2 df's
Base DF
Secondary DF
baseDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
secDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
77,XYZ,5000,mex,ITA,2/11/2019
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF
Expected output
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
Using pyspark I can use subtract() to find the values of table1 not present in table2, and consequently use unionAll of the two tables or should I use withcolumn to overwrite values satisfying the condition.
Could someone suggest a good way of doing this?
Update ---
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF.
You can do a left join and coalesce the resulting Sal column, with secdf taking precedence over basedf:
import pyspark.sql.functions as F
result = basedf.alias('basedf').join(
secdf.alias('secdf'),
['No', 'Name'],
'left'
).select(
[F.coalesce('secdf.Sal', 'basedf.Sal').alias('Sal')
if c == 'Sal'
else F.coalesce('secdf.Dept', 'basedf.Dept').alias('Dept')
if c == 'Dept'
else f'basedf.{c}'
for c in basedf.columns]
)
result.show()
+---+----+----+-------+----+---------+
| No|Name| Sal|Address|Dept|Join_Date|
+---+----+----+-------+----+---------+
| 11| Sam|1000| ind| ITA|2/11/2019|
| 22| Tom|2500| usa| HRA|2/11/2019|
| 33| Kom|3000| uk| ITA|2/11/2019|
| 44| Nom|4600| can| HRA|2/11/2019|
| 55| Vom|8000| mex| ITA|2/11/2019|
| 66| XYZ|5000| mex| IT|2/11/2019|
+---+----+----+-------+----+---------+

Create one dataframe from multi csv files with different headers in Spark

In Spark, with Pyspark, i want to create one dataframe (where the path is actually a folder in S3), which contains multi csv files with common columns and different columns.
To say it more easily, i want only one dataframe from multiple csv files with different headers.
I can have a file with this header "raw_id, title, civility", and another file with this header "raw_id, first_name, civility".
This is my code in python 3 :
df = spark.read.load(
s3_bucket + 'data/contacts/normalized' + '/*/*/*/*',
format = 'csv',
delimiter = '|',
encoding = 'utf-8',
header = 'true',
quote = ''
)
This is an example of file_1.csv :
|raw_id|title|civility|
|1 |M |male |
And an example of file2.csv :
|raw_id|first_name|civility|
|2 |Tom |male |
The result i expect in my dataframe is :
|raw_id|first_name|title|civility|
|1 | |M |male |
|2 |Tom | |male |
But, what is happening is that i have all united columns but the data is not in the right place after the first file.
Do you know how to do this ?
Thank you very much by advance.
You need to load each of them in a different dataframe and join them together on the raw_id column.

Pandas: Why are my headers being inserted into the first row of my dataframe?

I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.

Filtering Spark Dataframe

I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')

VLOOKUP/ETL with Python

I have a data that comes from MS SQL Server. The data from the query returns a list of names straight from a public database. For instance, If i wanted records with the name of "Microwave" something like this would happen:
Microwave
Microwvae
Mycrowwave
Microwavee
Microwave would be spelt in hundreds of ways. I solve this currently with a VLOOKUP in excel. It looks for the value on the left cell and returns value on the right. for example:
VLOOKUP(A1,$A$1,$B$4,2,False)
Table:
A B
1 Microwave Microwave
2 Microwvae Microwave
3 Mycrowwave Microwave
4 Microwavee Microwave
I would just copy the VLOOKUP formula down the CSV or Excel file and then use that information for my analysis.
Is there a way in Python to solve this issue in another way?
I could make a long if/elif list or even a replace list and apply it to each line of the csv, but that would save no more time than just using the VLOOKUP. There are thousands of company names spelt wrong and i do not have the clearance to change the database.
So Stack, Any ideas on how to leverage python in this scenario?
If you had have data like this:
+-------------+-----------+
| typo | word |
+-------------+-----------+
| microweeve | microwave |
| microweevil | microwave |
| macroworv | microwave |
| murkeywater | microwave |
+-------------+-----------+
Save it as typo_map.csv
Then run (in the same directory):
import csv
def OpenToDict(path, index):
with open(path, 'rb') as f:
reader=csv.reader(f)
headings = reader.next()
heading_nums={}
for i, v in enumerate(headings):
heading_nums[v]=i
fields = [heading for heading in headings if heading <> index]
file_dictionary = {}
for row in reader:
file_dictionary[row[heading_nums[index]]]={}
for field in fields:
file_dictionary[row[heading_nums[index]]][field]=row[heading_nums[field]]
return file_dictionary
map = OpenToDict('typo_map.csv', 'typo')
print map['microweevil']['word']
The structure is slightly more complex than it needs to be for your situation but that's because this function was originally written to lookup more than one column. However, it will work for you, and you can simplify it yourself if you want.

Categories