Filtering Spark Dataframe - python

I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you

Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')

Related

Getting empty dataframe after foreachPartition execution in Pyspark

I'm kinda new in PySpark and I'm trying to perform a foreachPartition function in my dataframe and then I want to perform another function with the same dataframe.
The problem is that after using the foreachPartition function, my dataframe gets empty, so I cannot do anything else with it. My code looks like the following:
def my_random_function(partition, parameters):
#performs something with the dataframe
#does not return anything
my_py_spark_dataframe.foreachPartition(
lambda partition: my_random_function(partition, parameters))
Could someone tell me how can I perform this foreachPartition and also use the same dataframe to perform other functions?
I saw some users talking about copying the dataframe using df.toPandas().copy() but in my case this causes some perform issues, so I would like to use the same dataframe instead of creating a new one.
Thank you in advance!
It is not clear which operation you are trying; but here is a sample usage of foreachPartition:
The sample data is a list of coutries from three continents:
+---------+-------+
|Continent|Country|
+---------+-------+
| NA| USA|
| NA| Canada|
| NA| Mexico|
| EU|England|
| EU| France|
| EU|Germany|
| ASIA| India|
| ASIA| China|
| ASIA| Japan|
+---------+-------+
Following code partitions the data by "Continent", iterates each partition using foreachPartition and writes the "Country" name to each file of that specific partition i.e. continent.
df = spark.createDataFrame(data=[["NA", "USA"], ["NA", "Canada"], ["NA", "Mexico"], ["EU", "England"], ["EU", "France"], ["EU", "Germany"], ["ASIA", "India"], ["ASIA", "China"], ["ASIA", "Japan"]], schema=["Continent", "Country"])
df.withColumn("partition_id", F.spark_partition_id()).show()
df = df.repartition(F.col("Continent"))
df.withColumn("partition_id", F.spark_partition_id()).show()
def write_to_file(rows):
for row in rows:
with open(f"/content/sample_data/{row.Continent}.txt", "a+") as f:
f.write(f"{row.Country}\n")
df.foreachPartition(write_to_file)
Output:
Three files: one for each partition.
!ls -1 /content/sample_data/
ASIA.txt
EU.txt
NA.txt
Each file has country names for that continent (partition):
!cat /content/sample_data/ASIA.txt
India
China
Japan
!cat /content/sample_data/EU.txt
England
France
Germany
!cat /content/sample_data/NA.txt
USA
Canada
Mexico

How to group column values into 'others'?

I have a huge list of website names in my dataframe.
e.g array(['google','facebook','yahoo','youtube', and many other small websites])
Dataframe has around 40 more websites.
I want to group the other websites name as 'other'
My input table is something like
|Website |
|-------------|
|google.com |
|youtube.com |
|yahoo.com |
|nyu.com |
|something.com|
My desired output will be something like
|Website |
|-----------|
|google.com |
|youtube.com|
|yahoo.com |
|others |
|others |
I tried a few things but didn't work. Should I be manually renaming them ? Or is there any way, I can create a new column and mention them as others with a few exceptions as above ?
Thanks in advance.
try:
m=df['Website'].isin(['google.com','youtube.com','yahoo.com'])
#Finally:
df.loc[~m,'Website']='others'
OR
m=df['Website'].str.contains('google|youtube|yahoo')
#Finally:
df.loc[~m,'Website']='others'
Try using str.contains:
df.loc[df['Website'].str.contains('google|youtube|yahoo|facebook'),'Website']='others'
Maybe...
# maintain a list of sites you wish to keep
sitesToKeep = ['google.com', 'youtube.com', 'yahoo.com']
# for all rows where the value in the column 'Website' is not present in the list 'sitesToKeep' change the value to 'other'
df.loc[~df.Website.isin(sitesToKeep), 'Website'] = 'Other'

PySpark : Performance optimzation during create_map for categorizing the page visit

I am working on optimizing the below operation whose exection time is relatively high on the actual dataset(large datset).I tried below on two of the pyspark dataset 1 & 2 to arrive at the "page_category" column of dataset-2
pyspark dataset-1 :
page_click | page_category
---------------------------
facebook | Social_network
insta | Social_nework
coursera | educational
Another dataset on which i am applying the create_map operation looks like :
pyspark dataset-2 :
id | page_click
---------------
1 | facebook
2 |Coursera
I am creating the dictionary of the dataset-1 and applying the
page_map = create_map([lit(x) for x in chain(*dict_dataset_1.items()])
dataset_2.withColumn('page_category', page_map[dataset_2['page_click']])
and then performing with_column on 'page_click' column of dataset-2 to arrive at the another column called 'page_category'
final dataset :
id | page_click | Page_category
-------------------------------
1 | facebook |social_network
2 |Coursera |educational
But this operation is taking too much time to complete, more than 4-5 minutes. Is there another way to speed up the operation ?
Thank you
Implement a simple broadcast join
df2.join(broadcast(df1),df2.page_click==df1.page_click,'left').\
select(df2.id, df2.page_click, df1.page_category).show()
+---+----------+--------------+
| id|page_click| page_category|
+---+----------+--------------+
| 1| facebook|Social_network|
| 2| coursera| educational|
+---+----------+--------------+

Best practice for double every column on the same DataFrame

I want to take a DF and double each column (with new column name).
I want to make "Stress Tests" on my ML Model (implemented using PySpark & Spark Pipeline) and see how well it performs if I double/triple the number of features in my input dataset.
For Example, take this DF:
+-------+-------+-----+------+
| _c0| _c1| _c2| _c3|
+-------+-------+-----+------+
| 1 |Testing| | true |
+-------+-------+-----+------+
and make it like this:
+-------+-------+-----+------+-------+-------+-----+------+
| _c0| _c1| _c2| _c3| _c4| _c5| _c6| _c7|
+-------+-------+-----+------+-------+-------+-----+------+
| 1 |Testing| | true | 1 |Testing| | true |
+-------+-------+-----+------+-------+-------+-----+------+
The easiest way I can do it is like this:
df = df
doubledDF = df
for col in df.columns:
doubledDF = doubledDF.withColumn(col+"1dup", df[col])
However, it takes way to much time.
I would appreciate any solution, and even more the explanation why this solution approach is better.
Thank you very much!
You can do this by using selectExpr(). The asterisk * will un-list a list.
For eg; *['_c0', '_c1', '_c2', '_c3'] will return '_c0', '_c1', '_c2', '_c3'
Along with the help of list-comprehensions, this code can be fairly generalized.
df = sqlContext.createDataFrame([(1,'Testing','',True)],('_c0','_c1','_c2','_c3'))
df.show()
+---+-------+---+----+
|_c0| _c1|_c2| _c3|
+---+-------+---+----+
| 1|Testing| |true|
+---+-------+---+----+
col_names = df.columns
print(col_names)
['_c0', '_c1', '_c2', '_c3']
df = df.selectExpr(*[i for i in col_names],*[i+' as '+i+'_dup' for i in col_names])
df.show()
+---+-------+---+----+-------+-------+-------+-------+
|_c0| _c1|_c2| _c3|_c0_dup|_c1_dup|_c2_dup|_c3_dup|
+---+-------+---+----+-------+-------+-------+-------+
| 1|Testing| |true| 1|Testing| | true|
+---+-------+---+----+-------+-------+-------+-------+
Note: The following code will work too.
df = df.selectExpr('*',*[i+' as '+i+'_dup' for i in col_names])

Pandas: Why are my headers being inserted into the first row of my dataframe?

I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.

Categories