How to call a python function in PySpark?

How to call a python function in PySpark? - python

I have a multiple files (CSV and XML) and I want to do some filters.
I defined a functoin doing all those filters, and I want to knwo how can I call it to be applicable for my CSV file?
PS: The type of my dataframe is: pyspark.sql.dataframe.DataFrame
Thanks in advance

For example, if you read in your first CSV files as df1 = spark.read.csv(..) and your second CSV file as df2 = spark.read.csv(..)
Wrap up all the multiple pyspark.sql.dataframe.DataFrame that came from CSV files alone into a list..
csvList = [df1, df2, ...]
and then,
for i in csvList:
YourFilterOperation(i)
Basically, for every i which is pyspark.sql.dataframe.DataFrame that came from a CSV file stored in csvList, it should iterate one by one, go inside the loop and perform whatever filter operation that you've written.
Since you haven't provided any reproducible code, I can't see if this works on my Mac.

Related

displaying multiple pandas function created on python in the same csv file

How can i display multiple pandas function created on python in the same csv file
So I have multiple data tables saved as pandas dataframes, and I want to output all of them into the same CSV for ease of access. However, I am not really sure the best way to go about this, as I want to maintain each dataframes inherent structure (ie columns and index), so I can combine them all into 1 single dataframe.

You have 2 choices:
Either you combine them first (pd.concat()) with all the advantages and limitations of that appraoch, then you cann call .to_csv and it will print 1 file. If they are structurally the same, this is great because you will be able to read the file again.
Or, you call .to_csv() multiple times, and save the output in a "buffer", which you can then write (see here). Probably the only way if your DataFrames are very different from a structural perspective, but a mess to read them later.
Is .json output an option for what you want to do?

Thanks alot for the comment Kingotto, I used to first option added the this code and it was able to help me arrange my functions horizontally and exported the file to csv like this:
frames = pd.concat([file_1, file_2, file_3], axis = 1)
save the dataframe
frames.to_csv('Combined.csv', index = False)

Analysing multiple files using for loop

I have 10 files which I need to work on.
I need to import those files using pd.read_csv to turn them all into dataframes along with usecols as I only need the same two specific columns from each file.
I then need to search the two columns for a specific entry in the rows like ‘abcd’ and for python to return a new df with includes all the rows it appeared in for each file.
Is there a way I could do this using a for loop. For far I’ve only got a list of all the paths to the 10 files.
So far what I do for one file without the for loop is:
df = pd.read_csv(r'filepath', header=2, usecols=['Column1', 'Column2'])
search_df = df.loc[df['Column1'] == 'abcd']

Reading dataframe from multiple input paths and adding columns simultaneously

I am trying to read multiple input paths and based on the dates in the paths adding two columns to the data frame. Actually the files were stored as orc partitioned by these dates using hive so they have a structure like
s3n://bucket_name/folder_name/partition1=value1/partition2=value2
where partition2 = mg_load_date . So here I am trying to fetch multiple directories from multiple paths and based on the partitions I have to create two columns namely mg_load_date and event_date for each spark dataframe. I am reading these as input and combining them after I add these two columns finding the dates for each file respectively.
Is there any other way since I have many reads for each file, to read all the files at once while adding two columns for their specific rows. Or any other way where I can make the read operation fast since I have many reads.
I guess reading all the files like this sqlContext.read.format('orc').load(inputpaths) is faster than reading them individually and then merging them.
Any help is appreciated.
dfs = []
for i in input_paths:
df = sqlContext.read.format('orc').load(i)
date = re.search('mg_load_date=([^/]*)/$', i).group(1)
df = df.withColumn('event_date',F.lit(date)).withColumn('mg_load_date',F.lit(date))
dfs+=[df]
df = reduce(DataFrame.unionAll,dfs)

As #user8371915 says, you should load your data from the root path instead of passing a list of subdirectory:
sqlContext.read.format('orc').load("s3n://bucket_name/folder_name/")
Then you'll have access to your partitioning columns partition1 and partition2.
If for some reason you can't load from the root path you can try using pyspark.sql.functions input_file_name to get the name of the file for each row of your dataframe.

Spark 2.2.0+
to read from multiple folders using orc format.
df=spark.read.orc([path1,path2])
ref: https://issues.apache.org/jira/browse/SPARK-12334

Pyspark - write a dataframe into 2 different csv files

I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows.
I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas.
what would be the most efficient way to do this?
Thanks for your help!

Let's assume you've got Dataset called "df".
You can:
Option one: write twice:
df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API
Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API

Data, without header:
df.to_csv("filename.csv", header=False)
Header, without data:
df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")

Rebase csv file by merging a base csv file with another new csv file

I am currently working with two csv files, base.csv and another csv file, output_20170503.csv which will be produced everyday, so my aim here is to rebase every output so that they have the same data as the base.csv.
My base.csv:
ID,Name,Number,Shape,Sound
1,John,45,Round,Meow
2,Jimmy,78,Sphere,Woof
3,Marc,,Triangle,Quack
4,Yun,50,Triangle,Meow
5,Nyancat,,Round,Quack
My output_20170503.csv
ID,Name,Number,Shape,Sound
1,John,,Round,Meow
2,Jimmy,,Sphere,Woof
3,Marc,,Triangle,Quack
4,Yun,,Triangle,
5,Nyancat,,Round,Quack
6,Marc,,Square,Woof
7,Jonnn,,Hexagon,Chirp
The objective here is to rebase the data (ID from 1-5) from base.csv with the output_20170503.csv
What I want to achieve:
ID,Name,Number,Shape,Sound
1,John,45,Round,Meow
2,Jimmy,78,Sphere,Woof
3,Marc,,Triangle,Quack
4,Yun,50,Triangle,Meow
5,Nyancat,,Round,Quack
6,Marc,,Square,Woof
7,Jonnn,,Hexagon,Chirp
I already searched for the solution but what I got;
Merge two csv files (both of csv files have different columns, won't work for me)
Remove duplicates from a csv files (Appending base.csv with the output_20170503.csv and then remove the duplicates, won't work because they have different values for column Number)
Any help would be appreciated, thank you.

You can try this, I use first two item as key and generate a dict and then iterate the new dict update the base dict if the key not in base:
new = {"".join(i.split(',')[:2]): i[:-1].split(',') for i in open('output_20170503.csv')}
base = {"".join(i.split(',')[:2]): i[:-1].split(',') for i in open('base.csv')}
base.update({i: new[i] for i in new if i not in base})
f=open("out.csv","w")
for i in sorted(base.values(), key=lambda x: x[0]):
if i[0]!="ID":
f.write(",".join(i)+"\n")
Output:
1,John,45,Round,Meow
2,Jimmy,78,Sphere,Woof
3,Marc,,Triangle,Quack
4,Yun,50,Triangle,Meow
5,Nyancat,,Round,Quac
6,Marc,,Square,Woof
7,Jonnn,,Hexagon,Chir
Python2.7+ supports the syntactical extension called the "dictionary comprehension" or "dict comprehension", so if you're using Python2.6, you need to replace the first three lines with:
new = dict(("".join(i.split(',')[:2]),i[:-1].split(',')) for i in open('output_20170503.csv'))
base = dict(("".join(i.split(',')[:2]),i[:-1].split(',')) for i in open('base.csv'))
base.update(dict((i,new[i]) for i in new if i not in base))

You should try to use pandas library which is excellent for data manipulation. you can read easily csv files and do merge operation. Your solution might look like the following :
import pandas as pd
base_df = pd.read_csv('base.csv')
output_df = pd.read_csv('My output_20170503.csv')
output_df.update(base_df)
output_df.write_csv('My output_20170503.csv')
The missing values on output_df has now been updated with the one from base_df.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to call a python function in PySpark? - python

I have a multiple files (CSV and XML) and I want to do some filters. I defined a functoin doing all those filters, and I want to knwo how can I call it to be applicable for my CSV file? PS: The type of my dataframe is: pyspark.sql.dataframe.DataFrame Thanks in advance

Related

displaying multiple pandas function created on python in the same csv file

Analysing multiple files using for loop

Reading dataframe from multiple input paths and adding columns simultaneously

Pyspark - write a dataframe into 2 different csv files

Rebase csv file by merging a base csv file with another new csv file

Categories

Resources