Pyspark - write a dataframe into 2 different csv files

Pyspark - write a dataframe into 2 different csv files - python

I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows.
I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas.
what would be the most efficient way to do this?
Thanks for your help!

Let's assume you've got Dataset called "df".
You can:
Option one: write twice:
df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API
Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API

Data, without header:
df.to_csv("filename.csv", header=False)
Header, without data:
df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")

Related

Python reads only one column from my CSV file

first post here.
I am very new to programmation, sorry if it is confused.
I made a database by collecting multiple different data online. All these data are in one xlsx file (each data a column), that I converted in csv afterwards because my teacher only showed us how to use csv file in Python.
I installed pandas and make it read my csv file, but it seems it doesnt understand that I have multiple columns, it reads one column. Thus, I can't get the info on each data (and so i can't transform the data).
I tried df.info() and df.info(verbose=True, show_counts=True) but it makes the same thing
len(df.columns) = 1 which proves it doesnt see that each data has its own column
len(df) = 1923, which is right
I was expecting that : https://imgur.com/a/UROKtxN (different project, not the same database)
database used: https://imgur.com/a/Wl1tsYb
And I have that instead : https://imgur.com/a/iV38YNe
database used: https://imgur.com/a/VefFrL4
idk, it looks pretty similar, why doesn't work :((
Thanks.

displaying multiple pandas function created on python in the same csv file

How can i display multiple pandas function created on python in the same csv file
So I have multiple data tables saved as pandas dataframes, and I want to output all of them into the same CSV for ease of access. However, I am not really sure the best way to go about this, as I want to maintain each dataframes inherent structure (ie columns and index), so I can combine them all into 1 single dataframe.

You have 2 choices:
Either you combine them first (pd.concat()) with all the advantages and limitations of that appraoch, then you cann call .to_csv and it will print 1 file. If they are structurally the same, this is great because you will be able to read the file again.
Or, you call .to_csv() multiple times, and save the output in a "buffer", which you can then write (see here). Probably the only way if your DataFrames are very different from a structural perspective, but a mess to read them later.
Is .json output an option for what you want to do?

Thanks alot for the comment Kingotto, I used to first option added the this code and it was able to help me arrange my functions horizontally and exported the file to csv like this:
frames = pd.concat([file_1, file_2, file_3], axis = 1)
save the dataframe
frames.to_csv('Combined.csv', index = False)

Pandas read_csv end reading at first linebreak

I am trying to read a csv file with some garbage at the top, but also garbage at the bottom of the interesting data. I need to read multiple files and the length of the interesting data varies. Is there a way to let the pd.read_csv command know that the dataframe ends at the first linebreak?
Example data (screenshot from excel):
I read the file with:
dataframe = pd.read_csv(file, skiprows=45)
Which nicely gives me a dataframe with 10 columns with the headers on line 46 (see image). However, it continues further than the #GARBAGE DATA row.
Important note: Neither the length of the data nor the length of the footer is of equal length in the different files I want to read.

Two ways you could implement this
1) use skipfooter parameter of read csv , it tells the function the Number of lines at bottom of file to skip
pd.read_csv("in.csv",skiprows=45,skipfooter=2)
2) Read the file as it is and later use dropna function, this should drop the Garbage values.
df.dropna(inplace=True)

After using this command:
dataframe = pd.read_csv(file, skiprows=45)
You can use this command:
dataframe= dataframe.dropna(how='any')
This would delete a row if any empty value has been found in that row. Hence it would delete rest of all the rows.

How to call a python function in PySpark?

I have a multiple files (CSV and XML) and I want to do some filters.
I defined a functoin doing all those filters, and I want to knwo how can I call it to be applicable for my CSV file?
PS: The type of my dataframe is: pyspark.sql.dataframe.DataFrame
Thanks in advance

For example, if you read in your first CSV files as df1 = spark.read.csv(..) and your second CSV file as df2 = spark.read.csv(..)
Wrap up all the multiple pyspark.sql.dataframe.DataFrame that came from CSV files alone into a list..
csvList = [df1, df2, ...]
and then,
for i in csvList:
YourFilterOperation(i)
Basically, for every i which is pyspark.sql.dataframe.DataFrame that came from a CSV file stored in csvList, it should iterate one by one, go inside the loop and perform whatever filter operation that you've written.
Since you haven't provided any reproducible code, I can't see if this works on my Mac.

make custom spreadsheets with python

I have a pandas data frame with two columns:
year experience and salary
I want to save a csv file with these two columns and also have some stats at the head of the file as in the image:
Is there any option to handle these with pandas or any other library of do I have to make a script to write it line adding the commas between fields?

Pandas does not support what you want to do here. The problem is that your format is no valid csv. The RFC for CSV states that Each record is located on a separate line, implying that a line corresponds to a record, with an optional header line. Your format adds the average and max values, which do not correspond to records.
As I see it, you have three paths to go from here: i. You create two separate data frames and map them to csv files (super precise would be 3), one with your records, one with the additional values. ii. Write your data frame to csv first, then open that file and insert the your additional values at the top. iii. If your goal is an import into excel, however, #gefero 's suggestion is the right hint: try using the xslxwriter package do directly write to cells in a spreadsheet.

You can read the file as two separate parts (stats and csv)
Reading stats:
number_of_stats_rows = 3
stats = pandas.read_csv(file_path, nrows=number_of_stats_rows, header=None).fillna('')
Reading remaining file:
other_data = pandas.read_csv(file_path, skiprows=number_of_stats_rows).fillna('')

Take a look to xslxwriter. Perhaps it´s what you are looking for.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark - write a dataframe into 2 different csv files - python

Data, without header: df.to_csv("filename.csv", header=False) Header, without data: df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe df_new.to_csv("filename.csv")

Related

Python reads only one column from my CSV file

displaying multiple pandas function created on python in the same csv file

Pandas read_csv end reading at first linebreak

How to call a python function in PySpark?

make custom spreadsheets with python

Categories

Resources