Write multiple Avro files from pyspark to the same directory - python

I'm trying to write out dataframe as Avro files from PySpark dataframe to the path /my/path/ to HDFS, and partition by the col 'partition', so under /my/path/ , there should be the following sub directory structures
partition= 20230101
partition= 20230102
....
Under these sub directories, there should be the avro files. I'm trying to use
df1.select("partition","name","id").write.partitionBy("partition").format("avro").save("/my/path/")
It succeed the first time with , but when I tried to write another df with a new partition, it failed with error : path /my/path/ already exist. How should I achieve this? Many thanks for your help. The df format is as below:
partition name id
20230101. aa. 10 ---this row is the content in the first df
20230102. bb. 20 ---this row is the content in the second df

You should change the SaveMode.
By default save mode is ErrorIfExists, so you are getting an error.
Change it to append mode.
df1.select("partition","name","id") \
.write.mode("append").format("avro")\
.partitionBy("partition").save("/my/path/")

Related

Combining .csv Files in Python - Merged File Data Error - Jupyter Lab

I am trying to merge a large number of .csv files. They all have the same table format, with 60 columns each. My merged table results in the data coming out fine, except the first row consists of 640 columns instead of 60 columns. The remainder of the merged .csv consists of the desired 60 column format. Unsure where in the merge process it went wrong.
The first item in the problematic row is the first item in 20140308.export.CSV while the second (starting in column 61) is the first item in 20140313.export.CSV. The first .csv file is 20140301.export.CSV the last is 20140331.export.CSV (YYYYMMDD.export.csv), for a total of 31 .csv files. This means that the problematic row consists of the first item from different .csv files.
The Data comes from http://data.gdeltproject.org/events/index.html. In particular the dates of March 01 - March 31, 2014. Inspecting the download of each individual .csv file shows that each file is formatted the same way, with tab delimiters and comma separated values.
The code I used is below. If there is anything else I can post, please let me know. All of this was run through Jupyter Lab through Google Cloud Platform. Thanks for the help.
import glob
import pandas as pd
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory= False) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
I used the following bash code to download the data:
!curl -LO http://data.gdeltproject.org/events/[20140301-20140331].export.CSV.zip
I used the following code to unzip the data:
!unzip -a "********".export.CSV.zip
I used the following code to transfer to my storage bucket:
!gsutil cp 2014DataCombinedMarch.csv gs://ddeltdatabucket/2014DataCombinedMarch.csv
Looks like these CSV files have no header on them, so Pandas is trying to use the first row in the file as a header. Then, when Pandas tries to concat() the dataframes together, it's trying to match the column names which it has inferred for each file.
I figured out how to suppress that behavior:
import glob
import pandas as pd
def read_file(f):
names = [f"col_{i}" for i in range(58)]
return pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory=False, names=names)
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([read_file(f) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
You can supply your own column names to Pandas through the names parameter. Here, I'm just supplying col_0, col_1, col_2, etc for the names, because I don't know what they should be. If you know what those columns should be, you should change that names = line.
I tested this script, but only with 2 data files as input, not all 31.
PS: Have you considered using Google BigQuery to get the data? I've worked with GDELT before through that interface and it's way easier.

How to call a python function in PySpark?

I have a multiple files (CSV and XML) and I want to do some filters.
I defined a functoin doing all those filters, and I want to knwo how can I call it to be applicable for my CSV file?
PS: The type of my dataframe is: pyspark.sql.dataframe.DataFrame
Thanks in advance
For example, if you read in your first CSV files as df1 = spark.read.csv(..) and your second CSV file as df2 = spark.read.csv(..)
Wrap up all the multiple pyspark.sql.dataframe.DataFrame that came from CSV files alone into a list..
csvList = [df1, df2, ...]
and then,
for i in csvList:
YourFilterOperation(i)
Basically, for every i which is pyspark.sql.dataframe.DataFrame that came from a CSV file stored in csvList, it should iterate one by one, go inside the loop and perform whatever filter operation that you've written.
Since you haven't provided any reproducible code, I can't see if this works on my Mac.

Reading dataframe from multiple input paths and adding columns simultaneously

I am trying to read multiple input paths and based on the dates in the paths adding two columns to the data frame. Actually the files were stored as orc partitioned by these dates using hive so they have a structure like
s3n://bucket_name/folder_name/partition1=value1/partition2=value2
where partition2 = mg_load_date . So here I am trying to fetch multiple directories from multiple paths and based on the partitions I have to create two columns namely mg_load_date and event_date for each spark dataframe. I am reading these as input and combining them after I add these two columns finding the dates for each file respectively.
Is there any other way since I have many reads for each file, to read all the files at once while adding two columns for their specific rows. Or any other way where I can make the read operation fast since I have many reads.
I guess reading all the files like this sqlContext.read.format('orc').load(inputpaths) is faster than reading them individually and then merging them.
Any help is appreciated.
dfs = []
for i in input_paths:
df = sqlContext.read.format('orc').load(i)
date = re.search('mg_load_date=([^/]*)/$', i).group(1)
df = df.withColumn('event_date',F.lit(date)).withColumn('mg_load_date',F.lit(date))
dfs+=[df]
df = reduce(DataFrame.unionAll,dfs)
As #user8371915 says, you should load your data from the root path instead of passing a list of subdirectory:
sqlContext.read.format('orc').load("s3n://bucket_name/folder_name/")
Then you'll have access to your partitioning columns partition1 and partition2.
If for some reason you can't load from the root path you can try using pyspark.sql.functions input_file_name to get the name of the file for each row of your dataframe.
Spark 2.2.0+
to read from multiple folders using orc format.
df=spark.read.orc([path1,path2])
ref: https://issues.apache.org/jira/browse/SPARK-12334

Pyspark - write a dataframe into 2 different csv files

I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows.
I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas.
what would be the most efficient way to do this?
Thanks for your help!
Let's assume you've got Dataset called "df".
You can:
Option one: write twice:
df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API
Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API
Data, without header:
df.to_csv("filename.csv", header=False)
Header, without data:
df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")

writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

i have started the shell with databrick csv package
#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0
Then i read a csv file did some groupby op and dump that to a csv.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv') ####it has columns and df.columns works fine
type(df) #<class 'pyspark.sql.dataframe.DataFrame'>
#now trying to dump a csv
df.write.format('com.databricks.spark.csv').save('path+my.csv')
#it creates a directory my.csv with 2 partitions
### To create single file i followed below line of code
#df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name
#but in both cases no columns information(How to add column names to that csv file???)
# again i am trying to read that csv by
df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv")
#i am not getting any columns in that..1st row becomes column names
Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.
Question1- while giving csv dump is there any way i can add column name with that???
Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???
note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.
Try
df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.
Just in case,
on spark 2.1 you can create a single csv file with the following lines
dataframe.coalesce(1) //So just a single part- file will be created
.write.mode(SaveMode.Overwrite)
.option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") //Avoid creating of crc files
.option("header","true") //Write the header
.csv("csvFullPath")
with spark >= 2.o, we can do something like
df = spark.read.csv('path+filename.csv', sep = 'ifany',header='true')
df.write.csv('path_filename of csv',header=True) ###yes still in partitions
df.toPandas().to_csv('path_filename of csv',index=False) ###single csv(Pandas Style)
The following should do the trick:
df \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Alternatively, if you want the results to be in a single partition, you can use coalesce(1):
df \
.coalesce(1) \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Note however that this is an expensive operation and might not be feasible with extremely large datasets.
got answer for 1st question, it was a matter of passing one extra parameter header = 'true' along with to csv statement
df.write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
#Alternative for 2nd question
Using topandas.to_csv , But again i don't want to use pandas here, so please suggest if any other way around is there.

Categories