Overwrite specific CSV partitions pyspark - python

I have a spark dataframe named df, which is partitioned on the column date. I need to save on S3 this dataframe with the CSV format. When I write the dataframe, I need to delete the partitions (i.e. the dates) on S3 for which the dataframe has data to be written to. All the other partitions need to remain intact.
I saw here that this is exactly the job of the option spark.sql.sources.partitionoverwritemode set to dynamic.
However, it does not seem to work for me with CSV files.
If I use it on parquet with the following command it works perfectly:
df.write
.option("partitionOverwriteMode", "dynamic")
.partitionBy("date")
.format("parquet")
.mode("overwrite")
.save(output_dir)
But if I use it on CSV with the following command it does not work:
df.write
.option("partitionOverwriteMode", "dynamic")
.partitionBy("date")
.format("csv")
.mode("overwrite")
.save(output_dir)
Why is this the case? Any idea of how this behaviour could be implemented with CSV outputs?

I need to delete the partitions (i.e. the dates) on S3 for which the dataframe has data to be written to
Assuming you have a convenient list of the dates you are processing you can use the replaceWhere option to determine the partitions to overwrite (delete and replace).
For example:
df.write
.partitionBy("date")
.option("replaceWhere", "date >= '2020-12-14' AND date <= '2020-12-15'")
.format("csv")
.mode("overwrite")
.save(output_dir)
A more dynamic way is if you have the start_date and end_date stored in variables:
start_date = "2022-01-01"
end_date = "2022-01-14"
condition = f"date >= '{start_date}' AND date <= '{end_date}'"
df.write
.partitionBy("date")
.option("replaceWhere", condition)
.format("csv")
.mode("overwrite")
.save(output_dir)

What Spark version do you use? For Spark <2.0.0 it may seem impossible to use partitioning along with the csv format

If you are not on EMR, and are using the s3a committers to safely commit work to s3, then the partitioned committer can be set to delete all data in the destination partitions before committing new work, leaving all other partitions alone.

Related

Creating a spark Dataframe within foreach() while using autoloader with BinaryFile option in databricks

I am using autoloader with BinaryFile option to decode .proto based files in databricks. I am able to decode the proto file and write it in csv format using foreach() and pandas library. But having challenge in writing it in delta format. End of the day, I want to write in delta format and trying to avoid one more hop in storage i.e., storing in csv.
There are few ways I could think of but it has challenges :
Convert pandas dataframe to spark dataframe. I have to use sparkContext to createDataframe but I can't broadcast sparkContext to worker nodes.
Avoid using pandas DF, still I need to create dataframe which is not possible with in foreach() (since load is distributed across workers)
Other ways like UDF, where I will decode and explode the string returned from the decode. But that's not applicable here because, we are getting spark non-native file format i.e., proto.
Also come across few blogs but not helpful in foreach() and BinaryFile option.
https://github.com/delta-io/delta-rs/tree/main/python - This is not stable yet in python
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_delta.html - This points us to the challenge 2 mentioned above.
Any leads on this is much appreciated.
Below is the skeleton code snippet for reference:
cloudfile_options = {
"cloudFiles.subscriptionId": subscription_ID,
"cloudFiles.connectionString": queue_connection_string,
"cloudFiles.format": "BinaryFile",
"cloudFiles.tenantId":tenant_ID,
"cloudFiles.clientId":client_ID,
"cloudFiles.clientSecret":client_secret,
"cloudFiles.resourceGroup": storage_resource_group,
"cloudFiles.useNotifications" :"true"
}
reader_df = spark.readStream.format("cloudFiles") \
.options(**cloudfile_options) \
.load("some_storage_input_path")
def decode_proto(self, row):
with open(row['path'], 'rb') as f:
// Do decoding
// convert decoded string to Json and write to storage using pandas df
write_stream = reader_df.select("path") \
.writeStream \
.foreach(decode_proto) \
.option("checkpointLocation", checkpoint_path) \
.trigger(once=True) \
.start()

spark.read parquet into a dataframe gives null values

I'm new in PySpark and long story short:
I have a parquet file and I am trying to read it and use it with SPARK SQL, but currently I can:
Read the file with schema but gives NULL values - spark.read.format
Read the file without schema (header has first row values as column names) - read_parquet
I have a parquet file "locations.parquet" and
location_schema = StructType([
StructField("loc_id", IntegerType()),
StructField("descr", StringType())
])
I have been trying to read the parquet file:
location_df = spark.read.format('parquet') \
.options(header='false') \
.schema(location_schema) \
.load("data/locations.parquet")
Then I put everything into a Temp table to run queries:
location_df .registerTempTable("location")
But after trying to run a query:
query = spark.sql("select * from location")
query.show(100)
It gives me NULL values:
The parquet file is correct since I have been running successfully this:
great = pd.read_parquet('data/locations.parquet', engine='auto')
But the problem with read_parquet (from my understanding) is that I cannot set a schema like I did with spark.read.format.
If I use the spark.read.format with csv, It also runs successfully and brings data.
Any advice is greatly appreciated, thanks.
location_df.schema gives me : StructType(List(StructField(862,LongType,true),StructField(Animation,StringType,true)))
You have a Struct with an array, not just two columns. That explains why your first example is null (the names/types don't match). It would also throw errors if you starting selecting loc_id or descr individually.
It appears that the field names within the array of Structs are 862 and Animation, too, not the fields you're interested in
You can download a parquet-tools utility separately to inspect the file data and print the file schema without Spark. Start debugging there...
If you don't give the Spark reader a schema, then that information is already interred from the file itself since Parquet files include that information (in the footer of the file, not the header)

Parquet compatibility with Dask/Pandas and Pyspark

This is the same question as here, but the accepted answer does not work for me.
Attempt:
I try to save a dask dataframe in parquet format and read it with spark.
Issue: the timestamp column can not be interpreted by pyspark
what i have done:
I try to save a Dask dataframe in hfds as parquet using
import dask.dataframe as dd
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', flavor='spark')
Then I read the file with pyspark:
sdf = spark.read.parquet('hdfs:///user/<myuser>/<filename>')
sdf.show()
>>> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://nameservice1/user/<user>/<filename>/part.0.parquet. Column: [utc_timestamp], Expected: bigint, Found: INT96
but if i save the dataframe with
dd.to_parquet(ddf_param_logs, 'hdfs:///user/<myuser>/<filename>', engine='pyarrow', use_deprecated_int96_timestamps=True)
the utc timestamp column contains the timestamp Information in unix Format (1578642290403000)
this is my Environment:
dask==2.9.0
dask-core==2.9.0
pandas==0.23.4
pyarrow==0.15.1
pyspark==2.4.3
The INT96 type was explicitly included in order to allow compatibility with spark, which chose not to use the standard time type defined by the parquet spec. Unfortunately, it seems that they have changed again, and no longer use their own previous standard, not the parquet one.
If you could find out what type spark wants here, and post an issue to the dask repo, it would be appreciated. You would want to output data from spark containing time columns, and see what format it ends up as.
Did you also try the fastparquet backend?

Drop partition columns when writing parquet in pyspark

I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.
Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.
Example:
val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.
Checking contents of csv file:
hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv
Output:
a
As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.
To check schema of parquet file:
parquet-tools schema <hdfs://nn:8020/parquet_file>
You can also verify what are all the columns included in your parquet file.
If you use df.write.partitionBy('year','month', 'day').
These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy creates.
Ex. partitionBy('year').csv("/data") will create something like:
/data/year=2018/part1---.csv
/data/year=2019/part1---.csv
When you read the data back it uses the special path year=xxx to populate these columns.
You can prove it by reading in the data of a single partition directly.
Ex. year will not be a column in this case.
df = spark.read.csv("data/year=2019/")
df.printSchema()
Also #Shu's answer could be used to investigate.
You can sleep safely that these columns are not taking up storage space.
If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.

writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

i have started the shell with databrick csv package
#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0
Then i read a csv file did some groupby op and dump that to a csv.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv') ####it has columns and df.columns works fine
type(df) #<class 'pyspark.sql.dataframe.DataFrame'>
#now trying to dump a csv
df.write.format('com.databricks.spark.csv').save('path+my.csv')
#it creates a directory my.csv with 2 partitions
### To create single file i followed below line of code
#df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name
#but in both cases no columns information(How to add column names to that csv file???)
# again i am trying to read that csv by
df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv")
#i am not getting any columns in that..1st row becomes column names
Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.
Question1- while giving csv dump is there any way i can add column name with that???
Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???
note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.
Try
df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.
Just in case,
on spark 2.1 you can create a single csv file with the following lines
dataframe.coalesce(1) //So just a single part- file will be created
.write.mode(SaveMode.Overwrite)
.option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") //Avoid creating of crc files
.option("header","true") //Write the header
.csv("csvFullPath")
with spark >= 2.o, we can do something like
df = spark.read.csv('path+filename.csv', sep = 'ifany',header='true')
df.write.csv('path_filename of csv',header=True) ###yes still in partitions
df.toPandas().to_csv('path_filename of csv',index=False) ###single csv(Pandas Style)
The following should do the trick:
df \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Alternatively, if you want the results to be in a single partition, you can use coalesce(1):
df \
.coalesce(1) \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Note however that this is an expensive operation and might not be feasible with extremely large datasets.
got answer for 1st question, it was a matter of passing one extra parameter header = 'true' along with to csv statement
df.write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
#Alternative for 2nd question
Using topandas.to_csv , But again i don't want to use pandas here, so please suggest if any other way around is there.

Categories