spark.read parquet into a dataframe gives null values - python

I'm new in PySpark and long story short:
I have a parquet file and I am trying to read it and use it with SPARK SQL, but currently I can:
Read the file with schema but gives NULL values - spark.read.format
Read the file without schema (header has first row values as column names) - read_parquet
I have a parquet file "locations.parquet" and
location_schema = StructType([
StructField("loc_id", IntegerType()),
StructField("descr", StringType())
])
I have been trying to read the parquet file:
location_df = spark.read.format('parquet') \
.options(header='false') \
.schema(location_schema) \
.load("data/locations.parquet")
Then I put everything into a Temp table to run queries:
location_df .registerTempTable("location")
But after trying to run a query:
query = spark.sql("select * from location")
query.show(100)
It gives me NULL values:
The parquet file is correct since I have been running successfully this:
great = pd.read_parquet('data/locations.parquet', engine='auto')
But the problem with read_parquet (from my understanding) is that I cannot set a schema like I did with spark.read.format.
If I use the spark.read.format with csv, It also runs successfully and brings data.
Any advice is greatly appreciated, thanks.

location_df.schema gives me : StructType(List(StructField(862,LongType,true),StructField(Animation,StringType,true)))
You have a Struct with an array, not just two columns. That explains why your first example is null (the names/types don't match). It would also throw errors if you starting selecting loc_id or descr individually.
It appears that the field names within the array of Structs are 862 and Animation, too, not the fields you're interested in
You can download a parquet-tools utility separately to inspect the file data and print the file schema without Spark. Start debugging there...
If you don't give the Spark reader a schema, then that information is already interred from the file itself since Parquet files include that information (in the footer of the file, not the header)

Related

Is there a way to ignore entirely the first n rows of a comma separated csv when reading data in using Pyspark in Databricks?

I have a series of datasets (reports) as separate csv files (UTF-8), where the report metadata (parameters used) are made available at the top of the dataset. When trying to read these in, Spark only acknowledges the first two columns (for instance "Report Name" : "Sales Current Year"), where its picking up only the columns where the report details are defined and not the entirety of the dataset itself (187 columns). I've tried everything but this was the closest I came to reading in the dataset to exclude the parameter information at the top of the report and to fully capture the dataset:
row_rdd = spark.sparkContext \
.textFile(load_path) \
.zipWithIndex() \
.filter(lambda row: row[1] >= n_skip_rows) \
.map(lambda row: row[0])
sparkDF = spark.read.option("multiLine", True).option("mode", "DROPMALFORMED").option("inferSchema", "true").option("header", False).csv(row_rdd)
To add to the complexity, when I download one of these csv files and open in a text editor I see there are many empty lines which I presume is throwing out the structure of the data on read, so to try ignore lines as above I've tried .option("mode", "DROPMALFORMED") to try navigate this, however now the spark reader only reads in 489 of ~8000 records.
Any advice or thoughts on how to tackle this appreciated!

Overwrite specific CSV partitions pyspark

I have a spark dataframe named df, which is partitioned on the column date. I need to save on S3 this dataframe with the CSV format. When I write the dataframe, I need to delete the partitions (i.e. the dates) on S3 for which the dataframe has data to be written to. All the other partitions need to remain intact.
I saw here that this is exactly the job of the option spark.sql.sources.partitionoverwritemode set to dynamic.
However, it does not seem to work for me with CSV files.
If I use it on parquet with the following command it works perfectly:
df.write
.option("partitionOverwriteMode", "dynamic")
.partitionBy("date")
.format("parquet")
.mode("overwrite")
.save(output_dir)
But if I use it on CSV with the following command it does not work:
df.write
.option("partitionOverwriteMode", "dynamic")
.partitionBy("date")
.format("csv")
.mode("overwrite")
.save(output_dir)
Why is this the case? Any idea of how this behaviour could be implemented with CSV outputs?
I need to delete the partitions (i.e. the dates) on S3 for which the dataframe has data to be written to
Assuming you have a convenient list of the dates you are processing you can use the replaceWhere option to determine the partitions to overwrite (delete and replace).
For example:
df.write
.partitionBy("date")
.option("replaceWhere", "date >= '2020-12-14' AND date <= '2020-12-15'")
.format("csv")
.mode("overwrite")
.save(output_dir)
A more dynamic way is if you have the start_date and end_date stored in variables:
start_date = "2022-01-01"
end_date = "2022-01-14"
condition = f"date >= '{start_date}' AND date <= '{end_date}'"
df.write
.partitionBy("date")
.option("replaceWhere", condition)
.format("csv")
.mode("overwrite")
.save(output_dir)
What Spark version do you use? For Spark <2.0.0 it may seem impossible to use partitioning along with the csv format
If you are not on EMR, and are using the s3a committers to safely commit work to s3, then the partitioned committer can be set to delete all data in the destination partitions before committing new work, leaving all other partitions alone.

Drop partition columns when writing parquet in pyspark

I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.
Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.
Example:
val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.
Checking contents of csv file:
hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv
Output:
a
As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.
To check schema of parquet file:
parquet-tools schema <hdfs://nn:8020/parquet_file>
You can also verify what are all the columns included in your parquet file.
If you use df.write.partitionBy('year','month', 'day').
These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy creates.
Ex. partitionBy('year').csv("/data") will create something like:
/data/year=2018/part1---.csv
/data/year=2019/part1---.csv
When you read the data back it uses the special path year=xxx to populate these columns.
You can prove it by reading in the data of a single partition directly.
Ex. year will not be a column in this case.
df = spark.read.csv("data/year=2019/")
df.printSchema()
Also #Shu's answer could be used to investigate.
You can sleep safely that these columns are not taking up storage space.
If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.

After df.toPandas().to_csv('mycsv.csv'), data is garbled when reread

I have a table named result_25. I use this code to successfully export data to csv on my disk.
result_25.toPandas().to_csv('mycsv.csv')
In order to check whether I save the file correctly, I read my table back in with this code:
rr = spark.read.csv('mycsv.csv', inferSchema=True, header=True)
I checked the data, it looked like fine.
rr & result_25 dataset
But when I checked my result_25 and rr with .describe().show(), they show differently(I was expecting they were the same).
result_25 and rr describe()
And when I grouped by 'prediction', they were even more different.
rr
result_25
What's wrong here? Anybody can help me? Thanks!!!
By default, pandas.to_csv adds an index to the CSV export (from the docs):
index: boolean, default True
Write row names (index)
You can export to CSV without the index:
result_25.toPandas().to_csv('mycsv.csv', index=False)
and you won't see the additional column _c0 (the column name _c0 is added by pyspark since pandas does not give any name to the index column).
If you only use spark (and don't need the saved data frame in human-readable format), another way to avoid this is to write/read pyspark data frames in other formats such as JSON or parquet:
# JSON
result_25.write.json('mydataframe.json')
rr = spark.read.json('mydataframe.json')
# parquet
result_25.write.parquet('mydataframe.parquet')
rr = spark.read.parquet('mydataframe.parquet')

writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

i have started the shell with databrick csv package
#../spark-1.6.1-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0
Then i read a csv file did some groupby op and dump that to a csv.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(path.csv') ####it has columns and df.columns works fine
type(df) #<class 'pyspark.sql.dataframe.DataFrame'>
#now trying to dump a csv
df.write.format('com.databricks.spark.csv').save('path+my.csv')
#it creates a directory my.csv with 2 partitions
### To create single file i followed below line of code
#df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("path+file_satya.csv") ## this creates one partition in directory of csv name
#but in both cases no columns information(How to add column names to that csv file???)
# again i am trying to read that csv by
df_new = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("the file i just created.csv")
#i am not getting any columns in that..1st row becomes column names
Please don't answer like add a schema to dataframe after read_csv or while reading mention the column names.
Question1- while giving csv dump is there any way i can add column name with that???
Question2-is there a way to create single csv file(not directory again) which can be opened by ms office or notepad++???
note: I am currently not using cluster, As it is too complex for spark beginner like me. If any one can provide a link for how to deal with to_csv into single file in clustered environment , that would be a great help.
Try
df.coalesce(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
Note that this may not be an issue on your current setup, but on extremely large datasets, you can run into memory problems on the driver. This will also take longer (in a cluster scenario) as everything has to push back to a single location.
Just in case,
on spark 2.1 you can create a single csv file with the following lines
dataframe.coalesce(1) //So just a single part- file will be created
.write.mode(SaveMode.Overwrite)
.option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") //Avoid creating of crc files
.option("header","true") //Write the header
.csv("csvFullPath")
with spark >= 2.o, we can do something like
df = spark.read.csv('path+filename.csv', sep = 'ifany',header='true')
df.write.csv('path_filename of csv',header=True) ###yes still in partitions
df.toPandas().to_csv('path_filename of csv',index=False) ###single csv(Pandas Style)
The following should do the trick:
df \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Alternatively, if you want the results to be in a single partition, you can use coalesce(1):
df \
.coalesce(1) \
.write \
.mode('overwrite') \
.option('header', 'true') \
.csv('output.csv')
Note however that this is an expensive operation and might not be feasible with extremely large datasets.
got answer for 1st question, it was a matter of passing one extra parameter header = 'true' along with to csv statement
df.write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')
#Alternative for 2nd question
Using topandas.to_csv , But again i don't want to use pandas here, so please suggest if any other way around is there.

Categories