Drop partition columns when writing parquet in pyspark - python

I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.

Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.
Example:
val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.
Checking contents of csv file:
hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv
Output:
a
As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.
To check schema of parquet file:
parquet-tools schema <hdfs://nn:8020/parquet_file>
You can also verify what are all the columns included in your parquet file.

If you use df.write.partitionBy('year','month', 'day').
These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy creates.
Ex. partitionBy('year').csv("/data") will create something like:
/data/year=2018/part1---.csv
/data/year=2019/part1---.csv
When you read the data back it uses the special path year=xxx to populate these columns.
You can prove it by reading in the data of a single partition directly.
Ex. year will not be a column in this case.
df = spark.read.csv("data/year=2019/")
df.printSchema()
Also #Shu's answer could be used to investigate.
You can sleep safely that these columns are not taking up storage space.
If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.

Related

Skipping rows and columns when reading csv with Pandas

I need help about read csv file with pandas.
I have a .csv file that recorded machine parameters and want to read this excel with pandas and analyze. But problem is this excel file not in a proper table format. That means there are a lot of empty rows and columns. Also parameter values are starting from 301st line (example).
How can I read as properly this csv file?
You can use skiprows:
pd.read_csv(csv_file, skiprows=301)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Pandas - Change the default date format when creating a dataframe from CSVs

I have a script that loops through a folder of CSVs, reads them, removes any empty rows (they all have 'empty' rows that Pandas reads as NaN) and appends them to a master dataframe. It then writes the dataframe to a new CSV. This is all working as expected:
if pl.Path(file).suffix == '.csv':
fullPath = os.path.join(sourceLoc, file)
print(file)
initDF = pd.read_csv(fullPath)
cleanDF = initDF.dropna(subset=['Name'])
masterDF = masterDF.append(cleanDF)
masterDF.to_csv(destLoc, index=False)
My only issue is the input dates are displayed like this 25/05/21 but the output dates end up formatted like this 05/25/21. As I'm in the UK and using a UK version of Excel to analyse the output, it's confusing all my functions.
The only solutions I've found so far are to reformat the date columns individually or style them, which to my understanding only affects how they look in Jupyter and not in the actual data. As there are multiple date columns in the source data files I'd rather not have to reformat them all individually.
Is there any way of defining the date format when first creating the dataframe, or reformatting every date column once the dataframe is filled?
In the end this issue was caused by two different problems.
The first was Excel intermittently exporting my dates in US format despite the original format (and my Windows Region settings) being UK format. I've now added a short VBA loop in my export code to ensure those columns are formatted correctly every time the data is exported.
The second was the CSV date being imported with incorrect dtypes. I suspect this was again the fault of Excel (2010 is problematic) but I'm unsure. I'm now correcting this with an astype() method.
The end result is my dates are now imported into Pandas in the correct format and outputted to a new CSV in the correct format too.

Updating Parquet Partition files with new values/rows

For this example say I have a dataframe of 5000 rows and 5 columns. I have used the following line to create a parquet file partitioned on 'partition_key' (I have ~30 unique PKs):
dataframe.to_parquet(filename,engine='pyarrow',compression='snappy', partition_cols =
'partition_key')
I am reading the parquet file into my jupyter notebook and a subset dataframe using the filters parameter (say result is 100 rows, 5 cols) successfully via:
new_df = pd.read_parquet(filename, engine = 'pyarrow', filters = one of the partition keys)
What I need to do is update values in the 'new_df' dataframe and then save it back, and/or replace the exact 100 entries/rows in the original parquet file. Is there way to either update the original parquet file, or perhaps delete the partition folder that I grabbed via the filter parameter and do my edits to the 'new_df' and append to the original parquet file?
Any suggestions are appreciated!

How to read data from excel from a particular column in python

I have an excel sheet and I am reading the excel sheet using pandas in python.
Now I want to read the excel file based on a column, if the column has some value then do not read that row, if the column is empty than read that and store the values in a list.
Here is a screenshot
Excel Example
Now in the above image when the uniqueidentifier is yes then it should not read that value, but if it is empty then it should start reading from that value.
How to do that using python and how to get index so that after I have performed some function that I am again able to write to that blank unique identifier column saying that row has been read
This is possible for csv files. There you could do
iter_csv = pandas.read_csv('file.csv', iterator=True, chunksize=100000)
df = pd.concat([chunk[chunk['UniqueIdentifier'] == 'True'] for chunk in iter_csv])
But pd.read_excel does not offer to return an iterator object, maybe some other excel-readers can. But I don't no which ones. Nevertheless you could export your excel file as csv and use the solution for csv files.

Reading dataframe from multiple input paths and adding columns simultaneously

I am trying to read multiple input paths and based on the dates in the paths adding two columns to the data frame. Actually the files were stored as orc partitioned by these dates using hive so they have a structure like
s3n://bucket_name/folder_name/partition1=value1/partition2=value2
where partition2 = mg_load_date . So here I am trying to fetch multiple directories from multiple paths and based on the partitions I have to create two columns namely mg_load_date and event_date for each spark dataframe. I am reading these as input and combining them after I add these two columns finding the dates for each file respectively.
Is there any other way since I have many reads for each file, to read all the files at once while adding two columns for their specific rows. Or any other way where I can make the read operation fast since I have many reads.
I guess reading all the files like this sqlContext.read.format('orc').load(inputpaths) is faster than reading them individually and then merging them.
Any help is appreciated.
dfs = []
for i in input_paths:
df = sqlContext.read.format('orc').load(i)
date = re.search('mg_load_date=([^/]*)/$', i).group(1)
df = df.withColumn('event_date',F.lit(date)).withColumn('mg_load_date',F.lit(date))
dfs+=[df]
df = reduce(DataFrame.unionAll,dfs)
As #user8371915 says, you should load your data from the root path instead of passing a list of subdirectory:
sqlContext.read.format('orc').load("s3n://bucket_name/folder_name/")
Then you'll have access to your partitioning columns partition1 and partition2.
If for some reason you can't load from the root path you can try using pyspark.sql.functions input_file_name to get the name of the file for each row of your dataframe.
Spark 2.2.0+
to read from multiple folders using orc format.
df=spark.read.orc([path1,path2])
ref: https://issues.apache.org/jira/browse/SPARK-12334

Categories