Updating Parquet Partition files with new values/rows - python

For this example say I have a dataframe of 5000 rows and 5 columns. I have used the following line to create a parquet file partitioned on 'partition_key' (I have ~30 unique PKs):
dataframe.to_parquet(filename,engine='pyarrow',compression='snappy', partition_cols =
'partition_key')
I am reading the parquet file into my jupyter notebook and a subset dataframe using the filters parameter (say result is 100 rows, 5 cols) successfully via:
new_df = pd.read_parquet(filename, engine = 'pyarrow', filters = one of the partition keys)
What I need to do is update values in the 'new_df' dataframe and then save it back, and/or replace the exact 100 entries/rows in the original parquet file. Is there way to either update the original parquet file, or perhaps delete the partition folder that I grabbed via the filter parameter and do my edits to the 'new_df' and append to the original parquet file?
Any suggestions are appreciated!

Related

How to create a dataframe by using column names and text/excel files using pyspark

I have a text/excel files in by blob storage, and I want to fecth only certain columns from that files and create a dataframe.
for ex:
a,b,c,d,e are the columns present in my file
I have the path for that file :/mnt/reservoir/files/file1.txt
I want to create a dataframe by using this file and fetching only columns b,c,d
how to achieve this using pyspark?
To work around this and create a dataframe using specific columns we have to create another data frame after reading the whole text/excel file.
Code Example:
df = spark.read.format("csv").option("header", "true").option("delimter", ",").load("/mnt/reservoir/files/file1.txt")
df2 = df.select(df.b, df.c, df.d)
OR
df3 = df.select(df.columns[1:4])
My sample file reading
Output

Python script to read csv files every 15 minutes and discard the csv file that has been read and update dataframe every time a new csv file is read

I have a local directly which receives multiple csv files every 15minutes automatically from a measured values provider. I want to read these files into a pandas dataframe and create a new column for each csv file or add new rows if column already exists and then load the dataframe to a database. I am using a for loop to read the files and merging on timestamp. How can I read the csv files and discard the csv files that have been read already so that when the python script runs again after 15 minutes it does not read the old csv files but only the new ones and update the dataframe instead of creating a new dataframe with current values? Also the timestamps in my csv files are overlapping. How can I merge the csv files with same timestamp and only choose the last value received for a particular timestamp? Right now if I do a merge on timestamp I am getting duplicate rows?
The more time that passes the more it takes for the for loop to iterate over all csv files and the dataframe needs longer to load because the csv files are increasing and if i delete the csv files manually, older values will be deleted from the dataframe.
extension = 'csv'
all_filenames = [i for i in glob.glob('*Wind*.{}'.format(extension))]
al=pd.read_csv('202216_14345.123_0000.Wind15min_202203161432.csv',sep=';')
al['#timestamp'] = pd.to_datetime(al['#timestamp'])
frames = []
for f in all_filenames:
df = pd.read_csv(f,sep =';')
df['#timestamp'] = pd.to_datetime(df['#timestamp'])
#print(df.head(5))
frames.append(df)
#result = pd.concat(frames,axis = 1,join = 'inner')
result= pd.merge(al, df,how = 'outer')
al = result

Drop partition columns when writing parquet in pyspark

I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.
Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.
Example:
val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.
Checking contents of csv file:
hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv
Output:
a
As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.
To check schema of parquet file:
parquet-tools schema <hdfs://nn:8020/parquet_file>
You can also verify what are all the columns included in your parquet file.
If you use df.write.partitionBy('year','month', 'day').
These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy creates.
Ex. partitionBy('year').csv("/data") will create something like:
/data/year=2018/part1---.csv
/data/year=2019/part1---.csv
When you read the data back it uses the special path year=xxx to populate these columns.
You can prove it by reading in the data of a single partition directly.
Ex. year will not be a column in this case.
df = spark.read.csv("data/year=2019/")
df.printSchema()
Also #Shu's answer could be used to investigate.
You can sleep safely that these columns are not taking up storage space.
If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.

Reading dataframe from multiple input paths and adding columns simultaneously

I am trying to read multiple input paths and based on the dates in the paths adding two columns to the data frame. Actually the files were stored as orc partitioned by these dates using hive so they have a structure like
s3n://bucket_name/folder_name/partition1=value1/partition2=value2
where partition2 = mg_load_date . So here I am trying to fetch multiple directories from multiple paths and based on the partitions I have to create two columns namely mg_load_date and event_date for each spark dataframe. I am reading these as input and combining them after I add these two columns finding the dates for each file respectively.
Is there any other way since I have many reads for each file, to read all the files at once while adding two columns for their specific rows. Or any other way where I can make the read operation fast since I have many reads.
I guess reading all the files like this sqlContext.read.format('orc').load(inputpaths) is faster than reading them individually and then merging them.
Any help is appreciated.
dfs = []
for i in input_paths:
df = sqlContext.read.format('orc').load(i)
date = re.search('mg_load_date=([^/]*)/$', i).group(1)
df = df.withColumn('event_date',F.lit(date)).withColumn('mg_load_date',F.lit(date))
dfs+=[df]
df = reduce(DataFrame.unionAll,dfs)
As #user8371915 says, you should load your data from the root path instead of passing a list of subdirectory:
sqlContext.read.format('orc').load("s3n://bucket_name/folder_name/")
Then you'll have access to your partitioning columns partition1 and partition2.
If for some reason you can't load from the root path you can try using pyspark.sql.functions input_file_name to get the name of the file for each row of your dataframe.
Spark 2.2.0+
to read from multiple folders using orc format.
df=spark.read.orc([path1,path2])
ref: https://issues.apache.org/jira/browse/SPARK-12334

python pandas HDFStore append not contant size data

I am using python 2.7 with pandas and HDFStore
I try to process a big data set, which fits into the disk but not to the memory.
I store a large size data set in a .h5 file, the size of the data in each column is not constant, for instance, one column may have a string of 5 chars in one row and a string of 20 chars in another.
So i had issues writing the data to file in iterations, when the first iteration contained a small size of data and the following batches contained larger size of data.
I found that the issue was that the min_size was not used properly and the data did not fit into the columns, I used the following code to cache the database into the h5 without error
colsLen = {}
for col in dbCols:
curs.execute('SELECT MAX(CHAR_LENGTH(%s)) FROM table' % col)
for a in curs:
colsLen.update({col: a[0]})
# get the first row to create the hdfstore
rx = dbConAndQuery.dbTableToDf(con, table, limit=1, offset=0) #this is my utility that is querying the db
hdf.put("table", table, format="table", data_columns=True, min_itemsize=colsLen)
for i in range(rxRowCount / batchSize + 1):
rx = dbConAndQuery.dbTableToDf(con, table, limit=batchSize, offset=i * batchSize + 1)
hdf.append("table", table, format="table", data_columns=True, min_itemsize=colsLen)
hdf.close()
The issue is: how can i use HDFStore in cases where I can't query for the maximum size of each column's data in advance? e.g getting or creating data in iterations due to memory constrains.
I found that I can process the data using dask with in disk dataframes, but it lacks some functionality that I need in pandas, so the main idea is to process the data in batches appending it to the existing HDFStore file.
Thanks!
I found that the issue was hdf optimizing the data storage and counting on the size of the largest value of each column,
I found two ways to solve this:
1.Pre query the database to get the maximum data char length for each column
2.inserting each batch to a new key in the file then it works, each batch will be inserted to the hdf file using it's biggest value as biggest value in the column

Categories