PySpark : Optimize read/load from Delta using selected columns or partitions - python

I am trying to load data from Delta into a pyspark dataframe.
path_to_data = 's3://mybucket/daily_data/'
df = spark.read.format("delta").load(path_to_data)
Now the underlying data is partitioned by date as
s3://mybucket/daily_data/
dt=2020-06-12
dt=2020-06-13
...
dt=2020-06-22
Is there a way to optimize the read as Dataframe, given:
Only certain date range is needed
Subset of column is only needed
Current way, i tried is :
df.registerTempTable("my_table")
new_df = spark.sql("select col1,col2 from my_table where dt_col > '2020-06-20' ")
# dt_col is column in dataframe of timestamp dtype.
In the above state, does Spark need to load the whole data, filter the data based on date range and then filter columns needed ? Is there any optimization that can be done in pyspark read, to load data since it is already partitioned ?
Something on line of :
df = spark.read.format("delta").load(path_to_data,cols_to_read=['col1','col2'])
or
df = spark.read.format("delta").load(path_to_data,partitions=[...])

In your case, there is no extra step needed. The optimizations would be taken care by Spark. Since you already partitioned the dataset based on column dt when you try to query the dataset with partitioned column dt as filter condition. Spark load only the subset of the data from the source dataset which matches the filter condition, in your case it is dt > '2020-06-20'.
Spark internally does the optimization based partitioning pruning.

To do this without SQL..
from pyspark.sql import functions as F
df = spark.read.format("delta").load(path_to_data).filter(F.col("dt_col") > F.lit('2020-06-20'))
Though for this example you may have some work to do with comparing dates.

Related

How to divide a spark dataframe into n different chunks and convert them to dataframe and append into one?

I have a spark dataframe of 100000 rows. Is there a way to loop though 1000 rows and convert them to pandas dataframe using toPandas() and append them into a new dataframe?
Directly changing this by using toPandas() is taking a very long time. There is no column by which we can divide the dataframe in a segmented fraction.
You can directly use limit -
pd_df = ....
sparkDF_1k = sparkDF.limit(1000)
pd_df = pd.concat([pd_df,sparkDF_1k.toPandas()])
firstThousand = train.limit(1000).collect()[:1000]
df = spark.createDataFrame(firstThousand)
df.toPandas()

Convert 100k row pyspark df to pandas df

I have a pyspark df with 100k rows. I am using spark
df = pandas_df.toPandas()
which takes lot of time to execute this syntax. Is there any other way to do this operation within seconds?
Also to save the pyspark dataframe in .csv format it takes lot of time. Why is it so?
Try to repartition your datafram first before converting to pandas df
df = df.repartition(1)
df = df.toPandas()

Pandas between_time equivalent for Dask DataFrame

I have a Dask dataframe created with dd.read_csv("./*/file.csv") where the * glob is a folder for each date. In the concatenated dataframe I want to filter out subsets of time like how I would with a pd.between_time("09:30", "16:00"), say.
Because Dask's internal representation of the index does not have the nice features of Pandas's DateTimeIndex, I haven' had any success with filtering how I normally would in Pandas. Short of resorting to a naive mapping function/loop, I am unable to get this to work in Dask.
Since the partitions are by date, perhaps that could be exploited by converting to a Pandas dataframe and then back to a Dask partition, but it seems like there should be a better way.
Updating with the example used in Angus' answer.
I guess I don't understand the logic of the queries in the answers/comments. Is Pandas smart enough to not interpret the boolean mask literally as a string and do the correct datetime comparisons?
Filtering in Dask works just like pandas with a few convenience functions removed.
For example if you had the following data:
time,A,B
6/18/2020 09:00,29,0.330799201
6/18/2020 10:15,30,0.518081116
6/18/2020 18:25,31,0.790506469
The following code:
import dask.dataframe as dd
df = dd.read_csv('*.csv', parse_dates=['time']).set_index('time')
df.loc[(df.index > "09:30") & (df.index < "16:00")].compute()
(If ran on 18th June 2020) Would return:
time,A,B
2020-06-18 10:15:00,30,0.518081
EDIT:
The above answer filters for the current date only; pandas interprets the time string as a datetime value with the current date. If you'd like to filter values for all days between specific times there's a workaround to strip the date from the datetime column:
import dask.dataframe as dd
df = dd.read_csv('*.csv',parse_dates=['time'])
df["time_of_day"] = dd.to_datetime(df["time"].dt.time.astype(str))
df.loc[(df.time_of_day > "09:30") & (df.time_of_day < "16:00")].compute()
Bear in mind there might be a speed penalty to this method, possibly a concern for larger datasets.

I want to select all records from one dataframe where its value exists/not exists in another dataframe. How to do this using pyspark dataframes?

I have the two pyspark dataframes.
I want to select all records from voutdf where its "hash" does not exist in vindf.tx_hash
How to do this using pyspark dataframe.?
I tried a semi join but I am ending up with out of memory errors.
voutdf = sqlContext.createDataFrame(voutRDD,["hash", "value","n","pubkey"])
vindf = sqlContext.createDataFrame(vinRDD,["txid", "tx_hash","vout"])
You can do it with left-anti join:
df = voutdf.join(vindf.withColumnRenamed("tx_hash", "hash"), "hash", 'left_anti')
left-anti join:
It takes all rows from the left dataset that don't have their matching in the right dataset.

Implement MSSQL's partition by windowed clause in Pandas

I’m in the process of moving a MSSQL database to MYSQL and have decided to move some stored procedures to Python rather than rewrite in MYSQL. I am using Pandas 0.23 on Python 3.5.4.
The old MSSQL base uses a number of windowed functions. So far I’ve had success with converting using Pandas using pandas.Dataframe.rolling as follows:
MSSQL
AVG([Close]) OVER (ORDER BY DateValue ROWS 13 PRECEDING) AS MA14
Python
df['MA14'] = df.Close.rolling(14).mean()
I'm stuck working on a solution for the PARTITION BY part of the MSSQL windowed function in python. I am working on a solution with pandas groupby based on feedback since posting...
https://pandas.pydata.org/pandas-docs/version/0.23.0/groupby.html
For Example let's say MSSQL is:
AVG([Close]) OVER (PARTITION BY myCol ORDER BY DateValue ROWS 13 PRECEDING) AS MA14
What I have worked out so far:
Col1 contains my categorical data which I wish to groupby and apply function to on a rolling basis. There is also a date column, thus Col1 and the date column would represent a unique record in the df.
1. Delivers the mean for Col1 albeit aggregated
grouped = df.groupby(['Col1']).mean()
print(grouped.tail(20))
2. Appears to be applying the rolling mean per categorical group of Col1. Which I am after
grouped = df.groupby(['Col1']).Close.rolling(14).mean()
print(grouped.tail(20))
3 Assign to df as new Column RM
df['RM'] = df.groupby(['Col1']).Close.rolling(14).mean()
print(df.tail(20))
It doesn't like this step which I get the error...
TypeError: incompatible index of inserted column with frame index
I've worked up a simple example which may help:
How do I get the results of #2 in the df in #1 or similar.
import numpy as np
import pandas as pd
dta = {'Colour': ['Red','Red','Blue','Blue','Red','Red','Blue','Red','Blue','Blue','Blue','Red'],
'Year': [2014,2015,2014,2015,2016,2017,2018,2018,2016,2017,2013,2013],
'Val':[87,78,863,673,74,81,756,78,694,701,804,69]}
df = pd.DataFrame(dta)
df = df.sort_values(by=['Colour','Year'], ascending=True)
print(df)
#1 add calculated columns to the df. This averages all of column Val
df['ValMA3'] = df.Val.rolling(3).mean().round(0)
print (df)
#2 Group by Colour. This is calculating average by groups correctly.
# where are the other columns from my original dataframe?
#what if I have multiple calculated columns to add?
gf = df.groupby(['Colour'])
gf = gf.Val.rolling(3).mean().round(0)
print(gf)
I am pretty sure the transform function can help.
df.groupby('Col1'')['Val'].transform(lambda x: x.rolling(3, 2).mean())
where e.g. the value 3 is the step of the rolling window, and 2 is the minimum number of periods.
(Just don't forget to sort your data frame before applying the running calculation)

Categories