PySpark read.parquet vs Pandas read_parquet - python

I am not seeing significantly faster read times using PySpark read.parquet vs. Pandas read_parquet. I am trying to read 4 parquet files each 2-3 MB using a for loop and do some basic aggregation.
Here is my PySpark code:
ts_dfs = []
# For loop to collect building time series and append to empty dataframe
start = time.time()
for id in ids[0:4]:
# Make the path to the data
timeseries_path = f'{dataset_path}/timeseries_individual_buildings/by_county/upgrade=0/county={id[0]}'
# Read the data and select columns of interest
ts_data_df = spark.read.parquet(timeseries_path).select('`bldg_id`', '`out.electricity.heating.energy_consumption`', '`timestamp`')
# Aggregate by month
ts_data_df = ts_data_df \
.groupBy(f.month('timestamp').alias('month'),'bldg_id') \
.agg(f.sum('`out.electricity.heating.energy_consumption`').alias('kWh'))
# Append to empty list
ts_dfs.append(ts_data_df)
# Combine all dfs
ts = reduce(DataFrame.unionAll, ts_dfs)
end = time.time()
print ("Time elapsed:", end - start)
Time elapsed: 5.127371788024902
Here is my Pandas code:
#Download time series of all buildings with the ids we want:
cols = {'timestamp','out.electricity.heating.energy_consumption'}
ts_dfs = []
start = time.time()
for id in ids[0:4]:
# Make the path to the data
timeseries_path = f'{dataset_path}/timeseries_individual_buildings/by_county/upgrade=0/county={id[0]}'
# Read the data and columns of interest
ts_data_df = pd.read_parquet(timeseries_path, columns = cols)
# Some date processing
ts_data_df['date'] = pd.to_datetime(ts_data_df['timestamp'])
ts_data_df['month'] = ts_data_df['date'].dt.month
# Aggregate by month
ts_data_df = ts_data_df.groupby(['month','bldg_id'], as_index = False).agg(sum = ('out.electricity.heating.energy_consumption','sum'))
# Append to empty list
ts_dfs.append(ts_data_df)
# Concatenate all the dfs:
all_ts_df = pd.concat(ts_dfs)
end = time.time()
print ("Time elapsed:", end - start)
Time elapsed: 40.325382232666016
Frankly I would expect Spark to finish this significantly faster given that the files are so small. I'm running this in a Jupyter Notebook and my spark configuration is as follows:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# Configure prior to creating context
conf = pyspark.SparkConf() \
.setAppName('appName') \
.setMaster('local[*]') \
.setAll([
("spark.sql.execution.arrow.pyspark.enabled", "true"),
("spark.sql.execution.arrow.enabled", "true")
])
sc = SparkContext(conf=conf)
My machine has 16 gb of ram and 8 cores, so there should be some parallelization going on given this configuration, correct?
Additionally, when I try to convert the above pyspark DataFrame to pandas, it doubles the amount of time even though the resulting dataframe is only 48 rows and 3 columns. Is there any .persist() or .cache() I can do to speed things up? Any changes to my PySpark configuration that would better leverage my computing power?

Related

How to reduce the time complexity of KS test python code?

I am currently working on a project where i need to compare whether two distributions are same or not. For that i have two data frame both contains numeric values only
db_df - which is from the db
2)data - which is user uploaded dataframe
I have to compare each and every columns from db_df with the data and find the similar columns from data and suggest it to user as suggestions for the db column
Dimensions of both the data frame is 100 rows,239 columns
`
from scipy.stats import kstest
row_list = []
suggestions = dict()
s = time.time()
db_data_columns = db_df.columns
data_columns = data.columns
for i in db_data_columns:
col_list = list()
for j in data_columns:
# perform Kolmogorov-Smirnov test
col_list.append(kstest(
df_db[i], data[j]
)[1])
row_list.append(col_list)
print(f"=== AFTER FOR TIME {time.time()-s}")
df = pd.DataFrame(row_list).T
df.columns = db_df.columns
df.index = data.columns
for i in df.columns:
sorted_df = df.sort_values(by=[i], ascending=False)
sorted_df = sorted_df[sorted_df > 0.05]
sorted_df = sorted_df[:3].loc[:, i:i]
sorted_df = sorted_df.dropna()
suggestions[sorted_df.columns[0]] = list(sorted_df.to_dict().values())[0]
`
After getting all the p-values for all the columns in db_df with the data i need select the top 3 columns from data for each column in db_df
**Overall time taken for this is 14 seconds which is very long. is there any chances to reduce the time less than 5 sec **

Python, improving speed of multi values stresstest

I would like to improve the speed of the following code. The data set is a list of trades that I would like to stresstest, by simulating various parameters and have all the results stored in a table.
The way I'm performing this, it's by designing the range of parameters and then I iterate over their values, I initiate a copy of the dataset, I assign the value of the parameters to new columns and I concat everything in a huge dataframe.
I would like to know if someone has a good idea to avoid the three for loops to build the dataframe ?
'''
# Defining the range of parameters to simulate
volchange = range(-1,2)
spreadchange = range(-10,11)
flatchange = range(-10,11)
# the df where I store all the results
final_result = pd.DataFrame()
# Iterating over the range of parameters
for vol in volchange:
for spread in spreadchange:
for flat in flatchange:
# Creating a copy of the initial dataset, assigning the simulated values to three
# new columns and concat it with the rest, resulting in a dataframe which is
# several time the initial dataset with all the possible triplet of parameters
inter_pos = pos.copy()
inter_pos['vol_change[pts]'] = vol
inter_pos['spread_change[%]'] = spread
inter_pos['spot_change[%]'] = flat
final_result = pd.concat([final_result,inter_pos], axis = 0)
# Performing computation at dataframe level
final_result['sim_vol'] = final_result['vol_change[pts]'] + final_result['ImpliedVolatility']
final_result['spread'].multiply(final_result['spread_change[%]'])/100
final_result['sim_spread'] = final_result['spread'] + final_result['spread_change']
final_result['spot_change'] = final_result['spot'] * final_result['spot_change[%]']/100
final_result['sim_spot'] = final_result['spot'] + final_result['spot_change']
final_result['sim_price'] = final_result['sim_spot'] - final_result['sim_spread']
'''
Thanks a lot for your help !
Have a nice week ahead !
Concatenating pandas dataframes onto one another takes a long time. It's better to create a list of dataframes and then use pd.concat to concatenate them all at once.
You can test this yourself like this:
import pandas as pd
import numpy as np
from time import time
dfs = []
columns = [f"{i:02d}" for i in range(100)]
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
dfs.append(df)
new_df = pd.concat(dfs)
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 1.851675271987915
new_df = pd.DataFrame(columns=columns)
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
new_df = pd.concat([new_df, df])
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 12.258363008499146
You can also use itertools.product to get rid of your nested for loops.
Also as suggested by #Ahmed AEK:
you can pass data=itertools.product(volchange, spreadchange ,flatchange ) to pd.DataFrame directly, and avoid creating the list altogether, which is a more memory efficient and faster approach

Assemble a dataframe from two csv files

I wrote the following code to form a data frame containing the energy consumption and the temperature. The data for each of the variables is collected from a different csv file:
def match_data():
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
new_time = []
new_pwr = []
new_tmp = []
for i in range(1,len(pwr_data)):
for j in range(1,len(temp_data)):
if pwr_data['time'][i] == temp_data['Date'][j]:
time = pwr_data['time'][i]
pwr = pwr_data['watt_hour'][i]
tmp = temp_data['Temp'][j]
new_time.append(time)
new_pwr.append(pwr)
new_tmp.append(tmp)
return pd.DataFrame({'Time' : new_time,'watt_hour' : new_pwr,'Temp':new_tmp})
I was trying to collect data with matching time indices so that I can assemble them in a data frame.
The code works well but it takes time(43 seconds for around 1300 data points). At the moment I don't have much data but I was wondering if there was a more efficient and faster way to do so
Do the pwr_data['time'] and temp_data['Date] columns have the same granularity?
If so, you can pd.merge() the two dataframes after reading them.
# read data
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
# merge data on time and Date columns
# you can set the how to be 'inner' or 'right' depending on your needs
df = pd.merge(pwr_data, temp_data, how='left', left_on='time', right_on='Date')
Just like #greco recommended this did the trick and in no time!
pd.merge(pwr_data,temp_data,how='inner',left_on='time',right_on='Date')
'time' and Date are the columns on which you want to base the merge.

Window aggregation on many columns in Spark

Having trouble doing an aggregation across many columns in Pyspark. There are hundreds of boolean columns showing the current state of a system, with a row added every second. The goal is to transform this data to show the number of state changes for every 10 second window.
I planned to do this in two steps, first XOR the boolean value with the previous row's value, then second sum over a 10 second window. Here's the rough code I came up with:
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, Window, Row
from pyspark.sql import types as T, functions as F
from datetime import datetime, timedelta
from random import random
import time
sc = pyspark.SparkContext(conf=pyspark.SparkConf().setMaster('local[*]'))
spark = SparkSession(sc)
# create dataframe
num_of_cols = 50
df = spark.createDataFrame(
[(datetime.now() + timedelta(0, i), *[round(random()) for _ in range(num_of_cols)]) for i in range(10000)],
['Time', *[f"M{m+1}" for m in range(num_of_cols)]])
cols = set(df.columns) - set(['Time'])
# Generate changes
data_window = Window.partitionBy(F.minute('Time')).orderBy('Time')
# data_window = Window.orderBy('Time')
df = df.select('Time', *[F.col(m).bitwiseXOR(F.lag(m, 1).over(data_window)).alias(m) for m in cols])
df = df.groupBy(F.window('Time', '10 seconds')) \
.agg(*[F.sum(m).alias(m) for m in cols]) \
.withColumn('start_time', F.col('window')['start']) \
.drop('window')
df.orderBy('start_time').show(20, False)
# Keep UI open
time.sleep(60*60)
With the data_window partitioned by minute, Spark generates 52 stages, each dependent on the last. Increasing the num_of_cols increases the number of stages as well. It seems to me this should be an embarrassingly parallelizable problem. Compare each row to the last, and then aggregate by 10 seconds. Removing the data_window partitionBy allows it to run in a single stage, but it forces all the data on a single partition to achieve it.
Why are the stages dependent on eachother, is there a better way to write this to improve parallelization? I'd think it'd be possible to do multiple aggregations over the same window at the same time. Eventually this would need to scale to hundreds of columns, are there any tricks to improve performance at that point?
Based off the helpful response from Georg, I came up with the following:
import pandas as pd
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, Window
from pyspark.sql import types as T, functions as F
from datetime import datetime, timedelta
from random import random
import time
import pprint
sc = pyspark.SparkContext(conf=pyspark.SparkConf().setMaster('local[*]'))
spark = SparkSession(sc)
#F.pandas_udf(T.ArrayType(T.IntegerType()), F.PandasUDFType.GROUPED_AGG)
def pandas_xor(v):
values = v.values
if len(values) == 1:
return values[0] * False
elif len(values) == 2:
return values[0] ^ values[1]
else:
raise RuntimeError('Too many values given to pandas_xor: {}'.format(values))
# create dataframe
num_of_cols = 50
df = spark.createDataFrame(
[(datetime.now() + timedelta(0, i), *[round(random()) for _ in range(num_of_cols)]) for i in range(100000)],
['Time', *[f"M{m+1}" for m in range(num_of_cols)]])
cols = set(df.columns) - set(['Time'])
df = df.select('Time', F.array(*cols).alias('data'))
# XOR
data_window = Window.partitionBy(F.minute('Time')).orderBy('Time').rowsBetween(Window.currentRow, 1)
# data_window = Window.orderBy('Time')
df = df.select('Time', pandas_xor(df.data).over(data_window).alias('data'))
df = df.groupBy(F.window('Time', '10 seconds')) \
.agg(*[F.sum(F.element_at('data', i + 1)).alias(m) for i, m in enumerate(cols)]) \
.withColumn('start_time', F.col('window')['start']) \
.drop('window')
df.orderBy('start_time').show(20, False)
# Keep UI open
time.sleep(60*60)
With the following instructions to run it with Spark 3.0.0preview2
Download Spark 3.0.0
mkdir contrib
wget -O contrib/spark-3.0.0-preview2.tgz 'https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz'
tar -C contrib -xf contrib/spark-3.0.0-preview2.tgz
rm contrib/spark-3.0.0-preview2.tgz
In first shell, configure environment to use Pyspark 3.0.0
export SPARK_HOME="$(pwd)/contrib/spark-3.0.0-preview2-bin-hadoop2.7"
export PYTHONPATH="$SPARK_HOME/python/lib/pyspark.zip:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip"
Kick off pyspark job
time python3 so-example.py
View local Spark run's Web UI at http://localhost:4040

Huge quantity of Dask computations causing memory issues

I am working on a task where I need to determine where two geospatial points are within 250 meters of each other and occur within 20 minutes of each other. My data set is approximately 1.2M rows and 10 columns. So, I need to determine a distance, time difference, and whether they meet my criteria by going through 1.2M**2 calculations.
I have been able to run the code below where I create 10,000 Dask objects to compute without problem. However, when I attempt to test 100,000 objects Dask runs up against memory limitations and I see significant CPU usage for swap. To be clear, I'm running this on a 32 core node with 125 GB of memory.
Admittedly, I'm quite new to Dask, so I'd like to know: is there a better way to solve this problem than processing in 10,000 row chunks?
#!/usr/bin/env python
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.array import sqrt
import time
import multiprocessing as mp
df = pd.read_hdf(...) # Used to select single item for comparison
ddf = dd.read_hdf(...) # Used for Dask operations
def distCheck(item,df=ddf):
'''
Determine if any records in df are within 250m of item and within 20
minutes of item. Return Dask object for calculation.
'''
dist = sqrt(((ddf.LCC_x1-item.LCC_x1)**2+(ddf.LCC_y1-item.LCC_y1)**2))
distcrit = dist[dist < 250]
delta = (ddf.Date - item.Date).abs()
timecrit = delta[delta < np.timedelta64(20,'m')]
res1 = ddf.copy()
res1['dist'] = dist
res1['delta'] = delta
res1 = res1.loc[(distcrit.index) & (timecrit.index) & (idcrit.index)]
res1['MatchMMSI'] = item.MMSI
res1['MatchVoy'] = item.Voyage
out = res1
return out
def getDaskCalls(start,stop):
'''
Get Dask objects to assess temporal and spatial proximity for df
indices from start to stop.
'''
# Kick off multiprocessing pool, submit, and close
pool = mp.Pool(processes=32)
daskers = []
for i in range(start,stop):
result = pool.apply_async(distCheck,args=(df.iloc[i,:],ddf,))
daskers.append(result)
dasky = [i.get() for i in daskers]
pool.close()
return dasky
def runDask(calls):
result = pd.DataFrame([],columns=calls[0].columns)
output = dd.compute(calls)
result = pd.concat([result]+[i for i in output[0] if i.shape[0] != 0])
return result
###
### Process
###
# Get initial timestamp
start = time.time()
# Create Dask Calls & determine duration
dcalls = getDaskCalls(0,10000)
callsCreated = time.time()
# Print time required to create calls
print("Dask Calls Created.")
print(callsCreated-start)
# Compute the calls with Dask
print("Computing...")
result = runDask(dcalls)
# Print the time for computation
computation = time.time()
print(" ...Done.")
print(computation-callsCreated)

Categories