Python - Multiprocessing very large parquet file

Python - Multiprocessing very large parquet file - python

I have stumbled across similar questions but all of them used either .csv or .txt files and a solution was to read in the data line by line or in chunks. I am not aware of that being possible with parquet files as they were designed to be read by columns and not rows.
My first attempt below works well with a smaller subset/test dataset created from the original full dataset.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the groups into chunks of groups
groupby_split = np.array_split(groupby, num_processes)
# Create a list where each element is a chunk of groups
# The starmap expects this sort of input
input_list = [[gb] for gb in groupby_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
x.to_parquet('path/to/save/file.parquet')
but when I switch to the full parquet file, I get the error:
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
which I expected.
My solution to this was to break the very large number of groups into smaller chunks of groups (size similar to the subset) and loop over each one like with the subset earlier.
EDIT I forgot to add the second np.array_split within the loop.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the large number of groups into smaller chunks
N = 10
groupby_split = np.array_split(groupby, N)
final_agg_df = pd.DataFrame()
# iterate over each of the smaller chunks
for groupby_group in groupby_split:
groupby_split_split = np.array_split(groupby_group, num_processes)
# Create a list to use as argument for starmap
input_list = [[gb] for gb in groupby_split_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
final_agg_df = pd.concat([final_agg_df, x])
final_agg_df.to_parquet('path/to/save/file.parquet')
But this is still giving me the same error...
I thought that since the pool was created prior to reading in the large parquet file (I read a solution earlier that mentioned doing this) that each process would only be given the small chunk of groups.
I am wondering if there is something I missed? And also if there is a better way of doing this in general (queue? dask? function logic?)
Thanks in advance!

Related

Python, improving speed of multi values stresstest

I would like to improve the speed of the following code. The data set is a list of trades that I would like to stresstest, by simulating various parameters and have all the results stored in a table.
The way I'm performing this, it's by designing the range of parameters and then I iterate over their values, I initiate a copy of the dataset, I assign the value of the parameters to new columns and I concat everything in a huge dataframe.
I would like to know if someone has a good idea to avoid the three for loops to build the dataframe ?
'''
# Defining the range of parameters to simulate
volchange = range(-1,2)
spreadchange = range(-10,11)
flatchange = range(-10,11)
# the df where I store all the results
final_result = pd.DataFrame()
# Iterating over the range of parameters
for vol in volchange:
for spread in spreadchange:
for flat in flatchange:
# Creating a copy of the initial dataset, assigning the simulated values to three
# new columns and concat it with the rest, resulting in a dataframe which is
# several time the initial dataset with all the possible triplet of parameters
inter_pos = pos.copy()
inter_pos['vol_change[pts]'] = vol
inter_pos['spread_change[%]'] = spread
inter_pos['spot_change[%]'] = flat
final_result = pd.concat([final_result,inter_pos], axis = 0)
# Performing computation at dataframe level
final_result['sim_vol'] = final_result['vol_change[pts]'] + final_result['ImpliedVolatility']
final_result['spread'].multiply(final_result['spread_change[%]'])/100
final_result['sim_spread'] = final_result['spread'] + final_result['spread_change']
final_result['spot_change'] = final_result['spot'] * final_result['spot_change[%]']/100
final_result['sim_spot'] = final_result['spot'] + final_result['spot_change']
final_result['sim_price'] = final_result['sim_spot'] - final_result['sim_spread']
'''
Thanks a lot for your help !
Have a nice week ahead !

Concatenating pandas dataframes onto one another takes a long time. It's better to create a list of dataframes and then use pd.concat to concatenate them all at once.
You can test this yourself like this:
import pandas as pd
import numpy as np
from time import time
dfs = []
columns = [f"{i:02d}" for i in range(100)]
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
dfs.append(df)
new_df = pd.concat(dfs)
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 1.851675271987915
new_df = pd.DataFrame(columns=columns)
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
new_df = pd.concat([new_df, df])
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 12.258363008499146
You can also use itertools.product to get rid of your nested for loops.
Also as suggested by #Ahmed AEK:
you can pass data=itertools.product(volchange, spreadchange ,flatchange ) to pd.DataFrame directly, and avoid creating the list altogether, which is a more memory efficient and faster approach

Why does performance decrease with the size of the data frame?

I'm working with a relatively large dataset (approx 5m observations, made up of about 5.5k firms).
I needed to run OLS regressions with a 60 month rolling window for each firm. I noticed that the performance was insanely slow when I ran the following code:
for idx, sub_df in master_df.groupby("firm_id"):
# OLS code
However, when I first split my dataframe into about 5.5k dfs and then iterated over each of the dfs, the performance improved dramatically.
grouped_df = master_df.groupby("firm_id")
df_list = [group for group in grouped_df]
for df in df_list:
my_df = df[1]
# OLS code
I'm talking 1-2 weeks of time (24/7) to complete in the first version compared to 8-9 hours tops.
Can anyone please explain why splitting the master df into N smaller dfs and then iterating over each smaller df performs better than iterating over the same number of groups within the master df?
Thanks ever so much!

I'm unable to reproduce your observation. Here's some code that generates data and then times the direct and indirect methods separately. The time taken is very similar in either case.
Is it possible that you accidentally sorted the dataframe by the group key between the runs? Sorting by group key results in a noticeable difference in run time.
Otherwise, I'm beginning to think that there might be some other differences in your code. It would be great if you could post the full code.
import numpy as np
import pandas as pd
from datetime import datetime
def generate_data():
''' returns a Pandas DF with columns 'firm_id' and 'score' '''
# configuration
np.random.seed(22)
num_groups = 50000 # number of distinct groups in the DF
mean_group_length = 200 # how many records per group?
cov_group_length = 0.10 # throw in some variability in the num records per group
# simulate group lengths
stdv_group_length = mean_group_length * cov_group_length
group_lengths = np.random.normal(
loc=mean_group_length,
scale=stdv_group_length,
size=(num_groups,)).astype(int)
group_lengths[group_lengths <= 0] = mean_group_length
# final length of DF
total_length = sum(group_lengths)
# compute entries for group key column
firm_id_list = []
for i, l in enumerate(group_lengths):
firm_id_list.extend([(i + 1)] * l)
# construct the DF; data column is 'score' populated with Numpy's U[0, 1)
result_df = pd.DataFrame(data={
'firm_id': firm_id_list,
'score': np.random.rand(total_length)
})
# Optionally, shuffle or sort the DF by group keys
# ALTERNATIVE 1: (badly) unsorted df
result_df = result_df.sample(frac=1, random_state=13).reset_index(drop=True)
# ALTERNATIVE 2: sort by group key
# result_df.sort_values(by='firm_id', inplace=True)
return result_df
def time_method(df, method):
''' time 'method' with 'df' as its argument '''
t_start = datetime.now()
method(df)
t_final = datetime.now()
delta_t = t_final - t_start
print(f"Method '{method.__name__}' took {delta_t}.")
return
def process_direct(df):
''' direct for-loop over groupby object '''
for group, df in df.groupby('firm_id'):
m = df.score.mean()
s = df.score.std()
return
def process_indirect(df):
''' indirect method: generate groups first as list and then loop over list '''
grouped_df = df.groupby('firm_id')
group_list = [pair for pair in grouped_df]
for pair in group_list:
m = pair[1].score.mean()
s = pair[1].score.std()
df = generate_data()
time_method(df, process_direct)
time_method(df, process_indirect)

pandas read_csv - Skipping every other row starting from a certain row

I have a huge .csv file (over 1 million rows), that I am trying to parse using the pandas read_csv function. The file is very large because it is measurement data from a sensor with very high sampling rate, and I want to take downsampled segments from it. I tried implementing it with a lambda function and the skiprows and nrows parameter. What my code currently does is, it just reads the same segment over and over again.
segment_amt = 20 # How many segments we want from a individual measurement file
segment_length = 5 # Segment length in seconds
segment_length_idx = fs * segment_length # Segmenth length in indices
segment_skip_length = 10 # How many seconds between segments
segment_skip_idx = fs * segment_skip_length # The amount of indices to skip between each segment
downsampling = 2 # Factor of downsampling
idx = start_idx
for i in range(segment_amt):
cond = lambda x: (x+idx) % downsampling != 0
data = pd.read_csv(filename, skiprows=cond, nrows = segment_length_idx/downsampling,
usecols=[z_component_idx],names=["z"],engine='python')
M1_df = M1_df.append(data.T)
idx += segment_skip_idx
This results in something like this. I assume the behaviour is due to the lambda function, but I don't know how to fix it, so that it changes the starting row each time based on idx (this is what I thought it would do currently).

Your cond lambda is wrong. You want to skip rows if x < idx or x % downsampling != 0. Just write it that way:
cond = x < idx or x % downsampling != 0
But you should also consider passing header = False to avoid processing the first line of each segment as a header.

RAM consumption by pandas DataFrame

I am trying to work with around 100 csv files to do a time series analysis.
To build an efficient algorithm to use I've structured my data read_csv function such that it only reads all the files at once and don't have to repeat the same process again and again. To explain further following is my code:
start_date = '2016-06-01'
end_date = '2017-09-02'
allocation = 170000
#contains 100 symbols
usesymbols = ['']
cost_matrix = []
def data():
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in usesymbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
dataframe = data()
Problem is that if I run the above code with 15 symbols it works perfectly.
But that's not sufficient, I want to use 100 symbols.
If I run the code with 100 items in usesymbols, my RAM is used up completely and the machine freezes.
Is there anything that can be done to avoid this situation?
Edited Part:
1) I've 16 GB RAM.
2) the issue is with the variable power_set, if I don't call powerset function data gets retrieved easily.

DataFrame.memory_usage(index=False)
Return:
sizes : Series
A series with column names as index and memory usage of columns with units of bytes.

Pandas append perfomance concat/append using "larger" DataFrames

The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?
import pandas as pd
path = 'D:\\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
data = pd.concat([data,data_tmp])
del data_tmp
EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.
I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.
EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.
import string
import random
import pandas as pd
import numpy as np
import math
# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------
# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
id_list.append(''.join(
random.choice(string.ascii_uppercase + string.digits) for _
in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]
time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size = math.ceil(len(time_index)/num_files)
data = []
# Simulate files
for file in range(0, run_to_file):
tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
# TODO not all cases cover, make sure ints are obtained
tmp_ids = id_list[file * id_interval:
start_ids + (file + 1) * id_interval]
tmp_data = pd.DataFrame(np.random.standard_normal(
(len(tmp_time), len(tmp_ids))), index=tmp_time,
columns=tmp_ids)
tmp_file = tmp_data.stack().sortlevel(1).reset_index()
# final simulated data structure of the parsed csv file
tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
'ID', 0: 'Value'})
# comment/uncomment if pivot takes place on aggregate level or not
tmp_file = tmp_file.pivot(index='Date', columns='ID',
values='Value')
data.append(tmp_file)
data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')

Using your reproducible example code, I can indeed confirm that the concat of only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:
In [94]: df1, df2 = data[0], data[1]
In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop
In [99]: %%timeit
....: df1b, df2b = df1.align(df2, axis=1)
....: pd.concat([df1b, df2b])
....:
1 loops, best of 3: 686 ms per loop
The result of both approaches is the same.
The aligning is equivalent to:
common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)
So this is probably the easier way to use when having to deal with a full list of dataframes.
The reason that pd.concat is slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so much slower surprises me as well, but I will raise an issue about that.

Summary, three key performance drivers depending on the set-up:
1) Make sure datatype are the same when concatenating two dataframes
2) Use integer based column names if possible
3) When using string based columns, make sure to use the align method before concat is called as suggested by joris

As #joris mentioned, you should append all of the pivot tables to a list and then concatenate them all in one go. Here is a proposed modification to your code:
dfs = []
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
dfs.append(data_tmp)
del data_tmp
data = pd.concat(dfs)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Multiprocessing very large parquet file - python

Related

Python, improving speed of multi values stresstest

Why does performance decrease with the size of the data frame?

pandas read_csv - Skipping every other row starting from a certain row

RAM consumption by pandas DataFrame

Pandas append perfomance concat/append using "larger" DataFrames

Categories

Resources