Time complexity between appending to a Dict vs a Dataframe - python

My program currently creates a bunch of Dataframes with a specific structure. The total number of DataFrames is, for now, 88 (with up to 10k rows) ; however, this is just the testing phase with a small amount of data.
This number might increase to several hundreds of Dfs, with up to few 100k rows.
I'm concerned about scalability. I have implemented two methods to retrieve the output which is the concatenation of all these Dfs. For now, they give approximately the same result ; however, as I said, which of these will perform better ?
Append to a DataFrame :
Create an empty DataFrame df1 (with the correct structure),
Loop
Create the DataFrame of results,
Append it to df1,
Export to csv
Append to a Dictionary:
Create an empty Dict,
Loop
Create the DataFrame of results,
Append to Dict
Concat all values of the dict in a df
Export to csv
. which of these will perform better as the amount of data grows ?
. does appending to a Dict gives better result than to a DataFrame although there are more steps, or the other way around, or does it give the same result ?

Approach2 is absolutely faster. Pandas is quite a heavy lib I think. Maybe you should consider using MySQL to insert data into the database rather than pandas if the data is large and consumes much memory. In MySQL, you could save the data in the database rather than save them in memory.
import pandas as pd
from time import time
df = pd.DataFrame(columns=range(100))
#start to test approach1
approach1_start = time()
for i in range(1000):
data_entry = ['test' for i in range(100)]
new = pd.DataFrame([data_entry])
df = pd.concat([df,new])
approach1_end = time()
approach1_time = approach1_end - approach1_start
print(approach1_time)
9.54729175567627
#start to test approach2
approach2_start = time()
data_entry_list = []
for i in range(1000):
data_entry = ['test' for i in range(100)]
data_entry_list.append(data_entry)
df = pd.DataFrame(data_entry_list)
approach2_end = time()
approach2_time = approach2_end - approach2_start
print(approach2_time)
0.021973371505737305

I have done some testing in order to have an idea. Here are the testing code:
import timeit
import time
import pandas as pd
def timing2(f):
def wrap(*args):
time1 = time.time()
ret = f(*args)
time2 = time.time()
print('{:s} : {:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
return ret
return wrap
#timing2
def withList():
lst = []
for i in range(100):
df = pd.DataFrame({'A': [1,2,3], 'B':[4,5,6], 'C':[90,53,64]})
lst.append(df)
df_new = pd.concat(lst)
return df_new
#timing2
def withDataFrame():
lst = []
col_lst = ['A','B','C']
df = pd.DataFrame(columns = col_lst)
for i in range(100):
df_r = pd.DataFrame({'A': [1,2,3], 'B':[4,5,6], 'C':[90,53,64]})
df.append(df_r)
return df
#timing2
def withDict():
dic = {}
col_lst = ['A','B','C']
df = pd.DataFrame(columns = col_lst)
for i in range(100):
df_r = pd.DataFrame({'A': [1,2,3], 'B':[4,5,6], 'C':[90,53,64]})
dic[i] = df_r
lst_result = [values for values in dic.values()]
df = df.append(lst_result)
return df
withList()
withDataFrame()
withDict()
Here are the results :
withList : 76.801 ms ;
withDataFrame : 101.746 ms ;
withDict : 57.819 ms.

Related

How to reduce the time complexity of KS test python code?

I am currently working on a project where i need to compare whether two distributions are same or not. For that i have two data frame both contains numeric values only
db_df - which is from the db
2)data - which is user uploaded dataframe
I have to compare each and every columns from db_df with the data and find the similar columns from data and suggest it to user as suggestions for the db column
Dimensions of both the data frame is 100 rows,239 columns
`
from scipy.stats import kstest
row_list = []
suggestions = dict()
s = time.time()
db_data_columns = db_df.columns
data_columns = data.columns
for i in db_data_columns:
col_list = list()
for j in data_columns:
# perform Kolmogorov-Smirnov test
col_list.append(kstest(
df_db[i], data[j]
)[1])
row_list.append(col_list)
print(f"=== AFTER FOR TIME {time.time()-s}")
df = pd.DataFrame(row_list).T
df.columns = db_df.columns
df.index = data.columns
for i in df.columns:
sorted_df = df.sort_values(by=[i], ascending=False)
sorted_df = sorted_df[sorted_df > 0.05]
sorted_df = sorted_df[:3].loc[:, i:i]
sorted_df = sorted_df.dropna()
suggestions[sorted_df.columns[0]] = list(sorted_df.to_dict().values())[0]
`
After getting all the p-values for all the columns in db_df with the data i need select the top 3 columns from data for each column in db_df
**Overall time taken for this is 14 seconds which is very long. is there any chances to reduce the time less than 5 sec **

Python, improving speed of multi values stresstest

I would like to improve the speed of the following code. The data set is a list of trades that I would like to stresstest, by simulating various parameters and have all the results stored in a table.
The way I'm performing this, it's by designing the range of parameters and then I iterate over their values, I initiate a copy of the dataset, I assign the value of the parameters to new columns and I concat everything in a huge dataframe.
I would like to know if someone has a good idea to avoid the three for loops to build the dataframe ?
'''
# Defining the range of parameters to simulate
volchange = range(-1,2)
spreadchange = range(-10,11)
flatchange = range(-10,11)
# the df where I store all the results
final_result = pd.DataFrame()
# Iterating over the range of parameters
for vol in volchange:
for spread in spreadchange:
for flat in flatchange:
# Creating a copy of the initial dataset, assigning the simulated values to three
# new columns and concat it with the rest, resulting in a dataframe which is
# several time the initial dataset with all the possible triplet of parameters
inter_pos = pos.copy()
inter_pos['vol_change[pts]'] = vol
inter_pos['spread_change[%]'] = spread
inter_pos['spot_change[%]'] = flat
final_result = pd.concat([final_result,inter_pos], axis = 0)
# Performing computation at dataframe level
final_result['sim_vol'] = final_result['vol_change[pts]'] + final_result['ImpliedVolatility']
final_result['spread'].multiply(final_result['spread_change[%]'])/100
final_result['sim_spread'] = final_result['spread'] + final_result['spread_change']
final_result['spot_change'] = final_result['spot'] * final_result['spot_change[%]']/100
final_result['sim_spot'] = final_result['spot'] + final_result['spot_change']
final_result['sim_price'] = final_result['sim_spot'] - final_result['sim_spread']
'''
Thanks a lot for your help !
Have a nice week ahead !
Concatenating pandas dataframes onto one another takes a long time. It's better to create a list of dataframes and then use pd.concat to concatenate them all at once.
You can test this yourself like this:
import pandas as pd
import numpy as np
from time import time
dfs = []
columns = [f"{i:02d}" for i in range(100)]
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
dfs.append(df)
new_df = pd.concat(dfs)
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 1.851675271987915
new_df = pd.DataFrame(columns=columns)
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
new_df = pd.concat([new_df, df])
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 12.258363008499146
You can also use itertools.product to get rid of your nested for loops.
Also as suggested by #Ahmed AEK:
you can pass data=itertools.product(volchange, spreadchange ,flatchange ) to pd.DataFrame directly, and avoid creating the list altogether, which is a more memory efficient and faster approach

How to export data frame as CSV in a loop

I'm analyzing some data over a loop of 10 iterations, each of the iterations represents one of the data sets. I've managed to create a data frame with pandas at the end of each iteration, now i need to export each with a different name. here a take of the code.
for t in range(len(myFiles)):
DATA = np.array(importdata(t))
data = DATA[:,1:8]
Numbers = data[:,0:5]
Stars = data[:,5:7]
[numbers,repetitions]=(Frequence(Numbers))
rep_n,freq_n = (translate(repetitions,data))
[stars,Rep_s] = (Frequence(Stars))
rep_s,freq_s = (translate(Rep_s,data))
DF1 = dataframe(numbers,rep_n,freq_n)
DF2 = dataframe(stars,rep_s,freq_s)
Data frames DF1 and DF2 must be store separately with different names in each of the loop iterations.
You can create lists of DataFrames:
ListDF1, ListDF2 = [], []
for t in range(len(myFiles)):
...
rep_s,freq_s = (translate(Rep_s,data))
ListDF1.append(dataframe(numbers,rep_n,freq_n))
ListDF2.append(dataframe(stars,rep_s,freq_s))
Then for select DataFrame use indexing:
#get first DataFrame
print (ListDF1[0])
EDIT: If need export with different filenames use t variable for DF1_0.csv, DF2_0.csv, then DF1_1.csv, DF2_1.csv, ... filenames, because python counts from 0:
for t in range(len(myFiles)):
...
DF1.to_csv(f'DF1_{t}.csv')
DF2.to_csv(f'DF2_{t}.csv')
you can use microseconds from datetime, since it will be different
from datetime import datetime
for t in range(len(myFiles)):
DATA = np.array(importdata(t))
data = DATA[:,1:8]
Numbers = data[:,0:5]
Stars = data[:,5:7]
[numbers,repetitions]=(Frequence(Numbers))
rep_n,freq_n = (translate(repetitions,data))
[stars,Rep_s] = (Frequence(Stars))
rep_s,freq_s = (translate(Rep_s,data))
DF1 = dataframe(numbers,rep_n,freq_n)
DF2 = dataframe(stars,rep_s,freq_s)
DF1.to_csv(f'DF1_{datetime.now().strftime('%f')}.csv')
DF2.to_csv(f'DF2_{datetime.now().strftime('%f')}.csv')

Why does performance decrease with the size of the data frame?

I'm working with a relatively large dataset (approx 5m observations, made up of about 5.5k firms).
I needed to run OLS regressions with a 60 month rolling window for each firm. I noticed that the performance was insanely slow when I ran the following code:
for idx, sub_df in master_df.groupby("firm_id"):
# OLS code
However, when I first split my dataframe into about 5.5k dfs and then iterated over each of the dfs, the performance improved dramatically.
grouped_df = master_df.groupby("firm_id")
df_list = [group for group in grouped_df]
for df in df_list:
my_df = df[1]
# OLS code
I'm talking 1-2 weeks of time (24/7) to complete in the first version compared to 8-9 hours tops.
Can anyone please explain why splitting the master df into N smaller dfs and then iterating over each smaller df performs better than iterating over the same number of groups within the master df?
Thanks ever so much!
I'm unable to reproduce your observation. Here's some code that generates data and then times the direct and indirect methods separately. The time taken is very similar in either case.
Is it possible that you accidentally sorted the dataframe by the group key between the runs? Sorting by group key results in a noticeable difference in run time.
Otherwise, I'm beginning to think that there might be some other differences in your code. It would be great if you could post the full code.
import numpy as np
import pandas as pd
from datetime import datetime
def generate_data():
''' returns a Pandas DF with columns 'firm_id' and 'score' '''
# configuration
np.random.seed(22)
num_groups = 50000 # number of distinct groups in the DF
mean_group_length = 200 # how many records per group?
cov_group_length = 0.10 # throw in some variability in the num records per group
# simulate group lengths
stdv_group_length = mean_group_length * cov_group_length
group_lengths = np.random.normal(
loc=mean_group_length,
scale=stdv_group_length,
size=(num_groups,)).astype(int)
group_lengths[group_lengths <= 0] = mean_group_length
# final length of DF
total_length = sum(group_lengths)
# compute entries for group key column
firm_id_list = []
for i, l in enumerate(group_lengths):
firm_id_list.extend([(i + 1)] * l)
# construct the DF; data column is 'score' populated with Numpy's U[0, 1)
result_df = pd.DataFrame(data={
'firm_id': firm_id_list,
'score': np.random.rand(total_length)
})
# Optionally, shuffle or sort the DF by group keys
# ALTERNATIVE 1: (badly) unsorted df
result_df = result_df.sample(frac=1, random_state=13).reset_index(drop=True)
# ALTERNATIVE 2: sort by group key
# result_df.sort_values(by='firm_id', inplace=True)
return result_df
def time_method(df, method):
''' time 'method' with 'df' as its argument '''
t_start = datetime.now()
method(df)
t_final = datetime.now()
delta_t = t_final - t_start
print(f"Method '{method.__name__}' took {delta_t}.")
return
def process_direct(df):
''' direct for-loop over groupby object '''
for group, df in df.groupby('firm_id'):
m = df.score.mean()
s = df.score.std()
return
def process_indirect(df):
''' indirect method: generate groups first as list and then loop over list '''
grouped_df = df.groupby('firm_id')
group_list = [pair for pair in grouped_df]
for pair in group_list:
m = pair[1].score.mean()
s = pair[1].score.std()
df = generate_data()
time_method(df, process_direct)
time_method(df, process_indirect)

Python looping and Pandas rank/index quirk

This question pertains to one posted here:
Sort dataframe rows independently by values in another dataframe
In the linked question, I utilize a Pandas Dataframe to sort each row independently using values in another Pandas Dataframe. The function presented there works perfectly every single time it is directly called. For example:
import pandas as pd
import numpy as np
import os
##Generate example dataset
d1 = {}
d2 = {}
d3 = {}
d4 = {}
## generate data:
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
d3[col+'3'] = np.random.random_integers(0,100, 12)
d4[col+'4'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
#place data into dataframes
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
dat3 = pd.DataFrame(d3, index = t_index)
dat4 = pd.DataFrame(d4, index = t_index)
## Functions
def sortByAnthr(X,Y,Xindex, Reverse=False):
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex,Y), key=lambda pair: pair[1],reverse=Reverse)]
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
return(ordrd_row)
def r_selectr(dat2,dat1, n, Reverse=False):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index,Reverse),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
## Call functions
ordr_cols2,ordr_r = r_selectr(dat2,dat1,5)
##print output:
print("Ordering set:\n",dat2.iloc[-2:,:])
print("Original set:\n", dat1.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
As can be checked, the columns of dat1 are correctly ordered according to the values in dat2.
However, when called from a loop over dataframes, it does not rank/index correctly and produces completely dubious results. Although I am not quite able to recreate the problem using the reduced version presented here, the idea should be the same.
## Loop test:
out_dict = {}
data_dicts = {'dat2':dat2, 'dat3': dat3, 'dat4':dat4}
for i in range(3):
#this outer for loop supplies different parameter values to a wrapper
#function that calls r_selectr.
for key in data_dicts.keys():
ordr_cols,_ = r_selectr(data_dicts[key], dat1,5)
out_list.append(ordr_cols)
#do stuff here
#print output:
print("Ordering set:\n",dat3.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
In my code (almost completely analogous to the example given here), the ordr_cols are no longer ordered correctly for any of the sorting data frames.
I currently solve the issue by separating the ordering and indexing operations with r_selectr into two separate functions. That, for some reason, resolves the issue though I have no idea why.

Categories