I try to add new columns in a huge pandas dataframe. I wrote a function to add the new columns and can now loop over the dataframe. This works, but since the dataframe is so big it takes quite a while. So I tried to use the multiprocessing module to speed up, but was not able to make it run.
Below is a MWE. I guess pool.map() cannot change the dataframe directly and I need to save the new columns first somewhere else. Note: In the "real" code I will add more than 100 new columns and those are also based on values in other dataframes (so I guess apply is not possible).
import pandas as pd
import numpy as np
from multiprocessing import Pool
df = pd.DataFrame({"Value1" : [1,2,3], "Value2" : [9,8,7]})
def make_new_columns(i):
df.loc[i, 'mean'] = np.mean([df.loc[i, 'Value1'], df.loc[i, 'Value2']])
df.loc[i, 'sd'] = np.std([df.loc[i, 'Value1'], df.loc[i, 'Value2']])
df.loc[i, 'cv'] = df.loc[i, 'mean'] / df.loc[i, 'sd']
# With a for loop it is working
# for i in range(len(df)):
# make_new_columns(i)
# With multiprocessing it isn't
pool = Pool()
pool.map(make_new_columns, range(len(df)))
Thanks for you input.
EDIT:
To give a bit more background. I have a data.frame containing tennis match data (Match_Table) which looks a bit like this:
Match_Table:
Date Player_1 Player_2 Winner Aces_1 Aces_2 [...]
----------------------------------------------------------------------
20200528 Thomas Peter Thomas 6 2
20200526 Peter Michael Peter 8 3
20200524 Donald Bill Bill 3 12
...
Now, I am interested into statistics of a specific matchup. So for example: "What was the winrate of e.g. Peter in the last 100 games?", "How many aces did he score on average?", "How many aces did his opponent score?", "How was his win rate against e.g. Bill in the last 100 games?", ...
I need this statistics also for different dates in the past (e.g. What was Peters win rate January 2018). Therefore, I make a second table with the required information (Statistic_Table):
Statistic_Table:
Date Player1 Player2
----------------------------------------------------------------------
202002 Thomas Peter
202002 Peter Michael
201905 Donald Bill
...
Then I wrote a function which filters the Match_Table and calculates all missing columns of Statistic_Table. I can now loop over each row, so it results in this:
Date Player Opponent Winrate Winrate_vs avgAces [...]
-------------------------------------------------------------
202002 Thomas Peter 0.47 0.45 4.5
202002 Peter Michael 0.54 0.64 8.4
201905 Donald Bill 0.63 0.78 6.5
...
Every thing works fine. But since for every cell in my quite large Statistic_Table, I have to subset another table and calculate statistics (not only mean or rates but also weighed averages, etc.), it takes several hours. This would be possible, since I need to create the table just once. But still, if I could split the workload on different cores it would be faster and also easier in the case I have to adjust some parameters.
I also looked in the possibility to use some apply method or optimize the code, but since I (hopefully) only need to generate the table, once I don't want to lose too much time on this. Thus, multiprocessing seemed an easy solution especially, since I have access to powerful computers.
There is a better module for your use case than multiprocessing. Use Ray.
import pandas as pd
import numpy as np
import ray
ray.init()
#ray.remote
class DataFrameActor:
def __init__(self, df):
self.df = df.copy()
def make_new_columns(self, i):
self.df.loc[i, 'mean'] = np.mean([self.df.loc[i, 'Value1'], self.df.loc[i, 'Value2']])
self.df.loc[i, 'sd'] = np.std([self.df.loc[i, 'Value1'], self.df.loc[i, 'Value2']])
self.df.loc[i, 'cv'] = self.df.loc[i, 'mean'] / self.df.loc[i, 'sd']
def to_df(self):
return self.df
#ray.remote
def worker(_df_actor, value):
_df_actor.make_new_columns.remote(i=value)
df = pd.DataFrame({"Value1" : [1,2,3], "Value2" : [9,8,7]})
df_actor = DataFrameActor.remote(df)
[worker.remote(df_actor, j) for j in range(len(df))]
print(ray.get(df_actor.to_df.remote()))
Assuming the functions in your MWE represents what you want to do in your real frame, you should work column-wise.
df['mean'] = df[['Value1', 'Value2']].mean(axis=1)
df['sd'] = df[['Value1', 'Value2']].std(axis=1)
df['cv'] = df['mean'] / df['sd']
Below is timing code (where df is built with more rows and values are randomly drawn integers)
import pandas as pd
import numpy as np
from multiprocessing import Pool
n_rows = 2000
df = pd.DataFrame({"Value1" : np.random.randint(1, high=100, size=n_rows),
"Value2" : np.random.randint(1, high=100, size=n_rows)})
# Function takes now df as input so no global variables is changed
def make_new_columns(df, i):
df.loc[i, 'mean'] = np.mean([df.loc[i, 'Value1'], df.loc[i, 'Value2']])
df.loc[i, 'sd'] = np.std([df.loc[i, 'Value1'], df.loc[i, 'Value2']])
df.loc[i, 'cv'] = df.loc[i, 'mean'] / df.loc[i, 'sd']
return df
#cellwise construction: 1.98s
%%timeit
df_2 = df.copy()
# With a for loop it is working
for i in range(len(df_2)):
make_new_columns(df_2, i)
# columnwise construction: 2.4**ms**
%%timeit
df_2= df.copy()
df_2['mean'] = df_2[['Value1', 'Value2']].mean(axis=1)
df_2['sd'] = df_2[['Value1', 'Value2']].std(axis=1)
df_2['cv'] = df_2['mean'] / df_2['sd']
Related
the code I'm running gives results that are space de-liminated. This creates a problem with my sector column which gives a result of Communication Services. It creates 1 column for Communication and another column for Services where I need 1 column saying Communication Services. I have tried to concatentate the 2 columns into 1 but I'm getting attribute and str errors and don't know how to achieve this. Can anyone show how this can be done? Thanks
Code
import yfinance as yf
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
list_of_futures= []
def get_stats(ticker):
info = yf.Tickers(ticker).tickers[ticker].info
s= f"{ticker} {info['currentPrice']} {info['marketCap']} {info['sector']}"
list_of_futures.append(s)
ticker_list = ['AAPL', 'ORCL', 'GTBIF', 'META']
with ThreadPoolExecutor() as executor:
executor.map(get_stats, ticker_list)
(
pd.DataFrame(list_of_futures)
[0].str.split(expand=True)
.rename(columns={0: "Ticker", 1: "Price", 2: "Market Cap", 3: "Sector", 4: "Sector1"})
.to_excel("yahoo_futures.xlsx", index=False)
)
Current Results
Desired Results
Let us reformulate the get_stats function to return dictionary instead string. This way you can avoid the unnecessary step to split the strings to create a dataframe
def get_stats(ticker):
info = yf.Tickers(ticker).tickers[ticker].info
cols = ['currentPrice', 'marketCap', 'sector']
return {'ticker': ticker, **{c: info[c] for c in cols}}
tickers = ['AAPL', 'ORCL', 'GTBIF', 'META']
with ThreadPoolExecutor() as executor:
result_iter = executor.map(get_stats, tickers)
df = pd.DataFrame(result_iter)
Result
ticker currentPrice marketCap sector
0 AAPL 148.11 2356148699136 Technology
1 ORCL 82.72 223027183616 Technology
2 GTBIF 13.25 3190864896 Healthcare
3 META 111.41 295409188864 Communication Services
I am working with stock data coming from Yahoo Finance.
def load_y_finance_data(y_finance_tickers: list):
df = pd.DataFrame()
print("Loading Y-Finance data ...")
for ticker in y_finance_tickers:
df[ticker.replace("^", "")] = yf.download(
ticker,
auto_adjust=True, # only download adjusted data
progress=False,
)["Close"]
print("Done loading Y-Finance data!")
return df
x = load_y_finance_data(["^VIX", "^GSPC"])
x
VIX GSPC
Date
1990-01-02 17.240000 359.690002
1990-01-03 18.190001 358.760010
1990-01-04 19.219999 355.670013
1990-01-05 20.110001 352.200012
1990-01-08 20.260000 353.790009
DataSize=(8301, 2)
Here I want to perform a sliding window operation for every 50 days period, where I want to get correlation (using corr() function) for 50 days slice (day_1 to day_50) of data and after window will move by one day (day_2 to day_51) and so on.
I tried the naive way of using a for loop to do this and it works as well. But it takes too much time. Code below-
data_size = len(x)
period = 50
df = pd.DataFrame()
for i in range(data_size-period):
df.loc[i, "GSPC_VIX_corr"] = x[["GSPC", "VIX"]][i:i+period].corr().loc["GSPC", "VIX"]
df
GSPC_VIX_corr
0 -0.703156
1 -0.651513
2 -0.602876
3 -0.583256
4 -0.589086
How can I do this more efficiently? Is there any built-in way I can use?
Thanks :)
You can use the rolling windows functionality of Pandas with many different aggreggations, including corr(). Instead of your for loop, do this:
x["VIX"].rolling(window=period).corr(x["GSPC"])
I have around 150 CSV files on the following format:
Product Name
Cost
Manufacturer
Country
P_0
5
Pfizer
Finland
P_1
10
BioNTech
Sweden
P_2
12
Pfizer
Denmark
P_3
11
J&J
Finland
Each CSV represents daily data. So the file for the previous date would look like:
Product Name
Cost
Manufacturer
Country
P_0
7
Pfizer
Finland
P_1
15
BioNTech
Sweden
P_2
17
Pfizer
Denmark
P_3
10
J&J
Finland
I would like to create a time series dataset where I can track the price of a product given a manufacturer in a given country over time.
So for example I want to be able to show the price development of product P_1 made by BioNTech in Sweden as:
Date
Price
17/10/2022
15
18/10/2022
10
My attempt:
Each CSV has the date as a part of its name (e.g., 'data_17-10_2022'). So I have created a list that contains the path to all of the CSV files and then I iterate through this list, convert each CSV to a pandas dataframe, add each of them to a list and then concatenate this after which I perform some groupby operation.
def create_ts(data):
df_list = []
for file in data:
match = re.search(r'\d{2}-\d{2}-\d{4}', file) # get date from file name
date = datetime.strptime(match.group(), '%d-%m-%Y').date()
df = pd.read_csv(file, sep = ";")
df["date"] = date # create a new column in each df that contains the date
df_list.append(df)
return df_list
df_concat = pd.concat(create_ts(my_files))
df_group = df_concat.groupby(["Manufacturer", "Country", "Product Name"])
This returns what I am after. However, it is very slow (when I tried it for a random country, manufacturer and product name it took nearly 10 minutes to run).
The problem (I think) is that each CSV is approximately 40MB (180000 rows and 20 columns, of which I have drop around 10 irrelevant columns).
Is there anything I can do to speed this up? I tried installing modin but I got an error saying I need VS C++ v.14 and my work computer does not allow me to install programs without going through a very long process with the IT department.
Fundamentally your reading approach is fine: as far as I know reading then concatenating the dataframes is the best approach. There are some marginal improvements you can get if you use the usecols and dtype parameters in read_csv but this is ever dependant on what your data looks like:
Method
Time
Relative
Original
0.1512130000628531
1.5909069397118787
Only load columns you need
0.09676750004291534
1.0180876465175188
Use dtype parameter
0.09504829999059439
1.0
I think to get significant performance improvement you probably want to look at caching at some point in the process as dankal444 mentions.
What you cache depends on how the data is changing, but assuming the files do not change once you have received them I would probably cache the loaded dataframe with a set of included files something like:
import pickle
dst = './fastreading.pkl'
contained_files = set()
with open(dst, 'wb') as f:
pickle.dump((contained_files, df), f)
with open(dst, 'rb') as f:
contained_files2, df2 = pickle.load(f)
You could then check if the file is in the list of contained files in your loading process. I am using pickle here, but there are other faster ways of loading/saving dataframes, there is some benchmark data here.
If you are worried that the files will chance, you could include a timestamp or a checksum in your contained files list.
The other thing I would recommend is running a profiler. This should give you a good idea where the time is spent.
read_csv test code:
import pandas as pd
import numpy as np
import timeit
iterations = 10
item_count = 5000
path = './fasterreading.csv'
data = {c: [i/2 for i in range(item_count)] for c in [chr(c) for c in range(ord('a'), ord('z') + 1)]}
dtypes = {c: np.float64 for c in data.keys()}
df = pd.DataFrame(data)
df.to_csv(path)
# attempt to negate file system caching effect
timeit.timeit(lambda: pd.read_csv(path), number=5)
t0 = timeit.timeit(lambda: pd.read_csv(path), number=iterations)
t1 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c']), number=iterations)
t2 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c'], dtype=dtypes), number=iterations)
tmin = min(t0, t1, t2)
print(f'| Method | Time | Relative |')
print(f'|------------------ |----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Only load columns you need | {t1} | {t1 / tmin} |')
print(f'| Use dtype parameter | {t2} | {t2 / tmin} |')
I have two dataframes with the same columns:
Dataframe 1:
attr_1 attr_77 ... attr_8
userID
John 1.2501 2.4196 ... 1.7610
Charles 0.0000 1.0618 ... 1.4813
Genarito 2.7037 4.6707 ... 5.3583
Mark 9.2775 6.7638 ... 6.0071
Dataframe 2:
attr_1 attr_77 ... attr_8
petID
Firulais 1.2501 2.4196 ... 1.7610
Connie 0.0000 1.0618 ... 1.4813
PopCorn 2.7037 4.6707 ... 5.3583
I want to generate a correlation and p-value dataframe of all posible combinations, this would be the result:
userId petID Correlation p-value
0 John Firulais 0.091447 1.222927e-02
1 John Connie 0.101687 5.313359e-03
2 John PopCorn 0.178965 8.103919e-07
3 Charles Firulais -0.078460 3.167896e-02
The problem is that the cartesian product generates more than 3 million tuples. Taking minutes to finish. This is my code, I've written two alternatives:
First of all, initial DataFrames:
df1 = pd.DataFrame({
'userID': ['John', 'Charles', 'Genarito', 'Mark'],
'attr_1': [1.2501, 0.0, 2.7037, 9.2775],
'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],
'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')
df2 = pd.DataFrame({
'petID': ['Firulais', 'Connie', 'PopCorn'],
'attr_1': [1.2501, 0.0, 2.7037],
'attr_77': [2.4196, 1.0618, 4.6707],
'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')
Option 1:
# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)
i = 0
for df1_key in df1_keys:
df1_values = df1.loc[df1_key, :].values
for df2_key in df2.index:
df2_values = df2.loc[df2_key, :]
pearson_res = pearsonr(df1_values, df2_values)
users[i] = df1_key
pets[i] = df2_key
coff[i] = pearson_res[0]
p_value[i] = pearson_res[1]
i += 1
# After loop, creates the resulting Dataframe
return pd.DataFrame(data={
'userID': users,
'petID': pets,
'Correlation': coff,
'p-value': p_value
})
Option 2 (slower), from here:
# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):
df1, df2 = prepare_df(df1_file_path, df2_file_path)
df1['_tmpkey'] = 1
df2['_tmpkey'] = 1
res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)
res.index = pd.MultiIndex.from_product((df1.index, df2.index))
df1.drop('_tmpkey', axis=1, inplace=True)
df2.drop('_tmpkey', axis=1, inplace=True)
return res
# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):
values = np.split(row.values, 2)
return pearsonr(values[0], values[1])
result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)
Is there a faster way to solve such a problem with Pandas? Or I'll have no more option than parallelize the iterations?
Edit:
As the size of the dataframe increases the second option results in a better runtime, but It's still taking seconds to finish.
Thanks in advance
Of all the alternatives tested, the one that gave me the best results was the following:
An iteration product was made with
itertools.product().
All the iterations on both iterrows were performed on a Pool of
parallel processes (using a map function).
To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:
In the cython_modules.pyx file:
from scipy.stats import pearsonr
import numpy as np
def compute_row_cython(row):
(df1_key, df1_values), (df2_key, df2_values) = row
cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]
Then I set up the setup.py:
from distutils.core import setup
from Cython.Build import cythonize
setup(name='Compiled Pearson',
ext_modules=cythonize("cython_modules.pyx")
Finally I compiled it with: python setup.py build_ext --inplace
The final code was left, then:
import itertools
import multiprocessing
from cython_modules import compute_row_cython
NUM_CORES = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()
Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.
The solution with Dask can be found in my other question.
Here's another method using same cross join but using the built in pandas method DataFrame.corrwith and scipy.stats.ttest_ind. Since we use less "loopy" implementation, this should perform better.
from scipy.stats import ttest_ind
mrg = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop(columns='key')
x = mrg.filter(like='_x').rename(columns=lambda x: x.rsplit('_', 1)[0])
y = mrg.filter(like='_y').rename(columns=lambda x: x.rsplit('_', 1)[0])
df = mrg[['userID', 'petID']].join(x.corrwith(y, axis=1).rename('Correlation'))
df['p_value'] = ttest_ind(x, y, axis=1)[1]
userID petID Correlation p_value
0 John Firulais 1.000000 1.000000
1 John Connie 0.641240 0.158341
2 John PopCorn 0.661040 0.048041
3 Charles Firulais 0.641240 0.158341
4 Charles Connie 1.000000 1.000000
5 Charles PopCorn 0.999660 0.020211
6 Genarito Firulais 0.661040 0.048041
7 Genarito Connie 0.999660 0.020211
8 Genarito PopCorn 1.000000 1.000000
9 Mark Firulais -0.682794 0.006080
10 Mark Connie -0.998462 0.003865
11 Mark PopCorn -0.999569 0.070639
I am adding indicators to the stock market data, but my solution for finding the ATR is too slow. Is there a way to vectorize this problem?
I am using the yfinance library to get the stock prices.
At first I made a simple for loop, but then I realize it would take too long, and now I am using the Index to get previous row data. I tried using the .shift() function, but I need to get a slice of previous data because if I would use an average of 174, It would be hard to write.
For example I have this table:
Date Open High Low Close
0 2017-10-23 156.839996 157.039993 155.949997 156.509995
1 2017-10-23 156.529999 156.550095 155.929901 156.339996
2 2017-10-23 156.320007 157.600006 156.119995 157.593994
3 2017-10-23 157.599899 157.679993 157.279999 157.279999
4 2017-10-23 157.297501 157.369995 156.559998 156.619995
And I would like to add an ATR column which is calculated by this formula:
Current ATR=(Prior ATR + Current TR)/2 in which
Current TR = max[(high-low),(high - previous_close),(low-previous_close)]
import yfinance as yf
import numpy as np
import pandas as pd
def get_prices(stock):
ul = yf.Ticker(stock)
pdate = (datetime.now() - timedelta(days=729)).date()
df = ul.history(start=pdate, interval="1H").drop(["Volume","Dividends", "Stock Splits"], axis=1)
df = df.reset_index()
return df
def addATR_(row, df, av=13):
if(row.Index>av):
lrid = row.Index-1
high = row.High
low = row.Low
prev_close = df.loc[df.index==(lrid),"Close"]
matr = df[row.Index-av:row.Index]["ATR"].sum()
tr=[(high - low),np.abs(high - prev_close).values,np.abs(low - prev_close).values]
return (matr + np.max(tr))/(av+1)
def add_ATR(df):
df.insert(5,"ATR", 0.2)
df['Index'] = df.index
df["ATR"] = df.apply(lambda row: addATR_(row, df), axis=1)
df = df.drop(range(14))
df = df.reset_index().drop(["index","Index"],axis=1)
print("Added ATR")
return df
print(add_ATR(get_prices("msft")))
I really want to speed things up, because it takes me 9 seconds to add only one Indicator.