Efficiently load and manipulate csv using dask DataFrame

Efficiently load and manipulate csv using dask DataFrame - python

I am trying to manipulate the csv-file from https://www.kaggle.com/raymondsunartio/6000-nasdaq-stocks-historical-daily-prices using dask.dataframe. The original dataframe has columns 'date', 'ticker', 'open', 'close', etc...
My goal is to create a new data frame with index 'date' and columns as the closing price of each unique ticker.
The following code does the trick, but is quite slow, using almost a minute for N = 6. I suspect that dask tries to read the CSV-file multiple times in the for-loop, but I don't know how I would go about making this faster. My initial guess is that using df.groupby('ticker') somewhere would help, but I am not familiar enough with pandas.
import dask.dataframe as dd
from functools import reduce
def load_and_fix_csv(path: str, N: int, tickers: list = None) -> dd.DataFrame:
raw = dd.read_csv(path, parse_dates=["date"])
if tickers is None:
tickers = raw.ticker.unique().compute()[:N] # Get unique tickers
dfs = []
for tick in tickers:
tmp = raw[raw.ticker == tick][["date", "close"]] # Temporary dataframe from specific ticker with columns date, close
dfs.append(tmp)
df = reduce(lambda x, y: dd.merge(x, y, how="outer", on="date"), dfs) # Merge all dataframes on date
df = df.set_index("date").compute()
return df
Every kind of help is appreciated!
Thank you.

I'm pretty sure you're right that Dask is likely going "back to the well" for each loop; this is because Dask builds a graph of operations and attempts to defer computation until forced or necessary. One thing I like to do is to cut the reading operations of the graph with Client.persist:
from distributed import Client
client = Client()
def persist_load_and_fix_csv(path: str, N: int, tickers: list = None) -> dd.DataFrame:
raw = dd.read_csv(path, parse_dates=["date"])
# This "cuts the graph" prior operations (just the `read_csv` here)
raw = client.persist(raw)
if tickers is None:
tickers = raw.ticker.unique().compute()[:N] # Get unique tickers
dfs = []
for tick in tickers:
tmp = raw[raw.ticker == tick][["date", "close"]] # Temporary dataframe from specific ticker with columns date, close
dfs.append(tmp)
df = reduce(lambda x, y: dd.merge(x, y, how="outer", on="date"), dfs) # Merge all dataframes on date
df = df.set_index("date").compute()
return df
In a Kaggle session I tested both functions with persist_load_and_fix_csv(csv_path, N=3) and managed to cut the time in half. You'll also get better performance by only keeping the columns you end up using.
(Note: I've found that, at least for me and my code, if I start seeing .compute() crop up in functions that I should step back and reevaluate the code paths; I view it as a code smell)

Related

trying to figure out a pythonic way of code that is taking time even after using list comprehension and pandas

I have two dataframes: one comprising a large data set, allprice_df, with time price series for all stocks; and the other, init_df, comprising selective stocks and trade entry dates. I am trying to find the highest price for each ticker symbol and its associated date.
The following code works but it is time consuming, and I am wondering if there is a better, more Pythonic way to accomplish this.
# Initial call
init_df = init_df.assign(HighestHigh = lambda x:
highestHigh(x['DateIdentified'], x['Ticker'], allprice_df))
# HighestHigh function in lambda call
def highestHigh(date1,ticker,allp_df):
if date1.size == ticker.size:
temp_df = pd.DataFrame(columns = ['DateIdentified','Ticker'])
temp_df['DateIdentified'] = date1
temp_df['Ticker'] = ticker
else:
print("dates and tickers size mismatching")
sys.exit(1)
counter = itertools.count(0)
high_list = [getHigh(x,y,allp_df, next(counter)) for x, y in zip(temp_df['DateIdentified'],temp_df['Ticker'])]
return high_list
# Getting high for each ticker
def getHigh(dateidentified,ticker,allp_df, count):
print("trade %s" % count)
currDate = datetime.datetime.now().date()
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
hh = allpm_df.iloc[:,0].max()
hd = allpm_df.loc[(allpm_df['high']==hh),'date']
hh = round(hh,2)
h_list = [hh,hd]
return h_list
# Split the list in to 2 columns one with price and the other with the corresponding date
init_df = split_columns(init_df,"HighestHigh")
# The function to split the list elements in to different columns
def split_columns(orig_df,col):
split_df = pd.DataFrame(orig_df[col].tolist(),columns=[col+"Mod", col+"Date"])
split_df[col+"Date"] = split_df[col+"Date"].apply(lambda x: x.squeeze())
orig_df = pd.concat([orig_df,split_df], axis=1)
orig_df = orig_df.drop(col,axis=1)
orig_df = orig_df.rename(columns={col+"Mod": col})
return orig_df

There are a couple of obvious solutions that would help reduce your runtime.
First, in your getHigh function, instead of using loc to get the date associated with the maximum value for high, use idxmax to get the index of the row associated with the high and then access that row:
hh, hd = allpm_df[allpm_df['high'].idxmax()]
This will replace two O(N) operations (finding the maximum in a list, and doing a list lookup using a comparison) with one O(N) operation and one O(1) operation.
Edit
In light of your information on the size of your dataframes, my best guess is that this line is probably where most of your time is being consumed:
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
In order to make this faster, I would setup your data frame to include a multi-index when you first create the data frame:
index = pd.MultiIndex.from_arrays(arrays = [ticker_symbols, dates], names = ['Symbol', 'Date'])
allp_df = pd.Dataframe(data, index = index)
allp_df.index.sortlevel(level = 0, sort_remaining = True)
This should create a dataframe with a sorted, multi-level index associated with your ticker symbol and date. Doing this will reduce your search time tremendously. Once you do that, you should be able to access all the data associated with a ticker symbol and a given date-range by doing this:
allp_df[ticker, (dateidentified: currDate)]
which should return your data much more quickly. For more information on multi-indexing, check out this helpful Pandas tutorial.

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.

You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)

Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.

Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output

This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

Assemble a dataframe from two csv files

I wrote the following code to form a data frame containing the energy consumption and the temperature. The data for each of the variables is collected from a different csv file:
def match_data():
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
new_time = []
new_pwr = []
new_tmp = []
for i in range(1,len(pwr_data)):
for j in range(1,len(temp_data)):
if pwr_data['time'][i] == temp_data['Date'][j]:
time = pwr_data['time'][i]
pwr = pwr_data['watt_hour'][i]
tmp = temp_data['Temp'][j]
new_time.append(time)
new_pwr.append(pwr)
new_tmp.append(tmp)
return pd.DataFrame({'Time' : new_time,'watt_hour' : new_pwr,'Temp':new_tmp})
I was trying to collect data with matching time indices so that I can assemble them in a data frame.
The code works well but it takes time(43 seconds for around 1300 data points). At the moment I don't have much data but I was wondering if there was a more efficient and faster way to do so

Do the pwr_data['time'] and temp_data['Date] columns have the same granularity?
If so, you can pd.merge() the two dataframes after reading them.
# read data
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
# merge data on time and Date columns
# you can set the how to be 'inner' or 'right' depending on your needs
df = pd.merge(pwr_data, temp_data, how='left', left_on='time', right_on='Date')

Just like #greco recommended this did the trick and in no time!
pd.merge(pwr_data,temp_data,how='inner',left_on='time',right_on='Date')
'time' and Date are the columns on which you want to base the merge.

Pandas append perfomance concat/append using "larger" DataFrames

The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?
import pandas as pd
path = 'D:\\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
data = pd.concat([data,data_tmp])
del data_tmp
EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.
I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.
EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.
import string
import random
import pandas as pd
import numpy as np
import math
# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------
# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
id_list.append(''.join(
random.choice(string.ascii_uppercase + string.digits) for _
in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]
time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size = math.ceil(len(time_index)/num_files)
data = []
# Simulate files
for file in range(0, run_to_file):
tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
# TODO not all cases cover, make sure ints are obtained
tmp_ids = id_list[file * id_interval:
start_ids + (file + 1) * id_interval]
tmp_data = pd.DataFrame(np.random.standard_normal(
(len(tmp_time), len(tmp_ids))), index=tmp_time,
columns=tmp_ids)
tmp_file = tmp_data.stack().sortlevel(1).reset_index()
# final simulated data structure of the parsed csv file
tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
'ID', 0: 'Value'})
# comment/uncomment if pivot takes place on aggregate level or not
tmp_file = tmp_file.pivot(index='Date', columns='ID',
values='Value')
data.append(tmp_file)
data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')

Using your reproducible example code, I can indeed confirm that the concat of only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:
In [94]: df1, df2 = data[0], data[1]
In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop
In [99]: %%timeit
....: df1b, df2b = df1.align(df2, axis=1)
....: pd.concat([df1b, df2b])
....:
1 loops, best of 3: 686 ms per loop
The result of both approaches is the same.
The aligning is equivalent to:
common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)
So this is probably the easier way to use when having to deal with a full list of dataframes.
The reason that pd.concat is slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so much slower surprises me as well, but I will raise an issue about that.

Summary, three key performance drivers depending on the set-up:
1) Make sure datatype are the same when concatenating two dataframes
2) Use integer based column names if possible
3) When using string based columns, make sure to use the align method before concat is called as suggested by joris

As #joris mentioned, you should append all of the pivot tables to a list and then concatenate them all in one go. Here is a proposed modification to your code:
dfs = []
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
dfs.append(data_tmp)
del data_tmp
data = pd.concat(dfs)

work with chunked data while groupby operations are needed

I have a dataset df with three columns: 'String_key_val', 'Float_other_val1', 'Int_other_val2'. I want to groupby on key_val, then extract the sum of val1 (resp. val2) with respect to these groups. Here is my code:
df = pandas.read_csv('test.csv')
grouped = df.groupby('String_key_val')
series_calculus1 = grouped['Float_other_val1'].sum()
series_calculus2 = grouped['Int_other_val2'].sum()
res = pandas.concat([series_calculus1, series_calculus2], axis=1)
res.to_csv('output_test.csv')
My problem is: My entry dataset is 10GB and I have 4Go Ram so I need to chunk my calculus but I can't see how. I thought of using HDFStore, but since I only have to build a numerical dataset, I see no point of storing complete DataFrame, and I don't think HDFStore can store simple arrays.
What can I do?

I believe a simple approach would be something along these lines....
import pandas as pd
summary = pd.DataFrame()
chunker = pd.read_csv('test.csv',iterator=True,chunksize=50000)
for chunk in chunker:
group = chunk.groupby('String_key_val')
out = group[['Float_other_val1','Int_other_val2']].sum()
summary = summary.append(out)
summary = summary.reset_index()
group = summary.groupby('String_key_val')
summary = group[['Float_other_val1','Int_other_val2']].sum()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently load and manipulate csv using dask DataFrame - python

Related

trying to figure out a pythonic way of code that is taking time even after using list comprehension and pandas

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

Assemble a dataframe from two csv files

Pandas append perfomance concat/append using "larger" DataFrames

work with chunked data while groupby operations are needed

Categories

Resources