Preserving dask dataframe divisions when loading multiple parquet files

Preserving dask dataframe divisions when loading multiple parquet files - python

I have some time series data in data frames with time as index. The index is sorted and the data is stored in multiple parquet files with one day of data in each file. I use dask 2.9.1
When I load data from one parquet file the division are set correctly.
When I load data from multiple files I do not get the devisions in the resulting dask dataframe.
The example below illustrates the problem:
import pandas as pd
import pandas.util.testing as tm
import dask.dataframe as dd
df = tm.makeTimeDataFrame( 48, "H")
df1 = df[:24].sort_index()
df2 = df[24:].sort_index()
dd.from_pandas( df1, npartitions=1 ).to_parquet( "df1d.parq", engine="fastparquet" )
dd.from_pandas( df2, npartitions=1 ).to_parquet( "df2d.parq", engine="fastparquet" )
ddf = dd.read_parquet( "df*d.parq", infer_divisions=True, sorted_index=True, engine="fastparquet" )
print(ddf.npartitions, ddf.divisions)
Here I get 2 partitions and (None, None, None) as divisions
Can I get dd.read_parquet to set the partitions to actual values?
Update
In my actual data I have one parquet file pr day.
The files are created by saving the data from a dataframe where a timestamp is used as index. The index is sorted. The size of each file is 100-150MB and when loaded into memory it uses app 2.5GB of RAM, getting the index activated is important as recreating the index is really heavy.
I did not manage to find a combination of parameters or engine on read_parquet that make it create division on load.
The data files are named "yyyy-mm-dd.parquet", so I tied to create divisions from that info:
from pathlib import Path
files = list (Path("e:/data").glob("2019-06-*.parquet") )
divisions = [ pd.Timestamp( f.stem) for f in files ] + [ pd.Timestamp( files[-1].stem) + pd.Timedelta(1, unit='D' ) ]
ddf = dd.read_parquet( files )
ddf.divisions = divisions
This did not enable use of index and in some cases it failed with "TypeError: can only concatenate tuple (not "list") to tuple"
Then I tried to set divisions as a tuple
ddf.divisions = tuple(divisions) and then it worked. When the index setup is correct dask is impressively fast
Update 2
A better way is to read the dask dataframes individually and then concatenate them:
from pathlib import Path
import dask.dataframe as dd
files = list (Path("e:/data").glob("2019-06-*.parquet") )
ddfs = [ dd.read_parquet( f ) for f in files ]
ddf = dd.concat(ddfs, axis=0)
In this way the divisions are set and it also solves another problem of handling additions of columns over time.

Below I have rewritten the original question to use concat, which solved my problem
import pandas as pd
import pandas.util.testing as tm
import dask.dataframe as dd
# create two example parquet files
df = tm.makeTimeDataFrame( 48, "H")
df1 = df[:24].sort_index()
df2 = df[24:].sort_index()
dd.from_pandas( df1, npartitions=1 ).to_parquet( "df1d.parq" )
dd.from_pandas( df2, npartitions=1 ).to_parquet( "df2d.parq" )
# read the files and concatenate
ddf = dd.concat([dd.read_parquet( d ) for d in ["df1d.parq", "df2d.parq"] ], axis=0)
print(ddf.npartitions, ddf.divisions)
I still get the expected 2 partitions, but now the divisions are (Timestamp('2000-01-01 00:00:00'), Timestamp('2000-01-02 00:00:00'), Timestamp('2000-01-02 23:00:00'))

Related

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.

You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)

Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.

Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output

This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

Reading partitioned data (parquets) using dask with 'int64' vs 'int64 not null'

I have this annoying situation where some of my parquet files have:
x: int64
and others have
x: int64 not null
and ergo (in dask 2.8.0/numpy 1.15.1/pandas 0.25.3) I can't run the following:
test: Union[pd.Series, pd.DataFrame, np.ndarray] = dd.read_parquet(input_path).query(filter_string)[input_columns].compute()
Anyone know what I can do short of upgrading dask/numpy (as I know the latest dask/numpy seem to work)?
Thanks in advance!

If you know which files contain the different dtypes, then it's best to re-process them (load/convert dtype/save).
If that's not an option, then you can create a dask dataframe from delayed objects with something like this:
import pandas as pd
from dask import delayed
import dask.dataframe as dd
#delayed
def custom_load(fpath):
df = pd.read_parquet(fpath)
df = df.astype({'x': 'Int64'}) # the appropriate dtype
return df
delayed = [custom_load(f) for f in files] # where files is the list of files
ddf = dd.from_delayed(delayed) # can also provide meta option if known

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

I was trying to save a DataFrame and load it. If I print the resulting df, I see they are (almost) identical. The freq attribute of the datetimeindex is not preserved though.
My code looks like this
import datetime
import os
import numpy as np
import pandas as pd
def test_load_pandas_dataframe():
idx = pd.date_range(start=datetime.datetime.now(),
end=(datetime.datetime.now()
+ datetime.timedelta(hours=3)),
freq='10min')
a = pd.DataFrame(np.arange(2*len(idx)).reshape((len(idx), 2)), index=idx,
columns=['first', 2])
a.to_csv('test_df')
b = load_pandas_dataframe('test_df')
os.remove('test_df')
assert np.all(b == a)
def load_pandas_dataframe(filename):
'''Correcty loads dataframe but freq is not maintained'''
df = pd.read_csv(filename, index_col=0,
parse_dates=True)
return df
if __name__ == '__main__':
test_load_pandas_dataframe()
And I get the following error:
ValueError: Can only compare identically-labeled DataFrame objects
It is not a big issue for my program, but it is still annoying.
Thanks!

The issue here is that the dataframe you save has columns
Index(['first', 2], dtype='object')
but the dataframe you load has columns
Index(['first', '2'], dtype='object').
In other words, the columns of your original dataframe had the integer 2, but upon saving it with to_csv and loading it back with read_csv, it is parsed as the string '2'.
The easiest fix that passes your assertion is to change line 13 to:
columns=['first', '2'])

To complemente #jfaccioni answer, freq attribute is not preserved, there are two options here
Fast a simple, use pickle which will preserver everything:
a.to_pickle('test_df')
b = pd.read_pickle('test_df')
a.equals(b) # True
Or you can use the inferred_freq attribute from a DatetimeIndex:
a.to_csv('test_df')
b.read_csv('test_df')
b.index.freq = b.index.inferred_freq
print(b.index.freq) #<10 * Minutes>

Pandas append perfomance concat/append using "larger" DataFrames

The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?
import pandas as pd
path = 'D:\\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
data = pd.concat([data,data_tmp])
del data_tmp
EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.
I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.
EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.
import string
import random
import pandas as pd
import numpy as np
import math
# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------
# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
id_list.append(''.join(
random.choice(string.ascii_uppercase + string.digits) for _
in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]
time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size = math.ceil(len(time_index)/num_files)
data = []
# Simulate files
for file in range(0, run_to_file):
tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
# TODO not all cases cover, make sure ints are obtained
tmp_ids = id_list[file * id_interval:
start_ids + (file + 1) * id_interval]
tmp_data = pd.DataFrame(np.random.standard_normal(
(len(tmp_time), len(tmp_ids))), index=tmp_time,
columns=tmp_ids)
tmp_file = tmp_data.stack().sortlevel(1).reset_index()
# final simulated data structure of the parsed csv file
tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
'ID', 0: 'Value'})
# comment/uncomment if pivot takes place on aggregate level or not
tmp_file = tmp_file.pivot(index='Date', columns='ID',
values='Value')
data.append(tmp_file)
data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')

Using your reproducible example code, I can indeed confirm that the concat of only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:
In [94]: df1, df2 = data[0], data[1]
In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop
In [99]: %%timeit
....: df1b, df2b = df1.align(df2, axis=1)
....: pd.concat([df1b, df2b])
....:
1 loops, best of 3: 686 ms per loop
The result of both approaches is the same.
The aligning is equivalent to:
common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)
So this is probably the easier way to use when having to deal with a full list of dataframes.
The reason that pd.concat is slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so much slower surprises me as well, but I will raise an issue about that.

Summary, three key performance drivers depending on the set-up:
1) Make sure datatype are the same when concatenating two dataframes
2) Use integer based column names if possible
3) When using string based columns, make sure to use the align method before concat is called as suggested by joris

As #joris mentioned, you should append all of the pivot tables to a list and then concatenate them all in one go. Here is a proposed modification to your code:
dfs = []
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
dfs.append(data_tmp)
del data_tmp
data = pd.concat(dfs)

How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

I'm importing large amounts of http logs (80GB+) into a Pandas HDFStore for statistical processing. Even within a single import file I need to batch the content as I load it. My tactic thus far has been to read the parsed lines into a DataFrame then store the DataFrame into the HDFStore. My goal is to have the index key unique for a single key in the DataStore but each DataFrame restarts it's own index value again. I was anticipating HDFStore.append() would have some mechanism to tell it to ignore the DataFrame index values and just keep adding to my HDFStore key's existing index values but cannot seem to find it. How do I import DataFrames and ignore the index values contained therein while having the HDFStore increment its existing index values? Sample code below batches every 10 lines. Naturally the real thing would be larger.
if hd_file_name:
"""
HDF5 output file specified.
"""
hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
print hdf_output
columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result',
'response_size', 'referrer', 'user_agent', 'response_time']
source_name = str(log_file.name.rsplit('/')[-1]) # HDF5 Tables don't play nice with unicode so explicit str(). :(
batch = []
for count, line in enumerate(log_file,1):
data = parse_line(line, rejected_output = reject_output)
# Add our source file name to the beginning.
data.insert(0, source_name )
batch.append(data)
if not (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)
batch = []
if (count % 10):
df = pd.DataFrame( batch, columns = columns )
hdf_output.append(KEY_NAME, df)

You can do it like this. Only trick is that the first time the store table doesn't exist, so get_storer will raise.
import pandas as pd
import numpy as np
import os
files = ['test1.csv','test2.csv']
for f in files:
pd.DataFrame(np.random.randn(10,2),columns=list('AB')).to_csv(f)
path = 'test.h5'
if os.path.exists(path):
os.remove(path)
with pd.get_store(path) as store:
for f in files:
df = pd.read_csv(f,index_col=0)
try:
nrows = store.get_storer('foo').nrows
except:
nrows = 0
df.index = pd.Series(df.index) + nrows
store.append('foo',df)
In [10]: pd.read_hdf('test.h5','foo')
Out[10]:
A B
0 0.772017 0.153381
1 0.304131 0.368573
2 0.995465 0.799655
3 -0.326959 0.923280
4 -0.808376 0.449645
5 -1.336166 0.236968
6 -0.593523 -0.359080
7 -0.098482 0.037183
8 0.315627 -1.027162
9 -1.084545 -1.922288
10 0.412407 -0.270916
11 1.835381 -0.737411
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798
15 1.181344 0.354411
16 0.501892 -0.358361
17 0.633256 0.419397
18 0.932354 -0.603932
19 -0.341135 2.453220
You actually don't necessarily need a global unique index, (unless you want one) as HDFStore (through PyTables) provides one by uniquely numbering rows. You can always add these selection parameters.
In [11]: pd.read_hdf('test.h5','foo',start=12,stop=15)
Out[11]:
A B
12 -0.607571 0.507790
13 0.043509 -0.294086
14 -0.465210 0.880798

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Preserving dask dataframe divisions when loading multiple parquet files - python

Related

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

Reading partitioned data (parquets) using dask with 'int64' vs 'int64 not null'

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

Pandas append perfomance concat/append using "larger" DataFrames

How does one append large amounts of data to a Pandas HDFStore and get a natural unique index?

Categories

Resources