Reading partitioned data (parquets) using dask with 'int64' vs 'int64 not null'

Reading partitioned data (parquets) using dask with 'int64' vs 'int64 not null' - python

I have this annoying situation where some of my parquet files have:
x: int64
and others have
x: int64 not null
and ergo (in dask 2.8.0/numpy 1.15.1/pandas 0.25.3) I can't run the following:
test: Union[pd.Series, pd.DataFrame, np.ndarray] = dd.read_parquet(input_path).query(filter_string)[input_columns].compute()
Anyone know what I can do short of upgrading dask/numpy (as I know the latest dask/numpy seem to work)?
Thanks in advance!

If you know which files contain the different dtypes, then it's best to re-process them (load/convert dtype/save).
If that's not an option, then you can create a dask dataframe from delayed objects with something like this:
import pandas as pd
from dask import delayed
import dask.dataframe as dd
#delayed
def custom_load(fpath):
df = pd.read_parquet(fpath)
df = df.astype({'x': 'Int64'}) # the appropriate dtype
return df
delayed = [custom_load(f) for f in files] # where files is the list of files
ddf = dd.from_delayed(delayed) # can also provide meta option if known

Related

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

I was trying to save a DataFrame and load it. If I print the resulting df, I see they are (almost) identical. The freq attribute of the datetimeindex is not preserved though.
My code looks like this
import datetime
import os
import numpy as np
import pandas as pd
def test_load_pandas_dataframe():
idx = pd.date_range(start=datetime.datetime.now(),
end=(datetime.datetime.now()
+ datetime.timedelta(hours=3)),
freq='10min')
a = pd.DataFrame(np.arange(2*len(idx)).reshape((len(idx), 2)), index=idx,
columns=['first', 2])
a.to_csv('test_df')
b = load_pandas_dataframe('test_df')
os.remove('test_df')
assert np.all(b == a)
def load_pandas_dataframe(filename):
'''Correcty loads dataframe but freq is not maintained'''
df = pd.read_csv(filename, index_col=0,
parse_dates=True)
return df
if __name__ == '__main__':
test_load_pandas_dataframe()
And I get the following error:
ValueError: Can only compare identically-labeled DataFrame objects
It is not a big issue for my program, but it is still annoying.
Thanks!

The issue here is that the dataframe you save has columns
Index(['first', 2], dtype='object')
but the dataframe you load has columns
Index(['first', '2'], dtype='object').
In other words, the columns of your original dataframe had the integer 2, but upon saving it with to_csv and loading it back with read_csv, it is parsed as the string '2'.
The easiest fix that passes your assertion is to change line 13 to:
columns=['first', '2'])

To complemente #jfaccioni answer, freq attribute is not preserved, there are two options here
Fast a simple, use pickle which will preserver everything:
a.to_pickle('test_df')
b = pd.read_pickle('test_df')
a.equals(b) # True
Or you can use the inferred_freq attribute from a DatetimeIndex:
a.to_csv('test_df')
b.read_csv('test_df')
b.index.freq = b.index.inferred_freq
print(b.index.freq) #<10 * Minutes>

Preserving dask dataframe divisions when loading multiple parquet files

I have some time series data in data frames with time as index. The index is sorted and the data is stored in multiple parquet files with one day of data in each file. I use dask 2.9.1
When I load data from one parquet file the division are set correctly.
When I load data from multiple files I do not get the devisions in the resulting dask dataframe.
The example below illustrates the problem:
import pandas as pd
import pandas.util.testing as tm
import dask.dataframe as dd
df = tm.makeTimeDataFrame( 48, "H")
df1 = df[:24].sort_index()
df2 = df[24:].sort_index()
dd.from_pandas( df1, npartitions=1 ).to_parquet( "df1d.parq", engine="fastparquet" )
dd.from_pandas( df2, npartitions=1 ).to_parquet( "df2d.parq", engine="fastparquet" )
ddf = dd.read_parquet( "df*d.parq", infer_divisions=True, sorted_index=True, engine="fastparquet" )
print(ddf.npartitions, ddf.divisions)
Here I get 2 partitions and (None, None, None) as divisions
Can I get dd.read_parquet to set the partitions to actual values?
Update
In my actual data I have one parquet file pr day.
The files are created by saving the data from a dataframe where a timestamp is used as index. The index is sorted. The size of each file is 100-150MB and when loaded into memory it uses app 2.5GB of RAM, getting the index activated is important as recreating the index is really heavy.
I did not manage to find a combination of parameters or engine on read_parquet that make it create division on load.
The data files are named "yyyy-mm-dd.parquet", so I tied to create divisions from that info:
from pathlib import Path
files = list (Path("e:/data").glob("2019-06-*.parquet") )
divisions = [ pd.Timestamp( f.stem) for f in files ] + [ pd.Timestamp( files[-1].stem) + pd.Timedelta(1, unit='D' ) ]
ddf = dd.read_parquet( files )
ddf.divisions = divisions
This did not enable use of index and in some cases it failed with "TypeError: can only concatenate tuple (not "list") to tuple"
Then I tried to set divisions as a tuple
ddf.divisions = tuple(divisions) and then it worked. When the index setup is correct dask is impressively fast
Update 2
A better way is to read the dask dataframes individually and then concatenate them:
from pathlib import Path
import dask.dataframe as dd
files = list (Path("e:/data").glob("2019-06-*.parquet") )
ddfs = [ dd.read_parquet( f ) for f in files ]
ddf = dd.concat(ddfs, axis=0)
In this way the divisions are set and it also solves another problem of handling additions of columns over time.

Below I have rewritten the original question to use concat, which solved my problem
import pandas as pd
import pandas.util.testing as tm
import dask.dataframe as dd
# create two example parquet files
df = tm.makeTimeDataFrame( 48, "H")
df1 = df[:24].sort_index()
df2 = df[24:].sort_index()
dd.from_pandas( df1, npartitions=1 ).to_parquet( "df1d.parq" )
dd.from_pandas( df2, npartitions=1 ).to_parquet( "df2d.parq" )
# read the files and concatenate
ddf = dd.concat([dd.read_parquet( d ) for d in ["df1d.parq", "df2d.parq"] ], axis=0)
print(ddf.npartitions, ddf.divisions)
I still get the expected 2 partitions, but now the divisions are (Timestamp('2000-01-01 00:00:00'), Timestamp('2000-01-02 00:00:00'), Timestamp('2000-01-02 23:00:00'))

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting:
Python PANDAS: Stack by Enumerated Date to Create Records Vectorized
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
from io import StringIO
test_data = '''id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8'''
df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])
df_test = df_test.loc[np.repeat(df_test.index, df_test['units'])]
df_test['transaction_dt'] += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='d')
df_test = df_test.reset_index(drop=True)
expected results:
id,transaction_dt,measures
1,2018-01-01,30.5
1,2018-01-02,30.5
1,2018-01-03,30.5
1,2018-01-04,30.5
1,2018-01-03,26.3
1,2018-01-04,26.3
1,2018-01-05,26.3
1,2018-01-06,26.3
2,2018-01-01,12.7
2,2018-01-02,12.7
2,2018-01-03,12.7
2,2018-01-03,8.8
2,2018-01-04,8.8
2,2018-01-05,8.8
It occurred to me that this might be a good candidate to try to parallelize because the separate dask partitions should not need to know anything about each other to accomplish the required operations. Here is a naive representation of how I thought it might work:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] += dd_test.to_timedelta(dd.groupby(level=0).cumcount(), unit='d')
dd_test = dd_test.reset_index(drop=True)
So far I have been trying to work through the following errors or idiomatic differences:
"NotImplementedError: Only integer valued repeats supported."
I have tried to convert the index into a int column/array to try as well but still run into the issue.
2. dask does not support the mutating operator: "+="
3. No dask .to_timedelta() argument
4. No dask .cumcount() (but I think .cumsum() is interchangable?!)
If there are any dask experts out there who might be able let me know if there are fundamental impediments to preclude me from trying this or any tips on implementation, that would be a great help!
Edit:
I think I have made a bit of progress on this since posting the question:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] = dd_test['transaction_dt'] + (dd.test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
However, I am still stuck on the dask array repeats error. Any tips still welcome.

Not sure if this is exactly what you are looking for, but I replaced the da.repeat with using np.repeat, along with explicity casting dd_test.index and dd_test['units'] to numpy arrays, and finally adding dd_test['transaction_dt'].astype('M8[us]') to your timedelta calculation.
df_test = pd.read_csv(StringIO(test_data), sep=',')
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[np.repeat(np.array(dd_test.index),
np.array(dd_test['units']))]
dd_test['transaction_dt'] = dd_test['transaction_dt'].astype('M8[us]') + (dd_test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
df_expected = dd_test.compute()

Pythonic type hints with pandas?

Let's take a simple function that takes a str and returns a dataframe:
import pandas as pd
def csv_to_df(path):
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
What is the recommended pythonic way of adding type hints to this function?
If I ask python for the type of a DataFrame it returns pandas.core.frame.DataFrame.
The following won't work though, as it'll tell me that pandas is not defined.
def csv_to_df(path: str) -> pandas.core.frame.DataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')

Why not just use pd.DataFrame?
import pandas as pd
def csv_to_df(path: str) -> pd.DataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Result is the same:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> pandas.core.frame.DataFrame

I'm currently doing the following:
from typing import TypeVar
PandasDataFrame = TypeVar('pandas.core.frame.DataFrame')
def csv_to_df(path: str) -> PandasDataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Which gives:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> ~pandas.core.frame.DataFrame
Don't know how pythonic that is, but it's understandable enough as a type hint, I find.

Now there is a pip package that can help with this.
https://github.com/CedricFR/dataenforce
You can install it with pip install dataenforce and use very pythonic type hints like:
def preprocess(dataset: Dataset["id", "name", "location"]) -> Dataset["location", "count"]:
pass

Check out the answer given here which explains the usage of the package data-science-types.
pip install data-science-types
Demo
# program.py
import pandas as pd
df: pd.DataFrame = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]}) # OK
df1: pd.DataFrame = pd.Series([1,2,3]) # error: Incompatible types in assignment
Run using mypy the same way:
$ mypy program.py

This is straying from the original question but building off of #dangom's answer using TypeVar and #Georgy's comment that there is no way to specify datatypes for DataFrame columns in type hints, you could use a simple work-around like this to specify datatypes in a DataFrame:
from typing import TypeVar
DataFrameStr = TypeVar("pandas.core.frame.DataFrame(str)")
def csv_to_df(path: str) -> DataFrameStr:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')

Take a look at pandera.
pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.
Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical or reproducible research settings.
The advantage of pandera is that you can also specify dtypes of individual DataFrame columns. The following example uses pandera to run-time enforce a DataFrame containing a single column of integers:
import pandas as pd
import pandera
from pandera.typing import DataFrame, Series
class Integers(pandera.SchemaModel):
number: Series[int]
#pandera.check_types
def my_fn(a: DataFrame[Integers]) -> None:
pass
# This works
df = pd.DataFrame({"number": [ 2002, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ 2002.0, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ '2002', 2003]})
my_fn(df)

PyTables ValueError on string column with newer pandas

Problem writing pandas dataframe (timeseries) to HDF5 using pytables/tstables:
import pandas
import tables
import tstables
# example dataframe
valfloat = [512.3, 918.8]
valstr = ['abc','cba']
tstamp = [1445464064, 1445464013]
df = pandas.DataFrame(data = zip(valfloat, valstr, tstamp), columns = ['colfloat', 'colstr', 'timestamp'])
df.set_index(pandas.to_datetime(df['timestamp'].astype(int), unit='s'), inplace=True)
df.index = df.index.tz_localize('UTC')
colsel = ['colfloat', 'colstr']
dftoadd = df[colsel].sort_index()
# try string conversion from object-type (no type mixing here ?)
##dftoadd.loc[:,'colstr'] = dftoadd['colstr'].map(str)
h5fname = 'df.h5'
# class to use as tstable description
class TsExample(tables.IsDescription):
timestamp = tables.Int64Col(pos=0)
colfloat = tables.Float64Col(pos=1)
colstr = tables.StringCol(itemsize=8, pos=2)
# create new time series
h5f = tables.open_file(h5fname, 'a')
ts = h5f.create_ts('/','example',TsExample)
# append to HDF5
ts.append(dftoadd, convert_strings=True)
# save data and close file
h5f.flush()
h5f.close()
Exception:
ValueError: rows parameter cannot be converted into a recarray object
compliant with table tstables.tstable.TsTable instance at ...
The error was: cannot view Object as non-Object type
While this particular error happens with TsTables, the code chunk responsible for it is identical to PyTables try-section here.
The error is happening after I upgraded pandas to 0.17.0; the same code was running error-free with 0.16.2.
NOTE: if a string column is excluded then everything works fine, so this problem must be related to string-column type representation in the dataframe.
The issue could be related to this question. Is there some conversion required for 'colstr' column of the dataframe that I am missing?

This is not going to work with a newer pandas as the index is timezone aware, see here
You can:
convert to a type PyTables understands, this would require localizing
use HDFStore to write the frame
Note that what you are doing is the reason HDFStore exists in the first place, to make reading/writing pyTables friendly for pandas objects. Doing this 'manually' is full of pitfalls.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading partitioned data (parquets) using dask with 'int64' vs 'int64 not null' - python

Related

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

Preserving dask dataframe divisions when loading multiple parquet files

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

Pythonic type hints with pandas?

PyTables ValueError on string column with newer pandas

Categories

Resources