Pythonic type hints with pandas? - python

Let's take a simple function that takes a str and returns a dataframe:
import pandas as pd
def csv_to_df(path):
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
What is the recommended pythonic way of adding type hints to this function?
If I ask python for the type of a DataFrame it returns pandas.core.frame.DataFrame.
The following won't work though, as it'll tell me that pandas is not defined.
def csv_to_df(path: str) -> pandas.core.frame.DataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')

Why not just use pd.DataFrame?
import pandas as pd
def csv_to_df(path: str) -> pd.DataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Result is the same:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> pandas.core.frame.DataFrame

I'm currently doing the following:
from typing import TypeVar
PandasDataFrame = TypeVar('pandas.core.frame.DataFrame')
def csv_to_df(path: str) -> PandasDataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Which gives:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> ~pandas.core.frame.DataFrame
Don't know how pythonic that is, but it's understandable enough as a type hint, I find.

Now there is a pip package that can help with this.
https://github.com/CedricFR/dataenforce
You can install it with pip install dataenforce and use very pythonic type hints like:
def preprocess(dataset: Dataset["id", "name", "location"]) -> Dataset["location", "count"]:
pass

Check out the answer given here which explains the usage of the package data-science-types.
pip install data-science-types
Demo
# program.py
import pandas as pd
df: pd.DataFrame = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]}) # OK
df1: pd.DataFrame = pd.Series([1,2,3]) # error: Incompatible types in assignment
Run using mypy the same way:
$ mypy program.py

This is straying from the original question but building off of #dangom's answer using TypeVar and #Georgy's comment that there is no way to specify datatypes for DataFrame columns in type hints, you could use a simple work-around like this to specify datatypes in a DataFrame:
from typing import TypeVar
DataFrameStr = TypeVar("pandas.core.frame.DataFrame(str)")
def csv_to_df(path: str) -> DataFrameStr:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')

Take a look at pandera.
pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.
Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical or reproducible research settings.
The advantage of pandera is that you can also specify dtypes of individual DataFrame columns. The following example uses pandera to run-time enforce a DataFrame containing a single column of integers:
import pandas as pd
import pandera
from pandera.typing import DataFrame, Series
class Integers(pandera.SchemaModel):
number: Series[int]
#pandera.check_types
def my_fn(a: DataFrame[Integers]) -> None:
pass
# This works
df = pd.DataFrame({"number": [ 2002, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ 2002.0, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ '2002', 2003]})
my_fn(df)

Related

Reading partitioned data (parquets) using dask with 'int64' vs 'int64 not null'

I have this annoying situation where some of my parquet files have:
x: int64
and others have
x: int64 not null
and ergo (in dask 2.8.0/numpy 1.15.1/pandas 0.25.3) I can't run the following:
test: Union[pd.Series, pd.DataFrame, np.ndarray] = dd.read_parquet(input_path).query(filter_string)[input_columns].compute()
Anyone know what I can do short of upgrading dask/numpy (as I know the latest dask/numpy seem to work)?
Thanks in advance!
If you know which files contain the different dtypes, then it's best to re-process them (load/convert dtype/save).
If that's not an option, then you can create a dask dataframe from delayed objects with something like this:
import pandas as pd
from dask import delayed
import dask.dataframe as dd
#delayed
def custom_load(fpath):
df = pd.read_parquet(fpath)
df = df.astype({'x': 'Int64'}) # the appropriate dtype
return df
delayed = [custom_load(f) for f in files] # where files is the list of files
ddf = dd.from_delayed(delayed) # can also provide meta option if known

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds).
Throughout the examples we use:
import pandas as pd
import pyarrow as pa
Here's a minimal example to show the situation:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': ObjectId('5e9992543bfddb58073803e7')},
{'name': 'bob', 'oid': ObjectId('5e9992543bfddb58073803e8')},
]
)
df.to_parquet('some_path')
And we get:
ArrowInvalid: ('Could not convert 5e9992543bfddb58073803e7 with type ObjectId: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column oid with type object')
I tried to follow this reference: https://arrow.apache.org/docs/python/extending_types.html
Thus I wrote the following type extension:
class ObjectIdType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(12), "my_package.objectid")
def __arrow_ext_serialize__(self):
# since we don't have a parametrized type, we don't need extra
# metadata to be deserialized
return b''
#classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return ObjectId()
And was able to get a working pyarray for my oid column:
values = df['oid']
storage_array = pa.array(values.map(lambda oid: oid.binary), type=pa.binary(12))
pa.ExtensionArray.from_storage(objectid_type, storage_array)
Now where I’m stuck, and cannot find any good solution on the internet, is how to save my df to parquet, letting it interpret which column needs which Extension. I might change columns in the future, and I have several different types that need this treatment.
How can I simply create parquet file from dataframes and restore them while transparently converting the types ?
I tried to create a pyarrow.Table object, and append columns to it after preprocessing, but it doesn’t work as table.append_column takes binary columns and not pyarrow.Arrays, plus the whole isinstance thing looks like a terrible solution.
table = pa.Table.from_pandas(pd.DataFrame())
for col, values in test_df.iteritems():
if isinstance(values.iloc[0], ObjectId):
arr = pa.array(
values.map(lambda oid: oid.binary), type=pa.binary(12)
)
elif isinstance(values.iloc[0], ...):
...
else:
arr = pa.array(values)
table.append_column(arr, col) # FAILS (wrong type)
Pseudocode of the ideal solution:
parquetize(df, path, my_custom_types_conversions)
# ...
new_df = unparquetize(path, my_custom_types_conversions)
assert df.equals(new_df) # types have been correctly restored
I’m getting lost in pyarrow’s doc on if I should use ExtensionType, serialization or other things to write these functions. Any pointer would be appreciated.
Side note, I do not need parquet at all means, the main issue is to being able to save and restore dataframes with custom types quickly and space efficiently. I tried a solution based on jsonifying and gziping the dataframe, but it was too slow.
I think it is probably because the 'ObjectId' is not a defined keyword in python hence it is throwing up this exception in type conversion.
I tried the example you provided and tried by casting the oid values as string type during dataframe creation and it worked.
Check below the steps:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': "ObjectId('5e9992543bfddb58073803e7')"},
{'name': 'bob', 'oid': "ObjectId('5e9992543bfddb58073803e8')"},
]
)
df.to_parquet('parquet_file.parquet')
df1 = pd.read_parquet('parquet_file.parquet',engine='pyarrow')
df1
output:
name oid
0 alice ObjectId('5e9992543bfddb58073803e7')
1 bob ObjectId('5e9992543bfddb58073803e8')
You could write a method that reads the column names and types and outputs a new DF with the columns converted to compatible types, using a switch-case pattern to choose what type to convert column to (or whether to leave it as is).

Type-checking Pandas DataFrames

I want to type-check Pandas DataFrames i.e. I want to specify which column labels a DataFrame must have and what kind of data type (dtype) is stored in them. A crude implementation (inspired by this question) would work like this:
from collections import namedtuple
Col = namedtuple('Col', 'label, type')
def dataframe_check(*specification):
def check_accepts(f):
assert len(specification) <= f.__code__.co_argcount
def new_f(*args, **kwds):
for (df, specs) in zip(args, specification):
spec_columns = [spec.label for spec in specs]
assert (df.columns == spec_columns).all(), \
'Columns dont match specs {}'.format(spec_columns)
spec_dtypes = [spec.type for spec in specs]
assert (df.dtypes == spec_dtypes).all(), \
'Dtypes dont match specs {}'.format(spec_dtypes)
return f(*args, **kwds)
new_f.__name__ = f.__name__
return new_f
return check_accepts
I don't mind the complexity of the checking function but it adds a lot of boilerplate code.
#dataframe_check([Col('a', int), Col('b', int)], # df1
[Col('a', int), Col('b', float)],) # df2
def f(df1, df2):
return df1 + df2
f(df, df)
Is there a more Pythonic way of type-checking DataFrames? Something that looks more like the new Python 3.6 static type-checking?
Is it possible to implement it in mypy?
Try pandera
A data validation library for scientists, engineers, and analysts seeking correctness.
Perhaps not the most pythonic way, but using a dict for your specs might do the trick (with keys as column names and values as data types):
import pandas as pd
df = pd.DataFrame(columns=['col1', 'col2'])
df['col1'] = df['col1'].astype('int')
df['col2'] = df['col2'].astype('str')
cols_dtypes_req = {'col1':'int', 'col2':'object'} #'str' dtype is 'object' in pandas
def check_df(dataframe, specs):
for colname in specs:
if colname not in dataframe:
return 'Column missing.'
elif dataframe[colname].dtype != specs[colname]:
return 'Data type incorrect.'
for dfcol in dataframe:
if dfcol not in specs:
return 'Unexpected dataframe column.'
return 'Dataframe meets specifications.'
print(check_df(df, cols_dtypes_req))

PyTables ValueError on string column with newer pandas

Problem writing pandas dataframe (timeseries) to HDF5 using pytables/tstables:
import pandas
import tables
import tstables
# example dataframe
valfloat = [512.3, 918.8]
valstr = ['abc','cba']
tstamp = [1445464064, 1445464013]
df = pandas.DataFrame(data = zip(valfloat, valstr, tstamp), columns = ['colfloat', 'colstr', 'timestamp'])
df.set_index(pandas.to_datetime(df['timestamp'].astype(int), unit='s'), inplace=True)
df.index = df.index.tz_localize('UTC')
colsel = ['colfloat', 'colstr']
dftoadd = df[colsel].sort_index()
# try string conversion from object-type (no type mixing here ?)
##dftoadd.loc[:,'colstr'] = dftoadd['colstr'].map(str)
h5fname = 'df.h5'
# class to use as tstable description
class TsExample(tables.IsDescription):
timestamp = tables.Int64Col(pos=0)
colfloat = tables.Float64Col(pos=1)
colstr = tables.StringCol(itemsize=8, pos=2)
# create new time series
h5f = tables.open_file(h5fname, 'a')
ts = h5f.create_ts('/','example',TsExample)
# append to HDF5
ts.append(dftoadd, convert_strings=True)
# save data and close file
h5f.flush()
h5f.close()
Exception:
ValueError: rows parameter cannot be converted into a recarray object
compliant with table tstables.tstable.TsTable instance at ...
The error was: cannot view Object as non-Object type
While this particular error happens with TsTables, the code chunk responsible for it is identical to PyTables try-section here.
The error is happening after I upgraded pandas to 0.17.0; the same code was running error-free with 0.16.2.
NOTE: if a string column is excluded then everything works fine, so this problem must be related to string-column type representation in the dataframe.
The issue could be related to this question. Is there some conversion required for 'colstr' column of the dataframe that I am missing?
This is not going to work with a newer pandas as the index is timezone aware, see here
You can:
convert to a type PyTables understands, this would require localizing
use HDFStore to write the frame
Note that what you are doing is the reason HDFStore exists in the first place, to make reading/writing pyTables friendly for pandas objects. Doing this 'manually' is full of pitfalls.

Get the name of a pandas DataFrame

How do I get the name of a DataFrame and print it as a string?
Example:
boston (var name assigned to a csv file)
import pandas as pd
boston = pd.read_csv('boston.csv')
print('The winner is team A based on the %s table.) % boston
You can name the dataframe with the following, and then call the name wherever you like:
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
>>>
Ones
Sometimes df.name doesn't work.
you might get an error message:
'DataFrame' object has no attribute 'name'
try the below function:
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
In many situations, a custom attribute attached to a pd.DataFrame object is not necessary. In addition, note that pandas-object attributes may not serialize. So pickling will lose this data.
Instead, consider creating a dictionary with appropriately named keys and access the dataframe via dfs['some_label'].
df = pd.DataFrame()
dfs = {'some_label': df}
From here what I understand DataFrames are:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
And Series are:
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
Series have a name attribute which can be accessed like so:
In [27]: s = pd.Series(np.random.randn(5), name='something')
In [28]: s
Out[28]:
0 0.541
1 -1.175
2 0.129
3 0.043
4 -0.429
Name: something, dtype: float64
In [29]: s.name
Out[29]: 'something'
EDIT: Based on OP's comments, I think OP was looking for something like:
>>> df = pd.DataFrame(...)
>>> df.name = 'df' # making a custom attribute that DataFrame doesn't intrinsically have
>>> print(df.name)
'df'
DataFrames don't have names, but you have an (experimental) attribute dictionary you can use. For example:
df.attrs['name'] = "My name" # Can be retrieved later
attributes are retained through some operations.
Here is a sample function:
'df.name = file` : Sixth line in the code below
def df_list():
filename_list = current_stage_files(PATH)
df_list = []
for file in filename_list:
df = pd.read_csv(PATH+file)
df.name = file
df_list.append(df)
return df_list
I am working on a module for feature analysis and I had the same need as yours, as I would like to generate a report with the name of the pandas.Dataframe being analyzed. To solve this, I used the same solution presented by #scohe001 and #LeopardShark, originally in https://stackoverflow.com/a/18425523/8508275, implemented with the inspect library:
import inspect
def aux_retrieve_name(var):
callers_local_vars = inspect.currentframe().f_back.f_back.f_locals.items()
return [var_name for var_name, var_val in callers_local_vars if var_val is var]
Note the additional .f_back term since I intend to call it from another function:
def header_generator(df):
print('--------- Feature Analyzer ----------')
print('Dataframe name: "{}"'.format(aux_retrieve_name(df)))
print('Memory usage: {:03.2f} MB'.format(df.memory_usage(deep=True).sum() / 1024 ** 2))
return
Running this code with a given dataframe, I get the following output:
header_generator(trial_dataframe)
--------- Feature Analyzer ----------
Dataframe name: "trial_dataframe"
Memory usage: 63.08 MB

Categories