I want to type-check Pandas DataFrames i.e. I want to specify which column labels a DataFrame must have and what kind of data type (dtype) is stored in them. A crude implementation (inspired by this question) would work like this:
from collections import namedtuple
Col = namedtuple('Col', 'label, type')
def dataframe_check(*specification):
def check_accepts(f):
assert len(specification) <= f.__code__.co_argcount
def new_f(*args, **kwds):
for (df, specs) in zip(args, specification):
spec_columns = [spec.label for spec in specs]
assert (df.columns == spec_columns).all(), \
'Columns dont match specs {}'.format(spec_columns)
spec_dtypes = [spec.type for spec in specs]
assert (df.dtypes == spec_dtypes).all(), \
'Dtypes dont match specs {}'.format(spec_dtypes)
return f(*args, **kwds)
new_f.__name__ = f.__name__
return new_f
return check_accepts
I don't mind the complexity of the checking function but it adds a lot of boilerplate code.
#dataframe_check([Col('a', int), Col('b', int)], # df1
[Col('a', int), Col('b', float)],) # df2
def f(df1, df2):
return df1 + df2
f(df, df)
Is there a more Pythonic way of type-checking DataFrames? Something that looks more like the new Python 3.6 static type-checking?
Is it possible to implement it in mypy?
Try pandera
A data validation library for scientists, engineers, and analysts seeking correctness.
Perhaps not the most pythonic way, but using a dict for your specs might do the trick (with keys as column names and values as data types):
import pandas as pd
df = pd.DataFrame(columns=['col1', 'col2'])
df['col1'] = df['col1'].astype('int')
df['col2'] = df['col2'].astype('str')
cols_dtypes_req = {'col1':'int', 'col2':'object'} #'str' dtype is 'object' in pandas
def check_df(dataframe, specs):
for colname in specs:
if colname not in dataframe:
return 'Column missing.'
elif dataframe[colname].dtype != specs[colname]:
return 'Data type incorrect.'
for dfcol in dataframe:
if dfcol not in specs:
return 'Unexpected dataframe column.'
return 'Dataframe meets specifications.'
print(check_df(df, cols_dtypes_req))
Related
I have several functions that need to be sent to pipeline on assignment, for example
def Android_iOs_device_os_cange(df: pd.DataFrame) -> pd.DataFrame:
import numpy as np
df = df.copy()
def foung_android_list(df):
list_for_android = list(df[df['device_os'] == 'Android'].device_brand.unique())
list_for_android.remove('(not set)')
return list_for_android
def foung_iOS_list(df):
list_for_iOS = list(df[df['device_os'] == 'iOS'].device_brand.unique())
list_for_iOS.remove('(not set)')
return list_for_iOS
df.loc[:,'device_os'] = np.where(df.loc[df['device_brand'].isin(foung_iOS_list(df))] & (df.loc[df['device_os'].isnull()]), 'iOS',
df['device_os'])
df.loc[:,'device_os'] = np.where(df['device_brand'].isin(foung_android_list(df)) & (df['device_os'].isnull()), 'Android',
df['device_os'])
df.loc[:,'device_os'] = np.where(df['device_os'].isnull(), '(not set)', df['device_os'])
print(df)
return df
This function changes all empty values in the device_os column on Android or iOS, depending on which brand of phone the client has specified, or leaves (not set) if the device_brand line is empty. As a function in jupiter lab, the code runs fine, but when I inserted it into the function in the pipeline, the code gives me an error, 'device_brand', i.e. the code does not find such a column in the DataFrame.
Because this is a data science task, I have a data preprocessing function and I perform it outside of pipline, because a target variable is needed for x and y, and I get it from another dataframe, in theory I can generally shove everything into the preprocessing function and execute the code like this, but the task is a task if I shove Android_iOs_device_os_cange to the preprocessing function, then the from_float_to_int function follows
def from_float_to_int(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
for i in df.columns:
if df[i].dtype == 'float64':
df[i] = df[i].astype(int)
print(df)
return df
as it is easy to guess, it changes the data type from float to int, and i have an idea make columnselectior replace types but I am not sure about the correctness of this strategy, at the end there are functions that cannot be replaced so easily.
as it is easy to guess, it changes the data type from float to int and then I encounter the problem that pipeline does not perceive the columns function, and without it I will not be able to perform two probably the most important functions: prepare_for_ohe and make_standard_scatter and Labelencoder_select. The first function removes from all columns, except one, all columns where unique values are less than 80, the second converts numeric values for certain columns to StandartScallaer, and the latter converts encrypted data to LabelEncoder and in all these functions there are pd.columns, without this function I do not know how to replace it, because if with this problem I if I meet in from_float_to_int, then it's stupid to think that it won't be in the following functions
def make_standard_scatter(df: pd.DataFrame) -> None:
from sklearn.preprocessing import StandardScaler
df = df.copy()
data_1 = df[['count_of_action', 'hit_time']]
std_scaler = StandardScaler()
std_scaler.fit(data_1)
std_scaled = std_scaler.transform(data_1)
list1 = ['count_of_action', 'hit_time']
list2 = []
for name in list1:
std_name = name + '_std'
list2.append(std_name)
df[list2] = std_scaled
print(df)
return df
def prepare_for_ohe(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df[[i for i in df.columns if i != 'hit_time']] = df[[i for i in df.columns if i != 'hit_time']].apply(
lambda x: x.where(x.map(x.value_counts()) > 80))
df = df.dropna()
return df
def Labelencoder_select(df: pd.DataFrame) -> None:
from sklearn.preprocessing import LabelEncoder
df = df.copy()
list1 = [i for i in df.columns if (i.split('_')[0] in 'utm') or (i.split('_')[0] in 'device') or (i.split('_')[0] in
'geo')]
df[list1] = df[list1].apply(LabelEncoder().fit_transform)
print(df)
return df
And all this I mean, how to write functions so that pd.columns functions and line-by-line data changes are perceived correctly by Pipeline.
I am able to write a function to merge columns to a new column, but fail to change int column into float before changing to string for merging.
I hope that in the new merged column, those integer would have pending ".00000".
At the end I was trying to make merged column as key for joining two vaex on multiple key/column. As it seems vaex only take one column/key for joining two vaex, I need to make combined column as key.
The changing of int to float is in case that column in one vaex is int and in another vaex is float.
code is as below.
Function new_column_by_column_merging is working, but function new_column_by_column_merging2 is not. Wondering if there is any way to make it work.
import vaex
import pandas as pd
import numpy as np
def new_column_by_column_merging(df, columns=None):
if columns is None:
columns = df.get_column_names()
if type(columns) is str:
df['merged_column_key'] = df[columns]
return df
df['merged_column_key'] = np.array(['']*len(df))
for col in columns:
df['merged_column_key'] = df['merged_column_key'] + '_' + df[col].astype('string')
return df
def new_column_by_column_merging2(df, columns=None):
if columns is None:
columns = df.get_column_names()
if type(columns) is str:
df['merged_column_key'] = df[columns]
return df
df['merged_column_key'] = np.array(['']*len(df))
for col in columns:
try:
df[col] = df[col].astype('float')
except:
print('fail to convert to float')
df['merged_column_key'] = df['merged_column_key'] + '_' + df[col].astype('string')
return df
pandas_df = pd.DataFrame({'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Last Name': ['Johnson', 'Cameron', 'Biden', 'Washington'], 'Age': [20, 21, 19, 18], 'Weight': [60.0, 61.0, 62.0, 63.0]})
print('pandas_df is')
print(pandas_df)
df = vaex.from_pandas(df=pandas_df, copy_index=False)
df1 = new_column_by_column_merging(df, ['Name', 'Age', 'Weight'])
print('new_column_by_column_merging returns')
print(df1)
df2 = new_column_by_column_merging2(df, ['Name', 'Age', 'Weight'])
print('new_column_by_column_merging2 returns')
print(df2)
It looks like the vaex expression system does not always play nicely with the try / except checks. So you need to be careful with the dtypes. One way of handing this:
import vaex
df = vaex.datasets.titanic() # dataframe for testing
def new_column_by_column_merging2(df, columns=None):
if columns is None:
columns = df.get_column_names()
if type(columns) is str:
df['merged_column_key'] = df[columns]
return df
df['merged_column_key'] = np.array(['']*len(df))
for col in columns:
if df[col].is_string():
pass
else:
df[col] = df[col].astype('float')
df['merged_column_key'] = df['merged_column_key'] + '_' + df[col].astype('string')
return df
new_column_by_column_merging2(df) # should work
Basically i modified the try/except statement to explicitly check for strings (since they can't be converted to floats). You might have to extend that check to check for other things like datetime etc.. if needed. Hope this helps
I have a codebase where this pattern is very common:
df # Some pandas dataframe with columns userId, sessionId
def add_session_statistics(df):
df_statistics = get_session_statistics(df.sessionId.unique())
return df.merge(df_statistics, on='sessionId', how='left')
def add_user_statistics(df):
df_statistics = add_user_statistics(df.userId.unique())
return df.merge(df_statistics, on='sessionId', how='left')
# etc..
df_enriched = (df
.pipe(add_session_statistics)
.pipe(add_user_statistics)
)
However, in another part of the codebase I have 'userId', 'sessionId' as the index of the dataframe. Something like:
X = df.set_index(['userId', 'sessionId'])
This means I can't use the add_{somthing}_statistics() functions on X without resetting the index each time.
Is there any decorator I can add to the add_{somthing}_statistics() to make them reset the index if they get a KeyError when attempting the merge on a column that is not there?
This seems to work:
def index_suspension_on_add(add_function):
def _helper(df):
try:
return df.pipe(add_function)
except Exception:
index_names = df.index.names
return (df
.reset_index()
.pipe(add_function)
.set_index(index_names)
)
return _helper
#index_suspension_on_add
def add_user_statistics(df):
...
I have a JSON structure which I need to convert it into data-frame. I have converted through pandas library but I am having issues in two columns where one is an array and the other one is key-pair value.
Pito Value
{"pito-key": "Number"} [{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]
How to break columns into the data-frames.
As far as I understood your question, you can apply regular expressions to do that.
import pandas as pd
import re
data = {'pito':['{"pito-key": "Number"}'], 'value':['[{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]']}
df = pd.DataFrame(data)
def get_value(s):
s = s[1]
v = re.findall(r'VALUE\":\".*\"', s)
return int(v[0][8:-1])
def get_pito(s):
s = s[0]
v = re.findall(r'key\": \".*\"', s)
return v[0][7:-1]
df['value'] = df.apply(get_value, axis=1)
df['pito'] = df.apply(get_pito, axis=1)
df.head()
Here I create 2 functions that transform your scary strings to values you want them to have
Let me know if that's not what you meant
Let's take a simple function that takes a str and returns a dataframe:
import pandas as pd
def csv_to_df(path):
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
What is the recommended pythonic way of adding type hints to this function?
If I ask python for the type of a DataFrame it returns pandas.core.frame.DataFrame.
The following won't work though, as it'll tell me that pandas is not defined.
def csv_to_df(path: str) -> pandas.core.frame.DataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Why not just use pd.DataFrame?
import pandas as pd
def csv_to_df(path: str) -> pd.DataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Result is the same:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> pandas.core.frame.DataFrame
I'm currently doing the following:
from typing import TypeVar
PandasDataFrame = TypeVar('pandas.core.frame.DataFrame')
def csv_to_df(path: str) -> PandasDataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Which gives:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> ~pandas.core.frame.DataFrame
Don't know how pythonic that is, but it's understandable enough as a type hint, I find.
Now there is a pip package that can help with this.
https://github.com/CedricFR/dataenforce
You can install it with pip install dataenforce and use very pythonic type hints like:
def preprocess(dataset: Dataset["id", "name", "location"]) -> Dataset["location", "count"]:
pass
Check out the answer given here which explains the usage of the package data-science-types.
pip install data-science-types
Demo
# program.py
import pandas as pd
df: pd.DataFrame = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]}) # OK
df1: pd.DataFrame = pd.Series([1,2,3]) # error: Incompatible types in assignment
Run using mypy the same way:
$ mypy program.py
This is straying from the original question but building off of #dangom's answer using TypeVar and #Georgy's comment that there is no way to specify datatypes for DataFrame columns in type hints, you could use a simple work-around like this to specify datatypes in a DataFrame:
from typing import TypeVar
DataFrameStr = TypeVar("pandas.core.frame.DataFrame(str)")
def csv_to_df(path: str) -> DataFrameStr:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Take a look at pandera.
pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.
Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical or reproducible research settings.
The advantage of pandera is that you can also specify dtypes of individual DataFrame columns. The following example uses pandera to run-time enforce a DataFrame containing a single column of integers:
import pandas as pd
import pandera
from pandera.typing import DataFrame, Series
class Integers(pandera.SchemaModel):
number: Series[int]
#pandera.check_types
def my_fn(a: DataFrame[Integers]) -> None:
pass
# This works
df = pd.DataFrame({"number": [ 2002, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ 2002.0, 2003]})
my_fn(df)
# Raises an exception
df = pd.DataFrame({"number": [ '2002', 2003]})
my_fn(df)