Pandas Dataframe - enforce data properties - python

I would like to enforce properties on pandas data tables.
Mostly a "uniqueness" of "primary keys" in the table would be interesting.
Is there a way to ensure such properties without having to call a validation function?
It would be preferable that pandas throws an error if ANY modification to the table breaches the defined rules.
I already found a the package pandera which only works when a validation function is called. The checks are not enforced on the table at any time.

you can use the set_index() function to set a column or set of columns as the index of a DataFrame.
you should note that if the column is not unique (has duplicates) then Pandas will raise a KeyError.
The question has no actual code to debug, so i post a boiler plate example of what i might look like:
import pandas as pd
df = pd.read_csv("data.csv")
df = df.set_index("id")
There are different variants on this.

Related

Looking for names that somewhat match

I am creating an app that loads a DataFrame from spreadsheet. Sometimes the names are not the same as before (someone changed a column name slightly) and are just barely different. What would be the best way to "look" for these closely related column names in new spreadsheets when loading the df? If I as python to look for "col_1" but in the users spreadsheet the column is "col1" then python won't find it.
Example:
import pandas as pd
df = pd.read_excel('data.xlsx')
Here are the column names I am looking for, the rest of the columns load just fine and the column names that a just barely different get skipped and the data never gets loaded. How can I make sure that if the name is close to the name that python is looking for it will get loaded in the df?
Names I'm looking for:
'Water Consumption'
'Female Weight'
'Uniformity'
data.xlsx, incoming data column names that are slightly different:
'Water Consumed Actual'
'Body Weight Actual'
'Unif %'
The builtin function difflib.get_close_matches can help you with column names that are slightly wrong. Using that with the usecols argument of pd.read_excel should get you most of the way there.
You can do something like:
import difflib
import pandas as pd
desired_columns = ['Water Consumption', 'Female Weight', 'Uniformity']
def column_checker(col_name):
if difflib.get_close_matches(col_name, desired_columns):
return True:
else:
return False
df = pd.read_excel('data.xlsx', usecols=column_checker)
You can mess with the parameters of get_close_matches to make it more or less sensitive too.

How to create a property of a subclassed pandas dataframe that acts like a column

This may not be possible, or even a good idea but I wanted to see what options are available.
I want to create a subclass of a DataFrame that has a column as a property. E.g.
import pandas as pd
class DFWithProperties(pd.DataFrame):
"""A DataFrame that definitely has an 'a' column."""
#property
def depends_on_a(self):
return self["a"] + 1
df = DFWithProperties({"a": [1,2,3,4]})
I can obviously access this property in a similar way to other columns with
df.depends_on_a
But I also want to be able to access this column with something like
df["depends_on_a"]
or
df[["a", "depends_on_a"]]
Are there any neat tricks I can use here, or am I being too ambitious?
Yes! You can store whatever you would like as an attribute of the pandas DataFrame, simply by saving it. You can save it even as a column if need be.
For example:
df.depends_on_a = df["depends_on_a"] (or whatever you need)
However, this is not related to the dataframe itself, so any changes made will have to be applied manually to both as it will not happen automatically.
Source: Adding meta-information/metadata to pandas DataFrame

Auxiliary data / description in pandas DataFrame

Imagine I have pd.DataFrame containing a lot of rows and I want to add some string data describing/summarizing it.
One way to do this would be to add column with this value, but it would be a complete waste of memory (e.g. when the string would be a long specification of ML model).
I could also insert it in file name - but this has limitations and is very impractical at saving/loading.
I could store this data in separate file, but it is not what I want.
I could make a class based on pd.DataFrame and add this field, but then I am unsure if save/load pickle would still work properly.
So is there some real clean way to store something like "description" in pandas DataFrame? Preferentially such that would withstand to_pickle/read_pickle operations.
[EDIT]: Ofc, this generates next questions such as what if we concatenate dataframes with such informations? But let's stick to saving/loading problem only.
Googled it a bit, you can name a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame( data=np.ones([10,10]) )
df.name = 'My name'
print(df.name)

Why does vaex change column names that contain a period?

When using vaex I came across an unexpected error NameError: name 'column_2_0' is not defined.
After some investigation I found that in my data source (HDF5 file) the column name causing problems is actually called column_2.0 and that vaex renames it to column_2_0 but when performing operations using column names I run into the error. Here is a simple example that reproduces this error:
import pandas as pd
import vaex
cols = ['abc_1', 'abc1', 'abc.1']
vals = list(range(0,len(cols)))
df = pd.DataFrame([vals], columns=cols)
dfv = vaex.from_pandas(df)
for col in dfv.column_names:
dfv = dfv[dfv[col].notna()]
dfv.count()
...
NameError: name 'abc_1_1' is not defined
In this case it appears that vaex tries to rename abc.1 to abc_1 which is already taken so instead it ends up using abc_1_1.
I know that I can rename the column like dfv.rename('abc_1_1', 'abc_dot_1'), but (a) I'd need to introduce special logic for naming conflicts like in this example where the column name that vaex comes up with is already taken and (b) I'd rather not have to do this manually each time I have a column that contains a period.
I could also enforce all my column names from source data to never use a period but this seems like a stretch given that pandas and other sources where data might come from in general don't have this restriction.
What are some ideas to deal with this problem other than the two I mentioned above?
In Vaex the columns are in fact "Expressions". Expressions allow you do build sort of a computational graph behind the scenes as you are doing your regular dataframe operations. However, that requires the column names to be as "clean" as possible.
So column names like '2', or '2.5' are not allows, since the expression system can interpret them as numbers rather than column names. Also column names like 'first-name', the expressions system can interpret as df['first'] - df['name'].
To avoid this, vaex will smartly rename columns so that they can be used in the expression system. This is extremely complicated actually. So in your example above, you've found a case that has not been covered yet (isna/ notna).
Btw, you can always access the original names via df.get_column_names(alias=True).

Python bson: How to create list of ObjectIds in a New Column

I have a CSV that I'm needing to create a column of random unique MongoDB ids in python.
Here's my csv file:
import pandas as pd
df = pd.read_csv('file.csv', sep=';')
print(df)
Zone
Zone_1
Zone_2
I'm currently using this line of code to generate a unique ObjectId - UPDATE
import bson
x = bson.objectid.ObjectId()
df['objectids'] = x
print(df)
Zone; objectids
Zone_1; 5bce2e42f6738f20cc12518d
Zone_2; 5bce2e42f6738f20cc12518d
How can I get the ObjectId to be unique for each row?
Hate to see you down voted... stack overflow goes nuts with the duplicate question nonsense, refuses to provide useful assistance, then down votes you for having the courage to ask about something you don't know.
The referenced question clearly has nothing to do with ObjectIds, let alone adding them (or any other object not internal to NumPy or Pandas) into data frames.
You probably need to use a map
this assumes the column "objectids" is not a Series in your frame and "Zone" is a Series in your frame
df['objectids'] = df['Zone'].map(lambda x: bson.objectid.ObjectId())
Maps are super helpful (though slow) way to poke every record in your series and particularly helpful as an initial method to connect external functions.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html

Categories