Validating pandas dataframe columns - python

I have a dataframe with columns as below -
u'wellthie_issuer_identifier', u'issuer_name', u'service_area_identifier', u'hios_plan_identifier', u'plan_year', u'type'
I need to validate values in each column and finally have a dataframe which is valid.
For example, I need to check if plan_year column satisfies below validation
presence: true, numericality: true, length: { is: 4 }
hios_plan_identifier column satisfies below regex.
format: /\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\z/,
presence: true, length: { minimum: 10 },
type column contains,
in: ['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan']
There are lot of columns which I need to validate. I have tried to give an example data.
I am able to check regex with str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z', regex=True)
Similary I can check other validation as well individually. I am confused as to how to put all the validation together. Should I put all in a if loop with and conditions. Is there a easy way to validate the dataframe columns ? Need help here

There are multiple pandas functions you could use of. Basically the syntax you could use to filter your dataframe by content is:
df = df[(condition1) & (condition2) & ...] # filter the df and assign to the same df
Specifically for your case, you could replace condition with following functions(expressions):
df[some_column] == some_value
df[some_column].isin(some_list_of_values) # This check whether the value of the column is one of the values in the list
df[some_column].str.contains() # You can use it the same as str.contains()
df[some_column].str.isdigit() # Same usage as str.isdigit(), check whether string is all digits, need to make sure column type is string in advance
df[some_column].str.len() == 4 # Filter string with length of 4
Finally, if you want to reset the index, you could use df = df.reset_index(drop=True) to reset your output df index to 0,1,2,...
Edit: To check for NaN, NaT, None values you could use
df[some_column].isnull()
For multiple columns, you could use
df[[col1, col2]].isin(valuelist).all(axis=1)

Related

Python Pandas - check if value exists in 1 dataframe and update value in another dataframe based on results

I have 2 dataframes.
Orders and Receipts.
Both Orders and Receipts have a column named Request Number.
Orders have an additional Column named Received.
I need to compare ['Request Number'] columns and determine whether there's a matching field (eg 123456) inside either dataframes.
If it exists in either dataframes Received value in Orders should change to True else stay False.
I tried the following with no luck - it output False, despite knowing there's matching fields.
orders['Received'] = orders['Request Number'].apply(lambda x: True if x in receipts['Request Number'] else False)
Why is it not working?
Use Series.isin for check membership:
orders['Received'] = orders['Request Number'].isin(receipts['Request Number'])
Any suggestions why it's not working?
You can check using the in operator:
Using the Python in operator on a Series tests for membership in the index, not membership among the values.
So possible solution is convert values to numpy array:
orders['Received'] = (orders['Request Number']
.apply(lambda x: x in receipts['Request Number'].to_numpy()))

How to read specific lines that contain a specific string with Pandas read_csv()?

I would like to sort out only the columns and rows I want to use when downloading a CSV.
with
df = pd.read_csv("https://data.org/data.csv",usecols = ['Lion','Tree'])
I can read only the columns I want, but how can I read only the rows whose column "Lion" contains the word "animal" for example?
If what you're asking for is to filter rows while reading the csv file, the answer is that there is no built-in way to do that.
But you can do what you want when the csv file has been loaded in a DataFrame like that:
df = df.loc[df['Lion'] == 'animal']
Explanation:
DataFrame.loc allows you to access a group of rows and columns by label(s) or a boolean array.
And here, df['Lion'] == 'animal' will return a boolean array like for example:
0 True
3 True
This means that rows 0 and 3 match the condition where the values are equal to the string 'animal'.
So, loc will select these rows 0 and 3.

How to check splitted values of a column without changing the dataframe?

I am trying to find columns hitting specific conditions and put a value in the column col.
My current implementation is:
df.loc[~(df['myCol'].isin(myInfo)), 'col'] = 'ok'
In the future, myCol will have multiple info. So I need to split the value in myCol without changing the dataframe and check if any of the splitted values are in myInfo. If one of them are, the current row should get the value 'ok' in the column col. Is there an elegant way without really splitting and saving in an extra variable?
Currently, I do not know how the multiple info will be represented (either separated by a character or just concatenated one after one, each consisting of 4 alphanumeric values).
Let's say you need to split on "-" for your myCol column.
sep='-'
deconcat = df['MyCol'].str.split(sep, expand=True)
new_df=df.join(deconcat)
The new_df DataFrame will have the same index as df, therefore you can do what you want with new_df and then join back to df to filter it how you want.
You can do the above .isin code for each of the new split columns to get your desired result.
Source:
Code taken from the pyjanitor documentation which has a built-in function, deconcatenate_column, that does this.
Source code for deconcatenate_column

I have missing data in my pandas dataframe. How can I tell python not to include it in a new dataframe?

I have a text file mart_export.txt full of two different types of keys that looks like this
Gene stable ID RefSeq match transcript
ENSG00000243959
ENSG00000206698
ENSG00000265684
ENSG00000251990
ENSG00000241552
ENSG00000050767 NM_173465.4
As you can see, most of the right column doesn't have any data, but I am trying to build a new pandas dataframe out of just the indices that have values for both columns. Here is my script so far
#Put the biomart export in a pandas dataframe
mart = pd.read_csv("mart_export.txt", delimiter="\t")
#Create new list of records with Gene Stable Id and RefSeq numbers
d = {'Gene Stable ID': [], 'RefSeq ID': []}
for i in mart:
if mart['RefSeq match transcript'] != NaN:
d['Gene Stable ID'].append(mart['Gene stable ID'])
d['RefSeq ID'].append(mart['RefSeq match transcript'])
In Spyder, the values in the second column that are blank are labeled NaN, but when I try to use this value in my code, I get an error in python that says NaN is not defined. How can I specify to python what a blank looks like?
You can drop rows or columns using dropna() method of pandas DataFrame.
In your case, it would be:
mart.dropna(axis="rows", inplace=True)
You can drop columns containing NaNs, specify how argument and so on, check the docs linked above.
To detect NaN, you may use pd.isna or pd.isnull.
However, mart is DataFrame, so mart['RefSeq match transcript'] is a column.
mart['RefSeq match transcript'] == something will return a series.
Therefore, condition 'if mart['RefSeq match transcript'] == something' will always return errors no matter what value you tried to compare.
You either need to dropna as shown in other answers or filter out nan such as belows:
mart_noNaN = mart[~mart['RefSeq match transcript'].isna()]
notice the '~' negation in front of `mart.

delete string in a pandas dataframe

I have a
df = pandasdataframe with data.
I have a second pandas-dataframe (called df_outlier) with only some keys (that obviously also exist in df) and I want to remove them from df.
df_outlier
I was looking for something like the following function - but that might not be the right approach. The key contains alphanumeric values - so letters and numbers. So it is not an int.
clean_df = (df['ID'] - df_outlier['ID'])
Any ideas? Thanks.
To filter a df using multiple values from another df we can use isin, this will return a boolean mask for the rows where the values exist in the passed in list/Series. In order to filter out these values we use the negation operator ~ to invert the mask:
clean_df = df[~df['ID'].isin(df_outlier['ID'])]

Categories