Pandera. Column validation based on a secondary column - python

Is it possible to validate a column based on another column using Pandera?
My dataframe looks like this:
df = pd.DataFrame({
"Name": ["Thomas","",""],
"Address": ["Address 1", "Address 1", "Address 3"],
"Zip": ["65989", "65989", "65954"],
"External": [False, True, False],
})
I would like to validate the "Name" column based on the "External" column. If external = True then Name can be empty. In this example, the third record should be invalid because external = False and the name is missing.
A related question here suggests to use the wide checks. However, in this way all the columns of the third record are evaluated as invalid (Name, Address, Zip and External) but I need only the Name to be invalid and ignore the rest.
schema_ = pa.DataFrameSchema({
"Name": pa.Column(str),
"Address": pa.Column(str),
"Zip": pa.Column(str),
"External": pa.Column(str),
},
checks=[pa.Check.is_external()])
#extensions.register_check_method(check_type="element_wise",)
def is_external(pandas_obj: pd.Series):
if (pandas_obj["external"] == True) and (len(pandas_obj["Name"]))<1:
return False
else:
return True
I also tried something like this in the schema:
"Name": pa.Column(str, checks=[
pa.Check.is_external(df["external"]),
]),
But in this case all the values are passed to the function "[False, True, False]" and I am not sure how to compare them to the corresponding values of the Name column.
Is it possible to make this kind of checks in Pandera ?
Thanks in advance!

Related

Rename Pandas columns with inplace

I want to rename a few columns and set inplace = False then I am able to view the result. Why set inplace = True will result in None (I have refresh the runtime to run inplace=True)?
ckd_renamed = ckd_data.rename(
columns = {"id": "Id",
"age": "Age",
"bp": "blood_pressure"},
inplace=True)
The dataset that I use is
"https://raw.githubusercontent.com/dphi-official/Datasets/master/Chronic%20Kidney%20Disease%20(CKD)%20Dataset/ChronicKidneyDisease.csv"
Because when you set inplace=True, there is no new copy created, and ckd_data is modified, well, in-place.
The convention of returning None for in-place modifications stems from how e.g. Python's .sort() works.
After
ckd_data.rename(columns = {"id": "Id", "age": "Age", "bp": "blood_pressure"}, inplace=True)
you'd continue to refer to the dataframe, now with modified column names, as ckd_data.
EDIT: If you're using Colab and you want the cell to also display the new value, you should be able to add an additional expression statement with just the name of the variable.
ckd_data.rename(columns = {"id": "Id", "age": "Age", "bp": "blood_pressure"}, inplace=True)
ckd_data

How to extract data from JSON file for each entry?

So I am using the JSON package in python to extract data from generated JSON which would essentially fetched data from a firebase database which was then generated as a JSON file.
Within the given data set I want to extract all of the data corresponding to bills in each entry within the JSON file. For that I created a separate dictionary to add all of the elements corresponding to bills in the dataset.
When converted to CSV, the dataset looks like this:
csv for one entry
So I have the following code to do above operation. But as I create a new dictionary, there are certain entries which have null values designated as [] (see the csv file). I assigned list to store all those bills which would have the data in the bills column (essentially avoiding all the null entries). But as a I create a new list the required output is only getting stored in the first index of the new list or array. Please see the code below.
My code is as below:
filedata = open('requireddataset.json','r') data = json.load(filedata)
listoffields = [] # To produce it into a list with fields for dic
for dic in data:
try:
listoffields.append(dic['bills']) # only non-essential bill categories.
except KeyError:
pass
#print (listoffields[3]) # This would return the first payment entry within
# the JSON Array of objects.
for val in listoffields:
if val!=[]:
x = val[0] # only val[0] would contain data
#print (x)
myarray = np.array(val)
print(myarray[0]) # All of the data stored in only one index, any way to change this?
This is the output : output
This is how the original JSON file looks like : requireddataset.json
Essentially my question is the list listoffields would contain all the fields in it(from the JSON file), and bills in one of the fields. And within the column bills each entry again contains id, value, role and many other entries. Is there any way to extract only values from this and produce sum .
In the JSON file this is how it looks like for one entry :
[{"goal_savings": 0.0, "social_id": "", "score": 0, "country": "BR", "photo": "http://graph.facebook", "id": "", "plates": 3, "rcu": null, "name": "", "email": ".", "provider": "facebook", "phone": "", "savings": [], "privacyPolicyAccepted": true, "currentRole": "RoleType.PERSONAL", "empty_lives_date": null, "userId": "", "authentication_token": "-------", "onboard_status": "ONBOARDING_WIZARD", "fcmToken": ----------", "level": 1, "dni": "", "social_token": "", "lives": 10, "bills": [{"date": "2020-12-10", "role": "RoleType.PERSONAL", "name": "Supermercado", "category": "feeding", "periodicity": "PeriodicityType.NONE", "value": 100.0"}], "payments": [], "goals": [], "goalTransactions": [], "incomes": [], "achievements": [{"created_at":", "name": ""}]}]

Getting index of a value inside a json file PYTHON

I have a sizable json file and i need to get the index of a certain value inside it. Here's what my json file looks like:
data.json
[{...many more elements here...
},
{
"name": "SQUARED SOS",
"unified": "1F198",
"non_qualified": null,
"docomo": null,
"au": "E4E8",
"softbank": null,
"google": "FEB4F",
"image": "1f198.png",
"sheet_x": 0,
"sheet_y": 28,
"short_name": "sos",
"short_names": [
"sos"
],
"text": null,
"texts": null,
"category": "Symbols",
"sort_order": 167,
"added_in": "0.6",
"has_img_apple": true,
"has_img_google": true,
"has_img_twitter": true,
"has_img_facebook": true
},
{...many more elements here...
}]
How can i get the index of the value "FEB4F" whose key is "google", for example?
My only idea was this but it doesn't work:
print(data.index('FEB4F'))
Your basic data structure is a list, so there's no way to avoid looping over it.
Loop through all the items, keeping track of the current position. If the current item has the desired key/value, print the current position.
position = 0
for item in data:
if item.get('google') == 'FEB4F':
print('position is:', position)
break
position += 1
Assuming your data can fit in an table, I recommend using pandas for that. Here is the summary:
Read de data using pandas.read_json
Identify witch column to filter
Filter using pandas.DataFrame.loc
IE:
import pandas as pd
data = pd.read_json("path_to_json.json")
print(data)
#lets assume you want to filter using the 'unified' column
filtered = data.loc[data['unified'] == 'something']
print(filtered)
Of course the steps would be different depending on the JSON structure

Merge rows of a dataframe according to values in a specific column by concatenating the strings in the other columns into one

I have a dataframe that looks like this:
df1 = pd.DataFrame({
"Business_Process_Activity" : ["SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "SendingReportToManager", "PreparingAndSendingAgenda", "PreparingAndSendingAgenda"],
"Case":[1,1,2,2,2,3,4],
"Application":["MicrosoftWord", "MicrosoftOutlook", "MicrosoftWord", "MicrosoftOutlook", "MicrosoftOutlook", "MicrosoftWord", "MicrosoftWord"],
"Activity_of_the_User":["SavingADocument", "SendingAnEmail", "SavingADocument", "SendingAnEmail", "SendingAnEmail", "SavingADocument", "SavingADocument"],
"Receiver_email_root":["None", "idatta91 adarandall larryjacob", "None", "idatta91 larryjacob"," vanessaHudgens prithakaur", "None", "None"],
"Receiever_email_domains":["None", "gmail yahoo", "None", "gmail", "gmail yahoo", "None", "None"],
"Receiver_email_count_Catg":["None", "Few", "None", "Double", "Double", "None", "None"],
"Subject":["None","Activity Report", "None", "Project Progress Report", "Project Progress Report 2", "None", "None"]
})
I want to merge the rows of the dataframe according to the column Case. So, if the number in the Case column of two or more rows is the same then the strings of the other columns of those rows are concatenated into one row.
Also the the values in the column Business_Process_Activity is the same for the cases of the same number. For that column, I do not want to concatenate the Business_Process_Activity values but keep only one of them since that column needs to be categorical. I want the final dataframe to look like this:
df2 = pd.DataFrame({"Case":[1,2,3,4],
"Business_Process_Activity" : ["SendingReportToManager", "SendingReportToManager", "PreparingAndSendingAgenda", "PreparingAndSendingAgenda"],
"Application":["MicrosoftWord MicrosoftOutlook", "MicrosoftWord MicrosoftOutlook MicrosoftOutlook", "MicrosoftWord", "MicrosoftWord"],
"Activity_of_the_User":["SavingADocument SendingAnEmail","SavingADocument SendingAnEmail SendingAnEmail", "SavingADocument", "SavingADocument"],
"Receiver_email_root":["idatta91 adarandall larryjacob", "idatta91 larryjacob vanessaHudgens prithakaur", "None", "None"],
"Receiever_email_domains":["gmail yahoo","gmail gmail yahoo", "None", "None"],
"Receiver_email_count_Catg":["Few", "Double Double", "None", "None"],
"Subject":["Activity Report", "Project Progress Report Project Progress Report 2", "None", "None"]
})
If strings are merged with a "None" column, the "None" string should be deleted since that value is not empty anymore. The duplicate number of the case columns should be deleted as the rows are merged into one.
How do I do this? Thanks in advance!
Idea is remove None values and also None strings per groups, join together and last replace empty strings to None:
df = (df1.groupby('Case')
.agg(lambda x: ' '.join(x[x.ne('None') & x.notna()]))
.where(lambda x: x.astype(bool), None)
.reset_index())
Another solution with custom function:
def f(x):
y = x[x.ne('None') & x.notna()]
return None if y.empty else ' '.join(y)
df = df1.groupby('Case').agg(f).reset_index()
Use:
g = df1.groupby('Case')
df2 = g.agg(lambda s: ' '.join(s[s.ne('None')] if s.ne('None').any() else ['None']))
df2['Business_Process_Activity'] = g['Business_Process_Activity'].first()
df2 = df2.reset_index()
# print(df2)
Case Business_Process_Activity ... Receiver_email_count_Catg Subject
0 1 SendingReportToManager ... Few Activity Report
1 2 SendingReportToManager ... Double Double Project Progress Report Project Progress Report 2
2 3 PreparingAndSendingAgenda ... None None
3 4 PreparingAndSendingAgenda ... None None

"Like" condition instead of an "isin" in pandas

I have a parameter dictionary as below -
paramDict = {
"DataFilter": {
"tableField": [{
"table":"GL_LEDGERS",
"field":"NAME"
}],
"value" : ["ABC."]
}
}
Now I want to use a "like" instead of the "isin" condition so that the data gets filtered for "ABC" as well as "ABC." -
DataFilter = df['NAME'].isin(
pd.Series(paramDict['DataFilter']['value']))
df = df[DataFilter]
Can you please help me with the same. I am using python 2.7. Thanks.
I assume your Series is a string type.
If so, you can use .contains:
DataFilter = df['NAME'].str.contains('ABC')

Categories