I am using if in a conditional statement like the below code. If address is NJ then the value of name column is changed to 'N/A'.
df1.loc[df1.Address.isin(['NJ']), 'name'] = 'N/A'
How do I do the same, if I have 'nested if statements' like below?
# this not code just representing the logic
if address isin ('NJ', 'NY'):
if name1 isin ('john', 'bob'):
name1 = 'N/A'
if name2 isin ('mayer', 'dylan'):
name2 = 'N/A'
Can I achieve above logic using df.loc? Or is there any other way to do it?
Separate assignments, as shown by #MartijnPeiters, are a good idea for a small number of conditions.
For a large number of conditions, consider using numpy.select to separate your conditions and choices. This should make your code more readable and easier to maintain.
For example:
import pandas as pd, numpy as np
df = pd.DataFrame({'address': ['NY', 'CA', 'NJ', 'NY', 'WS'],
'name1': ['john', 'mayer', 'dylan', 'bob', 'mary'],
'name2': ['mayer', 'dylan', 'mayer', 'bob', 'bob']})
address_mask = df['address'].isin(('NJ', 'NY'))
conditions = [address_mask & df['name1'].isin(('john', 'bob')),
address_mask & df['name2'].isin(('mayer', 'dylan'))]
choices = ['Option 1', 'Option 2']
df['result'] = np.select(conditions, choices)
print(df)
address name1 name2 result
0 NY john mayer Option 1
1 CA mayer dylan 0
2 NJ dylan mayer Option 2
3 NY bob bob Option 1
4 WS mary bob 0
Use separate assignments. You have different conditions to filter on, you can combine the address and the two name* filters with & (but put parentheses around each test):
df1.loc[(df1.Address.isin(['NJ'])) & (df1.name1 isin ('john', 'bob')), 'name1'] = 'N/A'
df1.loc[(df1.Address.isin(['NJ'])) & (df1.name2 isin ('mayer', 'dylan')), 'name2'] = 'N/A'
You can always store the boolean filters in a variable first:
nj_address = df1.Address.isin(['NJ'])
name1_filter = df1.name1 isin ('john', 'bob')
name2_filter = df1.name2 isin ('mayer', 'dylan')
df1.loc[nj_address & name1_filter, 'name1'] = 'N/A'
df1.loc[nj_address & name2_filter, 'name2'] = 'N/A'
Related
I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add
One of the columns I'm importing into my dataframe is structured as a list. I need to pick out certain values from said list, transform the value and add it to one of two new columns in the dataframe. Before:
Name
Listed_Items
Tom
["dr_md_coca_cola", "dr_od_water", "potatoes", "grass", "ot_other_stuff"]
Steve
["dr_od_orange_juice", "potatoes", "grass", "ot_other_stuff", "dr_md_pepsi"]
Phil
["dr_md_dr_pepper", "potatoes", "grass", "dr_od_coffee","ot_other_stuff"]
From what I've read I can turn the column into a list
df["listed_items"] = df["listed_items"].apply(eval)
But then I cannot see how to find any list items that start dr_md, extract the item, remove the starting dr_md, replace any underscores, capitalize the first letter and add that to a new MD column in the row. Then same again for dr_od. There is only one item in the list that starts dr_md and dr_od in each row. Desired output
Name
MD
OD
Tom
Coca Cola
Water
Steve
Pepsi
Orange Juice
Phil
Dr Pepper
Coffee
What you need to do is make a function that does the processing for you that you can pass into apply (or in this case, map). Alternatively, you could expand your list column into multiple columns and then process them afterwards, but that will only work if your lists are always in the same order (see panda expand columns with list into multiple columns). Because you only have one input column, you could use map instead of apply.
def process_dr_md(l:list):
for s in l:
if s.startswith("dr_md_"):
# You can process your string further here
return l[6:]
def process_dr_od(l:list):
for s in l:
if s.startswith("dr_od_"):
# You can process your string further here
return l[6:]
df["listed_items"] = df["listed_items"].map(eval)
df["MD"] = df["listed_items"].map(process_dr_md)
df["OD"] = df["listed_items"].map(process_dr_od)
I hope that gets you on your way!
Use pivot_table
df = df.explode('Listed_Items')
df = df[df.Listed_Items.str.contains('dr_')]
df['Type'] = df['Listed_Items'].str.contains('dr_md').map({True: 'MD',
False: 'OD'})
df.pivot_table(values='Listed_Items',
columns='Type',
index='Name',
aggfunc='first')
Type MD OD
Name
Phil dr_md_dr_pepper dr_od_coffee
Steve dr_md_pepsi dr_od_orange_juice
Tom dr_md_coca_cola dr_od_water
From here it's just a matter of beautifying your dataset as your wish.
I took a slightly different approach from the previous answers.
given a df of form:
Name Items
0 Tom [dr_md_coca_cola, dr_od_water, potatoes, grass...
1 Steve [dr_od_orange_juice, potatoes, grass, ot_other...
2 Phil [dr_md_dr_pepper, potatoes, grass, dr_od_coffe...
and making the following assumptions:
only one item in a list matches the target mask
the target mask always appears at the start of the entry string
I created the following function to parse the list:
import re
def parse_Items(tgt_mask: str, itmList: list) -> str:
p = re.compile(tgt_mask)
for itm in itmList:
if p.match(itm):
return itm[p.search(itm).span()[1]:].replace('_', ' ')
Then you can modify your original data farme by use of the following:
df['MD'] = [parse_Items('dr_md_', x) for x in df['Items'].to_list()]
df['OD'] = [parse_Items('dr_od_', x) for x in df['Items'].to_list()]
df.pop('Items')
This produces the following:
Name MD OD
0 Tom coca cola water
1 Steve pepsi orange juice
2 Phil dr pepper coffee
I would normalize de data before to put in a dataframe:
import pandas as pd
from typing import Dict, List, Tuple
def clean_stuff(text: str):
clean_text = text[6:].replace('_', ' ')
return " ".join([
word.capitalize()
for word in clean_text.split(" ")
])
def get_md_od(stuffs: List[str]) -> Tuple[str, str]:
md_od = [s for s in stuffs if s.startswith(('dr_md', 'dr_od'))]
md_od = sorted(md_od)
print(md_od)
return clean_stuff(md_od[0]), clean_stuff(md_od[1])
dirty_stuffs = [{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]},
{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]}
]
normalized_stuff: List[Dict[str, str]] = []
for stuff in dirty_stuffs:
md, od = get_md_od(stuff['Listed_Items'])
normalized_stuff.append({
'Name': stuff['Name'],
'MD': md,
'OD': od,
})
df = pd.DataFrame(normalized_stuff)
print(df)
So, I have a pandas data frame where one column contains the description of the nationality of a user and I want to replace this whole description with the country he's from.
My inputs are the df and the list of countries:
Description
ID
I am from Atlantis
1
My family comes from Narnia
2
["narnia","uzbekistan","Atlantis",...]
I know that:
I only have one country per description
the description contains the name of the country or does not, there is no necessity to infer the country from what he says, I only want to map [phrase containing name of country] to [country].
If I had only one country to replace I could use something like
df.loc[df['description'].str.contains('Atlantis', case=False), 'description'] = 'Atlantis'
I know that, because the country names are organised in a list, I could cycle through it and apply this to all the elements, something like:
for country in country_list:
df.loc[df['description'].str.contains(country, case=False), 'description'] = country
but it seems to me quite unpythonic so I was wondering if anyone could help me finding a better way (that I'm sure exists)
The output should be:
Description
ID
Atlantis
1
Narnia
2
You can use pd.Series.str.extract:
country_list = ["narnia","uzbekistan","Atlantis"]
df = pd.DataFrame({'Description': {0: 'I am from Atlantis',
1: 'My family comes from Narnia'},
'ID': {0: 1, 1: 2}})
print (df["Description"].str.extract(f"({'|'.join(country_list)})", flags=re.I))
0
0 Atlantis
1 Narnia
I have a dataframe like this:
Title Participants
0 ShowA B. Smith,C. Ball
1 ShowB T. Smooth
2 ShowC K. Dulls,L. Allen,B. Smith
I'm splitting on , in the Participants column and creating a list for each cell. Next, I check for specific participant(s) in each list. In this example, I'm checking for either B. Smith or K. Dulls
for item in df['Participants']:
listX = item.split(',')
if 'B. Smith' in listX or 'K. Dulls' in listX:
print(listX)
This returns:
['B. Smith', 'C. Ball']
['K. Dulls', 'L. Allen', 'B. Smith']
1) I'm guessing there is a cleaner way to check for multiple participants, in my if statement. I'd love any suggestions.
2) This is where i've been spinning in circles, how do I return the Title associated with the list(s) I return?
In this example, i'd like to return:
ShowA
ShowC
Setup code:
import pandas as pd
df = pd.DataFrame(data={'Title': ['ShowA', 'ShowB', 'ShowC'],
'Participants': ['B. Smith,C. Ball', 'T. Smooth', 'K. Dulls,L. Allen,B. Smith']})
target_participants = ['B. Smith', 'K. Dulls']
get_dummies
You can use pandas.Series.str.get_dummies and create a dataframe where columns are boolean expressions of where names are present.
dummies = df.Participants.str.get_dummies(',').astype(bool)
dummies
B. Smith C. Ball K. Dulls L. Allen T. Smooth
0 True True False False False
1 False False False False True
2 True False True True False
Then we can find your result
df.loc[dummies['B. Smith'] | dummies['K. Dulls'], 'Title']
0 ShowA
2 ShowC
Name: Title, dtype: object
contains
Otherwise, you can use pandas.Series.str.contains. First we'll need to specify the people you are looking for in a list and then construct a string to use as a regular expression.
people_to_look_for = ['B. Smith', 'K. Dulls']
pattern = '|'.join(people_to_look_for)
mask = df.Participants.str.contains(pattern)
df.loc[mask, 'Title']
0 ShowA
2 ShowC
Name: Title, dtype: object
I'm not sure how great the performance for this would be, although I imagine it's worth the investment if you keep the elements of the 'Participants' column as lists.
import pandas as pd
df = pd.DataFrame(data={'Title': ['ShowA', 'ShowB', 'ShowC'],
'Participants': ['B. Smith,C. Ball', 'T. Smooth', 'K. Dulls,L. Allen,B. Smith']})
target_participants = {'B. Smith', 'K. Dulls'}
df['Participants'] = df['Participants'].str.split(',')
print(df, end='\n\n')
contains_parts = ~df['Participants'].map(target_participants.isdisjoint)
print(contains_parts)
Output:
Title Participants
0 ShowA [B. Smith, C. Ball]
1 ShowB [T. Smooth]
2 ShowC [K. Dulls, L. Allen, B. Smith]
0 True
1 False
2 True
Name: Participants, dtype: bool
I am stumbled upon a trivial problem in pandas. I have two dataframes. The first one, df_1 is as follows
vendor_name date company_name state
PERTH is june 2019 Abc enterprise Kentucky
Megan Ent 25-april-2019 Xyz Fincorp Texas
The second one df_2 contains the correct values for each column in df_1.
df_2
Field wrong value correct value
vendor_name PERTH Perth Enterprise
date is 15 ## this means that is should be read as 15
company_name Abc enterprise ABC International Enterprise Inc.
In order to replace the values with correct ones in df_1 (except date field) I am using pandas.loc method. Below is the code snippet
vend = df_1['vendor_Name'].tolist()
comp = df_1['company_name'].tolist()
state = df_1['state'].tolist()
for i in vend:
if df_2['wrong value'].str.contains(i):
crct = df_2.loc[df_2['wrong value'] == i,'correct value'].tolist()
Similarly, for company and state I have followed the above way.
However, the crct is returning a blank series. Ideally it should return
['Perth Enterprise','Abc International Enterprise Inc']
The next step would be to replace the respective field values by the above list.
With the above, I have three questions:
Why the above code is generating a blank list? What I am missing here?
How can I replace the respective fields using df_1.replace method?
What should be a correct approach to replace the portion of date in df_1 by the correct one in df_2?
Edit: when data has looping replacement(i.e overlaping keys and values), replacement on whole dataframe will fail. In this case, doing it column by column and concat them together. Finally, use join to adding any missing columns from df1:
df_replace = pd.concat([df1[k].replace(val, regex=True) for k, val in d.items()], axis=1).join(df1.state)
Original:
I tried your code in my interactive and it gives error ValueError: The truth value of a Series is ambiguous on df_2['wrong value'].str.contains(i).
assume you have multiple vendor names, so the simple way is construct a dictionary from groupby of df2 and use it with df.replace on df1.
d = {k: gp.set_index('wrong value')['correct value'].to_dict()
for k, gp in df2.groupby('Field')}
Out[64]:
{'company_name': {'Abc enterprise': 'ABC International Enterprise Inc. '},
'date': {'is': '15'},
'vendor_name': {'PERTH': 'Perth Enterprise'}}
df_replace = df1.replace(d, regex=True)
print(df_replace)
In [68]:
vendor_name date company_name \
0 Perth Enterprise 15 june 2019 ABC International Enterprise Inc.
1 Megan Ent 25-april-2019 Xyz Fincorp
state
0 Kentucky
1 Texas
Note: your sample df2 has only value for vendor PERTH, so it only replace first row. When you have all vendor_names in df2, it will replace them all in df1.
A simple way to do that is to iterate over the first dataframe and then replace the wrong values :
Result = pd.DataFrame()
for i in range(len(df1)):
vendor_name = df1.iloc[i]['vendor_name']
date = df1.iloc[i]['date']
company_name = df1.iloc[i]['company_name']
if vendor_name in df2['wrong value'].values:
vendor_name = df2.loc[df2['wrong value'] == vendor_name]['correct value'].values[0]
if company_name in df2['wrong value'].values:
company_name = df2.loc[df2['wrong value'] == company_name]['correct value'].values[0]
new_row = {'vendor_name':[vendor_name],'date':[date],'company_name':[company_name]}
new_row = pd.DataFrame(new_row,columns=['vendor_name','date','company_name'])
Result = Result.append(new_row,ignore_index=True)
Result :
Define the following replace function:
def repl(row):
fld = row.Field
v1 = row['wrong value']
v2 = row['correct value']
updInd = df_1[df_1[fld].str.contains(v1)].index
df_1.loc[updInd, fld] = df_1.loc[updInd, fld]\
.str.replace(re.escape(v1), v2)
Then call it for each row in df_2:
for _, row in df_2.iterrows():
repl(row)
Note that str.replace alone does not require to import re (Pandas
imports it under the hood).
But in the above function re.escape is called explicitely, from our code,
hence import re is required.