pandas replace under condition issue - python

I want to replace values under column Severity if the value has the word (dtd), (dnp), (out indefintely), (out for season) in the string to levels(1-4)
I tried to replace using a dictionary, but it doesn't change the elements under the column
df['Severity'] = ""
df['Severity'] = df['Notes']
replace_dict = {'DTD':1,'DNP':2,'out indefinitely':3,'out for season':4}
df['Severity'] = df['Severity'].replace(replace_dict)
I am cleaning NBA injury data from season 2018-19
the frame look like this:
enter image description here

I would build a custom function then apply to the string column:
def replace_severity(s):
"""matches string `s` for keys in `replace_dict` returning values from `replace_dict`"""
# define the keys + matches
replace_dict = {'DTD':1,'DNP':2,'out indefinitely':3,'out for season':4}
for key, value in replace_dict.items():
if re.search(key, s, re.I):
return value
# if no results
return None
# Apply this function to your column
df['Severity'] = df['Notes'].apply(replace_severity)

Related

Counting combinations in Dataframe create new Dataframe

So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')

Change Column values in pandas applying another function

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])

How do I fix the For Loop to return a certain character from a DataFrame?

I have imported an excel file and made it into a DataFrame and iterated over a column called "Titles" to spit out titles with certain keywords. I have the list of titles as "match_titles." What I want to do now is to create a For Loop to return the column before "titles" for each title in match_titles." I'm not sure why the code is not working. Any help would be appreciated.
import pandas as pd
data = pd.read_excel(r'C:\Users\bryanmccormack\Downloads\asin_list.xlsx')
df = pd.DataFrame(data, columns=['Track','Asin','Title'])
excludes = ["Chainsaw", "Diaper pail", "Leaf Blower"]
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
match_titles = [e for e in df.Title if
any(keywords.issubset(e.lower().split()) for keywords in my_excludes)]
a = []
for i in match_titles:
a.append(df['Asin'])
print(a)
In your for loop you are appending the unfiltered column df['Asin'] to your list a as many times as there are values in match_titles. But there isn't any filtering of df.
One solution would be to make a column of the match_values then you can return the column Asin after filtering on that match_values column:
# make a function to perform your match analysis.
def is_match(title, excludes=["Chainsaw", "Diaper pail", "Leaf Blower"]):
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
if any(keywords.issubset(title.lower().split()) for keywords in my_excludes):
return True
return False
# Make a new boolean column for the matches. This applies your
# function to each value in df['Title'] and puts the output in
# the new column.
df['match_titles'] = df['Title'].apply(is_match)
# Filter the df to only matches and return the column you want.
# Because the match_titles column is boolean it can be used as
# an index.
result = df[df['match_titles']]['Asin']

How to prevent multi value dictionary object from splitting each word into individual letter strings?

I have a dictionary object that looks like this:
my_dict = {123456789123: ('a', 'category'),
123456789456:('bc','subcategory'),123456789678:('c_d','subcategory')}
The below code extracts and compares a integer in column headers in a df to the key in the dictionary and creates a new dataframe by picking the second value as the columns of the new df and first value as the value inside the df.
Code:
names = df.columns.values
new_df = pd.DataFrame()
for name in names:
if ('.value.' in name) and df[name][0]:
last_number = int(name[-13:])
print(last_number)
key, value = my_dict[last_number]
try:
new_df[value][0] = list(new_df[value][0]) + [key]
except:
new_df[value] = [key]
new_df:
category subcategory
0 a [b, c, c_d]
I am not sure what is causing it in my code, but how do I prevent bcfrom split up?
edit:
example df from above:
data.value.123456789123 data.value.123456789456 data.value.123456789678
TRUE TRUE TRUE
new_df should look like this:
category subcategory
0 a [bc, c_d]
list(new_df[value][0]) breaks a string into a list of characters, that's why you get the individual characters.
list(new_df[value][0]) must be [new_df[value][0]]. Or, better, list(new_df[value][0]) + [key] must be [new_df[value][0], key].
Using DataFrame constructor and groupby
df=pd.DataFrame(list(my_dict.values()))
df.groupby(1)[0].apply(list).to_frame(0).T
1 category subcategory
0 [a] [bc, c_d]

Extract string from data frame in python

I am new in Python and I would like to extract a string data from my data frame. Here is my data frame:
Which state has the most counties in it?
Unfortunately I could not extract a string! Here is my code:
import pandas as pd
census_df = pd.read_csv('census.csv')
def answer_five():
return census_df[census_df['COUNTY']==census_df['COUNTY'].max()]['STATE']
answer_five()
How about this:
import pandas as pd
census_df = pd.read_csv('census.csv')
def answer_five():
"""
Returns the 'STATE' corresponding to the max 'COUNTY' value
"""
max_county = census_df['COUNTY'].max()
s = census_df.loc[census_df['COUNTY']==max_county, 'STATE']
return s
answer_five()
This should output a pd.Series object featuring the 'STATE' value(s) where 'COUNTY' is maxed. If you only want the value and not the Series (as your question stated, and since in your image there's only 1 max value for COUNTY) then return s[0] (instead of return s) should do.
def answer_five():
return census_df.groupby('STNAME')['COUNTY'].nunique().idxmax()
You can aggregate data using group by state name, then get count on unique counties and return id of max count.
I had the same issue for some reason I tried using .item() and manage to extract the exact value I needed.
In your case it would look like:
return census_df[census_df['COUNTY'] == census_df['COUNTY'].max()]['STATE'].item()

Categories