Conditionally selecting values from pandas dataframe

Conditionally selecting values from pandas dataframe - python

I have a dataframe in which I would like to determine how many unique bird species each person saw who participated in my "Big Year".
I've tried using a list comprehension and for loops to iterate over each row and determine if it's unique using .is_unique(), but that seems to be the source of much of my distress. I can get a list of all the unique species with .unique(), quite nicely, but I would like to somehow get the people associated with those birds.
df = pd.DataFrame({'Species':['woodpecker', 'woodpecker', 'dove', 'mockingbird'], 'Birder':['Steve', 'Ben','Ben','Greg']})
ben_unique_bird = [x for x in range(len(df['Species'])) if df['Birder'][x]=='Ben' and df['Species'][x].is_unique()]
Edit: I think I'm unclear in this- I want to get a list of birds that each person saw that no one else did. So the output would be something like (Steve, 0), (Ben, 1), (Greg, 1), in whatever format.
Thanks!

This can be done with list comprehension quite easily.
df = pd.DataFrame({'Species':['woodpecker', 'woodpecker', 'dove', 'mockingbird'], 'Birder':['Steve', 'Ben','Ben','Greg']})
matches = [(row[1], row[2]) for row in df.itertuples() if (row[1],row[2]) not in matches]
This gives a list of tuples as output:
[('Steve', 'woodpecker'), ('Ben', 'woodpecker'), ('Ben', 'dove'), ('Greg', 'mockingbird')]

name of unique birds they saw
ben_unique_bird = df[df['Birder'] == 'Ben']['Species'].unique()
number of unique birds they saw
len(df[df['Birder'] == 'Ben']['Species'].unique())
Recommended method 1 to get a table
df.groupby(['Birder']).agg({"Species": lambda x: x.nunique()})
same method broken down
for i in df['Birder'].unique():
print (" Name ",i," Distinct count ",len(df[df['Birder'] == i]['Species'].unique())," distinct bird names ",df[df['Birder'] == i]['Species'].unique())

You can create a helper series via pd.DataFrame.duplicated and then use GroupBy + sum:
counts = data.assign(dup_flag=df['Species'].duplicated(keep=False))\
.groupby('Birder')['dup_flag'].sum().astype(int)
for name, count in counts.items():
print(f'{name} saw {count} bird(s) that no one else saw')
Result:
Ben saw 1 bird(s) that no one else saw
Greg saw 0 bird(s) that no one else saw
Steve saw 1 bird(s) that no one else saw

I figured out a terrible way of doing what I want, but it works. Please let me know if you have a more efficient way of doing this, because I know there has to be one.
data = pd.DataFrame({'Species':['woodpecker', 'woodpecker', 'dove', 'mockingbird'], 'Birder':['Steve', 'Ben','Ben','Greg']})
ben_birds = []
steve_birds = []
greg_birds = []
#get all the names of the birds that people saw and put them in a list
for index, row in data.iterrows():
if row['Birder'] == 'Bright':
ben_birds.append(row['Species'])
elif row['Birder'] == 'Filios':
steve_birds.append(row['Species'])
else:
greg_birds.append(row['Species'])
duplicates=[]
#compare each of the lists to look for duplicates, and make a new list with those
for bird in ben_birds:
if (bird in steve_birds) or (bird in greg_birds):
duplicates.append(bird)
for bird in steve_birds:
if (bird in greg_birds):
duplicates.append(bird)
#if any of the duplicates are in a list, remove those birds
for bird in ben_birds:
if bird in duplicates:
ben_birds.remove(bird)
for bird in steve_birds:
if bird in duplicates:
steve_birds.remove(bird)
for bird in greg_birds:
if bird in duplicates:
greg_birds.remove(bird)
print(f'Ben saw {len(ben_birds)} Birds that no one else saw')
print(f'Steve saw {len(steve_birds)} Birds that no one else saw')
print(f'Greg saw {len(greg_birds)} Birds that no one else saw')

Related

Iterate through multiple list of dictionaries

I would like to iterate through list of dictionaries in order to get a specific value, but I can't figure it out.
I've made a simplified version of what I've been working with. These lists or much longer, with more dictionaries in them, but for the sake of an example, I hope this shortened dataset will be enough.
listOfResults = [{"29":2523,"30":626,"10":0,"32":128},{"29":2466,"30":914,"10":0,"32":69}]
For example, I need the values of the key "30" from the dictionaries above. I've managed to get those and stored them in a list of integers. ( [626, 914] )
These integers are basically IDs. After this, I need to get the value of these IDs from another list of dictionaries.
listOfTrack = [{"track_length": 1.26,"track_id": 626,"track_name": "Rainbow Road"},{"track_length": 6.21,"track_id": 914,"track_name": "Excalibur"}]
I would like to print/store the track_names and track_lengths of the IDs I've got from the listOfResults earlier. Unfortunately, I've ended up in a complete mess of for loops.

You want something like this:
ids = [626, 914]
result = { track for track in list_of_tracks if track.get("track_id") in ids }

I unfortunately can't comment on the answer given by Nathaniel Ford because I'm a new user so I just thought I'd share it here as an answer.
His answer is basically correct, but I believe you need to replace the curly braces with brackets or else you will get this error: TypeError: unhashable type: 'dict'
The answer should look like:
ids = [626, 914]
result = [track for track in listOfTrack if track.get("track_id") in ids]

listOfResults = [{"29":2523,"30":626,"10":0,"32":128},{"29":2466,"30":914,"10":0,"32":69}]
ids = [x.get('30') for x in listOfResults]
listOfTrack = [{"track_length": 1.26,"track_id": 626,"track_name": "Rainbow Road"},{"track_length": 6.21,"track_id": 914,"track_name": "Excalibur"}]
out = [x for x in listOfTrack if x.get('track_id') in ids]
Alternatively, it may be time to learn a new library if you're going to be doing a lot this.
import pandas as pd
results_df = pd.DataFrame(listOfResults)
track_df = pd.DataFrame(listOfTrack)
These Look like:
# results_df
29 30 10 32
0 2523 626 0 128
1 2466 914 0 69
# track_df
track_length track_id track_name
0 1.26 626 Rainbow Road
1 6.21 914 Excalibur
Now we can answer your question:
# Creates a mask of rows where this is True.
mask = track_df['track_id'].isin(results_df['30'])
# Specifies that we want just those two columns.
cols = ['track_length', 'track_name']
out = track_df.loc[mask, cols]
print(out)
# Or we can make it back into a dictionary:
print(out.to_dict('records'))
Output:
track_length track_name
0 1.26 Rainbow Road
1 6.21 Excalibur
[{'track_length': 1.26, 'track_name': 'Rainbow Road'}, {'track_length': 6.21, 'track_name': 'Excalibur'}]

Removing Custom-Defined Words from List (Part II)- Python

This is a continuation of my previous thread: Removing Custom-Defined Words from List - Python
I have a df as such:
df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})
<OUT>
PageNumber new_tags
175 flower architecture people...
162 hair red bobbles...
576 sweets chocolate shop...
And another df (which will act as the reference df (see more below)):
top_words= pd.DataFrame({'ID': [1,2,3], 'tag':['flower, people, chocolate']})
<OUT>
ID tag
1 flower
2 people
3 chocolate
I'm trying to remove values in a list in a df based on the values of another df. The output I wish to gain is:
<OUT> df
PageNumber new_tags
175 flower people
576 chocolate
I've tried the inner join method: Filtering the dataframe based on the column value of another dataframe, however no luck unfortunately.
So I have resorted to tokenizing all tags in both of the df columns and trying to loop through each and retaining only the values in the reference df. Currently, it returns empty lists...
df['tokenised_new_tags'] = filtered_new["new_tags"].astype(str).apply(nltk.word_tokenize)
topic_words['tokenised_top_words']= topic_words['tag'].astype(str).apply(nltk.word_tokenize)
df['top_word_tokens'] = [[t for t in tok_sent if t in topic_words['tokenised_top_words']] for tok_sent in df['tokenised_new_tags']]
Any help is much appreciated - thanks!

How about this:
def remove_custom_words(phrase, words_to_remove_list):
return([ elem for elem in phrase.split(' ') if elem not in words_to_remove_list])
df['new_tags'] = df['new_tags'].apply(lambda x: remove_custom_words(x[0],top_words['tag'].to_list()))
Basically I am applying remove_custom_words function for each row of the dataset. Then we filter and remove the words contained in top_words['tag']

Using a for loop combined with a nested if statement to create a new pandas DataFrame based on 3 columns of a different DataFrame in Python

trv_last = []
for i in range(0,len(trv)):
if (trv_name_split.iloc[i,3] != None):
trv_last = trv_name_split.iloc[i,3]
elif (trv_name_split.iloc[i,2] != None):
trv_last = trv_name_split.iloc[i,2]
else:
trv_last = trv_name_split.iloc[i,1]
trv_last
This returns 'Campo' which is the last index in my range:
0 1 2 3
1 John Doe None None
2 Jane K. Martin None
: : : : :
972 Jino Campo None None
As you can see all names were together in one column and I used str.split() to split them up. Since some names had first middle middle last, I am left with 4 columns. I am only interested in he last name.
My goal is to create a new DF with only the last name. The logic here is if the 4th column is not "None" then that is the last name and move backwards toward the 2nd column being last name if all else are "None".
Thank you for having a look and I appreciate the help!

Looping through pandas dataframes isn't a great idea. That's why they made apply. Best practice is to use apply and assign.
def build_last_name(row):
if row.3:
return row.3
if row.2:
return row.2
return row.1
last_names = trv_name_split.apply(build_last_name, axis=1)
trv_name_split = trv_name_split.assign(last_name=last_names)
Familiarizing yourself with apply is going to save a lot of headaches. Here's the docs.

Figured out the answer to my own question..
trv_last = []
for i in range(0,len(trv)):
if (trv_name_split.iloc[i,3] != None):
trv_last.append(trv_name_split.iloc[i,3])
elif (trv_name_split.iloc[i,2] != None):
trv_last.append(trv_name_split.iloc[i,2])
else:
trv_last.append(trv_name_split.iloc[i,1])
trv_last

Update one column's value based on another column's value in Pandas using regular expression

Suppose I have a dataframe like below:
>>> df = pd.DataFrame({'Category':['Personal Care', 'Home Care', 'Pharma', 'Pet'], 'SubCategory':['Shampoo', 'Floor Wipe', 'Veterinary', 'Animal Feed']})
>>> df
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pharma Veterinary
3 Pet Animal Feed
I'd like to update the value in 'Category' column whenever the 'Subcategory' column's value contains either 'Veterinary' or 'Animal' (case-insensitive). To do that, I devised a method like below:
def update_col1_values_based_on_values_in_col2_using_regex_mappings(
df,
col1_name: str,
col2_name: str,
dictionary_of_regex_mappings: dict):
for pattern, new_str_value in dictionary_of_regex_mappings.items():
mask = df[col2_name].str.contains(pattern)
df.loc[mask, col1_name] = new_str_value
return df
This method works as expected as shown below:
>>> df1 = update_col1_values_based_on_values_in_col2_using_regex_mappings(df, 'Category', 'SubCategory', {"(?i).*Veterinary.*": "Pet Related", "(?i).*Animal.*": "Pet Related"})
>>> df1
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pet Related Veterinary
3 Pet Related Animal Feed
In practice, there will be more than 'Veterinary' and 'Animal Feed' to map from, so some of the suggestions below, although they read elegant, are not going be practical for the actual use case. In other words, please assume that the mapping is going to be more like this:
{
"(?i).*Veterinary.*": "Pet Related",
"(?i).*Animal.*": "Pet Related"
"(?i).*Pharma.*": "Pharmaceutical",
"(?i).*Diary.*": "Other",
... # lots and lots more mapping here
}
I'm wondering if there's a more elegant (Pandas-ish) way to accomplish this. Thank you in advance for your suggestions!
EDIT: I didn't clarify in the beginning that the mapping between 'Category' and 'Subcategory' columns wouldn't be restricted to just 'Veterinary' and 'Animal'.

You can use the following code, which is intuitive.
df['Category'] = df['SubCategory'].map(lambda x: "Pet Related" if "Animal" in x or "Veterinary" in x else x)

You could do it with pd.DataFrame.where, and re to add the flag case-insensitive:
import re
df.Category.where(~df.SubCategory.str.contains('Veterinary|Animal',flags = re.IGNORECASE),'Pet Related',inplace=True)
Output:
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pet Related Veterinary
3 Pet Related Animal Feed

Not sure if this is the best way, but you can do this:
df.loc[df.SubCategory.str.contains('Veterinary|Animal'), 'Category']='Pet Related'
If you need to use regex, str.contains() does also support regex
pattern = r'(?i)veterinary|animal'
df.loc[df.SubCategory.str.contains(pattern, regex=True), 'Category']='Pet Related'
And this is the result
In [3]: df
Out[3]:
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pet Related Veterinary
3 Pet Related Animal Feed

I want create a list from values in a dataset based on a specific condition

I am working with a Dataset that contains the information of every March Madness game since 1985. I want to know which teams have won it all and how many times each.
I masked the main dataset and created a new one containing only information about the championship game. Now I am trying to create a loop that compares the scores from both teams that played in the championship game, detects the winner and adds that team to a list. This is how the dataset looks like: https://imgur.com/tXhPYSm
tourney = pd.read_csv('ncaa.csv')
champions = tourney.loc[tourney['Region Name'] == "Championship", ['Year','Seed','Score','Team','Team.1','Score.1','Seed.1']]
list_champs = []
for i in champions:
if champions['Score'] > champions['Score.1']:
list_champs.append(i['Team'])
else:
list_champs.append(i['Team.1'])

Why do you need to loop through the DataFrame?
Basic filtering should work well. Something like this:
champs1 = champions.loc[champions['Score'] > champions['Score.1'], 'Team']
champs2 = champions.loc[champions['Score'] < champions['Score.1'], 'Team.1']
list_champs = list(champs1) + list(champs2)

A minimalist change (not the most efficient) to get your code working:
tourney = pd.read_csv('ncaa.csv')
champions = tourney.loc[tourney['Region Name'] == "Championship", ['Year','Seed','Score','Team','Team.1','Score.1','Seed.1']]
list_champs = []
for row in champions.iterrows():
if row['Score'] > row['Score.1']:
list_champs.append(row['Team'])
else:
list_champs.append(row['Team.1'])
Otherwise, you could simply do:
df.apply(lambda row: row['Team'] if row['Score'] > row['Score.1'] else row['Team.1'], axis=1).values

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditionally selecting values from pandas dataframe - python

Related

Iterate through multiple list of dictionaries

Removing Custom-Defined Words from List (Part II)- Python

Using a for loop combined with a nested if statement to create a new pandas DataFrame based on 3 columns of a different DataFrame in Python

Update one column's value based on another column's value in Pandas using regular expression

I want create a list from values in a dataset based on a specific condition

Categories

Resources