Removing Custom-Defined Words from List (Part II)- Python - python

This is a continuation of my previous thread: Removing Custom-Defined Words from List - Python
I have a df as such:
df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})
<OUT>
PageNumber new_tags
175 flower architecture people...
162 hair red bobbles...
576 sweets chocolate shop...
And another df (which will act as the reference df (see more below)):
top_words= pd.DataFrame({'ID': [1,2,3], 'tag':['flower, people, chocolate']})
<OUT>
ID tag
1 flower
2 people
3 chocolate
I'm trying to remove values in a list in a df based on the values of another df. The output I wish to gain is:
<OUT> df
PageNumber new_tags
175 flower people
576 chocolate
I've tried the inner join method: Filtering the dataframe based on the column value of another dataframe, however no luck unfortunately.
So I have resorted to tokenizing all tags in both of the df columns and trying to loop through each and retaining only the values in the reference df. Currently, it returns empty lists...
df['tokenised_new_tags'] = filtered_new["new_tags"].astype(str).apply(nltk.word_tokenize)
topic_words['tokenised_top_words']= topic_words['tag'].astype(str).apply(nltk.word_tokenize)
df['top_word_tokens'] = [[t for t in tok_sent if t in topic_words['tokenised_top_words']] for tok_sent in df['tokenised_new_tags']]
Any help is much appreciated - thanks!

How about this:
def remove_custom_words(phrase, words_to_remove_list):
return([ elem for elem in phrase.split(' ') if elem not in words_to_remove_list])
df['new_tags'] = df['new_tags'].apply(lambda x: remove_custom_words(x[0],top_words['tag'].to_list()))
Basically I am applying remove_custom_words function for each row of the dataset. Then we filter and remove the words contained in top_words['tag']

Related

Iterate through multiple list of dictionaries

I would like to iterate through list of dictionaries in order to get a specific value, but I can't figure it out.
I've made a simplified version of what I've been working with. These lists or much longer, with more dictionaries in them, but for the sake of an example, I hope this shortened dataset will be enough.
listOfResults = [{"29":2523,"30":626,"10":0,"32":128},{"29":2466,"30":914,"10":0,"32":69}]
For example, I need the values of the key "30" from the dictionaries above. I've managed to get those and stored them in a list of integers. ( [626, 914] )
These integers are basically IDs. After this, I need to get the value of these IDs from another list of dictionaries.
listOfTrack = [{"track_length": 1.26,"track_id": 626,"track_name": "Rainbow Road"},{"track_length": 6.21,"track_id": 914,"track_name": "Excalibur"}]
I would like to print/store the track_names and track_lengths of the IDs I've got from the listOfResults earlier. Unfortunately, I've ended up in a complete mess of for loops.
You want something like this:
ids = [626, 914]
result = { track for track in list_of_tracks if track.get("track_id") in ids }
I unfortunately can't comment on the answer given by Nathaniel Ford because I'm a new user so I just thought I'd share it here as an answer.
His answer is basically correct, but I believe you need to replace the curly braces with brackets or else you will get this error: TypeError: unhashable type: 'dict'
The answer should look like:
ids = [626, 914]
result = [track for track in listOfTrack if track.get("track_id") in ids]
listOfResults = [{"29":2523,"30":626,"10":0,"32":128},{"29":2466,"30":914,"10":0,"32":69}]
ids = [x.get('30') for x in listOfResults]
listOfTrack = [{"track_length": 1.26,"track_id": 626,"track_name": "Rainbow Road"},{"track_length": 6.21,"track_id": 914,"track_name": "Excalibur"}]
out = [x for x in listOfTrack if x.get('track_id') in ids]
Alternatively, it may be time to learn a new library if you're going to be doing a lot this.
import pandas as pd
results_df = pd.DataFrame(listOfResults)
track_df = pd.DataFrame(listOfTrack)
These Look like:
# results_df
29 30 10 32
0 2523 626 0 128
1 2466 914 0 69
# track_df
track_length track_id track_name
0 1.26 626 Rainbow Road
1 6.21 914 Excalibur
Now we can answer your question:
# Creates a mask of rows where this is True.
mask = track_df['track_id'].isin(results_df['30'])
# Specifies that we want just those two columns.
cols = ['track_length', 'track_name']
out = track_df.loc[mask, cols]
print(out)
# Or we can make it back into a dictionary:
print(out.to_dict('records'))
Output:
track_length track_name
0 1.26 Rainbow Road
1 6.21 Excalibur
[{'track_length': 1.26, 'track_name': 'Rainbow Road'}, {'track_length': 6.21, 'track_name': 'Excalibur'}]

How to do a nested loop with 2 columns for pandas dataframe with counter?

I am using python and pandas. I have a bunch of unstructured survey data.
I have a dataframe:
Type
Activity
Sport
rowing
Sport
Surfing
Sport
Basketball
Sport
Dancing
Sport
Dancing
Studies
science
Studies
Math
Studies
History
I have survey data that says:
"Sarah does Basketball and Math"
"Kilian does Math"
"Lorenzo does history"
"Robert does dancing"
"Rachel does basketball and dancing"
I want a table that says which students do one or the other and which students do both. (the real data has 30 different sub categories)
I want to create a table like below:
Student
Sports
Studies
"Sarah does Basketball and Math"
1
1
"Kilian does Math"
0
1
""Lorenzo does history"
0
1
"Robert does dancing"
1
0
"Rachel does basketball and dancing"
2
0
I think I need to say
Distinct_Activities = dataframe.Activity.nunique()
#split survey data to be a list of words.
counter = 0
Then say:
For i in Survey_data:
while j = Disitinct_Activities[0]
if you compare a list of words from sentence and your words in data frame where type = sport and one Activity is similar then counter +1 then go to the next Activity till you finish that type. then return count in a dictionary to a column for how many times it hit that section. then go to next sentence and compare all activities in #Activity or go to next part of Distinct_Activities[1]
Then loop back up to next sentence once done.
I am struggling figuring out how to loop through the dataframe using type and activity. I tried to create 30 different lists and dataframes but that didn't go well. Can anyone help me create this inner loop strategy.
PROGRESS and ERROR
import pandas as pd
from collections import Counter
# read the Type_Activity, Student files
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
# create a dictionary with (activity, Type)
### activity_type = dict(zip(df1['Activity'].str.lower(), df1['Type'].str.lower()))
activity_type = df1.groupby('Type')['Activity'].apply(list).to_dict()
df2 = df2.join( # join the df2 with the new dataframe
pd.json_normalize( # convert the dictionaries into columns
df2['Student'].apply( # apply the following function on the "Student" column
lambda x: Counter([ # count the types
type_
for activity in x.strip().lower().split() # lower then split the student text into words
for type_ in [activity_type.get(activity)] # just a hack to ignore the normal words
if type_ # the hole purpose of the pervoius line is to add this check
])
)
).fillna(0) # fill the NAN values with zeros then convert to int for better look
)
Traceback (most recent call last):
File "/mnt/06082022_CreateFactorization.py", line 33, in <module>
pd.json_normalize(
File "/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py", line 7841, in applymap
return self.apply(infer).__finalize__(self, "applymap")
File "/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py", line 7765, in apply
return op.get_result()
File "/opt/conda/lib/python3.8/site-packages/pandas/core/apply.py", line 185, in get_result
return self.apply_standard()
File "/opt/conda/lib/python3.8/site-packages/pandas/core/apply.py", line 276, in apply_standard
results, res_index = self.apply_series_generator()
File "/opt/conda/lib/python3.8/site-packages/pandas/core/apply.py", line 290, in apply_series_generator
results[i] = self.f(v)
File "/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py", line 7839, in infer
return lib.map_infer(x.astype(object)._values, func, ignore_na=ignore_na)
File "pandas/_libs/lib.pyx", line 2467, in pandas._libs.lib.map_infer
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Counter'
It's best to avoid loops as much as possible, here we replaced the 2 loops with a dictionary activity_type that maps every activity to its type.
NOTE: the df1 is the Type, Activity DataFrame
import pandas as pd
from collections import Counter
# read the Type_Activity, Student files
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
# create a dictionary with (activity, Type)
activity_type = dict(zip(df1['Activity'].str.lower(), df1['Type'].str.lower()))
df2 = df2.join( # join the df2 with the new dataframe
pd.json_normalize( # convert the dictionaries into columns
df2['Student'].apply( # apply the following function on the "Student" column
lambda x: Counter([ # count the types
type_
for activity in x.strip().lower().split() # lower then split the student text into words
for type_ in [activity_type.get(activity)] # just a hack to ignore the normal words
if type_ # the hole purpose of the pervoius line is to add this check
])
)
).fillna(0) # fill the NAN values with zeros then convert to int for a better look
)
update:
my activity_type dictionary was mapping the activity to a single string (type), assuming that every activity has only one type.
your activity_type dictionary was mapping the activity to a list of types, which will be better if activities may have more than one type.
NOTE: I changed it to set instead of list to avoid duplications and for better performance.
import pandas as pd
from collections import Counter
# read the Type_Activity, Student files
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
# lower case the columns then create a dictionary with (activity, [Types])
df1[['Activity', 'Type']] = df1[['Activity', 'Type']].applymap(lambda x: x.lower())
activity_types = df1.groupby('Activity')['Type'].apply(set).to_dict()
# join the df2 with the new dataframe
df2 = df2.join(
# convert the dictionaries into columns
pd.json_normalize(
# apply the following function on the "Student" column
df2['Student'].apply(
# count the types
lambda x: Counter([
type_
# lower then split the student text into words
for activity in x.strip().lower().split()
# ignore the normal words
for type_ in activity_types.get(activity, [])
])
)
).fillna(0).applymap(int) # fill the NAN values with zeros then convert to int for better look
)
output:
Student
sport
studies
Sarah does Basketball and Math
1
1
Kilian does Math
0
1
Lorenzo does history
0
1
Robert does dancing
1
0
Rachel does basketball and dancing
2
0

Dataframe df has blank col. 'Key' & col. 'News'. A list has keywords & 1 or more appear in "News' and keywords should be filled in col. 'key' in df

sample data
df dataframe
Key News
Tata Steel results
Oracle results and announce buyback
Bhart Airtel results, dividend:
The keyword List is {'result', 'buyback', 'dividend'}
The dataframe df is already filtered from a larger data frame by me based on list and contains at least one keyword from the list in the 'News" column.
The command used was
df = df_large[df_large['News'].str.contains('|'.join(list)]
Desired Result:
want 'key' in df to be populated by keyword (one or more) in the List, depending on how many of them appear in the "News' column in df.
df should look like.
Key NEWS
result Tata Steel results
dividend Oracle results and announce buyback
result, buyback Bhart Airtel results, dividend
Is iteration the only way. even if yes, what is the optimum way.
Let's set up our data:
df = pd.DataFrame(columns = ['News'], data = ['Tata Steel results', 'Oracle results and announce buyback', 'Bhart Airtel results, dividend:'])
keywords = ['result', 'buyback', 'dividend']
Now we define the function that filters the full keyword list to those that are in text:
def filter_key_list(text, keys = keywords):
return ','.join([k for k in keys if k in text])
Now we can map this to the News column:
df['Keys'] = df['News'].map(filter_key_list)
df
output:
News Keys
0 Tata Steel results result
1 Oracle results and announce buyback result,buyback
2 Bhart Airtel results, dividend: result,dividend

Python concatenate values in rows till empty cell and continue

I am struggling a little to do something like that:
to get this output:
The purpose of it, is to separate a sentence into 3 parts to make some manipulations after.
Any help is welcome
Select from the dataframe only the second line of each pair, which is the line
containing the separator, then use astype(str).apply(''.join...) to restrain the word
that can be on any value column on the original dataframe to a single string.
Iterate over each row using split with the word[i] of the respective row, after split
reinsert the separator back on the list, and with the recently created list build the
desired dataframe.
Input used as data.csv
title,Value,Value,Value,Value,Value
Very nice blue car haha,Very,nice,,car,haha
Very nice blue car haha,,,blue,,
A beautiful green building,A,,green,building,lol
A beautiful green building,,beautiful,,,
import pandas as pd
df = pd.read_csv("data.csv")
# second line of each pair
d1 = df[1::2]
d1 = d1.fillna("").reset_index(drop=True)
# get separators
word = d1.iloc[:,1:].astype(str).apply(''.join, axis=1)
strings = []
for i in range(len(d1.index)):
word_split = d1.iloc[i, 0].split(word[i])
word_split.insert(1, word[i])
strings.append(word_split)
dn = pd.DataFrame(strings)
dn.insert(0, "title", d1["title"])
print(dn)
Output from dn
title 0 1 2
0 Very nice blue car haha Very nice blue car haha
1 A beautiful green building A beautiful green building

Conditionally selecting values from pandas dataframe

I have a dataframe in which I would like to determine how many unique bird species each person saw who participated in my "Big Year".
I've tried using a list comprehension and for loops to iterate over each row and determine if it's unique using .is_unique(), but that seems to be the source of much of my distress. I can get a list of all the unique species with .unique(), quite nicely, but I would like to somehow get the people associated with those birds.
df = pd.DataFrame({'Species':['woodpecker', 'woodpecker', 'dove', 'mockingbird'], 'Birder':['Steve', 'Ben','Ben','Greg']})
ben_unique_bird = [x for x in range(len(df['Species'])) if df['Birder'][x]=='Ben' and df['Species'][x].is_unique()]
Edit: I think I'm unclear in this- I want to get a list of birds that each person saw that no one else did. So the output would be something like (Steve, 0), (Ben, 1), (Greg, 1), in whatever format.
Thanks!
This can be done with list comprehension quite easily.
df = pd.DataFrame({'Species':['woodpecker', 'woodpecker', 'dove', 'mockingbird'], 'Birder':['Steve', 'Ben','Ben','Greg']})
matches = [(row[1], row[2]) for row in df.itertuples() if (row[1],row[2]) not in matches]
This gives a list of tuples as output:
[('Steve', 'woodpecker'), ('Ben', 'woodpecker'), ('Ben', 'dove'), ('Greg', 'mockingbird')]
name of unique birds they saw
ben_unique_bird = df[df['Birder'] == 'Ben']['Species'].unique()
number of unique birds they saw
len(df[df['Birder'] == 'Ben']['Species'].unique())
Recommended method 1 to get a table
df.groupby(['Birder']).agg({"Species": lambda x: x.nunique()})
same method broken down
for i in df['Birder'].unique():
print (" Name ",i," Distinct count ",len(df[df['Birder'] == i]['Species'].unique())," distinct bird names ",df[df['Birder'] == i]['Species'].unique())
You can create a helper series via pd.DataFrame.duplicated and then use GroupBy + sum:
counts = data.assign(dup_flag=df['Species'].duplicated(keep=False))\
.groupby('Birder')['dup_flag'].sum().astype(int)
for name, count in counts.items():
print(f'{name} saw {count} bird(s) that no one else saw')
Result:
Ben saw 1 bird(s) that no one else saw
Greg saw 0 bird(s) that no one else saw
Steve saw 1 bird(s) that no one else saw
I figured out a terrible way of doing what I want, but it works. Please let me know if you have a more efficient way of doing this, because I know there has to be one.
data = pd.DataFrame({'Species':['woodpecker', 'woodpecker', 'dove', 'mockingbird'], 'Birder':['Steve', 'Ben','Ben','Greg']})
ben_birds = []
steve_birds = []
greg_birds = []
#get all the names of the birds that people saw and put them in a list
for index, row in data.iterrows():
if row['Birder'] == 'Bright':
ben_birds.append(row['Species'])
elif row['Birder'] == 'Filios':
steve_birds.append(row['Species'])
else:
greg_birds.append(row['Species'])
duplicates=[]
#compare each of the lists to look for duplicates, and make a new list with those
for bird in ben_birds:
if (bird in steve_birds) or (bird in greg_birds):
duplicates.append(bird)
for bird in steve_birds:
if (bird in greg_birds):
duplicates.append(bird)
#if any of the duplicates are in a list, remove those birds
for bird in ben_birds:
if bird in duplicates:
ben_birds.remove(bird)
for bird in steve_birds:
if bird in duplicates:
steve_birds.remove(bird)
for bird in greg_birds:
if bird in duplicates:
greg_birds.remove(bird)
print(f'Ben saw {len(ben_birds)} Birds that no one else saw')
print(f'Steve saw {len(steve_birds)} Birds that no one else saw')
print(f'Greg saw {len(greg_birds)} Birds that no one else saw')

Categories