Modifying dataframe rows based on list of strings - python

Background
I have a dataset where I have the following:
product_title price
Women's Pant 20.00
Men's Shirt 30.00
Women's Dress 40.00
Blue 4" Shorts 30.00
Blue Shorts 35.00
Green 2" Shorts 30.00
I created a new column called gender which contains the values Women, Men, or Unisex based on the specified string in product_title.
The output looks like this:
product_title price gender
Women's Pant 20.00 women
Men's Shirt 30.00 men
Women's Dress 40.00 women
Blue 4" Shorts 30.00 women
Blue Shorts 35.00 unisex
Green 2" Shorts 30.00 women
Approach
I approached creating a new column by using if/else statements:
df['gender'] = ['women' if 'women' in word or 'Blue 4"' in word or 'Green 2"' in word
else "men" if "men" in word
else "unisex"
for word in df.product_title.str.lower()]
Although this approach works, it becomes very long when I have a lot of conditions for labeling women vs men vs unisex. Is there cleaner way to do this? Is there a way I can pass a list of strings instead of having a long chain of or conditions?
I would really appreciate help as I am new to python and pandas library.

IIUC,
import numpy as np
s = df['product title'].str.lower()
df['gender'] = np.select([s.str.contains('men'),
s.str.contains('women|blue 4 shorts|green 2 shorts')],
['men', 'women'],
default='unisex')

Here is another idea with str.extract and series.map
d = {'women':['women','blue 4"','green 2"'],'men':['men']}
d1 = {val:k for k,v in d.items() for val in v}
pat = '|'.join(d1.keys())
import re
df['gender'] = (df['product_title'].str.extract('('+pat+')',flags=re.I,expand=False)
.str.lower().map(d1).fillna('unisex'))
print(df)
product_title price gender
0 Women's Pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue 4" Shorts 30.0 women
4 Blue Shorts 35.0 unisex
5 Green 2" Shorts 30.00 NaN women

You can try to define your own function and run it with a apply+lambda espression:
Create the function which you can change as you need:
def sex(str):
'''
look for specific values and retun value
'''
for words in ['women','Blue 4"','Green 2"']:
if words in str.lower():
return 'women'
elif 'men' in str.lower():
return 'men'
else:
return 'unisex'
and after apply to the colum you need to check for values:
df['gender']=df['product_title'].apply(lambda str: sex(str))
Cheers!
EDIT 3:
After looking around and checking about the numpy approac from #ansev following #anky comment I was able to find out this may be faster up to a certain point, tested with 5000 rows and still faster, but the numpy approach started to catch up. So it really depends on how big your dataset are.
Will remove any comment on speed considered I was testing only on this small frame initially, still a learning process as you can see from my level.

Related

Find the most popular word order in a Pandas dataframe

I'm trying to find the most common word order in a pandas dataframe for strings which occur more than once.
Example Dataframe
title
0 Men's Nike Socks
1 Nike Socks Men's
2 Men's Black Nike Socks
3 Men's Nike Socks
4 Everyday 3 Pack Cotton Cushioned Crew Socks
Desired Output
Men's Nike Socks
This is because each word occurs more than once, arranged in the most common order.
What I've Tried
I thought one way to tackle this is to assign a score for each word position, e.g. first position = high score, a low position (further right in the sentence = lower score).
I considered counting the maximum number of words which appear in the dataframe and then use that to incrementally score the words based on their frequency and position.
I Python beginner and not sure how to progress further than that.
It's worth mentioning that the word sizes will be random, and not constrained to the example above.
Minimum Reproducible Example
import pandas as pd
data = [
"Men's Nike Socks for sale",
"Nike Socks Men's",
"Men's Nike Socks in the UK",
"Men's Nike Socks to buy",
"Everyday 3 Pack Cotton Cushioned Crew Socks",
]
df = pd.DataFrame(data, columns=['title'])
print(df)
Edit: My original example is too simplified as my desired output appeared twice exactly in the dataframe.
I've updated the dataframe, but the desired output is still the same.
Use value_counts() and idxmax().
result = df['title'].value_counts().idxmax()
print(result)
Output: Men's Nike Socks
Explanation:
>>> df['title'].value_counts()
Men's Nike Socks 2
Nike Socks Men's 1
Men's Black Nike Socks 1
Everyday 3 Pack Cotton Cushioned Crew Socks 1
Name: title, dtype: int64
Update base new DataFrame:
max_split = df['title'].str.split().apply(len).max()
for i in range(1, max_split):
try:
result = df['title'].str.split(' ', i, expand=True).iloc[:, :-1].apply(' '.join, axis=1).mode()[0]
except TypeError:
break
print(result)
Output: Men's Nike Socks
You can use pandas.Series.mode that returns the most frequent value in a column/serie :
out = df['title'].mode()
# Output :
print(out)
0 Men's Nike Socks
Name: title, dtype: object
# Edit :
To find the most frequent phrases in a column, use nltk as shown in the code below (highly inspired by #jezarel) :
from nltk import ngrams
vals = [y for x in df['title'] for y in x.split()]
n = [3, 4, 5] # Phrases between 3 and 5 words, To be adjusted !
out = pd.Series([' '.join(y) for x in n for y in ngrams(vals, x)]).value_counts().idxmax()
print(out)
Men's Nike Socks

How to compare two data row before concatenating them?

I have 2 datasets (in CSV format) with different size such as follow:
df_old:
index category text
0 spam you win much money
1 spam you are the winner of the game
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
4 neutral we have a party now
5 neutral they are driving to downtown
df_new:
index category text
0 spam you win much money
14 spam London is the capital of Canada
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
4 neutral we have a party now
31 neutral construction will be done
I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.
The code is:
(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.
The expected output should be:
df_concat:
index category text
14 spam London is the capital of Canada
0 spam you win much money
1 spam you are the winner of the game
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
31 neutral construction will be done
4 neutral we have a party now
5 neutral they are driving to downtown
I tried this and this but these are removing either the category or text.
To remove duplicates on specific column(s), use subset in drop_duplicates:
df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')
Try concat + sort_values:
res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)
Output
index category text
0 0 spam you win much money
1 14 spam London is the capital of Canada
1 1 spam you are the winner of the game
2 15 not_spam no more raining in winter
3 25 not_spam the soccer game plays on HBO
2 2 not_spam the weather in Chicago is nice
3 3 not_spam pizza is an Italian food
4 31 neutral construction will be done
4 4 neutral we have a party now
5 5 neutral they are driving to downtown
Your code seems right , try to add this to the concat result it will remove your duplicates :
# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
df_concat.drop_duplicates()
If you want to reindex it you can do ofcourse mot chnging the ‘index’column):
df_concat.drop_duplicates(ignore_index =True)
You can always do combine_first
out = df_new.combine_first(df_old)

Filtering a column in a data frame to get only column entries that contain a specific word

print(data['PROD_NAME'])
0 Natural Chip Compny SeaSalt175g
1 CCs Nacho Cheese 175g
2 Smiths Crinkle Cut Chips Chicken 170g
3 Smiths Chip Thinly S/Cream&Onion 175g
4 Kettle Tortilla ChpsHny&Jlpno Chili 150g
...
264831 Kettle Sweet Chilli And Sour Cream 175g
264832 Tostitos Splash Of Lime 175g
264833 Doritos Mexicana 170g
264834 Doritos Corn Chip Mexican Jalapeno 150g
264835 Tostitos Splash Of Lime 175g
Name: PROD_NAME, Length: 264836, dtype: object
I only want product names that have the word 'chip' in it somewhere.
new_data = pd.DataFrame(data['PROD_NAME'].str.contains("Chip"))
print(pd.DataFrame(new_data))
PROD_NAME
0 True
1 False
2 True
3 True
4 False
... ...
264831 False
264832 False
264833 False
264834 True
264835 False
[264836 rows x 1 columns]
My question is how do I remove the product_names that are False and instead of having True in the data frame above, get the product name which caused it to become True.
Btw, this is part of the Quantium data analytics virtual internship program.
Try using .loc with column names to select particular columns that meet the criteria you need. There is some documentation here, but the part before the comma is the boolean series you want to use as filter (in your case the str.contains('Chip') and after the comma are the column/columns you want to return (in your case 'PROD_NAME' but also works with another column/columns).
Example
import pandas as pd
example = {'PROD_NAME':['Chippy','ABC','A bag of Chips','MicroChip',"Product C"],'Weight':range(5)}
data = pd.DataFrame(example)
data.loc[data.PROD_NAME.str.contains('Chip'),'PROD_NAME']
#0 Chippy
#2 A bag of Chips
#3 MicroChip
you are almost there,
try this,
res = data[data['PROD_NAME'].str.contains("Chip")]
O/P:
prod_name
0 Natural Chip Compny SeaSalt175g
2 Smiths Crinkle Cut Chips Chicken 170g
3 Smiths Chip Thinly S/Cream&Onion 175g
8 Doritos Corn Chip Mexican Jalapeno 150g

Populate value for data frame row based on condition

Background
I have a dataset that looks like the following:
product_name price
Women's pant 20.00
Men's Shirt 30.00
Women's Dress 40.00
Blue Shirt 30.00
...
I am looking to create a new column called
gender
which will contain the values Women,Men, or Unisex based in the string in the product_name
The desired result would look like this:
product_name price gender
Women's pant 20.00 women
Men's Shirt 30.00 men
Women's Dress 40.00 women
Blue Shirt 30.00 unisex
My Approach
I figured that first I should create a new column with a blank value for each row. Then I should loop through each row in the dataframe and check on the string df[product_name] to see if its a mens, womens, or unisex and fill out the respective gender row value.
Here is my code:
df['gender'] = ""
for product_name in df['product_name']:
if 'women' in product_name.lower():
df['gender'] = 'women'
elif 'men' in product_name.lower():
df['gender'] = 'men'
else:
df['gender'] = 'unisex'
However, I get the following result:
product_name price gender
Women's pant 20.00 men
Men's Shirt 30.00 men
Women's Dress 40.00 men
Blue Shirt 30.00 men
I would really appreciate some help here as I am new to python and pandas library.
You could use a list comprehension with if/else to get your output:
df['gender'] = ['women' if 'women' in word
else "men" if "men" in word
else "unisex"
for word in df.product_name.str.lower()]
df
product_name price gender
0 Women's pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue Shirt 30.0 unisex
Alternatively, you could use numpy select to achieve the same results:
cond1 = df.product_name.str.lower().str.contains("women")
cond2 = df.product_name.str.lower().str.contains("men")
condlist = [cond1, cond2]
choicelist = ["women", "men"]
df["gender"] = np.select(condlist, choicelist, default="unisex")
Usually, for strings, python's iteration is much faster; you have to test that though.
Try turning your for statement into a function and using apply. So something like -
def label_gender(product_name):
'''product_name is a str'''
if 'women' in product_name.lower():
return 'women'
elif 'men' in product_name.lower():
return 'men'
else:
return 'unisex'
df['gender'] = df.apply(lambda x: label_gender(x['product_name']),axis=1)
A good breakdown of using apply/lambda can be found here: https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7
You can also use np.where + Series.str.contains,
import numpy as np
df['gender'] = (
np.where(df.product_name.str.contains("women", case=False), 'women',
np.where(df.product_name.str.contains("men", case=False), "men", 'unisex'))
)
product_name price gender
0 Women's pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue Shirt 30.0 unisex
Use np.where .str.contains and regex first word` in phrase. So that;
#np.where(if product_name has WomenORMen, 1st Word in Phrase, otherwise;unisex)
df['Gender']=np.where(df.product_name.str.contains('Women|Men')\
,df.product_name.str.split('(^[\w]+)').str[1],'Unisex')
product_name price gender
0 Women's pant 20.0 Women
1 Men's Shirt 30.0 Men
2 Women's Dress 640.0 Women
3 Blue Shirt 30.0 Unisex

How to combine common rows in DataFrame

I'm running some analysis on bank statements (csv's). Some items like McDonalds each have their own row (due to having different addresses).
I'm trying to combine these rows by a common phrase. So for this example the obvious phrase, or string, would be "McDonalds". I think it'll be an if statement.
Also, the column has a dtype of "object". Will I have to convert it to string format?
Here is an example output of the result of printingtotali = df.Item.value_counts() from my code.
Ideally I'd want that line to output McDonalds as just a single row.
In the csv they are 2 separate rows.
foo 14
Restaurant Boulder CO 8
McDonalds Boulder CO 5
McDonalds Denver CO 5
Here's what the column data consists of
'Sukiya Greenwood Vil CO' 'Sei 34179 Denver CO' 'Chambers Place Liquors 303-3731100 CO' "Mcdonald's F26593 Fort Collins CO" 'Suh Sushi Korean Bbq Fort Collins CO' 'Conoco - Sei 26927 Fort Collins CO'
OK. I think I ginned up something that can be helpful. Realize that the task of inferring categories or names from text strings can be huge, depending on how detailed you want to get. You can dive into regex or other learning models. People make careers of it! Obviously, your bank is doing some of this as they categorize things when you get a year-end summary.
Anyhow, here is a simple way to generate some categories and use them as a basis for the grouping that you want to do.
import pandas as pd
item=['McDonalds Denver', 'Sonoco', 'ATM Fee', 'Sonoco, Ft. Collins', 'McDonalds, Boulder', 'Arco Boulder']
txn = [12.44, 4.00, 3.00, 14.99, 19.10, 52.99]
df = pd.DataFrame([item, txn]).T
df.columns = ['item_orig', 'charge']
print(df)
# let's add an extra column to catch the conversions...
df['item'] = pd.Series(dtype=str)
# we'll use the "contains" function in pandas as a simple converter... quick demo
temp = df.loc[df['item_orig'].str.contains('McDonalds')]
print('\nitems that containt the string "McDonalds"')
print(temp)
# let's build a simple conversion table in a dictionary
conversions = { 'McDonalds': 'McDonalds - any',
'Sonoco': 'gas',
'Arco': 'gas'}
# let's loop over the orig items and put conversions into the new column
# (there is probably a faster way to do this, but for data with < 100K rows, who cares.)
for key in conversions:
df['item'].loc[df['item_orig'].str.contains(key)] = conversions[key]
# see how we did...
print('converted...')
print(df)
# now move over anything that was NOT converted
# in this example, this is just the ATM Fee item...
df['item'].loc[df['item'].isnull()] = df['item_orig']
# now we have decent labels to support grouping!
print('\n\n *** sum of charges by group ***')
print(df.groupby('item')['charge'].sum())
Yields:
item_orig charge
0 McDonalds Denver 12.44
1 Sonoco 4
2 ATM Fee 3
3 Sonoco, Ft. Collins 14.99
4 McDonalds, Boulder 19.1
5 Arco Boulder 52.99
items that containt the string "McDonalds"
item_orig charge item
0 McDonalds Denver 12.44 NaN
4 McDonalds, Boulder 19.1 NaN
converted...
item_orig charge item
0 McDonalds Denver 12.44 McDonalds - any
1 Sonoco 4 gas
2 ATM Fee 3 NaN
3 Sonoco, Ft. Collins 14.99 gas
4 McDonalds, Boulder 19.1 McDonalds - any
5 Arco Boulder 52.99 gas
*** sum of charges by group ***
item
ATM Fee 3.00
McDonalds - any 31.54
gas 71.98
Name: charge, dtype: float64

Categories