Find the most popular word order in a Pandas dataframe - python

I'm trying to find the most common word order in a pandas dataframe for strings which occur more than once.
Example Dataframe
title
0 Men's Nike Socks
1 Nike Socks Men's
2 Men's Black Nike Socks
3 Men's Nike Socks
4 Everyday 3 Pack Cotton Cushioned Crew Socks
Desired Output
Men's Nike Socks
This is because each word occurs more than once, arranged in the most common order.
What I've Tried
I thought one way to tackle this is to assign a score for each word position, e.g. first position = high score, a low position (further right in the sentence = lower score).
I considered counting the maximum number of words which appear in the dataframe and then use that to incrementally score the words based on their frequency and position.
I Python beginner and not sure how to progress further than that.
It's worth mentioning that the word sizes will be random, and not constrained to the example above.
Minimum Reproducible Example
import pandas as pd
data = [
"Men's Nike Socks for sale",
"Nike Socks Men's",
"Men's Nike Socks in the UK",
"Men's Nike Socks to buy",
"Everyday 3 Pack Cotton Cushioned Crew Socks",
]
df = pd.DataFrame(data, columns=['title'])
print(df)
Edit: My original example is too simplified as my desired output appeared twice exactly in the dataframe.
I've updated the dataframe, but the desired output is still the same.

Use value_counts() and idxmax().
result = df['title'].value_counts().idxmax()
print(result)
Output: Men's Nike Socks
Explanation:
>>> df['title'].value_counts()
Men's Nike Socks 2
Nike Socks Men's 1
Men's Black Nike Socks 1
Everyday 3 Pack Cotton Cushioned Crew Socks 1
Name: title, dtype: int64
Update base new DataFrame:
max_split = df['title'].str.split().apply(len).max()
for i in range(1, max_split):
try:
result = df['title'].str.split(' ', i, expand=True).iloc[:, :-1].apply(' '.join, axis=1).mode()[0]
except TypeError:
break
print(result)
Output: Men's Nike Socks

You can use pandas.Series.mode that returns the most frequent value in a column/serie :
out = df['title'].mode()
# Output :
print(out)
0 Men's Nike Socks
Name: title, dtype: object
# Edit :
To find the most frequent phrases in a column, use nltk as shown in the code below (highly inspired by #jezarel) :
from nltk import ngrams
vals = [y for x in df['title'] for y in x.split()]
n = [3, 4, 5] # Phrases between 3 and 5 words, To be adjusted !
out = pd.Series([' '.join(y) for x in n for y in ngrams(vals, x)]).value_counts().idxmax()
print(out)
Men's Nike Socks

Related

How to compare two data row before concatenating them?

I have 2 datasets (in CSV format) with different size such as follow:
df_old:
index category text
0 spam you win much money
1 spam you are the winner of the game
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
4 neutral we have a party now
5 neutral they are driving to downtown
df_new:
index category text
0 spam you win much money
14 spam London is the capital of Canada
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
4 neutral we have a party now
31 neutral construction will be done
I am using a code that concatenates the df_new to the df_old in the way that df_new goes on top of df_old's each category.
The code is:
(pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
Now, the problem is that some of the rows with similar index, category, text (all together at same row) being duplicated at the same time, and (like: [0, spam, you win much money]) I want to avoid this.
The expected output should be:
df_concat:
index category text
14 spam London is the capital of Canada
0 spam you win much money
1 spam you are the winner of the game
15 not_spam no more raining in winter
25 not_spam the soccer game plays on HBO
2 not_spam the weather in Chicago is nice
3 not_spam pizza is an Italian food
31 neutral construction will be done
4 neutral we have a party now
5 neutral they are driving to downtown
I tried this and this but these are removing either the category or text.
To remove duplicates on specific column(s), use subset in drop_duplicates:
df.drop_duplicates(subset=['index', 'category', 'text'], keep='first')
Try concat + sort_values:
res = pd.concat((new_df, old_df)).drop_duplicates()
res = res.sort_values(by=['category'], key=lambda x: x.map({'spam' : 0, 'not_spam' : 1, 'neutral': 2}))
print(res)
Output
index category text
0 0 spam you win much money
1 14 spam London is the capital of Canada
1 1 spam you are the winner of the game
2 15 not_spam no more raining in winter
3 25 not_spam the soccer game plays on HBO
2 2 not_spam the weather in Chicago is nice
3 3 not_spam pizza is an Italian food
4 31 neutral construction will be done
4 4 neutral we have a party now
5 5 neutral they are driving to downtown
Your code seems right , try to add this to the concat result it will remove your duplicates :
# this first lines will create a new column ‘index’ and will help the rest of the code be correct
df_new = df_new.reset_index()
df_ old = df_ old.reset_index()
df_concat = (pd.concat([df_new,df_old], sort=False).sort_values('category', ascending=False, kind='mergesort'))
df_concat.drop_duplicates()
If you want to reindex it you can do ofcourse mot chnging the ‘index’column):
df_concat.drop_duplicates(ignore_index =True)
You can always do combine_first
out = df_new.combine_first(df_old)

Filtering a column in a data frame to get only column entries that contain a specific word

print(data['PROD_NAME'])
0 Natural Chip Compny SeaSalt175g
1 CCs Nacho Cheese 175g
2 Smiths Crinkle Cut Chips Chicken 170g
3 Smiths Chip Thinly S/Cream&Onion 175g
4 Kettle Tortilla ChpsHny&Jlpno Chili 150g
...
264831 Kettle Sweet Chilli And Sour Cream 175g
264832 Tostitos Splash Of Lime 175g
264833 Doritos Mexicana 170g
264834 Doritos Corn Chip Mexican Jalapeno 150g
264835 Tostitos Splash Of Lime 175g
Name: PROD_NAME, Length: 264836, dtype: object
I only want product names that have the word 'chip' in it somewhere.
new_data = pd.DataFrame(data['PROD_NAME'].str.contains("Chip"))
print(pd.DataFrame(new_data))
PROD_NAME
0 True
1 False
2 True
3 True
4 False
... ...
264831 False
264832 False
264833 False
264834 True
264835 False
[264836 rows x 1 columns]
My question is how do I remove the product_names that are False and instead of having True in the data frame above, get the product name which caused it to become True.
Btw, this is part of the Quantium data analytics virtual internship program.
Try using .loc with column names to select particular columns that meet the criteria you need. There is some documentation here, but the part before the comma is the boolean series you want to use as filter (in your case the str.contains('Chip') and after the comma are the column/columns you want to return (in your case 'PROD_NAME' but also works with another column/columns).
Example
import pandas as pd
example = {'PROD_NAME':['Chippy','ABC','A bag of Chips','MicroChip',"Product C"],'Weight':range(5)}
data = pd.DataFrame(example)
data.loc[data.PROD_NAME.str.contains('Chip'),'PROD_NAME']
#0 Chippy
#2 A bag of Chips
#3 MicroChip
you are almost there,
try this,
res = data[data['PROD_NAME'].str.contains("Chip")]
O/P:
prod_name
0 Natural Chip Compny SeaSalt175g
2 Smiths Crinkle Cut Chips Chicken 170g
3 Smiths Chip Thinly S/Cream&Onion 175g
8 Doritos Corn Chip Mexican Jalapeno 150g

Modifying dataframe rows based on list of strings

Background
I have a dataset where I have the following:
product_title price
Women's Pant 20.00
Men's Shirt 30.00
Women's Dress 40.00
Blue 4" Shorts 30.00
Blue Shorts 35.00
Green 2" Shorts 30.00
I created a new column called gender which contains the values Women, Men, or Unisex based on the specified string in product_title.
The output looks like this:
product_title price gender
Women's Pant 20.00 women
Men's Shirt 30.00 men
Women's Dress 40.00 women
Blue 4" Shorts 30.00 women
Blue Shorts 35.00 unisex
Green 2" Shorts 30.00 women
Approach
I approached creating a new column by using if/else statements:
df['gender'] = ['women' if 'women' in word or 'Blue 4"' in word or 'Green 2"' in word
else "men" if "men" in word
else "unisex"
for word in df.product_title.str.lower()]
Although this approach works, it becomes very long when I have a lot of conditions for labeling women vs men vs unisex. Is there cleaner way to do this? Is there a way I can pass a list of strings instead of having a long chain of or conditions?
I would really appreciate help as I am new to python and pandas library.
IIUC,
import numpy as np
s = df['product title'].str.lower()
df['gender'] = np.select([s.str.contains('men'),
s.str.contains('women|blue 4 shorts|green 2 shorts')],
['men', 'women'],
default='unisex')
Here is another idea with str.extract and series.map
d = {'women':['women','blue 4"','green 2"'],'men':['men']}
d1 = {val:k for k,v in d.items() for val in v}
pat = '|'.join(d1.keys())
import re
df['gender'] = (df['product_title'].str.extract('('+pat+')',flags=re.I,expand=False)
.str.lower().map(d1).fillna('unisex'))
print(df)
product_title price gender
0 Women's Pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue 4" Shorts 30.0 women
4 Blue Shorts 35.0 unisex
5 Green 2" Shorts 30.00 NaN women
You can try to define your own function and run it with a apply+lambda espression:
Create the function which you can change as you need:
def sex(str):
'''
look for specific values and retun value
'''
for words in ['women','Blue 4"','Green 2"']:
if words in str.lower():
return 'women'
elif 'men' in str.lower():
return 'men'
else:
return 'unisex'
and after apply to the colum you need to check for values:
df['gender']=df['product_title'].apply(lambda str: sex(str))
Cheers!
EDIT 3:
After looking around and checking about the numpy approac from #ansev following #anky comment I was able to find out this may be faster up to a certain point, tested with 5000 rows and still faster, but the numpy approach started to catch up. So it really depends on how big your dataset are.
Will remove any comment on speed considered I was testing only on this small frame initially, still a learning process as you can see from my level.

Dealing with abbreviation and misspelled words in DataFrame Pandas

I have a dataframe contains misspelled words and abbreviations like this.
input:
df = pd.DataFrame(['swtch', 'cola', 'FBI',
'smsng', 'BCA', 'MIB'], columns=['misspelled'])
output:
misspelled
0 swtch
1 cola
2 FBI
3 smsng
4 BCA
5 MIB
I need to correcting the misspelled words and the Abvreviations
I have tried with creating the dictionary such as:
input:
dicts = pd.DataFrame(['coca cola', 'Federal Bureau of Investigation',
'samsung', 'Bank Central Asia', 'switch', 'Men In Black'], columns=['words'])
output:
words
0 coca cola
1 Federal Bureau of Investigation
2 samsung
3 Bank Central Asia
4 switch
5 Men In Black
and applying this code
x = [next(iter(x), np.nan) for x in map(lambda x: difflib.get_close_matches(x, dicts.words), df.misspelled)]
df['fix'] = x
print (df)
The output is I have succeded correcting misspelled but not the abbreviation
misspelled fix
0 swtch switch
1 cola coca cola
2 FBI NaN
3 smsng samsung
4 BCA NaN
5 MIB NaN
Please help.
How about following a 2-prong approach where first correct the misspellings and then expand the abbreviations:
df = pd.DataFrame(['swtch', 'cola', 'FBI', 'smsng', 'BCA', 'MIB'], columns=['misspelled'])
abbreviations = {
'FBI': 'Federal Bureau of Investigation',
'BCA': 'Bank Central Asia',
'MIB': 'Men In Black',
'cola': 'Coca Cola'
}
spell = SpellChecker()
df['fixed'] = df['misspelled'].apply(spell.correction).replace(abbreviations)
Result:
misspelled fixed
0 swtch switch
1 cola Coca Cola
2 FBI Federal Bureau of Investigation
3 smsng among
4 BCA Bank Central Asia
5 MIB Men In Black
I use pyspellchecker but you can go with any spelling-checking library. It corrected smsng to among but that is a caveat of automatic spelling correction. Different libraries may give out different results

Creating a new column with last 2 values after a str.split operation

I came across this extremely well explained similar question (Get last "column" after .str.split() operation on column in pandas DataFrame), and used some of the codes found. However, it's not the output that I would like.
raw_data = {
'category': ['sweet beverage, cola,sugared', 'healthy,salty snacks', 'juice,beverage,sweet', 'fruit juice,beverage', 'appetizer,salty crackers'],
'product_name': ['coca-cola', 'salted pistachios', 'fruit juice', 'lemon tea', 'roasted peanuts']}
df = pd.DataFrame(raw_data)
Objective is to extract the various categories from each row, and only use the last 2 categories to create a new column. I have this code, which works and I have the categories of interest as a new column.
df['my_col'] = df.categories.apply(lambda s:s.split(',')[-2:])
output
my_col
[cola,sugared]
[healthy,salty snacks]
[beverage,sweet]
...
However, it appears as a list. How can I not have it appear as a list? Can this be achieved? Thanks all!!!!!
I believe you need str.split, select last to lists and last str.join:
df['my_col'] = df.category.str.split(',').str[-2:].str.join(',')
print (df)
category product_name my_col
0 sweet beverage, cola,sugared coca-cola cola,sugared
1 healthy,salty snacks salted pistachios healthy,salty snacks
2 juice,beverage,sweet fruit juice beverage,sweet
3 fruit juice,beverage lemon tea fruit juice,beverage
4 appetizer,salty crackers roasted peanuts appetizer,salty crackers
EDIT:
In my opinion pandas str text functions are more recommended as apply with puru python string functions, because also working with NaNs and None.
raw_data = {
'category': [np.nan, 'healthy,salty snacks'],
'product_name': ['coca-cola', 'salted pistachios']}
df = pd.DataFrame(raw_data)
df['my_col'] = df.category.str.split(',').str[-2:].str.join(',')
print (df)
category product_name my_col
0 NaN coca-cola NaN
1 healthy,salty snacks salted pistachios healthy,salty snacks
df['my_col'] = df.category.apply(lambda s: ','.join(s.split(',')[-2:]))
AttributeError: 'float' object has no attribute 'split'
You can also use join in the lambda to the result of split:
df['my_col'] = df.category.apply(lambda s: ','.join(s.split(',')[-2:]))
df
Result:
category product_name my_col
0 sweet beverage, cola,sugared coca-cola cola,sugared
1 healthy,salty snacks salted pistachios healthy,salty snacks
2 juice,beverage,sweet fruit juice beverage,sweet
3 fruit juice,beverage lemon tea fruit juice,beverage
4 appetizer,salty crackers roasted peanuts appetizer,salty crackers

Categories