Categorising clothing based on keywords in title

Categorising clothing based on keywords in title - python

I'm looking for the best way to categorise items based on keywords that may be found in the title for a clothing website please.
The categories will be the gender of the clothing item, so womens, mens, boys, girls. However, depending on the item, the titles may contain different keywords such as 'female', 'woman', 'women', 'lady's' and so on.
My thoughts are to put the keywords into a list and then cycle through the list looking for a match and then categorise accordingly.
If I follow this method though, is it possible to do this with a list within a list and cycle through that, so we could have:
gender = ['woman', [#keywords for females clothes], 'men', [#keywords for men's clothes]]
Then cycle through this and if we find a match, tag it accordingly. Alternatively it may be better to use a dictionary, have the key be the category and then a list of corresponding keywords.
Or, there could be an altogether different solution that I've completely missed. I feel there is a pretty simple solution to this but for some reason I can't seem to get my head around it. Thanks in advance.

Try this:
import pandas as pd
d = {'men': ['men', 'boy'], 'women': ['women', 'girl', 'lady']}
def classify(text):
gender = 'None of any'
for i in d:
if any(j in text for j in d[i]):
gender = i
return gender
df = pd.DataFrame({'text':['this is a boy', 'a girl']})
df['cat'] = df['text'].apply(lambda x: classify(x))
print(df)

you can use flashtext for extracting keyword from a given string
from flashtext import KeywordProcessor
kp = KeywordProcessor()
dict_= {'sport': ['cricket','football'],'movie' : ['horror', 'drama']} # here you can add list of word for men and woman
kp.add_keywords_from_dict(dict_)
# now you can extract keyword from a given string
kp.extract_keywords('I love playing football')
#op
['sport']
kp.extract_keywords("some people don't like to watch drama and horror movie, but love to watch cricket")
#op
['movie', 'movie', 'sport']

Related

How to count the word occurence (from words in specific list) and store the results in a new column in a Pandas Dataframe in Python?

I currently have a list of words about MMA.
I want to create a new column in my Pandas Dataframe called 'MMA Related Word Count'. I want to analyze the column 'Speech' for each row and sum up how often words (from the list under here) occurred within the speech. Does anyone know the best way to do this? I'd love to hear it, thanks in advance!
Please take a look at my dataframe.
CODE EXAMPLE:
import pandas as pd
mma_related_words = ['mma', 'ju jitsu', 'boxing']
data = {
"Name": ['Dana White', 'Triple H'],
"Speech": ['mma is a fantastic sport. ju jitsu makes you better as a person.', 'Boxing sucks. Professional dancing is much better.']
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
CURRENT DATAFRAME:
Name
Speech
Dana White
mma is a fantastic sport. ju jitsu makes you better as a person.
Triple H
boxing sucks. Professional wrestling is much better.
--
EXPECTED OUTPUT:
Exactly same as above. But at right side new column with 'MMA Related Word Count'. For Dana White: value 2. For Triple H I want value 1.

You can use a regex with str.count:
import re
regex = '|'.join(map(re.escape, mma_related_words))
# 'mma|ju\\ jitsu|boxing'
df['Word Count'] = df['Speech'].str.count(regex, flags=re.I)
# or
# df['Word Count'] = df['Speech'].str.count(r'(?i)'+regex)
output:
Name Speech Word Count
0 Dana White mma is a fantastic sport. ju jitsu makes you b... 2
1 Triple H Boxing sucks. Professional dancing is much bet... 1

Using simple loop in apply lambda function shall work; Try this;
def fun(string):
cnt = 0
for w in mma_related_words:
if w.lower() in string.lower():
cnt = cnt + 1
return cnt
df['MMA Related Word Count'] = df['Speech'].apply(lambda x: fun(string=x))
Same can also be written as;
df['MMA Related Word Count1'] = df['Speech'].apply(lambda x: sum([1 for w in mma_related_words if w.lower() in str(x).lower()]))
Output of df;

Building a network with users as nodes and their sentences as targets

I have difficulties in building a network from this dataset
Node Sentence
Mary I am here to help. What would you like to talk about?
Mary What's up? I hope everything is going well in NY. I have always loved NY, the Big Apple!
John There is the football match, tonight. Let's go to the pub!
Christopher It is a great news! I am so happy for y'all
Catherine Do not do that! It is extremely dangerous
Matt I read that news. I was so happy and grateful it was not you.
Matt Yes, I didn't know it. It is such a surprising news! Congratulations!
Sarah Nothing to add...
Catherine Finally a beautiful sunny day!!!
Mary Jane I do not think it will rain. There is the sun. It is a hot day. Very hot!
Names should be the nodes in a network. For each node, I should create a link with frequent words within sentence (excluding stopwords) to get the more meaningful relationship.
For removing stopwords from the sentences I am using nlkt (it does not work well, but it should be ok):
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
df['Sentences'] = df['Sentences'].str.lower().str.split()
df['Sentences'].apply(lambda x: [item for item in x if item not in stop_words])
Then, for the frequency of words, I would first create a vocabulary with all the terms and their corresponding frequency, then I would go back to sentences to create a couple (word, freq) where the word is the 'target' node and the freq should be the size of the target node.
Here my difficulties are revealed, since this
word = df['Sentence'].tolist()
words = nltk.tokenize.word_tokenize(word)
word_dist = nltk.FreqDist(words)
result = pd.DataFrame(word_dist, columns=['Word', 'Frequency'])
does not show Words and their frequency (I am creating a new dataframe to show them, instead of adding this information in two more columns in my original dataframe; this latter would be preferable).
For building the network, once got the Node, Target and Weight, I would use networkx.
Example of word_dist result (not sorted):
Word Frequency
help 8
like 12
news 21
day 8
sunny 17
sun 23
football 12
pub 3
home 14
congratulations 3

The nltk.FreqDist() class returns a collections.counter object, which is
basically a dictionary. When pandas constructs a dataframe and the first
argument is a dictionary, each key is considered a column, and each value
is expected to be a list of column values. So, as in the example below,
result will be an empty dataframe with two columns.
To construct a dataframe with a dictionary, where every key is a row,
you can simply split the dictionary apart in keys and values, like
in the construction of result2. The next line sets the name of the
index, if you so desire.
import pandas as pd
word_dict = {'help': '8',
'like': '12',
'news': '21',
'day': '8',
'sunny': '17',
'sun': '23',
'football': '12',
'pub': '3',
'home': '14',
'congratulations': '3'}
result = pd.DataFrame(word_dict, columns=('a', 'b'))
result2 = pd.DataFrame(word_dict.values(), index=word_dict.keys(), columns=('Frequency',))
result2.index.rename('Word', inplace=True)

How can I show a specific word in a data set?

I just started to learn python. I have a question about matching some of the words in my dataset in excel.
words_list is included some of the words I would like to find in a dataset.
words_list = ('tried','mobile','abc')
df is the extract from excel and picked up a single column.
df =
0 to make it possible or easier for someone to do ...
1 unable to acquire a buffer item very likely ...
2 The organization has tried to make...
3 Broadway tried a variety of mobile Phone for the..
I would like to get the result like this:
'None',
'None',
'tried',
'tried','mobile'
I tried in Jupiter like this:
list = [ ]
for word in df:
if any (aa in word for aa in words_List):
list.append(word)
else:
list.append('None')
print(list)
But the result will show the whole sentence in df
'None'
'None'
'The organization has tried to make...'
'Broadway tried a variety of mobile Phone for the..'
Can I only show the result only in the words list?
Sorry for my English and
thank you all

I'd suggest a manipulation on the DataFrame (that should always be your first thought, use the power of pandas)
import pandas as pd
words_list = {'tried', 'mobile', 'abc'}
df = pd.DataFrame({'col': ['to make it possible or easier for someone to do',
'unable to acquire a buffer item very likely',
'The organization has tried to make',
'Broadway tried a variety of mobile Phone for the']})
df['matches'] = df['col'].str.split().apply(lambda x: set(x) & words_list)
print(df)
col matches
0 to make it possible or easier for someone to do {}
1 unable to acquire a buffer item very likely {}
2 The organization has tried to make {tried}
3 Broadway tried a variety of mobile Phone for the {mobile, tried}

The reason it's printing the whole line has to do with your:
for word in df:
Your "word" variable is actually taking the whole line. Then it's checking the whole line to see if it contains your search word. If it does find it, then it basically says, "yes, I found ____ in this line, so append the line to your list.
What it sounds like you want to do is first split the line into words, and THEN check.
list = [ ]
found = False
for line in df:
words = line.split(" ")
for word in word_list:
if word in words:
found = True
list.append(word)
# this is just to append "None" if nothing found
if found:
found = False
else:
list.append("None")
print(list)
As a side note, you may want to use pprint instead of print when working with lists. It prints lists, dictionaries, etc in easier to read layouts. I don't know if you'll need to install the package. That depends on how you initially installed python. But usage would be something like:
from pprint import pprint
dictionary = {'firstkey':'firstval','secondkey':'secondval','thirdkey':'thirdval'}
pprint(dictionary)

Check how many words from a given list occur in list of text/strings

I have a list of text data which contains reviews, something likes this:
1. 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.'
2. 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
3. 'This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
I have a seperate list of words which I want to know exists in the these reviews:
['food','science','good','buy','feedback'....]
I want to know which of these words are present in the review and select reviews which contains certain number of these words. For example, lets say only select reviews which contains atleast 3 of the words from this list, so it displays all those reviews, but also show which of those were encountered in the review while selecting it.
I have the code for selecting reviews containing at least 3 of the words, but how do I get the second part which tells me which words exactly were encountered. Here is my initial code:
keywords = list(words)
text = list(df.summary.values)
sentences=[]
for element in text:
if len(set(keywords)&set(element.split(' '))) >=3:
sentences.append(element)

To answer the second part, allow me to revisit how to approach the first part. A handy approach here is to cast your review strings into sets of word strings.
Like this:
review_1 = "I have bought several of the Vitality canned dog food products and"
review_1 = set(review_1.split(" "))
Now the review_1 set contains one of every word. Then take your list of words, convert it to a set, and do an intersection.
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = review_1.intersection(words)
The resulting set, matches, contains all the words that are common. The length of this is the number of matches.
Now, this does not work if you cared about how many of each word matches. For example, if the word "food" is found twice in the review and "science" is found once, does that count as matching three words?
If so, let me know via comment and I can write some code to update the answer to include that scenario.
EDIT: Updating to include comment question
If you want to keep a count of how many times each word repeats, then hang onto the review list. Only cast it to set when performing the intersection. Then, use the 'count' list method to count the number of times each match appears in the review. In the example below, I use a dictionary to store the results.
review_1 = "I have bought several of the Vitality canned dog food products and"
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = set(review_1).intersection(words)
match_counts = dict()
for match in matches:
match_counts[match] = words.count(match)

You can use set intersection for finding the common words:
def filter_reviews(data, *, trigger_words = frozenset({'food', 'science', 'good', 'buy', 'feedback'})):
for review in data:
words = review.split() # use whatever method is appropriate to get the words
common = trigger_words.intersection(words)
if len(common) >= 3:
yield review, common

String replace - avoid repeat

I am working on merging a few datasets regarding over 200 countries in the world. In cleaning the data I need to convert some three-letter codes for each country into the countries' full names.
The three-letter codes and country full names come from a separate CSV file, which shows a slightly different set of countries.
My question is: Is there a better way to write this?
str.replace("USA", "United States of America")
str.replace("CAN", "Canada")
str.replace("BHM", "Bahamas")
str.replace("CUB", "Cuba")
str.replace("HAI", "Haiti")
str.replace("DOM", "Dominican Republic")
str.replace("JAM", "Jamaica")
and so on. It goes on for another 200 rows. Thank you!

Since the number of substitution is high, I would instead iterate over the words in the string and replace based upon a dictionary lookup.
mapofcodes = {'USA': 'United States of America', ....}
for word in mystring.split():
finalstr += mapofcodes.get(word, word)

Try reading the CSV file into a dictionary to a 2D array, you can access which ever one you want then.
that is if I understand your question correctly.

Here's a regular expressions solution:
import re
COUNTRIES = {'USA': 'United States of America', 'CAN': 'Canada'}
def repl(m):
country_code = m.group(1)
return COUNTRIES.get(country_code, country_code)
p = re.compile(r'([A-Z]{3})')
my_string = p.sub(repl, my_string)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.