Picking phrases containing specific words in python

Picking phrases containing specific words in python - python

I have a list with 10 names and a list with many of phrases. I only want to select the phrases containing one of those names.
ArrayNames = [Mark, Alice, Paul]
ArrayPhrases = ["today is sunny", "Paul likes apples", "The cat is alive"]
In the example, is there any way to pick only the second phrase considering the face that contains Paul, given these two arrays?
This is what I tried:
def foo(x,y):
tmp = []
for phrase in x:
if any(y) in phrase:
tmp.append(phrase)
print(tmp)
x is the array of phrases, y is the array of names.
This is the output:
if any(y) in phrase:
TypeError: coercing to Unicode: need string or buffer, bool found
I'm very unsure about the syntax I used concerning the any() construct. Any suggestions?

Your usage of any is incorrect, do the following:
ArrayNames = ['Mark', 'Alice', 'Paul']
ArrayPhrases = ["today is sunny", "Paul likes apples", "The cat is alive"]
result = []
for phrase in ArrayPhrases:
if any(name in phrase for name in ArrayNames):
result.append(phrase)
print(result)
Output
['Paul likes apples']
You are getting a TypeError because any returns a bool and your trying to search for a bool inside a string (if any(y) in phrase:).
Note that any(y) works because it will use the truthy value of each of the strings of y.

Related

how to know if a list of word are in the dictionary

I have a list of words like this
name
[primevère]
[one, federal, credit, union]
[nitroxal]
and I have a dictionary called words. words contain all English words in the English dictionary. Format like below
**word**
data
name
english...
For one, federal, credit, union, I want to test each element is in the dictionary
df['is_english'] = df['name'].isin(words)
df.head(3)
however, for list like this, [one, federal, credit, union], I won't be able to cover everything. I want to see if each element in the list is an English word. If one of them not english, return false. If all words are english, return True.

Let me give you a working example of using apply to check whether all items in a list in the dataframe's name column are in the list words:
import pandas as pd
words = ['data','name','english']
df = pd.DataFrame({'name': [['primevère'],['one', 'federal', 'credit', 'union'],['english'],['nitroxal']]})
df['is_english'] = df['name'].apply(lambda wds: all(x in words for x in wds))
Result:
name
is_english
0
['primevère']
False
1
['one', 'federal', 'credit', 'union']
False
2
['english']
True
3
['nitroxal']
False

Return substring if present in a string and match with case insensitive python

I am currently trying to return a substring if is present in a string, with case insensitive.
So an example would be, I want to return the string "apple" even when the sentence is "Apple is cool" or "I like APPLE" or "I like apples"
What I have so far is this:
df_word_list = pd.DataFrame({'word': ['apple','cool']})
df= pd.DataFrame({'sentence': ['"Apple is cool","I like APPLE","I like apples"]})
words = [x for x in df_word_list['word'].tolist() if x in str(df['sentence'][i])]
This gives me the returned words, but it's case sensitive, anyone knows how to turn it into case insensitive?
I would like the final output to be
apple, cool
apple
Row 3 is empty because it has an "s" ("apples" instead of
"apple")
df_words_list is the dataframe of words that I want to identify. df is the dataframe that contains the sentences.

df.sentence.str.lower().str.split().apply(lambda l: ", ".join([x for x in l if x in df_word_list["word"].values]))
result is pandas.Series of strings
0 apple, cool
1 apple
2
Name: sentence, dtype: object

Extract words in a paragraph that are similar to words in list

I have the following string:
"The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
List of words to be extracted:
["town","teddy","chicken","boy went"]
NB: town and teddy are wrongly spelt in the given sentence.
I have tried the following but I get other words that are not part of the answer:
import difflib
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town","teddy","chicken","boy went"]
[difflib.get_close_matches(x.lower().strip(), sent.split()) for x in list1 ]
I am getting the following result:
[['twn', 'to'], ['tddy'], ['chicken.', 'picked'], ['went']]
instead of:
'twn', 'tddy', 'chicken','boy went'

Notice in the documentation for difflib.get_closest_matches():
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
Return a list of the best "good enough" matches. word is a sequence for which close matches are desired (typically a string), and
possibilities is a list of sequences against which to match word
(typically a list of strings).
Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.
Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are
ignored.
At the moment, you are using the default n and cutoff arguments.
You can specify either (or both), to narrow down the returned matches.
For example, you could use a cutoff score of 0.75:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), cutoff=0.75) for x in list1]
Or, you could specify that only at most 1 match should be returned:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), n=1) for x in list1]
In either case, you could use a list comprehension to flatten the lists of lists (since difflib.get_close_matches() always returns a list):
matches = [r[0] for r in result]
Since you also want to check for close matches of bigrams, you can do so by extracting pairings of adjacent "words", and pass them to difflib.get_close_matches() as part of the possibilities argument.
Here is a full working example of this in action:
import difflib
import re
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town", "teddy", "chicken", "boy went"]
# this extracts overlapping pairings of "words"
# i.e. ['The boy', 'boy went', 'went to', 'to twn', ...
pairs = re.findall(r'(?=(\b[^ ]+ [^ ]+\b))', sent)
# we pass the sent.split() list as before
# and concatenate the new pairs list to the end of it also
result = [difflib.get_close_matches(x.lower().strip(), sent.split() + pairs, n=1) for x in list1]
matches = [r[0] for r in result]
print(matches)
# ['twn', 'tddy', 'chicken.', 'boy went']

If you read Python documentation fordifflib.get_close_matches()
https://docs.python.org/3/library/difflib.html
It returns all possible best matches.
Method signature:
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
Here n is the maximum number of close matches to return. So I think you can pass this as 1.
>>> [difflib.get_close_matches(x.lower().strip(), sent.split(),1)[0] for x in list1]
['twn', 'tddy', 'chicken.', 'went']

Why isn't there function to count Document Frequency (DF) in NLTK?

I am looking for a function to get the DF of certain term (meaning how many documents contain a certain word in a corpus), but I can't seem to find the function here. The page only has function to get values of tf, idf, and tf_idf. I am looking specifically for DF only. I copied the code below from the documentation,
matches = len([True for text in self._texts if term in text])
but I don't like the result it gives. For example if I have a list of strings and I am looking for the word Pete, it also includes the name Peter which is not I want. For example.
texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
So I am looking for pete which appears TWICE, but the code I showed above will tell you that there are THREE pete's because it also counts peter. How do I solve this? Thanks.

Your description is incorrect. The expression you posted does indeed give 1, not 3, when you search for pete in texts:
>>> texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
>>> len([True for text in texts if 'pete' in text])
1
The only way you could have matched partial words is if your texts were not tokenized (i.e. if texts is a list of strings, not a list of token lists).
But the above code is terrible, it builds a list for no reason at all. A better (and more conventional) way to count hits is this:
>>> sum(1 for text in texts if 'pete' in text))
1

As for the question that you pose (Why (...)?) : I don't know.
As a solution to your example (noting that peter occurs twice and pete just once:
texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
def flatten(l):
out = []
for item in l:
if isinstance(item, (list, tuple)):
out.extend(flatten(item))
else:
out.append(item)
return out
flat = flatten(texts)
len([c for c in flat if c in ['pete']])
len([c for c in flat if c in ['peter']])
Compare the two results
Edit:
import collections
def counts(listr, word):
total = []
for i in range(len(texts)):
total.append(word in collections.Counter(listr[i]))
return(sum(total))
counts(texts,'peter')
#2

What is efficient way to match words in string?

Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.

Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'

Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')

You may use Python's set in order to get good performance while using the in operator.

If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Picking phrases containing specific words in python - python

Related

how to know if a list of word are in the dictionary

Return substring if present in a string and match with case insensitive python

Extract words in a paragraph that are similar to words in list

Why isn't there function to count Document Frequency (DF) in NLTK?

What is efficient way to match words in string?

Categories

Resources