How can I show a specific word in a data set? - python

I just started to learn python. I have a question about matching some of the words in my dataset in excel.
words_list is included some of the words I would like to find in a dataset.
words_list = ('tried','mobile','abc')
df is the extract from excel and picked up a single column.
df =
0 to make it possible or easier for someone to do ...
1 unable to acquire a buffer item very likely ...
2 The organization has tried to make...
3 Broadway tried a variety of mobile Phone for the..
I would like to get the result like this:
'None',
'None',
'tried',
'tried','mobile'
I tried in Jupiter like this:
list = [ ]
for word in df:
if any (aa in word for aa in words_List):
list.append(word)
else:
list.append('None')
print(list)
But the result will show the whole sentence in df
'None'
'None'
'The organization has tried to make...'
'Broadway tried a variety of mobile Phone for the..'
Can I only show the result only in the words list?
Sorry for my English and
thank you all

I'd suggest a manipulation on the DataFrame (that should always be your first thought, use the power of pandas)
import pandas as pd
words_list = {'tried', 'mobile', 'abc'}
df = pd.DataFrame({'col': ['to make it possible or easier for someone to do',
'unable to acquire a buffer item very likely',
'The organization has tried to make',
'Broadway tried a variety of mobile Phone for the']})
df['matches'] = df['col'].str.split().apply(lambda x: set(x) & words_list)
print(df)
col matches
0 to make it possible or easier for someone to do {}
1 unable to acquire a buffer item very likely {}
2 The organization has tried to make {tried}
3 Broadway tried a variety of mobile Phone for the {mobile, tried}

The reason it's printing the whole line has to do with your:
for word in df:
Your "word" variable is actually taking the whole line. Then it's checking the whole line to see if it contains your search word. If it does find it, then it basically says, "yes, I found ____ in this line, so append the line to your list.
What it sounds like you want to do is first split the line into words, and THEN check.
list = [ ]
found = False
for line in df:
words = line.split(" ")
for word in word_list:
if word in words:
found = True
list.append(word)
# this is just to append "None" if nothing found
if found:
found = False
else:
list.append("None")
print(list)
As a side note, you may want to use pprint instead of print when working with lists. It prints lists, dictionaries, etc in easier to read layouts. I don't know if you'll need to install the package. That depends on how you initially installed python. But usage would be something like:
from pprint import pprint
dictionary = {'firstkey':'firstval','secondkey':'secondval','thirdkey':'thirdval'}
pprint(dictionary)

Related

Pandas isin() does not return anything even when the keywords exist in the dataframe

I'd like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can't understand why the solution is not working in my case.
keywords = ['fake', 'false', 'lie']
df1:
text
19152
I think she is the Corona Virus....
19154
Boy you hate to see that. I mean seeing how it was contained and all.
19155
Tell her it’s just the fake flu, it will go away in a few days.
19235
Is this fake news?
...
...
20540
She’ll believe it’s just alternative facts.
Expected results: I'd like to select rows that have the exact keywords in my list ('fake', 'false', 'lie). For example, in the above df, it should return rows 19155 and 19235.
str.contains()
df1[df1['text'].str.contains("|".join(keywords))]
The problem with str.contains() is that the result is not limited to the exact keywords. For example, it returns sentences with believe (e.g., row 20540) because lie is a substring of "believe"!
pandas.Series.isin
To find the rows including the exact keywords, I used pd.Series.isin:
df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]
Even though I see there are matches in df1, it doesn't return anything.
import re
df[df.text.apply(lambda x: any(i for i in re.findall('\w+', x) if i in keywords))]
Output:
text
2 Tell her it’s just the fake flu, it will go aw...
3 Is this fake news?
If text is as follows,
df1 = pd.DataFrame()
df1['text'] = [
"Dear Kellyanne, Please seek the help of Paula White I believe ...",
"trump saying it was under controll was a lie, ...",
"Her mouth should hanve been ... All the lies she has told ...",
"she'll believe ...",
"I do believe in ...",
"This value is false ...",
"This value is fake ...",
"This song is fakelove ..."
]
keywords = ['misleading', 'fake', 'false', 'lie']
First,
Simple way is this.
df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
text
5 This value is false ...
6 This value is fake ...
It'll not catch the words like "believe", but can't catch the words "lie," because of the special letter.
Second,
So if remove a special letter in the text data like
new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z]+", " ", x))
df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
Now It can catch the word "lie,".
text
1 trump saying it was under controll was a lie, ...
5 This value is false ...
6 This value is fake ...
Third,
It can't still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python
I think splitting words then matching is a better and straightforward approach, e.g. if the df and keywords are
df = pd.DataFrame({'text': ['lama abc', 'cow def', 'foo bar', 'spam egg']})
keywords = ['foo', 'lama']
df
text
0 lama abc
1 cow def
2 foo bar
3 spam egg
This should return the correct result
df.loc[pd.Series(any(word in keywords for word in words) for words in df['text'].str.findall(r'\w+'))]
text
0 lama abc
2 foo bar
Explaination
First, do words splitting in df['text']
splits = df['text'].str.findall(r'\w+')
splits is
0 [lama, abc]
1 [cow, def]
2 [foo, bar]
3 [spam, egg]
Name: text, dtype: object
Then we need to find if there exists any word in a row should appear in the keywords
# this is answer for a single row, if words is the split list of that row
any(word in keywords for word in words)
# for the entire dataframe, use a Series, `splits` from above is word split lists for every line
rows = pd.Series(any(word in keywords for word in words) for words in splits)
rows
0 True
1 False
2 True
3 False
dtype: bool
Now we can find the correct rows with
df.loc[rows]
text
0 lama abc
2 foo bar
Be aware this approach could consume much more memory as it needs to generate the split list on each line. So if you have huge data sets, this might be a problem.
I believe it's because pd.Series.isin() checks if the string is in the column, and not if the string in the column contains a specific word. I just tested this code snippet:
s = pd.Series(['lama abc', 'cow', 'lama', 'beetle', 'lama',
'hippo'], name='animal')
s.isin(['cow', 'lama'])
And as I was thinking, the first string, even containing the word 'lama', returns False.
Maybe try using regex? See this: searching a word in the column pandas dataframe python

Python matching various keyword from dictionary issues

I have a complex text where I am categorizing different keywords stored in a dictionary:
text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'
sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}
this can successfully find my keywords and categorize them with some limitations:
pattern = r'[a-zA-Z0-9]+'
[cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]
The limitations that I cannot solve are:
For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.
I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.
The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.
Data to pandas df:
ind_list = []
for site in url_list:
ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
ind_list.append(ind)
websites['Indication'] = ind_list
Current output:
Website Sector Sub-sector Therapeutical Area Focus URL status
0 url3.com [med tech] [] [] [] []
1 www.url1.com [med tech, services] [] [oncology, gastroenterology] [] []
2 www.url2.com [med tech, services] [] [orthopedy] [] []
In the output I get [] that I'd like to avoid.
Can you help me with these points?
Thanks!
Give you some hints here the problem that can readily be spot:
Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.
Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().
With the above 2 changes, it should allow you to capture some categorized keywords.
Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Edit
Test Run with 2-word phrase:
text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Output: # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']
text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Ouptput: # Correctly doesn't match with extra words in between
[]
Can you try a different approach other than regex,
I would suggest difflib when you have two similar matching words.
findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword.
If you want to test whether the keyword is in the string:
[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech

Is there a possibility in pySpark to search a string within two separate words?

I'm looking to find a way in python spark to search a string with separate two words. for example: IPhone x or Samsun s10 ...
I want to give a text file and (Iphone x) as a composite string for example, and get result then.
All what i find in the internet is just one word count
IUUC:
In spark 2.0 and if you were gunna read it from a file, for exemple a .csv file:
df = spark.read.format("csv").option("header", "true").load("pathtoyourcsvfile.csv")
then you can filter it using regex like this:
pattern = "\s+(word1|word2)\s+"
filtered = df.filter(df['<thedesiredcolumnhere>'].rlike(pattern))
You can try to write your own UDF combine with wordsegmente to segment your words, and you can add new word to the dictionary to help library to segment new words, such as "Iphone x"
For example:
>>> from wordsegment import clean
>>> clean('She said, "Python rocks!"')
'shesaidpythonrocks'
>>> segment('She said, "Python rocks!"')
['she', 'said', 'python', 'rocks']
If you don't want to use library, you can also see Word segmentation using dynamic programming
This is the answer:
# give a file
rdd = sc.textFile("/root/PycharmProjects/Spark/file")
# give a composite string
string_ = "Iphone x"
# filer by line containing the string
new_rdd = rdd.filter(lambda line: string_ in line)
# collect these lines
rt = str(new_rdd.collect())
# apply regex to find all words and count
count = re.findall(string_, rt) them

Python - Identify certain keywords in a user's input, to then lead to an answer

Python - Identify certain keywords in a user's input, to then lead to an answer. For example, user inputs "There is no display on my phone"
The keywords 'display' and 'phone' would link to a set of solutions.
I just need help finding a general idea on how to identify and then lead to a set of solutions. I would appreciate any help.
Use NLTK library, import stopwords.
write a code that if the word in your text is in stopword then you have to remove that word. You will get the filtered output.
Also,
Make a negative list file - containing all the words apart from stopwords that you want to remove, extent the stopwords with these words before the above code.and you will get a 100% correct output.
A simple way if you don't want to use any external libraries would be the following.
def bool_to_int(list):
num = 0
for k, v in enumerate(list):
if v==1:
num+=(2**k)
return num
def take_action(code):
if code==1:
# do this
elif code==2:
# do this
...
keywords = ['display', 'phone', .....,]
list_of_words = data.split(" ")
code = [0]*len(keywords)
for i in list_of_words:
if i in keywords:
idx = keywords.index(i)
code[idx]=1
code = bool_to_int(code)
take_action(code)

Convert string to dataframe, separated by colon

I have a string that comes from an article with a few hundred sentences. I want to convert the string to a dataframe, with each sentence as a row. For example,
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
I hope it becomes:
This is a book, to which I found exciting.
I bought it for my cousin.
He likes it.
As a python newbie, this is what I tried:
import pandas as pd
data_csv = StringIO(data)
data_df = pd.read_csv(data_csv, sep = ".")
With the code above, all sentences become column names. I actually want them in rows of a single column.
Don't use read_csv. Just split by '.' and use the standard pd.DataFrame:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
data_df = pd.DataFrame([sentence for sentence in data.split('.') if sentence],
columns=['sentences'])
print(data_df)
# sentences
# 0 This is a book, to which I found exciting
# 1 I bought it for my cousin
# 2 He likes it
Keep in mind that this will break if there will be
floating point numbers in some of the sentences. In this case you will need to change the format of your string (eg use '\n' instead of '.' to separate sentences.)
this is a quick solution but it solves your issue:
data_df = pd.read_csv(data, sep=".", header=None).T
You can achieve this via a list comprehension:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
df = pd.DataFrame({'sentence': [i+'.' for i in data.split('. ')]})
print(df)
# sentence
# 0 This is a book, to which I found exciting.
# 1 I bought it for my cousin.
# 2 He likes it.
What you are trying to do is called tokenizing sentences. The easiest way would be to use a Text-Mining library such as NLTK for it:
from nltk.tokenize import sent_tokenize
pd.DataFrame(sent_tokenize(data))
Otherwise you could simply try something like:
pd.DataFrame(data.split('. '))
However, this will fail if you run into sentences like this:
problem = 'Tim likes to jump... but not always!'

Categories