I have been able to create a program that works for a small set of data, however, when I try and scale the amount of data I am working with, my program takes forever. So I was wondering if anyone has a better approach than what follows.
For my simple example, I have a data frame with various sentences people have written where there is a lot of variation in the formatting of the text. Sometimes there may be spaces to separate the words, however sometimes it may just be one long string without any spaces. In addition, there can be misspells, but I am ignoring those.
Here are some examples of what I am talking about
Name
Sentence
John
"This is my first sentence"
Jane
"tHisIsMyFirstsEntenceToo!"
Bob
"Canyoube1ievethis?"
Anna
"Why are we doing this first?"
This is a reduced set of data that I am working with, but what I am trying to do is find the largest word in the string from a list of strings. So for this example, here is the list of words that I want to look for in these strings and identify the longest word in that was found ["sentence", "this", "first", "believe"]
The output with the data and list should be
Name
Sentence
Longest Word
John
"This is my first sentence"
"sentence"
Jane
"tHisIsMyFirstsEntenceToo!"
"sentence"
Bob
"Canyoube1ievethis?"
"this"
Anna
"Why are we doing this first?"
"first"
Obviously the first thing to do is standardize the text, so I lowercase the sentences and the words in the list are already lowercase, but I will include that too just for the sake of generalizing.
import pandas as pd
import numpy as np
words = ["sentence", "this", "first", "believe"]
df['sentence_lower'] = df['Sentence'].str.lower()
words = [word.lower() for word in words]
From here, I do not know what the best approach is to get the desired output. I just iterated over the words and checked to see if the word was in the sentence and then see if it is longer than the current match.
df['Longest Word'] = ''
for word in words:
df['Longest Word'] = np.where((df['sentence_lower'].str.contains(word)) & (
df['Longest Word'].str.len() < len(word)),
word, df['Longest Word'])
This does work, but it is pretty slow. Is there a better way of doing this?
With the examples you provided:
import pandas as pd
df = pd.DataFrame(
{
"Name": ["John", "Jane", "Bob", "Anna"],
"Sentence": [
"This is my first sentence",
"tHisIsMyFirstsEntenceToo!",
"Canyoube1ievethis?",
"Why are we doing this first?",
],
}
)
words = ["sentence", "this", "first", "believe"]
Here is another way to do it:
words = sorted(set(words), key=len)
df["Longest Words"] = df["Sentence"].apply(
lambda x: max([word if word in x.lower() else "" for word in words], key=len)
)
print(df)
# Output
Name Sentence Longest Words
0 John This is my first sentence sentence
1 Jane tHisIsMyFirstsEntenceToo! sentence
2 Bob Canyoube1ievethis? this
3 Anna Why are we doing this first? first
On my machine, it takes 0.00011 seconds in average (50,000 runs), compared to 0.00210 seconds for your code, which is nearly 20 times faster.
Related
I'd like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can't understand why the solution is not working in my case.
keywords = ['fake', 'false', 'lie']
df1:
text
19152
I think she is the Corona Virus....
19154
Boy you hate to see that. I mean seeing how it was contained and all.
19155
Tell her it’s just the fake flu, it will go away in a few days.
19235
Is this fake news?
...
...
20540
She’ll believe it’s just alternative facts.
Expected results: I'd like to select rows that have the exact keywords in my list ('fake', 'false', 'lie). For example, in the above df, it should return rows 19155 and 19235.
str.contains()
df1[df1['text'].str.contains("|".join(keywords))]
The problem with str.contains() is that the result is not limited to the exact keywords. For example, it returns sentences with believe (e.g., row 20540) because lie is a substring of "believe"!
pandas.Series.isin
To find the rows including the exact keywords, I used pd.Series.isin:
df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]
Even though I see there are matches in df1, it doesn't return anything.
import re
df[df.text.apply(lambda x: any(i for i in re.findall('\w+', x) if i in keywords))]
Output:
text
2 Tell her it’s just the fake flu, it will go aw...
3 Is this fake news?
If text is as follows,
df1 = pd.DataFrame()
df1['text'] = [
"Dear Kellyanne, Please seek the help of Paula White I believe ...",
"trump saying it was under controll was a lie, ...",
"Her mouth should hanve been ... All the lies she has told ...",
"she'll believe ...",
"I do believe in ...",
"This value is false ...",
"This value is fake ...",
"This song is fakelove ..."
]
keywords = ['misleading', 'fake', 'false', 'lie']
First,
Simple way is this.
df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
text
5 This value is false ...
6 This value is fake ...
It'll not catch the words like "believe", but can't catch the words "lie," because of the special letter.
Second,
So if remove a special letter in the text data like
new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z]+", " ", x))
df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
Now It can catch the word "lie,".
text
1 trump saying it was under controll was a lie, ...
5 This value is false ...
6 This value is fake ...
Third,
It can't still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python
I think splitting words then matching is a better and straightforward approach, e.g. if the df and keywords are
df = pd.DataFrame({'text': ['lama abc', 'cow def', 'foo bar', 'spam egg']})
keywords = ['foo', 'lama']
df
text
0 lama abc
1 cow def
2 foo bar
3 spam egg
This should return the correct result
df.loc[pd.Series(any(word in keywords for word in words) for words in df['text'].str.findall(r'\w+'))]
text
0 lama abc
2 foo bar
Explaination
First, do words splitting in df['text']
splits = df['text'].str.findall(r'\w+')
splits is
0 [lama, abc]
1 [cow, def]
2 [foo, bar]
3 [spam, egg]
Name: text, dtype: object
Then we need to find if there exists any word in a row should appear in the keywords
# this is answer for a single row, if words is the split list of that row
any(word in keywords for word in words)
# for the entire dataframe, use a Series, `splits` from above is word split lists for every line
rows = pd.Series(any(word in keywords for word in words) for words in splits)
rows
0 True
1 False
2 True
3 False
dtype: bool
Now we can find the correct rows with
df.loc[rows]
text
0 lama abc
2 foo bar
Be aware this approach could consume much more memory as it needs to generate the split list on each line. So if you have huge data sets, this might be a problem.
I believe it's because pd.Series.isin() checks if the string is in the column, and not if the string in the column contains a specific word. I just tested this code snippet:
s = pd.Series(['lama abc', 'cow', 'lama', 'beetle', 'lama',
'hippo'], name='animal')
s.isin(['cow', 'lama'])
And as I was thinking, the first string, even containing the word 'lama', returns False.
Maybe try using regex? See this: searching a word in the column pandas dataframe python
I have a list of 3000 (mostly unique) words sorted by their frequency in English. I also have a list of 3000 unique sentences. Ideally I would like to use Python to generate a list of one example sentence for the use of each word. So each word would have a sentence, which contains that word, paired with it. But no sentence should be paired with more than one word and no word should have more than one sentence associated with it.
But here is the catch, this is a messy dataset, so many words are going to appear in more than one sentence, some words will only appear in one sentence, and many words will not appear in any of the sentences. So I'm not going to get my ideal result. Instead, what I would like is an optimal list with the greatest number of sentences matched with words. And then a list of sentences that were omitted. Also, ideally, the sorted list should prefer to find sentences for lower frequency words than for higher frequency ones. (Since it will be easier to go back and find replacement sentences for higher frequency words.)
Here is an abbreviated example to help clarify:
words = ["the", "cat", "dog", "fish", "runs"]
sentences = ["the dog and cat are friends", "the dog runs all the time", "the dog eats fish", "I love to eat fish", "Granola is yummy too"]
output = ["", "the dog and cat are friends", "the dog eats fish", "I love to eat fish", "the dog runs all the time"]
omitted = ["Granola is yummy too"]
As you can see:
"Granola is yummy too" was omitted because it doesn't contain any of the words.
"the dog and cat are friends" was matched with "cat" because it is the only sentence that contains "cat"
"the dog runs all the time" was matched with "runs" because it is the only sentence that contains "runs"
"the dog eats fish" was matched with "dog" because "dog" is less frequent than "the" in English
"I love to eat fish" was matched with "fish" because the only other sentence with "fish" was already used
"the" didn't have any sentences left that matched with it
I'm not sure where to even start writing the code for this. (I'm a linguist who dabbles in coding on the side, not a professional coder.) So any help would be greatly appreciated!
...where to even start...
Here is a kind of naive approach without any attempt to optimize.
make a dictionary with the words as the keys and a list for the value
{'word1':[], 'word2':[], ...}
for each item in the dictionary
iterate over the sentences and append a sentence to the item's list if the word is in the sentence
Or maybe:
make a set of the words
make an empty dictionary
for each sentence
find the intersection of the words-in-the-sentence with the set of words
add an item to the dictionary using the sentence for the key and the intersection for the value
I'm trying to extract top words by date as follows:
df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date')
in the following dataframe:
import pandas as pd
# initialize
data = [['20/05', "So many books, so little time." ], ['20/05', "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." ], ['19/05',
"Don't be pushed around by the fears in your mind. Be led by the dreams in your heart."], ['19/05', "Be the reason someone smiles. Be the reason someone feels loved and believes in the goodness in people."], ['19/05', "Do what is right, not what is easy nor what is popular."]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Publishing_Date', 'Quotes'])
How you can see, there are many stop-words ("the", "an", "a", "be", ...), that I would like to remove in order to have a better selection. My aim would be to find some key words, i.e. patterns, in common by date so I would be more interested and focused on names rather than verbs.
Any idea on how I could remove stop-words AND keep only names?
Edit
Expected output (based on the results from Vaibhav Khandelwal's answer below):
Publishing_Date Quotes Nouns
20/05 .... books, time, person, gentleman, lady, novel
19/05 .... fears, mind, dreams, heart, reason, smiles
I would need to extract only nouns (reasons should be more frequent so it would be ordered based on frequency).
I think it should be useful nltk.pos_tag where tag is in ('NN').
This is how you can remove stopwords from your text:
import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
stop_words = stopwords.words('english')
fresh_text = []
for i in text.lower().split():
if i not in stop_words:
fresh_text.append(i)
return(' '.join(fresh_text))
df['text'] = df['Quotes'].apply(remove_stopwords)
NOTE: If you want to remove words append explicitly in the stopwords list
For your other half you can add another function to extract nouns:
def extract_noun(text):
token = nltk.tokenize.word_tokenize(text)
result=[]
for i in nltk.pos_tag(token):
if i[1].startswith('NN'):
result.append(i[0])
return(', '.join(result))
df['NOUN'] = df['text'].apply(extract_noun)
The final output will be as follows:
I have a numpy array of words that I want to delete from strings in a Pandas dataframe.
For example: If there a word 'the' in that array and there's a string in a column 'The cat'. So it should become ' cat'. I don't want to delete the whole string, just that words.
# This will iterate that numpy array
def iterate():
for x in range(0, 52):
for y in range(0, 8):
return (np_array[x,y])
# The code below drops that row/record
filtered = df[~df.content.str.contains(iterate())]
Help will be highly appreciated.
Sample data:
numpy array = [a, about, and, across, after, afterwards, in, on, as]
One sample cell:
df['content'] = Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!
Sample Output:
Be sure to tune watch Donald Trump Late Night with David Letterman he presents the Top Ten List tonight!
If you can manage to get a flat list of stopwords to remove from that Numpy array, you can build a regexp that matches all of the stopwords you want to remove, then use df.replace.
stopwords = [
"a", "about", "and", "across", "after",
"afterwards", "in", "on", "as",
]
# Compile a regular expression that will match all the words in one sweep
stopword_re = re.compile("|".join(r"\b%s\b" % re.escape(word) for word in stopwords))
# Replace and reassign into the column
df["content"].replace(stopword_re, "", inplace=True)
You can also add .replace(re.compile(r"\s+"), " ") to collapse the resulting multiple spaces into one space, if your application requires that.
I have to Python lists, one of which contains about 13000 disallowed phrases, and one which contains about 10000 sentences.
phrases = [
"phrase1",
"phrase2",
"phrase with spaces",
# ...
]
sentences = [
"sentence",
"some sentences are longer",
"some sentences can be really really ... really long, about 1000 characters.",
# ...
]
I need to check every sentence in the sentences list to see if it contains any phrase from the phrases list, if it does I want to put ** around the phrase and add it to another list. I also need to do this in the fastest possible way.
This is what I have so far:
import re
for sentence in sentences:
for phrase in phrases:
if phrase in sentence.lower():
iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
newsentence = iphrase.sub("**"+phrase+"**", sentence)
newlist.append(newsentence)
So far this approach takes about 60 seconds to complete.
I tried using multiprocessing (each sentence's for loop was mapped separately) however this yielded even slower results. Given that each process was running at about 6% CPU usage, it appears the overhead makes mapping such a small task to multiple cores not worth it. I thought about separating the sentences list into smaller chunks and mapping those to separate processes, but haven't quite figured out how to implement this.
I've also considered using a binary search algorithm but haven't been able to figure out how to use this with strings.
So essentially, what would be the fastest possible way to perform this check?
Build your regex once, sorting by longest phrase so you encompass the **s around the longest matching phrases rather than the shortest, perform the substitution and filter out those that have no substitution made, eg:
phrases = [
"phrase1",
"phrase2",
"phrase with spaces",
'can be really really',
'characters',
'some sentences'
# ...
]
sentences = [
"sentence",
"some sentences are longer",
"some sentences can be really really ... really long, about 1000 characters.",
# ...
]
# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]
Gives you a results of:
['**some sentences** are longer',
'**some sentences** **can be really really** ... really long, about 1000 **characters**.']
What about set comprehension?
found = {'**' + p + '**' for s in sentences for p in phrases if p in s}
You could try update (by reduction) the phrases list if you don't mind altering it:
found = []
p = phrases[:] # shallow copy for modification
for s in sentences:
for i in range(len(phrases)):
phrase = phrases[i]
if phrase in s:
p.remove(phrase)
found.append('**'+ phrase + '**')
phrases = p[:]
Basically each iteration reduces the phrases container. We iterate through the latest container until we find a phrase that is in at least one sentence.
We remove it from the copied list then once we checked the latest phrases, we update the container with the reduced subset of phrases (those that haven't been seen yet). We do this since we only need to see a phrase at least once, so checking again (although it may exist in another sentence) is unnecessary.