Removing stop-words and selecting only names in pandas - python

I'm trying to extract top words by date as follows:
df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date')
in the following dataframe:
import pandas as pd
# initialize
data = [['20/05', "So many books, so little time." ], ['20/05', "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." ], ['19/05',
"Don't be pushed around by the fears in your mind. Be led by the dreams in your heart."], ['19/05', "Be the reason someone smiles. Be the reason someone feels loved and believes in the goodness in people."], ['19/05', "Do what is right, not what is easy nor what is popular."]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Publishing_Date', 'Quotes'])
How you can see, there are many stop-words ("the", "an", "a", "be", ...), that I would like to remove in order to have a better selection. My aim would be to find some key words, i.e. patterns, in common by date so I would be more interested and focused on names rather than verbs.
Any idea on how I could remove stop-words AND keep only names?
Edit
Expected output (based on the results from Vaibhav Khandelwal's answer below):
Publishing_Date Quotes Nouns
20/05 .... books, time, person, gentleman, lady, novel
19/05 .... fears, mind, dreams, heart, reason, smiles
I would need to extract only nouns (reasons should be more frequent so it would be ordered based on frequency).
I think it should be useful nltk.pos_tag where tag is in ('NN').

This is how you can remove stopwords from your text:
import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
stop_words = stopwords.words('english')
fresh_text = []
for i in text.lower().split():
if i not in stop_words:
fresh_text.append(i)
return(' '.join(fresh_text))
df['text'] = df['Quotes'].apply(remove_stopwords)
NOTE: If you want to remove words append explicitly in the stopwords list
For your other half you can add another function to extract nouns:
def extract_noun(text):
token = nltk.tokenize.word_tokenize(text)
result=[]
for i in nltk.pos_tag(token):
if i[1].startswith('NN'):
result.append(i[0])
return(', '.join(result))
df['NOUN'] = df['text'].apply(extract_noun)
The final output will be as follows:

Related

Optimizing Pandas with Text Analysis

I have been able to create a program that works for a small set of data, however, when I try and scale the amount of data I am working with, my program takes forever. So I was wondering if anyone has a better approach than what follows.
For my simple example, I have a data frame with various sentences people have written where there is a lot of variation in the formatting of the text. Sometimes there may be spaces to separate the words, however sometimes it may just be one long string without any spaces. In addition, there can be misspells, but I am ignoring those.
Here are some examples of what I am talking about
Name
Sentence
John
"This is my first sentence"
Jane
"tHisIsMyFirstsEntenceToo!"
Bob
"Canyoube1ievethis?"
Anna
"Why are we doing this first?"
This is a reduced set of data that I am working with, but what I am trying to do is find the largest word in the string from a list of strings. So for this example, here is the list of words that I want to look for in these strings and identify the longest word in that was found ["sentence", "this", "first", "believe"]
The output with the data and list should be
Name
Sentence
Longest Word
John
"This is my first sentence"
"sentence"
Jane
"tHisIsMyFirstsEntenceToo!"
"sentence"
Bob
"Canyoube1ievethis?"
"this"
Anna
"Why are we doing this first?"
"first"
Obviously the first thing to do is standardize the text, so I lowercase the sentences and the words in the list are already lowercase, but I will include that too just for the sake of generalizing.
import pandas as pd
import numpy as np
words = ["sentence", "this", "first", "believe"]
df['sentence_lower'] = df['Sentence'].str.lower()
words = [word.lower() for word in words]
From here, I do not know what the best approach is to get the desired output. I just iterated over the words and checked to see if the word was in the sentence and then see if it is longer than the current match.
df['Longest Word'] = ''
for word in words:
df['Longest Word'] = np.where((df['sentence_lower'].str.contains(word)) & (
df['Longest Word'].str.len() < len(word)),
word, df['Longest Word'])
This does work, but it is pretty slow. Is there a better way of doing this?
With the examples you provided:
import pandas as pd
df = pd.DataFrame(
{
"Name": ["John", "Jane", "Bob", "Anna"],
"Sentence": [
"This is my first sentence",
"tHisIsMyFirstsEntenceToo!",
"Canyoube1ievethis?",
"Why are we doing this first?",
],
}
)
words = ["sentence", "this", "first", "believe"]
Here is another way to do it:
words = sorted(set(words), key=len)
df["Longest Words"] = df["Sentence"].apply(
lambda x: max([word if word in x.lower() else "" for word in words], key=len)
)
print(df)
# Output
Name Sentence Longest Words
0 John This is my first sentence sentence
1 Jane tHisIsMyFirstsEntenceToo! sentence
2 Bob Canyoube1ievethis? this
3 Anna Why are we doing this first? first
On my machine, it takes 0.00011 seconds in average (50,000 runs), compared to 0.00210 seconds for your code, which is nearly 20 times faster.

Removing specific words present in a numpy array from strings in a dataframe column? [Python]

I have a numpy array of words that I want to delete from strings in a Pandas dataframe.
For example: If there a word 'the' in that array and there's a string in a column 'The cat'. So it should become ' cat'. I don't want to delete the whole string, just that words.
# This will iterate that numpy array
def iterate():
for x in range(0, 52):
for y in range(0, 8):
return (np_array[x,y])
# The code below drops that row/record
filtered = df[~df.content.str.contains(iterate())]
Help will be highly appreciated.
Sample data:
numpy array = [a, about, and, across, after, afterwards, in, on, as]
One sample cell:
df['content'] = Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!
Sample Output:
Be sure to tune watch Donald Trump Late Night with David Letterman he presents the Top Ten List tonight!
If you can manage to get a flat list of stopwords to remove from that Numpy array, you can build a regexp that matches all of the stopwords you want to remove, then use df.replace.
stopwords = [
"a", "about", "and", "across", "after",
"afterwards", "in", "on", "as",
]
# Compile a regular expression that will match all the words in one sweep
stopword_re = re.compile("|".join(r"\b%s\b" % re.escape(word) for word in stopwords))
# Replace and reassign into the column
df["content"].replace(stopword_re, "", inplace=True)
You can also add .replace(re.compile(r"\s+"), " ") to collapse the resulting multiple spaces into one space, if your application requires that.

Identifying list of regex expressions in Pandas column

I have a large pandas dataframe. A column contains text broken down into sentences, one sentence per row. I need to check the sentences for the presence of terms used in various ontologies. Some of the ontologies are fairly large and have more than 100.000 entries. In addition some of the ontologies contain molecule names with hyphens, commas, and other characters that may or may not be present in the text to be examined, hence, the need for regular expressions.
I came up with the code below, but it's not fast enough to deal with my data. Any suggestions are welcome.
Thank you!
import pandas as pd
import re
sentences = ["""There is no point in driving yourself mad trying to stop
yourself going mad""",
"The ships hung in the sky in much the same way that bricks don’t"]
sentence_number = list(range(0, len(sentences)))
d = {'sentence' : sentences, 'number' : sentence_number}
df = pd.DataFrame(d)
regexes = ['\\bt\\w+', '\\bs\\w+']
big_regex = '|'.join(regexes)
compiled_regex = re.compile(big_regex, re.I)
df['found_regexes'] = df.sentence.str.findall(compiled_regex)

How to un-stem a word in Python?

I want to know if there is anyway that I can un-stem them to a normal form?
The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these - eat, eaten, ate, eating etc will count towards eat and hence, I used stemming.
But the next part of the problem requires me to find similar words in data and I am using nltk's synsets to calculate Wu-Palmer Similarity among the words. The problem is that nltk's synsets wont work on stemmed words, or at least in this code they won't. check if two words are related to each other
How should I do it? Is there a way to un-stem a word?
I think an ok approach is something like said in https://stackoverflow.com/a/30670993/7127519.
A possible implementations could be something like this:
import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()
An stemmer to use. Here a text to use:
complete_text = ''' cats catlike catty cat
stemmer stemming stemmed stem
fishing fished fisher fish
argue argued argues arguing argus argu
argument arguments argument '''
Create a list with the different words:
my_list = []
#for i in complete_text.decode().split():
try:
aux = complete_text.decode().split()
except:
aux = complete_text.split()
for i in aux:
if i not in my_list:
my_list.append(i.lower())
my_list
with output:
['cats',
'catlike',
'catty',
'cat',
'stemmer',
'stemming',
'stemmed',
'stem',
'fishing',
'fished',
'fisher',
'fish',
'argue',
'argued',
'argues',
'arguing',
'argus',
'argu',
'argument',
'arguments']
An now create the dictionary:
aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict
Which output is:
{'argu': 'argue, argued, argues, arguing, argus, argu',
'argument': 'argument, arguments',
'cat': 'cats, cat',
'catlik': 'catlike',
'catti': 'catty',
'fish': 'fishing, fished, fish',
'fisher': 'fisher',
'stem': 'stemming, stemmed, stem',
'stemmer': 'stemmer'}
Companion notebook here.
No, there isn't. With stemming, you lose information, not only about the word form (as in eat vs. eats or eaten), but also about the word itself (as in tradition vs. traditional). Unless you're going to use a prediction method to try and predict this information on the basis of the context of the word, there's no way to get it back.
tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.
You may like this open-source project which uses Stemming and contains an algorithm to do Inverse Stemming:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA
On this page of the project, there are explanations on how to do the Inverse Stemming. To sum things up, it works like as follow.
First, you will stem some documents, here short (French language) strings with their stop words removed for example:
['sup chat march trottoir',
'sup chat aiment ronron',
'chat ronron',
'sup chien aboi',
'deux sup chien',
'combien chien train aboi']
Then the trick is to have kept the count of the most popular original words with counts for each stemmed word:
{'aboi': {'aboie': 1, 'aboyer': 1},
'aiment': {'aiment': 1},
'chat': {'chat': 1, 'chats': 2},
'chien': {'chien': 1, 'chiens': 2},
'combien': {'Combien': 1},
'deux': {'Deux': 1},
'march': {'marche': 1},
'ronron': {'ronronner': 1, 'ronrons': 1},
'sup': {'super': 4},
'train': {'train': 1},
'trottoir': {'trottoir': 1}}
Finally, you may now guess how to implement this by yourself. Simply take the original words for which there was the most counts given a stemmed word. You can refer to the following implementation, which is available under the MIT License as part of the Multilingual-Latent-Dirichlet-Allocation-LDA project:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py
Improvements could be made by ditching the non-top reverse words (by using a heap for example) which would yield just one dict in the end instead of a dict of dicts.
I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb.
check out the pattern package
pip install pattern
Then use the en.lemma function to return a verb's base form.
import pattern.en as en
base_form = en.lemma('ate') # base_form == "eat"
Theoretically the only way to unstem is if prior to stemming you kept a dictionary of terms or a mapping of any kind and carry on this mapping to your rest of your computations. This mapping should somehow capture the place of your unstemmed token and when there is a need to unstemm a token given that you know the original place of your stemmed token you would be able to trace back and recover the original unstemmed representation with your mapping.
For the Bag of Words representation this seems computationally intensive and somehow defeats the purpose of the statistical nature of the BoW approach.
But again theoretically I believe it could work. I haven't seen that though in any implementation.

Finding the surrounding sentence of a char/word in a string

I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?
You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).
Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

Categories