Identifying list of regex expressions in Pandas column - python

I have a large pandas dataframe. A column contains text broken down into sentences, one sentence per row. I need to check the sentences for the presence of terms used in various ontologies. Some of the ontologies are fairly large and have more than 100.000 entries. In addition some of the ontologies contain molecule names with hyphens, commas, and other characters that may or may not be present in the text to be examined, hence, the need for regular expressions.
I came up with the code below, but it's not fast enough to deal with my data. Any suggestions are welcome.
Thank you!
import pandas as pd
import re
sentences = ["""There is no point in driving yourself mad trying to stop
yourself going mad""",
"The ships hung in the sky in much the same way that bricks don’t"]
sentence_number = list(range(0, len(sentences)))
d = {'sentence' : sentences, 'number' : sentence_number}
df = pd.DataFrame(d)
regexes = ['\\bt\\w+', '\\bs\\w+']
big_regex = '|'.join(regexes)
compiled_regex = re.compile(big_regex, re.I)
df['found_regexes'] = df.sentence.str.findall(compiled_regex)

Related

DataFrame Stemming

Hey guys I have a quick question about dataframe stemming. You see I want to know how to iterate through a dataframe of text, stem each word, and then return the dataframe to the text is was originally in (With the stemmed word changes of course).
I will give an example
dic = {0:['He was caught eating by the tree'],1:['She was playing with her friends']}
dic = pd.DataFrame(dic)
dic = dic.T
dic.rename(columns = {0:'text'},inplace=True)
When you run these 4 codes, you will get a dataframe of text. I would like to a method to iterate and stem each word as I have a dataframe constiting of over 30k of such sentences and would like to stem it. Thank you very much.

Highlight text in dataframe based on regex pattern

Problem: I have a use case wherein I'm required to highlight the word/words with red font color in a dataframe row based on a regex pattern. I landed upon a regex pattern as it ignores all spaces, punctuation, and case sensitivity.
Source: The original source comes from a csv file. So I'm looking to load it into a dataframe, do the pattern match highlight formatting and output it on excel.
Code: The code helps me with the count of words that match in the dataframe row.
import pandas as pd
import re
df = pd.read_csv("C:/filepath/filename.csv", engine='python')
p = r'(?i)(?<![^ .,?!-])Crust|good|selection|fresh|rubber|warmer|fries|great(?!-[^ .,?!;\r\n])'
df['Output'] = df['Output'].apply(lambda x: re.sub(p, red_fmt.format(r"\g<0>"), x))
Sample Data:
Input
Wow... Loved this place.
Crust is not good.
The selection on the menu was great and so were the prices.
Honeslty it didn't taste THAT fresh.
The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.
The fries were great too.
Output: What I'm trying to achieve.
import re
# Console output color.
red_fmt = "\033[1;31m{}\033[0m"
s = """
Wow... Loved this place.
Crust is not good.
The selection on the menu was great and so were the prices.
Honeslty it didn't taste THAT fresh.
The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.
The fries were great too.
"""
p = r'(?i)(?<![^ \r\n.,?!-])Crust|good|selection|fresh|rubber|warmer|fries|great(?!-[^ .,?!;\r\n])'
print(re.sub(p, red_fmt.format(r"\g<0>"), s))

Removing stop-words and selecting only names in pandas

I'm trying to extract top words by date as follows:
df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date')
in the following dataframe:
import pandas as pd
# initialize
data = [['20/05', "So many books, so little time." ], ['20/05', "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." ], ['19/05',
"Don't be pushed around by the fears in your mind. Be led by the dreams in your heart."], ['19/05', "Be the reason someone smiles. Be the reason someone feels loved and believes in the goodness in people."], ['19/05', "Do what is right, not what is easy nor what is popular."]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Publishing_Date', 'Quotes'])
How you can see, there are many stop-words ("the", "an", "a", "be", ...), that I would like to remove in order to have a better selection. My aim would be to find some key words, i.e. patterns, in common by date so I would be more interested and focused on names rather than verbs.
Any idea on how I could remove stop-words AND keep only names?
Edit
Expected output (based on the results from Vaibhav Khandelwal's answer below):
Publishing_Date Quotes Nouns
20/05 .... books, time, person, gentleman, lady, novel
19/05 .... fears, mind, dreams, heart, reason, smiles
I would need to extract only nouns (reasons should be more frequent so it would be ordered based on frequency).
I think it should be useful nltk.pos_tag where tag is in ('NN').
This is how you can remove stopwords from your text:
import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
stop_words = stopwords.words('english')
fresh_text = []
for i in text.lower().split():
if i not in stop_words:
fresh_text.append(i)
return(' '.join(fresh_text))
df['text'] = df['Quotes'].apply(remove_stopwords)
NOTE: If you want to remove words append explicitly in the stopwords list
For your other half you can add another function to extract nouns:
def extract_noun(text):
token = nltk.tokenize.word_tokenize(text)
result=[]
for i in nltk.pos_tag(token):
if i[1].startswith('NN'):
result.append(i[0])
return(', '.join(result))
df['NOUN'] = df['text'].apply(extract_noun)
The final output will be as follows:

Python Count Number of Phrases in Text

I have a list of product reviews/descriptions in excel and I am trying to classify them using Python based on words that appear in the reviews.
I import both the reviews, and a list of words that would indicate the product falling into a certain classification, into Python using Pandas and then count the number of occurrences of the classification words.
This all works fine for single classification words e.g. 'computer' but I am struggling to make it work for phrases e.g. 'laptop case'.
I have look through a few answers but none were successful for me including:
using just text.count(['laptop case', 'laptop bag']) as per the answer here: Counting phrase frequency in Python 3.3.2 but because you need to split the text up that does not work (and I think maybe text.count does not work for lists either?)
Other answers I have found only look at the occurrence of a single word. Is there something I can do to count words and phrases that does not involve the splitting of the body of text into individual words?
The code I currently have (that works for individual terms) is:
for i in df1.index:
descriptions = df1['detaileddescription'][i]
if type(descriptions) is str:
descriptions = descriptions.split()
pool.append(sum(map(descriptions.count, df2['laptop_bag'])))
else:
pool.append(0)
print(pool)
You're on the right track! You're currently splitting into single words, which facilitates finding occurrences of single words as you pointed out. To find phrases of length n you should split the text into chunks of length n, which are called n-grams.
To do that, check out the NLTK package:
from nltk import ngrams
sentence = 'I have a laptop case and a laptop bag'
n = 2
bigrams = ngrams(sentence.split(), n)
for gram in bigrams:
print(gram)
Sklearn's CountVectorizer is the standard way
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer()
vec = vectorizer.fit_transform(descriptions)
And if you want to see the counts as a dict:
count_dict = {k:v for k,v in zip(vectorizer.get_feature_names(), vec.toarray()[0]) if v>0}
print (count_dict)
The default is unigrams, you can use bigrams or higher ngrams with the ngram_range parameter

Extracting #mentions from tweets using findall python (Giving incorrect results)

I have a csv file something like this
text
RT #CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in #CellCellPress htp://.co/HrjDwbm7NN
RT #gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT #sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT #MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via #nucAmbiguous htp://…
I want to extract all the mentions (starting with '#') from the tweet text. So far I have done this
import pandas as pd
import re
mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'
for i in range(X.shape[0]):
result = re.findall("(^|[^#\w])#(\w{1,25})", str(X.iloc[:i,:]))
print(result);
There are two problems here:
First: at str(X.iloc[:1,:]) it gives me ['CritCareMed'] which is not ok as it should give me ['CellCellPress'], and at str(X.iloc[:2,:]) it again gives me ['CritCareMed'] which is of course not fine again. The final result I'm getting is
[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]
It doesn't include the mentions in 2nd row and both two mentions in last row.
What I want should look something like this:
How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?
You can use str.findall method to avoid the for loop, use negative look behind to replace (^|[^#\w]) which forms another capture group you don't need in your regex:
df['mention'] = df.text.str.findall(r'(?<![#\w])#(\w{1,25})').apply(','.join)
df
# text mention
#0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
#1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
#2 RT #gvwilson: Where's the theory for software ... gvwilson
#3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
#4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
Also X.iloc[:i,:] gives back a data frame, so str(X.iloc[:i,:]) gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text column, you can use X.text.iloc[0], or a better way to iterate through a column, use iteritems:
import re
for index, s in df.text.iteritems():
result = re.findall("(?<![#\w])#(\w{1,25})", s)
print(','.join(result))
#CritCareMed
#CellCellPress
#gvwilson
#sciencemagazine
#MHendr1cks,nucAmbiguous
While you already have your answer, you could even try to optimize the whole import process like so:
import re, pandas as pd
rx = re.compile(r'#([^:\s]+)')
with open("test.txt") as fp:
dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())
df = pd.DataFrame(dft, columns = ['text', 'mention'])
print(df)
Which yields:
text mention
0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
2 RT #gvwilson: Where's the theory for software ... gvwilson
3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
This might be a bit faster as you don't need to change the df once it's already constructed.
mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.
#.*? carries out a non-greedy match for a word starting
with a hashtag
(?=\s|$) look-ahead for the end of the word or end of the sentence
(?:(?<=\s)|(?<=^)) look-behind to ensure there are no false positives if a # is used in the middle of a word
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.

Categories