Counting Specific Phrases Using Python - python

So I am trying to get a count for specific phrases in Python from a string I created. I have been able to make a list of specific individual words but never with anything involving two phrases. I just want to be able to create a list of items that involve two words for each item.
import pandas as pd
import numpy as np
import re
import collections
import plotly.express as px
df = pd.read_excel("Datasets/realDonaldTrumprecent2020.xlsx", sep='\t',
names=["Tweet_ID", "Date", "Text"])
df = pd.DataFrame(df)
df.head()
tweets = df["Text"]
raw_string = ''.join(tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)
no_capital_letters = re.sub('[A-Z]+', lambda m: m.group(0).lower(), no_special_characters)
words_list = no_capital_letters.split(" ")
phrases = ['fake news', 'lamestream media', 'sleepy joe', 'radical left', 'rigged election']
I initially was able to get a list of just the individual words but I want to be able to get a list of instances where the phrases show up. Is there a way to do this?

Pandas provides some nice tools for doing these things.
For example, if your DataFrame was as follows:
import pandas as pd
df = pd.DataFrame({'text': [
'Encyclopedia Britannica is FAKE NEWS!',
'What does Sleepy Joe read? Webster\'s Dictionary? Fake News!',
'Sesame Street is lamestream media by radical leftist Big Bird!!!',
'1788 was a rigged election! Landslide for King George! Fake News',
]})
...you could select tweets containing the phrase 'fake news' like so:
selector = df.text.str.lower().str.contains('fake news')
This produces the following Series of booleans:
0 True
1 True
2 False
3 True
Name: text, dtype: bool
You can count how many are positive with sum:
sum(selector)
And use it to index the data frame to get an array of tweets
df.text[selector].values

If you are trying to count the number of times those phrases appear in the text, the following code should work.
for phrase in phrases:
sum(s.count(phrase) for phrase in words_list)
print(phrase, sum)
In terms of "a list of instances where the phrases show up", you should be able to slightly modify the above for loop:
phrase_list = []
for phrase in phrases:
for tweet in tweets:
if tweet in phrase:
phrase_list.append(tweet)

Related

How can I exclude words from frequency word analysis in a list of articles in python?

I have a dataframe df with a column "Content" that contains a list of articles extracted from the internet. I have already the code for constructing a dataframe with the expected output (two columns, one for the word and the other for its frequency). However, I would like to exclude some words (conectors, for instance) in the analysis. Below you will find my code, what should I add to it?
It is possible to use the code get_stop_words('fr') for a more efficiente use? (Since my articles are in French).
Source Code
import csv
from collections import Counter
from collections import defaultdict
import pandas as pd
df = pd.read_excel('C:/.../df_clean.xlsx',
sheet_name='Articles Scraping')
df = df[df['Content'].notnull()]
d1 = dict()
for line in df[df.columns[6]]:
words = line.split()
# print(words)
for word in words:
if word in d1:
d1[word] += 1
else:
d1[word] = 1
sort_words = sorted(d1.items(), key=lambda x: x[1], reverse=True)
There are a few ways you can achieve this. You can either use the isin() method with a list comprehension,
data = {'test': ['x', 'NaN', 'y', 'z', 'gamma',]}
df = pd.DataFrame(data)
words = ['x', 'y', 'NaN']
df = df[~df.test.isin([word for word in words])]
Or you can go with not string contains and a join:
df = df[~df.test.str.contains('|'.join(words))]
If you want to utilize the stop words package for French, you can also do that, but you must preprocess all of your texts before you start doing any frequency analysis.
french_stopwords = set(stopwords.stopwords("fr"))
STOPWORDS = list(french_stopwords)
STOPWORDS.extend(['add', 'new', 'words', 'here'])
I think the extend() will help you tremendously.

How gather lists and load into dataframe

The following code creates a dataframe, tokenizes, and filters stopwords. However, am I stuck trying to properly gather the results to load back into a column of the dataframe. Trying to put the results back into the dataframe (using commented code) produces the following error ValueError: Length of values does not match length of index. It seems like the issue is with how I'm loading the lists back into the df. I think it is treating them one at a time. I'm not clear how to form a list of lists, which is what I think is needed. Neither append() nor extend() seem appropriate, or if they are I'm not doing it properly. Any insight would be greatly appreciated.
Minimal example
# Load libraries
import numpy as np
import pandas as pd
import spacy
# Create dataframe and tokenize
df = pd.DataFrame({'Text': ['This is the first text. It is two sentences',
'This is the second text, with one sentence']})
nlp = spacy.load("en_core_web_sm")
df['Tokens'] = ''
doc = df['Text']
doc = doc.apply(lambda x: nlp(x))
df['Tokens'] = doc
# df # check dataframe
# Filter stopwords
df['No Stop'] = ''
def test_loc(df):
for i in df.index:
doc = df.loc[i,'Tokens']
tokens_no_stop = [token.text for token in doc if not token.is_stop]
print(tokens_no_stop)
# df['No Stop'] = tokens_no_stop # THIS PRODUCES AN ERROR
test_loc(df)
Result
['text', '.', 'sentences']
['second', 'text', ',', 'sentence']
As you mentioned you need a list of lists in order for the assignment to work.
Another solution can be to use pandas.apply as you used in the beginning of your code.
import numpy as np
import pandas as pd
import spacy
df = pd.DataFrame({'Text': ['This is the first text. It is two sentences',
'This is the second text, with one sentence']})
nlp = spacy.load("en_core_web_sm")
df['Tokens'] = df['Text'].apply(lambda x: nlp(x))
def remove_stop_words(tokens):
return [token.text for token in tokens if not token.is_stop]
df['No Stop'] = df['Tokens'].apply(remove_stop_words)
Notice you don't have to create the column before assigning to it.

How can I get unique words from a DataFrame column of strings?

I'm looking for a way to get a list of unique words in a column of strings in a DataFrame.
import pandas as pd
import numpy as np
df = pd.read_csv('FinalStemmedSentimentAnalysisDataset.csv', sep=';',dtype=
{'tweetId':int,'tweetText':str,'tweetDate':str,'sentimentLabel':int})
tweets = {}
tweets[0] = df[df['sentimentLabel'] == 0]
tweets[1] = df[df['sentimentLabel'] == 1]
the dataset I'm using is from this link: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
I got this column with strings of variable length, and i want to get the list of every unique word in the column and its count, how can i get it? I'm using Pandas in python and the original database has more then 1M rows so i also need some efective way to process this fast enough and not make the code be running for too long.
a Example of column could be:
is so sad for my apl friend.
omg this is terrible.
what is this new song?
And the list could be something like.
[is,so,sad,for,my,apl,friend,omg,this,terrible,what,new,song]
if you have strings in column then you would have to split every sentence into list of words and then put all list in one list - you can use it sum() for this - it should give you all words. To get unique words you can convert it to set() - and later you can convert back to list()
But at start you would have to clean sentences to remove chars like ., ?, etc. I uses regex to keep only some chars and space. Eventually you would have to convert all words into lower or upper case.
import pandas as pd
df = pd.DataFrame({
'sentences': [
'is so sad for my apl friend.',
'omg this is terrible.',
'what is this new song?',
]
})
unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum())
print(list(sorted(unique)))
Result
['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']
EDIT: as #HenryYik mentioned in comment - findall('\w+') can be used instead of split() but also instead of replace()
unique = set(df['sentences'].str.lower().str.findall("\w+").sum())
EDIT: I tested it with data from
http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
All works fast except column.sum() or sum(column) - I measured time for 1000 rows and calculated for 1 500 000 rows and it would need 35 minutes.
Much faster is to use itertools.chain() - it would need about 8 seconds.
import itertools
words = df['sentences'].str.lower().str.findall("\w+")
words = list(itertools.chain(words))
unique = set(words)
but it can be converterd to set() directly.
words = df['sentences'].str.lower().str.findall("\w+")
unique = set()
for x in words:
unique.update(x)
and it takes about 5 seconds
Full code:
import pandas as pd
import time
print(time.strftime('%H:%M:%S'), 'start')
print('-----')
#------------------------------------------------------------------------------
start = time.time()
# `read_csv()` can read directly from internet and compressed to zip
#url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip'
url = 'SentimentAnalysisDataset.csv'
# need to skip two rows which are incorrect
df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881])
end = time.time()
print(time.strftime('%H:%M:%S'), 'load:', end-start, 's')
print('-----')
#------------------------------------------------------------------------------
start = end
words = df['SentimentText'].str.lower().str.findall("\w+")
#df['words'] = words
end = time.time()
print(time.strftime('%H:%M:%S'), 'words:', end-start, 's')
print('-----')
#------------------------------------------------------------------------------
start = end
unique = set()
for x in words:
unique.update(x)
end = time.time()
print(time.strftime('%H:%M:%S'), 'set:', end-start, 's')
print('-----')
#------------------------------------------------------------------------------
print(list(sorted(unique))[:10])
Result
00:27:04 start
-----
00:27:08 load: 4.10780930519104 s
-----
00:27:23 words: 14.803470849990845 s
-----
00:27:27 set: 4.338541269302368 s
-----
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']

How can I Optimize a search in pandas dataframe

I need to search the word 'mas' in Dataframe, the column with frase is Corpo, and the text in this column is splitted in list, for example: I like birds ---> split [I,like,birds]. So, I need search 'mas' in a portuguese frase and catch just the words after 'mas'. The code is taking to long to execute this function.
df.Corpo.update(df.Corpo.str.split()) #tokeniza frase
df.Corpo = df.Corpo.fillna('')
for i in df.index:
for j in range(len(df.Corpo[i])):
lista_aux = []
if df.Corpo[i][j] == 'mas' or df.Corpo[i][j] == 'porem' or df.Corpo[i][j] == 'contudo' or df.Corpo[i][j] == 'todavia':
lista_aux = df.Corpo[i]
df.Corpo[i] = lista_aux[j+1:]
break
if df.Corpo[i][j] == 'question':
df.Corpo[i] = ['question']
break
When working with pandas dataframes (or numpy arrays) you should always try to use vectorized operations instead of for-loops over individual dataframe elements. Vectorized operations are (nearly always) significantly faster than for-loops.
In your case you could use pandas built-in vectorized operation str.extract, which allows extraction of the string part that matches a regex search pattern. The regex search pattern mas (.+) should capture the part of a string that follows after 'mas'.
import pandas as pd
# Example dataframe with phrases
df = pd.DataFrame({'Corpo': ['I like birds', 'I mas like birds', 'I like mas birds']})
# Use regex search to extract phrase sections following 'mas'
df2 = df.Corpo.str.extract(r'mas (.+)')
# Fill gaps with full original phrase
df2 = df2.fillna(df.Corpo)
will give as result:
In [1]: df2
Out[1]:
0
0 I like birds
1 like birds
2 birds

python pandas dataframe words in context: get 3 words before and after

I am working in jupyter notebook and have a pandas dataframe "data":
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I want to go through the text in column "Answer" and get the three words before and after the word "data".
So in this scenario I would have gotten "is very important"; "We value", "since we need".
Is there an good way to do this within a pandas dataframe? So far I only found solutions where "Answer" would be its own file run through python code (without a pandas dataframe). While I realize that I need to use the NLTK library, I haven't used it before, so I don't know what the best approach would be. (This was a great example Extracting a word and its prior 10 word context to a dataframe in Python)
This may work:
import pandas as pd
import re
df = pd.read_csv('data.csv')
for value in df.Answer.values:
non_data = re.split('Data|data', value) # split text removing "data"
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
output:
['is very important']
['We value', 'since we need']
The solution using generator expression, re.findall and itertools.chain.from_iterable functions:
import pandas as pd, re, itertools
data = pd.read_csv('test.csv') # change with your current file path
data_adjacents = ((i for sublist in (list(filter(None,t))
for t in re.findall(r'(\w*?\s*\w*?\s*\w*?\s+)(?=\bdata\b)|(?<=\bdata\b)(\s+\w*\s*\w*\s*\w*)', l, re.I)) for i in sublist)
for l in data.Answer.tolist())
print(list(itertools.chain.from_iterable(data_adjacents)))
The output:
[' is very important', 'We value ', ' since we need']

Categories