Removing all punctuation from string in dataframe - python

This is officially doing my head in. I am web scraping a collection of tweets for text analysis. The tweets have been scraped and put into a dataframe, where each row is a string containing the entire tweet. I can't for the life of me remove quotation marks or apostrophes, but removing all other punctuation is OK.
What I am trying to do is extract just the verbs, nouns and adjectives from each of the scraped tweets, which I have done, but anything in quotation marks is excluded.
The code that I have been using so far is below, but I can't for the life of me add quotation marks or apostrophes. I have also tried every other method I can find on this site , but it either does nothing, or produces errors.
tweets['Text_processed'] = tweets['Text'].map(lambda x: re.sub('[,\##.!?]', "", x))
The entire code base up until this point is:
import GetOldTweets3 as got
import pandas as pd
import re
from wordcloud import WordCloud# Join the different processed titles together.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter("ignore", DeprecationWarning)# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
import os
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
import gensim
from gensim import corpora, models, similarities
import logging
import tempfile
from nltk.corpus import stopwords
from string import punctuation
from collections import OrderedDict
import pyLDAvis.gensim
import tempfile
%matplotlib inline
init_notebook_mode(connected=True) #do not miss this line
warnings.filterwarnings("ignore")
# Function that pulls tweets based on a general search query and turns to csv file
# Parameters: (text query you want to search), (max number of most recent tweets to pull from)
def text_query_to_csv(text_query, count):
# Creation of query object
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\
.setMaxTweets(count)
# Creation of list that contains all tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
# Creating list of chosen tweet data
text_tweets = [[tweet.date, tweet.text] for tweet in tweets]
# Creation of dataframe from tweets
tweets_df = pd.DataFrame(text_tweets, columns = ['Datetime', 'Text'])
# Converting tweets dataframe to csv file
tweets_df.to_csv('{}-{}-tweets.csv'.format(text_query, int(count)), sep=',')
############################################
# Search word and number of tweets to scrape
############################################
text_query = '#barackobama'
count = 5
# Calling function to query X amount of relevant tweets and create a CSV file
text_query_to_csv(text_query, count)
filename = '#barackobama-5-tweets.csv'
tweets = pd.read_csv(filename)
# Convert tweets to strings and lower case
tweets['Text'] = tweets['Text'].astype(str)
tweets['Text'] = tweets['Text'].map(lambda x: x.lower())
tweets
This is the offending bit of code below...
# remove punctuation
tweets['Text_processed'] = tweets['Text'].map(lambda x: re.sub('[,\##.!?]', "", x))
tweets['Text_processed'].head()
#####################################
# Extract nouns, verbs and adjectives
#####################################
import nltk
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from IPython.display import display
lemmatizer = nltk.WordNetLemmatizer()
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.label() =='NP'):
yield subtree.leaves()
def get_word_postag(word):
if pos_tag([word])[0][1].startswith('J'):
return wordnet.ADJ
if pos_tag([word])[0][1].startswith('V'):
return wordnet.VERB
if pos_tag([word])[0][1].startswith('N'):
return wordnet.NOUN
else:
return wordnet.NOUN
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
postag = get_word_postag(word)
word = lemmatizer.lemmatize(word,postag)
return word
def get_terms(tree):
for leaf in leaves(tree):
terms = [normalise(w) for w,t in leaf]
yield terms
tidied_tweets = []
for t in tweets['Text']:
#word tokenizeing and part-of-speech tagger
document = t
tokens = [nltk.word_tokenize(sent) for sent in [document]]
postag = [nltk.pos_tag(sent) for sent in tokens][0]
# Rule for NP chunk and VB Chunk
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
{<RB.?>*<VB.?>*<JJ>*<VB.?>+<VB>?} # Verbs and Verb Phrases
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
#Chunking
cp = nltk.RegexpParser(grammar)
# the result is a tree
tree = cp.parse(postag)
terms = get_terms(tree)
features = []
for term in terms:
_term = ''
for word in term:
_term += ' ' + word
features.append(_term.strip())
tidied_tweets.append(features)
tidied_tweets
The code base I have after this works OK, but the inability to remove quoted text or anything with an apostrophe is causing real problems.
EDITED TO ADD
I've managed to solve the problem, but in doing so, created another. The latest bit of code to extract the words sans the punctuation is:
tweet_list = []
ind_tweet = []
for tweets in tidied_tweets:
for words in tweets:
a = re.findall(r"[\w']+", words)
ind_tweet.append(a)
tweet_list.append(ind_tweet)
re.findall(r"[\w']+", words) does the job of extracting the word, but I can't create the same structured list I started with. What I wanted is 'tweet_list' to act as the parent list, and 'ind_tweet' to act as a sucession of child lists (nested). When I print out the result of the code above, I'm not able to create the nested list I am looking for. ind_tweet produces the output but in a single list with no nesting, and tweet_list duplicates ind_tweet. It probably isn't helping that it's 2:30am on a Saturday, but this should be much easier than I am making it...

And the answer is...
tweet_list = [[] for i in range(len(tidied_tweets))]
for tweets, t in zip(tidied_tweets, range(len(tidied_tweets))):
for words in tweets:
a = re.findall(r"[\w']+", words)
tweet_list[t].append(a)

Related

Exclude Japanese Stopwords from File

I am trying to remove Japanese stopwords from a text corpus from twitter.
Unfortunately the frequently used nltk does not contain Japanese, so I had to figure out a different way.
This is my MWE:
import urllib
from urllib.request import urlopen
import MeCab
import re
# slothlib
slothlib_path = "http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt"
sloth_file = urllib.request.urlopen(slothlib_path)
# stopwordsiso
iso_path = "https://raw.githubusercontent.com/stopwords-iso/stopwords-ja/master/stopwords-ja.txt"
iso_file = urllib.request.urlopen(iso_path)
stopwords = [line.decode("utf-8").strip() for line in iso_file]
stopwords = [ss for ss in stopwords if not ss==u'']
stopwords = list(set(stopwords))
text = '日本語の自然言語処理は本当にしんどい、と彼は十回言った。'
tagger = MeCab.Tagger("-Owakati")
tok_text = tagger.parse(text)
ws = re.compile(" ")
words = [word for word in ws.split(tok_text)]
if words[-1] == u"\n":
words = words[:-1]
ws = [w for w in words if w not in stopwords]
print(words)
print(ws)
Successfully Completed: It does give out the original tokenized text as well as the one without stopwords
['日本語', 'の', '自然', '言語', '処理', 'は', '本当に', 'しんどい', '、', 'と', '彼', 'は', '十', '回', '言っ', 'た', '。']
['日本語', '自然', '言語', '処理', '本当に', 'しんどい', '、', '十', '回', '言っ', '。']
There is still 2 issues I am facing though:
a) Is it possible to have 2 stopword lists regarded? namely iso_file and sloth_file ? so if the word is either a stopword from iso_file or sloth_file it will be removed? (I tried to use line 14 as
stopwords = [line.decode("utf-8").strip() for line in zip('iso_file','sloth_file')]
but received an error as tuple attributes may not be decoded
b) The ultimate goal would be to generate a new text file in which all stopwords are removed.
I had created this MWE
### first clean twitter csv
import pandas as pd
import re
import emoji
df = pd.read_csv("input.csv")
def cleaner(tweet):
tweet = re.sub(r"#[^\s]+","",tweet) #Remove #username
tweet = re.sub(r"(?:\#|http?\://|https?\://|www)\S+|\\n","", tweet) #Remove http links & \n
tweet = " ".join(tweet.split())
tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
return tweet
df['text'] = df['text'].map(lambda x: cleaner(x))
df['text'].to_csv(r'cleaned.txt', header=None, index=None, sep='\t', mode='a')
### remove stopwords
import urllib
from urllib.request import urlopen
import MeCab
import re
# slothlib
slothlib_path = "http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt"
sloth_file = urllib.request.urlopen(slothlib_path)
#stopwordsiso
iso_path = "https://raw.githubusercontent.com/stopwords-iso/stopwords-ja/master/stopwords-ja.txt"
iso_file = urllib.request.urlopen(iso_path)
stopwords = [line.decode("utf-8").strip() for line in iso_file]
stopwords = [ss for ss in stopwords if not ss==u'']
stopwords = list(set(stopwords))
with open("cleaned.txt",encoding='utf8') as f:
cleanedlist = f.readlines()
cleanedlist = list(set(cleanedlist))
tagger = MeCab.Tagger("-Owakati")
tok_text = tagger.parse(cleanedlist)
ws = re.compile(" ")
words = [word for word in ws.split(tok_text)]
if words[-1] == u"\n":
words = words[:-1]
ws = [w for w in words if w not in stopwords]
print(words)
print(ws)
While it works for the simple input text in the first MWE, for the MWE I just stated I get the error
in method 'Tagger_parse', argument 2 of type 'char const *'
Additional information:
Wrong number or type of arguments for overloaded function 'Tagger_parse'.
Possible C/C++ prototypes are:
MeCab::Tagger::parse(MeCab::Lattice *) const
MeCab::Tagger::parse(char const *)
for this line: tok_text = tagger.parse(cleanedlist)
So I assume I will need to make amendments to the cleanedlist?
I have uploaded the cleaned.txt on github for reproducing the issue:
[txt on github][1]
Also: How would I be able to get the tokenized list that excludes stopwords back to a text format like cleaned.txt? Would it be possible to for this purpose create a df of ws?
Or might there even be a more simple way?
Sorry for the long request, I tried a lot and tried to make it as easy as possible to understand what I'm driving at :-)
Thank you very much!
[1]: https://gist.github.com/yin-ori/1756f6236944e458fdbc4a4aa8f85a2c
It sounds like you want to:
combine two lists of stopwords
save text that has had stopwords removed
For problem 1, if you have two lists you can make them into one list with full_list = list1 + list2. You can then make them into a set after that.
The reason you are getting the MeCab error is probably that you are passing a list to parse, which expects a string. (What MeCab wrapper are you using? I have never seen that particular error.) As a note, you should pass each individual tweet to MeCab, instead of the combined text of all tweets, something like:
tokenized = [tagger.parse(tweet) for tweet in cleanedlist]
That should resolve your problem.
Saving text with stopwords removed is just the same as any text file.
As a separate point...
Stopword lists are not very useful in Japanese because if you're using something like MeCab you already have part of speech information. So you should use that instead to throw out verb endings, function words, and so on.
Also removing stopwords is probably actively unhelpful if you're using any modern NLP methods, see the spaCy preprocessing FAQ.

how to iterate pandas frame from csv and apply text summarisation

Im working on text summarization extraction on long text data. I have multiple users text data in the input csv file. But current code is appending all the text column data in to sentences and then apply logic. How do I apply the code for each row instead of merging all the column values? Any Help will appreciated.
Input.csv (^ delimited)
uid^name^text
36d73f013aa7^Don Howard^The Irvine Foundation has entered into a partnership with College Futures Foundation that starts a new chapter in our support of postsecondary success in California.To achieve Irvine’s singular goal.
36d73f013aa8^Simon Haris^That’s why we have long provided funding to expand postsecondary success. Now with our focus on low-wage workers, we have decided to split our postsecondary funding into two parts:. Strengthening and expanding work-ready credentialing programs (which we will do directly, primarily as part of our Better Careers initiative).
36d73f013aa8^David^Accelerating and expanding the attainment of bachelor’s degrees (which we will fund through our partnership with College Futures). We believe that College Futures is in a stronger position than we are to make grants to support improvements in how the CSUs and the California Community Colleges can better serve students.
pseudo code
Loop each record
apply below logic to text column to get summary
Code : Text Summarization Code
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt') # one time execution
from nltk.corpus import stopwords
import re
# Read the CSV file
import io
df = pd.read_csv('/home/sshuser/textsummerisation/input.csv',sep='^')
# split the the text in the articles into sentences
sentences = []
for s in df['text']:
sentences.append(sent_tokenize(s))
# flatten the list
sentences = [y for x in sentences for y in x]
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]
nltk.download('stopwords')# one time execution
stop_words = stopwords.words('english')
def remove_stopwords(sen):
sen_new = " ".join([i for i in sen if i not in stop_words])
return sen_new
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
# Extract word vectors
word_embeddings = {}
fopen = open('/home/sshuser/textsummerisation/glove.6B.100d.txt', encoding='utf-8')
for line in fopen:
values = line.split()
word = values[0]
print(values)
print(word)
coefs = np.asarray(values[1:], dtype='float32')
word_embeddings[word] = coefs
fopen.close()
sentence_vectors = []
for i in clean_sentences:
if len(i) != 0:
v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
else:
v = np.zeros((100,))
sentence_vectors.append(v)
len(sentence_vectors)
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Specify number of sentences to form the summary
sn = 10
# Generate summary
for i in range(sn):
print(ranked_sentences[i][1])
Expected output: Output of the above code should come in summary column for each record
uid^name^text^summary

python remove punctuation email spam

Trying to remove punctuation from the list of words. New to python programming so if someone could help that would be great. The purpose of this is to be used for email spam classification. Previously I had joined the words after checking to see if punctuation was present, but this gave me single characters rather than whole words. After changing it to get words this is what I have below so now trying to remove the punctuation as won't work the same as I did before.
import os
import string
from collections import Counter
from os import listdir # return all files and folders in the directory
import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
# used for importing the lingspam dataset
def importLingspamDataset(dir):
allEmails = [] # for storing the emails once read
fileNames = []
for file in listdir(dir):
f = open((dir + '/' + file), "r") # used for opening the file in read only format
fileNames.append(file)
allEmails.append(f.read()) # appends the read emails to the emails array
f.close()
return allEmails, fileNames
def importEnronDataset(dir):
allEmails = [] # for storing the emails once read
fileNames = []
for file in listdir(dir):
f = open((dir + '/' + file), "r") # used for opening the file in read only format
fileNames.append(file)
allEmails.append(f.read()) # appends the read emails to the emails array
f.close()
return allEmails, fileNames
# used to remove punctuation from the emails as this is of no use for detecting spam
def removePunctuation(cleanedEmails):
punc = set(string.punctuation)
for word, line in enumerate(cleanedEmails):
words = line.split()
x = [''.join(c for c in words if c not in string.punctuation)]
allWords = []
allWords += x
return allWords
# used to remove stopwords i.e. words of no use in detecting spam
def removeStopwords(cleanedEmails):
removeWords = set(stopwords.words('english')) # sets all the stopwords to be removed
for stopw in removeWords: # for each word in remove words
if stopw not in removeWords: # if the word is not in the stopwords to be removed
cleanedEmails.append(stopw) # add this word to the cleaned emails
return(cleanedEmails)
# funtion to return words to its root form - allows simplicity
def lemmatizeEmails(cleanedEmails):
lemma = WordNetLemmatizer() # to be used for returning each word to its root form
lemmaEmails = [lemma.lemmatize(i) for i in cleanedEmails] # lemmatize each word in the cleaned emails
return lemmaEmails
# function to allow a systematic process of elimating the undesired elements within the emails
def cleanAllEmails(cleanedEmails):
cleanPunc = removePunctuation(cleanedEmails)
cleanStop = removeStopwords(cleanPunc)
cleanLemma = lemmatizeEmails(cleanStop)
return cleanLemma
def createDictionary(email):
allWords = []
allWords.extend(email)
dictionary = Counter(allWords)
dictionary.most_common(3000)
word_cloud = WordCloud(width=400, height=400, background_color='white',
min_font_size=12).generate_from_frequencies(dictionary)
plt.imshow(word_cloud)
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
word_cloud.to_file('test1.png')
def featureExtraction(email):
emailFiles = []
emailFiles.extend(email)
featureMatrix = np.zeros((len(emailFiles), 3000))
def classifyLingspamDataset(email):
classifications = []
for name in email:
classifications.append("spmsg" in name)
return classifications
# Lingspam dataset
trainingDataLingspam, trainingLingspamFilename = importLingspamDataset("spam-non-spam-dataset/train-mails") # extract the training emails from the dataset
#testingDataLingspam, testingLingspamFilename = importLingspamDataset("spam-non-spam-dataset/test-mails") # extract the testing emails from the dataset
trainingDataLingspamClean = cleanAllEmails(trainingDataLingspam)
#testingDataLingspamClean = cleanAllEmails(testingDataLingspam)
#trainClassifyLingspam = classifyLingspamDataset(trainingDataLingspam)
#testClassifyLingspam = classifyLingspamDataset(testingDataLingspam)
trainDictionary = createDictionary(trainingDataLingspamClean)
#createDictionary(testingDataLingspamClean)
#trainingDataEnron, trainingEnronFilename = importEnronDataset("spam-non-spam-dataset-enron/bigEmailDump/training/")
Based on your question, I assume that you have a list of emails, which for each email you would like to remove the punctuation marks. This answer was based on the first revision of the code you posted.
import string
def removePunctuation(emails):
# I am using a list comprehension here to iterate over the emails.
# For each iteration, translate the email to remove the punctuation marks.
# Translate only allows a translation table as an argument.
# This is why str.maketrans is used to create the translation table.
cleaned_emails = [email.translate(str.maketrans('', '', string.punctuation))
for email in emails]
return cleaned_emails
if __name__ == '__main__':
# Assuming cleanedEmails is a list of emails,
# I am substituting cleanedEmails with emails.
# I used cleanedEmails as the result.
emails = ["This is a, test!", "This is another##! \ntest"]
cleaned_emails = removePunctuation(emails)
print(cleaned_emails)
input: ["This is a, test!", "This is another##! \ntest"]
output: ['This is a test', 'This is another \ntest']
EDIT:
Issue is resolved after having a conversation with OP. OP was having an issue with WordCloud and the solution I provided is working. Manage to guide OP through getting WordCloud working. OP is now fine tuning the results of the WordCloud.

TypeError: string indices must be integers (Text Data Preprocessing in CSV files for Sentiment Analysis)

I'm kind of new to programming and NLP in general. I've found some code on this website :(https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed) to use for sentiment analysis on twitter. I have the csv files i need and so instead of building them i just defined the variables by the files.
When i try to run the code it's giving me a type error when running this line:
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
And traces back to the line:
processedTweets.append((self._processTweet(tweet["text"]),tweet["label"])).
I don't know how to circumvent the issue and still keep core functionality of the code intact.
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
import twitter
import csv
import time
import nltk
nltk.download('stopwords')
testDataSet = pd.read_csv("Twitter data.csv")
print(testDataSet[0:4])
trainingData = pd.read_csv("full-corpus.csv")
print(trainingData[0:4])
class PreProcessTweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
def processTweets(self, list_of_tweets):
processedTweets=[]
for tweet in list_of_tweets:
processedTweets.append((self._processTweet(tweet["text"]),tweet["label"]))
return processedTweets
def _processTweet(self, tweet):
tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('#[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]
tweetProcessor = PreProcessTweets()
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)
I expect it to start cleaning the data I've found before I can start using Naive Bayes
It's hard to tell without your actual data, but I think you are confusing multiple types through each other.
When loading the csv-data you are making a pandas dataframe.
Then in the processTweets method, you are trying to loop through this dataframe like a list.
At last, in the for loop of the processTweets where you are accessing the values of the list, which you call 'tweet', you are trying to access the values of 'tweet' with the keys 'text' and 'label'. I however doubt that you have a dictionary in there.
I downloaded some tweets from this site.
With this data, I tested your code and made the following adjustments.
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
import nltk
#had to install 'punkt'
nltk.download('punkt')
nltk.download('stopwords')
testDataSet = pd.read_csv("data.csv")
# For testing if the code works I only used a TestDatasSet, and no trainingData.
class PreProcessTweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
# To make it clear I changed the parameter to df_of_tweets (df = dataframe)
def processTweets(self, df_of_tweets):
processedTweets=[]
#turning the dataframe into lists
# in my data I did not have a label, so I used sentiment instead.
list_of_tweets = df_of_tweets.text.tolist()
list_of_sentiment = df_of_tweets.sentiment.tolist()
# using enumerate to keep track of the index of the tweets so I can use it to index the list of sentiment
for index, tweet in enumerate(list_of_tweets):
# adjusted the code here so that it takes values of the lists straight away.
processedTweets.append((self._processTweet(tweet), list_of_sentiment[index]))
return processedTweets
def _processTweet(self, tweet):
tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('#[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]
tweetProcessor = PreProcessTweets()
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)
tweetProcessor = PreProcessTweets()
print(preprocessedTestSet)
Hope it helps!

How to filter out non-English data from csv using pandas

I'm currently writing a code to extract frequently used words from my csv file, and it works just fine until I get a barplot of strange words listed. I don't know why, probably because there are some foreign words involved. However, I don't know how to fix this.
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import matplotlib
from matplotlib import pyplot as plt
import sys
sys.setrecursionlimit(100000)
# import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\nlp_dataset\\commitment.csv", encoding='cp1252',na_values=" NaN")
data.shape
data['text'] = data.fillna({'text':'none'})
def remove_punctuation(text):
'' 'a function for removing punctuation'''
import string
#replacing the punctuations with no space,
#which in effect deletes the punctuation marks
translator = str.maketrans('', '', string.punctuation)
#return the text stripped of punctuation marks
return text.translate(translator)
#Apply the function to each examples
data['text'] = data['text'].apply(remove_punctuation)
data.head(10)
#Removing stopwords -- extract the stopwords
#extracting the stopwords from nltk library
sw= stopwords.words('english')
#displaying the stopwords
np.array(sw)
# function to remove stopwords
def stopwords(text):
'''a function for removing stopwords'''
#removing the stop words and lowercasing the selected words
text = [word.lower() for word in text.split() if word.lower() not in sw]
#joining the list of words with space separator
return " ". join(text)
# Apply the function to each examples
data['text'] = data ['text'].apply(stopwords)
data.head(10)
# Top words before stemming
# create a count vectorizer object
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text dta
count_vectorizer.fit(data['text'])
# collect the vocabulary items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items()
#store the vocab and counts in a pandas dataframe
vocab = []
count = []
#iterate through each vocav and count append the value to designated lists
for key, value in dictionary:
vocab.append(key)
count.append(value)
#store the count in pandas dataframe with vocab as indedx
vocab_bef_stem = pd.Series(count, index=vocab)
#sort the dataframe
vocab_bef_stem = vocab_bef_stem.sort_values(ascending = False)
# Bar plot of top words before stemming
top_vocab = vocab_bef_stem.head(20)
top_vocab.plot(kind = 'barh', figsize=(5,10), xlim = (1000, 5000))
I want a list of frequent words ordered in a bar-plot, but for now it just gives non-English words with all-same frequency. Please help me out
The problem is that you are not sorting your vocabulary with counts instead with some unique ID created by count vectorizer.
count_vectorizer.vocabulary_.items()
This doesn't contains the count of each feature. count_vectorizer don't save the count of each feature.
Hence, you are getting to see the rarest/mis-spelled words (since these gets more change of larger value - unique ID) from your corpus in the plot. The way to get the counts of the words, is by applying transform on your text data and sum the counts of each word on all documents.
By default, tf-idf removes the punctuation and also you can feed a list of stop words for the vectorizer to remove. Your code can be reduced as follows.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document ?',
]
sw= stopwords.words('english')
count_vectorizer = CountVectorizer(stop_words=sw)
X = count_vectorizer.fit_transform(corpus)
vocab = pd.Series( X.toarray().sum(axis=0), index = count_vectorizer.get_feature_names())
vocab.sort_values(ascending=False).plot.bar(figsize=(5,5), xlim = (0, 7))
Instead of corpus, plug in your text data column. The output of the above snippet will be

Categories