I'm loading Excel sheets into Python in order to clean (tokenize, stem et cetera) rows of text. I'm using Pandas to clean each individual line and return a new, cleaned Excel file in the same format as the original. In order for the tokenizer and stemmer to be able to read the Excel file, the Pandas dataframe needs to be in string format.
It more or less works, but the below code splits the text in each row by individual words, resulting in each row only containing one (cleaned) word and not a sentence like the original file. How can I make sure it doesn't split each row of text?
(simplified) code below:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer('english')
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open('example.xls', 'rb'))
data_to_string = pd.DataFrame.to_string(excel)
for line in data_to_string:
tokens = tokenizer.tokenize(data_to_string)
stopped = [word for word in tokens if not word in stop_words] #removes stop words
trimmed = [ word for word in stopped if len(word) >= 3 ] #takes out all words of two characters or less.
stemmed = [stemmer.stem(word) for word in trimmed] #stems the words
return_to_dataframe = pd.DataFrame(stemmed) #resets back to pandas dataframe
I've thought about using this, but it doesn't work:
data_to_string = excel.astype(str).apply(' '.join, axis=1)
Edit: Maarten asked if I could upload an image of what my current and desired output would be. The format of the original input file (uncleaned) is on the left. The middle is the desired outcome (stemmed and stop words removed etc.), and the right image is the current output.
EDIT: I managed to solve it; the main problem was with the tokenization. First, I had to convert the pandas dataframe to a list of lists (see strdata in the code below'), and then tokenize each item in each list. The rest was solved with a simple for loop, appending the cleaned rows back to a list and converting the list back to a pandas dataframe. The remove_NaN is there because pandas saw each None-type element as a string of alphanumeric characters (namely the word "None") instead of an empty cell, so this string had to be removed. Also, pandas put each tokenized word into a separate column. mergeddf is there in order to merge all words back into the same column.
The working code looks like this:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import pandas as pd
import numpy as np
#load tokenizer, stemmer and stop words
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open(inFilePath, 'rb')) #use pandas to read excel file
strdata = excel.values.tolist() #convert values to list of lists (each row becomes a separate list)
tokens = [tokenizer.tokenize(str(i)) for i in strdata] #tokenize words in lists
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words] #remove stop words
stemmed = [stemmer.stem(i) for i in stopped] #stem words
cleaned_list.append(stemmed) #append stemmed words to list
backtodf = pd.DataFrame(cleaned_list) #convert list back to pandas dataframe
remove_NaN = backtodf.replace(np.nan, '', regex=True) #remove None (which return as words (str))
mergeddf = remove_NaN.astype(str).apply(lambda x: ' '.join(x), axis=1) #convert cells to strings, merge columns
Related
I'm new to Python and NLTK. I'm trying to prepare text for tokenization using NLTK in Python after I import the text from a csv. There's only one column in the file with free text. I want to isolate that specific column, which I did.... I think.
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import re
import unicodedata
pd.set_option('display.max_colwidth',50)
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(oiw.columns[[1,2,3]],axis=1)
for row in text:
for text['value'] in row:
tokens = word_tokenize(row)
print(tokens)
When I run the code, the output it gives me is ['values'] which is the column name. How do I get the rest of the rows to show up in the output?
Sample data I have in the 'values' column:
The way was way too easy to order online.
Everything is great.
It's too easy for me to break.
The output I'm hoping to receive is:
['The','way','was','too','easy','to','order','online','Everything','is','great','It''s','for','me','break']
The correction you need to be made is in the segment.
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(columns=[1,2,3]) # correctly dropping columns named 1 2 and 3
for row in text['value']: # Correctly selecting the column
tokens = word_tokenize(row)
print(tokens) # Will print tokens in each row
print(tokens) # Will print the tokens of the last row
Hence you will be iterating over the correct column of the dataframe.
This is one of my first times using Python so apologies if this is a silly question. I have a 3 column CSV, first column called comment (which is the column that I manipulated into bigrams), second column which is called comment type, and third column which is called comment date. I am satisfied with my current output from this code, where I split column 1 (Comment) into bigrams, counted the frequencies, and exported into a CSV file. But now I also want to add the comment type and comment date columns from my original csv (without making any changes to them) to my exported CSV next to word and frequency columns. I'm not to sure how to go about that and tested some ideas but didn't work.
import csv
import string
import re
from nltk.util import everygrams
import pandas as pd
from collections import Counter
from itertools import combinations
df = pd.read_csv('modified.csv', 'r', encoding="utf8",
names=['comment'])
top_N = 1000
stopwords = nltk.corpus.stopwords.words('english')
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
txt = df.comment.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
words = [w for w in words if not w in RE_stopwords]
bigrm = list(nltk.bigrams(words))
word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
rslt.to_csv('bigram3.csv')
The added lines at the end, create a new column in your rslt dataframe, and copy the data from your original dataframe to this one.
import csv
import string
import re
from nltk.util import everygrams
import pandas as pd
from collections import Counter
from itertools import combinations
df = pd.read_csv('modified.csv', 'r', encoding="utf8",
names=['comment'])
top_N = 1000
stopwords = nltk.corpus.stopwords.words('english')
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
txt = df.comment.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
words = [w for w in words if not w in RE_stopwords]
bigrm = list(nltk.bigrams(words))
word_dist = nltk.FreqDist([' '.join(x) for x in bigrm])
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
rslt['Column_Type'] = df['comment type']
rslt['Column_Date'] = df['comment date']
print(rslt)
rslt.to_csv('bigram3.csv')
Here is the CSV tableThere are two columns in a CSV table. One is summaries and the other one is texts. Both columns were typeOfList before I combined them together, converted to data frame and saved as a CSV file. BTW, the texts in the table have already been cleaned (removed all marks and converted to lower cases):
I want to loop through each cell in the table, split summaries and texts into words and tokenize each word. How can I do it?
I tried with python CSV reader and df.apply(word_tokenize). I tried also newList=set(summaries+texts), but then I could not tokenize them.
Any solutions to solve the problem, no matter of using CSV file, data frame or list. Thanks for your help in advance!
note: The real table has more than 50,000 rows.
===some update==
here is the code I have tried.
import pandas as pd
data= pd.read_csv('test.csv')
data.head()
newTry=data.apply(lambda x: " ".join(x), axis=1)
type(newTry)
print (newTry)
import nltk
for sentence in newTry:
new=sentence.split()
print(new)
print(set(new))
enter image description here
Please refer to the output in the screenshot. There are duplicate words in the list, and some square bracket. How should I removed them? I tried with set, but it gives only one sentence value.
You can use built-in csv pacakge to read csv file. And nltk to tokenize words:
from nltk.tokenize import word_tokenize
import csv
words = []
def get_data():
with open("sample_csv.csv", "r") as records:
for record in csv.reader(records):
yield record
data = get_data()
next(data) # skip header
for row in data:
for sent in row:
for word in word_tokenize(sent):
if word not in words:
words.append(word)
print(words)
I have a csv with msg column and it has the following text
muchloveandhugs
dudeseriously
onemorepersonforthewin
havefreebiewoohoothankgod
thisismybestcategory
yupbabe
didfreebee
heykidforget
hecomplainsaboutit
I know that nltk.corpus.words has a bunch of sensible words. My problem is how do I iterate it over the df[‘msg’] column so that I can get words such as
df[‘msg’]
much love and hugs
dude seriously
one more person for the win
From this question about splitting words in strings with no spaces and not quite knowing what your data looks like:
import pandas as pd
import wordninja
filename = 'mycsv.csv' # Put your filename here
df = pd.read_csv(filename)
for wordstring in df['msg']:
split = wordninja.split(wordstring)
# Do something with split
I am trying to extract keywords line by line from a csv file and create a keyword field. Right now I am able to get the full extraction. How do I get keywords for each row/field?
Data:
id,some_text
1,"What is the meaning of the word Himalaya?"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward"
Code: This is search entire text but not row by row. Do I need to put something else besides replace(r'\|', ' ')?
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
df = pd.read_csv('test-data.csv')
# print(df.head(5))
text_context = df['some_text'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ') # not put lower case?
print(text_context)
print('')
tokens=nltk.tokenize.word_tokenize(text_context)
word_dist = nltk.FreqDist(tokens)
stop_words = stopwords.words('english')
punctuations = ['(',')',';',':','[',']',',','!','?']
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)
final output:
id,some_text,new_keyword_field
1,What is the meaning of the word Himalaya?,"meaning,word,himalaya"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward","palindrome,word,phrase,sequence,reads,backward,forward"
Here is a clean way to add a new keywords column to your dataframe using pandas apply. Apply works by first defining a function (get_keywords in our case) that we can apply to each row or column.
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# I define the stop_words here so I don't do it every time in the function below
stop_words = stopwords.words('english')
# I've added the index_col='id' here to set your 'id' column as the index. This assumes that the 'id' is unique.
df = pd.read_csv('test-data.csv', index_col='id')
Here we define our function that will be applied to each row using df.apply in the next cell. You can see that this function get_keywords takes a row as its argument and returns a string of comma separated keywords like you have in your desired output above ("meaning,word,himalaya"). Within this function we lower, tokenize, filter out punctuation with isalpha(), filter out our stop_words, and join our keywords together to form the desired output.
# This function will be applied to each row in our Pandas Dataframe
# See the docs for df.apply at:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
def get_keywords(row):
some_text = row['some_text']
lowered = some_text.lower()
tokens = nltk.tokenize.word_tokenize(lowered)
keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
keywords_string = ','.join(keywords)
return keywords_string
Now that we have defined our function that will be applied we call df.apply(get_keywords, axis=1). This will return a Pandas Series (similar to a list). Since we want this series to be a part of our dataframe we add it as a new column using df['keywords'] = df.apply(get_keywords, axis=1)
# applying the get_keywords function to our dataframe and saving the results
# as a new column in our dataframe called 'keywords'
# axis=1 means that we will apply get_keywords to each row and not each column
df['keywords'] = df.apply(get_keywords, axis=1)
Output:
Dataframe after adding 'keywords' column