I'm new to Python and NLTK. I'm trying to prepare text for tokenization using NLTK in Python after I import the text from a csv. There's only one column in the file with free text. I want to isolate that specific column, which I did.... I think.
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import re
import unicodedata
pd.set_option('display.max_colwidth',50)
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(oiw.columns[[1,2,3]],axis=1)
for row in text:
for text['value'] in row:
tokens = word_tokenize(row)
print(tokens)
When I run the code, the output it gives me is ['values'] which is the column name. How do I get the rest of the rows to show up in the output?
Sample data I have in the 'values' column:
The way was way too easy to order online.
Everything is great.
It's too easy for me to break.
The output I'm hoping to receive is:
['The','way','was','too','easy','to','order','online','Everything','is','great','It''s','for','me','break']
The correction you need to be made is in the segment.
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(columns=[1,2,3]) # correctly dropping columns named 1 2 and 3
for row in text['value']: # Correctly selecting the column
tokens = word_tokenize(row)
print(tokens) # Will print tokens in each row
print(tokens) # Will print the tokens of the last row
Hence you will be iterating over the correct column of the dataframe.
Related
I am working on the Sentiment Analysis for a college project. I have an excel file with a "column" named "comments" and it has "1000 rows". The sentences in these rows have spelling mistakes and for the analysis, I need to have them corrected. I don't know how to process this so that I get and column with correct sentences using python code.
All the methods I found were correcting spelling mistakes of a word not sentence and not on the column level with 100s of rows.
you can use Spellchecker for doing your stuff
import pandas as pd
from spellchecker import SpellChecker
spell = SpellChecker()
df = pd.DataFrame(['hooww good mrning playing fotball studyiing hard'], columns = ['text'])
def spell_check(x):
correct_word = []
mispelled_word = x.split()
for word in mispelled_word:
correct_word.append(spell.correction(word))
return ' '.join(correct_word)
df['spell_corrected_sentence'] = df['text'].apply(lambda x: spell_check(x))
Here is the CSV tableThere are two columns in a CSV table. One is summaries and the other one is texts. Both columns were typeOfList before I combined them together, converted to data frame and saved as a CSV file. BTW, the texts in the table have already been cleaned (removed all marks and converted to lower cases):
I want to loop through each cell in the table, split summaries and texts into words and tokenize each word. How can I do it?
I tried with python CSV reader and df.apply(word_tokenize). I tried also newList=set(summaries+texts), but then I could not tokenize them.
Any solutions to solve the problem, no matter of using CSV file, data frame or list. Thanks for your help in advance!
note: The real table has more than 50,000 rows.
===some update==
here is the code I have tried.
import pandas as pd
data= pd.read_csv('test.csv')
data.head()
newTry=data.apply(lambda x: " ".join(x), axis=1)
type(newTry)
print (newTry)
import nltk
for sentence in newTry:
new=sentence.split()
print(new)
print(set(new))
enter image description here
Please refer to the output in the screenshot. There are duplicate words in the list, and some square bracket. How should I removed them? I tried with set, but it gives only one sentence value.
You can use built-in csv pacakge to read csv file. And nltk to tokenize words:
from nltk.tokenize import word_tokenize
import csv
words = []
def get_data():
with open("sample_csv.csv", "r") as records:
for record in csv.reader(records):
yield record
data = get_data()
next(data) # skip header
for row in data:
for sent in row:
for word in word_tokenize(sent):
if word not in words:
words.append(word)
print(words)
I am trying to extract keywords line by line from a csv file and create a keyword field. Right now I am able to get the full extraction. How do I get keywords for each row/field?
Data:
id,some_text
1,"What is the meaning of the word Himalaya?"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward"
Code: This is search entire text but not row by row. Do I need to put something else besides replace(r'\|', ' ')?
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
df = pd.read_csv('test-data.csv')
# print(df.head(5))
text_context = df['some_text'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ') # not put lower case?
print(text_context)
print('')
tokens=nltk.tokenize.word_tokenize(text_context)
word_dist = nltk.FreqDist(tokens)
stop_words = stopwords.words('english')
punctuations = ['(',')',';',':','[',']',',','!','?']
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)
final output:
id,some_text,new_keyword_field
1,What is the meaning of the word Himalaya?,"meaning,word,himalaya"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward","palindrome,word,phrase,sequence,reads,backward,forward"
Here is a clean way to add a new keywords column to your dataframe using pandas apply. Apply works by first defining a function (get_keywords in our case) that we can apply to each row or column.
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# I define the stop_words here so I don't do it every time in the function below
stop_words = stopwords.words('english')
# I've added the index_col='id' here to set your 'id' column as the index. This assumes that the 'id' is unique.
df = pd.read_csv('test-data.csv', index_col='id')
Here we define our function that will be applied to each row using df.apply in the next cell. You can see that this function get_keywords takes a row as its argument and returns a string of comma separated keywords like you have in your desired output above ("meaning,word,himalaya"). Within this function we lower, tokenize, filter out punctuation with isalpha(), filter out our stop_words, and join our keywords together to form the desired output.
# This function will be applied to each row in our Pandas Dataframe
# See the docs for df.apply at:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
def get_keywords(row):
some_text = row['some_text']
lowered = some_text.lower()
tokens = nltk.tokenize.word_tokenize(lowered)
keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
keywords_string = ','.join(keywords)
return keywords_string
Now that we have defined our function that will be applied we call df.apply(get_keywords, axis=1). This will return a Pandas Series (similar to a list). Since we want this series to be a part of our dataframe we add it as a new column using df['keywords'] = df.apply(get_keywords, axis=1)
# applying the get_keywords function to our dataframe and saving the results
# as a new column in our dataframe called 'keywords'
# axis=1 means that we will apply get_keywords to each row and not each column
df['keywords'] = df.apply(get_keywords, axis=1)
Output:
Dataframe after adding 'keywords' column
I'm loading Excel sheets into Python in order to clean (tokenize, stem et cetera) rows of text. I'm using Pandas to clean each individual line and return a new, cleaned Excel file in the same format as the original. In order for the tokenizer and stemmer to be able to read the Excel file, the Pandas dataframe needs to be in string format.
It more or less works, but the below code splits the text in each row by individual words, resulting in each row only containing one (cleaned) word and not a sentence like the original file. How can I make sure it doesn't split each row of text?
(simplified) code below:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer('english')
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open('example.xls', 'rb'))
data_to_string = pd.DataFrame.to_string(excel)
for line in data_to_string:
tokens = tokenizer.tokenize(data_to_string)
stopped = [word for word in tokens if not word in stop_words] #removes stop words
trimmed = [ word for word in stopped if len(word) >= 3 ] #takes out all words of two characters or less.
stemmed = [stemmer.stem(word) for word in trimmed] #stems the words
return_to_dataframe = pd.DataFrame(stemmed) #resets back to pandas dataframe
I've thought about using this, but it doesn't work:
data_to_string = excel.astype(str).apply(' '.join, axis=1)
Edit: Maarten asked if I could upload an image of what my current and desired output would be. The format of the original input file (uncleaned) is on the left. The middle is the desired outcome (stemmed and stop words removed etc.), and the right image is the current output.
EDIT: I managed to solve it; the main problem was with the tokenization. First, I had to convert the pandas dataframe to a list of lists (see strdata in the code below'), and then tokenize each item in each list. The rest was solved with a simple for loop, appending the cleaned rows back to a list and converting the list back to a pandas dataframe. The remove_NaN is there because pandas saw each None-type element as a string of alphanumeric characters (namely the word "None") instead of an empty cell, so this string had to be removed. Also, pandas put each tokenized word into a separate column. mergeddf is there in order to merge all words back into the same column.
The working code looks like this:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import pandas as pd
import numpy as np
#load tokenizer, stemmer and stop words
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open(inFilePath, 'rb')) #use pandas to read excel file
strdata = excel.values.tolist() #convert values to list of lists (each row becomes a separate list)
tokens = [tokenizer.tokenize(str(i)) for i in strdata] #tokenize words in lists
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words] #remove stop words
stemmed = [stemmer.stem(i) for i in stopped] #stem words
cleaned_list.append(stemmed) #append stemmed words to list
backtodf = pd.DataFrame(cleaned_list) #convert list back to pandas dataframe
remove_NaN = backtodf.replace(np.nan, '', regex=True) #remove None (which return as words (str))
mergeddf = remove_NaN.astype(str).apply(lambda x: ' '.join(x), axis=1) #convert cells to strings, merge columns
I have a pandas dataframe and I am trying to tokenize the contents of each row.
import pandas as pd
import nltk as nk
from nltk import word_tokenize
TextData = pd.read_csv('TextData.csv')
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)
When I run it, I get an error at line 67,
TypeError: ('expected string or buffer', u'occurred at index 67')
Which I think I am getting because the value for 'Summary' at iloc[67] is an NA value.
TextData.Summary.iloc[67]
Out[45]: nan
Assuming it is the na value which is causing this, is there a way to tell word_tokenize or pandas to ignore the NA values whenever it comes across them?
Else, what else might be causing this?
You can use fillna() to replace NaN with a specified value:
import pandas as pd
import nltk as nk
from nltk import word_tokenize
TextData = pd.read_csv('TextData.csv')
TextData.fillna('some value') # or just: TextData['Summary'].fillna('some value')
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)
Previous Answer
You can simply "eliminate" the records where that value is null:
TextData = TextData[TextData['tokenized_summary'].notnull()]
Making the final product look like:
import pandas as pd
import nltk as nk
from nltk import word_tokenize
TextData = pd.read_csv('TextData.csv')
TextData = TextData[TextData['tokenized_summary'].notnull()]
TextData['tokenized_summary'] = TextData.apply(lambda row: nk.word_tokenize(row['Summary']), axis=1)