Get each unique word in a csv file tokenized

Get each unique word in a csv file tokenized - python

Here is the CSV tableThere are two columns in a CSV table. One is summaries and the other one is texts. Both columns were typeOfList before I combined them together, converted to data frame and saved as a CSV file. BTW, the texts in the table have already been cleaned (removed all marks and converted to lower cases):
I want to loop through each cell in the table, split summaries and texts into words and tokenize each word. How can I do it?
I tried with python CSV reader and df.apply(word_tokenize). I tried also newList=set(summaries+texts), but then I could not tokenize them.
Any solutions to solve the problem, no matter of using CSV file, data frame or list. Thanks for your help in advance!
note: The real table has more than 50,000 rows.
===some update==
here is the code I have tried.
import pandas as pd
data= pd.read_csv('test.csv')
data.head()
newTry=data.apply(lambda x: " ".join(x), axis=1)
type(newTry)
print (newTry)
import nltk
for sentence in newTry:
new=sentence.split()
print(new)
print(set(new))
enter image description here
Please refer to the output in the screenshot. There are duplicate words in the list, and some square bracket. How should I removed them? I tried with set, but it gives only one sentence value.

You can use built-in csv pacakge to read csv file. And nltk to tokenize words:
from nltk.tokenize import word_tokenize
import csv
words = []
def get_data():
with open("sample_csv.csv", "r") as records:
for record in csv.reader(records):
yield record
data = get_data()
next(data) # skip header
for row in data:
for sent in row:
for word in word_tokenize(sent):
if word not in words:
words.append(word)
print(words)

Related

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.

The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])

You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR

You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

Python NLTK Prepare Data from CSV for Tokenization

I'm new to Python and NLTK. I'm trying to prepare text for tokenization using NLTK in Python after I import the text from a csv. There's only one column in the file with free text. I want to isolate that specific column, which I did.... I think.
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import re
import unicodedata
pd.set_option('display.max_colwidth',50)
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(oiw.columns[[1,2,3]],axis=1)
for row in text:
for text['value'] in row:
tokens = word_tokenize(row)
print(tokens)
When I run the code, the output it gives me is ['values'] which is the column name. How do I get the rest of the rows to show up in the output?
Sample data I have in the 'values' column:
The way was way too easy to order online.
Everything is great.
It's too easy for me to break.
The output I'm hoping to receive is:
['The','way','was','too','easy','to','order','online','Everything','is','great','It''s','for','me','break']

The correction you need to be made is in the segment.
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(columns=[1,2,3]) # correctly dropping columns named 1 2 and 3
for row in text['value']: # Correctly selecting the column
tokens = word_tokenize(row)
print(tokens) # Will print tokens in each row
print(tokens) # Will print the tokens of the last row
Hence you will be iterating over the correct column of the dataframe.

Is there any way in python to auto-correct spelling mistake in multiple rows of an excel files of a single column?

I am working on the Sentiment Analysis for a college project. I have an excel file with a "column" named "comments" and it has "1000 rows". The sentences in these rows have spelling mistakes and for the analysis, I need to have them corrected. I don't know how to process this so that I get and column with correct sentences using python code.
All the methods I found were correcting spelling mistakes of a word not sentence and not on the column level with 100s of rows.

you can use Spellchecker for doing your stuff
import pandas as pd
from spellchecker import SpellChecker
spell = SpellChecker()
df = pd.DataFrame(['hooww good mrning playing fotball studyiing hard'], columns = ['text'])
def spell_check(x):
correct_word = []
mispelled_word = x.split()
for word in mispelled_word:
correct_word.append(spell.correction(word))
return ' '.join(correct_word)
df['spell_corrected_sentence'] = df['text'].apply(lambda x: spell_check(x))

Splitting words in a column

I have a csv with msg column and it has the following text
muchloveandhugs
dudeseriously
onemorepersonforthewin
havefreebiewoohoothankgod
thisismybestcategory
yupbabe
didfreebee
heykidforget
hecomplainsaboutit
I know that nltk.corpus.words has a bunch of sensible words. My problem is how do I iterate it over the df[‘msg’] column so that I can get words such as
df[‘msg’]
much love and hugs
dude seriously
one more person for the win

From this question about splitting words in strings with no spaces and not quite knowing what your data looks like:
import pandas as pd
import wordninja
filename = 'mycsv.csv' # Put your filename here
df = pd.read_csv(filename)
for wordstring in df['msg']:
split = wordninja.split(wordstring)
# Do something with split

Python - read and edit individual Excel file rows

I'm loading Excel sheets into Python in order to clean (tokenize, stem et cetera) rows of text. I'm using Pandas to clean each individual line and return a new, cleaned Excel file in the same format as the original. In order for the tokenizer and stemmer to be able to read the Excel file, the Pandas dataframe needs to be in string format.
It more or less works, but the below code splits the text in each row by individual words, resulting in each row only containing one (cleaned) word and not a sentence like the original file. How can I make sure it doesn't split each row of text?
(simplified) code below:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer('english')
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open('example.xls', 'rb'))
data_to_string = pd.DataFrame.to_string(excel)
for line in data_to_string:
tokens = tokenizer.tokenize(data_to_string)
stopped = [word for word in tokens if not word in stop_words] #removes stop words
trimmed = [ word for word in stopped if len(word) >= 3 ] #takes out all words of two characters or less.
stemmed = [stemmer.stem(word) for word in trimmed] #stems the words
return_to_dataframe = pd.DataFrame(stemmed) #resets back to pandas dataframe
I've thought about using this, but it doesn't work:
data_to_string = excel.astype(str).apply(' '.join, axis=1)
Edit: Maarten asked if I could upload an image of what my current and desired output would be. The format of the original input file (uncleaned) is on the left. The middle is the desired outcome (stemmed and stop words removed etc.), and the right image is the current output.
EDIT: I managed to solve it; the main problem was with the tokenization. First, I had to convert the pandas dataframe to a list of lists (see strdata in the code below'), and then tokenize each item in each list. The rest was solved with a simple for loop, appending the cleaned rows back to a list and converting the list back to a pandas dataframe. The remove_NaN is there because pandas saw each None-type element as a string of alphanumeric characters (namely the word "None") instead of an empty cell, so this string had to be removed. Also, pandas put each tokenized word into a separate column. mergeddf is there in order to merge all words back into the same column.
The working code looks like this:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import pandas as pd
import numpy as np
#load tokenizer, stemmer and stop words
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open(inFilePath, 'rb')) #use pandas to read excel file
strdata = excel.values.tolist() #convert values to list of lists (each row becomes a separate list)
tokens = [tokenizer.tokenize(str(i)) for i in strdata] #tokenize words in lists
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words] #remove stop words
stemmed = [stemmer.stem(i) for i in stopped] #stem words
cleaned_list.append(stemmed) #append stemmed words to list
backtodf = pd.DataFrame(cleaned_list) #convert list back to pandas dataframe
remove_NaN = backtodf.replace(np.nan, '', regex=True) #remove None (which return as words (str))
mergeddf = remove_NaN.astype(str).apply(lambda x: ' '.join(x), axis=1) #convert cells to strings, merge columns

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get each unique word in a csv file tokenized - python

Related

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Python NLTK Prepare Data from CSV for Tokenization

Is there any way in python to auto-correct spelling mistake in multiple rows of an excel files of a single column?

Splitting words in a column

Python - read and edit individual Excel file rows

Categories

Resources