I am trying to extract keywords line by line from a csv file and create a keyword field. Right now I am able to get the full extraction. How do I get keywords for each row/field?
Data:
id,some_text
1,"What is the meaning of the word Himalaya?"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward"
Code: This is search entire text but not row by row. Do I need to put something else besides replace(r'\|', ' ')?
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
df = pd.read_csv('test-data.csv')
# print(df.head(5))
text_context = df['some_text'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ') # not put lower case?
print(text_context)
print('')
tokens=nltk.tokenize.word_tokenize(text_context)
word_dist = nltk.FreqDist(tokens)
stop_words = stopwords.words('english')
punctuations = ['(',')',';',':','[',']',',','!','?']
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)
final output:
id,some_text,new_keyword_field
1,What is the meaning of the word Himalaya?,"meaning,word,himalaya"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward","palindrome,word,phrase,sequence,reads,backward,forward"
Here is a clean way to add a new keywords column to your dataframe using pandas apply. Apply works by first defining a function (get_keywords in our case) that we can apply to each row or column.
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# I define the stop_words here so I don't do it every time in the function below
stop_words = stopwords.words('english')
# I've added the index_col='id' here to set your 'id' column as the index. This assumes that the 'id' is unique.
df = pd.read_csv('test-data.csv', index_col='id')
Here we define our function that will be applied to each row using df.apply in the next cell. You can see that this function get_keywords takes a row as its argument and returns a string of comma separated keywords like you have in your desired output above ("meaning,word,himalaya"). Within this function we lower, tokenize, filter out punctuation with isalpha(), filter out our stop_words, and join our keywords together to form the desired output.
# This function will be applied to each row in our Pandas Dataframe
# See the docs for df.apply at:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
def get_keywords(row):
some_text = row['some_text']
lowered = some_text.lower()
tokens = nltk.tokenize.word_tokenize(lowered)
keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
keywords_string = ','.join(keywords)
return keywords_string
Now that we have defined our function that will be applied we call df.apply(get_keywords, axis=1). This will return a Pandas Series (similar to a list). Since we want this series to be a part of our dataframe we add it as a new column using df['keywords'] = df.apply(get_keywords, axis=1)
# applying the get_keywords function to our dataframe and saving the results
# as a new column in our dataframe called 'keywords'
# axis=1 means that we will apply get_keywords to each row and not each column
df['keywords'] = df.apply(get_keywords, axis=1)
Output:
Dataframe after adding 'keywords' column
Related
I have a csv file with three columns, namely (cid,ccontent,value) . And I want to loop through each word in ccontent column and translate the words individually.
I found this code for translating a row but I want to translate each word not the row.
How to write a function in Python that translates each row of a csv to another language?
from googletrans import Translator
import pandas as pd
headers = ['A','B','A_translation', 'B_translation']
data = pd.read_csv('./data.csv')
translator = Translator()
# Init empty dataframe with much rows as `data`
df = pd.DataFrame(index=range(0,len(data)), columns=headers)
def translate_row(row):
''' Translate elements A and B within `row`. '''
a = translator.translate(row[0], dest='Fr')
b = translator.translate(row[1], dest='Fr')
return pd.Series([a.origin, b.origin, a.text, b.text], headers)
for i, row in enumerate(data.values):
# Fill empty dataframe with given serie.
df.loc[i] = translate_row(row)
print(df)
Thank you
You can try along the lines of, using list comprehension:
def translate_row(row):
row0bywords = [translator.translate(eachword, dest='Fr') for eachword in row[0]]
orw1bywords = [translator.translate(eachword, dest='Fr') for eachword in row[1]]
return row0bywords, row1bywords
I'm new to Python and NLTK. I'm trying to prepare text for tokenization using NLTK in Python after I import the text from a csv. There's only one column in the file with free text. I want to isolate that specific column, which I did.... I think.
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import re
import unicodedata
pd.set_option('display.max_colwidth',50)
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(oiw.columns[[1,2,3]],axis=1)
for row in text:
for text['value'] in row:
tokens = word_tokenize(row)
print(tokens)
When I run the code, the output it gives me is ['values'] which is the column name. How do I get the rest of the rows to show up in the output?
Sample data I have in the 'values' column:
The way was way too easy to order online.
Everything is great.
It's too easy for me to break.
The output I'm hoping to receive is:
['The','way','was','too','easy','to','order','online','Everything','is','great','It''s','for','me','break']
The correction you need to be made is in the segment.
oiw = pd.read_csv(r'C:\Users\tgray\Documents\PythonScripts\Worksheets.csv')
text = oiw.drop(columns=[1,2,3]) # correctly dropping columns named 1 2 and 3
for row in text['value']: # Correctly selecting the column
tokens = word_tokenize(row)
print(tokens) # Will print tokens in each row
print(tokens) # Will print the tokens of the last row
Hence you will be iterating over the correct column of the dataframe.
Here is the CSV tableThere are two columns in a CSV table. One is summaries and the other one is texts. Both columns were typeOfList before I combined them together, converted to data frame and saved as a CSV file. BTW, the texts in the table have already been cleaned (removed all marks and converted to lower cases):
I want to loop through each cell in the table, split summaries and texts into words and tokenize each word. How can I do it?
I tried with python CSV reader and df.apply(word_tokenize). I tried also newList=set(summaries+texts), but then I could not tokenize them.
Any solutions to solve the problem, no matter of using CSV file, data frame or list. Thanks for your help in advance!
note: The real table has more than 50,000 rows.
===some update==
here is the code I have tried.
import pandas as pd
data= pd.read_csv('test.csv')
data.head()
newTry=data.apply(lambda x: " ".join(x), axis=1)
type(newTry)
print (newTry)
import nltk
for sentence in newTry:
new=sentence.split()
print(new)
print(set(new))
enter image description here
Please refer to the output in the screenshot. There are duplicate words in the list, and some square bracket. How should I removed them? I tried with set, but it gives only one sentence value.
You can use built-in csv pacakge to read csv file. And nltk to tokenize words:
from nltk.tokenize import word_tokenize
import csv
words = []
def get_data():
with open("sample_csv.csv", "r") as records:
for record in csv.reader(records):
yield record
data = get_data()
next(data) # skip header
for row in data:
for sent in row:
for word in word_tokenize(sent):
if word not in words:
words.append(word)
print(words)
I'm trying to use NLTK word_tokenize on an excel file I've opened as a data frame. The column I want to use word_tokenize on contains sentences. How can I pull out that specific column from my data frame to tokenize it? The name of the column I'm trying to access is called "Complaint / Query Detail".
import pandas as pd
from nltk import word_tokenize
file = "List of Complaints.xlsx"
df = pd.read_excel(file, sheet_name = "All Complaints" )
token = df["Complaint / Query Detail"].apply(word_tokenize)
I tried this method but I keep getting errors.
Try this:
df['Complaint / Query Detail'] = df.apply(lambda row:
nltk.word_tokenize(row['Complaint / Query Detail']), axis=1)
This is a for loop for tokenizing columns in a dataframe.
This is where you see DF put in yoru CSV file
def tokenize_text(df):
for columns in df.columns:
dataframe["tokenized_"+ columns] = dataframe.apply(lambda row: nltk.word_tokenize(row[columns]), axis=1)
return dataframe
print(df)
I hope it's helpful.
I'm loading Excel sheets into Python in order to clean (tokenize, stem et cetera) rows of text. I'm using Pandas to clean each individual line and return a new, cleaned Excel file in the same format as the original. In order for the tokenizer and stemmer to be able to read the Excel file, the Pandas dataframe needs to be in string format.
It more or less works, but the below code splits the text in each row by individual words, resulting in each row only containing one (cleaned) word and not a sentence like the original file. How can I make sure it doesn't split each row of text?
(simplified) code below:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer('english')
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open('example.xls', 'rb'))
data_to_string = pd.DataFrame.to_string(excel)
for line in data_to_string:
tokens = tokenizer.tokenize(data_to_string)
stopped = [word for word in tokens if not word in stop_words] #removes stop words
trimmed = [ word for word in stopped if len(word) >= 3 ] #takes out all words of two characters or less.
stemmed = [stemmer.stem(word) for word in trimmed] #stems the words
return_to_dataframe = pd.DataFrame(stemmed) #resets back to pandas dataframe
I've thought about using this, but it doesn't work:
data_to_string = excel.astype(str).apply(' '.join, axis=1)
Edit: Maarten asked if I could upload an image of what my current and desired output would be. The format of the original input file (uncleaned) is on the left. The middle is the desired outcome (stemmed and stop words removed etc.), and the right image is the current output.
EDIT: I managed to solve it; the main problem was with the tokenization. First, I had to convert the pandas dataframe to a list of lists (see strdata in the code below'), and then tokenize each item in each list. The rest was solved with a simple for loop, appending the cleaned rows back to a list and converting the list back to a pandas dataframe. The remove_NaN is there because pandas saw each None-type element as a string of alphanumeric characters (namely the word "None") instead of an empty cell, so this string had to be removed. Also, pandas put each tokenized word into a separate column. mergeddf is there in order to merge all words back into the same column.
The working code looks like this:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import pandas as pd
import numpy as np
#load tokenizer, stemmer and stop words
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
excel = pd.read_excel(open(inFilePath, 'rb')) #use pandas to read excel file
strdata = excel.values.tolist() #convert values to list of lists (each row becomes a separate list)
tokens = [tokenizer.tokenize(str(i)) for i in strdata] #tokenize words in lists
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words] #remove stop words
stemmed = [stemmer.stem(i) for i in stopped] #stem words
cleaned_list.append(stemmed) #append stemmed words to list
backtodf = pd.DataFrame(cleaned_list) #convert list back to pandas dataframe
remove_NaN = backtodf.replace(np.nan, '', regex=True) #remove None (which return as words (str))
mergeddf = remove_NaN.astype(str).apply(lambda x: ' '.join(x), axis=1) #convert cells to strings, merge columns