Text from data frame into a large corpus - python

I am following a few online tutorials on text processing. One tutorial uses the below code to read in a number of .txt files and put them into one large corpus.
corpus_raw = u""
for file_name in file_names:
with codecs.open(file_name, "r", "utf-8") as file_name:
corpus_raw += file_name.read()
print("Document is {0} characters long".format(len(corpus_raw)))
print()
…
Then they go on to process the data:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences = tokenizer.tokenize(corpus_raw)
However the text data I have is in a panda dataframe. I have rows which are books and the text for those books are in a cell. I have found this answer but I cannot seem to get it to work with my data.
My pandas df has an "ID" column called "IDLink" and the text column "text". How can I put all my text data into a large corpus? it Will be to run a Word2Vec model.
EDIT:
This is not working as expected. I thought I would have for each row a list of tokenized words.
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
risk['tokenized_documents'] = risk['text'].apply(tokenizer.tokenize)

Related

Replace csv column with for loop in python

I'm trying to replace my dataset with some different way. I know below code blocks seems unlogical but I have to do with this way. Is there any option replace my 'Text' values in csv file to my tokenized and filtered lines with for loop ?
dataset = pandas.read_csv('/root/Desktop/%20/%1004.csv' , encoding='cp1252')
counter=0
for field in dataset['text']:
tokens = word_tokenize(field.translate(table))
tokens2= [w for w in tokens if not w in stop_words]
tokens3 = [token for token in tokens2 if not all(char.isdigit() or char == '.' or char == '-' for char in token)]
lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in tokens3]
stemmed_word = [snowball_stemmer.stem(word) for word in lemmatized_word]
##### ANY CODE TO REPLACE ITEMS IN dataset['Text'] to stemmed_word
##### LIKE ;
dataset['Text']s first value = stemmed_word[counter]
counter=counter+1
then save replaced csv file
because I have features at another columns like age , gender ,
experience.
You can just leave the data you don't intend to modify as they are, and write them to the new file along with your modified column of lemmatized words. Then whether you write the new processed dataset to a new file or overwrite your old one is entirely up to you. Though I'd personally choose to write to a new file (it's unlikely that adding another CSV file will be a problem to your computer's storage nowadays).
Anyway, to write files, you can use the csv module.
import pandas
import csv
dataset = pandas.read_csv('/root/Desktop/%20/%1004.csv' , encoding='cp1252')
# do your text processing on the desired column for your dataset
# ...
# ...
# ...
dataT = dataset.transpose()
with open('new_dataset', 'wb') as csvfile:
writer = csv.writer(csvfile)
for r in dataT:
writer.writerow(dataT[r])
I can't fully test it out, since I don't know the exact format of your dataset. But it should be something along this line (perhaps you should be writing the processed dataframe directly, and not its transpose; you should be able to figure that out yourself after playing around with it).

How to cluster different texts from different files?

I would like to cluster texts from different files to their topics. I am using the 20 newsgroup dataset. So there are different categories and I would like to cluster the texts to these categories with DBSCAN. My problem is how to do this?
At the moment I am saving each text of a file in a dict as a string. Then, I am removing several characters and words and extracting nouns from each dict entry. Then, I would like to apply Tf-idf on each dict entry which works but how can I pass this to DBSCAN to cluster this in categories?
my text processing and data handling:
counter = 0
dic = {}
for i in range(len(categories)):
path = Path('dataset/20news/%s/' % categories[i])
print("Getting files from: %s" %path)
files = os.listdir(path)
for f in files:
with open(path/f, 'r',encoding = "latin1") as file:
data = file.read()
dic[counter] = data
counter += 1
if preprocess == True:
print("processing Data...")
content = preprocessText(data)
if get_nouns == True:
content = nounExtractor(content)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
for i in range(len(content)):
content[i] = tfidf_vectorizer.fit_transform(content[i])
So I would like to pass each text to DBSCAN and I think it would be wrong to put all texts in one string because then there is no way to assign labels to it, am I right?
I hope my explanation is not too confusing.
Best regards!
EDIT:
for f in files:
with open(path/f, 'r',encoding = "latin1") as file:
data = file.read()
all_text.append(data)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
tfidf_vectorizer.fit(all_text)
text_vectors = []
for text in all_text:
text_vectors.append(tfidf_vectorizer.transform(text))
You should fit the TFIDF vectorizer to the whole training text corpus, and then create a vector representation for each text/document on it's own by transforming it using the TFIDF, you should then apply clustering to those vector representation for the documents.
EDIT
A simply edit to your original code would be instead of the following loop
for i in range(len(content)):
content[i] = tfidf_vectorizer.fit_transform(content[i])
You could do this
transformed_contents = tfidf_vectorizer.fit_transform(content)
transformed_contents will then contain the vectors that you should run your clustering algorithm against.

PYTHON: How to extract multiple regex patterns from list of text files and store as data frame?

I have a list of .txt files Each txt file contains multiple newspaper articles. On average, each file contains about 400 articles.
I want to define a function that maps over the list, extracting 1) publication date and 2) body text from each file, and returns a pandas data frame of date and text.
I have regex patterns that will match the relevant strings (they've worked for the same procedure in R), but I haven't been able to define a function that works.
Thanks in advance for the help with this beginner question!
if you don't know how to define a function:
import re
def split_date_body(data):
p = re.compile(r'(Date),\s*(Body.*)')
Date, Body = p.findall(data)[0]
return Date, Body
data = 'Date, Body xxxx'
print(split_date_body(data))
change r'(Date),\s*(Body.*)' to your regex
if you don't know how to parse file with multiple lines:
with open(your_file, 'r') as f:
datas = f.readlines()
for data in datas:
result = split_date_body(data)

Tensorflow text parsing

I am trying to parse text with Tensorflow. Lets say I have CSV file that looks like this:
foo,bar
And I want the output to be an array of the values:
[foo,bar]
How would you do it? I was thinking about creating some kind of hash for each word (value) for the input, but I have no idea what to do for the output.
I am not parsing CSV files, I only used them as an example.
reader = tf.TextLineReader()
_, lines = reader.read(...)
# if your data is ccv
list-of-tensors = tf.decode_csv(lines, ...)
# if your data is just a line of text
a-sparse-tensor = tf.string_split(lines, ...)

Split the hash-tags from a tweet and then store them and the remaining string in a CSV

I have a CSV file with thousands of tweets with columns - ID, created date, tweet. I want the output as another CSV file with another column added which must have the tweet's hashtag word in it that is split from the tweet. I need a python script to perform so.
For example,
If i have a tweet as
I love #stackoverflow coding #helpful
Then i need to store that tweet's hashtag split from the string and stored in another column like below.
"I love coding","stackoverflow,helpful"
Sample input from CSV
"id","created_date","tweet"
"723456719","2015-12-03 15:16:47","I love #stackoverflow coding #helpful"
"723456720","2015-12-03 16:15:47","I love #github coding #useful"
The output CSV must look like
"id","created_date","tweet","hashtags"
"723456719","2015-12-03 15:16:47","I love coding","stackoverflow,helpful"
"723456720","2015-12-03 16:15:47","I love coding","github,useful"
I am new to python, please help me out. Here's the code i tried out. I tried this piece of code from a github page.
#import regex
import re
#start process_tweet
def processTweet(tweet):
# process the tweets
#Convert to lower case
tweet = tweet.lower()
#Convert www.* or https?://* to URL
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
#Convert #username to AT_USER
tweet = re.sub('#[^\s]+','AT_USER',tweet)
#Remove additional white spaces
tweet = re.sub('[\s]+', ' ', tweet)
#Replace #word with word
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
#trim
tweet = tweet.strip('\'"')
return tweet
#end
#Read the tweets one by one and process it
fp = open('sample.csv', 'r')
line = fp.readline()
while line:
processedTweet = processTweet(line)
print processedTweet
line = fp.readline()
#end loop
fp.close()
You can use DictReader and DictWriter from the csv module to treat each line as a dictionary. This script will work in python 3.
import csv
# Open input file for reading and output file for writing.
# It's good practice to use the with open() context manager
with open('sample.csv', 'r') as csv_in, open('results.csv', 'w') as csv_out:
# The reader will figure out the field names
# based on the first line in the file.
reader = csv.DictReader(
csv_in
)
# We have to tell the writer the fields and their order
# and which dialect of csv we want.
writer = csv.DictWriter(
csv_out,
fieldnames=reader.fieldnames + ['hashtags'],
dialect=reader.dialect,
quoting=csv.QUOTE_ALL,
)
# write the header line of the output csv
writer.writeheader()
# loop over each line in the csv. The header line is not
# part of this loop when using csv.DictReader
for row in reader:
# Split the tweet into words using str.split()
words = row['tweet'].split()
# If you need to modify this code, you should turn the
# following lines into one or two separate functions.
# This will make debugging and testing easier.
# Filter and join the words using str.startswith()
row['tweet'] = ' '.join(
w for w in words if not w.startswith('#'))
# Extract the hashtags and remove the initial "#"
# using string slicing.
row['hashtags'] = ','.join(
w[1:] for w in words if w.startswith('#'))
# write the modified row to the output csv
writer.writerow(row)
If you use python 2.7, you might have to do some modifications to handle unicode input, if that's relevant. It might be as simple as adding this line to the top of your python file.
from __future__ import unicode_literals

Categories