I would like to cluster texts from different files to their topics. I am using the 20 newsgroup dataset. So there are different categories and I would like to cluster the texts to these categories with DBSCAN. My problem is how to do this?
At the moment I am saving each text of a file in a dict as a string. Then, I am removing several characters and words and extracting nouns from each dict entry. Then, I would like to apply Tf-idf on each dict entry which works but how can I pass this to DBSCAN to cluster this in categories?
my text processing and data handling:
counter = 0
dic = {}
for i in range(len(categories)):
path = Path('dataset/20news/%s/' % categories[i])
print("Getting files from: %s" %path)
files = os.listdir(path)
for f in files:
with open(path/f, 'r',encoding = "latin1") as file:
data = file.read()
dic[counter] = data
counter += 1
if preprocess == True:
print("processing Data...")
content = preprocessText(data)
if get_nouns == True:
content = nounExtractor(content)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
for i in range(len(content)):
content[i] = tfidf_vectorizer.fit_transform(content[i])
So I would like to pass each text to DBSCAN and I think it would be wrong to put all texts in one string because then there is no way to assign labels to it, am I right?
I hope my explanation is not too confusing.
Best regards!
EDIT:
for f in files:
with open(path/f, 'r',encoding = "latin1") as file:
data = file.read()
all_text.append(data)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
tfidf_vectorizer.fit(all_text)
text_vectors = []
for text in all_text:
text_vectors.append(tfidf_vectorizer.transform(text))
You should fit the TFIDF vectorizer to the whole training text corpus, and then create a vector representation for each text/document on it's own by transforming it using the TFIDF, you should then apply clustering to those vector representation for the documents.
EDIT
A simply edit to your original code would be instead of the following loop
for i in range(len(content)):
content[i] = tfidf_vectorizer.fit_transform(content[i])
You could do this
transformed_contents = tfidf_vectorizer.fit_transform(content)
transformed_contents will then contain the vectors that you should run your clustering algorithm against.
Related
As titled,I have several txt files and first need to split each of them into batches with each of them less than 10,000 words.I then need to process each batch and print out the frequency table of the whole txt files. My attempt:
histogram = {}
results2=[os.path.basename(filename) for filename in glob.glob(path + '*.txt')]
for filename in results2:
with open(path + filename, 'r',encoding='utf-8') as f:
text = f.read()
batches = split_into_batches(text,10000)
for single_batch in batches:
parsed = nlp(single_batch)
for token in parsed:
original_token_text = token.orth_
if original_token_text not in histogram:
histogram[original_token_text] = 1
else:
histogram[original_token_text] += 1
print(histogram)
This code just kept running without giving output. However,it works fine for each txt file. I need an overall frequency table. Anyway I can fix it?
Any help will be appreciated!
I'm trying to replace my dataset with some different way. I know below code blocks seems unlogical but I have to do with this way. Is there any option replace my 'Text' values in csv file to my tokenized and filtered lines with for loop ?
dataset = pandas.read_csv('/root/Desktop/%20/%1004.csv' , encoding='cp1252')
counter=0
for field in dataset['text']:
tokens = word_tokenize(field.translate(table))
tokens2= [w for w in tokens if not w in stop_words]
tokens3 = [token for token in tokens2 if not all(char.isdigit() or char == '.' or char == '-' for char in token)]
lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in tokens3]
stemmed_word = [snowball_stemmer.stem(word) for word in lemmatized_word]
##### ANY CODE TO REPLACE ITEMS IN dataset['Text'] to stemmed_word
##### LIKE ;
dataset['Text']s first value = stemmed_word[counter]
counter=counter+1
then save replaced csv file
because I have features at another columns like age , gender ,
experience.
You can just leave the data you don't intend to modify as they are, and write them to the new file along with your modified column of lemmatized words. Then whether you write the new processed dataset to a new file or overwrite your old one is entirely up to you. Though I'd personally choose to write to a new file (it's unlikely that adding another CSV file will be a problem to your computer's storage nowadays).
Anyway, to write files, you can use the csv module.
import pandas
import csv
dataset = pandas.read_csv('/root/Desktop/%20/%1004.csv' , encoding='cp1252')
# do your text processing on the desired column for your dataset
# ...
# ...
# ...
dataT = dataset.transpose()
with open('new_dataset', 'wb') as csvfile:
writer = csv.writer(csvfile)
for r in dataT:
writer.writerow(dataT[r])
I can't fully test it out, since I don't know the exact format of your dataset. But it should be something along this line (perhaps you should be writing the processed dataframe directly, and not its transpose; you should be able to figure that out yourself after playing around with it).
I am following a few online tutorials on text processing. One tutorial uses the below code to read in a number of .txt files and put them into one large corpus.
corpus_raw = u""
for file_name in file_names:
with codecs.open(file_name, "r", "utf-8") as file_name:
corpus_raw += file_name.read()
print("Document is {0} characters long".format(len(corpus_raw)))
print()
…
Then they go on to process the data:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences = tokenizer.tokenize(corpus_raw)
However the text data I have is in a panda dataframe. I have rows which are books and the text for those books are in a cell. I have found this answer but I cannot seem to get it to work with my data.
My pandas df has an "ID" column called "IDLink" and the text column "text". How can I put all my text data into a large corpus? it Will be to run a Word2Vec model.
EDIT:
This is not working as expected. I thought I would have for each row a list of tokenized words.
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
risk['tokenized_documents'] = risk['text'].apply(tokenizer.tokenize)
I am working a script to clean up a .txt file, create a list, count frequencies of unique words, and output a .csv file with the frequencies. I would like to open multiple files and combine them to still output a single .csv file.
Would it be more efficient to write code that would combine the text across the .txt files first or to read/clean all the unique files and combine the lists/dictionaries afterwards? What would the syntax look like for the optimal scenario?
I have been trying to research it on my own but have very limited coding skills and can't seem to find an answer that fits my specific question. I appreciate any and all input. Thanks!
import re
filename = 'testtext.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
import re
words = re.split(r'\W+', text)
words = [word.lower() for word in words]
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
from collections import Counter
countlist = Counter(stripped)
import csv
w = csv.writer(open("testtext.csv", "w"))
for key, val in countlist.items():
w.writerow([key, val])
If you want to count the frequencies of words for multiple files and outputting it into one CSV file, you won't need to do much to your code, just add a loop in your code e.g.:
import re
import string
from collections import Counter
import csv
files = ['testtext.txt', 'testtext2.txt', 'testtext3']
stripped = []
for filename in files:
file = open(filename, 'rt')
text = file.read()
file.close()
words = re.split(r'\W+', text)
words = [word.lower() for word in words]
table = str.maketrans('', '', string.punctuation)
stripped += [w.translate(table) for w in words] # concatenating parsed data
countlist = Counter(stripped)
w = csv.writer(open("testtext.csv", "w"))
for key, val in countlist.items():
w.writerow([key, val])
I don't know if this is the most optimal way of doing it.
It will depend on factors like: how big are the files? and how many files do you want to parse? and how frequently do you want to parse x files of y size? etc etc.
When you have figured that out, you can start thinking of ways to optimize the process.
if you need calculate the frequency, it's better to combine the strings from multiple .txt files at first, to know the performance, you can write the datetime function at the begining and end of the processing.
I am new at python and I have been trying to do some NLP on various .json files inside a folder. I have managed to get and print separately every entry from the dictionary using the key which is article get the description value. The thing is every time the loop executes I save the new data value to the same variable which is body1. What I find for some reason particularly difficult to do is save each data entry (each articles description) in a two dimensional matrix or a table of dictionaries if you which in order to be able to have all the entries there for future use. Something like :
body1 = ['file_name', 'description',
'file_name', 'description',
'file_name', 'description']
So if I need I will be able to print the second file's description using body1[name][description]. Now in every iteration the data from the last iteration are lost. I think that my C-configured was of thinking does now let me see the answer to that. I would appreciate any ideas.
Thank you in advance,
George
import os
import glob
import json
import nltk
from nltk.corpus import stopwords
from nltk import PorterStemmer
stop = stopwords.words('english')
stemmer=PorterStemmer()
for name in glob.glob('/Users/jorjis/Desktop/test/*'):
jfile = open(name, 'r')
values = json.load(jfile)
jfile.close()
body1 = values['article']['description']
tokens = nltk.wordpunct_tokenize(body1)
tokens = [w.lower() for w in tokens]
vocab = [word for word in tokens if word not in stop]
print body1
You need to ceate a list outside the loop and append the values.
final = [] # add values you want saved to final
uniq_ident = 1
for name in glob.glob('/Users/jorjis/Desktop/test/*'):
jfile = open(name, 'r')
values = json.load(jfile)
jfile.close()
body1 = values['article']['description']
tokens = nltk.wordpunct_tokenize(body1)
tokens = [w.lower() for w in tokens]
vocab = [word for word in tokens if word not in stop]
final.append([uniq_ident,vocab]) # append vocab or whatever values you want to keep
uniq_ident += 1
print body1
You can also use make final a dict with final = {} and use final[uniq_ident] = vocab
If you want to keep final a list and append a dict each time use:
final.append({uniq_ident:vocab})