python NLP on a list of text and frequency table - python

As titled,I have several txt files and first need to split each of them into batches with each of them less than 10,000 words.I then need to process each batch and print out the frequency table of the whole txt files. My attempt:
histogram = {}
results2=[os.path.basename(filename) for filename in glob.glob(path + '*.txt')]
for filename in results2:
with open(path + filename, 'r',encoding='utf-8') as f:
text = f.read()
batches = split_into_batches(text,10000)
for single_batch in batches:
parsed = nlp(single_batch)
for token in parsed:
original_token_text = token.orth_
if original_token_text not in histogram:
histogram[original_token_text] = 1
else:
histogram[original_token_text] += 1
print(histogram)
This code just kept running without giving output. However,it works fine for each txt file. I need an overall frequency table. Anyway I can fix it?
Any help will be appreciated!

Related

How to cluster different texts from different files?

I would like to cluster texts from different files to their topics. I am using the 20 newsgroup dataset. So there are different categories and I would like to cluster the texts to these categories with DBSCAN. My problem is how to do this?
At the moment I am saving each text of a file in a dict as a string. Then, I am removing several characters and words and extracting nouns from each dict entry. Then, I would like to apply Tf-idf on each dict entry which works but how can I pass this to DBSCAN to cluster this in categories?
my text processing and data handling:
counter = 0
dic = {}
for i in range(len(categories)):
path = Path('dataset/20news/%s/' % categories[i])
print("Getting files from: %s" %path)
files = os.listdir(path)
for f in files:
with open(path/f, 'r',encoding = "latin1") as file:
data = file.read()
dic[counter] = data
counter += 1
if preprocess == True:
print("processing Data...")
content = preprocessText(data)
if get_nouns == True:
content = nounExtractor(content)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
for i in range(len(content)):
content[i] = tfidf_vectorizer.fit_transform(content[i])
So I would like to pass each text to DBSCAN and I think it would be wrong to put all texts in one string because then there is no way to assign labels to it, am I right?
I hope my explanation is not too confusing.
Best regards!
EDIT:
for f in files:
with open(path/f, 'r',encoding = "latin1") as file:
data = file.read()
all_text.append(data)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
tfidf_vectorizer.fit(all_text)
text_vectors = []
for text in all_text:
text_vectors.append(tfidf_vectorizer.transform(text))
You should fit the TFIDF vectorizer to the whole training text corpus, and then create a vector representation for each text/document on it's own by transforming it using the TFIDF, you should then apply clustering to those vector representation for the documents.
EDIT
A simply edit to your original code would be instead of the following loop
for i in range(len(content)):
content[i] = tfidf_vectorizer.fit_transform(content[i])
You could do this
transformed_contents = tfidf_vectorizer.fit_transform(content)
transformed_contents will then contain the vectors that you should run your clustering algorithm against.

tensorflow - TFRecordWriter takes up too much memory when writing to a file?

I'm working on a large dataset, which has 306400 images to be processed.
But the thing I'm to do is really simple: resize the image and then write to a .TFRecords file.
However, I got an out of memory error.
I can't run the script several times since the .TFRecord file cannot be appended, so I have to write all data in one run.
I've tried to use several for loops because I thought after each for loop the memory used would be released but it seemed I was wrong.
So I then tried to use iter() to get iterators since for dict objects using dict.iteritems() can save memory, comparing to dict.iter().
But no magic.
So now I have no idea how to solve the problem.
def gen_records(record_name, img_path_file, label_map):
writer = tf.python_io.TFRecordWriter(record_name)
classes = []
with open(label_map, 'r') as f:
for l in f.readlines():
classes.append(l.split(',')[0])
with open(img_path_file, 'r') as f:
lines = f.readlines()
num_images = len(lines)
print 'total number to be written' + str(num_images)
print 'start writing...'
patches = []
with open(img_path_file, 'r') as f:
for patch in f.readlines():
patches.append(patch[:-1])
cnt = 0
for patch in patches:
cnt += 1
# print '[' + str(cnt) + ' / ' + str(num_images) + ']' + 'writing ' + str()
img = tf.image.resize_images(np.array(Image.open(patch)), (224, 224), method=tf.image.ResizeMethod.BILINEAR)
img_raw = np.array(img).tostring()
label = classes.index(patch.split('/')[1])
example = tf.train.Example(features=tf.train.Features(feature={
'label': _int64_feature(int(label)),
'image': _bytes_feature(img_raw)
}))
writer.write(example.SerializeToString())
writer.close()
How can I "release" the memory used after every iteration? Or how can I save the memory?
First thing to try is to load each pic on demand. Delete the lines that load the pics (lines 15 to 18) and define the following function outside gen_records:
def generate_patches():
with open('testfile.txt', 'r') as f:
for patch in f.readlines():
yield patch[:-1]
Then replace the definition of the for loop by
for patch in generate_patches():
...

How to save each loop of a python function to a separate file?

I have a question regarding a small Python program I would like to finish. I learned about coding, but this is my first real program.
With this program I want to combine 2 textfiles and insert a privnote link in between of those 2 files. In the end I want to save this combined file to a new output file. This function should be looped for a pre defined amount of times and each loop should be saved in a separate file:
This is my code:
import pyPrivnote as pn
import sys
def Text():
Teil_1 = open("Teil_1.txt", "r")
Content_1 = Teil_1.read()
print(Content_1)
note_link = pn.create_note("Data")
print(note_link)
Teil_2 = open("Teil_2.txt", "r")
Content_2 = Teil_2.read()
print(Content_2)
Above part works. Next part is where I struggle.
i = 0 + 1
while i <= 3:
filename = "C:\\Users\\Python\\Datei%d.txt" % i
f = open(filename, "r")
Text()
f.close()
How can I save each loop output of the Text() function to a new file?
I would like to save it the output to the relative path /output/ and the files should have the name "file01, file02...".
I searched for several hours now, but I donĀ“t find an answer to this problem.
Thanks in advance for your help!
Pass the file to Text():
def Text(out_file):
Teil_1 = open("Teil_1.txt", "r")
Content_1 = Teil_1.read()
out_file.write(Content_1)
Teil_1.close()
note_link = pn.create_note("Data")
out_file.write(note_link)
Teil_2 = open("Teil_2.txt", "r")
Content_2 = Teil_2.read()
out_file.write(Content_2)
Teil_2.close()
And:
i = 1
while i <= 3:
filename = "C:\\Users\\Python\\Datei%d.txt" % i
i += 1
f = open(filename, "rw")
Text(f)
f.close()
But inside the loop you are opening 2 files. There are easier ways to achieve this if you want to write the content of these files in one single file.

Downloading Data From .txt file containing URLs with Python again

I am currently trying to extract the raw data from a .txt file of 10 urls, and put the raw data from each line(URL) in the .txt file. And then repeat the process with the processed data(the raw data from the same original .txt file stripped of the html) by using Python.
import commands
import os
import json
# RAW DATA
input = open('uri.txt', 'r')
t_1 = open('command', 'w')
counter_1 = 0
for line in input:
counter_1 += 1
if counter_1 < 11:
filename = str(counter_1)
print str(line)
filename= str(count)
command ='curl ' + '"' + str(line).rstrip('\n') + '"'+ '> ./rawData/' + filename
output_1 = commands.getoutput(command)
input.close()
# PROCESSED DATA
counter_2 = 0
input = open('uri.txt','r')
t_2 = open('command','w')
for line in input:
counter_2 += 1
if counter_2 <11:
filename = str(counter_2) + '-processed'
command = 'lynx -dump -force_html ' + '"'+ str(line).rstrip('\n') + '"'+'> ./processedData/' + filename
print command
output_2 = commands.getoutput(command)
input.close()
I am attempting to do all of this with one script. Can anyone help me refine my code so I can run it? it should loop through the code completely once for each kind line in the .txt file. For example, I should have 1 raw & 1 processed .txt file for every url line in my .txt file.
Break your code up into functions. Currently the code is hard to read and debug. Make a function called get_raw() and a function called get_processed(). Then for your main loop, you can do
for line in file:
get_raw(line)
get_processed(line)
Or something similar. Also you should avoid using 'magic numbers' like counter<11. Why is it 11? Is it the number of the lines in the file? If it is you can get the number of lines with len().

Write series of strings (plus a number) to a line of csv

It's not pretty code, but I have some code that grabs a series of strings out of an HTML file and gives me a series of strings: author, title, date, length, text. I have 2000+ html files and I want go through all of them and write this data to a single csv file. I know all of this will have to be wrapped into a for loop eventually, but before then I am having a hard time understanding how to go from getting these values to writing them to a csv file. My thinking was to create a list or a tuple first and then write that to a line in a csv file:
the_file = "/Users/john/Code/tedtalks/test/transcript?language=en.0"
holding = soup(open(the_file).read(), "lxml")
at = holding.find("title").text
author = at[0:at.find(':')]
title = at[at.find(":")+1 : at.find("|") ]
date = re.sub('[^a-zA-Z0-9]',' ', holding.select_one("span.meta__val").text)
length_data = holding.find_all('data', {'class' : 'talk-transcript__para__time'})
(m, s) = ([x.get_text().strip("\n\r")
for x in length_data if re.search(r"(?s)\d{2}:\d{2}",
x.get_text().strip("\n\r"))][-1]).split(':')
length = int(m) * 60 + int(s)
firstpass = re.sub(r'\([^)]*\)', '', holding.find('div', class_ = 'talk-transcript__body').text)
text = re.sub('[^a-zA-Z\.\']',' ', firstpass)
data = ([author].join() + [title] + [date] + [length] + [text])
with open("./output.csv", "w") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in data:
writer.writerow(line)
I can't for the life of me figure out how to get Python to respect the fact that these are strings and should be stored as strings and not as lists of letters. (The .join() above is me trying to figure this out.)
Looking ahead: is it better/more efficient to handle 2000 files this way, stripping them down to what I want and writing one line of the CSV at a time or is it better to build a data frame in pandas and then write that to CSV? (All 2000 files = 160MB, so stripped down, the eventual data can't be more than 100MB, so no great size here, but looking forward size may eventually become an issue.)
This will grab all the files and put the data into a csv, you just need to pass the path to the folder that contains the html files and the name of your output file:
import re
import csv
import os
from bs4 import BeautifulSoup
from glob import iglob
def parse(soup):
# both title and author are can be parsed in separate tags.
author = soup.select_one("h4.h12.talk-link__speaker").text
title = soup.select_one("h4.h9.m5").text
# just need to strip the text from the date string, no regex needed.
date = soup.select_one("span.meta__val").text.strip()
# we want the last time which is the talk-transcript__para__time previous to the footer.
mn, sec = map(int, soup.select_one("footer.footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text.split(":"))
length = (mn * 60 + sec)
# to ignore time etc.. we can just pull from the actual text fragment and remove noise i.e (Applause).
text = re.sub(r'\([^)]*\)',"", " ".join(d.text for d in soup.select("span.talk-transcript__fragment")))
return author.strip(), title.strip(), date, length, re.sub('[^a-zA-Z\.\']', ' ', text)
def to_csv(patt, out):
# open file to write to.
with open(out, "w") as out:
# create csv.writer.
wr = csv.writer(out)
# write our headers.
wr.writerow(["author", "title", "date", "length", "text"])
# get all our html files.
for html in iglob(patt):
with open(html, as f:
# parse the file are write the data to a row.
wr.writerow(parse(BeautifulSoup(f, "lxml")))
to_csv("./test/*.html","output.csv")

Categories