split() issues with pdf extractText() - python

I'm working on a minor content analysis program that I was hoping that I could have running through several pdf-files and return the sum of frequencies that some specific words are mentioned in the text. The words that are searched for are specified in a separate text file (list.txt) and can be altered. The program runs just fine through files with .txt format, but the result is completely different when running the program on a .pdf file. To illustrate, the test text that I have the program running trhough is the following:
"Hello
This is a product development notice
We’re working with innovative measures
A nice Innovation
The world that we live in is innovative
We are currently working on a new process
And in the fall, you will experience our new product development introduction"
The list of words grouped in categories are the following (marked in .txt file with ">>"):
innovation: innovat
product: Product, development, introduction
organization: Process
The output from running the code with a .txt file is the following:
Whereas the ouput from running it with a .pdf is the following:
As you can see, my issue is pertaining to the splitting of the words, where in the .pdf output i can have a string like "world" be split into 'w','o','rld'. I have tried to search for why this happens tirelessly, without success. As I am rather new to Python programming, I would appreciate any answe or direction to where I can fin and answer to why this happens, should you know any source.
Thanks
The code for the .txt is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.txt'
textfile = open(f)
text = textfile.read().split() # lowercase the text
print (text)
textfile.close()
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in text:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])
While the code for the .pdf is as follows:
import string, re, os
import PyPDF2
dictfile = open('list.txt')
lines = dictfile.readlines()
dictfile.close()
dic = {}
scores = {}
i = 2011
while i < 2012:
f = 'annual_report_' + str(i) +'.pdf'
textfile = open(f, 'rb')
text = PyPDF2.PdfFileReader(textfile)# lowercase the text
for pageNum in range(0, text.numPages):
texts = text.getPage(pageNum)
textfile = texts.extractText().split()
print (textfile)
i = i + 1
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# import the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = line[2:].strip()
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in textfile:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
print (os.path.basename(f))
for key in scores.keys():
print (key, ":", scores[key])

Related

Is there a way to output a link to a file with Python?

I have some code to sort a text and output info on it.
How it works is you copy a text a paste it into a text(.txt) file and save the file where the python file is saved. Then you go into the command prompt and type python3 the_name_of_the_python_file.py the_name_of_the_text_file.txt. When you run it, it outputs "All counted!". After that you have a new .txt file where the python file is saved and it tells you the number of words and unique words in the text file you attached. The new file will also list out what words are the most to least used.
Is there a way to get my code to output "All counted!" and then a link like thing that I can click on to open the new file?
Here is my code:
import sys
text_file = open(sys.argv[1], "r")
word_list = text_file.read().split(",")
word_list = "".join(word_list)
word_list = word_list.split(".")
word_list = "".join(word_list)
word_list = word_list.split(" ")
file_name = []
file_name = sys.argv[1].split(".")
text_file.close()
NumWords = 0
NumUniqueWords = 0
Words = {}
for i in word_list:
if i not in Words.keys():
NumWords += 1
NumUniqueWords += 1
Words[i.lower()] = 1
else:
NumWords += 1
Words[i] += 1
def get_key(val):
for key, value in Words.items():
if value == val:
return key
newfile = open(file_name[0] + "-count.txt", "w")
newfile.write("Total Words - {}\nUnique Words - {}\n\n".format(NumWords, NumUniqueWords))
for i in range(len(Words)):
newfile.write("{} - {}\n".format(get_key(max(Words.values())), max(Words.values())))
del(Words[get_key(max(Words.values()))])
newfile.close()
print("All counted!")
I do have things in my code to eliminate ","'s and "."'s and the same word capitalized or lowercase.

Python program that reads a text file for input?

So I'm currently trying to write some code that opens and reads a text file. The text file contains a short paragraph. Within the paragraph, there are some words with brackets around them, which could look like: "the boy [past_tense_verb] into the wall." I am trying to write code that looks for the brackets in the text file, and then displays to the user the words in the text file, for the user to then write some input that will replace the bracketed words. This is the code I have so far:
f = open('madlib.txt', 'r')
for line in f:
start = line.find('[')+1
end = line.find(']')+1
word = line[start:end+1]
inputword = input('Enter a ' + word + ': ')
print(line[:start] + inputword + line[end:])
Any help is greatly appreciated - thanks!
import re
with open('madlib.txt', 'r') as f:
data = f.read()
words_to_replace = re.findall(r"\[(\w+)\]", data)
replace_with = []
for idx, i in enumerate(words_to_replace):
print(f"Type here replace \033[1;31m{i}\033[1;m with:", end =" ")
a = input()
replace_with.append(a)
for idx, i in enumerate(replace_with):
data = data.replace(words_to_replace[idx], i)
with open('newmadlib.txt', 'w') as f:
f.write(data)

Speed up the write-to-different-files process

I am reading from a huge file (232MB) line by line.
First, i recognize each line according to a Regular Expression.
Then for each line, I am writing to different city.txt files under the 'report' directory according to a cityname in each line. However, this process takes a while. I am wondering if there is anyway of speeding up the process?
Example of input file: (each column split by a \t)
2015-02-03 19:20 Sane Diebgo Music 692.08 Cash
Actually i have tested the code with writing to different files and not writing to different file(simply process the large file and come up with 2 dicts) the time difference is huge. 80% of the time is spent writing to different files
def processFile(file):
pattern = re.compile(r"(\d{4}-\d{2}-\d{2})\t(\d{2}:\d{2})\t(.+)\t(.+)\t(\d+\.\d+|\d+)\t(\w+)\n")
f = open(file)
total_sale = 0
city_dict = dict()
categories_dict = dict()
os.makedirs("report", exist_ok = True)
for line in f:
valid_entry = pattern.search(line)
if valid_entry == None:
print("Invalid entry: '{}'".format(line.strip()))
continue
else:
entry_sale = float(valid_entry.group(5))
total_sale += entry_sale
city_dict.update({valid_entry.group(3) : city_dict.get(valid_entry.group(3), 0) + entry_sale})
categories_dict.update({valid_entry.group(4) : categories_dict.get(valid_entry.group(4), 0) + entry_sale})
filename = "report/" + valid_entry.group(3) + ".txt"
if os.path.exists(filename):
city_file = open(filename, "a")
city_file.write(valid_entry.group(0))
city_file.close()
else:
city_file = open(filename, "w")
city_file.write(valid_entry.group(0))
city_file.close()
f.close()
return (city_dict, categories_dict, total_sale)
The dictionary lookups and updates could be improved by using defaultdict:
from collections import defaultdict
city_dict = defaultdict(float)
categories_dict = defaultdict(float)
...
city = valid_entry.group(3)
category = valid_entry.group(4)
...
city_dict[city] += entry_sale
category_dict[category] += entry_sale

How to create table to find mean of document using python

I have a directory containing corpus text files, I want to create a table which contains the number of words in each document that is table contains column of document number & row contains word count in that document for each unique word...all should be done in python...please help...thank you...
The table should look like this:
word1 word2 word3 ...
doc1 14 5 45
doc2 6 1 0
.
.
.
import nltk
import collections
import os.path
def cleanDoc(doc):
stopset = set(nltk.corpus.stopwords.words('english'))
stemmer = nltk.PorterStemmer()
tokens = nltk.WordPunctTokenizer().tokenize(doc)
clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
path = "c://Users/Desktop/corpus files"
i=0
for file in os.listdir(path) :
f = open("c://Users/Desktop/corpus files/file%d.txt" %i,'r')
data= f.read()
words = cleanDoc(data)
fw = open("c://Users/Desktop/words/words%d.txt" %i,'w')
fd = collections.Counter(words)
#fd = nltk.FreqDist(words)
#plot(fd)
row_format = "{:>15}" * (len(words) + 1)
print row_format.format("document %d" %i, *words)
#for
fw.write(str(fd))
fw.write(str(words))
fw.close()
i=i+1
f.close()
I think this is fairly close, if not exactly, what you want. In case it isn't, I tried to make things easy to change.
To produce the table desired processing is done two phases. In the first, the unique words in each document file of the formfile<document-number>.txtare found and saved in a corresponding words<document-number>.txtfile, plus they are added to a set of comprising all the unique words seen among all document files. This set is needed to produce table columns that consist of all the unique words in all the files and is why two phases of processing were required.
In the second phase, the word files are read back in and turned back into dictionies which used to fill in the corresponding columns of the table being printed.
import ast
import collections
import nltk
import re
import os
user_name = "UserName"
path = "c://Users/%s/Desktop/corpus files" % user_name
def cleanDoc(doc):
stopset = set(nltk.corpus.stopwords.words('english'))
stemmer = nltk.PorterStemmer()
tokens = nltk.WordPunctTokenizer().tokenize(doc)
clean = [token.lower() for token in tokens
if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
# phase 1 -- find unique words, create word files, update overall unique word set
corpus_file_pattern = re.compile(r"""file(\d+).txt""")
unique_words = set()
longest_filename = 0
document_nums = []
for filename in os.listdir(path):
corpus_file_match = corpus_file_pattern.match(filename)
if corpus_file_match: # corpus text file?
if len(filename) > longest_filename:
longest_filename = len(filename)
document_num = int(corpus_file_match.group(1))
document_nums.append(document_num)
with open(os.path.join(path, filename)) as file:
data = file.read()
words = cleanDoc(data)
unique_words.update(words)
fd = collections.Counter(words)
words_filename = "words%d.txt" % document_num
with open(os.path.join(path, words_filename), mode = 'wt') as fw:
fw.write(repr(dict(fd)) + '\n') # write representation as dict
# phase 2 -- create table using unique_words and data in word files
unique_words_list = sorted(unique_words)
unique_words_empty_counter = collections.Counter({word: 0 for word
in unique_words})
document_nums = sorted(document_nums)
padding = 2 # spaces between columns
min_col_width = 5
col_headings = ["Document"] + unique_words_list
col_widths = [max(min_col_width, len(word))+padding for word in col_headings]
col_widths[0] = longest_filename+padding # first col is special case
# print table headings
for i, word in enumerate(col_headings):
print "{:{align}{width}}".format(word, align='>' if i else '<',
width=col_widths[i]),
print
for document_num in document_nums:
# read word in document dictionary back in
filename = "words%d.txt" % document_num
file_words = unique_words_empty_counter.copy()
with open(os.path.join(path, filename)) as file:
data = file.read()
# convert data read into dict and update with file word counts
file_words.update(ast.literal_eval(data))
# print row of data
print "{:<{width}}".format(filename, width=col_widths[0]),
for i, word in enumerate(col_headings[1:], 1):
print "{:>{width}n}".format(file_words[word], width=col_widths[i]),
print

Cutting character values according to value from file

This is the which i am doing
import csv
output = open('output.txt' , 'wb')
# this functions return the min for num.txt
def get_min(num):
return int(open('%s.txt' % num, 'r+').readlines()[0])
# temporary variables
last_line = ''
input_list = []
#iterate over input.txt in sort the input in a list of tuples
for i, line in enumerate(open('input.txt', 'r+').readlines()):
if i%2 == 0:
last_line = line
else:
input_list.append((last_line, line))
filtered = [(header, data[:get_min(header[-2])] + '\n' ) for (header, data) in input_list]
[output.write(''.join(data)) for data in filtered]
output.close()
In this code input.txt is something like this
>012|013|0|3|M
AFDSFASDFASDFA
>005|5|67|0|6
ACCTCTGACC
>029|032|4|5|S
GGCAGGGAGCAGGCCTGTA
and num.txt is something like this
M 4
P 10
I want that in above input.txt check the amount of value from the num.txt by looking at its last column which is same like in num.txt and cut its character according to that values
I think the error in my code is that it only accept the integer text file , where it should also accept file which contain alphabets
The totally revised version, after a long chat with the OP;
import os
import re
# Fetch all hashes and counts
file_c = open('num.txt')
file_c = file_c.read()
lines = re.findall(r'\w+\.txt \d+', file_c)
numbers = {}
for line in lines:
line_split = line.split('.txt ')
hash_name = line_split[0]
count = line_split[1]
numbers[hash_name] = count
#print(numbers)
# The input file
file_i = open('input.txt')
file_i = file_i.read()
for hash_name, count in numbers.iteritems():
regex = '(' + hash_name.strip() + ')'
result = re.findall(r'>.*\|(' + regex + ')(.*?)>', file_i, re.S)
if len(result) > 0:
data_original = result[0][2]
stripped_data = result[0][2][int(count):]
file_i = file_i.replace(data_original, '\n' + stripped_data)
#print(data_original)
#print(stripped_data)
#print(file_i)
# Write the input file to new input_new.txt
f = open('input_new.txt', 'wt')
f.write(file_i)
You can do it like so;
import re
min_count = 4 # this variable will contain that count integer from where to start removing
str_to_match = 'EOG6CC67M' # this variable will contain the filename you read
input = '' # The file input (input.txt) will go in here
counter = 0
def callback_f(e):
global min_count
global counter
counter += 1
# Check your input
print(str(counter) + ' >>> ' + e.group())
# Only replace the value with nothing (remove it) after a certain count
if counter > min_count:
return '' # replace with nothing
result = re.sub(r''+str_to_match, callback_f, input)
With this tactic you can keep count with a global counter and there's no need to do hard line-loops with complex structures.
Update
More detailed version with file access;
import os
import re
def callback_f(e):
global counter
counter += 1
# Check your input
print(str(counter) + ' >>> ' + e.group())
# Fetch all hash-file names and their content (count)
num_files = os.listdir('./num_files')
numbers = {}
for file in num_files:
if file[0] != '.':
file_c = open('./num_files/' + file)
file_c = file_c.read()
numbers[file.split('.')[0]] = file_c
# Now the CSV files
csv_files = os.listdir('./csv_files')
for file in csv_files:
if file[0] != '.':
for hash_name, min_count in numbers.iteritems():
file_c = open('./csv_files/' + file)
file_c = file_c.read()
counter = 0
result = re.sub(r''+hash_name, callback_f, file_c)
# Write the replaced content back to the file here
Considered directory/file structure;
+ Projects
+ Project_folder
+ csv_files
- input1.csv
- input2.csv
~ etc.
+ num_files
- EOG6CC67M.txt
- EOG62JQZP.txt
~ etc.
- python_file.py
The CSV files contain the big chunks of text you state in your original question.
The Num files contain the hash-files with an Integer in them
What happens in this script;
Collect all Hash files (in a dictionary) and it's inner count number
Loop through all CSV files
Subloop through the collected numbers for each CSV file
Replace/remove (based on what you do in callback_f()) hashes after a certain count
Write the output back (it's the last comment in the script, would contain the file.write() functionality)

Categories