Adding rows and columns to a pandas DataFrame in multiple loops - python

I am trying to make a simple tool which can look for keywords (from multiple txt files) in multiple PDFs. In the end, I would like it to produce a report in the following form:
Name of the pdf document
Keyword document 1
Keyword document...
Keyword document x
PDF 1
1
546
77
PDF...
3
8
8
PDF x
324
23
34
Where the numbers represent the total number of occurrences of all keywords from the keyword document in that particular file.
This is where I got so far - the function can successfully locate, count, and relate summed keywords to the document:
import fitz
import glob
def keyword_finder():
# access all PDFs from current directory
for pdf_file in glob.glob('*.pdf'):
# open files using PyMuPDF
document = fitz.open(pdf_file)
# count the number of pages in document
document_pages = document.page_count
# access all txt files (these contain the keywords)
for text_file in glob.glob('*.txt'):
# empty list to store the results
occurrences_sdg = []
# open keywords file
inputs = open(text_file, 'r')
# read txt file
keywords_list = inputs.read()
# split the words by an 'enter'
keywords_list_separated = keywords_list.split('\n')
for keyword in keywords_list_separated[1:-1]: # omit first and last entry
occurrences_keyword = []
# read in page by page
for page in range(0, document_pages):
# load in text from i page
text_per_page = document.load_page(page)
# search for keywords on the page, and sum all occurrences
keyword_sum = len(text_per_page.search_for(keyword))
# add occurrences from each page to list per keyword
occurrences_keyword.append(keyword_sum)
# sum all occurances of a keyword in the document
occurrences_sdg.append(sum(occurrences_keyword))
if sum(occurrences_sdg) > 0:
print(f'{pdf_file} has {sum(occurrences_sdg)} keyword(s) from {text_file}\n')
I did try using pandas and I believe that still is the best choice. The number of loops makes it difficult for me to decide at which point the "skeleton" dataframe should be made, and when the results should be added. Final goal is to have this produced report saved as csv.

Related

Using python and regex to find stings and put them together to replace the filename of a .pdf- the rename fails when using more than one group

I have several thousand pdfs which I need to re-name based on the content. The layouts of the pdfs are inconsistent. To re-name them I need to locate a specific string "MEMBER". I need the value after the string "MEMBER" and the values from the two lines above MEMBER, which are Time and Date values respectively.
So:
STUFF
STUFF
STUFF
DD/MM/YY
HH:MM:SS
MEMBER ######
STUFF
STUFF
STUFF
I have been using regex101.com and have ((.*(\n|\r|\r\n)){2})(MEMBER.\S+) which matches all of the values I need. But it puts them across four groups with group 3 just showing a carriage return.
What I have so far looks like this:
import fitz
from os import DirEntry, curdir, chdir, getcwd, rename
from glob import glob as glob
import re
failed_pdfs = []
count = 0
pdf_regex = r'((.*(\n|\r|\r\n)){2})(MEMBER.\S+)'
text = ""
get_curr = getcwd()
directory = 'PDF_FILES'
chdir(directory)
pdf_list = glob('*.pdf')
for pdf in pdf_list:
with fitz.open(pdf) as pdf_obj:
for page in pdf_obj:
text += page.get_text()
new_file_name = re.search(pdf_regex, text).group().strip().replace(":","").replace("-","") + '.pdf'
text = ""
#clean_new_file_name = new_file_name.translate({(":"): None})
print(new_file_name)
# Tries to rename a pdf. If the filename doesn't already exist
# then rename. If it does exist then throw an error and add to failed list
try:
rename(pdf, new_file_name )
except WindowsError:
count += 1
failed_pdfs.append(str(count) + ' - FAILED TO RENAME: [' + pdf + " ----> " + str(new_file_name) + "]")
If I specify a group in the re.search portion- Like for instance Group 4 which contains the MEMBER ##### value, then the file renames successfully with just that value. Similarly, Group 2 renames with the TIME value. I think the multiple lines are preventing it from using all of the values I need. When I run it with group(), the print value shows as
DATE
TIME
MEMBER ######.pdf
And the log count reflects the failures.
I am very new at this, and stumbling around trying to piece together how things work. Is the issue with how I built the regex or with the re.search portion? Or everything?
I have tried re-doing the Regular Expression, but I end up with multiple lines in the results, which seems connected to the rename failure.
The strategy is to read the page's text by words and sort them adequately.
If we then find "MEMBER", the word following it represents the hashes ####, and the two preceeding words must be date and time respectively.
found = False
for page in pdf_obj:
words = page.get_text("words", sort=True)
# all space-delimited strings on the page, sorted vertically,
# then horizontally
for i, word in enumerate(words):
if word[4] == "MEMBER":
hashes = words[i+1][4] # this must be the word after MEMBER!
time-string = words[i-1][4] # the time
date_string = words[i-2][4] # the date
found = True
break
if found == True: # no need to look in other pages
break

How to extract multiple instances of a word from PDF files on python?

I'm writing a script on python to read a PDF file and record both the string that appears after every instance that "time" is mentioned as well as the page number its mentioned on.
I have gotten it to recognize when each page has the string "time" on it and send me the page number, however if the page has "time" more than once, it does not tell me. I'm assuming this is because it has already fulfilled the criteria of having the string "time" on it at least once, and therefore it skips to the next page to perform the check.
How would I go about finding multiple instances of the word "time"?
This is my code:
import PyPDF2
def pdf_read():
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
if "Time" in pageContent or "time" in pageContent:
print(pageNumber)
Also as a side note, this pdf is a scanned document and therefore when I read the text on python (or copy and paste onto word) there are a lot words which come up with multiple random symbols and characters even though its perfectly legible. Is this a limitation of computer programming without having to apply more complex concepts such as machine learning in order to read the files accurately?
A solution would be to create a list of strings off pageContent and count the frequency of the word 'time' in the list. It is also easier to select the word following 'time' - you can simply retrieve the next item in the list:
import PyPDF2
import string
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
pageContent = ''.join(pageContent.splitlines()).split() # words to list
pageContent = ["".join(j.lower() for j in i if j not in string.punctuation) for i in pageContent] # remove punctuation
print(pageContent.count('time') + pageContent.count('Time')) # count occurances of time in list
print([(j, pageContent[i+1] if i+1 < len(pageContent) else '') for i, j in enumerate(pageContent) if j == 'Time' or j == 'time']) # list time and following word
Note that this example also strips all words from characters that are not letters or digits. Hopefully this sufficiently cleans up the bad OCR.

Splitting a docx by headings into separate files in Python

I want to write a program that grabs my docx files, iterates through them and splits each file into multiple separate files based on headings. Inside each of the docx there are a couple of articles, each with a 'Heading 1' and text underneath it.
So if my original file1.docx has 4 articles, I want it to be split into 4 separate files each with its heading and text.
I got to the part where it iterates through all of the files in a path where I hold the .docx files, and I can read the headings and text separately, but I can't seem to figure out a way how to merge it all and split it into separate files each with the heading and the text. I am using the python-docx library.
import glob
from docx import Document
headings = []
texts = []
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
def iter_text(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for name in glob.glob('/*.docx'):
document = Document(name)
for heading in iter_headings(document.paragraphs):
headings.append(heading.text)
for paragraph in iter_text(document.paragraphs):
texts.append(paragraph.text)
print(texts)
How do I extract the text and heading for each article?
This is the XML reading python-docx gives me. The red braces mark what I want to extract from each file.
https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png
I am open for any alternative suggestions on how to achieve what I want with different methods, or if there is an easier way to do it with PDF files.
I think the approach of using iterators is a sound one, but I'd be inclined to parcel them differently. At the top level you could have:
for paragraphs in iterate_document_sections(document.paragraphs):
create_document_from_paragraphs(paragraphs)
Then iterate_document_sections() would look something like:
def iterate_document_sections(document):
"""Generate a sequence of paragraphs for each headed section in document.
Each generated sequence has a heading paragraph in its first position,
followed by one or more body paragraphs.
"""
paragraphs = [document.paragraphs[0]]
for paragraph in document.paragraphs[1:]:
if is_heading(paragraph):
yield paragraphs
paragraphs = [paragraph]
continue
paragraphs.append(paragraph)
yield paragraphs
Something like this combined with portions of your other code should give you something workable to start with. You'll need an implementation of is_heading() and create_document_from_paragraphs().
Note that the term "section" here is used as in common publishing parlance to refer to a (section) heading and its subordinate paragraphs, and does not refer to a Word document section object (like document.sections).
In fact, provided solution works well only if documents don't have any other elements except paragraphs (tables for example).
Another possible solution is to iterate not only through paragraphs but all document body's child xml elements. Once you find "subdocument's" start and end elements (paragraphs with headings in your example) you should delete other irrelevant to this part xml elements (a kind of cut off all other document content). This way you can preserve all styles, text, tables and other document elements and formatting.
It's not an elegant solution and means that you have to keep a temporary copy of a full source document in memory.
This is my code:
import tempfile
from typing import Generator, Tuple, Union
from docx import Document
from docx.document import Document as DocType
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.oxml.xmlchemy import BaseOxmlElement
from docx.text.paragraph import Paragraph
def iterparts(doc_path:str, skip_first=True, bias:int=0) -> Generator[Tuple[int,DocType],None,None]:
"""Iterate over sub-documents by splitting source document into parts
Split into parts by copying source document and cutting off unrelevant
data.
Args:
doc_path (str): path to source *docx* file
skip_first (bool, optional): skip first split point and wait for
second occurrence. Defaults to True.
bias (int, optional): split point bias. Defaults to 0.
Yields:
Generator[Tuple[int,DocType],None,None]: first element of each tuple
indicates the number of a
sub-document, if number is 0
then there are no sub-documents
"""
doc = Document(doc_path)
counter = 0
while doc:
split_elem_idx = -1
doc_body = doc.element.body
cutted = [doc, None]
for idx, elem in enumerate(doc_body.iterchildren()):
if is_split_point(elem):
if split_elem_idx == -1 and skip_first:
split_elem_idx = idx
else:
cutted = split(doc, idx+bias) # idx-1 to keep previous paragraph
counter += 1
break
yield (counter, cutted[0])
doc = cutted[1]
def is_split_point(element:BaseOxmlElement) -> bool:
"""Split criteria
Args:
element (BaseOxmlElement): oxml element
Returns:
bool: whether the *element* is the beginning of a new sub-document
"""
if isinstance(element, CT_P):
p = Paragraph(element, element.getparent())
return p.text.startswith("Some text")
return False
def split(doc:DocType, cut_idx:int) -> Tuple[DocType,DocType]:
"""Splitting into parts by copying source document and cutting of
unrelevant data.
Args:
doc (DocType): [description]
cut_idx (int): [description]
Returns:
Tuple[DocType,DocType]: [description]
"""
tmpdocfile = write_tmp_doc(doc)
second_part = doc
second_elems = list(second_part.element.body.iterchildren())
for i in range(0, cut_idx):
remove_element(second_elems[i])
first_part = Document(tmpdocfile)
first_elems = list(first_part.element.body.iterchildren())
for i in range(cut_idx, len(first_elems)):
remove_element(first_elems[i])
tmpdocfile.close()
return (first_part, second_part)
def remove_element(elem: Union[CT_P,CT_Tbl]):
elem.getparent().remove(elem)
def write_tmp_doc(doc:DocType):
tmp = tempfile.TemporaryFile()
doc.save(tmp)
return tmp
Note that you should define is_split_point method according to your split criteria

How to make searching a string in text files quicker

I want to search a list of strings (having from 2k upto 10k strings in the list) in thousands of text files (there may be as many as 100k text files each having size ranging from 1 KB to 100 MB) saved in a folder and output a csv file for the matched text filenames.
I have developed a code that does the required job but it takes around 8-9 hours for 2000 strings to search in around 2000 text files having size of ~2.5 GB in total.
Also, by using this method, system's memory is consumed and so sometimes need to split the 2000 text files into smaller batches for the code to run.
The code is as below(Python 2.7).
# -*- coding: utf-8 -*-
import pandas as pd
import os
def match(searchterm):
global result
filenameText = ''
matchrateText = ''
for i, content in enumerate(TextContent):
matchrate = search(searchterm, content)
if matchrate:
filenameText += str(listoftxtfiles[i])+";"
matchrateText += str(matchrate) + ";"
result.append([searchterm, filenameText, matchrateText])
def search(searchterm, content):
if searchterm.lower() in content.lower():
return 100
else:
return 0
listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
with open("Txt/"+txt, 'r') as txtfile:
TextContent.append(txtfile.read())
result = []
for i, searchterm in enumerate(searchlist):
print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
match(searchterm)
df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])
Sample Input below.
List of strings -
["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]
Text file -
Usual text file containing different lines separated by \n
Sample Output below.
String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;
As in the example above, first string was matched in 4 files(seperated by ;), second string was matched in 3 files and third string was not matched in any of the files.
Is there a quicker way to search without any splitting of text files?
Your code does a lot of pushing large amounts of data around in memory because you load all files in memory and then search them.
Performance aside, your code could use some cleaning up. Try to write functions as autonomous as possible, without depending on global variables (for input or output).
I rewrote your code using list comprehensions and it became a lot more compact.
# -*- coding: utf-8 -*-
from os import listdir
from os.path import isfile
def search_strings_in_files(path_str, search_list):
""" Returns a list of lists, where each inner list contans three fields:
the filename (without path), a string in search_list and the
frequency (number of occurences) of that string in that file"""
filelist = listdir(path_str)
return [[filename, s, open(path_str+filename, 'r').read().lower().count(s)]
for filename in filelist
if isfile(path_str+filename)
for s in [sl.lower() for sl in search_list] ]
if __name__ == '__main__':
print search_strings_in_files('/some/path/', ['some', 'strings', 'here'])
Mechanism's that I use in this code:
list comprehension to loop thought search_lists and though the files.
compound statements to loop only through the files in a directory (and not through sub directories).
method chaining to directly call a method of an object that is returned.
Tip for reading the list comprehension: try reading it form bottom to top, so:
I convert all items in search_list to lower using list comprehension.
Then I loop over that list (for s in...)
Then I filter out the directory entries that are not files using a compound statement (if isfile...)
Then I loop over all files (for filename...)
In the top line, I create the sublist containing three items:
filename
s, that is the lower case search string
a method chained call to open the file, read all its contents, convert it to lowercase and count the number of occurrences of s.
This code uses all the power there is in "standard" Python functions. If you need more performance, you should look into specialised libraries for this task.

Python XML parse and count occurence of a string then output to Excel

So here is my conundrum!
I have 100+ XML files that I need to parse and find a string by the tag name(or regular expression).
Once I find that string/tag value I need to count the times it occurs(or find the highest value of that string.)
Example:
<content styleCode="Bold">Value 1</content>
<content styleCode="Bold">Value 2</content>
<content styleCode="Bold">Value 3</content>
<content styleCode="Bold">Another Value 1</content>
<content styleCode="Bold">Another Value 2</content>
<content styleCode="Bold">Another Value 3</content>
<content styleCode="Bold">Another Value 4</content>
So basically I would want to parse the XML, find the tag listed above and output to an Excel spreadsheet with the highest value found. The spreadsheet already has headers so just the numerical value is output to the Excel file.
So the output would be in Excel:
Value Another Value
3 4
The each file would output onto another row.
I'm not sure how your XML files were named.
For the easy case, let's say they were named in this pattern:
file1.xml, file2.xml, ... and they are stored in the same folder as your python script.
Then you can use the following code to do the job:
import xml.etree.cElementTree as ElementTree
import re
from xlrd import open_workbook
from xlwt import Workbook
from xlutils.copy import copy
def process():
for i in xrange(1, 100): #loop from file1.xml to file99.xml
resultDict = {}
xml = ElementTree.parse('file%d.xml' %i)
root = xml.getroot()
for child in root:
value = re.search(r'\d+', child.text).group()
key = child.text[:-(1+len(value))]
try:
if value > resultDict[key]:
resultDict[key] = value
except KeyError:
resultDict[key] = value
rb = open_workbook("names.xls")
wb = copy(rb)
s = wb.get_sheet(0)
for index, value in enumerate(resultDict.values()):
s.write(i, index, value)
wb.save('names.xls')
if __name__ == '__main__':
process()
So there are two main parts of the problem. (1) Find the maximum value pair from each file, and (2) Write these in an Excel Workbook. One thing I always advocate is writing reusable code. Here you have to put all your xml files in a folder and simply execute the main method and get the results.
Well now there are several options to write into excel. Simplest one is create a tab or comma separated file (CSV) and import it into excel manually. XMWT is a standard library. OpenPyxl is another library which makes creating excel files much simpler and smaller in terms of lines of code.
Be sure to import required libraries and modules in the beginning of the file.
import re
import os
import openpyxl
While reading an XML file, we use regular expressions to extract the values you want.
regexPatternValue = ">Value\s+(\d+)</content>"
regexPatternAnotherValue = ">Another Value\s+(\d+)</content>"
To modularize it a little more, create a method that parses each line in the given XML file, looks for the regex patterns, extracts all the values and returns maximum of them. In the following method, I'm returning a tuple containing two elements, (Value, Another) which are maximum numbers of each type seen in that file.
def get_values(filepath):
values = []
another = []
for line in open(filepath).readlines():
matchValue = re.search(regexPatternValue, line)
matchAnother = re.search(regexPatternAnotherValue, line)
if matchValue:
values.append(int(matchValue.group(1)))
if matchAnother:
another.append(int(matchAnother.group(1)))
# Now we want to calculate highest number in both the lists.
try:
maxVal = max(values)
except:
maxVal = '' # This case will handle if there are NO values at all
try:
maxAnother = max(another)
except:
maxAnother = ''
return maxVal, maxAnother
Now keep your XML files in one folder, iterate over them, and extract the regex patterns in each. In the following code, I'm appending these extracted values in a list named as writable_lines. Then finally after parsing all the files, create a Workbook and add the extracted values in the format.
def process_folder(folder, output_xls_path):
files = [folder+'/'+f for f in os.listdir(folder) if ".txt" in f]
writable_lines = []
writable_lines.append(("Value","Another Value")) # Header in the excel
for file in files:
values = get_values(file)
writable_lines.append((str(values[0]),str(values[1])))
wb = openpyxl.Workbook()
sheet = wb.active
for i in range(len(writable_lines)):
sheet['A' + str(i+1)].value = writable_lines[i][0]
sheet['B' + str(i+1)].value = writable_lines[i][1]
wb.save(output_xls_path)
In the lower for-loop, we're directing openpyxl to write the values in the cell specified like typical excel format sheet["A3"], sheet["B3"] etc.
Ready to go...
if __name__ == '__main__':
process_folder("xmls", "try.xls")

Categories