Removing specific section from thousands of pdfs (using python) - python

There is a case in my job where l have to remove a specific section (Glossary) from thousands of pdf documents.
The text l want to remove has a different font from the other parts:
Example:
"Floor" the lower surface of a room, on which one may walk.
"exchange" an act of giving one thing and receiving another (especially of the same type or value) in return.
Can you please suggest a way how to do it faster?

One of the possible ways to solve this problem is to find the section you want to delete using regex. Then using one of the libraries for pdf editing in python to delete this section.

Related

PyPDF2 Extract from field or location

I have a python script running fine, it scans a folder and collects data based on text line position which could work great but if any lines have missing data it throws my numbers off obviously.
I have looked in the pdf file using iText RUPS and I can find a reference to one set of the data I need
BT
582 -158.78 Td
(213447) Tj
ET
the information I want is in the brackets, can I somehow use the coordinates? if all fails, I might be able to get people to agree to start the info I need to collect with a flag XX12345 or YY12345 then I can easily pick out the data from the text extraction, but I'd rather find a better way.
Not added code examples as that works fine it's just the next step I'm struggling with, but I can if anyone wishes.
Many thanks
I tried to use just text extraction, but missing inputs throw my numbering scheme off.

Database searches in separate files

I am looking for a kind of database which can search in separate files eg. pdf, xls, doc that I get from different suppliers. My idea is something like this:
For example, I need to search for a part number and check different data about it. The file containing the part number must then be opened with the part number marked. If there are multiple hits, the database should display a list of the various files containing the searched item number. The list should act as links that open the file with the item number selected when selecting one from the list.
Does this already exist or how do I approach it?
Today, it's all assembled into a single PDF file of more than 1000 pages, and it's a time-consuming and laborious process to maintain.
I've only used vba in connection with Excel, so maybe it's too complicated for me. But is it possible for a programmer without spending 1000 hours on it?
Please help me :-)
Either Access or Excel could do this. I noticed the Python tag. I'm sure Python could handle this as well, although it seems more like a database solution would be best. It sounds like a one-to-many scenario. See the link below for some ideas of how this technique works.
https://www.tutorialspoint.com/ms_access/ms_access_one_to_many_relationship.htm
Also, below is a link with a whole bunch of MS Access templates. Take a look at that and hopefully that will give you some ideas of how to get started.
https://www.microsoftaccessexpert.com/Microsoft-Access-Templates.aspx
I agree, keeping this in a PDF with 1000 pages is NOT the way to go!!

Use Python to find groups of consecutive numbers in excel

SO i embarked on learning python a couple of weeks ago through codecademy and it has been a slow process for me. But unfortunately I'm at a point right now that i need a script to do a specific function for me right now.
I have an excel doc that is a list of ID numbers, example -
72100234
72100235
72100239
There are roughly 50,000 id numbers. I need a script that will do a couple of things for me.
- first is to show me how the job is done so i can learn from it
- second is its purpose, I need it to go through and find the consecutive numbers and group them together in either a separate file or another tab
Example -
72100234
72100235
72100236
72100239
72100240
I have to find these blocks of numbers so they can be assigned to a specific agency and going through the list manually... yeah, been there, done that, dont want to do it again.
Thanks for any help.
try to divide your problem into different steps.
The first step could be to get familiar with a library to access your excel file. The one i tend to use is "openpyxl". To find out how to get data from an excel file you can use this link: https://openpyxl.readthedocs.io/en/stable/optimized.html.
I am new here aswell and i found out that you should first make an affort to try yourself and afterwards people will be very helpful.
After you found out how to get the data from excel to python you could provide some of it. That would make it easier to understand your problem. I hope i could help you a bit.

Converting PDF to any parse-able format

I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.
The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.
Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.

Removing heading numbers from docx

I need to use python to pre-process docx (Word) documents, so that pandoc can properly convert them into markdown. One of the key requirements is that the styles of the docx document should be "cleaned up", in particular that the numbering of headings (Heading 1, Heading 2, etc.) should be removed.
Restrictions: I know how to do that using VBA (and likely could do it from python using PyWin32 or such). But it is a requirement that it must be implemented without Microsoft Windows and without LibreOffice/UNO.
How can I use the python-docx package to do that? I have looked at the documentation and there does not seem to be any proper to do it (actually the heading numbering style does not seem to be implemented). Did I miss something?
Unless I should use another method, such as applying a different Word template to the docx document, with the main styles correctly predefined according to my requirements? Could that be done through an available python package?
Code in VBA
This is the code in VBA that got the job done:
Sub RemoveHeaderNos()
' Remove the header nos
Debug.Print "Removing header numbers and formatting..."
For Each s In ActiveDocument.Styles
s.LinkToListTemplate ListTemplate:=Nothing
Next
End Sub
On terminology, I'm understanding you to mean "the numbering of heading paragraphs" as opposed to something like page numbers in the page headers, have I got that right? The two terms "heading" and "header" are unfortunately close and mean quite different things, in Word parlance anyway :)
I'm assuming your paragraph headings are numbered, like 'Heading 1' style causes the next sequential integer to be prefixed to the heading paragraph text, like '9. Ninth section heading', (then likewise for Heading 2 -> 9.1, 9.2, etc.
You're correct that this hasn't been implemented in python-docx yet. You would need to get as close to the XML element in question (perhaps the <w:style> element for Heading 1 for example) as possible using the python-docx API, and then use lxml calls to manipulate the XML under that.
You'd need to start with a strategy for what XML changes you need to make. opc-diag is handy for that. You can change the .docx manually (preferably a radically stripped-down, super-short document) using Word to make it look the way you want, then compare the XML before and after to discover what changes you need to make to the XML.
Then you can validate your strategy by extracting a .docx (using opc-diag), manually updating the XML with the minimum required changes, repackage it (also using opc-diag), and load it in Word to make sure it behaves as expected.
I suspect there's a way to "disconnect" the "Heading 1" style from a numbering definition in the styles.xml part that would accomplish what you're after and be a fairly straightforward handful of element changes.
Anyway, that's where I would start.
This issue was solved in version 1.17 of pandoc, released on 20 March 2016 ("Don’t turn numbered headers into lists"). If other people meet the same issue, the best thing at this stage would be to upgrade to a that version or a later one.
Nevertheless, the exploration of the various solutions with the python-docx was interesting, because it indicated a point of possible improvement.

Categories