Python read XML from standard input - python

I'm trying to read XML input from command line in Python3. So far I tried various method and following is my code for read XML,
import sys
import xml.dom.minidom
try:
input = sys.stdin.buffer
except AttributeError:
input = sys.stdin
xmlString = input.read()
But this continuedly getting inputs please someone can tell how to stop getting inputs after getting XML file
My XML file is,
<response>
<article>
<title>A Novel Approach to Image Classification, in a Cloud Computing Environment stability.</title>
<publicationtitle>IEEE Transactions on Cloud Computing</publicationtitle>
<abstract>Classification of items within PDF documents has always been challenging. This stability document will discuss a simple classification algorithm for indexing images within a PDF.</abstract>
</article>
<body>
<sec>
<label>I.</label>
<p>Should Haven't That is a bunch of text pattern these classification and cyrptography. These paragraphs are nothing but nonsense. What is the statbility of your program to find neural nets. Throw in some numbers to see if you get the word count correct this is a classification this in my nd and rd words. What the heck throw in cryptography.</p>
<p>I bet diseases you can't find probability twice. Here it is a again probability. Just to fool you I added it three times probability. Does this make any pattern classification? pattern classification! pattern classification.</p>
<p>
<fig>
<label>FIGURE.</label>
<caption>This is a figure representing convolutional neural nets.</caption>
</fig>
</p>
</sec>
</body>
</response>
Since this has number of lines I cant input this from conventional way using input()

Reading from the console / command line is done with input(). Try:
import xml.dom.minidom
xmlString = input()
For more details on sys.stdin, take a look at this SO post.
Edit: If you wanted to read multiple lines from the console, try sys.stdin.readlines, like xmlString = sys.stdin.readlines(). The user terminates multi-line input with CTRL+D. Or, you can just have the user write the XML to a file, and parse that file (easier, but maybe not desireable).

Related

Why Python is writting the UNICODE code instead the character on a file

I'm making a Python program that (after doing a lot of things haha) creates a HTML file with some of the generated info.
I open a HTML template and then I replace some 'tokens' with the generated info.
The way I open and replace the info is the following:
def getPlantilla():
with open('assets/plantillas/plantilla_proyecto3.html','r') as file:
plantilla = file.read()
return plantilla
def remplazarTokens(plantilla:str,PID,Pmap):
tabla_html = tabulate(Pmap,headers="firstrow",tablefmt='html')
return plantilla.format(PID=PID,TABLA=tabla_html)
But before 'replace the tokens' I generate some HTML code with the generated info with this function:
def crearTrigger(uso,id):
return f"{uso}"
And finally I create the file:
with open(filename,'w',encoding='UTF-8') as file:
file.write(html)
The problem is that in the final .html, the code that was generated with crearTrigger() dosen't works well because some characters are remplaced with the UNICODE code.
Example:
Out: <a href="#heap">Heap</a>
How it should be: Heap
I think that this is a encoding problem, but I had tried to encode it with .encode("utf-8") and still have the same problem.
Hope someone can help me. Thanks
Update: When I was writting the question, I realised that the library tabulate that I using to convert the info into a HTML table, it's creating the problem (Putting the UNICODE code instead the char), because the out's from crearTrigger() are saving in a list, that later tabulate converts into a HTLM table. But I still dont know how to solve it.

How to access remote and encrypted PDF text without writing to local drive [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am very new to the coding world and have been stuck on this one problem for 3 days now, searching everywhere for an answer, so any help will be greatly appreciated. I am needing to extract a small amount of text from a url-located Pdf file. I'm using sessions.get(chart_PDF) as the driver for locating the URL where chart_PDF is the example url below.
Example url is https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
I know I am able to write it to my local drive but I don't want to do that, I want to be able to do it remotely, since I only need a couple of numbers from it.
I have tried finding the password from the url page for decrypting, couldn't find. I've tried to use PyPDF2, pdfminer and pikepdf (probably not well).
I only need to retrieve two numbers near the bottom of the PDF that can be used for the rest of my code. Please help, even if it is a simple fix, I'm new to all this and need some help. Thanks.
from io import BytesIO
from pikepdf import Pdf as PDF
from pdfminer import high_level
chart_PDF = https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
retrieve = s.get(chart_PDF)
content = retrieve.content
response =urllib.request.urlopen(chart_PDF)
p = BytesIO(content)
p.getbuffer()
check = PDFPage.get_pages(p, check_extractable=False)
extract = high_level.extract_text(p)
I'm getting:
PDFTextExtractionNotAllowedWarning: The PDF <_io.BytesIO object at 0x000001B007ABEC20> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding.warnings.warn(warning_msg, PDFTextExtractionNotAllowedWarning)
Alternately, if I try this:
from pikepdf import Pdf as PDF
from pdfminer.pdfpage import PDFPage
from PyPDF2 import PdfFileReader
new_pdf = PDF.new()
with PDF.open(p) as pdf:
print(len(pdf.pages))
page1 = pdf.pages[0]
if PdfFileReader.getIsEncrypted(pdf):
print(True)
PdfFileReader.decrypt(page1, password='')
pdf.close()
I get:
line 1987, in decrypt
return self._decrypt(password)
AttributeError: _decrypt
UPDATE 3/8/21
Thank you so much K J! You've seriously been a huge help!
from io import BytesIO
from pdfminer.pdfpage import PDFPage
from pdfminer import high_level
retrieve = s.get(chart_PDF)
content = retrieve.content
bytes = BytesIO(content)
bytes.getbuffer()
PDFPage.get_pages(bytes, check_extractable=False)
extract = high_level.extract_text(bytes, password='') #THIS LINE THROWS ERROR
joined = ''.join(extract)
find_txt = re.findall(r'[(]\d*[-]\d[.]\d[)]', joined)
print(find_txt)
bytes.close()
This is now working well and I have been able to pull the numbers that I need (I have basically pulled all numbers from inside brackets off the PDF). I'll sort through that to find which one I need.
Strangely enough, although its giving me what I need, my extract = high_level.extract_text(bytes, password='') line still throws the Warning: (warning_msg, PDFTextExtractionNotAllowedWarning) which is rather annoying. Not sure how this process works but its still letting the info out.
I can't use try except or it skips over it. What is the way around this? how can I stop that error coming up?
FINAL UPDATE
I got around the warning and it works well now.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
extract = high_level.extract_text(bytes)
Cheers fellas for putting up with my ignorance, you've helped so much.
The whole file has to be downloaded to a device via RAM so the blob as a FILE can be parsed at the very END for one OR more %%EOF and the location of page 0 (it gets converted to 1 or i) it could be ANYWHERE IN THE STREAM,.
THEN you can navigate to other sequential numbered pages in the RANDOM order they are built. Any complaints please contact Adobe.
However it is easiest if it is cached as a physical FILE object. If you dont want that on disk use a ram drive for your browser.
Again those two objects at bottom of page one could be anywhere mixed into the content of "page" 99's objects, or otherwise. each letter in a PDF can in its extreme be more than one object anywhere in the file. but a good authoring editor would try to keep them as lines by lines. (there is no such PDF thing as a word or paragraph.)
We can Print the file as Plain Text to see how it is composited and although (secured) that is allowed.
I tried printing from browser with little success but know that can depend on browser system and OS print drivers. Here I have printed the page as text using Acrobat portable, so we can see the sequential offsets of each text block from Left Hand margin JUST LIKE a PDF VIEWER would need to rebuild them.
UPDATE
You said your target is (1380-4.4) to the RIGHT of ALTERNATE but again A PDF has no concept of Left and Right or BEFORE or AFTER so we find IN THIS FILE the variable target is in 2 separate pieces PRIOR to the KNOWN characters which luckily is a complete single block (alternate). Thus here proximity of plain text could well work if the capture is confined to that nearby locality. However there is no guarantee that ALTERNATE would always be a single block.
It was perhaps not a good Idea To show the way a Printer would be given a stream of sequential data
Here is the way one PDF viewer goes about decrypting the file
As stated on this occasion the word ALTERNATE is defined as text however the next item is the "3" under "B" which is text as a vector path it is not called a "character" although it looks like one but a numbered glyph from a font table. We do see later that some of those numbers are stored as "text" and for your target it is mixed in with similar text in the same object.
Thus you need to call a PDF interpreter to give you a meaningful translation of all bits and pieces of objects so that you can extract the "right" text.
The easiest way for a "simple" one line target in a complex file is to use MuPDF to first tidy up the file
mutool clean -gggg -D infile.pdf outfile.pdf
combined with
PDFTOTXT -layout outfile.pdf outfile.txt
or similar to hopefully export that text on a line by line basis, such that you can consistently find your target instantly before ! or after ALTERNATE.
N.B Mutool convert to HTML would place the target value in a table entry AFTER the key word, and if the lines are consistent in number that would be a simpler way to find or grep.

Extract only additions from diff in python

I am trying to solve a problem:
I receive auto-generated email from government with no tags in HTML. It's one table nested upon another. An abomination of a template. I get it every few days and I want to extract some fields from it. My idea was this
Use HTML in the email as template. Remove all fields that change with every mail like Name of my client, their Unique ID and issue explained in the mail.
Use this html template with missing fields and diff it with new emails. That will give me all the new info in one shot without having to parse this email.
Problem is, I can't find any way of loading only these additions. I am trying to use difflib in python and it returns byte streams of additions and subtractions in each line that I am not able to process properly. I want to find a way to only return the additions and nothing else. I am open to using other libraries or methods. I do not want to write a huge regex with tons of html.
When I got the stdout from using Popen calling diff it also returned bytes.
You can convert the bytes to chars, then continue with your processing.
You could do something similar to what I do below to convert your bytes to a string
The below calls diff on two files and prints only the lines beginning with the '>' symbol (new in the rhs file):
#! /usr/env python
import os
import sys, subprocess
file1 = 'test1'
file2 = 'test2'
if len(sys.argv)==3:
file1=sys.argv[1]
file2=sys.argv[2]
if not os.access(file1,os.R_OK):
print(f'Unable to read: \'{file1}\'')
sys.exit(1)
if not os.access(file2,os.R_OK):
print(f'Unable to read: \'{file2}\'')
sys.exit(1)
argv = ['diff',file1,file2]
runproc = subprocess.Popen(args=argv, stdout=subprocess.PIPE)
out, err = runproc.communicate()
outstr=''
for c in out:
outstr+=chr(c)
for line in outstr.split('\n'):
if len(line)==0:
continue
if line[0]=='>':
print(line)

How can I say a file is SVG without using a magic number?

An SVG file is basically an XML file so I could use the string <?xml (or the hex representation: '3c 3f 78 6d 6c') as a magic number but there are a few opposing reason not to do that if for example there are extra white-spaces it could break this check.
The other images I need/expect to check are all binaries and have magic numbers. How can I fast check if the file is an SVG format without using the extension eventually using Python?
XML is not required to start with the <?xml preamble, so testing for that prefix is not a good detection technique — not to mention that it would identify every XML as SVG. A decent detection, and really easy to implement, is to use a real XML parser to test that the file is well-formed XML that contains the svg top-level element:
import xml.etree.cElementTree as et
def is_svg(filename):
tag = None
with open(filename, "r") as f:
try:
for event, el in et.iterparse(f, ('start',)):
tag = el.tag
break
except et.ParseError:
pass
return tag == '{http://www.w3.org/2000/svg}svg'
Using cElementTree ensures that the detection is efficient through the use of expat; timeit shows that an SVG file was detected as such in ~200μs, and a non-SVG in 35μs. The iterparse API enables the parser to forego creating the whole element tree (module name notwithstanding) and only read the initial portion of the document, regardless of total file size.
You could try reading the beginning of the file as binary - if you can't find any magic numbers, you read it as a text file and match to any textual patterns you wish. Or vice-versa.
This is from man file (here), for the unix file command:
The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable ... These files have a “magic number” stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a “magic” has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. ...
(my emphasis)
And here's one example of the "magic" that the file command uses to identify an svg file (see source for more):
...
0 string \<?xml\ version=
>14 regex ['"\ \t]*[0-9.]+['"\ \t]*
>>19 search/4096 \<svg SVG Scalable Vector Graphics image
...
0 string \<svg SVG Scalable Vector Graphics image
...
As described by man magic, each line follows the format <offset> <type> <test> <message>.
If I understand correctly, the code above looks for the literal "<?xml version=". If that is found, it looks for a version number, as described by the regular expression. If that is found, it searches the next 4096 bytes until it finds the literal "<svg". If any of this fails, it looks for the literal "<svg" at the start of the file, and so on.
Something similar could be implemented in Python.
Note there's also python-magic, which provides an interface to libmagic, as used by the unix file command.

pyPdf unable to extract text from some pages in my PDF

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:
http://www.4shared.com/document/kmJF67E4/forms.html
If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?
from pyPdf import PdfFileReader
input = PdfFileReader(file("forms.pdf", "rb"))
for page in input1.pages:
print page.extractText()
Note that extractText() still has problems extracting the text properly. From the documentation for extractText():
This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.
Since it is the text you want, you can use the Linux command pdftotext.
To invoke that using Python, you can do this:
>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
The text is extracted from forms.pdf and saved to output.
This works in the case of your PDF file and extracts the text you want.
This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.
You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.
I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.
I had similar problem with some pdfs and for windows, this is working excellent for me:
1.- Download Xpdf tools for windows
2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64
3.- use subprocess to run command from console:
import subprocess
try:
extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
print (e)
I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.
I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

Categories