Python, BeautifulSoup: Output XML not recognized by program? - python

I'm running into some strange problems with using BeautifulSoup for editing XML. After I read in the XML file and make all of my edits, I write it to a file using
output = open("output.xml", "r")
output.write(str(soup))
output.close
However, the program that needs to read in this XML file does not even recognize it exists (it's supposed to automatically load XML. For any reference this is ArcGis). I can open up the original XML and the Python-generated one and verify that they are completely identical--at least visually.
Another quirky behavior is that if I open the original XML file and make edits to it with a text editor and save, there is no error with the program and it loads correctly. If I make edits using Python but NOT with BeautifulSoup (i.e. string operations) it also works fine. it seems that writing the output string with BeautifulSoup causes the problem.
Is this some kind of encoding issue? What's the difference between a string obtained from the BeautifulSoup object and a normal string read straight from a file?

Related

JSONDecodeError reading a large JSON file

I would like to read a large JSON file that I have before created through some web-scraping. However, when I try to read in the file, I get the following error message:
JSONDecodeError: Expecting ',' delimiter: line 1364567 column 2 (char 1083603504)
However, 1364567 is the very last line and it seems to be correct right there. Therefore I expect that the error is somewhere else in the file before, for example that somewhere there are brackets that are opened but not closed. What do you suggest how I can track down the problem and fix it? I can also provide a link to the file, but it is quite large (1.05 GB).
I use the following code to read the json file
import json
with open("file.json") as f:
data = json.load(f)
Thank you very much!
Edit: The problem was solved as follows: The end of the JSON file looked normal, i.e. an additional line with fields and information and a closing bracket ]. json.load complained about a missing comma, i.e. not recognizing that the last bracket indicated indeed that the file ended. Therefore there must have been opening brackets [ before in the file, that were not closed. Luckily those were due to some hiccups with the scraping at the beginning of the file, such that some manual search with Sublime Text allowed me just to delete those opening brackets and read the file without problems. Anyways, thank you very much for your suggestions and I am sure I will use them the next time I have a problem with JSON!
You can Use any powerful IDE such as pycharm, Atom, sublime they each have plugins for json formatting
and you can always validate json using online tools but it would be heavy for them to process
Hope this information might help
You can use this to check your json format before running codes. Just to make sure where the problem is and fix it
https://jsonformatter.curiousconcept.com/
Since you want the last line, and you are using Python, one of the good solutions could be to actually read the last line(s) and print them, to see where the problem is.
For that, there is actually a module you can use, file_read_backwards, which does this efficiently.
For details see this SO answer: https://stackoverflow.com/a/41418158/50003

Reading data from a binary file in Python

I have a python code that I don't really understand since I am super new to python that reads data from a binary file and spits out a CSV file.
The code is in this site I can't post it on here because it is huge.
https://github.com/PX4/Firmware/blob/master/Tools/sdlog2/sdlog2_dump.py
Now I know you must have the key to a binary file to be able to parse it and use its data. What I don't understand is how that code that I linked above does this or where the key in the code is. I have tried reading the code and breaking it down, but due to my limited knowledge I have no idea where to go from here.
I am trying to extract the key so that I can write my own code using the key to parse the data in the binary file.

How to edit and save existing HTML using Python?

I'm trying to write a program that enables someone to edit html from python 'input()' questions. For example: change a paragraph from the command line in python. Is there some sort of library I can use to read html then edit and save it?
Since an HTML file is just a plain text file it can be opened by python without the need for any extra libraries and such. Just open the file, edit what you need and write it.
Check out the following link:
http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python

txt file appears blank with .write() python

I am on a windows machine and am trying to write a couple thousand lines to a text file using ipython. To test this I am just trying to get some text to appear in the file.
my code is as follows:
path="\Users\\*****\Desktop"
with open(path+'newheaders.txt','wb') as f:
f.write('new text')
This question (.write not working in Python) is answered and seems like it should have solved my issue but when I open the text file it is still blank.
I tested the file using the code below and the text appears to be there.
with open(path+'newheaders.txt','r') as f:
print f.read()
any ideas?
This 'should' work as written. A few things to try (I would put this in a comment but I lack sufficient reputation):
Delete the file and make sure the program is creating the file
Try writing as 'wt' rather than binary to see if we can narrow down the problem that way.
Remove all the business with the path and just try to write the file in the current directory.
What text editor are you using? Is it possible it's not refreshing the blank file?

xsd validation, get the object that is invalid

I have a large XML file (3 MB+) and i have an XSD to validate it against.
I am using python and LXML. I started from this script <>. Which does the validation fine, including giving me the line number. But the problem the file is on one line, so when i validate all i get is the error showing on line 1. when i use pretty print to split the lines up for me it maxes out at line 65535.
Thanks!
Pretty-print your XML to add newlines to it. Then put it through your validator to get a more helpful line number.
EDIT: On re-reading your question, I see that you have used Notepad++ to add the newlines. But that LXML apparently has a size limitation when it comings to validating your XML.
For a general approach to this problem, please see Validating a HUGE XML file. In particular, the accepted answer begins with:
Instead of using a DOMParser, use a SAXParser. This reads from an
input stream or reader so you can keep the XML on disk instead of
loading it all into memory.
Basically, you need to use a streaming approach which SAX offers. So, if your requirement is that you must validate your file in Python, then you'll need to find validation approach based on streaming. (Perhap LXML offers validation in a streaming fashion?)
However, if your validation requirements are more flexible, then consider a specialized tool such as XMLStarlet.
For example, here's how to validate an XML file against an XSD from the XMLStarlet entry on Wikipedia:
xmlstarlet val -e -s my.xsd my.xml
And testimonial on using XMLStarlet on very large files.

Categories