Why does beautiful-soup change the html? - python

I have an HTML file. I am trying to open it and read the contents as
with open("M_ALARM_102.HTML", "r") as f:
contents = f.read()
print(contents)
when I print the contents in the above command it prints perfectly. But when I pass the contents to BeautifulSoup and print the soup it changes the HTML code
soup = BeautifulSoup(contents, html.parser)
print(soup)
here is the output from BeautifulSoup
ÿþ<html>
<head>
<meta charset="UTF-8">
<title>ARRÊT SERVOS</title>
<style type="text/css">
I am not getting why it is doing this. I need to extract 3 tags from it but it keeps giving None as output.
Can anyone help me, please?

&lt is < this symbol and &gt is > this symbol. İt is for security to protect web site by XSS ( Cross Site Scripting ) attacks.

It might be that the parser used by BeautifulSoup did not recognize that file as html.
I see two "strange" characters in that output: ÿþ. They look like something added the BOM (byte order mark) to the file, while the parser expected valid utf-8.
There is a good chance that this is the problem.
One way to fix the BOM problem is to open the file in notepad, and save it as UTF-8. Notepad is pretty good at doing this kind of stuff.
You might also be able to fix it by opening the file in python as utf-16, using with open("M_ALARM_102.HTML", "r", encoding="utf-16") as f:. Note that here you specify the encoding directly (see more from python documentation about unicode).
Note that I did not personally try the latter approach, so I am not sure it will actually remove the BOM -- the best option is still to not introduce it at all in your workflow.

Related

Why Python is writting the UNICODE code instead the character on a file

I'm making a Python program that (after doing a lot of things haha) creates a HTML file with some of the generated info.
I open a HTML template and then I replace some 'tokens' with the generated info.
The way I open and replace the info is the following:
def getPlantilla():
with open('assets/plantillas/plantilla_proyecto3.html','r') as file:
plantilla = file.read()
return plantilla
def remplazarTokens(plantilla:str,PID,Pmap):
tabla_html = tabulate(Pmap,headers="firstrow",tablefmt='html')
return plantilla.format(PID=PID,TABLA=tabla_html)
But before 'replace the tokens' I generate some HTML code with the generated info with this function:
def crearTrigger(uso,id):
return f"{uso}"
And finally I create the file:
with open(filename,'w',encoding='UTF-8') as file:
file.write(html)
The problem is that in the final .html, the code that was generated with crearTrigger() dosen't works well because some characters are remplaced with the UNICODE code.
Example:
Out: <a href="#heap">Heap</a>
How it should be: Heap
I think that this is a encoding problem, but I had tried to encode it with .encode("utf-8") and still have the same problem.
Hope someone can help me. Thanks
Update: When I was writting the question, I realised that the library tabulate that I using to convert the info into a HTML table, it's creating the problem (Putting the UNICODE code instead the char), because the out's from crearTrigger() are saving in a list, that later tabulate converts into a HTLM table. But I still dont know how to solve it.

cannot open files at BeatifulSoup lxml parser

I am trying to read the contents of an HTML file with BeautifulSoup, but I'm receiving an UnicodeDecodeError.
I also tried changing the parser to html.parser instead of the lxml but it doesn't work.
however, if I use the requests library to request the URL, it works, but not if I read the HTML file locally.
answer:
I needed to add a Unicode and it was should have something like that: with open('lap.html', encoding="utf8") as html_file:
You are passing a file to 'BeautifulSoup' instead you have to pass the content of the file.
try :
content = html_file.read() source = BeautifulSoup(content, 'lxml')
First of all, fix the soruce to source, then make a gap between the equal sign and the text and then find out, what might not be encodable by the coding standart you use, because that error refers to a sign which cant be decoded/encoded

Write to an HTML file with Python

I have a couple of graphs I need to display in my browser offline, MPLD3 outputs the html as a string and I need to be able to make an html file containing that string. What I'm doing right now is:
tohtml = mpld3.fig_to_html(fig, mpld3_url='/home/pi/webpage/mpld3.js',
d3_url='/home/pi/webpage/d3.js')
print(tohtml)
Html_file = open("graph.html","w")
Html_file.write(tohtml)
Html_file.close();
tohtml is the variable where the HTML string is stored. I've printed this string to the terminal and then pasted it into an empty HTML file and I get my desired result. However, when I run my code, I get an empty file named graph.html
It seems like you may be reinventing the wheel here. Have you tried something like,
mpld3_url='/home/pi/webpage/mpld3.js'
d3_url='/home/pi/webpage/d3.js'
with open('graph.html', 'w') as fileobj:
mpld3.save_html(fig, fileobj, d3_url=d3_url, mpld3_url=mpld3_url)
Note, this is untested just going off of mpld3.save_html documentation and using prior knowledge about Python IO Streams

requests - Python command line behavior differs from behavior when script is run

I'm trying to write a script that will input data I supply into a web form at a url I supply.
To start with, I'm testing it out by simply getting the html of the page and outputting it as a text file. (I'm using Windows, hence .txt.)
import sys
import requests
sys.stdout = open('html.txt', 'a')
content = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
content.text
When I do this (i.e., the last two lines) on the python command line (>>>), I get what I expect. When I do it in this script and run it from the normal command line, the resulting html.txt is blank. If I add print(content) then html.txt contains only: <Response [200]>.
Can anyone elucidate what's going on here? Also, as you can probably tell, I'm a beginner, and I can't for the life of me find a beginner-level tutorial that explains how to use requests (or urllib[2] or selenium or whatever) to send data to webpages and retrieve the results. Thanks!
You want:
import sys
import requests
result = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
if result.status_code == requests.codes.ok:
with open('html.txt', 'a') as sys.stdout:
print result.content
Requests returns an instance of type request.Response. When you tried to print that, the __repr__ method was called, which looks like this:
def __repr__(self):
return '<Response [%s]>' % (self.status_code)
That is where the <Response [200]> came from.
The requests.Reponse has a content attribute which is an instance of str (or bytes for Python 3) that contains your HTML.
The text attribute is type unicode which may or may not be what you want. You mention in the comments that you saw a UnicodeDecodeError when you tried to write it to a file. I was able to replace the print result.content above with print result.text and I did not get that error.
If you need help solving your unicode problems, I recommend reading this unicode presentation. It explains why and when to decode and encode unicode.
The interactive interpreter echoes the result of every expression that doesn't produce None. This doesn't happen in regular scripts.
Use print to explicitly echo values:
print response.content
I used the undecoded version here as you are redirecting stdout to a file with no further encoding information.
You'd be better of writing the output directly to a file however:
with open('html.txt', 'ab') as outputfile:
outputfile.write(response.content)
This writes the response body, undecoded, directly to the file.

Python parse xml file

I need to parse an xml file which method would be best for my case. beautifulsoup4, ElementTree, etc. it's a pretty big file.
I have windows 10 64bit running python 2.7.11 32bit
xml file:
http://pastebin.com/jTDRwCZr
I'm trying to get this output from the xml file it contains different languages using " div xml:lang="English" " for english. any help on how i can use beautifulsoup with lxml to achieve this? thanks for your time.
<tt xmlns="http://www.w3.org/2006/04/ttaf1" xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
<head>
<styling>
<style id="1" tts:textOutline='#000000 2px 2px' tts:color="white"/>
</styling>
</head>
<body>
<div xml:lang="English">
<p begin="00:00:28.966" end="00:00:31.385" style="1">
text text text...
</p>
</div>
</body>
</tt>
The file that you link to is not that large that you need to worry about alternative methods of parsing and processing it.
Assuming that you are trying to remove all non-English language divs you can do it with BeautifulSoup:
from bs4 import BeautifulSoup
with open('input.xml') as infile:
soup = BeautifulSoup(infile, 'lxml')
for e in soup.find_all('div', attrs={'xml:lang': lambda value: value != 'English'}):
_ = e.extract()
with open('output.xml', 'w') as outfile:
outfile.write(soup.prettify(soup.original_encoding))
In the code above the soup.find_all() finds all divs that have an xml:lang attribute that is a value other than 'English'. It then removes the matching elements with extract(). Finally the resulting document is written out to a new file using the same encoding as the input (otherwise it will default to UTF-8).
Usually DOM approach is quick and easy to use (upto 10 MB). However, if it is the really big xml file (> 50 MB), then XML DOM approach cannot be used since it parses the entire XML object into the memory. It takes upto 3-4 GB of RAM for parsing only upto 100 MBs of data and get significantly slower.
So the another option would be to do iterative or event based parsing of XML files.
For iterative parsing, elementTree or lxml approaches can be used. Usually elementTree is quite slow, so I would recommend to use the cElementTree, similar API but implemented in C which is significantly faster than elementTree.
I recently used the elementTree for parsing >100 MB large XML files and it's been working really good for me so far! I'm not sure about lxml.
I would check out online for more information on how to use XML parsing APIs.

Categories