cannot open files at BeatifulSoup lxml parser - python

I am trying to read the contents of an HTML file with BeautifulSoup, but I'm receiving an UnicodeDecodeError.
I also tried changing the parser to html.parser instead of the lxml but it doesn't work.
however, if I use the requests library to request the URL, it works, but not if I read the HTML file locally.
answer:
I needed to add a Unicode and it was should have something like that: with open('lap.html', encoding="utf8") as html_file:

You are passing a file to 'BeautifulSoup' instead you have to pass the content of the file.
try :
content = html_file.read() source = BeautifulSoup(content, 'lxml')

First of all, fix the soruce to source, then make a gap between the equal sign and the text and then find out, what might not be encodable by the coding standart you use, because that error refers to a sign which cant be decoded/encoded

Related

Why does beautiful-soup change the html?

I have an HTML file. I am trying to open it and read the contents as
with open("M_ALARM_102.HTML", "r") as f:
contents = f.read()
print(contents)
when I print the contents in the above command it prints perfectly. But when I pass the contents to BeautifulSoup and print the soup it changes the HTML code
soup = BeautifulSoup(contents, html.parser)
print(soup)
here is the output from BeautifulSoup
ÿþ<html>
<head>
<meta charset="UTF-8">
<title>ARRÊT SERVOS</title>
<style type="text/css">
I am not getting why it is doing this. I need to extract 3 tags from it but it keeps giving None as output.
Can anyone help me, please?
&lt is < this symbol and &gt is > this symbol. İt is for security to protect web site by XSS ( Cross Site Scripting ) attacks.
It might be that the parser used by BeautifulSoup did not recognize that file as html.
I see two "strange" characters in that output: ÿþ. They look like something added the BOM (byte order mark) to the file, while the parser expected valid utf-8.
There is a good chance that this is the problem.
One way to fix the BOM problem is to open the file in notepad, and save it as UTF-8. Notepad is pretty good at doing this kind of stuff.
You might also be able to fix it by opening the file in python as utf-16, using with open("M_ALARM_102.HTML", "r", encoding="utf-16") as f:. Note that here you specify the encoding directly (see more from python documentation about unicode).
Note that I did not personally try the latter approach, so I am not sure it will actually remove the BOM -- the best option is still to not introduce it at all in your workflow.

How do you correctly parse web links to avoid a 403 error when using Wget?

I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.
For reference:
Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG
Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG
Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.
Help!
#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
## Accessing and Creating Six Digit File Code
pdf_dir = "/users/USERNAME/desktop/worky"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
## Identify File Name and Limit to Digits
filename = os.path.basename(file)
newname = filename[0:6]
## Run PDFX to identify and download links
pdf = pdfx.PDFx(filename)
url_list = pdf.get_references_as_dict()
attachment_counter = (1)
for x in url_list["url"]:
if x[0:4] == "http":
parsed_url = urllib.parse.quote(x, safe='://')
print (parsed_url)
wget.download(parsed_url, '/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
##os.rename(r'/users/USERNAME/desktop/worky/(filename).*',r'/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
attachment_counter += 1
for x in url_list["pdf"]:
print (parsed_url + "\n")```
I prefer to use requests (https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget and I got the same error (might be linked to user-agent HTTP headers used by wget).
wget and HTTP headers issues : download image from url using python urllib but receiving HTTP Error 403: Forbidden
HTTP headers : https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
The good thing with requests is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).
import requests
r = requests.get("https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG")
with open("myfile.png", "wb") as file:
file.write(r.content)
I'm not sure I understand what you're trying to do, but maybe you want to use formatted strings to build your URLs (https://docs.python.org/3/library/stdtypes.html?highlight=format#str.format) ?
Maybe checking string indexes is fine in your case (if x[0:4] == "http":), but I think you should check python re package to use regular expressions to catch the elements you want in a document (https://docs.python.org/3/library/re.html).
import re
regex = re.compile(r"^http://")
if re.match(regex, mydocument):
<do something>
The reason for this behavior is inside wget library. Inside it encodes the URL with urllib.parse.quote() (https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote).
Basically it replaces characters with their appropriate %xx escape character. Your URL is already escaped but the library does not know that. When it parses the %20 it sees % as a character that needs to be replaced so the result is %2520 and different URL - therefore 403 error.
You could decode that URL first and then pass it, but then you would have another problem with this library because your URL has parameter filename*= but the library expects filename=.
I would recommend doing something like this:
# get the file
req = requests.get(parsed_url)
# parse your URL to get GET parameters
get_parameters = [x for x in parsed_url.split('?')[1].split('&')]
filename = ''
# find the get parameter with the name
for get_parameter in get_parameters:
if "filename*=" in get_parameter:
# split it to get the name
filename = get_parameter.split('filename*=')[1]
# save the file
with open(<path> + filename, 'wb') as file:
file.write(req.content)
I would also recommend removing the utf-8'' in that filename because I don't think it is actually part of the filename. You could also use regular expressions for getting the filename, but this was easier for me.

Failing to parse 1 MB XML File with BS4

I have two SVG maps of the world, downloaded here. My goal is to do some editing of these maps in python, working with them via BeautifulSoup4. This works perfectly with the low-res file (132.5 Kb). However, the BS4 parser (using lxml) fails entirely when I attempt to use it with the high-res file (1.2 Mb).
The code is as such:
import lxml
from bs4 import BeautifulSoup as Soup
with open('worldHigh.svg','r') as f:
handler = f.read()
soup = Soup(handler,'xml')
print(soup.prettify())
When I run that with the worldHigh.svg fifle, the only thing that is printed is
<?xml version="1.0" encoding="utf-8"?>
When I run the equivalent, but changing worldHigh.svg for worldLow.svg, it prints the XML correctly (as desired).
Both SVG files work fine when opened by themselves (i.e., they show the map). However, one fails when I try to parse it, the other succeeds. I am at a loss for what is going wrong. I would understand if the parser fails at large sizes, but 1.2 MB does not seem large.
The XML parser needs the raw sequence of unencoded bytes. Use open(...,'rb') when parsing XML.
The reason why one worked and the other didn't is worldHigh.svg has a BOM at the beginning of the file.

How to auto-close xml tags in truncated file?

I receive an email when a system in my company generates an error. This email contains XML all crammed onto a single line.
I wrote a notepad++ Python script that parses out everything except XML and pretty prints it. Unfortunately some of the emails contain too much XML data and it gets truncated. In general, the truncated data isn't that important to me. I would like to be able to just auto-close any open tags so that my Python script works. It doesn't need to be smart or correct, it just needs to make the xml well-enough formed that the script runs. Is there a way to do this?
I am open to Python scripts, online apps, downloadable apps, etc.
I realize that the right solution is to get the non-truncated xml, but pulling the right lever to get things done will be far more work than just dealing with it.
Use Beautiful Soup
>>> import bs4
>>> s= bs4.BeautifulSoup("<asd><xyz>asd</xyz>")
>>> s
<html><head></head><body><asd><xyz>asd</xyz></asd></body></html>
>>
>>> s.body.contents[0]
<asd><xyz>asd</xyz></asd>
Notice that it closed the "asd" tag automagically"
To create a notepad++ script to handle this,
download the tarball and extract the files
Copy the bs4 directory to your PythonScript/scripts folder.
In notepad++ add the following code to your python script
#import Beautiful Soup
import bs4
#get text in document
text = editor.getText()
#soupify it to fix XML
soup = bs4.BeautifulSoup(text)
#convert soup object to string again
text = str(soup)
#clear editor and replace bad xml with fixed xml
editor.clearAll()
editor.addText(text)
#change language to xml
notepad.menuCommand( MENUCOMMAND.LANG_XML )
#soup has its own prettify, but I like the XML tools version better
notepad.runMenuCommand('XML Tools', 'Pretty print (XML only - with line breaks)', 1)
If you have BeautifulSoup and lxml installed, it's straightforward:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <?xml version="1.0" encoding="utf-8"?>
... <a>
... <b>foo</b>
... <c>bar</""", "xml")
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<a>
<b>foo</b>
<c>bar</c></a>
Note the second "xml" argument to the constructor to avoid the XML being interpreted as HTML.

How to unescape special characters while converting pyquery object to string

I am trying to fetch a remote page with python requests module, reconstruct a DOM tree, do some processing and save the result to file. When I fetch a page and then just write it to the file everything works (I can open an html file later in the browser and it is rendered correctly).
However, if I create a pyquery object and do some processing and then save it by using str conversion it fails. Specifically, special characters like && and etc. get modified within script tags of the saved source (caused by application of pyquery) and it prevents page from rendering correctly.
Here is my code:
import requests
from lxml import etree
from pyquery import PyQuery as pq
user_agent = {'User-agent': 'Mozilla/5.0'}
r = requests.get('http://www.google.com',headers=user_agent, timeout=4)
DOM = pq(r.text)
#some optional processing
fTest = open("fTest.html","wb")
fTest.write(str(DOM))
fTest.close()
So, the question is: How to make sure that special characters aren't escaped after application of pyquery? I suppose it might be related to lxml (parent library for pyquery), but after tedious search online and experiments with different ways of object serialization I still didn't make it. Maybe this is also related to unicode handling?!
Many thanks in advance!
I have found an elegant solution to the problem and the reason why it the code didn't work before.
First, you can read carefully the page with http://lxml.de/lxmlhtml.html.
It has a section "Creating HTML with the E-factory". After the section they point out to the fact that etree.tostring() method works for XML only. But for HTML with additional possibility to have script or style tags it will mess things around. So..
Second, the solution is to use the overloaded method html.tostring().
The final working code is:
# for networking
import requests
# for parsing and serialization
from lxml import etree
from lxml.html import tostring as html2str # IMPORTANT!!!
from pyquery import PyQuery as pq
user_agent = {'User-agent': 'Mozilla/5.0'}
r = requests.get('http://www.google.com',headers=user_agent, timeout=4)
# construct DOM object
DOM = pq(r.text)
# do stuff with DOM
#
# save result to file
fTest = open("fTest.html","wb")
fTest.write(html2str(DOM.root)) # IMPORTANT!!!
fTest.close()
Hope it will save time some of you in future! Have fun guys! ;)

Categories