Python BeautifulSoup output unusual spacing and characters

Python BeautifulSoup output unusual spacing and characters - python

Im new to python.
Im trying to parse data from a website using BeautifulSoup, I have successful used BeautifulSoup before. However for this particular website the data returned has spaces between every character and lots of "&gt" characters as well.
The weird thing is if copy the page source and add it to my local apache instance and make a request to my local copy, then the output is perfect. I should mention that the difference between my local and the website:
my local does not use https
my local does not require authentication however the website does require Active Directory auth and I using requests_ntlm
import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
content = r.text
soup = BeautifulSoup(content, 'lxml')
print(soup)

It looks like local server returns content encoded using UTF-8 and the main website use UTF-16. It's suggests the main website in not configured correctly. However it's possible to get around this issue with code.
Python defaults the requests to the encoding to UTF-8. (I believe) this is based on the response headers. The request has a method called apparent_encoding, which reads the stream and detects the correct encoding using chardet. However apparent_encoding does not get consumed, unless specified.
Therefore by setting r.encoding = r.apparent_encoding, the request should download the text correctly across both environments.
Code should look something like:
r = requests.get("http://WEBSITE/CONTEXT/",auth=HttpNtlmAuth('DOMAIN\USER','PASS'))
r.encoding = r.apparent_encoding # Override the default encoding
content = r.text
r.raise_for_status() # Always check for server errors before consuming the data.
soup = BeautifulSoup(content, 'lxml')
print(soup.prettify()) # Should match print(content) (minus indentation)

Related

how to work around encoding problems in redirects

A website I try to scrape seems to have an encoding problem. The pages state, that they are encoded in utf-8, but if I try to scrape them and fetch the html source using requests, the redirect adress contains an encoding, that is not utf-8.
Browsers seem to be tolerant, so they fix this automatically, but the python requests package runs into an exception.
My code looks like this:
res= rq.get(url, allow_redirects=True)
This runs into an exception when trying to decode the redirect string in the following code (hidden somewhere in the requests package):
string.decode(encoding)
where string is the redirect string and encoding is 'utf8':
string= b'/aktien/herm\xe8s-aktie'
I found out, that the encoding in fact is encoded in something like 'Windows-1252'. Actually the redirect should go on '/aktien/herm%C3%A8s-aktie'.
Now my question: how can I either get requests to be more tolerant about such encoding bugs (like the browsers), or how can I alternatively pass an encoding?
I searched for encoding settings, but what I saw so far, requests always does that automatically based on the result.
Btw. the result page of the redirect starts with (it really states to be utf-8)
<!DOCTYPE html><html lang="de" prefix="og: http://ogp.me/ns#"><head><meta charset="utf-8">

You can use hooks= parameter in requests.get() method and explicitly urlencode the Location HTTP header. For example:
import requests
import urllib.parse
url = "<YOUR URL FROM EXAMPLE>"
def response_hook(hook_data, **kwargs):
if "Location" in hook_data.headers:
hook_data.headers["Location"] = urllib.parse.quote(
hook_data.headers["Location"]
)
res = requests.get(url, allow_redirects=True, hooks={"response": response_hook})
print(res.url)
Prints:
https://.../herm%C3%A8s-aktie

Python requests.get content encoding problem

I am working on a scraping project using requests.get. The html file contains a relative path of the form href="../_static/new_page.html" for the css file.
I am using the below code to get the html file
import requests
url = "www.example.com"
req = requests.get(url)
req.content
All the href containing "../_static" become "_static/...". I tried req.text and changed the encoding to utf-8, which is the encoding of the page. However, I am always getting the same result. I also tried urllib.request.get, and I also got the same problem.
Any suggestions!

Adam.
Yes, it will be related to encoding format when you write the response content to Html file.
but you just have to consider the encoding type of response content itself.
Just check the encoding type of your requests library.
response = requests.get("url")
print(response.encoding)
You just need to choose the right encoding type like above.
response.encoding = "utf-8"
or
response.encoding = "ISO-8859-1"
or
response.encoding = "utf-8-sig"
...
Hope my answer helps you.
Regards

Error while scraping image with beautifulsoup

The original code is here : https://github.com/amitabhadey/Web-Scraping-Images-using-Python-via-BeautifulSoup-/blob/master/code.py
So i am trying to adapt a Python script to collect pictures from a website to get better at web scraping.
I tried to get images from "https://500px.com/editors"
The first error was
The code that caused this warning is on line 12 of the file/Bureau/scrapper.py. To get rid of this warning, pass the additional argument
'features="lxml"' to the BeautifulSoup constructor.
So I did :
soup = BeautifulSoup(plain_text, features="lxml")
I also adapted the class to reflect the tag in 500px.
But now the script stopped running and nothing happened.
In the end it looks like this :
import requests
from bs4 import BeautifulSoup
import urllib.request
import random
url = "https://500px.com/editors"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="lxml")
for link in soup.find_all("a",{"class":"photo_link "}):
href = link.get('href')
print(href)
img_name = random.randrange(1,500)
full_name = str(img_name) + ".jpg"
urllib.request.urlretrieve(href, full_name)
print("loop break")
What did I do wrong?

Actually the website is loaded via JavaScript using XHR request to the following API
So you can reach it directly via API.
Note that you can increase parameter rpp=50 to any number as you want for getting more than 50 result.
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['url'])
also you can access the image url itself in order to write it directly!
import requests
r = requests.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
for item in r['photos']:
print(item['image_url'][-1])
Note that image_url key hold different img size. so you can choose your preferred one and save it. here I've taken the big one.
Saving directly:
import requests
with requests.Session() as req:
r = req.get("https://api.500px.com/v1/photos?rpp=50&feature=editors&image_size%5B%5D=1&image_size%5B%5D=2&image_size%5B%5D=32&image_size%5B%5D=31&image_size%5B%5D=33&image_size%5B%5D=34&image_size%5B%5D=35&image_size%5B%5D=36&image_size%5B%5D=2048&image_size%5B%5D=4&image_size%5B%5D=14&sort=&include_states=true&include_licensing=true&formats=jpeg%2Clytro&only=&exclude=&personalized_categories=&page=1&rpp=50").json()
result = []
for item in r['photos']:
print(f"Downloading {item['name']}")
save = req.get(item['image_url'][-1])
name = save.headers.get("Content-Disposition")[9:]
with open(name, 'wb') as f:
f.write(save.content)

Looking at the page you're trying to scrape I noticed something. The data doesn't appear to load until a few moments after the page finishes loading. This tells me that they're using a JS framework to load the images after page load.
Your scraper will not work with this page due to the fact that it does not run JS on the pages it's pulling. Running your script and printing out what plain_text contains proves this:
<a class='photo_link {{#if hasDetailsTooltip}}px_tooltip{{/if}}' href='{{photoUrl}}'>
If you look at the href attribute on that tag you'll see it's actually a templating tag used by JS UI frameworks.
Your options now are to either see what APIs they're calling to get this data (check the inspector in your web browser for network calls, if you're lucky they may not require authentication) or to use a tool that runs JS on pages. One tool I've seen recommended for this is selenium, though I've never used it so I'm not fully aware of its capabilities; I imagine the tooling around this would drastically increase the complexity of what you're trying to do.

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.

This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)

So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.

web scraping python <span> with id

I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.

The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.