Weird / Funny ouput after beautifulsoup web scrape - python

I thought this would be funny and interesting to share. I ran into a weird situation which I have never encountered before.
I was fooling around with pythons beautifulsoup. After scraping https://www.amazon.ca i got the strangest output at the end of the HTML.
Can anyone tell me if this is intentional from the developers of amazon? Or is this something else ?
FYI here is the code I used to show it has nothing to do with me
import lxml
from bs4 import BeautifulSoup
import urllib.request as re
# ********Below is the soup used to gather the HTML************
url = "https://www.amazon.ca"
page = re.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
print(soup)

So, Amazon doesn't allow web scraping on their websites. They may change the HTML content for web scraping programs. For me, the HTML just said: "Forbidden".
If you want to get data from Amazon, you will probably need to use their API

Related

Beautifulsoup. Result long random string

I am learning web scraping, however, I got issue preparing soup. It doesn't even look like the HTML code I can see while inspecting the page.
import requests
from bs4 import BeautifulSoup
URL = "https://www.mediaexpert.pl/"
response = requests.get(URL).text
soup = BeautifulSoup(response,"html.parser")
print(soup)
The result is like this:Result, soup
I tried to search the whole internet, but I think I have too little knowledge, for now, to find a solution. This random string is 85% of the result.
I will be glad for every bit of help.
BeautifulSoup does not deal with JavaScript generated content. It only works with static HTML. To extract data generated by JavaScript, you would need to use a library like Selenium.

Scrape data from website with frames or flexbox using python requests and BeautifulSoup

I've been trying to figure this out but with no luck. I found a thread (How to scrape data from flexbox element/container with Python and Beautiful Soup) that I thought would help but I can't seem to make any headway.
The site I'm trying to scrape is...http://www.northwest.williams.com/NWP_Portal/. In particular I want to get the data from the tab/frame of 'Storage Levels' but for the life of me I can't seem to navigate to the right spot to get the data. I've tried various iterations of the code below with no success. I've changed 'lxml' to 'html.parser', looked for tables, looked for 'tr' etc but the code always returns empty. I've also tried looking at the network info but when I click on any of the tabs (System Status, PAL/System Balancing etc) I don't see any change in network activity. I'm sure it's something simple that I'm overlooking but I just can't put my finger on it.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.northwest.williams.com/NWP_Portal/'
r = requests.get(url)
html = soup(r.content,'lxml')
page = html.findAll('div',{'class':'dailyOperations-panels'})
How can I 'navigate' to the 'Storage Levels' frame/tab? What is the html that I'm actually looking for? Can I do this with just requests and beautiful soup? I'm not opposed to using Selenium but I haven't used it before and would prefer to just use requests and BeautifulSoup if possible.
Thanks in advance!
Hey so what I notice is your are trying to get "dailyOperations-panels" from a div which won't work.

Scraping youtube to get dynamically loaded content

I'm trying to scrape youtube but most of the times I do it, It just gives back an empty result.
In this code snippet I'm trying to get the list of the video titles on the page. But when I run it I just get an empty result back. Even one title doesn't show up in result.
I've searched around and some results point out that it's due to the website loading content dynamically with javascript. Is this the case? How do I go about solving this? Is it possible to do without selenium?
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'lxml')
title = soup.findAll('a', attrs={'class': "yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(title)
Is it possible to do without selenium?
Often services have APIs which allow easier automation than scraping sites, Youtube has API and there are ready official libraries for various languages, for Python there is google-api-python-client, you would need key to use, to get running I suggest following Youtube Data API quickstart, note that you might ignore OAuth 2.0 parts, as long as you need access only to public data.
I totally agree with #Daweo and that's the right way to scrape a website like Youtube. But if you want to use BeautifulSoup and not get an empty list at the end, your code should be changed to as follows:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'html.parser')
titles = [i.text for i in soup.findAll('a') if i.get('aria-describedby')]
print(titles)
I also suggest that you use the API.

using beautiful soup on local content

I started a research project grabbing pages using wget with the local links and mirror options. I did it this way at the time to get the data as I did not know how long the sites would be active. So I have 60-70 sites fully mirrored with localized links sitting in a dir. I now need to gleam what I can from them.
Is there a good example of parsing these pages using beautifulsoup? I realize that beautifulsoup is designed to take the http request and parse from there. I will be honest, I'm not savvy on beautifulsoup yet and my programming skills are not awesome. Now that I have some time to devote to it, I would like to do this the easy way versus the manual way.
Can someone point me to a good example, resource, or tutorial for parsing the html I have stored? I really appreciate it. Am I over-thinking this?
Using BeautifulSoup with local contents are just the same with Internet contents. For example, to read a local html file into bs4:
response = urllib.request.urlopen('file:///Users/Li/Desktop/test.html', timeout=1)
html = response.read()
soup = bs4.BeautifulSoup(html, 'html.parser')
In terms of how to use bs4 for processing html, the documentation of bs4 is a pretty good tutorial. In most situation, spending a day reading it is enough for basic data processing.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

BeautifulSoup empty result

I'm currently running this code:
import urllib
from bs4 import BeautifulSoup
htmltext = urllib.urlopen("http://www.fifacoin.com/")
html = htmltext.read()
soup = BeautifulSoup(html)
for item in soup.find_all('tr', {'data-price': True}):
print(item['data-price'])
When I run this code I don't get any output at all, when I know there are html tags with these search parameters in them on that particular website. I'm probably making an obvious mistake here, i'm new to Python and BeautifulSoup.
The problem is that the price list table is loaded through javascript, and urllib does not include any javascript engine as far as I know. So all of the javascript in that page, which is executed in a normal browser, is not executed in the page fetched by urllib.
The only way of doing this is emulating a real browser.
Solutions that come to mind are PhantomJS and Node.js.
I recently did a similar thing with nodejs (although I am a python fan as well) and was presently surprised. I did it a little differently, but this page seems to explain quite well what you would want to do: http://liamkaufman.com/blog/2012/03/08/scraping-web-pages-with-jquery-nodejs-and-jsdom/

Categories