I'm currently running this code:
import urllib
from bs4 import BeautifulSoup
htmltext = urllib.urlopen("http://www.fifacoin.com/")
html = htmltext.read()
soup = BeautifulSoup(html)
for item in soup.find_all('tr', {'data-price': True}):
print(item['data-price'])
When I run this code I don't get any output at all, when I know there are html tags with these search parameters in them on that particular website. I'm probably making an obvious mistake here, i'm new to Python and BeautifulSoup.
The problem is that the price list table is loaded through javascript, and urllib does not include any javascript engine as far as I know. So all of the javascript in that page, which is executed in a normal browser, is not executed in the page fetched by urllib.
The only way of doing this is emulating a real browser.
Solutions that come to mind are PhantomJS and Node.js.
I recently did a similar thing with nodejs (although I am a python fan as well) and was presently surprised. I did it a little differently, but this page seems to explain quite well what you would want to do: http://liamkaufman.com/blog/2012/03/08/scraping-web-pages-with-jquery-nodejs-and-jsdom/
Related
I am learning web scraping, however, I got issue preparing soup. It doesn't even look like the HTML code I can see while inspecting the page.
import requests
from bs4 import BeautifulSoup
URL = "https://www.mediaexpert.pl/"
response = requests.get(URL).text
soup = BeautifulSoup(response,"html.parser")
print(soup)
The result is like this:Result, soup
I tried to search the whole internet, but I think I have too little knowledge, for now, to find a solution. This random string is 85% of the result.
I will be glad for every bit of help.
BeautifulSoup does not deal with JavaScript generated content. It only works with static HTML. To extract data generated by JavaScript, you would need to use a library like Selenium.
I've been trying to figure this out but with no luck. I found a thread (How to scrape data from flexbox element/container with Python and Beautiful Soup) that I thought would help but I can't seem to make any headway.
The site I'm trying to scrape is...http://www.northwest.williams.com/NWP_Portal/. In particular I want to get the data from the tab/frame of 'Storage Levels' but for the life of me I can't seem to navigate to the right spot to get the data. I've tried various iterations of the code below with no success. I've changed 'lxml' to 'html.parser', looked for tables, looked for 'tr' etc but the code always returns empty. I've also tried looking at the network info but when I click on any of the tabs (System Status, PAL/System Balancing etc) I don't see any change in network activity. I'm sure it's something simple that I'm overlooking but I just can't put my finger on it.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.northwest.williams.com/NWP_Portal/'
r = requests.get(url)
html = soup(r.content,'lxml')
page = html.findAll('div',{'class':'dailyOperations-panels'})
How can I 'navigate' to the 'Storage Levels' frame/tab? What is the html that I'm actually looking for? Can I do this with just requests and beautiful soup? I'm not opposed to using Selenium but I haven't used it before and would prefer to just use requests and BeautifulSoup if possible.
Thanks in advance!
Hey so what I notice is your are trying to get "dailyOperations-panels" from a div which won't work.
This is the code that I wrote. I watched lot of tutorials but they get the output with exactly the same code
import requests
from bs4 import BeautifulSoup as bs
url="https://shop.punamflutes.com/pages/5150194068881408"
page=requests.get(url).text
soup=bs(page,'lxml')
#print(soup)
tag=soup.find('div',class_="flex xs12")
print(tag)
I always get none. Also the class name seems strange. The view source code has different stuff than the inspect element thing
Bs4 is weird. Sometimes it returns different code than what is on the page...it alters it depending on the source. Try using selenium. It works great and has many more uses than bs4. Most of all...it is super easy to find elements on a site.
It's not a bs4 problem, it is correctly parsing what requests returns. It rather depends on the webpage itself
If you inspect the "soup", you will see that the source of the page is a set of links to scripts that render the content on the page. In order for these scripts to be executed, you need to have a browser - requests will only get you what the webserver returns, but won't execute the javascript for you. You can verify this yourself by deactivating javascript in the developer tools of your browser.
The solution is to use a web browser (e.g. headless chrome + chromedriver) and Selenium to control it. There are plenty of good tutorials out there on how to do this.
So I am making a python project where I've decided to make a supermarket comparison thing. I've decided to leech the prices from an existing supermarket comparison website.
I used this website to learn:
https://docs.python-guide.org/scenarios/scrape/
To start I've attempted to fetch the price of apples (at Tesco) from this website:
http://www.mysupermarket.co.uk/tesco-price-comparison/Fruit/Tesco_Gala_Apple_Approx_160g.html
using an edited version of the docs code which is:
import requests
from lxml import html
page = requests.get('http://www.mysupermarket.co.uk/tesco-price-comparison/Fruit/Tesco_Gala_Apple_Approx_160g.html')
tree = html.fromstring(page.content)
price_tesco = tree.xpath('//*[#id="PriceWrp"]/div[2]/span')
print(price_tesco)
I've tried the xpath code for the price but when I print the price, it returns nothing (an empty list)
So how would I fix this?
Note - I am new to HTML Scraping and have a basic knowledge of python but decided to have a bit of a challenge.
Thanks in advance.
I can't view the site in question (behind a firewall), but you should know that a lot of websites nowadays have dynamic contents using javascripts and such and can't be properly scraped using a basic library, which I'm assuming is the case here if your xpath is indeed correct but returning nothing.
Your best bet is to use a library that can render and scrape these type of dynamic contents, such as selenium, or Requests-HTML (my preference since it's headless).
because its a javascript rendered page use requests_html with render like:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://www.mysupermarket.co.uk/tesco-price-
comparison/Fruit/Tesco_Gala_Apple_Approx_160g.html')
r.html.render()
price = r.html.xpath('//*[#id="PriceWrp"]/div[2]/span')[0]
print(price.text)
Probably this site is dynamic and does not let you to get full html file. You can use "selenium" library for this case, little slower but always solves your problems.
I thought this would be funny and interesting to share. I ran into a weird situation which I have never encountered before.
I was fooling around with pythons beautifulsoup. After scraping https://www.amazon.ca i got the strangest output at the end of the HTML.
Can anyone tell me if this is intentional from the developers of amazon? Or is this something else ?
FYI here is the code I used to show it has nothing to do with me
import lxml
from bs4 import BeautifulSoup
import urllib.request as re
# ********Below is the soup used to gather the HTML************
url = "https://www.amazon.ca"
page = re.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
print(soup)
So, Amazon doesn't allow web scraping on their websites. They may change the HTML content for web scraping programs. For me, the HTML just said: "Forbidden".
If you want to get data from Amazon, you will probably need to use their API