I am learning web scraping, however, I got issue preparing soup. It doesn't even look like the HTML code I can see while inspecting the page.
import requests
from bs4 import BeautifulSoup
URL = "https://www.mediaexpert.pl/"
response = requests.get(URL).text
soup = BeautifulSoup(response,"html.parser")
print(soup)
The result is like this:Result, soup
I tried to search the whole internet, but I think I have too little knowledge, for now, to find a solution. This random string is 85% of the result.
I will be glad for every bit of help.
BeautifulSoup does not deal with JavaScript generated content. It only works with static HTML. To extract data generated by JavaScript, you would need to use a library like Selenium.
Related
I've been trying to figure this out but with no luck. I found a thread (How to scrape data from flexbox element/container with Python and Beautiful Soup) that I thought would help but I can't seem to make any headway.
The site I'm trying to scrape is...http://www.northwest.williams.com/NWP_Portal/. In particular I want to get the data from the tab/frame of 'Storage Levels' but for the life of me I can't seem to navigate to the right spot to get the data. I've tried various iterations of the code below with no success. I've changed 'lxml' to 'html.parser', looked for tables, looked for 'tr' etc but the code always returns empty. I've also tried looking at the network info but when I click on any of the tabs (System Status, PAL/System Balancing etc) I don't see any change in network activity. I'm sure it's something simple that I'm overlooking but I just can't put my finger on it.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.northwest.williams.com/NWP_Portal/'
r = requests.get(url)
html = soup(r.content,'lxml')
page = html.findAll('div',{'class':'dailyOperations-panels'})
How can I 'navigate' to the 'Storage Levels' frame/tab? What is the html that I'm actually looking for? Can I do this with just requests and beautiful soup? I'm not opposed to using Selenium but I haven't used it before and would prefer to just use requests and BeautifulSoup if possible.
Thanks in advance!
Hey so what I notice is your are trying to get "dailyOperations-panels" from a div which won't work.
I thought this would be funny and interesting to share. I ran into a weird situation which I have never encountered before.
I was fooling around with pythons beautifulsoup. After scraping https://www.amazon.ca i got the strangest output at the end of the HTML.
Can anyone tell me if this is intentional from the developers of amazon? Or is this something else ?
FYI here is the code I used to show it has nothing to do with me
import lxml
from bs4 import BeautifulSoup
import urllib.request as re
# ********Below is the soup used to gather the HTML************
url = "https://www.amazon.ca"
page = re.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
print(soup)
So, Amazon doesn't allow web scraping on their websites. They may change the HTML content for web scraping programs. For me, the HTML just said: "Forbidden".
If you want to get data from Amazon, you will probably need to use their API
I started a research project grabbing pages using wget with the local links and mirror options. I did it this way at the time to get the data as I did not know how long the sites would be active. So I have 60-70 sites fully mirrored with localized links sitting in a dir. I now need to gleam what I can from them.
Is there a good example of parsing these pages using beautifulsoup? I realize that beautifulsoup is designed to take the http request and parse from there. I will be honest, I'm not savvy on beautifulsoup yet and my programming skills are not awesome. Now that I have some time to devote to it, I would like to do this the easy way versus the manual way.
Can someone point me to a good example, resource, or tutorial for parsing the html I have stored? I really appreciate it. Am I over-thinking this?
Using BeautifulSoup with local contents are just the same with Internet contents. For example, to read a local html file into bs4:
response = urllib.request.urlopen('file:///Users/Li/Desktop/test.html', timeout=1)
html = response.read()
soup = bs4.BeautifulSoup(html, 'html.parser')
In terms of how to use bs4 for processing html, the documentation of bs4 is a pretty good tutorial. In most situation, spending a day reading it is enough for basic data processing.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url = "http://www.csgolounge.com/api/mathes"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "html.parser")
print (data)
I am trying to use this code to get the text from this page, but every time I try to scrape or get the text from the page, I am redirected to home page, and my code outputs the html from the homepage. The page I am trying to scrape is a .php file, and not an html or textfile. I would like to get the text from the page and then extract the data and do what I want with it.
I have tried changing the headers of my code, that the website would think that I am not a bot, but a chrome browser, but I still get redirected to the homepage. I have tried using diffrent html python parsers like BeautifulSoup, and the python built in class, as well as many other popular parsers, but they all give the same result.
Is there a way to stop this, and to get the text from this link? Is it a mistake in my code or what?
First of all, try it without the "www" part.
Rewrite http://www.csgolounge.com/api/mathes as https://csgolounge.com/api/mathes
If it doesn't work, try Selenium.
It may be getting stuck since it can't process the javascript part.
Selenium can handle it better.
I'm currently running this code:
import urllib
from bs4 import BeautifulSoup
htmltext = urllib.urlopen("http://www.fifacoin.com/")
html = htmltext.read()
soup = BeautifulSoup(html)
for item in soup.find_all('tr', {'data-price': True}):
print(item['data-price'])
When I run this code I don't get any output at all, when I know there are html tags with these search parameters in them on that particular website. I'm probably making an obvious mistake here, i'm new to Python and BeautifulSoup.
The problem is that the price list table is loaded through javascript, and urllib does not include any javascript engine as far as I know. So all of the javascript in that page, which is executed in a normal browser, is not executed in the page fetched by urllib.
The only way of doing this is emulating a real browser.
Solutions that come to mind are PhantomJS and Node.js.
I recently did a similar thing with nodejs (although I am a python fan as well) and was presently surprised. I did it a little differently, but this page seems to explain quite well what you would want to do: http://liamkaufman.com/blog/2012/03/08/scraping-web-pages-with-jquery-nodejs-and-jsdom/