Probem with webscraping with BeautifulSoup

Probem with webscraping with BeautifulSoup - python

I am new in using beautifulSoup and having a question; appreciate your help:
from bs4 import BeautifulSoup as soup
import requests
URL = 'https://www.kbb.com/car-values/'
page = requests.get(URL)
soup1 = soup(page.content, 'html-parser')
print(soup1.prettify())
In parallel, I went to the URL in a separate browser and inspect the page to get the HTML version of the page to establish patterns.
I found two independent patterns that meet my need
yyyy1
and
yyyy2
P.S. xxxx1, xxxx2, yyyy1 and yyyy2 are just strings
I went back to the prettify() output and searched for the pattern xxxx1 and I found it but when I searched for pattern xxxx2 I could not find it?
It seems like the soup object does not contain all info in the HTML page? or I am not looking at the right HTML page?
I can not guess what I did wrong and how to do it right?
Thanks

Initially a modification was needed to run your code, changed the 'html-parser' to 'html.parser'. This fixed the bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html-parser. Do you need to install a parser library?
Locally when I try your code I get:
Access Denied
You don't have permission to access "http://www.kbb.com/" on this server.
Reference #18.afe17b5c.1587328194.c07350f
Are there restrictions on some countries?

Related

Beautifulsoup. Result long random string

I am learning web scraping, however, I got issue preparing soup. It doesn't even look like the HTML code I can see while inspecting the page.
import requests
from bs4 import BeautifulSoup
URL = "https://www.mediaexpert.pl/"
response = requests.get(URL).text
soup = BeautifulSoup(response,"html.parser")
print(soup)
The result is like this:Result, soup
I tried to search the whole internet, but I think I have too little knowledge, for now, to find a solution. This random string is 85% of the result.
I will be glad for every bit of help.

BeautifulSoup does not deal with JavaScript generated content. It only works with static HTML. To extract data generated by JavaScript, you would need to use a library like Selenium.

Scrape data from website with frames or flexbox using python requests and BeautifulSoup

I've been trying to figure this out but with no luck. I found a thread (How to scrape data from flexbox element/container with Python and Beautiful Soup) that I thought would help but I can't seem to make any headway.
The site I'm trying to scrape is...http://www.northwest.williams.com/NWP_Portal/. In particular I want to get the data from the tab/frame of 'Storage Levels' but for the life of me I can't seem to navigate to the right spot to get the data. I've tried various iterations of the code below with no success. I've changed 'lxml' to 'html.parser', looked for tables, looked for 'tr' etc but the code always returns empty. I've also tried looking at the network info but when I click on any of the tabs (System Status, PAL/System Balancing etc) I don't see any change in network activity. I'm sure it's something simple that I'm overlooking but I just can't put my finger on it.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.northwest.williams.com/NWP_Portal/'
r = requests.get(url)
html = soup(r.content,'lxml')
page = html.findAll('div',{'class':'dailyOperations-panels'})
How can I 'navigate' to the 'Storage Levels' frame/tab? What is the html that I'm actually looking for? Can I do this with just requests and beautiful soup? I'm not opposed to using Selenium but I haven't used it before and would prefer to just use requests and BeautifulSoup if possible.
Thanks in advance!

Hey so what I notice is your are trying to get "dailyOperations-panels" from a div which won't work.

Why can't I find anything in BeautifulSoup documentation about .text or content method?

At the moment I am following a Python course on Udemy and I am learning the concept of web scraping. The way this is done is as follows:
import requests
import bs4
url = requests.get("http://example.com/")
soup = bs4.BeautifulSoup(url.text, "lxml")
Now, I cannot find anything about the text method of Beautifulsoup in the documentation. I know this because it is clearly explained in the course I am following.
Is this usual? I am asking this more from a general point of view when searching for relevant information in future documentation.

You have to use the .text attribut, because if you use just url in your case, you only get the statuscode of your request, which can not be a parameter in your soup object.

using beautiful soup on local content

I started a research project grabbing pages using wget with the local links and mirror options. I did it this way at the time to get the data as I did not know how long the sites would be active. So I have 60-70 sites fully mirrored with localized links sitting in a dir. I now need to gleam what I can from them.
Is there a good example of parsing these pages using beautifulsoup? I realize that beautifulsoup is designed to take the http request and parse from there. I will be honest, I'm not savvy on beautifulsoup yet and my programming skills are not awesome. Now that I have some time to devote to it, I would like to do this the easy way versus the manual way.
Can someone point me to a good example, resource, or tutorial for parsing the html I have stored? I really appreciate it. Am I over-thinking this?

Using BeautifulSoup with local contents are just the same with Internet contents. For example, to read a local html file into bs4:
response = urllib.request.urlopen('file:///Users/Li/Desktop/test.html', timeout=1)
html = response.read()
soup = bs4.BeautifulSoup(html, 'html.parser')
In terms of how to use bs4 for processing html, the documentation of bs4 is a pretty good tutorial. In most situation, spending a day reading it is enough for basic data processing.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

Web scraping for divs inserted by scripts

Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'

Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source

check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network

The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.

Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.