I'm trying to get products for a project i'm working on from this page:
Belk.com
I originally tried going very specific using
soup.find("ul", {"class" : "product_results"})
Nothing was happening, so I went very broad and just started searching all divs.
contentDiv = soup.find_all("div")
for div in contentDiv:
print(div.get("class"))
When I do this I am getting only Div's for the top half of the page, which lead me to believe that there is an iframe that I wasn't getting into, but upon closer inspection I couldn't find the frame. Any thoughts on this?
This works for me
import httplib2
from bs4 import BeautifulSoup
http = httplib2.Http()
status, response = http.request('http://www.belk.com/AST/Main/Belk_Primary/Women/Shop/Accessories.jsp')
soup = BeautifulSoup(response)
res = soup.find('ul',{"class":"product_results"})
lis = res.findAll('li')
for j in lis:
#your code
pass
Related
I am new to Web scraping and this is one of my first web scraping project, I cant find the right selector for my soup.select("")
I want to get the "data-phone" (See picture bellow to undersdtand) But it In a div class and after it in a <a href>, who make that a little complicate for me!
I searched online and I foud that I have to use soup.find_all but this is not very helpfull Can anyone help me or give me a quick tip ?Thanks you!
my code:
import webbrowser, requests, bs4, os
url = "https://www.pagesjaunes.ca/search/si/1/electricien/Montreal+QC"
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
result = soup.find('a', {'class', 'mlr__item__cta jsMlrMenu'})
Phone = result['data-phone']
print(Phone)
I think one of the simplest way is to use the soup.select which allows the normal css selectors.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
soup.select('a.mlr__item_cta.jsMlrMenu')
This should return the entire list of anchors from which you can pick the data attribute.
Note I just tried it in the terminal:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'
r = requests.get(url)
soup = BeautifulSoup(r.text)
result = soup.select('a.mw-jump-link') # or any other selector
print(result)
print(result[0].get("href"))
You will have to loop over the result of soup.select and just collect the data-phone value from the attribute.
UPDATE
Ok I have searched in the DOM myself, and here is how I managed to retrieve all the phone data:
anchores = soup.select('a[data-phone]')
for a in anchores:
print(a.get('data-phone'))
It works also with only data selector like this: soup.select('[data-phone]')
Here real proof:
Surprisingly, for me it works also this one with classes:
for a in soup.select('a.mlr__item__cta.jsMlrMenu'):
print(a.get('data-phone'))
There is no surprise, we just had a typo in our first selector...
Find the difference :)
GOOD: a.mlr__item__cta.jsMlrMenu
BAD : a.mlr__item_cta.jsMlrMenu
I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element
i dont know what to try because im not getting any error messages its just blank when i run it, i was following along with a guy on youtube and his worked
import requests
import bs4
import sys
import webbrowser
search = 'savage'
res = requests.get(f'https://google.com/search?q={search}')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
linkelem = soup.select('.r a')
linkstoopen = min(5, len(linkelem))
for i in range(linkstoopen):
webbrowser.open('https://google.com', linkelem[i].get('href'))
it is supposed to open up the top 5 results for "savage" on google
Just as furas suggested I think the result html code might have changed since the tutorial was made.
I got it to work with two changes to your code:
Change the line:
linkelem = soup.select('.r a')
to
linkelem = []
for div in soup.find_all("div", {"class": "jfp3ef"}):
for link in div.select("a"):
linkelem.append(link)
Looking at the respons we get from Google we see that the result links is in a div with the class name jfp3ef.
Then you should add the element to the google url in the last loop
webbrowser.open('https://google.com' + linkelem[i].get('href'))
I was messing around with BeautifulSoup and found that it occasionally just takes an awful long time to parse a page despite no changes in the code or connection whatsoever. Any ideas?
from bs4 import BeautifulSoup
from urllib2 import urlopen
#The particular state website:
site = "http://sfbay.craigslist.org/rea/"
html = urlopen(site)
print "Done"
soup = BeautifulSoup(html)
print "Done"
#Get first 100 list of postings:
postings = soup('p')
If for some reason you wanted to read the text within the <a> tags you can do something like this.
postings = [x.text for x in soup.find("div", {"class":"content"}).findAll("a", {"class":"hdrlnk"})]
print(str(postings).encode('utf-8'))
This will return a list with the length of 100.
postings = soup('p')
This code is not good. The computer has to check each line to make sure p tag is in. one by one.
aTag = soup.findAll('a',class_='result_title hdrlnk')
for link in aTag:
print(link.text)
I am extracting data for a research project and I have sucessfully used findAll('div', attrs={'class':'someClassName'}) in many websites but this particular website,
WebSite Link
doesn't return any values when I used attrs option. But when I don't use the attrs option I get entire html dom.
Here is the simple code that I started with to test it out:
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div
My code is working fine, with requests
import requests
from BeautifulSoup import BeautifulSoup as bs
#grab HTML
r = requests.get(r'http://www.amazon.com/s/ref=sr_pg_1?rh=n:172282,k%3adigital%20camera&keywords=digital%20camera&ie=UTF8&qid=1343600585')
html = r.text
#parse the HTML
soup = bs(html)
results= soup.findAll('div', attrs={'class': 'data'})
print results
If you or anyone reading this question would like to know the reason that the code wasn't able to find the attrs value using the code you've given (copied below):
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div
The issue is when you attempted to create a BeautifulSoup object soup = bs(urlopen(url)) as the value of urlopen(url) is a response object and not the DOM.
I'm sure any issues you had encountered could have been more easily resolved by using bs(urlopen(url).read()) instead.