Can't find the Web scraping selector for div class - python

I am new to Web scraping and this is one of my first web scraping project, I cant find the right selector for my soup.select("")
I want to get the "data-phone" (See picture bellow to undersdtand) But it In a div class and after it in a <a href>, who make that a little complicate for me!
I searched online and I foud that I have to use soup.find_all but this is not very helpfull Can anyone help me or give me a quick tip ?Thanks you!
my code:
import webbrowser, requests, bs4, os
url = "https://www.pagesjaunes.ca/search/si/1/electricien/Montreal+QC"
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
result = soup.find('a', {'class', 'mlr__item__cta jsMlrMenu'})
Phone = result['data-phone']
print(Phone)

I think one of the simplest way is to use the soup.select which allows the normal css selectors.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
soup.select('a.mlr__item_cta.jsMlrMenu')
This should return the entire list of anchors from which you can pick the data attribute.
Note I just tried it in the terminal:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'
r = requests.get(url)
soup = BeautifulSoup(r.text)
result = soup.select('a.mw-jump-link') # or any other selector
print(result)
print(result[0].get("href"))
You will have to loop over the result of soup.select and just collect the data-phone value from the attribute.
UPDATE
Ok I have searched in the DOM myself, and here is how I managed to retrieve all the phone data:
anchores = soup.select('a[data-phone]')
for a in anchores:
print(a.get('data-phone'))
It works also with only data selector like this: soup.select('[data-phone]')
Here real proof:
Surprisingly, for me it works also this one with classes:
for a in soup.select('a.mlr__item__cta.jsMlrMenu'):
print(a.get('data-phone'))
There is no surprise, we just had a typo in our first selector...
Find the difference :)
GOOD: a.mlr__item__cta.jsMlrMenu
BAD : a.mlr__item_cta.jsMlrMenu

Related

Web Scraping masslottery.com using beautiful soup

I am trying to use beautiful soup to get all the keno numbers that came in ever. I mostly know how to do this but I am running into an issue. I need to get the game numbers and the numbers that came in for each game. However, using beautiful soup, I cannot figure out how to access these. I'll include a screenshot of the html, as well as a screenshot of what ive tried, as well as the link to the page I am inspecting.
html code
I am trying to access <div class="winning-number-ball-circle solid"> as you can see in that picture but all the html is nested.
I've tried
soup.find_all('div',{'class':'winning-number-ball-circle solid'})
and that does not work. Does anyone know how to access the inner elements?
Heres my code:
from bs4 import BeautifulSoup
import urllib.request
mass = 'https://www.masslottery.com/tools/past-results/keno?draw_date=2021-06-17'
page = urllib.request.urlopen(mass)
soup = BeautifulSoup(page,'lxml')
div = soup.find('div',{'class','winning-number-ball-circle solid'})
print(div)
Thanks in advance!
The data comes from an REST API call the browser makes by running Javascript. You need to make a request to that and then use the json returned
import requests
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
Thanks to #MendelG for suggestion to use pandas for formatting:
import requests
import pandas as pd
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
pd.json_normalize(r['draws'])
divs = soup.findAll('div',{'class':['winning-number-ball-circle','solid']})
for div in divs:
print(div.text)
with soup.findAll you find all the divs with class names 'winning-number-ball-circle' and 'solid'
this returns a list.
With a for you display the text of these divs

Scraping with Python. Can't get wanted data

I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item

How to solve, finding two of each link (Beautifulsoup, python)

Im using beautifulsoup4 to parse a webpage and collect all the href values using this code
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
allProductInfo = soup.find_all("a", class_="name-link")
print allProductInfo
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
linksList1 prints two of each link. I believe this is happening as its taking the link from the title as well as the item colour. I have tried a few things but cannot get BS to only parse the title link, and have a list of one of each link instead of two. I imagine its something real simple but im missing it. Thanks in advance
This code will give you the result without getting duplicate results
(also using set() may be a good idea as #Tarum Gupta)
But I changed the way you crawl
import requests
from bs4 import BeautifulSoup
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
# Gets all divs with class of inner-article then search for a with name-link class
that is inside an h1 tag
allProductInfo = soup.select("div.inner-article h1 a.name-link")
# print (allProductInfo)
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
alldiv = soup.findAll("div", {"class":"inner-article"})
for div in alldiv:
linkList1.append(div.h1.a['href'])
set(linksList1) # use set() to remove duplicate link
list(set(linksList1)) # use list() convert set to list if you need

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:
http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php
I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]
print(source_url)
I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.
You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:
source_url.get('href')
You almost got it!!
SOLUTION 1:
You just have to run the .text method on the soup you've assigned to source_url.
So instead of:
print(source_url)
You should use:
print(source_url.text)
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense
SOLUTION 2:
You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.
print source_url.get('href')
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

Beautiful Soup not finding Div

I'm trying to get products for a project i'm working on from this page:
Belk.com
I originally tried going very specific using
soup.find("ul", {"class" : "product_results"})
Nothing was happening, so I went very broad and just started searching all divs.
contentDiv = soup.find_all("div")
for div in contentDiv:
print(div.get("class"))
When I do this I am getting only Div's for the top half of the page, which lead me to believe that there is an iframe that I wasn't getting into, but upon closer inspection I couldn't find the frame. Any thoughts on this?
This works for me
import httplib2
from bs4 import BeautifulSoup
http = httplib2.Http()
status, response = http.request('http://www.belk.com/AST/Main/Belk_Primary/Women/Shop/Accessories.jsp')
soup = BeautifulSoup(response)
res = soup.find('ul',{"class":"product_results"})
lis = res.findAll('li')
for j in lis:
#your code
pass

Categories