I am looking to scrape a certain number from a website.
When inspecting in chrome I see the following div I want to pull:
<div class="sc-18nh1jk-0 bTfoun css-1p6fq9y">2472.38</div>
This class name looks weird to me. Here is the code simple code I use to try and pull the '2472.38' number:
from lxml import html
import requests
r = requests.get('MYWEBSITE')
tree = html.fromstring(r.content)
CurrentPrice = tree.xpath('//div[#class="sc-18nh1jk-0 bTfoun css-1p6fq9y"]')
print(CurrentPrice)
output is: []
Any suggestions? Thanks ahead of time!
If you provided the websites url it would have been nice, but I think that this webpage you are trying to scrape is using generated class names which means that the class will be dynamic.
Related
I am trying to use beautiful soup to get all the keno numbers that came in ever. I mostly know how to do this but I am running into an issue. I need to get the game numbers and the numbers that came in for each game. However, using beautiful soup, I cannot figure out how to access these. I'll include a screenshot of the html, as well as a screenshot of what ive tried, as well as the link to the page I am inspecting.
html code
I am trying to access <div class="winning-number-ball-circle solid"> as you can see in that picture but all the html is nested.
I've tried
soup.find_all('div',{'class':'winning-number-ball-circle solid'})
and that does not work. Does anyone know how to access the inner elements?
Heres my code:
from bs4 import BeautifulSoup
import urllib.request
mass = 'https://www.masslottery.com/tools/past-results/keno?draw_date=2021-06-17'
page = urllib.request.urlopen(mass)
soup = BeautifulSoup(page,'lxml')
div = soup.find('div',{'class','winning-number-ball-circle solid'})
print(div)
Thanks in advance!
The data comes from an REST API call the browser makes by running Javascript. You need to make a request to that and then use the json returned
import requests
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
Thanks to #MendelG for suggestion to use pandas for formatting:
import requests
import pandas as pd
r = requests.get('https://www.masslottery.com/rest/keno/getDrawsByDateRange?startDate=2021-06-17&endDate=2021-06-17').json()
pd.json_normalize(r['draws'])
divs = soup.findAll('div',{'class':['winning-number-ball-circle','solid']})
for div in divs:
print(div.text)
with soup.findAll you find all the divs with class names 'winning-number-ball-circle' and 'solid'
this returns a list.
With a for you display the text of these divs
I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella
I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps
you can use xpath expressions and create an absolute path on what you want to scrape
Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]
What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.
I am trying to learn to use the python library BeautifulSoup, I would like to, for example, scrape a price of a flight on Google Flights.
So I connected to Google Flights, for example at this link, and I want to get the cheapest flight price.
So I would get the value inside the div with this class "gws-flights-results__itinerary-price" (as in the figure).
Here is the simple code I wrote:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
div = soup.find('div', attrs={'class': 'gws-flights-results__itinerary-price'})
But the resulting div has class NoneType.
I also try with
find_all('div')
but within all the div I found in this way, there was not the div I was interested in.
Can someone help me?
Looks like javascript needs to run so use a method like selenium
from selenium import webdriver
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
driver = webdriver.Chrome()
driver.get(url)
print(driver.find_element_by_css_selector('.gws-flights-results__cheapest-price').text)
driver.quit()
Its great that you are learning web scraping! The reason you are getting NoneType as a result is because the website that you are scraping loads content dynamically. When requests library fetches the url it only contains javascript. and the div with this class "gws-flights-results__itinerary-price" isn't rendered yet! So it won't be possible by the scraping approach you are using to scrape this website.
However you can use other methods such as fetching the page using tools such as selenium or splash to render the javascript and then parse the content.
BeautifulSoup is a great tool for extracting part of HTML or XML, but here it looks like you only need to get the url to another GET-request for a JSON object.
(I am not by a computer now, can update with an example tomorrow.)
i'm working on web crawler right now, and it seems that I couldn't get the class that is inside div from a particular website. Below is my code. I use BeautifulSoup in Python3
import requests
from bs4 import BeautifulSoup as bs
response = requests.get('https://e27.co/startup/flipkart').text
soup = bs(response, 'html.parser')
content_div = soup.findAll('h1',class_ = 'profile-startup')
print(content_div)
I want to extract the text inside the h1 that has class "profile-startup". the above code returns nothing. can you guys help me?
This website is populating data using Javascript. If you take a look at the contents in response you will see that there is no h1. You have to see if they have an API you can use to retrieve the information you need or consider using a browser automation technology like Selenium: http://selenium-python.readthedocs.io/installation.html#introduction
i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.
There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.