I'm trying to parse the page https://www.petshop.ru/catalog/cats/veterinary_feed/dlya_koshek_pri_zapore_fibre_response_fr31_5789/, but it doesn't work.
import requests
from bs4 import BeautifulSoup as BS
r = requests.get ("https://www.petshop.ru/catalog/cats/veterinary_feed/dlya_koshek_pri_zapore_fibre_response_fr31_5789/")
html = BS (r.content, 'html.parser')
for el in html.select (".style_product_head__ufClP > .style_tablet__bK5he style_desktop__3Zkvu"):
title = el.select ('.style_list__3V0_P > .style_price_wrapper__1HT8P')
print ( title[0].text )
I do according to the model, because unfamiliar with python:
import requests
from bs4 import BeautifulSoup as BS
r = requests.get ("https://stopgame.ru/review/new/izumitelno/p1")
html = BS (r.content, 'html.parser')
for el in html.select (".items > .article-summary "):
title = el.select ('.caption > a')
print ( title[0].text )
I expect to see the following result: Обычная цена
Ideally, it would also be interesting to know how to display a result of this kind: petshop.ru: Обычная цена 3 125 ₽ , because I plan to implement the parsing of several more sites to track prices for this cat food :)
All the data you want is in the HTML source and it comes as a JSON object in a <script> tag. You can easily target that and parse it.
Here's how:
import json
import re
import requests
url = "https://www.petshop.ru/catalog/cats/veterinary_feed/dlya_koshek_pri_zapore_fibre_response_fr31_5789/"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0",
}
data = (
json.loads(
re.search(
r'"onProductView",\s(.*)\);',
requests.get(url, headers=headers).text
).group(1)
)
)
print(f"{data['product']['name']} - {data['product']['price']} руб.")
Output:
Для кошек при запоре - 3125 руб.
Related
I am trying the get the product image from the page below, using Python and BeautifulSoup. The image is inside javascript. I am using lxml. I have created a simplified version of my code to focus on the image only.
The image url I am after is https://lapa.co.za/pub/media/catalog/product/cache/image/700x700/e9c3970ab036de70892d86c6d221abfe/l/e/learn_to_read_l3_b05_tippie_fish_cover.jpg
import json
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}
testlink = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-1-grootboek-9-tippie-en-die-vis'
r = requests.get(testlink, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
title = soup.find('h1', class_='page-title').text.strip()
images = soup.find('div', class_='product-img-column')
# html_data = requests.get(testlink).text
# data = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data))
print(images)
The json is in the <script> tags. Just need to pull that out.
import json
from bs4 import BeautifulSoup
import requests
import re
headers = {
'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}
testlink = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-1-grootboek-9-tippie-en-die-vis'
r = requests.get(testlink, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
title = soup.find('h1', class_='page-title').text.strip()
images = soup.find('div', class_='product-img-column')
script = images.find('script', {'type':'text/x-magento-init'})
jsonStr = re.search(r'<script type=\"text/x-magento-init\">(.*)</script>', str(script), re.IGNORECASE | re.DOTALL).group(1)
data = json.loads(jsonStr)
image_data = data['[data-gallery-role=gallery-placeholder]']['mage/gallery/gallery']['data'][0]
image_url = image_data['full']
# OR
#image_url = image_data['img']
print(image_url)
Output:
print(image_url)
https://lapa.co.za/pub/media/catalog/product/cache/image/e9c3970ab036de70892d86c6d221abfe/9/7/9780799377347_1.jpg
when ever i try to extract text between tags using .text() it gives a blank screen with just [] as output
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.amazon.in/s?k=ssd&ref=nb_sb_noss")
soup = BeautifulSoup(page.content, "html.parser")
product = soup.find_all("h2",class_="a-link-normal a-text-normal")
results = soup.find_all("span",class_="a-offscreen")
print(product)
this is the output that i got :
C:\Users\Kushal\Desktop\requests-tutorial>C:/Users/Kushal/AppData/Local/Programs/Python/Python37/python.exe c:/Users/Kushal/Desktop/requests-tutorial/scraper.py
[]
when i try listing everything with a for loop then, nothing shows up not even the empty square brackets
Based on your comment below. I've modified the code to fetch all the product title on the said page along with the price details.
Mark as answer if it works, else comment for further analysis.
import requests
from bs4 import BeautifulSoup
import lxml
dataList = list()
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "cp1254,ISO-8859-9,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "tr,tr-TR,en-US,en;q=0.8",
}
url = requests.get('https://www.amazon.in/s?k=ssd&ref=nb_sb_noss'.format(), headers=headers)
soup = BeautifulSoup(url.content, 'lxml')
title = soup.find_all('span', attrs={'class':'a-size-medium a-color-base a-text-normal'})
price = soup.find_all('span', attrs={'class':'a-offscreen'})
for product in zip(title,price):
title,price=product
title_proper=title.text.strip()
price_proper=price.text.strip()
print(title_proper,'-',price_proper)
I am trying to use Python to scrape the US News Ranking for universities, and I'm struggling. I normally use Python "requests" and "BeautifulSoup".
The data is here:
https://www.usnews.com/education/best-global-universities/rankings
Using right click and inspect shows a bunch of links and I don't even know which one to pick. I followed an example from the web that I found but it just gives me empty data:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import math
from lxml.html import parse
from io import StringIO
url = 'https://www.usnews.com/education/best-global-universities/rankings'
urltmplt = 'https://www.usnews.com/education/best-global-universities/rankings?page=2'
css = '#resultsMain :nth-child(1)'
npage = 20
urlst = [url] + [urltmplt + str(r) for r in range(2,npage+1)]
def scrapevec(url, css):
doc = parse(StringIO(url)).getroot()
return([link.text_content() for link in doc.cssselect(css)])
usng = []
for u in urlst:
print(u)
ts = [re.sub("\n *"," ", t) for t in scrapevec(u,css) if t != ""]
This doesn't work as t is an empty array.
I'd really appreciate any help.
The MWE you posted is not working at all: urlst is never defined and cannot be called. I strongly suggest you to look for basic scraping tutorials (with python, java, etc.): there is plenty and in general is a good starting.
Below you can find a snippet of a code that prints the universities' names listed on page 1 - you'll be able to extend the code to all the 150 pages through a for loop.
import requests
from bs4 import BeautifulSoup
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings'
page1 = requests.get(baseurl, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
if a < 10: # there are 10 listed universities per page
print(univ.text)
Edit: now the example works, but as you say in your question, it only returns empty lists. Below an edited version of the code that returns a list of all universities (pp. 1-150)
import requests
from bs4 import BeautifulSoup
def parse_univ(url):
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
page1 = requests.get(url, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
res = []
for a,univ in enumerate(res_tab.findAll('a', href = True)): # parse universities' names
if a < 10: # there are 10 listed universities per page
res.append(univ.text)
return res
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='
ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists
univs = [item for sublist in ll for item in sublist] # unfold the list of lists
Re-edit following QHarr suggestion (thanks!) - same output, shorter and more "pythonic" solution
import requests
from bs4 import BeautifulSoup
def parse_univ(url):
newheaders = {
'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64)'
}
page1 = requests.get(url, headers = newheaders) # change headers or get blocked
soup = BeautifulSoup(page1.text, 'lxml')
res_tab = soup.find('div', {'id' : 'resultsMain'}) # find the results' table
return [univ.text for univ in res_tab.select('[href]', limit=10)]
baseurl = 'https://www.usnews.com/education/best-global-universities/rankings?page='
ll = [parse_univ(baseurl + str(p)) for p in range(1, 151)] # this is a list of lists
univs = [item for sublist in ll for item in sublist]
That's my first post in here so please be patient.
I'm trying to scrape all of the links having the particular word in (name of a city - Gdańsk ) it from my local news site.
The problem is, that I'm receiving some links which doesn't have the name of the city.
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import lxml
import re
url = 'http://www.trojmiasto.pl'
nazwa_pliku = 'testowyplik.txt'
user_agent = UserAgent()
strona = requests.get(url,headers={'user-agent':user_agent.chrome})
with open(nazwa_pliku,'w') as plik:
plik.write(page.content.decode('utf-8')) if type(page.content) == bytes else file.write(page.content)
def czytaj():
plikk = open('testowyplik.txt')
data = plikk.read()
plikk.close()
return data
soup = BeautifulSoup(czytaj(),'lxml')
linki = [li.div.a for div in soup.find_all('div',class_='entry-letter')]
for lin in linki:
print(lin)
rezultaty = soup.find_all('a',string=re.compile("Gdańsk"))
print(rezultaty)
l=[]
s=[]
for tag in rezultaty:
l.append(tag.get('href'))
s.append(tag.text)
for i in range(len(s)):
print('url = '+l[i])
print('\n')
Here is a complete and simpler example in Python 3:
import requests
from bs4 import BeautifulSoup
city_name = 'Gdańsk'
url = 'http://www.trojmiasto.pl'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
}
with requests.get(url, headers=headers) as html:
if html.ok:
soup = BeautifulSoup(html.content, 'html.parser')
links = soup('a')
for link in links:
if city_name in link.text:
print('\t- (%s)[%s]' % (link.text, link.get('href')))
Here is the output of the code above (formatted as Markdown just for clarity):
- [ZTM Gdańsk](//ztm.trojmiasto.pl/)
- [ZTM Gdańsk](https://ztm.trojmiasto.pl/)
- [Polepsz Gdańsk i złóż projekt do BO](https://www.trojmiasto.pl/wiadomosci/Polepsz-Gdansk-i-zloz-projekt-do-BO-n132827.html)
- [Pomnik Pileckiego stanie w Gdańsku](https://www.trojmiasto.pl/wiadomosci/Pomnik-rotmistrza-Witolda-Pileckiego-jednak-stanie-w-Gdansku-n132806.html)
- [O Włochu, który pokochał Gdańsk](https://rozrywka.trojmiasto.pl/Roberto-M-Polce-Polacy-maja-w-sobie-cos-srodziemnomorskiego-n132686.html)
- [Plakaty z poezją na ulicach Gdańska](https://kultura.trojmiasto.pl/Plakaty-z-poezja-na-ulicach-Gdanska-n132696.html)
- [Uniwersytet Gdański skończył 49 lat](https://nauka.trojmiasto.pl/Uniwersytet-Gdanski-skonczyl-50-lat-n132797.html)
- [Zapisz się na Półmaraton Gdańsk](https://aktywne.trojmiasto.pl/Zapisz-sie-na-AmberExpo-Polmaraton-Gdansk-2019-n132785.html)
- [Groźby na witrynach barów w Gdańsku](https://www.trojmiasto.pl/wiadomosci/Celtyckie-krzyze-i-grozby-na-witrynach-barow-w-Gdansku-n132712.html)
- [Stadion Energa Gdańsk](https://www.trojmiasto.pl/Stadion-Energa-Gdansk-o25320.html)
- [Gdańsk Big Beat Day 2019 ](https://www.trojmiasto.pl/rd/?t=p&id_polecamy=59233&url=https%3A%2F%2Fimprezy.trojmiasto.pl%2FGdansk-Big-Beat-Day-2019-imp475899.html&hash=150ce9c9)
- [ZTM Gdańsk](https://ztm.trojmiasto.pl/)
You could try attribute = value with contains operator (*)
rezultaty = [item['href'] for item in soup.select("[href*='Gdansk']")]
Full script
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.trojmiasto.pl')
soup = bs(r.content, 'lxml')
rezultaty = [item['href'] for item in soup.select("[href*='Gdansk']")]
print(rezultaty)
Without list comprehension:
for item in soup.select("[href*='Gdansk']"):
print(item['href'])
I am trying to extract some information about an App on Google Play and BeautifulSoup doesn't seem to work.
The link is this(say):
https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts
My code:
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html)
l = soup.find_all("div", { "class" : "document-subtitles"})
print len(l)
0 #How is this 0?! There is clearly a div with that class
I decided to go all in, didn't work either:
i = soup.select('html body.no-focus-outline.sidebar-visible.user-has-no-subscription div#wrapper.wrapper.wrapper-with-footer div#body-content.body-content div.outer-container div.inner-container div.main-content div div.details-wrapper.apps.square-cover.id-track-partial-impression.id-deep-link-item div.details-info div.info-container div.info-box-top')
print i
What am I doing wrong?
You need to pretend to be a real browser by supplying the User-Agent header:
import requests
from bs4 import BeautifulSoup
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
})
html = r.content
soup = BeautifulSoup(html, "html.parser")
title = soup.find(class_="id-app-title").get_text()
rating = soup.select_one(".document-subtitle .star-rating-non-editable-container")["aria-label"].strip()
print(title)
print(rating)
Prints the title and the current rating:
Weird Facts
Rated 4.3 stars out of five stars
To get the additional information field values, you can use the following generic function:
def get_info(soup, text):
return soup.find("div", class_="title", text=lambda t: t and t.strip() == text).\
find_next_sibling("div", class_="content").get_text(strip=True)
Then, if you do:
print(get_info(soup, "Size"))
print(get_info(soup, "Developer"))
You will see printed:
1.4M
Email email#here.com