Web scraping content of ::before using BeautifulSoup? - python

I am quite new to python and tried scraping some websites. A few of em worked well but i now stumbled upon one that is giving me a hard time. the url im using is: https://www.drankdozijn.nl/groep/rum. Im trying to get all product titles and urls from this page. But since there is a ::before in the HTML code i am unable to scrape it. Any help would be very appreciated! This is the code i have so far:
try:
source = requests.get(url)
source.raise_for_status()
soup = BeautifulSoup(source.text,'html.parser')
wachttijd = random.randint(2, 4)
print("Succes! URL:", url, "Wachttijd is:", wachttijd, "seconden")
productlist = soup.find('div', {'id':'app'})
for productinfo in productlist:
productnaam = getTextFromHTMLItem(productinfo.find('h3', {'class':'card-title lvl3'}))
product_url = getHREFFromHTMLItem(productinfo.find('a' , {'class':'ptile-v2_link'}))
# print info
print(productnaam)
# Informatie in sheet row plaatsen
print("Sheet append")
sheet.append([productnaam])
#time.sleep(1)
time.sleep(wachttijd)
print("Sheet opslaan")
excel.save('C:/Python/Files/RumUrlsDrankdozijn.xlsx')
return soup
except Exception as e:
print(e)

The product details for that site are returned via a different URL using JSON. The HTML returned does not contain this. This could easily be accessed as follows:
from bs4 import BeautifulSoup
import requests
import openpyxl
url = "https://es-api.drankdozijn.nl/products"
params = {
"country" : "NL",
"language" : "nl",
"page_template" : "groep",
"group" : "rum",
"page" : "1",
"listLength" : "20",
"clientFilters" : "{}",
"response" : "paginated",
"sorteerOp" : "relevance",
"ascdesc" : "asc",
"onlyAvail" : "false",
"cacheKey" : "1",
"premiumMember" : "N",
}
wb = openpyxl.Workbook()
ws = wb.active
ws.append(['Description', 'Price', 'URL', "Land", "AlcoholPercentage"])
for page in range(1, 11):
params['page'] = page
req = requests.get(url, params=params)
req.raise_for_status()
soup = BeautifulSoup(req.content, 'html.parser')
data = req.json()
for product in data['data']:
land = "unknown"
alcoholpercentage = "unknown"
features = {feature["alias"] : feature["value"]["description"] for feature in product['features']}
ws.append([
product["description"],
product["pricePerLiterFormatted"],
product["structuredData"]["offers"]["url"],
features["land"],
features["alcoholpercentage"]
])
wb.save('output.xlsx')
This gets the first 10 pages of details, starting:
I recommend you print(data) to have a look at all of the information that is available.
The URL was found using the browser's network tools to watch the request it made whilst loading the page. An alternative approach would be to use something like Selenium to fully render the HTML, but this will be slower and more resource intensive.
openpyxl is used to create an output spreadsheet. You could modify the column width's and appearance if needed for the Excel output.

Related

How to scrape stock data incorporating pagination next tag using python bs4?

The code can not get the next page, it only repeats in an infinite loop. I am using the example from oxylabs
Could you tell me what I'm doing wrong? Thank you.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
url = 'https://hnx.vn/en-gb/cophieu-etfs/chung-khoan-ny.html'
while True:
response = requests.get(url)
soup = bs(response.content, "lxml")
symbols = soup.find_all('td', class_='STOCK_CODE' )
for s in symbols:
symbol = s.find('a').text
print(symbol)
next_page = soup.select_one('span', id = 'next')
if next_page:
next_url = next_page.get('href')
url = urljoin(url, next_url)
else:
break
print(url)
The information you want for the other pages is being returned via another call. You need to recreate that call (use your browser's network tools to see what is happening).
The request requires a token that is returned when the homepage is requested. This needs to be provided when requesting the other pages.
For example:
from bs4 import BeautifulSoup as bs
import requests
session = requests.Session()
req_homepage = session.get('https://hnx.vn/en-gb/cophieu-etfs/chung-khoan-ny.html')
soup_homepage = bs(req_homepage.content, "lxml")
for meta in soup_homepage.find_all('meta'):
if meta.get('name', None) == '__RequestVerificationToken':
token = meta['content']
data = {
"p_issearch" : 0,
"p_keysearch" : "",
"p_market_code" : "",
"p_orderby" : "STOCK_CODE",
"p_ordertype" : "ASC",
"p_currentpage" : 2,
"p_record_on_page" : 10,
}
headers = {
"Referer" : "https://hnx.vn/en-gb/cophieu-etfs/chung-khoan-ny.html",
"__RequestVerificationToken" : token,
"X-Requested-With" : "XMLHttpRequest",
}
for page in range(1, 4):
print(f"Page {page}")
data['p_currentpage'] = page
req = session.post('https://hnx.vn/ModuleIssuer/List/ListSearch_Datas', data=data, headers=headers)
json_content = req.json()['Content']
soup = bs(json_content, "lxml")
for td in soup.find_all('td', class_='STOCK_CODE'):
symbol = td.find('a').text
print(' ', symbol)
This would give you the following output:
Page 1
AAV
ACM
ADC
ALT
AMC
AME
AMV
API
APP
APS
Page 2
ARM
ART
ATS
BAB
BAX
BBS
BCC
BCF
BDB
BED
Page 3
BII
BKC
BLF
BNA
BPC
BSC
BST
BTS
BTW
BVS

Web-scrape. BeautifulSoup. Multiple Pages. How on earth would you do that?

Hi I am a Newbie to programming. So I spent 4 days trying to learn python. I evented some new swear words too.
I was particularly interested in trying as an exercise some web-scraping to learn something new and get some exposure to see how it all works.
This is what I came up with. See code at end. It works (to a degree)
But what's missing?
This website has pagination on it. In this case 11 pages worth.  How would you go about adding to this script and getting python to go look at those other pages too and carry out the same scrape. Ie scrape page one , scrape page 2, 3 ... 11 and post the results to a csv?
https://www.organicwine.com.au/vegan/?pgnum=1
https://www.organicwine.com.au/vegan/?pgnum=2
https://www.organicwine.com.au/vegan/?pgnum=3
https://www.organicwine.com.au/vegan/?pgnum=4
https://www.organicwine.com.au/vegan/?pgnum=5
https://www.organicwine.com.au/vegan/?pgnum=6
https://www.organicwine.com.au/vegan/?pgnum=7
8, 9,10, and 11
On these pages the images are actually a thumbnail images something like 251px by 251px.
How would you go about adding to this script to say. And whilst you are at it follow the links to the detailed product page and capture the image link from there where the images are 1600px by 1600px and post those links to CSV
https://www.organicwine.com.au/mercer-wines-preservative-free-shiraz-2020
When we have identified those links lets also download those larger images to a folder
CSV writer. Also I don't understand line 58
for i in range(23)
how would i know how many products there were without counting them (i.e. there is 24 products on page one)
So this is what I want to learn how to do. Not asking for much (he says sarcastically) I could pay someone on up-work to do it but where's the fun in that? and that does not teach me how to 'fish'.
Where is a good place to learn python? A master class on web-scraping. It seems to be trial and error and blog posts and where ever you can pick up bits of information to piece it all together.
Maybe I need a mentor.
I wish there had been someone I could have reached out to, to tell me what beautifulSoup was all about. worked it out by trial and error and mostly guessing. No understanding of it but it just works.
Anyway, any help in pulling this all together to produce a decent script would be greatly appreciated.
Hopefully there is someone out there who would not mind helping me.
Apologies to organicwine for using their website as a learning tool. I do not wish to cause any harm or be a nuisance to the site
Thank you in advance
John
code:
import requests
import csv
from bs4 import BeautifulSoup
URL = "https://www.organicwine.com.au/vegan/?pgnum=1"
response = requests.get(URL)
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
product_title = soup.find_all('div', class_="caption")
# print(product_title)
winename = []
for wine in product_title:
winetext = wine.a.text
winename.append(winetext)
print(f'''Wine Name: {winetext}''')
# print(f'''\nWine Name: {winename}\n''')
product_price = soup.find_all('div', class_='wrap-thumb-mob')
# print(product_price.text)
price =[]
for wine in product_price:
wineprice = wine.span.text
price.append(wineprice)
print(f'''Wine Price: {wineprice}''')
# print(f'''\nWine Price: {price}\n''')
image =[]
product_image_link = (soup.find_all('div', class_='thumbnail-image'))
# print(product_image_link)
for imagelink in product_image_link:
wineimagelink = imagelink.a['href']
image.append(wineimagelink)
# image.append(imagelink)
print(f'''Wine Image Lin: {wineimagelink}''')
# print(f'''\nWine Image: {image}\n''')
#
#
# """ writing data to CSV """
# open OrganicWine2.csv file in "write" mode
# newline stops a blank line appearing in csv
with open('OrganicWine2.csv', 'w',newline='') as file:
# create a "writer" object
writer = csv.writer(file, delimiter=',')
# use "writer" obj to write
# you should give a "list"
writer.writerow(["Wine Name", "Wine Price", "Wine Image Link"])
for i in range(23):
writer.writerow([
winename[i],
price[i],
image[i],
])
In this case, to do pagination, instead of for i in range(1, 100) which is a hardcoded way of paging, it's better to use a while loop to dynamically paginate all possible pages.
"While" is an infinite loop and it will be executed until the transition to the next page is possible, in this case it will check for the presence of the button for the next page, for which the CSS selector ".fa-chevron-right" is responsible:
if soup.select_one(".fa-chevron-right"):
params["pgnum"] += 1 # go to the next page
else:
break
To extract the full size image an additional request is required, CSS selector ".main-image a" is responsible for full-size images:
full_image_html = requests.get(link, headers=headers, timeout=30)
image_soup = BeautifulSoup(full_image_html.text, "lxml")
try:
original_image = f'https://www.organicwine.com.au{image_soup.select_one(".main-image a")["href"]}'
except:
original_image = None
An additional step to avoid being blocked is to rotate user-agents. Ideally, it would be better to use residential proxies with random user-agent.
pandas can be used to extract data in CSV format:
pd.DataFrame(data=data).to_csv("<csv_file_name>.csv", index=False)
For a quick and easy search for CSS selectors, you can use the SelectorGadget Chrome extension (not always work perfectly if the website is rendered via JavaScript).
Check code with pagination and saving information to CSV in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
import pandas as pd
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
params = {
'pgnum': 1 # number page by default
}
data = []
while True:
page = requests.get(
"https://www.organicwine.com.au/vegan/?",
params=params,
headers=headers,
timeout=30,
)
soup = BeautifulSoup(page.text, "lxml")
print(f"Extracting page: {params['pgnum']}")
for products in soup.select(".price-btn-conts"):
try:
title = products.select_one(".new-h3").text
except:
title = None
try:
price = products.select_one(".price").text.strip()
except:
price = None
try:
snippet = products.select_one(".price-btn-conts p a").text
except:
snippet = None
try:
link = products.select_one(".new-h3 a")["href"]
except:
link = None
# additional request is needed to extract full size image
full_image_html = requests.get(link, headers=headers, timeout=30)
image_soup = BeautifulSoup(full_image_html.text, "lxml")
try:
original_image = f'https://www.organicwine.com.au{image_soup.select_one(".main-image a")["href"]}'
except:
original_image = None
data.append(
{
"title": title,
"price": price,
"snippet": snippet,
"link": link,
"original_image": original_image
}
)
if soup.select_one(".fa-chevron-right"):
params["pgnum"] += 1
else:
break
# save to CSV (install, import pandas as pd)
pd.DataFrame(data=data).to_csv("<csv_file_name>.csv", index=False)
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Yangarra McLaren Vale GSM 2016",
"price": "$29.78 in a straight 12\nor $34.99 each",
"snippet": "The Yangarra GSM is a careful blending of Grenache, Shiraz and Mourvèdre in which the composition varies from year to year, conveying the traditional estate blends of the southern Rhône. The backbone of the wine comes fr...",
"link": "https://www.organicwine.com.au/yangarra-mclaren-vale-gsm-2016",
"original_image": "https://www.organicwine.com.au/assets/full/YG_GSM_16.png?20211110083637"
},
{
"title": "Yangarra Old Vine Grenache 2020",
"price": "$37.64 in a straight 12\nor $41.99 each",
"snippet": "Produced from the fruit of dry grown bush vines planted high up in the Estate's elevated vineyards in deep sandy soils. These venerated vines date from 1946 and produce a wine that is complex, perfumed and elegant with a...",
"link": "https://www.organicwine.com.au/yangarra-old-vine-grenache-2020",
"original_image": "https://www.organicwine.com.au/assets/full/YG_GRE_20.jpg?20210710165951"
},
#...
]
Create the URL by putting the page number in it, then put the rest of your code into a for loop and you can use len(winenames) to count how many results you have. You should do the writing outside the for loop. Here's your code with those changes:
import requests
import csv
from bs4 import BeautifulSoup
num_pages = 11
result = []
for pgnum in range(num_pages):
url = f"https://www.organicwine.com.au/vegan/?pgnum={pgnum+1}"
response = requests.get(url)
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
product_title = soup.find_all("div", class_="caption")
winename = []
for wine in product_title:
winetext = wine.a.text
winename.append(winetext)
product_price = soup.find_all("div", class_="wrap-thumb-mob")
price = []
for wine in product_price:
wineprice = wine.span.text
price.append(wineprice)
image = []
product_image_link = soup.find_all("div", class_="thumbnail-image")
for imagelink in product_image_link:
winelink = imagelink.a["href"]
response = requests.get(winelink)
wine_page_soup = BeautifulSoup(response.text, "html.parser")
main_image = wine_page_soup.find("a", class_="fancybox")
image.append(main_image['href'])
for i in range(len(winename)):
result.append([winename[i], price[i], image[i]])
with open("/tmp/OrganicWine2.csv", "w", newline="") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["Wine Name", "Wine Price", "Wine Image Link"])
writer.writerows(results)
And here's how I would rewrite your code to accomplish this task. It's more pythonic (you should basically never write range(len(something)), there's always a cleaner way) and it doesn't require knowing how many pages of results there are:
import csv
import itertools
import time
import requests
from bs4 import BeautifulSoup
data = []
# Try opening 100 pages at most, in case the scraping code is broken
# which can happen because websites change.
for pgnum in range(1, 100):
url = f"https://www.organicwine.com.au/vegan/?pgnum={pgnum}"
response = requests.get(url)
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
search_results = soup.find_all("div", class_="thumbnail")
for search_result in search_results:
name = search_result.find("div", class_="caption").a.text
price = search_result.find("p", class_="price").span.text
# link to the product's page
link = search_result.find("div", class_="thumbnail-image").a["href"]
# get the full resolution product image
response = requests.get(link)
time.sleep(1) # rate limit
wine_page_soup = BeautifulSoup(response.text, "html.parser")
main_image = wine_page_soup.find("a", class_="fancybox")
image_url = main_image["href"]
# or you can just "guess" it from the thumbnail's URL
# thumbnail = search_result.find("div", class_="thumbnail-image").a.img['src']
# image_url = thumbnail.replace('/thumbL/', '/full/')
data.append([name, price, link, image_url])
# if there's no "next page" button or no search results on the current page,
# stop scraping
if not soup.find("i", class_="fa-chevron-right") or not search_results:
break
# rate limit
time.sleep(1)
with open("/tmp/OrganicWine3.csv", "w", newline="") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["Wine Name", "Wine Price", "Wine Link", "Wine Image Link"])
writer.writerows(data)

Web Scraping with Python BS

Trying to scrape some weather data off of Weather Underground. I haven't had any difficulty getting the data of interest until I came to getting the day/date, hi/lo temps, and forecast (ie. "Partly Cloudy"). Each is in a div without a class. The parent, of each, is a div with a class="obs-date" (see image below)
[WxUn HTML image][1]
Attempted code below with other options commented out. Each returns an empty list.
def get_wx(city, state):
city=city.lower()
state=state.lower()
# get current conditions; 'weather' in url
current_dict = get_current(city, state)
# get forecast; 'forecast' in url
f_url = f'https://www.wunderground.com/forecast/us/{state}/{city}'
f_response = req.get(f_url)
f_soup = BeautifulSoup(f_response.text, 'html.parser')
cast_dates = f_soup.find_all('div', class_="obs-date")
# cast_dates = f_soup.find_all('div', attrs={"class":"obs-date"})
# cast_dates = f_soup.select('div.obs-date')
print(cast_dates)
get_wx("Portland", "ME")
Any help with what I'm missing is appreciated.
As far as I can see the whole block you're trying to parse is driven by javascript, that's why you're getting empty results using beautifulsoup
The ADDITIONAL CONDITIONS part could be parsed completely using bs4 as well as everything below. Table at the end could be parsed using pandas.
To scrape javascript content, you can use requests-html or selenium libraries.
from requests_html import HTMLSession
import json
session = HTMLSession()
url = "https://www.wunderground.com/weather/us/me/portland"
response = session.get(url)
response.html.render(sleep=1)
data = []
current_date = response.html.find('.timestamp strong', first = True).text
weather_conditions = response.html.find('.condition-icon p', first = True).text
gusts = response.html.find('.medium-uncentered span', first = True).text
current_temp = response.html.find('.current-temp .is-degree-visible', first = True).text
data.append({
"Last update": current_date,
"Current weather": weather_conditions,
"Temperature": current_temp,
"Gusts": gusts,
})
print(json.dumps(data, indent = 2, ensure_ascii = False))
Output:
[
{
"Last update": "1:27 PM EDT on April 14, 2021",
"Current weather": "Fair",
"Temperature": "49 F",
"Gusts": "13 mph"
}
]

scraping with python using bs4

I am trying to scrape from
url in code : url
I trying to get from there to a dictionary ,
{key (name of game) : value (list of links)}
now when Im trying , i cant find the div tag with the id=accordion .
and because of this Im stuck now :
my code:
def findLiveGamesInBetman():
dic = {}
links = []
url ='https://live.batstream.tv/?sport=football&sp=1,2,3,4,5,6,7,8,9,10,20,25&fs=13px&fw=700&tt=none&fc=405115&tc=333333&bc=FFFFFF&bhc=FDFDFD&pd=4px&mr=1px&tm=817503&tmb=FFFFFF&wb=e5e5e5&bsh=0px&rdb=FFFFFF&rdc=C74300&l=https://sport-play.tv/register/&lt=1&lsp=1&lcy=1&lda=1&l2=https://sport-play.tv/register/&l2t=1&l2sp=1&l2co=1&l2cy=1&l2da=1'
# Fetching the html
request = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
content = urlopen(request).read()
parse = BeautifulSoup(content, 'html.parser')
body=parse.find('body')
div_con = body.find('div', {"class": "container"})
div_row = div_con.find('div', {"class": "row"})
div_col = div_row.find('div', {"class": "col-lg-12"})
div_accor = div_col.find('div',{'id' : 'accordion_t'})
div_ac = div_accor.find('div',{'id' : 'accordion'}) #--> this return empty
note : the url in this code is the but it just the games.
I had been looking here to find something that may help , but unfortunately,I didnt find anything.
how can I fix it ?
Thanks

Problems with web scraping (William Hill-UFC Odds)

I'm creating a web scraper that will let me get the odds of upcoming UFC Fights on William Hill. I'm using beautiful soup but have yet been able to successfully scrape the needed data. (https://sports.williamhill.com/betting/en-gb/ufc)
I need the fighters names and their odds.
I've attempted a variety of methods to try get the data, trying to scrape different tags etc., but nothing happens.
def scrape_data():
data = requests.get("https://sports.williamhill.com/betting/en-
gb/ufc")
soup = BeautifulSoup(data.text, 'html.parser')
links = soup.find_all('a',{'class': 'btmarket__name btmarket__name--
featured'}, href=True)
for link in links:
links.append(link.get('href'))
for link in links:
print(f"Now currently scraping link: {link}")
data = requests.get(link)
soup = BeautifulSoup(data.text, 'html.parser')
time.sleep(1)
fighters = soup.find_all('p', {'class': "btmarket__name"})
c = fighters[0].text.strip()
d = fighters[1].text.strip()
f1.append(c)
f2.append(d)
odds = soup.find_all('span', {'class': "betbutton_odds"})
a = odds[0].text.strip()
b = odds[1].text.strip()
f1_odds.append(a)
f2_odds.append(b)
return None
I would expect it to be exported to a CSV file. I'm currently using Morph.io to host and run the scraper, but it returns nothing.
If correct, it would output:
Fighter1Name:
Fighter2Name:
F1Odds:
F2Odds:
For every available fight.
Any help would be greatly appreciated.
The html returned has different attributes and values. You need to inspect the response.
For writing out to csv you will want to append "'" in front of odds to prevent odds being treated as fractions or dates. See commented out alternatives in code below.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://sports.williamhill.com/betting/en-gb/ufc')
soup = bs(r.content, 'lxml')
results = []
for item in soup.select('.btmarket:has([data-odds])'):
match_name = item.select_one('.btmarket__name[title]')['title']
odds = [i['data-odds'] for i in item.select('[data-odds]')]
row = {'event-starttime' : item.select_one('[datetime]')['datetime']
,'match_name' : match_name
,'home_name' : match_name.split(' vs ')[0]
#,'home_odds' : "'" + str(odds[0])
,'home_odds' : odds[0]
,'away_name' : match_name.split(' vs ')[1]
,'away_odds' : odds[1]
#,'away_odds' : "'" + str(odds[1])
}
results.append(row)
df = pd.DataFrame(results, columns = ['event-starttime','match_name','home_name','home_odds','away_name','away_odds'])
print(df.head())
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

Categories