python - BeautifulSoup and requests does not produce expected results with .findAll()

python - BeautifulSoup and requests does not produce expected results with .findAll() - python

I have been writing a piece of code that will retrieve a list of items and their corresponding prices from the Steam Marketplace (for the game Unturned). I am using BeautifulSoup (bs4) and requests library. This is my code so far:
for page_num in range(1,10):
website = 'http://steamcommunity.com/market/search?appid=304930#p'+str(page_num)+'_popular_desc'
r = requests.get(website)
doc = r.text.split('\n')
soup = BeautifulSoup(''.join(doc), "html.parser")
names = soup.findAll("span", { "class" : "market_listing_item_name" })
for item in range(len(names)):
items.append(names[item].contents[0])
costs = soup.findAll("span", { "class" : "normal_price" })
for cost in range(len(costs)):
prices.append(costs[cost].contents[0])
Expected Output:
Festive Gift Present : $0.32 USD
Halloween Gift Present : $0.26 USD
Carbon Fiber Mystery Box : $0.47 USD
Festive Hat : $1.67 USD
Nuclear Matamorez : $0.39 USD
... and so on
The problem with this code is, it is only getting the names of the first page. If I type the URL manually with different numbers in place of page_num it changes the page, and also the HTML document changes. However, the code doesn't seem to get the results from the second page and so on. requests is fetching the correct URL each time, but the HTML doc returns the same?

Page 2, 3, etc, are requested via ajax (or similar), so the source code isn't present when you first load the page. To bypass this we can sniff the ajax url and parse the source directly, in this case, json encoded, i.e:
import json
from bs4 import BeautifulSoup
from urllib2 import urlopen
output = ""
items =[]
prices =[]
for page_num in range(0,100, 10): #
start = page_num
count = page_num + 10
url = urlopen("http://steamcommunity.com/market/search/render/?query=&start={}&count={}&search_descriptions=0&sort_column=popular&sort_dir=desc&appid=304930".format(start, count))
jsonCode = json.loads(url.read())
output += jsonCode['results_html']
soup = BeautifulSoup(output, "html.parser")
names = soup.findAll("span", { "class" : "market_listing_item_name" })
for item in range(len(names)):
items.append(names[item].contents[0])
costs = soup.findAll("span", { "class" : "normal_price" })
for cost in range(len(costs)):
if "Starting at" not in costs[cost].contents[0]: # we just get the first price
prices.append(costs[cost].contents[0])
print items
[u'Festive Gift Present', u'Halloween Gift Present', u'Hypertech Timberwolf', u'Holiday Scarf', u'Chill Honeybadger', etc...]
print prices
[u'$0.34 USD', u'$0.28 USD', u'$1.77 USD', u'$0.31 USD', u'$0.65 USD', etc...]
PS: Steam will temporary ban your ip after ~50 requests

Related

How to scrape all match / ticket info while iterating a list?

Thanks in advance guys - I'm trying to compile ticket sale information into one easy to read list or possibly filtered table, but one step at a time.
Successfully managed to write a short script to list the pages for each event:
import requests
from bs4 import BeautifulSoup
url = "https://www.liverpoolfc.com/tickets/tickets-availability"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
pages = []
for link in soup.find_all("a", class_="ticket-card fixture"):
href = link.get("href")
if href:
pages.append(href)
print("Pages:")
for page in set(pages):
print("- " + page)
Which returns
Pages:
- /tickets/tickets-availability/wolverhampton-wanderers-v-liverpool-fc-4-feb-2023-0300pm-245
- /tickets/tickets-availability/liverpool-fc-v-arsenal-8-apr-2023-0300pm-236
- /tickets/tickets-availability/liverpool-fc-v-manchester-united-4-mar-2023-0300pm-235
- /tickets/tickets-availability/liverpool-fc-v-real-madrid-21-feb-2023-0800pm-238
- /tickets/tickets-availability/liverpool-fc-v-tottenham-hotspur-29-apr-2023-0300pm-232
- /tickets/tickets-availability/liverpool-fc-v-nottingham-forest-22-apr-2023-0300pm-234
- /tickets/tickets-availability/liverpool-fc-v-fulham-18-mar-2023-0300pm-237
- /tickets/tickets-availability/newcastle-united-v-liverpool-fc-18-feb-2023-0530pm-246
- /tickets/tickets-availability/liverpool-fc-v-brentford-6-may-2023-0300pm-231
- /tickets/tickets-availability/liverpool-fc-v-aston-villa-20-may-2023-0300pm-230
- /tickets/tickets-availability/liverpool-fc-v-everton-13-feb-2023-0800pm-233
- /tickets/tickets-availability/crystal-palace-v-liverpool-fc-25-feb-2023-0745pm-247
So far so good.
But for the following code I'm only getting the first results and hoping to get about 4 sets, and trying find all just doesn't seem to work (this is just for a single page at the moment):
import requests
from bs4 import BeautifulSoup
url = "https://www.liverpoolfc.com/tickets/tickets-availability/liverpool-fc-v-everton-13-feb-2023-0800pm-233"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Find all the elements with the desired class
ticket_sales = soup.find_all(class_="accorMenu")
# Create a list to store the extracted information
sales_list = []
# Check if any ticket sales were found
if ticket_sales:
# Iterate over each ticket sale
for accorMenuList in ticket_sales:
# Extract the desired information from the ticket sale
saletype = soup.find("span", class_="saletype").text.strip()
salename = soup.find("span", class_="salename").text.strip()
prereqs = soup.find("span", class_="prereqs").text.strip()
status = soup.find("span", class_="status").text.strip()
whenavailable = soup.find("span", class_="whenavailable").text.strip()
# Store the extracted information in a dictionary
sale_info = {
"saletype": saletype,
"salename": salename,
"prereqs": prereqs,
"status": status,
"whenavailable": whenavailable
}
# Add the dictionary to the list of sales
sales_list.append(sale_info)
# Print the list of sales
for sale in sales_list:
print("Saletype:", sale["saletype"])
print("Salename:", sale["salename"])
print("Prereqs:", sale["prereqs"])
print("Status:", sale["status"])
print("Whenavailable:", sale["whenavailable"])
print("---")
else:
# If no ticket sales were found, print a message
print("No ticket sales found.")
returns:
Saletype: match ticket -
Salename: Hospitality
Prereqs:
Status: available
Whenavailable: Mon 6 Feb 2023, 11:00am
---

Your approach is already the right one, but you are subject to the following misconceptions:
ticket_sales = soup.find_all(class_="accorMenu") does not reference the individual list elements but the list itself, which leads to the fact that there is only one element in the ResultSet that can be iterated over.
Instead use soup.select('.accorMenu li') or
soup.select('.accorMenu h3') to select the individual containers.
Used css selectors here, because it makes chaining a bit easier than use of several `find()/find_all()
When iterating, do not reference soup to the global object saletype = soup.find("span", class_="saletype").text.strip() but reference the respective iteration. Otherwise you will still only get the information of the first element in soup.
Furthermore, you should always check if an element has been found at all before applying a method to it, this can be implemented with a simple if else statement
Example
import requests
from bs4 import BeautifulSoup
url = 'https://www.liverpoolfc.com/tickets/tickets-availability/liverpool-fc-v-everton-13-feb-2023-0800pm-233'
soup = BeautifulSoup(requests.get(url).text)
sales_list = []
for e in soup.select('.accorMenu h3'):
# Store the extracted information in a dictionary
sales_list.append({
"saletype": e.find("span", class_="saletype").text.strip(),
"salename": e.find("span", class_="salename").text.strip(),
"prereqs": e.find("span", class_="prereqs").text.strip(),
"status": e.find("span", class_="status").text.strip(),
"whenavailable": e.find("span", class_="whenavailable").text.strip() if e.find("span", class_="whenavailable") else None
})
sales_list
Output
[{'saletype': 'match ticket -',
'salename': 'Hospitality',
'prereqs': '',
'status': 'available',
'whenavailable': None},
{'saletype': 'match ticket -',
'salename': 'Local Members Sale',
'prereqs': 'Members with an ‘L’ Postcode',
'status': 'sold out',
'whenavailable': None},
{'saletype': 'match ticket -',
'salename': 'Local General Sale',
'prereqs': 'Supporters with an ‘L’ Postcode',
'status': 'sold out',
'whenavailable': None},
{'saletype': 'match ticket -',
'salename': 'Additional Members Sale',
'prereqs': 'Members who have recorded 4+ Premier League home games from either season 2018/19 or 2019/20',
'status': 'on sale soon',
'whenavailable': 'Mon 6 Feb 2023, 11:00am'}]

bs4 findAll not collecting all of the data from the other pages on the website

I'm trying to scrape a real estate website using BeautifulSoup.
I'm trying to get a list of rental prices for London. This works but only for the first page on the website. There are over 150 of them so I'm missing out on a lot of data. I would like to be able to collect all the prices from all the pages. Here is the code I'm using:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home'
response = requests.get(url)
response.status_code
data = soup(response.content, 'lxml')
prices = []
for line in data.findAll('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}):
price = str(line).split('>')[2].split(' ')[0].replace('£', '').replace(',','')
price = int(price)
prices.append(price)
Any idea as to why I can't collect the prices from all the pages using this script?
Extra question : is there a way to access the price using soup, IE with doing any list/string manipulation? When I call data.find('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}) I get a string of the following form <div class="css-1e28vvi-PriceContainer e2uk8e7" data-testid="listing-price"><p class="css-1o565rw-Text eczcs4p0" size="6">£3,012 pcm</p></div>
Any help would be much appreciated!

You can append &pn=<page number> parameter to the URL to get next pages:
import re
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home&pn="
prices = []
for page in range(1, 3): # <-- increase number of pages here
data = soup(requests.get(url + str(page)).content, "lxml")
for line in data.findAll(
"div", {"class": "css-1e28vvi-PriceContainer e2uk8e7"}
):
price = line.get_text(strip=True)
price = int(re.sub(r"[^\d]", "", price))
prices.append(price)
print(price)
print("-" * 80)
print(len(prices))
Prints:
...
1993
1993
--------------------------------------------------------------------------------
50

Scraping website with BS4 // accessing class

I am tring to extract different information from websites with BeautifulSoup, such as title of the product and the price.
I do that with different urls, looping through the urls with for...in.... Here, I'll just provide a snippet without the loop.
from bs4 import BeautifulSoup
import requests
import csv
url= 'https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
price = soup.find('meta', property="product:price:amount")
title = soup.find("div", {"class": "flix-model-name"})
title2 = soup.find('div', class_="flix-model-name")
title3 = soup.find("div", attrs={"class": "flix-model-name"})
print(price['content'])
print(title)
print(title2)
print(title3)
So from this URL https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html I wasnt to extract the product number. the only place I find it is in the div class="flix-model-name". However, I am totally unable to reach it. I tried different ways to access it in the title, title2, title3, but I always have the output none.
I am a bit of a beginner, so I guess I am probably missing something basic... If so, please pardon me for that.
Any help is welcome! Many thanks in advance!
just for info, with each url I thought of appending the data and write them on a CSV file like that:
for url in urls:
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
row=[]
try:
# title = YOUR VERY WELCOMED ANSWER
prices = soup.find('meta', property="product:price:amount")
row = (title.text+','+prices['content']+'\n')
data.append(row)
except:
pass
file = open('database.csv','w')
i = 0
while i < (len(data)):
file.write(data[i])
i +=1
file.close()
Many thanks in advance for your help!
David

Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL, create the URL based on 2 dynamic parameters(product and category) and then do GET request to get the data.
After getting the data script will parse the JSON data using json.loads library.
Finally, it will iterate all over the list of products one by one and print the details which are divided in 2 categotries 'box1_ProductToProduct' and 'box2_KategorieTopseller' like Brand, Name, Product number and Unit price. Same way you can add more details by looking in to the API call.
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_product_details():
PRODUCT = 'MMCH1991479' #Product number
CATEGORY = '680942' #Category number
URL = 'https://www.mediamarkt.ch/rde_server/res/MMCH/recomm/product_detail/sid/WACXyEbIf3khlu6FcHlh1B1?product=' + PRODUCT + '&category=' + CATEGORY # dynamic URL
response = requests.get(URL,verify = False) #GET request to fetch the data
result = json.loads(response.text) # Parse JSON data using json.loads
box1_ProductToProduct = result[0]['box1_ProductToProduct'] # Extracted data from API
box2_KategorieTopseller = result[1]['box2_KategorieTopseller']
for item in box1_ProductToProduct: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
for item in box2_KategorieTopseller: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
scrap_product_details()

Extracting string data from a html source

from bs4 import BeautifulSoup
import urllib.request
page = urllib.request.urlopen('https://www.applied.com/categories/bearings/accessories/adapter-sleeves/c/1580?q=%3Arelevance&page=1')
html = page.read()
soup = BeautifulSoup(html)
items = soup.find_all(class_= 'product product--list ')
for i in items[0:1]:
product_name = i.find(class_="product__name").a.string.strip()
print(product_name)
product_url = i.find(class_="product__name").a['href']
print(product_url)
price = i.find(itemprop="price").string
print(price)
Using the above code I tried to get the price for each product in that page.
But when i tried, the output for price variable is showing as none.
When I inspect the html source for the price in a browser it is showing the price as a normal text as how I got for product_name variable.
Can someone guide me on how to get the price for the products in that page.

Price is loaded by Ajax(https://www.applied.com/getprices) after page is loaded that's why it is not in HTML.
Use https://www.applied.com/getprices to get the price of an item
You have to send post request with following params for getting the price of the product.
{
"productCodes": "100731658",
"page": "PLP",
"productCode": "100731658",
"CSRFToken": "172c7073-742f-4d7d-9c97-358e0d9e631e"
}

Findall to div tag using beautiful soup yields blank return

<div class="columns small-5 medium-4 cell header">Ref No.</div>
<div class="columns small-7 medium-8 cell">110B60329</div>
Website is https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results
I would like to run a loop and return '110B60329'. I have ran beautiful soup and done a find_all(div), I then define the 2 different tags as head and data based on their class. I then ran iteration through the 'head' tags hoping it would return the info in the div tag i have defined as data .
Python returns a blank (cmd prompt reprinted the filepth).
Would anyone kindly know how i might fix this. My full code is.....thanks
import requests
from bs4 import BeautifulSoup as soup
import csv
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html= soup(response.content,"lxml")
properties_col = html.find_all('div')
for col in properties_col:
ref = 'n/a'
des = 'n/a'
head = col.find_all("div",{"class": "columns small-5 medium-4 cell header"})
data = col.find_all("div",{"class":"columns small-7 medium-8 cell"})
for i,elem in enumerate(head):
#for i in range(elems):
if head [i].text == "Ref No.":
ref = data[i].text
print ref

You can do this by two ways.
1) If you are sure that the website that your are scraping won't change its content you can find all divs by that class and get the content by providing an index.
2) Find all left side divs (The titles) and if one of them matches what you want get the next sibling to get the text.
Example:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'
baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)
# content of search page in soup
html = soup(response.content,"lxml")
#Method 1
LeftBlockData = html.find_all("div", class_="columns small-7 medium-8 cell")
Reference = LeftBlockData[0].get_text().strip()
Description = LeftBlockData[2].get_text().strip()
print(Reference)
print(Description)
#Method 2
for column in html.find_all("div", class_="columns small-5 medium-4 cell header"):
RightColumn = column.next_sibling.next_sibling.get_text().strip()
if "Ref No." in column.get_text().strip():
print (RightColumn)
if "Description" in column.get_text().strip():
print (RightColumn)
The prints will output (in order):
110B60329
STORE
110B60329
STORE
Your problem is that you are trying to match a node text that have a lot of tabs with a non-spaced string.
For example your head [i].textvariable contains
Ref No., so if you compare it with Ref No. it'll give a false result. Striping it will solve.

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results")
soup = BeautifulSoup(r.text, 'lxml')
for row in soup.find_all(class_='table-row'):
print(row.get_text(strip=True, separator='|').split('|'))
out:
['Ref No.', '110B60329']
['Office', 'LOTHIAN VJB']
['Description', 'STORE']
['Property Address', '29 BOSWALL PARKWAY', 'EDINBURGH', 'EH5 2BR']
['Proprietor', 'SCOTTISH MIDLAND CO-OP SOCIETY LTD.']
['Tenant', 'PROPRIETOR']
['Occupier']
['Net Annual Value', '£1,750']
['Marker']
['Rateable Value', '£1,750']
['Effective Date', '01-APR-10']
['Other Appeal', 'NO']
['Reval Appeal', 'NO']
get_text() is very powerful tool, you can strip the white space and put separator in the text.
You can use this method to get clean data and filter it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - BeautifulSoup and requests does not produce expected results with .findAll() - python

Related

How to scrape all match / ticket info while iterating a list?

bs4 findAll not collecting all of the data from the other pages on the website

Scraping website with BS4 // accessing class

Extracting string data from a html source

Findall to div tag using beautiful soup yields blank return

Categories

Resources