BeautifulSoup find_all not finding all containers of a class - python

I'm trying to write a script to scrape some data off a Zillow page (https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/). Obviously I'm just trying to gather data from every listing. However I cannot grab the data from every listing as it only finds 9 instances of the class I'm searching for ('list-card-addr') even though I've checked the html from listings it does not find and the class exists. Anyone have any ideas for why this is? Here's my simple code
from bs4 import BeautifulSoup
import requests
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
url="https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/"
response = requests.get(url, headers=req_headers)
data = response.text
soup = BeautifulSoup(data,'html.parser')
address = soup.find_all(class_='list-card-addr')
print(len(address))

Data is stored within a comment. You can regex it out easily as string defining JavaScript object you can handle with json
import requests, re, json
r = requests.get('https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22savedSearchEnrollmentId%22%3A%22X1-SSc26hnbsm6l2u1000000000_8e08t%22%2C%22mapBounds%22%3A%7B%22west%22%3A-77.65840518457031%2C%22east%22%3A-77.11870181542969%2C%22south%22%3A38.13250414385234%2C%22north%22%3A38.444339281260426%7D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22mostrecentchange%22%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
print(data['cat1'])
print(data['cat1']['searchList'].keys())
Within this are details on pagination and the next url, if applicable, to get all results. You have only asked for page 1 here.
For example, print addresses
for i in data['cat1']['searchResults']['listResults']:
print(i['address'])

Related

Data Scraping of the Numbers and details from the website

I want to scrape the contact numbers from the website with the respective details of the Courier Services. I am not able to scrape the Contact numbers and other details like name address and rating from all the Courier services. I analyzed the data is in the script tag. Please suggest a fix for this
import requests
import pandas as pd
import json
import csv
from lxml import html
import re
headers ={'authority': 'www.justdial.com',
'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9 ',
'accept-encoding': 'gzip, deflate, br',
'accept-language':'en-US,en;q=0.9',
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36" }
produrl = 'https://www.justdial.com/Mumbai/Courier-Services-in-Mumbai-Bazar-Nalasopara-East/nct-10142628'
prodresp = requests.get(produrl, headers=headers, timeout=30)
prodResphtml = html.fromstring(prodresp.text)
partjson = prodResphtml.xpath('/html/head/script[9]/text()')
print(partjson)
That's data are coming with ajax api call from there;
https://www.justdial.com/api/india_api_write/20march2020/searchziva.php?city=Mumbai&area=Mumbai-Bazar-Nalasopara-East&lat=&long=&darea_flg=0&case=spcall&stype=category_list&search=Courier-Services&national_catid=10142628&nextdocid=&attribute_values=&basedon=&sortby=&nearme=0&max=100&pg_no=1

Python: Scraping Amazon webpage with bs4, BeautifulSoup

I'm trying to read specific information(name, price, etc. ...) from an Amazon webpage.
For that I'm using "BeautifulSoup" & "requests" as suggested in most tutorials. My code can load the page and find the item I'm looking for but fails to actually get it. I checked the webpage the item definetly exists.
Here is my code:
#import time
import requests
#import urllib.request
from bs4 import BeautifulSoup
URL = ('https://www.amazon.de/dp/B008JCUXNK/?coliid=I9G2T92PZXG06&colid=3ESRXLK53S0NY&psc=1&ref_=lv_ov_lig_dp_it')
# user agent = browser information (get via google search "my user agent")
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'}
page = requests.get(URL, headers=headers)# webpage
soup = BeautifulSoup(page.content, 'html.parser')# webpage as html
title = soup.find(id="productTitle")
print(title)
title is always "NONE" so calling get_Text will cause an error.
Can anybody tell me what's wrong?
Found a way to get past the captcha.
The request needs to contain a better header.
Example:
import datetime
import requests
KEY = "YOUR_KEY_HERE"
date = datetime.datetime.now().strftime("%Y%m%d")
BASE_REQUEST = ('https://www.amazon.de/Philips-Haartrockner-ThermoProtect-Technologie-HP8230/dp/B00BCQIIMS?pf_rd_r=T1T8Z7QTQTGYM8F7KRN5&pf_rd_p=c832d309-197e-4c59-8cad-735a8deab917&pd_rd_r=20c6ed33-d548-47d7-a262-c53afe32df96&pd_rd_w=63hR3&pd_rd_wg=TYwZH&ref_=pd_gw_crs_zg_bs_84230031')
headers = {
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.com/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
payload = {
"api-key": KEY,
"begin_date": date,
"end_date": date,
"q": "Donald Trump"
}
r = requests.get(BASE_REQUEST, headers=headers)
print(r.status_code)
if r.status_code == 200:
print('success')
For information on status codes just google html status codes.
Hope this helps anyone with similar problems
Cheers!
Your code is 100% correct, but I've tried your code and checked value of page.content. It contains captcha. Looks like Amazon don't want you to scrape their site.
You can read about your case here: https://www.reddit.com/r/learnpython/comments/bf21fn/how_to_prevent_captcha_while_scraping_amazon/.
But I also recommend to read Amazon's Terms And Conditions https://www.amazon.com/gp/help/customer/display.html/ref=hp_551434_conditions to be sure if you can legally scrape it.

Workaround for blocked GET requests in Python

I'm trying to retrieve and process the results of a web search using requests and beautifulsoup.
I've written some simple code to do the job, and it returns successfully (status = 200), but the content of the request is just an error message "We're sorry for any inconvenience, but the site is currently unavailable.", and has been the same for the last several days. Searching within Firefox returns results without issue, however. I've run the code using a URL for the UK-based site and it works without issue so I wonder if the US site is set up to block attempts to scrape web searches.
Are there ways to mask the fact I'm attempting to retrieve search results from within Python (eg, masquerading as a standard search within Firefox) or some other work around to allow access to the search results?
Code included for reference below:
import pandas as pd
from requests import get
import bs4 as bs
import re
# works
# baseURL = 'https://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=ky119sb&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&make=TOYOTA&model=VERSO&year-from=1990&year-to=2017&minimum-mileage=0&maximum-mileage=200000&body-type=MPV&fuel-type=Diesel&minimum-badge-engine-size=1.6&maximum-badge-engine-size=4.5&maximum-seats=8'
# doesn't work
baseURL = 'https://www.autotrader.com/cars-for-sale/Certified+Cars/cars+under+50000/Jeep/Grand+Cherokee/Seattle+WA-98101?extColorsSimple=BURGUNDY%2CRED%2CWHITE&maxMileage=45000&makeCodeList=JEEP&listingTypes=CERTIFIED%2CUSED&interiorColorsSimple=BEIGE%2CBROWN%2CBURGUNDY%2CTAN&searchRadius=0&modelCodeList=JEEPGRAND&trimCodeList=JEEPGRAND%7CSRT%2CJEEPGRAND%7CSRT8&zip=98101&maxPrice=50000&startYear=2015&marketExtension=true&sortBy=derivedpriceDESC&numRecords=25&firstRecord=0'
a = get(baseURL)
soup = bs.BeautifulSoup(a.content,'html.parser')
info = soup.find_all('div', class_ = 'information-container')
price = soup.find_all('div', class_ = 'vehicle-price')
d = []
for idx, i in enumerate(info):
ii = i.find_next('ul').find_all('li')
year_ = ii[0].text
miles = re.sub("[^0-9\.]", "", ii[2].text)
engine = ii[3].text
hp = re.sub("[^\d\.]", "", ii[4].text)
p = re.sub("[^\d\.]", "", price[idx].text)
d.append([year_, miles, engine, hp, p])
df = pd.DataFrame(d, columns=['year','miles','engine','hp','price'])
By default, Requests sends a unique user agent when making requests.
>>> r = requests.get('https://google.com')
>>> r.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
It is possible that the website you are using is trying to avoid scrapers by denying any request with a user agent of python-requests.
To get around this, you can change your user agent when sending a request. Since it's working on your browser, simply copy your browser user agent (you can Google it, or record a request to a webpage and copy your user agent like that). For me, it's Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 (what a mouthful), so I'd set my user agent like this:
>>> headers = {
... 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
... }
and then send the request with the new headers (the new headers are added to the default headers, they don't replace them unless they have the same name):
>>> r = requests.get('https://google.com', headers=headers) # Using the custom headers we defined above
>>> r.request.headers
{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
Now we can see that the request was sent with our preferred headers, and hopefully the site won't be able to tell the difference between Requests and a browser.

How to scrape a hidden table with beautifulsoup

It's about scraping a hidden table with beautifulsoup.
As you can see in this website, there is a button "choisissez votre séance" and when we click on it a table will be shown.
When I click on inspect the table element i can see the tag that contains attributes like price. However, when I view the website's source code, I can't find this information.
There is something in the code of the table 'display : none' which I think affects this, but I can't find a solution.
It would appear the page is using AJAX and loading the data for pricing in the background. Using Chrome I pressed F12 and had a look under the network tab. When I clicked the "choisissez votre séance" button I noticed a POST to this address:
'https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304'
This is great news for you as you do not need to scrape the HTML data, you simply need to provide the ID (in page source) to the API.
In the below code I am
Requesting the initial page
Collecting the cookie
Posting the ID (data) and the cookie we collected
Returning the JSON data you require to further process (variable J)
Hope the below helps out!
Cheers,
Adam
import requests
from bs4 import BeautifulSoup
h = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
s = requests.session()
initial_page_request = s.get('https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304',headers=h)
soup = BeautifulSoup(initial_page_request.text,'html.parser')
idseanc = soup.find("select",{"id":"sessionsSelect"})("option")[0]['value'].split("_")[1]
cookies = initial_page_request.cookies.get_dict()
headers = {
'Origin': 'https://www.ticketmaster.fr',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'Content-Type': 'application/json; charset=UTF-8',
'Accept': '*/*',
'Referer': 'https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
}
data = {'idseanc':str(idseanc)}
response = s.post('https://www.ticketmaster.fr/planPlacement/FindPrices/connected/false/idseance/2870471', headers=headers, cookies=cookies, data=data)
j = response.json()

Scraping data from website graph with Python

As an exercise, I am trying to scrape data from a dynamic graph using Python. The graph can be found at this link (let's say I want the data from the first one).
Now, I was thinking of doing something like:
src = 'https://marketchameleon.com/Overview/WFT/IV/#_ABSTRACT_RENDERER_ID_11'
import json
import urllib.request
with urllib.request.urlopen(src) as url:
data = url.read()
reply = json.loads(data)
However, I receive an error message on the last line of the code, saying:
JSONDecodeError: Expecting value
"data" is not empty, so I believe there is a problem with the format of the information within it. Does someone have an idea to solve this issue? Thanks!
I opened that link and see that the site loads data from another URL - https://marketchameleon.com/charts/histStockChartData?p=747&m=12&_=1534060722519
You can use json.loads() function twice and do some hacks with headers (urllib2.Request is your friend in case of Python 2) since server returns HTTP 500 when you don't imitate browser
src = 'https://marketchameleon.com/charts/histStockChartData?p=747&m=12'
import json
import urllib.request
user_agent = {
'Host': 'marketchameleon.com',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Upgrade-Insecure-Requests': 1,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,kk;q=0.6'
}
request = urllib.request.Request(src, headers=user_agent)
data = urllib.request.urlopen(request).read()
print(data)
reply = json.loads(data)
table = json.loads(reply['GTable'])
print(table)

Categories