How to scrape a hidden table with beautifulsoup

How to scrape a hidden table with beautifulsoup - python

It's about scraping a hidden table with beautifulsoup.
As you can see in this website, there is a button "choisissez votre séance" and when we click on it a table will be shown.
When I click on inspect the table element i can see the tag that contains attributes like price. However, when I view the website's source code, I can't find this information.
There is something in the code of the table 'display : none' which I think affects this, but I can't find a solution.

It would appear the page is using AJAX and loading the data for pricing in the background. Using Chrome I pressed F12 and had a look under the network tab. When I clicked the "choisissez votre séance" button I noticed a POST to this address:
'https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304'
This is great news for you as you do not need to scrape the HTML data, you simply need to provide the ID (in page source) to the API.
In the below code I am
Requesting the initial page
Collecting the cookie
Posting the ID (data) and the cookie we collected
Returning the JSON data you require to further process (variable J)
Hope the below helps out!
Cheers,
Adam
import requests
from bs4 import BeautifulSoup
h = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
s = requests.session()
initial_page_request = s.get('https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304',headers=h)
soup = BeautifulSoup(initial_page_request.text,'html.parser')
idseanc = soup.find("select",{"id":"sessionsSelect"})("option")[0]['value'].split("_")[1]
cookies = initial_page_request.cookies.get_dict()
headers = {
'Origin': 'https://www.ticketmaster.fr',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'Content-Type': 'application/json; charset=UTF-8',
'Accept': '*/*',
'Referer': 'https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
}
data = {'idseanc':str(idseanc)}
response = s.post('https://www.ticketmaster.fr/planPlacement/FindPrices/connected/false/idseance/2870471', headers=headers, cookies=cookies, data=data)
j = response.json()

Related

Issue about getting a 404 error page when scraping a website

I wanted to get Fear and greed index from 'https://alternative.me/crypto/ and I found a url where the XHR data is by using Chrome network tool but when trying to scrape the website, I got 404 or 500 error page.
url = 'https://alternative.me/api/crypto/fear-and-greed-index/history'
headers = {
'user-agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'),
'Referer': ('https://alternative.me/crypto/fear-and-greed-index/'),
'Accept-Language': ('ko-KR,ko;q=0.9,en-US;q=0.8,en;q=0.7'),
'accept-encoding': ('gzip, deflate, br'),
'accept': ('application/json, text/plain, */*')
}
res = requests.post(url, headers=headers)

WEB SCRAPING - python requests session not able to gather data

I've seen some similar threads but neither gave me the answer. I simply need to get html content from one website. I'm sending the POST request with data for particular case and then using GET requests I want to scrape the text from html. The problem is that I always receive the first page's content. Not sure what I am doing wrong.
import requests
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded',
'Origin':'https://przegladarka-ekw.ms.gov.pl',
'Referer':'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
url = 'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW'
r = requests.session()
r.post(url, data=data, headers=headers)
final_content = r.get(url, headers=headers)
print(final_content.text)
The GET requests come from ("https://przegladarka-ekw.ms.gov.pl/eukw_prz/eukw201906070952/js/jquery-1.11.0_min.js
") but it returns a wall of code. My goal is to scrape the page which appears after providing the data from above to search menu.

try this
import json
import urllib.request
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded',
'Origin':'https://przegladarka-ekw.ms.gov.pl',
'Referer':'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
url = 'https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW'
r=urllib.request.urlopen(url, data=bytes(json.dumps(data), encoding="utf-8"))
final_content = r
for i in r:
print(i)

Workaround for blocked GET requests in Python

I'm trying to retrieve and process the results of a web search using requests and beautifulsoup.
I've written some simple code to do the job, and it returns successfully (status = 200), but the content of the request is just an error message "We're sorry for any inconvenience, but the site is currently unavailable.", and has been the same for the last several days. Searching within Firefox returns results without issue, however. I've run the code using a URL for the UK-based site and it works without issue so I wonder if the US site is set up to block attempts to scrape web searches.
Are there ways to mask the fact I'm attempting to retrieve search results from within Python (eg, masquerading as a standard search within Firefox) or some other work around to allow access to the search results?
Code included for reference below:
import pandas as pd
from requests import get
import bs4 as bs
import re
# works
# baseURL = 'https://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=ky119sb&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&make=TOYOTA&model=VERSO&year-from=1990&year-to=2017&minimum-mileage=0&maximum-mileage=200000&body-type=MPV&fuel-type=Diesel&minimum-badge-engine-size=1.6&maximum-badge-engine-size=4.5&maximum-seats=8'
# doesn't work
baseURL = 'https://www.autotrader.com/cars-for-sale/Certified+Cars/cars+under+50000/Jeep/Grand+Cherokee/Seattle+WA-98101?extColorsSimple=BURGUNDY%2CRED%2CWHITE&maxMileage=45000&makeCodeList=JEEP&listingTypes=CERTIFIED%2CUSED&interiorColorsSimple=BEIGE%2CBROWN%2CBURGUNDY%2CTAN&searchRadius=0&modelCodeList=JEEPGRAND&trimCodeList=JEEPGRAND%7CSRT%2CJEEPGRAND%7CSRT8&zip=98101&maxPrice=50000&startYear=2015&marketExtension=true&sortBy=derivedpriceDESC&numRecords=25&firstRecord=0'
a = get(baseURL)
soup = bs.BeautifulSoup(a.content,'html.parser')
info = soup.find_all('div', class_ = 'information-container')
price = soup.find_all('div', class_ = 'vehicle-price')
d = []
for idx, i in enumerate(info):
ii = i.find_next('ul').find_all('li')
year_ = ii[0].text
miles = re.sub("[^0-9\.]", "", ii[2].text)
engine = ii[3].text
hp = re.sub("[^\d\.]", "", ii[4].text)
p = re.sub("[^\d\.]", "", price[idx].text)
d.append([year_, miles, engine, hp, p])
df = pd.DataFrame(d, columns=['year','miles','engine','hp','price'])

By default, Requests sends a unique user agent when making requests.
>>> r = requests.get('https://google.com')
>>> r.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
It is possible that the website you are using is trying to avoid scrapers by denying any request with a user agent of python-requests.
To get around this, you can change your user agent when sending a request. Since it's working on your browser, simply copy your browser user agent (you can Google it, or record a request to a webpage and copy your user agent like that). For me, it's Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 (what a mouthful), so I'd set my user agent like this:
>>> headers = {
... 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
... }
and then send the request with the new headers (the new headers are added to the default headers, they don't replace them unless they have the same name):
>>> r = requests.get('https://google.com', headers=headers) # Using the custom headers we defined above
>>> r.request.headers
{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
Now we can see that the request was sent with our preferred headers, and hopefully the site won't be able to tell the difference between Requests and a browser.

How to obtain a JSON response from the stats.nba.com API?

I'm just trying to simply use a Python get request to access JSON data from stats.nba.com. It seems pretty straight-forward as I can enter the URL into your browser and get the results I'm looking for. However, whenever I run this the program just runs to no end. I'm wondering if I have to include some type of headers information in my get request.
The code is below:
import requests
url = 'http://stats.nba.com/stats/commonteamroster?LeagueID=00&Season=2017-18&TeamID=1610612756'
response=requests.get(url)
print response.text

I have tried to visit the url you given, you can add header to your request to avoid this problem (the minimum information you need to provide is User-Agent, I think you can use more header information as you can):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get(url, headers=headers)
The stats.nba.com website need your 'User-Agent' header information.
You can get your request header information from Network tab in the browser.
Take chrome as example, when you press F12, and visit url you given, you can find the relative request information, the most useful information is request headers.

You need to use headers. Try copying from your browser's network tab. Here's what worked for me:
request_headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'stats.nba.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
And here's the modified get:
response = requests.get(url, headers = request_headers)

Retrieve the Request Headers from a page

Is there a possible way to retrieve the content of request header from a page such as this:
Accept:*/*
Accept-Encoding:gzip, deflate, br
Accept-Language:en-US,en;q=0.8,id;q=0.6
Connection:keep-alive
Cookie:SPC_IA=-1; SPC_EC=-; SPC_F=oTIBQWRUjdH7sWSqJu1UBBA1o3zy5j1C; REC_T_ID=9af9f032-8d77-11e7-b124-1866da5681e2; SPC_T_ID="VkC0m8L3ZwixZk8y836Lhq4XucdTWJQtxOT+CCjn+u7HhYZ0zEcK/BI8L3dT2/em76AgwKj3p9ysfh7yUnOrq9CqS4lRPFaqLTpEuecgX8U="; SPC_U=-; SPC_T_IV="PA3yjLFENXXf8Tzq685zSg=="; csrftoken=SiNmh7GZo00aZ3a0gxIqEaNjB38zhCQI; bannerShown=true; django_language=id; sessionid=15fgkr8ohrult2zkmgu2xyiwwnm4ejcx; SPC_SC_TK=; UYOMAPJWEMDGJ=; SPC_SC_UD=; SPC_SI=i94582s7ffe99b47y3qomp1siqy4adz5
Host:shopee.co.id
If-None-Match:"75c23fc0e3e55d18c21158ab8a335ab4;gzip"
If-None-Match-:55b03-c56af1c195a559f1680c15f63d56f07a
Referer:https://shopee.co.id/SHARP-LED-TV-24INCH-LC24LE175ITT_sby-Area-Surabaya-i.24413460.298360054
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
X-API-SOURCE:pc
X-Requested-With:XMLHttpRequest
I want to store the value of "If-None-Match-: xxxx" in my code without having to manually do it, so for every page I open using the for loop that I have in my code, every page's "If-None-Match-: xxxx" value is saved as a variable.
I tried using
r = requests.get(url)
r.headers
But it only prints out the HTTP response header.
I was wondering if there's a way.

For getting the header of client side request header you need to get re.request.headers
import requests
res = requests.request('GET', "https://www.google.com")
print res.request.headers
Output
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.13.0'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape a hidden table with beautifulsoup - python

Related

Issue about getting a 404 error page when scraping a website

WEB SCRAPING - python requests session not able to gather data

Workaround for blocked GET requests in Python

How to obtain a JSON response from the stats.nba.com API?

Retrieve the Request Headers from a page

Categories

Resources