I am trying to get all of the url links of restaurants in Singapore but my code is not working
data = requests.get("https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html").text
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a', {'property_title'}):
print('https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href'))
print(link.string)
It keeps on loading and loading again in the code soup = BeautifulSoup(data, "html.parser")
I don't know why this happens even though this works well for other sites.
Is this because trip advisor block crawling or code is wrong?
It keeps on loading and loading again
To get a response, add the user-agent header:
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
But the data is loaded dynamically, and requests doesn't support dynamically loaded pages. However, the is available in JSON format on the website, (It's not clear what you want to scrape). To get all the data you can use the json/re modules:
import json
...
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
json_data = re.search(r"window\.__WEB_CONTEXT__=({.*});", data, flags=re.MULTILINE).group(1)
print(
# Prints all the data, you can use `json.loads` instead to access the data instead
json.dumps(json_data, indent=4)
)
To get all the links:
import re
import requests
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
for link in re.findall(r'"detailPageUrl":"(.*?)"', data):
print("https://www.tripadvisor.com.sg/" + link)
Output (truncated):
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1145149-Reviews-Grand_Shanghai_Restaurant-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1193730-Reviews-Entre_Nous_creperie-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1173583-Reviews-The_Courtyard-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d4611806-Reviews-NOX_Dine_in_the_Dark-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d13152787-Reviews-Positano_Risto-Singapore.html
Related
I want to scrap a website, when I reach any tag the link is "job/undifined" , I used post request to fetch data from the page :
post request with postdata in this code :
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
postData = {
'search': 'search',
'facets[camp_type]':'day_camp',
'open[choices-made-content]': 'true'}
url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)
soup1 = BeautifulSoup(html_1.text, 'lxml')
a = soup1.select('div.MuiGrid-root MuiGrid-grid-xs-12 ')
b = soup1.select('span[class="MuiTypography-root MuiTypography-h2"]')
print('soup:',b)
sample from the output :
<span class="MuiTypography-root MuiTypography-h2" style="cursor:pointer">
<a href="job/undefined" style="color:#413E52;text-decoration:none">
Network and Security engineer
</a>
</span>
EDIT
Part of content is served dynamically so, you have to fetch the jobs hashid via api and then create the link yourself or use the data from JSON response:
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
url = 'https://api.trustme.work/api/job_offers?include=technologies%2Cjob%2Ccompany%2Ccontract_type%2Clevel'
jobs = requests.get(url, headers=headers).json()['included']['jobs']
['https://www.trustme.work/job/' + v['hashid'] for k,v in jobs.items()]
To get the links from each job post change your css selector to select your elements more specific, also try to use static identifiers or HTML structure over classes:
.select('h2 a')
To get a list of all links use a list comprehension:
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]
Example
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
postData = {
'search': 'search',
'facets[camp_type]':'day_camp',
'open[choices-made-content]': 'true'}
url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)
soup1 = BeautifulSoup(html_1.text, 'lxml')
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]
I would to scrape the last odds in archive from this page https://www.betexplorer.com/soccer/estonia/esiliiga/elva-flora-tallinn/Q9KlbwaJ/ but I can't get it with requests. How can I get it without interact with Selenium?
To trigger the archive odds page in the Developer Tools I need to hover on the odd.
Code
url = "https://www.betexplorer.com/archive-odds/4l4ubxv464x0xc78lr/14/"
headers = {
"Referer": "https://www.betexplorer.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
Json = requests.get(url, headers=headers).json()
As the site is being loaded by JavaScript, requests doesn't work. I have used selenium to load the page, extract the complete source code after everything is loaded.
Then used beautifulsoup to create a soup object to get required data.
From the source code you can see that the data-bid of the <tr> are what are being passed to get the odds data.
I extracted all the data-bid and passed them to the URL you've provided at the very end of your question one by one.
This code will get all the odds data in JSON format
import time
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
base_url = 'https://www.betexplorer.com/soccer/estonia/esiliiga/elva-flora-tallinn/Q9KlbwaJ/'
driver = webdriver.Chrome()
driver.get(base_url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
t = soup.find('table', attrs= {'id': 'sortable-1'})
trs = t.find('tbody').findAll('tr')
for i in trs:
data_bid = i['data-bid']
url = f"https://www.betexplorer.com/archive-odds/4l4ubxv464x0xc78lr/{data_bid}/"
headers = {"Referer": "https://www.betexplorer.com",'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
Json = requests.get(url, headers=headers).json()
# Do what you wish to do withe JSON data here....
I have a code to collect all of the URLs from the "oddsportal" website for a page:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/",headers=headers)
soup = BeautifulSoup(source.text, 'html.parser')
main_div=soup.find("div",class_="main-menu2 main-menu-gray")
a_tag=main_div.find_all("a")
for i in a_tag:
print(i['href'])
which returns these results:
/soccer/africa/africa-cup-of-nations/results/
/soccer/africa/africa-cup-of-nations-2019/results/
/soccer/africa/africa-cup-of-nations-2017/results/
/soccer/africa/africa-cup-of-nations-2015/results/
/soccer/africa/africa-cup-of-nations-2013/results/
/soccer/africa/africa-cup-of-nations-2012/results/
/soccer/africa/africa-cup-of-nations-2010/results/
/soccer/africa/africa-cup-of-nations-2008/results/
I would like the URLs to be returned as:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/3/
for all the parent urls generated for results.
I can see that the urls can be appended as seen from inspect element as below for div id = "pagination"
The data under id="pagination" is loaded dynamically, so requests won't support it.
However, you can get the table of all those pages (1-3) via sending a GET request to:
https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={timestampe}"
where {page} is corresponding to the page number (1-3) and {timestampe} is the current time
You'll also need to add:
"Referer": "https://www.oddsportal.com/"
to your headers.
also, use the lxml parser instead of html.parser to avoid a RecursionError.
import re
import requests
from datetime import datetime
from bs4 import BeautifulSoup
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Referer": "https://www.oddsportal.com/",
}
with requests.Session() as session:
session.headers.update(headers)
for page in range(1, 4):
response = session.get(
f"https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={datetime.now().timestamp()}"
)
table_data = re.search(r'{"html":"(.*)"}', response.text).group(1)
soup = BeautifulSoup(table_data, "lxml")
print(soup.prettify())
I am trying to get the value marked in the picture extracted to be a variable, but it seems that when it is within Vue Components, bs4 is not doing the searching like i am expecting. Can anyone point me in the general direction as to how i would be able to extract the value from this document in Python?
Code is found below picture, thanks in advance.
import requests
from bs4 import BeautifulSoup
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())
div_list = soup.findAll({"class":'value'})
print(div_list)
Since the page is returning a json response you don't need beautifulsoup to parse it.
import requests
import json
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
response = requests.get(URL, headers = headers)
dict_of_response = json.loads(response.text)
obj_list = dict_of_response['data']['segments']
print(obj_list)
The obj_list variable now contains a list of dicts. Those dicts contain the data you want and now you only need to loop trough the list and do what you want with the data.
I am trying to scrape the SEC report page to pull some basic info on a number of tickers.
Here is an example URL for Apple - https://sec.report/CIK/0000320193
Within the page is a 'Company Details' table which includes basic info. I am essentially trying to just scrape the IRS Number, State of Incorp and Address.
I am cool with just scraping this chart and saving it into a PD Df. I am very new to web scraping so looking for some tips to make this work! Below is my code, but I don't know where to go once I extract the panel body. Thanks guys!
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get('https://sec.report/CIK/0000051143.html', headers = headers)
page.content
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all(class_='panel-body')
Instead of BeautifoulSoup try with lxml package, for me it's easier to find elements with xpath sentences:
import requests
from lxml import html
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get('https://sec.report/CIK/0000051143', headers=headers)
raw_html = html.fromstring(page.text)
irs = raw_html.xpath('//tr[./td[contains(text(),"IRS Number")]]/td[2]/text()')[0]
state_incorp = raw_html.xpath('//tr[./td[contains(text(),"State of Incorporation")]]/td[2]/text()')
address = raw_html.xpath('//tr[./td[contains(text(),"Business Address")]]/td[2]/text()')[0]