Beautifulsoup : href link is undifined - python

I want to scrap a website, when I reach any tag the link is "job/undifined" , I used post request to fetch data from the page :
post request with postdata in this code :
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
postData = {
'search': 'search',
'facets[camp_type]':'day_camp',
'open[choices-made-content]': 'true'}
url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)
soup1 = BeautifulSoup(html_1.text, 'lxml')
a = soup1.select('div.MuiGrid-root MuiGrid-grid-xs-12 ')
b = soup1.select('span[class="MuiTypography-root MuiTypography-h2"]')
print('soup:',b)
sample from the output :
<span class="MuiTypography-root MuiTypography-h2" style="cursor:pointer">
<a href="job/undefined" style="color:#413E52;text-decoration:none">
Network and Security engineer
</a>
</span>

EDIT
Part of content is served dynamically so, you have to fetch the jobs hashid via api and then create the link yourself or use the data from JSON response:
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
url = 'https://api.trustme.work/api/job_offers?include=technologies%2Cjob%2Ccompany%2Ccontract_type%2Clevel'
jobs = requests.get(url, headers=headers).json()['included']['jobs']
['https://www.trustme.work/job/' + v['hashid'] for k,v in jobs.items()]
To get the links from each job post change your css selector to select your elements more specific, also try to use static identifiers or HTML structure over classes:
.select('h2 a')
To get a list of all links use a list comprehension:
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]
Example
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
postData = {
'search': 'search',
'facets[camp_type]':'day_camp',
'open[choices-made-content]': 'true'}
url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)
soup1 = BeautifulSoup(html_1.text, 'lxml')
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]

Related

requests.get not returning anything

I am trying to retrieve a url using requests.get
import requests
from bs4 import BeautifulSoup
baseurl = "https://www.olx.com.eg/"
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
r = requests.get('https://www.olx.com.eg/jobs/')
soup = BeautifulSoup(r.content, 'lxml')
product_list = soup.findAll('div',class_ = 'ads__item')
print(product_list)
but it returns an empty list because it does not even open the URL.
What is the issue here?
Add headers= parameter to requests.get:
import requests
from bs4 import BeautifulSoup
baseurl = "https://www.olx.com.eg/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
}
r = requests.get("https://www.olx.com.eg/jobs/", headers=headers)
soup = BeautifulSoup(r.content, "lxml")
product_list = soup.findAll("div", class_="ads__item")
print(len(product_list))
Prints:
45

Want to know how to crawling at tripadvisor

I am trying to get all of the url links of restaurants in Singapore but my code is not working
data = requests.get("https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html").text
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a', {'property_title'}):
print('https://www.tripadvisor.com/Restaurant_Review-g294265-' + link.get('href'))
print(link.string)
It keeps on loading and loading again in the code soup = BeautifulSoup(data, "html.parser")
I don't know why this happens even though this works well for other sites.
Is this because trip advisor block crawling or code is wrong?
It keeps on loading and loading again
To get a response, add the user-agent header:
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
But the data is loaded dynamically, and requests doesn't support dynamically loaded pages. However, the is available in JSON format on the website, (It's not clear what you want to scrape). To get all the data you can use the json/re modules:
import json
...
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
json_data = re.search(r"window\.__WEB_CONTEXT__=({.*});", data, flags=re.MULTILINE).group(1)
print(
# Prints all the data, you can use `json.loads` instead to access the data instead
json.dumps(json_data, indent=4)
)
To get all the links:
import re
import requests
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com.sg/Restaurants-g294265-Singapore.html", headers=headers
).text
for link in re.findall(r'"detailPageUrl":"(.*?)"', data):
print("https://www.tripadvisor.com.sg/" + link)
Output (truncated):
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1145149-Reviews-Grand_Shanghai_Restaurant-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1193730-Reviews-Entre_Nous_creperie-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d1173583-Reviews-The_Courtyard-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d4611806-Reviews-NOX_Dine_in_the_Dark-Singapore.html
https://www.tripadvisor.com.sg//Restaurant_Review-g294265-d13152787-Reviews-Positano_Risto-Singapore.html

How do I get the URLs for all the pages?

I have a code to collect all of the URLs from the "oddsportal" website for a page:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/",headers=headers)
soup = BeautifulSoup(source.text, 'html.parser')
main_div=soup.find("div",class_="main-menu2 main-menu-gray")
a_tag=main_div.find_all("a")
for i in a_tag:
print(i['href'])
which returns these results:
/soccer/africa/africa-cup-of-nations/results/
/soccer/africa/africa-cup-of-nations-2019/results/
/soccer/africa/africa-cup-of-nations-2017/results/
/soccer/africa/africa-cup-of-nations-2015/results/
/soccer/africa/africa-cup-of-nations-2013/results/
/soccer/africa/africa-cup-of-nations-2012/results/
/soccer/africa/africa-cup-of-nations-2010/results/
/soccer/africa/africa-cup-of-nations-2008/results/
I would like the URLs to be returned as:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/3/
for all the parent urls generated for results.
I can see that the urls can be appended as seen from inspect element as below for div id = "pagination"
The data under id="pagination" is loaded dynamically, so requests won't support it.
However, you can get the table of all those pages (1-3) via sending a GET request to:
https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={timestampe}"
where {page} is corresponding to the page number (1-3) and {timestampe} is the current time
You'll also need to add:
"Referer": "https://www.oddsportal.com/"
to your headers.
also, use the lxml parser instead of html.parser to avoid a RecursionError.
import re
import requests
from datetime import datetime
from bs4 import BeautifulSoup
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Referer": "https://www.oddsportal.com/",
}
with requests.Session() as session:
session.headers.update(headers)
for page in range(1, 4):
response = session.get(
f"https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={datetime.now().timestamp()}"
)
table_data = re.search(r'{"html":"(.*)"}', response.text).group(1)
soup = BeautifulSoup(table_data, "lxml")
print(soup.prettify())

Python Webscraping Vue Components

I am trying to get the value marked in the picture extracted to be a variable, but it seems that when it is within Vue Components, bs4 is not doing the searching like i am expecting. Can anyone point me in the general direction as to how i would be able to extract the value from this document in Python?
Code is found below picture, thanks in advance.
import requests
from bs4 import BeautifulSoup
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())
div_list = soup.findAll({"class":'value'})
print(div_list)
Since the page is returning a json response you don't need beautifulsoup to parse it.
import requests
import json
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
response = requests.get(URL, headers = headers)
dict_of_response = json.loads(response.text)
obj_list = dict_of_response['data']['segments']
print(obj_list)
The obj_list variable now contains a list of dicts. Those dicts contain the data you want and now you only need to loop trough the list and do what you want with the data.

soup.find does not return anything

I would like to pull some information from an Amazon page. I've written these few basic lines, but they are not working.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Cooler-Master-SickleFlow-120-Radiators/dp/B0046U6DWO/ref=sr_1_3?keywords=green+case+fan&qid=1578069342&sr=8-3'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id='priceblock_ourprice')
print(price)
Your code is all-right, but html.parser parses the page content badly. Use html5lib or lxml instead:
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Cooler-Master-SickleFlow-120-Radiators/dp/B0046U6DWO/ref=sr_1_3?keywords=green+case+fan&qid=1578069342&sr=8-3'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'lxml') # <-- use `html5lib` or `lxml`
price = soup.find(id='priceblock_ourprice')
print(price)
Prints:
<span class="a-size-medium a-color-price priceBlockBuyingPriceString" id="priceblock_ourprice">$10.50</span>

Categories