I am trying to retrieve a url using requests.get
import requests
from bs4 import BeautifulSoup
baseurl = "https://www.olx.com.eg/"
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
r = requests.get('https://www.olx.com.eg/jobs/')
soup = BeautifulSoup(r.content, 'lxml')
product_list = soup.findAll('div',class_ = 'ads__item')
print(product_list)
but it returns an empty list because it does not even open the URL.
What is the issue here?
Add headers= parameter to requests.get:
import requests
from bs4 import BeautifulSoup
baseurl = "https://www.olx.com.eg/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
}
r = requests.get("https://www.olx.com.eg/jobs/", headers=headers)
soup = BeautifulSoup(r.content, "lxml")
product_list = soup.findAll("div", class_="ads__item")
print(len(product_list))
Prints:
45
Related
I want to scrap a website, when I reach any tag the link is "job/undifined" , I used post request to fetch data from the page :
post request with postdata in this code :
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
postData = {
'search': 'search',
'facets[camp_type]':'day_camp',
'open[choices-made-content]': 'true'}
url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)
soup1 = BeautifulSoup(html_1.text, 'lxml')
a = soup1.select('div.MuiGrid-root MuiGrid-grid-xs-12 ')
b = soup1.select('span[class="MuiTypography-root MuiTypography-h2"]')
print('soup:',b)
sample from the output :
<span class="MuiTypography-root MuiTypography-h2" style="cursor:pointer">
<a href="job/undefined" style="color:#413E52;text-decoration:none">
Network and Security engineer
</a>
</span>
EDIT
Part of content is served dynamically so, you have to fetch the jobs hashid via api and then create the link yourself or use the data from JSON response:
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
url = 'https://api.trustme.work/api/job_offers?include=technologies%2Cjob%2Ccompany%2Ccontract_type%2Clevel'
jobs = requests.get(url, headers=headers).json()['included']['jobs']
['https://www.trustme.work/job/' + v['hashid'] for k,v in jobs.items()]
To get the links from each job post change your css selector to select your elements more specific, also try to use static identifiers or HTML structure over classes:
.select('h2 a')
To get a list of all links use a list comprehension:
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]
Example
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
postData = {
'search': 'search',
'facets[camp_type]':'day_camp',
'open[choices-made-content]': 'true'}
url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)
soup1 = BeautifulSoup(html_1.text, 'lxml')
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]
hey everyone I am trying to scrape this website but for some reason, it's not scarping. its really appreciate it if someone can give me a hand with this problem I have tried to use a different user agent but it's not working for some reason. for page content, it prints b'' and for the soup its empty
thanks in advance here's my code:
import requests
from bs4 import BeautifulSoup
url = "https://www.carrefourjordan.com/mafjor/en/c/deals?currentPage=1&filter=&nextPageOffset=0&pageSize=60&sortBy=relevance"
headers = {'User-Agent':'test'}
page = requests.get(url,headers=headers)
print(page.content)
soup = BeautifulSoup(page.content, "html.parser")
print(soup)
** These the 3 different headers I used **
```headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}```
you need to get the right cookies first. so you'll need to use a session
import requests
from bs4 import BeautifulSoup
headers={'User-Agent': 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.2; Trident/5.1)'}
url = "https://www.carrefourjordan.com/mafjor/en/c/deals?currentPage=1&filter=&nextPageOffset=0&pageSize=60&sortBy=relevance"
with requests.session() as s:
s.headers.update(headers)
# get the cookies first
s.get("https://www.carrefourjordan.com")
page = s.get(url)
soup = BeautifulSoup(page.text, "html.parser")
print(soup)
I have a code to collect all of the URLs from the "oddsportal" website for a page:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/",headers=headers)
soup = BeautifulSoup(source.text, 'html.parser')
main_div=soup.find("div",class_="main-menu2 main-menu-gray")
a_tag=main_div.find_all("a")
for i in a_tag:
print(i['href'])
which returns these results:
/soccer/africa/africa-cup-of-nations/results/
/soccer/africa/africa-cup-of-nations-2019/results/
/soccer/africa/africa-cup-of-nations-2017/results/
/soccer/africa/africa-cup-of-nations-2015/results/
/soccer/africa/africa-cup-of-nations-2013/results/
/soccer/africa/africa-cup-of-nations-2012/results/
/soccer/africa/africa-cup-of-nations-2010/results/
/soccer/africa/africa-cup-of-nations-2008/results/
I would like the URLs to be returned as:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/3/
for all the parent urls generated for results.
I can see that the urls can be appended as seen from inspect element as below for div id = "pagination"
The data under id="pagination" is loaded dynamically, so requests won't support it.
However, you can get the table of all those pages (1-3) via sending a GET request to:
https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={timestampe}"
where {page} is corresponding to the page number (1-3) and {timestampe} is the current time
You'll also need to add:
"Referer": "https://www.oddsportal.com/"
to your headers.
also, use the lxml parser instead of html.parser to avoid a RecursionError.
import re
import requests
from datetime import datetime
from bs4 import BeautifulSoup
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Referer": "https://www.oddsportal.com/",
}
with requests.Session() as session:
session.headers.update(headers)
for page in range(1, 4):
response = session.get(
f"https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={datetime.now().timestamp()}"
)
table_data = re.search(r'{"html":"(.*)"}', response.text).group(1)
soup = BeautifulSoup(table_data, "lxml")
print(soup.prettify())
I was following an online tutorial at the following webpage, https://www.youtube.com/watch?v=nCuPv3tf2Hg&list=PLRzwgpycm-Fio7EyivRKOBN4D3tfQ_rpu&index=1. I have no idea what I am doing wrong. I have tried the code in both Visual Studio and Jupyter notebooks to no avail.
Code:
import requests
from bs4 import BeautifulSoup as bs
bURL = 'https://www.thewhiskyexchange.com/c/540/taiwanese-whisky'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
r = requests.get('https://www.thewhiskyexchange.com/c/540/taiwanese-whisky')
soup = bs(r.content, 'lxml')
productlist = soup.find_all('div', class_='item')
productlinks = []
for item in productlist:
for link in item.find_all('a', href=True):
print(link['href'])
The structure of that website has changed since the video was posted.
I've fixed your code below:
import requests
from bs4 import BeautifulSoup as bs
bURL = 'https://www.thewhiskyexchange.com/c/540/taiwanese-whisky'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
r = requests.get(bURL, headers=headers)
soup = bs(r.text, 'html.parser')
for x in soup.find_all('li', {'class':'product-grid__item'}):
link = x.find('a')
print(x.text, 'https://www.thewhiskyexchange.com'+link['href'])
I would like to pull some information from an Amazon page. I've written these few basic lines, but they are not working.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Cooler-Master-SickleFlow-120-Radiators/dp/B0046U6DWO/ref=sr_1_3?keywords=green+case+fan&qid=1578069342&sr=8-3'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id='priceblock_ourprice')
print(price)
Your code is all-right, but html.parser parses the page content badly. Use html5lib or lxml instead:
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Cooler-Master-SickleFlow-120-Radiators/dp/B0046U6DWO/ref=sr_1_3?keywords=green+case+fan&qid=1578069342&sr=8-3'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'lxml') # <-- use `html5lib` or `lxml`
price = soup.find(id='priceblock_ourprice')
print(price)
Prints:
<span class="a-size-medium a-color-price priceBlockBuyingPriceString" id="priceblock_ourprice">$10.50</span>