Scraping with BS4 but HTML gets messed up when parsed - python

I'm having trouble scraping a website using BeautifulSoup4 and Python3. I'm using dryscrape to get the HTML since it requires JavaScript to be enabled in order to be shown (but as far as I know it's never used in the page itself).
This is my code:
from bs4 import BeautifulSoup
import dryscrape
productUrl = "https://www.mercadona.es/detall_producte.php?id=32009"
session = dryscrape.Session()
session.visit(productUrl)
response = session.body()
soup = BeautifulSoup(response, "lxml")
container1 = soup.find("div","contenido").find("dl").find_all("dt")
container3 = soup.find("div","contenido").find_all("td")
Now I want to read container3 content, but:
type(container3)
Returns:
bs4.element.ResultSet
which is the same as type(container1), but it's length it's 0!
So I wanted to know what was I getting to container3 before looking for my <td> tag, so I wrote it to a file.
container3 = soup.find("div","contenido")
soup_file.write(container3.prettify())
And, here is the link to that file: https://pastebin.com/xc22fefJ
It gets all messed up just before the table I want to scrape. I can't understand why, looking at the URL source code from Firefox everything looks fine.

Here's the updated solution:
url = 'https://www.mercadona.es/detall_producte.php?id=32009'
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
s = requests.session()
r = s.get(url, headers = rh)
The response to this gives you the Please enable JavaScript to view the page content. message. However, it also contains the necessary hidden data sent by browser using javascript, which can be seen from the network tab of the developer's tools.
TS015fc057_id: 3
TS015fc057_cr: a57705c08e49ba7d51954bea1cc9bfce:jlnk:l8MH0eul:1700810263
TS015fc057_76: 0
TS015fc057_86: 0
TS015fc057_md: 1
TS015fc057_rf: 0
TS015fc057_ct: 0
TS015fc057_pd: 0
Of these, the second one (the long string) is generated by javascript. We can use a library like js2py to run the code, which will return the required string to be passed in the request.
soup = BeautifulSoup(r.content, 'lxml')
script = soup.find_all('script')[1].text
js_code = re.search(r'.*(function challenge.*crc;).*', script, re.DOTALL).groups()[0] + '} challenge();'
js_code = js_code.replace('document.forms[0].elements[1].value=', 'return ')
hidden_inputs = soup.find_all('input')
hidden_inputs[1]['value'] = js2py.eval_js(js_code)
fd = {i['name']: i['value'] for i in hidden_inputs}
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': 'https://www.mercadona.es/detall_producte.php?id=32009',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '188',
'Content-Type': 'application/x-www-form-urlencoded',
'Cache-Control': 'max-age=0',
'Host': 'www.mercadona.es',
'Origin': 'https://www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
# NOTE: the next one is a POST request, as opposed to the GET request sent before
r = s.post(url, headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
And here's the result:
>>> len(soup.find('div', 'contenido').find_all('td'))
70
>>> len(soup.find('div', 'contenido').find('dl').find_all('dt'))
8
EDIT
Apparently, the javascript code needs to be run only once. The resulting data can be used for more than one request, like this:
for i in range(32007, 32011):
r = s.post(url[:-5] + str(i), headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
print(soup.find_all('dd')[1].text)
Result:
Manzana y plátano 120 g
Manzana y plátano 720g (6x120) g
Fresa Plátano 120 g
Fresa Plátano 720g (6x120g)

Related

Python requests POST gives different response content than browser

I'm writing a Python script to scrape a table from this site (this is public information about ocean tide levels).
One of the stations I'd like to scrape is Punta del Este, code 83.0, in any given day. But my scripts returns a different table than the browser even when the POST request seems to have the same input.
When I fill the form in my browser, the headers and data sent to the server are these:
So I wrote my script to make a POST request as it follows:
url = 'https://www.ambiente.gub.uy/SIH-JSF/paginas/sdh/consultaHDMCApublic.xhtml'
s = requests.Session()
r = s.get(url, verify=False)
soupGet = BeautifulSoup(r.content, 'lxml')
#JSESSIONID = s.cookies['JSESSIONID']
javax_faces_ViewState = soupGet.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']
headersSih = {
'Accept': 'application/xml, text/xml, */*; q=0.01',
'Accept-Language': 'gzip, deflate, br',
'Accept-Language': 'es-ES,es;q=0.6',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
# 'Cookie': 'JSESSIONID=FBE5ZdMQVFrgQ-P6K_yTc1bw.dinaguasihappproduccion',
'Faces-Request': 'partial/ajax',
'Origin': 'https://www.ambiente.gub.uy',
'Referer': url,
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'Sec-GPC': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
ini_date = datetime.strftime(fecha0 , '%d/%m/%Y %H:%M')
end_date = datetime.strftime(fecha0 + timedelta(days=1), '%d/%m/%Y %H:%M')
codigo = 830
dataSih = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'formConsultaHorario:j_idt64',
'javax.faces.partial.execute': '#all',
'javax.faces.partial.render': 'formConsultaHorario:pnlhorarioConsulta',
'formConsultaHorario:j_idt64': 'formConsultaHorario:j_idt64',
'formConsultaHorario': 'formConsultaHorario',
'formConsultaHorario:estacion_focus': '',
'formConsultaHorario:estacion_input': codigo,
'formConsultaHorario:fechaDesde_input': ini_date,
'formConsultaHorario:fechaHasta_input': end_date,
'formConsultaHorario:variables_focus': '',
'formConsultaHorario:variables_input': '26', # Variable: H,Nivel
'formConsultaHorario:fcal_focus': '',
'formConsultaHorario:fcal_input': '7', # Tipo calculo: Ingresado
'formConsultaHorario:ptiempo_focus': '',
'formConsultaHorario:ptiempo_input': '2', #Paso de tiempo: Escala horaria
'javax.faces.ViewState': javax_faces_ViewState,
}
page = s.post(url, headers=headersSih, data=dataSih)
However, when I do it via browser I get a table full of data, while python request returns (page.text) a table saying "No data was found".
Is there something I'm missing? I've tried changing a lots of stuff but nothing seems to do the trick.
Maybe on this website javascript loads the data. Requests dont activate it. If you want to get data from there use Selenium

How do I Webscrape a website that uses iframes?

I am trying to scrape this website 'https://swimming.org.nz/results.html'. In the form that comes up, I am filling in only the Age column as 8 to 8. I am using the following code to scrape the table as suggested elsewhere in StackOverflow. I am unable to get the table. How to get all the tables for this age group 8 to 8.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("https://swimming.org.nz/results.html")
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("x-MS_FIELD_AGE.FROM.L").attrs["src"]
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")
for row in soup.select("x-form-text x-form-field"):
print("\t".join([e.text for e in row.select("th, td")]))
You will see that it is not necessary using BeautifulSoup if you' ve look at the developer tools on your browser. Sending request is below and response type is xml. You don't need any scraping tool. You can get all of that data changing the StartRowIndexand MaximumRowCount.
import requests
url = "https://connect.swimming.org.nz/snz-wrap-public/pages/pageRequestHandler?tunnelTarget=tableData%2F%3F&data_file=MS.COMP.RESULTS&dict_file=MS.COMP.RESULTS&doGet=true"
payload="StartRowIndex=0&MaximumRowCount=100&sort=BY-DSND%20COMP.DATE%20BY-DSND%20STAGE&dir=ASC&tid=extTable1620108707767_4352538&selectCriteria=GET-LIST%20CMS_TABLE_19483_65507_076_184811_&extraColumns=%3CColumns%20DynamicLinkRoot%3D%22https%3A%2F%2Fconnect.swimming.org.nz%3A443%2Fsnz-wrap-public%2Fworkflows%2F%22%3E%3CColumn%3E%3CColumnName%3EExpander%3C%2FColumnName%3E%3CField%3EFRAGMENT_DISPLAY.SPLITS%3C%2FField%3E%3CShowInExpander%3Etrue%3C%2FShowInExpander%3E%3C%2FColumn%3E%3CColumn%3E%3CFieldExpression%3E%7BMEMBER.FORE1%7D%20%7BMEMBER.SURNAME%7D%3C%2FFieldExpression%3E%3CField%3EEXPRESSION_FIELD_1%3C%2FField%3E%3CColumnName%3EName%2520%3C%2FColumnName%3E%3CWidth%3E130%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXGENDER%3C%2FField%3E%3CColumnName%3EGender%3C%2FColumnName%3E%3CWidth%3E50%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EENTRANT.AGE%3C%2FField%3E%3CColumnName%3EAge%3C%2FColumnName%3E%3CWidth%3E35%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CFieldExpression%3E%7BXCATEGORY2%7D%3C%2FFieldExpression%3E%3CField%3ECATEGORY2.NUM%24%24SNZ%3C%2FField%3E%3CColumnName%3EDistance%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXCATEGORY1%3C%2FField%3E%3CColumnName%3EStroke%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CFieldExpression%3E%7BTIME%24%24SNZ%7D%3C%2FFieldExpression%3E%3CField%3ERESULT.TIME.MILLISECONDS%3C%2FField%3E%3CColumnName%3ETime%2520%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.POINTS%24%24SNZ%3C%2FField%3E%3CColumnName%3EFINA%2520Points%3C%2FColumnName%3E%3CWidth%3E85%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.YEAR%24%24SNZ%3C%2FField%3E%3CColumnName%3EPoints%2520Year%3C%2FColumnName%3E%3CWidth%3E80%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3E%24DATE%24COMP.DATE%3C%2FField%3E%3CColumnName%3EDate%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXEVENT.CODE%3C%2FField%3E%3CColumnName%3EMeet%3C%2FColumnName%3E%3CWidth%3E190%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EPARAMETER1%3C%2FField%3E%3CColumnName%3ECourse%3C%2FColumnName%3E%3CWidth%3E50%3C%2FWidth%3E%3C%2FColumn%3E%3C%2FColumns%3E&extraColumnsDownload=%3CDownloadColumns%20DynamicLinkRoot%3D%22https%3A%2F%2Fconnect.swimming.org.nz%3A443%2Fsnz-wrap-public%2Fworkflows%2F%22%3E%3CColumn%3E%3CField%3EXGENDER%3C%2FField%3E%3CColumnName%3EGender%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EENTRANT.AGE%3C%2FField%3E%3CColumnName%3EAge%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXCATEGORY2%3C%2FField%3E%3CColumnName%3EDistance%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXCATEGORY1%3C%2FField%3E%3CColumnName%3EStroke%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3ETIME%24%24SNZ%3C%2FField%3E%3CColumnName%3ETime%2520%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.POINTS%24%24SNZ%3C%2FField%3E%3CColumnName%3EFINA%2520Points%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.YEAR%24%24SNZ%3C%2FField%3E%3CColumnName%3EPoints%2520Year%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3E%24DATE%24COMP.DATE%3C%2FField%3E%3CColumnName%3EDate%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXEVENT.CODE%3C%2FField%3E%3CColumnName%3EMeet%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EPARAMETER1%3C%2FField%3E%3CColumnName%3ECourse%3C%2FColumnName%3E%3C%2FColumn%3E%3C%2FDownloadColumns%3E"
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
'accept': '*/*',
'x-requested-with': 'XMLHttpRequest',
'accept-language': 'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7,ru;q=0.6',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'https://connect.swimming.org.nz',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://connect.swimming.org.nz/snz-wrap-public/workflows/COMP.RESULTS.FIND',
'Cookie': 'JSESSIONID=93F2FEA63BA41ECB2505E2D1CD76374D; _ga=GA1.3.1735786808.1620106921; _gid=GA1.3.1806138988.1620106921'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

Scraping from specific website has stopped working

So a couple of weeks ago I wrote this program which sucessfuly scraped some info on some online store, but now it has stopped working without me changing the code?
Could this be something that has been changed within the website itself or is there something wrong with my code?
import requests
from bs4 import BeautifulSoup
url = 'https://www.continente.pt/stores/continente/pt-pt/public/Pages/ProductDetail.aspx?ProductId=7104665(eCsf_RetekProductCatalog_MegastoreContinenteOnline_Continente)'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
priceInfo = soup.find('div', class_='pricePerUnit').text
priceInfo = priceInfo.replace('\n', '').replace('\r', '').replace(' ', '')
productName = soup.find('div', class_='productTitle').text.replace('\n', ' ')
productInfo = (soup.find('div', class_='productSubtitle').text
+ ', ' + soup.find('div', class_='productSubsubtitle').text)
print('Nome do produto: ' + productName)
print('Detalhes: ' + productInfo)
print('Custo: ' + priceInfo)
I know for a fact that what im searching for does exist and the url is still valid, so what could be the issue?
I separated the priceInfo into 2 lines because the error exists in the first declaration, since it returns a NoneType which has no text attribute
Solution is bit multistep.
Try calling the page you want to scrape in Firefox once
Use browser_cookie3 lib to extract cookies
ensure they are not expired
Use the cookies in requests.get(url, cookies=browser_cookie3.firefox())
Use the headers as below
Hope it works!! Happy scraping
Have tried on my own and it works!!
headers = {
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Accept-Language': 'en-US,en;q=0.9,de;q=0.8',
}

Python POST to check a box on a webpage

I am trying to scrape a webpage that posts prices for the Mexico power market. The webpage has checkboxes that need to be checked for the file with prices to show up. Once I get the relevant box checked, I want to pull the links on the page and check if the particular file I am looking for is posted. I am getting stuck in the first part where I get the checkbox selected using requests.post. I used fiddler to track the changes when I post and passed those arguments in through requests.post.
I was expecting to be able to parse out all the 'href' links in the response but I didn't get any. Any help in redirecting me toward a solution would be greatly appreciated.
Below is the relevant portion of the code I am using:
data{
"ctl00$ContentPlaceHolder1$toolkit":"ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$treePrincipal",
"_EVENTTARGET": "ctl00$ContentPlaceHolder1$treePrincipal",
"__EVENTARGUMENT":{"commandName":"Check","index":"0:0:0:0"},
"__VIEWSTATE": "/verylongstringhere",
"__VIEWSTATEGENERATOR":"6B88769A",
"__EVENTVALIDATION":"/wEdAAPhpIpHlL5kdIfX6MRCtKcRwfFVx5pEsE3np13JV2opXVEvSNmVO1vU+umjph0Dtwe41EcPKcg0qvxOp6m6pWTIV4q0ZOXSBrDwJTrxjo3dZg==",
"ctl00_ContentPlaceHolder1_treePrincipal_ClientState":{"expandedNodes":[],"collapsedNodes":
[],"logEntries":[],"selectedNodes":[],"checkedNodes":["0","0:0","0:0:0","0:0:0:0"],"scrollPosition":0},
"ctl00_ContentPlaceHolder1_ListViewNodos_ClientState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_ClientState":"",
"ctl00$ContentPlaceHolder1$NotifAvisos$hiddenState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_XmlPanel_ClientState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_TitleMenu_ClientState":"",
"__ASYNCPOST":"true"
}
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Length': '26255',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': '_ga=GA1.3.1966843891.1571403663; _gid=GA1.3.1095695800.1571665852',
'Host': 'www.cenace.gob.mx',
'Origin': 'https://www.cenace.gob.mx',
'Referer': 'https://www.cenace.gob.mx/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/77.0.3865.120 Safari/537.36',
'X-MicrosoftAjax': 'Delta=true',
'X-Requested-With': 'XMLHttpRequest'
}
url ="https://www.cenace.gob.mx/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx"
r= requests.post(url,data=data, headers=headers, verify=False)
This is what Fiddler showed on the Post:enter image description here
Maybe you have incorrect __EVENTVALIDATION or __VIEWSTATE fields. You can get the initial page & scrape all the inputs with the initial values.
The following code grabs the input on the first requests, edit them like you did & then send the POST request scraping all the href values :
import requests
import json
from bs4 import BeautifulSoup
base_url = "https://www.cenace.gob.mx"
url = "{}/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx".format(base_url)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
payload = dict([
(t['name'],t.get('value',''))
for t in soup.select("input")
if t.has_attr('name')
])
payload['ctl00$ContentPlaceHolder1$toolkit'] = 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$treePrincipal'
payload['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$treePrincipal'
payload['__ASYNCPOST'] = 'true'
payload['__EVENTARGUMENT']= json.dumps({
"commandName":"Check",
"index":"0:1:1:0"
})
payload['ctl00_ContentPlaceHolder1_treePrincipal_ClientState'] = json.dumps({
"expandedNodes":[], "collapsedNodes":[],
"logEntries":[], "selectedNodes":[],
"checkedNodes":["0","0:1","0:1:1","0:1:1:0"],
"scrollPosition":0
})
r = requests.post(url, data = payload, headers= {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"
})
soup = BeautifulSoup(r.text, "html.parser")
print([
"{}/{}".format(base_url, t["href"])
for t in soup.findAll('a')
if not t["href"].startswith('javascript')
])

Can't parse some names and and their concerning urls from a webpage

I've created a python script using requests and BeautifulSoup to parse the profile names and the links to their profile names from a webpage. The content seems to generate dynamically but they are present in page source. So, I tried with the following but unfortunately I get nothing.
SiteLink
My attempt so far:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
'cache-control': 'max-age=0',
'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
def get_info(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".media__content"):
profileUrl = item.get("href")
profileName = item.select_one("[itemprop='name']").get_text()
print(profileUrl,profileName)
if __name__ == '__main__':
get_info(URL)
How can I fetch the content from that page?
The required content does available in page source. The site is very good at discarding requests when it is made using the same user-agent. So, I used fake_useragent to supply the same randomly with requests. It works if you don't use it incessantly.
Working solution:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from fake_useragent import UserAgent
URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'
def get_info(s,link):
s.headers["User-Agent"] = ua.random
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".media__content a[itemprop='url']"):
profileUrl = urljoin(link,item.get("href"))
profileName = item.select_one("span[itemprop='name']").get_text()
print(profileUrl,profileName)
if __name__ == '__main__':
ua = UserAgent()
with requests.Session() as s:
get_info(s,URL)
Partial output:
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Stewart-Kipness-2657107a Stewart Kipness
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Andrea-Anglin-Bulin-2631495a Andrea Anglin Bulin
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Betty-DeVinney-2631507a Betty DeVinney
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Sabra-Waldman-2657945a Sabra Waldman
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Russell-Berry-2631447a Russell Berry
The page content is NOT rendered via javascript. Your code is fine in my case.
You have just some issue to find the profileUrl and to handle nonetype exception. You have to focus to the a tag to get the data
You should try this:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
'cache-control': 'max-age=0',
'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
def get_info(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
results = []
for item in soup.select(".media__content"):
a_link = item.find('a')
if a_link:
result = {
'profileUrl': a_link.get('href'),
'profileName' : a_link.get_text()
}
results.append(result)
return results
if __name__ == '__main__':
info = get_info(URL)
print(info)
print(len(info))
OUTPUT:
[{'profileName': 'Stewart Kipness',
'profileUrl': '/CENTURY-21-Judge-Fite-Company-14501c/Stewart-Kipness-2657107a'},
....,
{'profileName': 'Courtney Melkus',
'profileUrl': '/CENTURY-21-Realty-Advisors-47551c/Courtney-Melkus-7389925a'}]
941
It looks like you can construct the url as well (Though does seem easier to just grab it)
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
'cache-control': 'max-age=0',
'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
r = requests.get(URL, headers = headers)
soup = bs(r.content, 'lxml')
items = soup.select('.media')
ids = []
names = []
urls = []
for item in items:
if item.select_one('[data-agent-id]') is not None:
anId = item.select_one('[data-agent-id]')['data-agent-id']
ids.append(anId)
name = item.select_one('[itemprop=name]').text.replace(' ','-')
names.append(name)
url = 'https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/' + name + '-' + anId + 'a'
urls.append(url)
results = list(zip(names, urls))
print(results)
Please try:
profileUrl = "https://www.century21.com/" + item.select("a")[0].get("href")

Categories