Web Scraping - Get to Page 2 - python

How to I get to page two of the data sets? No matter what I do, it only returns page 1.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myURL = 'https://jobs.collinsaerospace.com/search-jobs/'
uClient = uReq(myURL)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("section", {"id":"search-results"}, {"data-current-page":"4"})
for child in container:
for heading in child.find_all('h2'):
print(heading.text)

The site actually uses JSON to return the HTML containing all of the entries. The API for this allows a page number to be specified and also the number of records to be returned for each page, increasing this will further increase the speed.
The JSON that is returned contains 3 keys. Filter information, the results HTML and a flag to indicate if jobs were returned. This last entry can be used to signal when you have reached the end of the pages.
You might want to look at the very popular Python requests library which simplifies generating the correct URLs for you and is also fast.
import bs4
import requests
from bs4 import BeautifulSoup as soup
params = {
"CurrentPage" : 1,
"RecordsPerPage" : 100,
"SearchResultsModuleName" : "Search Results",
"SearchFiltersModuleName" : "Search Filters",
"SearchType" : 5,
}
myURL = 'https://jobs.collinsaerospace.com/search-jobs/results'
page = 1
more_jobs = True
while more_jobs:
print(f"\nPage {page}")
params['CurrentPage'] = page
req = requests.get(myURL, params=params)
json = req.json()
page_soup = soup(json['results'], "html.parser")
container = page_soup.findAll("section", {"id":"search-results"}, {"data-current-page":"4"})
for child in container:
for heading in child.find_all('h2'):
print(heading.text)
more_jobs = json['hasJobs'] # Did this return any jobs?
page += 1

Try the following script to get the results from whatever pages you are interested in. All you need to do is change the range as per your requirement. I could have defined a while loop to exhaust the whole content but that is not the question you asked.
import requests
from bs4 import BeautifulSoup
link = 'https://jobs.collinsaerospace.com/search-jobs/results?'
params = {
'CurrentPage': '',
'RecordsPerPage': 15,
'Distance': 50,
'SearchResultsModuleName': 'Search Results',
'SearchFiltersModuleName': 'Search Filters',
'SearchType': 5
}
for page in range(1,5): #This is where you change the range to get the results from whatever page you want
params['CurrentPage'] = page
res = requests.get(link,params=params)
soup = BeautifulSoup(res.json()['results'],"lxml")
for name in soup.select("h2"):
print(name.text)

try this :
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
for letter in range(10):
myURL = 'https://jobs.collinsaerospace.com/search-jobs/'+ str(letter) + ' '
uClient = uReq(myURL)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("section", {"id":"search-results"}, {"data-current-page":"4"})
for child in container:
for heading in child.find_all('h2'):
print(heading.text)
output:
of 3 first pages:
0
SYSTEMS / APPLICATIONS ENGINEER
Data Scientist
Sr Engineer, Drafter/Product Definition
Finance and Accounting Intern
Senior Software Engineer - CT3
Intern Manufacturing Engineer
Staff Eng., Reliability Engineering
Software Developer
Configuration Management Specialist
Disassembler I--2nd Shift
Disassembler I--3rd Shift
Manager, Supplier Performance
Manager, Supplier Performance
Assoc Eng, Mfg Engrg-Ops, ME P1
Manager, Supplier Performance
1
Assembly Operator (UK7014) 1 1 1 1
Senior Administrator (DF1040) 1 1 1
Tester 1
Assembler 1
Assembler 1
Finisher 1
Painter 1
Technician 1 Manufacturing/Operations
Assembler 1 - 1st Shift
Supply Chain Analyst 1
Assembler (W7006) 1
Assembler (W7006) 1
Supplier Quality Engineer 1
Supplier Inspection Engineer 1
Assembler 1 - 1st Shift
2
Assembler I-FAA-2
Senior/Business Analyst-2
Operational Technical Support Level 2
Project Engineer - 2 – EMU Program
Line & Surface Plate Inspector Class 2
Software Engineer (LVL 2) - Embedded UAV Controls
Software Engineer (LVL 2 / JAVA) - Air Combat Training
Software Engineer (Level 2) - Mission Simulation & Training
Electrical Engineer (LVL 2) - Mission Systems Design Tools
Quality Inspector II
GET/PGET
GET/PGET
Production Supervisor - 2nd shift
Software Developer
Trainee Operator/ Operator

Related

How to get all products from a beautifulsoup page

I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?
DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]
To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))

Inconsistent results while web scraping using beautiful soup

I am having an inconsistent issue that is driving me crazy. I am trying to scrape data about rental units. Let's say we have a webpage with 42 ads, the code works just fine for only 19 ads then it returns:
Traceback (most recent call last):
File "main.py", line 53, in <module>
title = real_state_title.div.h1.text.strip()
AttributeError: 'NoneType' object has no attribute 'div'
If you started the code to process ads starting from a different ad number, let's say 5, it will also process the first 19 ads then raises the same error!
Here is a minimum code to show the issue I am having. Please note that this code will print the HTML for a functioning ad and also for the one with the error. What is printed is so different.
Run the code then change the value of i to see the results.
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i += 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
# Print one functioning ad html
if print_functioning_ad:
print_functioning_ad = False
print(page_soup2)
print('real state title type', type(real_state_title))
try:
title = real_state_title.div.h1.text.strip()
print(title)
except Exception:
print(traceback.format_exc())
print(page_soup2)
break
print('____________________________________________________________')
Edit 1:
In my simple example I want to loop through each ad in the provided link, open it, and get the title. In my actual code I am not only getting the title but also every other info about the ad. So I need to load the data from the link associated with every ad. My code actually does that, but for an unknown reason, this happens ONLY for 19 ads regardless which one I started with. This is driving my nuts!
To get all pages from the URL you can use next example:
import requests
from bs4 import BeautifulSoup
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
page = 1
while True:
print("Page {}...".format(page))
print("-" * 80)
soup = BeautifulSoup(requests.get(page_url).content, "html.parser")
for i, a in enumerate(soup.select("a.title"), 1):
print(i, a.get_text(strip=True))
next_url = soup.select_one('a[title="Next"]')
if not next_url:
break
print()
page += 1
page_url = "https://www.kijiji.ca" + next_url["href"]
Prints:
Page 1...
--------------------------------------------------------------------------------
1 Spacious One Bedroom Apartment
2 3 Bedroom Quispamsis
3 Uptown-two-bedroom apartment for rent - all-inclusive
4 New Construction!! Large 2 Bedroom Executive Apt
5 LARGE 1 BEDROOM UPTOWN $850 HEAT INCLUDED AVAIABLE JULY 1
6 84 Wright St Apt 2
7 310 Woodward Ave (Brentwood Tower) Condo #1502
...
Page 5...
--------------------------------------------------------------------------------
1 U02 - CHFR - Cozy 1 Bedroom + Den - WEST SAINT JOHN
2 2+ Bedroom Historic Renovated Stainless Kitchen
3 2 Bedroom Apartment - 343 Prince Street West
4 2 Bedroom 5th Floor Loft Apartment in South End Saint John
5 Bay of Fundy view from luxury 5th floor 1 bedroom + den suite
6 Suites of The Atlantic - Renting for Fall 2021: 2 bedrooms
7 WOODWARD GARDENS//2 BR/$945 + LIGHTS//MAY//MILLIDGEVILLE//JULY
8 HEATED & SMOKE FREE - Bach & 1Bd Apt - 50% off 1st month's rent
9 Beautiful 2 bedroom apartment in Millidgeville
10 Spacious 2 bedroom in Uptown Saint John
11 3 bedroom apartment at Millidge Ave close to university ave
12 Big Beautiful 3 bedroom apt. in King Square
13 NEWER HARBOURVIEW SUITES UNFURNISHED OR FURNISHED /BLUE ROCK
14 Rented
15 Completely Renovated - 1 Bedroom Condo w/ small den Brentwood
16 1+1 Bedroom Apartment for rent for 2 persons
17 3 large bedroom apt. in King Street East Saint John,NB
18 Looking for a house
19 Harbour View 2 Bedroom Apartment
20 Newer Harbourview suites unfurnished or furnished /Blue Rock Ct
21 LOVELY 2 BEDROOM APARTMENT FOR LEASE 5 WOODHOLLOW PARK EAST SJ
I think I figured out the problem here. I seems like you can't make a lot of requests in a short period of time, so I added a try: except: statement where a time sleep of 80 second is issued when this error occurs, this fixed my problem!
You may want to change the sleep time period to a different value depends on the website you are trying to scrape from.
Here is the modified code:
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
import time
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i = i + 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
try:
title = real_state_title.div.h1.text.strip()
print(title)
except AttributeError:
print(traceback.format_exc())
i = i - 1
t = 80
print(f'----------------------------Sleep for {t} seconds!')
time.sleep(t)
continue
print('____________________________________________________________')

How to pull text from webpage from paragraph element in specific header inside a div using Beautifulsoup-python

Basically the title. I am trying to pull the paragraph text from the area underneath "genecards summary for name_of_gene gene" from https://www.genecards.org/cgi-bin/carddisp.pl?gene=IL6&keywords=il6 using the IL-6 gene as an example. what I want to pull is would like to pull just "IL6 (Interleukin 6) is a Protein Coding gene. Diseases associated with IL6 include Kaposi Sarcoma and Rheumatoid Arthritis, Systemic Juvenile. Among its related pathways are IL-1 Family Signaling Pathways and Immune response IFN alpha/beta signaling pathway. Gene Ontology (GO) annotations related to this gene include signaling receptor binding and growth factor activity."
I have been trying to use Beautifulsoup 4 with python. The issue I am having specifically is that I just don't know how to specify what text I want to pull from the website.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")`
updatedURL = baseURL + GeneToSearch
print(updatedURL)
req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()
soup = BeautifulSoup(response, 'lxml')
for tag in soup.find_all(['script', 'style']):
tag.decompose()
soup.get_text(strip=True)
VALID_TAGS = ['div', 'p']
for tag in soup.findAll('GeneCards Summary for '+ GeneToSearch + 'Gene'):
if tag.name not in VALID_TAGS:
tag.replaceWith(tag.renderContents())
print(soup.text)
This just ends up giving me every element from the website.
Using the latest version of BeautifulSoup you can use a pseudo css selector (:contains) to search for a tag with specific text. You can then navigate to the next p tag and extract corresponding text:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
baseURL = "https://www.genecards.org/cgi-bin/carddisp.pl?gene="
GeneToSearch = input("Gene of Interest: ")
updatedURL = baseURL + GeneToSearch
print(updatedURL)
req = Request(updatedURL, headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(req).read()
soup = BeautifulSoup(response, 'lxml')
text_find = 'GeneCards Summary for ' + GeneToSearch + ' Gene'
el = soup.select_one('h3:contains("' + text_find + '")')
summary = el.parent.find_next('p').text.strip()
print(summary)
Outputs:
IL6 (Interleukin 6) is a Protein Coding gene.
Diseases associated with IL6 include Kaposi Sarcoma and Rheumatoid Arthritis, Systemic Juvenile.
Among its related pathways are IL-1 Family Signaling Pathways and Immune response IFN alpha/beta signaling pathway.
Gene Ontology (GO) annotations related to this gene include signaling receptor binding and growth factor activity.
Try to navigate between tags, something like this:
soup.select('.gc-subsection-header')[1].next_sibling.next_sibling.text
Ref.: Beautiful Soup

How to select value from dropdown item using requests in Python 3?

I want to scrape data from the website https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx. I have to select the State as Delhi, District as Adarsh Nagar (4) & click on Search button, and scrape all the information.
So far I tried using the given below code as
import requests
from bs4 import BeautifulSoup
Error was coming as 'HTTPS 443 SSL', which I ressolved using 'verify = False
resp = requests.get('https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx',verify=False)
soup = BeautifulSoup(resp.text,"lxml")
dictinfo = {i['name']:i.get('value','') for i in soup.select('input[name]')}
dictinfo['ddlState']='Delhi'
dictinfo['ddldistrict']='Adarsh Nagar (4)'
dictinfo['__EVENTTARGET']='btnSearch'
dictinfo = {k:(None,str(v)) for k,v in dictinfo.items()}
r=requests.post('https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx',verify=False,files=dictinfo)
r
Error: Response [500]
soup2
Error:
Invalid postback or callback
argument. Event validation is enabled using <pages
enableEventValidation="true"/> in configuration or <%# Page
EnableEventValidation="true" %> in a page. For security purposes,
this feature verifies that arguments to postback or callback events
originate from the server control that originally rendered them. If
the data is valid and expected, use the
ClientScriptManager.RegisterForEventValidation method in order to
register the postback or callback data for validation.
Can someone please help me to scrape it or get it done.
(I can only use REQUEST & BEAUTIFULSOUP library, no SELENIUM, MECHANIZE,etc. libraries. )
Try the script below to get the tabular results meant to be populated choosing two dropdown items as you stated above from that webpage. Turn out that you have to make two subsequent post requests to populate the results.
import requests
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
url = 'https://xlnindia.gov.in/frm_G_Cold_S_Query.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0'
resp = s.get(url,verify=False)
soup = BeautifulSoup(resp.text,"lxml")
dictinfo = {i['name']:i.get('value','') for i in soup.select('input[name]')}
dictinfo['ddlState'] = 'DL'
res = s.post(url,data=dictinfo)
soup_obj = BeautifulSoup(res.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup_obj.select('input[name]')}
payload['ddldistrict'] = 'ADN'
r = s.post(url,data=payload)
sauce = BeautifulSoup(r.text,"lxml")
for items in sauce.select("#dgDisplay tr"):
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)
Output you may see in the console like:
['Firm Name', 'City', 'Licences', 'Reg. Pharmacists / Comp. Person']
['A ONE MEDICOS', 'DELHI-251/1, GALI NO.1, KH, NO, 739/251/1, NEAR HIMACHAL BHAWAN,SARAI PIPAL THALA, VILLAGE AZAD PUR,', 'R - 2', 'virender kumar, DPH, [22295-17/10/2013]']
['AAROGYAM', 'DELHI-PVT. SHOP NO. 1, GF, 121,VILLAGE BHAROLA', 'R - 2', 'avinesh bhadoriya, DPH, [27033-]']
['ABCO INDIA', 'DELHI-SHOP NO-452/22,BHUSHAN BHAWAN RING ROAD,FLYOVER AZAD PUR', 'W - 2', 'sanjay dubey , SSC, [C-P-03/01/1997]']
['ADARSH MEDICOS', 'DELHI-NORTHERN SIDE B-107, GALI NO. 1,,MAJLIS PARK, VILLAGE BHAROLA,', 'R - 2', 'dilip kumar, BPH, [28036-11/01/2018]']

Scraping content from AJAX onclick pop-up

I'm attempting to scape information from this page using Python: https://j2c-com.com/Euronaval14/catalogueWeb/catalogue.php?lang=gb. I'm specifically interested in the pop-up that occurs when you click on an individual exhibitor's name. The challenging part is it uses a lot of JavaScript to make AJAX calls to load the data.
I've examined the network calls when clicking on an exhibitor and it appears that the AJAX call goes to this URL (for the first exhibitor in the list, "A.I.A.D. and MOD ITALY"): https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=D000365D000365&rnd=0.005115277832373977
I understand where the cle parameter comes from (the id with the <span> tag), however, what I don't quite get is where the rnd parameter is derived. Is it simply just a random number? I tried supplying a random number with each request but the html returned is missing the actual content of the pop-up.
This leads me to believe that either the rnd attribute isn't a random number, or I need some type of cookie present in order for the actual data to come through in the request.
Here's my code so far, I'm using Requests and BeautifulSoup to parse the html:
import random
import decimal
import requests
from bs4 import BeautifulSoup
#base_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/catalogue.php?lang=gb'
base_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/cataloguerecherche.php?listeFavoris=&typeRecherche=1&typeRechSociete=&typeSociete=&typeMarque=&typeDescriptif=&typeActivite=&choixSociete=&choixPays=&choixActivite=&choixAgent=&choixPavillon=&choixZoneExpo=&langue=gb&rnd=0.1410133063327521'
def generate_random_number(i,d):
"Produce a random between 0 and 1, with 16 decimal digits"
return str(decimal.Decimal('%d.%d' % (random.randint(0,i),random.randint(0,d))))
r = requests.get(base_url)
soup = BeautifulSoup(r.text)
table = soup.find('table', {'id':'tableResultat'})
trs = table.findAll('tr')
for tr in trs:
span = tr.find('span')
cle = span.get('id')
url = 'https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=' + cle + '&rnd=' + generate_random_number(0,9999999999999999)
pop = requests.post(url)
print url
print pop.text
break
Can you help me understand how I can successfully capture the pop-up data, or what I'm doing wrong? Thanks in advance!
It is not about the rnd parameter. It is completely random and filled up by Math.random() js function.
As you've suspected, it is about cookies. PHPSESSID cookie is critical to be brought with every following request. Just start a requests.Session() and use it for every request you make:
The Session object allows you to persist certain parameters across
requests. It also persists cookies across all requests made from the
Session instance.
...
# start session
session = requests.Session()
r = session.get(base_url)
soup = BeautifulSoup(r.text)
table = soup.find('table', {'id':'tableResultat'})
trs = table.findAll('tr')
for tr in trs:
span = tr.find('span')
cle = span.get('id')
url = 'https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=' + cle + '&rnd=' + generate_random_number(0,9999999999999999)
pop = session.post(url) # <-- the POST request here contains cookies returned by the first GET call
print url
print pop.text
break
It prints (see the HTML is filled up with the required data):
https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=D000365D000365&rnd=0.1625497943120751
<table class='divAdresse'>
<tr>
<td class='ficheAdresse' valign='top'>Via Nazionale 54<br>IT-00184 - Roma<br><img
src='../../intranetJ2C/images/flags/IT.gif' style='margin-right:5px;'>ITALY<br><br>Phone: +39 06 488
0247 | Fax: +39 06 482 74 76<br><br>Website: <a href='http://www.aiad.it' target='_new'>www.aiad.it</a></td>
</tr>
</table>
<br>
<b class="divMarque">Contact:</b><br>
<font class="ficheAdresse"> Carlo Festucci - Secretary General<br>
c.festucci#aiad.it</font>
<br><br>
<div id='divTexte' class='ficheTexte'></div>
UPD.
The reason you were not getting the results for other exhibitors in the table is difficult to explain, but the main point here is to simulate all the consequent ajax requests being called under the hood when you click on the row in the browser:
import random
import decimal
import requests
from bs4 import BeautifulSoup
base_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/cataloguerecherche.php?listeFavoris=&typeRecherche=1&typeRechSociete=&typeSociete=&typeMarque=&typeDescriptif=&typeActivite=&choixSociete=&choixPays=&choixActivite=&choixAgent=&choixPavillon=&choixZoneExpo=&langue=gb&rnd=0.1410133063327521'
fiche_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/fiche.php'
reload_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/reload.php'
data_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php'
def generate_random_number(i,d):
"Produce a random between 0 and 1, with 16 decimal digits"
return str(decimal.Decimal('%d.%d' % (random.randint(0, i),random.randint(0, d))))
# start session
session = requests.Session()
r = session.get(base_url)
soup = BeautifulSoup(r.content)
for span in soup.select('table#tableResultat tr span'):
cle = span.get('id')
session.post(reload_url)
session.post(fiche_url, data={'page': 'page:catalogue',
'pasFavori': '1',
'listeFavoris': '',
'cle': cle,
'stand': '',
'rnd': generate_random_number(0, 9999999999999999)})
session.post(reload_url)
pop = session.post(data_url, data={'cle': cle,
'rnd': generate_random_number(0, 9999999999999999)})
print pop.text
Prints:
<table class='divAdresse'><tr><td class='ficheAdresse' valign='top'>Via Nazionale 54<br>IT-00184 - Roma<br><img src='../../intranetJ2C/images/flags/IT.gif' style='margin-right:5px;'>ITALY<br><br>Phone: +39 06 488 0247 | Fax: +39 06 482 74 76<br><br>Website: Contact:</b><br><font class="ficheAdresse"> Carlo Festucci - Secretary General<br><a href="mailto:c.festucci#aiad.it">c.festucci#aiad.it</font><br><br><div id='divTexte' class='ficheTexte'></div>
<table class='divAdresse'><tr><td class='ficheAdresse' valign='top'>An der Faehre 2<br>27809 - Lemwerder<br><img src='../../intranetJ2C/images/flags/DE.gif' style='margin-right:5px;'>GERMANY<br><br>Phone: +49 421 673 30 | Fax: +49 421 673 3115<br><br>Website: <a href='http://www.abeking.com' target='_new'>www.abeking.com</a></td></tr></table><br><b class="divMarque">Contact:</b><br><font class="ficheAdresse"> Thomas Haake - Sales Director Navy</font><br><br><div id='divTexte' class='ficheTexte'></div>
<table class='divAdresse'><tr><td class='ficheAdresse' valign='top'>Mohamed Bin Khalifa Street (street 15)<br>PO Box 107241<br>107241 - Abu Dhabi<br><img src='../../intranetJ2C/images/flags/AE.gif' style='margin-right:5px;'>UNITED ARAB EMIRATES<br><br>Phone: +971 2 445 5551 | Fax: +971 2 445 0644</td></tr></table><br><b class="divMarque">Contact:</b><br><font class="ficheAdresse"> Pierre Baz - Business Development<br>pierre.baz#abudhabimar.com</font><br><br><div id='divTexte' class='ficheTexte'></div>
...

Categories