Why not able to scrape all pages from a website with BeautifulSoup?

Why not able to scrape all pages from a website with BeautifulSoup? - python

I'm trying to get all the data from all pages,
i used a counter and cast it to take the page number in the url
then looped using this counter but always the same result
This is my code :
# Scrapping job offers from hello work website
#import libraries
import random
import requests
import csv
from bs4 import BeautifulSoup
from datetime import date
#configure user agent for mozilla browser
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0",
"Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0",
"Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"
]
random_user_agent= random.choice(user_agents)
headers = {'User-Agent': random_user_agent}
here where i have used my counter:
i=0
for i in range(1,15):
url = 'https://www.hellowork.com/fr-fr/emploi/recherche.html?p='+str(i)
print(url)
page = requests.get(url,headers=headers)
if (page.status_code==200):
soup = BeautifulSoup(page.text,'html.parser')
jobs = soup.findAll('div',class_=' new action crushed hoverable !tw-p-4 md:!tw-p-6 !tw-rounded-2xl')
#config csv
csvfile=open('jobList.csv','w+',newline='')
row_list=[] #to append list of job
try :
writer=csv.writer(csvfile)
writer.writerow(["ID","Job Title","Company Name","Contract type","Location","Publish time","Extract Date"])
for job in jobs:
id = job.get('id')
jobtitle= job.find('h3',class_='!tw-mb-0').a.get_text()
companyname = job.find('span',class_='tw-mr-2').get_text()
contracttype = job.find('span',class_='tw-w-max').get_text()
location = job.find('span',class_='tw-text-ellipsis tw-whitespace-nowrap tw-block tw-overflow-hidden 2xsOld:tw-max-w-[20ch]').get_text()
publishtime = job.find('span',class_='md:tw-mt-0 tw-text-xsOld').get_text()
extractdate = date.today()
row_list=[[id,jobtitle,companyname,contracttype,location,publishtime,extractdate]]
writer.writerows(row_list)
finally:
csvfile.close()

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
BeautifulSoup is not necessary needed here - You could get all and more information directly via api using a mix of requests and pandas - Check all available information here:
https://www.hellowork.com/searchoffers/getsearchfacets?p=1
Example
import requests
import pandas as pd
from datetime import datetime
df = pd.concat(
[
pd.json_normalize(
requests.get(f'https://www.hellowork.com/searchoffers/getsearchfacets?p={i}', headers={'user-agent':'bond'}).json(), record_path=['Results']
)[['ContractType','Localisation', 'OfferTitle', 'PublishDate', 'CompanyName']]
for i in range(1,15)
],
ignore_index=True
)
df['extractdate '] = datetime.today().strftime('%Y-%m-%d')
df.to_csv('jobList.csv', index=False)
Output
ContractType
Localisation
OfferTitle
PublishDate
CompanyName
extractdate
0
CDI
Beaurepaire - 85
Chef Gérant H/F
2023-01-24T16:35:15.867
Armonys Restauration - Morbihan
2023-01-24
1
CDI
Saumur - 49
Dessinateur Métallerie Débutant H/F
2023-01-24T16:35:14.677
G2RH
2023-01-24
2
Franchise
Villenave-d'Ornon - 33
Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F
2023-01-24T16:35:13.707
Elysée Concept
2023-01-24
3
Franchise
Montpellier - 34
Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F
2023-01-24T16:35:12.61
Elysée Concept
2023-01-24
4
CDD
Monaco
Spécialiste Senior Développement Matières Premières Cosmétique H/F
2023-01-24T16:35:06.64
Expectra Monaco
2023-01-24
...
275
CDI
Brétigny-sur-Orge - 91
Magasinier - Cariste H/F
2023-01-24T16:20:16.377
DELPHARM
2023-01-24
276
CDI
Lille - 59
Technicien Helpdesk Français - Italien H/F
2023-01-24T16:20:16.01
Akkodis
2023-01-24
277
CDI
Tours - 37
Conducteur PL H/F
2023-01-24T16:20:15.197
Groupe Berto
2023-01-24
278
Franchise
Nogent-le-Rotrou - 28
Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F
2023-01-24T16:20:12.29
Elysée Concept
2023-01-24
279
CDI
Cholet - 49
Ingénieur Assurance Qualité H/F
2023-01-24T16:20:10.837
Akkodis
2023-01-24

Related

Can't find hrefs of interest with BeautifulSoup

I am trying to collect a list of hrefs from the Netflix careers site: https://jobs.netflix.com/search. Each job listing on this site has an anchor and a class: <a class=css-2y5mtm essqqm81>. To be thorough here, the entire anchor is:
<a class="css-2y5mtm essqqm81" role="link" href="/jobs/244837014" aria-label="Manager, Written Communications"\>\
<span tabindex="-1" class="css-1vbg17 essqqm80"\>\<h4 class="css-hl3xbb e1rpdjew0"\>Manager, Written Communications\</h4\>\</span\>\</a\>
Again, the information of interest here is the hrefs of the form href="/jobs/244837014". However, when I perform the standard BS commands to read the HTML:
html_page = urllib.request.urlopen("https://jobs.netflix.com/search")
soup = BeautifulSoup(html_page)
I don't see any of the hrefs that I'm interested in inside of soup.
Running the following loop does not show the hrefs of interest:
for link in soup.findAll('a'):
print(link.get('href'))
What am I doing wrong?

That information is being fed dynamically in page, via XHR calls. You need to scrape the API endpoint to get jobs info. The following code will give you a dataframe with all jobs currently listed by Netflix:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from tqdm import tqdm ## if Jupyter: from tqdm.notebook import tqdm
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'referer': 'https://jobs.netflix.com/search',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
big_df = pd.DataFrame()
s = requests.Session()
s.headers.update(headers)
for x in tqdm(range(1, 20)):
url = f'https://jobs.netflix.com/api/search?page={x}'
r = s.get(url)
df = pd.json_normalize(r.json()['records']['postings'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df[['text', 'team', 'external_id', 'updated_at', 'created_at', 'location', 'organization' ]])
Result:
100%
19/19 [00:29<00:00, 1.42s/it]
text team external_id updated_at created_at location organization
0 Events Manager - SEA [Publicity] 244936062 2022-11-23T07:20:16+00:00 2022-11-23T04:47:29Z Bangkok, Thailand [Marketing and PR]
1 Manager, Written Communications [Publicity] 244837014 2022-11-23T07:20:16+00:00 2022-11-22T17:30:06Z Los Angeles, California [Marketing and Publicity]
2 Manager, Creative Marketing - Korea [Marketing] 244740829 2022-11-23T07:20:16+00:00 2022-11-22T07:39:56Z Seoul, South Korea [Marketing and PR]
3 Administrative Assistant - Philippines [Netflix Technology Services] 244683946 2022-11-23T07:20:16+00:00 2022-11-22T01:26:08Z Manila, Philippines [Corporate Functions]
4 Associate, Studio FP&A - APAC [Finance] 244680097 2022-11-23T07:20:16+00:00 2022-11-22T01:01:17Z Seoul, South Korea [Corporate Functions]
... ... ... ... ... ... ... ...
365 Software Engineer (L4/L5) - Content Engineering [Core Engineering, Studio Technologies] 77239837 2022-11-23T07:20:31+00:00 2021-04-22T07:46:29Z Mexico City, Mexico [Product]
366 Distributed Systems Engineer (L5) - Data Platform [Data Platform] 201740355 2022-11-23T07:20:31+00:00 2021-03-12T22:18:57Z Remote, United States [Product]
367 Senior Research Scientist, Computer Graphics / Computer Vision / Machine Learning [Data Science and Engineering] 227665988 2022-11-23T07:20:31+00:00 2021-02-04T18:54:10Z Los Gatos, California [Product]
368 Counsel, Content - Japan [Legal and Public Policy] 228338138 2022-11-23T07:20:31+00:00 2020-11-12T03:08:04Z Tokyo, Japan [Corporate Functions]
369 Associate, FP&A [Financial Planning and Analysis] 46317422 2022-11-23T07:20:31+00:00 2017-12-26T19:38:32Z Los Angeles, California [Corporate Functions]
370 rows × 7 columns

For each job, the url would be https://jobs.netflix.com/jobs/{external_id}

Using beautifulsoup4, avoid an AttributeError on a non-existing element

Good time of the day!
While working on a scraping project, I have faced some issues.
Currently I am working on a draft:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
import requests
from bs4 import BeautifulSoup
import time
import random
#driver Path
PATH = "C:\Program Files (x86)\chromedriver"
BASE_URL = "https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167"
driver = webdriver.Chrome(PATH)
driver.implicitly_wait(30)
driver.get(BASE_URL)
time.sleep(random.uniform(3.0, 5.0))
btn = driver.find_elements_by_xpath('//*[#id="uc-btn-accept-banner"]')[0]
btn.click()
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.content, "html.parser")
def reader(url):
ls = list()
ImmoWebCode = url.find(class_ ="classified__information--immoweb-code").text.strip()
Price = url.find("p", class_="classified__price").find("span",class_="sr-only").text.strip()
Locality = url.find(class_="classified__information--address-row").find("span").text.strip()
HouseType = url.find(class_="classified__title").text.strip()
LivingArea = url.find("th",text="Living area").find_next(class_="classified-table__data").next_element.strip()
RoomsNumber = url.find("th",text="Bedrooms").find_next(class_="classified-table__data").next_element.strip()
Kitchen = url.find("th",text="Kitchen type").find_next(class_="classified-table__data").next_element.strip()
TerraceOrientation = url.find("th",text="Terrace orientation").find_next(class_="classified-table__data").next_element.strip()
TerraceArea = url.find("th",text="Terrace").find_next(class_="classified-table__data").next_element.strip()
Furnished = url.find("th",text="Furnished").find_next(class_="classified-table__data").next_element.strip()
ls.append(Furnished)
OpenFire = url.find("th", text="How many fireplaces?").find_next(class_="classified-table__data").next_element.strip()
GardenOrientation = url.find("th", text="Garden orientation").find_next(class_="classified-table__data").next_element.strip()
ls.append(GardenOrientation)
GardenArea = url.find("th",text="Garden surface").find_next(class_="classified-table__data").next_element.strip()
PlotSurface = url.find("th",text="Surface of the plot").find_next(class_="classified-table__data").next_element.strip()
ls.append(PlotSurface)
FacadeNumber = url.find("th",text="Number of frontages").find_next(class_="classified-table__data").next_element.strip()
SwimmingPoool = url.find("th",text="Swimming pool").find_next(class_="classified-table__data").next_element.strip()
StateOfTheBuilding = url.find("th",text="Building condition").find_next(class_="classified-table__data").next_element.strip()
return ls
print(reader(soup))
I start facing issues, when the code reaches "Locality", I receive an Exception has occurred: AttributeError 'NoneType' object has no attribute 'find, though it is clear that the mentioned element is present on HTML code. I am adamant, that it is a synthax issue, but I can not put a finger on it.
It brings me to my second question:
Since this code will be running on multiple pages, those pages might not have requested elements. How can I place the None value if it occurs.
Thank you very much in advance!
Source Code:
<div class="classified__header-secondary-info classified__informations"><p class="classified__information--property">
3 bedrooms
<span aria-hidden="true">|</span>
199
<span class="abbreviation"><span aria-hidden="true">
m² </span> <span class="sr-only">
square meters </span></span></p> <div class="classified__information--financial"><!----> <!----> <span class="mortgage-banner__text">Request your mortgage loan</span></div> <div class="classified__information--address"><p><span class="classified__information--address-row"><span>
1340
</span> <span aria-hidden="true">—</span> <span>
Ottignies
</span>
|
</span> <button class="button button--text button--size-small classified__information--address-button">
Ask for the exact address
</button></p></div> <div class="classified__information--immoweb-code">
Immoweb code : 9308167
</div></div>

Isn't that data all within the <table> tags of the site? Can just use pandas:
import requests
import pandas as pd
url = 'https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167'
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
response = requests.get(url,headers=headers)
dfs = pd.read_html(response.text)
df = pd.concat(dfs).dropna().reset_index(drop=True)
df = df.pivot(index=None, columns=0,values=1).bfill().iloc[[0],:]
Output:
print(df.to_string())
0 Address As built plan Available as of Available date Basement Bathrooms Bedrooms CO₂ emission Construction year Double glazing Elevator Energy class External reference Furnished Garden Garden orientation Gas, water & electricity Heating type Investment property Kitchen type Land is facing street Living area Living room surface Number of frontages Outdoor parking spaces Price Primary energy consumption Reference number of the EPC report Shower rooms Surface of the plot Toilets Website Yearly theoretical total energy consumption
0 Grand' Route 69 A 1435 - Corbais No To be defined December 31 2022 - 12:00 AM Yes 1 3 Not specified 2020 Yes Yes Not specified 8566 - 4443 No Yes East Yes Gas No Installed No 199 m² square meters 48 m² square meters 2 1 € 410,000 410000 € Not specified Not specified 1 150 m² square meters 3 http://www.gilmont.be Not specified

Can't parse all the next page links from a website using requests

I'm trying to get all the links traversing next pages from this website. My script below can parse the links of next pages until 10. However, I can't go past that link visible as 10 at the bottom of that page.
I've tried with:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = 'https://www.icab.es'
link = 'https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc82c494d2c55856f1e25c06b4b6fcee5ddabebfe2d30057589a86e9750b459e9d60598cc6e5c52a4697030b2b8921f29f'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
p = 1
while True:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
"""some data I can fetch myself from current pages, so ignore this portion"""
p+=1
next_page = soup.select_one(f"a[title='{p}']")
if next_page:
link = urljoin(base,next_page.get("href"))
print("next page:",link)
else:
break
How can I get all the next page links from the website above?
PS selenium is not an option I would like to cope with.

You only need to get the href of ">" when your (p-1)%10 != 0
Code:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = 'https://www.icab.es'
link = 'https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc82c494d2c55856f1e25c06b4b6fcee5ddabebfe2d30057589a86e9750b459e9d60598cc6e5c52a4697030b2b8921f29f'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
p = 1
while True:
r = s.get(link)
soup = BeautifulSoup(r.text, "lxml")
"""some data I can fetch myself from current pages, so ignore this portion"""
p += 1
if not ((p-1) % 10):
next_page = soup.select_one(f"a[title='Següent']")
else:
next_page = soup.select_one(f"a[title='{p}']")
if next_page:
link = urljoin(base, next_page.get("href"))
print("page", next_page.text, link)
Result(page >> could be considered as page ?1):
D:\python37\python.exe E:/work/Compile/python/python_project/try.py
page 2 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64ddd1e9bee4bbc2e02ab2de1dfe88d7a8623a04a8617c3a28f3f17b03d0448cd1d399689a629d00f53e570c691ad10f5cfba28f6f9ee8d48ddb9b701d116d7c2a6d4ea403cef5d996fcb28a12b9f7778cd7521cfdf3d243cb2b1f3de9dfe304a10437e417f6c68df79efddd721f2ab8167085132c5e745958a3a859b9d9f04b63e402ec6e8ae29bee9f4791fed51e5758ae33460e9a12b6d73f791fd118c0c95180539f1db11c86a7ab97b31f94fb84334dce6867d519873cc3b80e182ff0b778
page 3 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64cc111efb5ef5c42ecde14350e1f5a2e0e8db36a9258d5c95496c2450f44ccd8c4074edb392b09c03e136988ff15e707fa0e01d1ee6c3198a166e2677b4b418e0b07cafd4d98a19364077e7ed2ea0341001481d8b9622a969a524a487e7d69f6b571f2cb03c2277ecd858c68a7848a0995c1c0e873d705a72661b69ab39b253bb775bc6f7f6ae3df2028114735a04dcb8043775e73420cb40c4a5eccb727438ea225b582830ce84eb959753ded1b3eb57a14b283c282caa7ad04626be8320b4ab
page 4 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64d8e9a9d04523d43bfb106098548163bfec74e190632187d135f2a0949b334acad719ad7c326481a43dfc6f966eb038e0a5a178968601ad0681d586ffc8ec21e414628f96755116e65b7962dfcf3a227fc1053d17701937d4f747b94c273ce8b9ccec178386585075c17a4cb483c45b85c1209329d1251767b8a0b4fa29969cf6ad42c7b04fcc1e64b9defd528753677f56e081e75c1cbc81d1f4cc93adbde29d06388474671abbab246160d0b3f03a17d1db2c6cd6c6d7a243d872e353200a35
page 5 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c643ba4bcf6634af323cf239c7ccf7eca827c3a245352a03532a91c0ced15db81dcfc52b6dfa69853a68cb320e29ca25e25fac3da0e85667145375c3fa1541d80b1b056c03c02400220223ad5766bd1a4824171188fd85a5412b59bd48fe604451cbd56d763be67b50e474befa78340d625d222f1bb6b337d8d2b335d1aa7d0374b1be2372e77948f22a073e5e8153c32202a219ed2ef3f695b5b0040ded1ca9c4a03462b5937182c004a1a425725d3d20a10b41fd215d551abf10ef5e8a76ace4f
page 6 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64418cdf5c38c01a1ac019cc46242eb9ba25f012f2e4bee18a2864a19dde58d6ee2ae93254aff239c70b7019526af1a435e0e89a7c81dc4842e365163d8f9e571ae4fc8b0fc7455f573abee020e21207a604f3d6b7c2015c300a7b1dbc75980b435bb1904535bed2610771fee5e3338a79fad6d024ec2684561c3376463b2cacc00a99659918b41a12c92233bca3eaa1e003dbb0a094b787244ef3c33688b4382f89ad64a92fa8b738dd810b6e32a087564a8db2422c5b2013e9103b1b57b4248d
page 7 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64f96d66b04d442c09e3b891f2a5f3fb235c1aa2786face046665066db9a63e7ca4523e5cf28f4f17898204642a7d5ef3f8474ecd5bf58b252944d148467b495ad2450ea157ce8f606d4b9a6bc2ac04bec3a666757eac42cbea0737e8191b0d375425e11c76990c82246cfb9cbe94daa46942d824ff9f144f6b51c768b50c3e35acfa81e5ebf95bcb5200f5b505595908907c99b8d59893724eb16d5694d36cd30d8a15744af616675b2d6a447c10b79ca04814aece8c7ab4d878b7955cd5cd9ef
page 8 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64d1c210208efbd4630d44a5a92d37e5eabccba6abf83c2584404a24d08e0ad738be3598a03bbec8975275b06378cc5f7c55a9b700eb5bd4ee243a3c30f781512c0ebd23800890cb150621caab21a7a879639331b369d92bb9668815465f5d3b6c061daa011784909fc09af75ab705612ba504b4c268b43f8a029e840b8c69531423e8b5e8fe91d7cc628c309ffb633e233932b7c1b57c5cf0a2f2f47618bca4837ce355f34ae154565b447cfffcecb66458d19e5e5f3547f6916cd1c30baec1a7
page 9 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c6415c187c4ac2cf9d4c982984e1b56faf34a31272a90b11557d1651ad92b01a2ecd3c719cfe78863f99e31b0fc1b6bc7b09e1e0e585ebdc0b04fc9dca8744bb66e8af86d65b39827f1265a82aea0286376456ccfa9cce638d72c494db2391127979eed3d349d725f2e60e2629512c388738fc26b1c9f16a2b478862469835474b305f1300c0aa53c2c4033e4b0967a542079915e30bb18418eb79a47a292ed835dd54689c1fd9ceda898678e7114fa95d559b55367e6f7f9d1ce3fb5ebb5d479c5
page 10 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c644ab59a0b943deffee8845c2521afef0ea3ff2d96cc2b65a071da1201603b54b15b5c4363e92285c60dffd0e893ba6a58ff528fb3278db8e746697dc8712936a560a3da8085e3dcab05949afecddaced326332986240624575c6b7f104182a8c57718ec62e728d8eaa886a611ad55e0d3dd0c1ba59b47cf89d1bd5b000f9fbc5bd7d6310742a53eedfa44383d62145c28ebcf9f180ca49a3616fcfaf7ecaaa0b2f7183fc1d10d18e0062613e73f9077d11a1dfaf044990c200ac10aac4f7cb332
page » https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64ff2c69157ff5cf4b8ccbc2674205f4fb3048dc10db0c7cb36c42fbc59aaa972b9fab70578ff58757fae7a1f1ca17076dfddb919cf92389ba66c8de7f6ea9ec08277b0228f8bd14ea82409ff7e5a051ea58940736b475c6f75c7eba096b711812ed5b6b8454ec11145b0ce10191a38068c6ca7e7c64a86b4c71819d55b3ab34233e9887c7bfa05f9f8bc488cb0986fb2680b8cb9278a437e7c91c7b9d15426e159c30c6c2351ed300925ef1b24bbf2dbf60cf9dea935d179235ed46640d2b0b54
page 12 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64346907383e54eae9d772c10d3600822205ff9b81665ff0f58fd876b4e0d9aeb6e0271904c5251d9cf6eb1fdd1ea16f8ea3f42ad3db66678bc538c444e0e5e4064946826aaf85746b3f87fb436d83a8eb6d6590c25dc7f208a16c1db7307921d79269591e036fed1ec78ec7351227f925a32d4d08442b9fd65b02f6ef247ca5f713e4faffe994bf26a14c2cb21268737bc2bc92bb41b3e3aaa05de10da4e38de3ab725adb5560eee7575cdf6d51d59870efacc1b9553609ae1e16ea25e6d6e9e6
page 13 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64afc9149ba3dadd6054f6d8629d1c750431a15f9c4048195cfc2823f61f6cfd1f2e4f78eb835829db8e7c88279bf3a38788d8feaf5327f1b42d863bba24d893ea5e033510dc2e0579474ac7efcc1915438eacb83f2a3b5416e64e3beb726d721eb79f55082be0371414ccd132e95cd53339cf7a8d6ec15b72595bf87107d082c9db7bba6cf45b8cfe7a9352abe2f289ae8591afcfd78e17486c25e94ea57c00e290613a18a8b991def7e1cd4cae517a4ee1b744036336fbc68b657cd33cc4c949

I had problems with SSL, so I changed the default ssl_context for this site:
import ssl
import requests
import requests.adapters
from bs4 import BeautifulSoup
# adapted from https://stackoverflow.com/questions/42981429/ssl-failure-on-windows-using-python-requests/50215614
class SSLContextAdapter(requests.adapters.HTTPAdapter):
def init_poolmanager(self, *args, **kwargs):
ssl_context = ssl.create_default_context()
# Sets up old and insecure TLSv1.
ssl_context.options &= ~ssl.OP_NO_TLSv1_3 & ~ssl.OP_NO_TLSv1_2 & ~ssl.OP_NO_TLSv1_1
ssl_context.minimum_version = ssl.TLSVersion.TLSv1
kwargs['ssl_context'] = ssl_context
return super(SSLContextAdapter, self).init_poolmanager(*args, **kwargs)
base = 'https://www.icab.es'
link = 'https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc82c494d2c55856f1e25c06b4b6fcee5ddabebfe2d30057589a86e9750b459e9d60598cc6e5c52a4697030b2b8921f29f'
with requests.session() as s:
s.mount('https://www.icab.es', SSLContextAdapter())
p = 1
while True:
print('Page {}..'.format(p))
# r = urllib.request.urlopen(link, context=ssl_context)
r = s.get(link)
soup = BeautifulSoup(r.content, "lxml")
for li in soup.select('li.principal'):
print(li.get_text(strip=True))
p += 1
link = soup.select_one('a[title="{}"]'.format(p))
if not link:
link = soup.select_one('a[title="Següent"]')
if not link:
break
link = base + link['href']
Prints:
Page 1..
Sr./Sra. Martínez Gòmez, Marc
Sr./Sra. Eguinoa de San Roman, Roman
Sr./Sra. Morales Santiago, Maria Victoria
Sr./Sra. Bengoa Tortajada, Javier
Sr./Sra. Moralo Rodríguez, Xavier
Sr./Sra. Romagosa Huerta, Marta
Sr./Sra. Peña Moncho, Juan
Sr./Sra. Piñana Morera, Roman
Sr./Sra. Millán Sánchez, Antonio
Sr./Sra. Martínez Mira, Manel
Sr./Sra. Montserrat Rincón, Anna
Sr./Sra. Fernández Paricio, Maria Teresa
Sr./Sra. Ruiz Macián- Dagnino, Claudia
Sr./Sra. Barba Ausejo, Pablo
Sr./Sra. Bruna de Quixano, Jose Luis
Sr./Sra. Folch Estrada, Fernando
Sr./Sra. Gracia Castellón, Sonia
Sr./Sra. Sales Valls, Gemma Elena
Sr./Sra. Pastor Giménez-Salinas, Adolfo
Sr./Sra. Font Jané, Àlvar
Sr./Sra. García González, Susana
Sr./Sra. Garcia-Tornel Florensa, Xavier
Sr./Sra. Marín Granados, Alejandra
Sr./Sra. Albero Jové, José María
Sr./Sra. Galcerà Margalef, Montserrat
Page 2..
Sr./Sra. Chimenos Minguella, Sergi
Sr./Sra. Lacasta Casado, Ramón
Sr./Sra. Alcay Morandeira, Carlos
Sr./Sra. Ribó Massó, Ignacio
Sr./Sra. Fitó Baucells, Antoni
Sr./Sra. Paredes Batalla, Patricia
Sr./Sra. Prats Viñas, Francesc
Sr./Sra. Correig Ferré, Gerard
Sr./Sra. Subirana Freixas, Alba
Sr./Sra. Álvarez Crexells, Juan
Sr./Sra. Glaser Woloschin, Joan Nicolás
Sr./Sra. Nel-lo Padro, Francesc Xavier
Sr./Sra. Oliveras Dalmau, Rosa Maria
Sr./Sra. Badia Piqué, Montserrat
Sr./Sra. Fuentes-Lojo Rius, Alejandro
Sr./Sra. Argemí Delpuy, Marc
Sr./Sra. Espinoza Carrizosa, Pina
Sr./Sra. Ges Clot, Carla
Sr./Sra. Antón Tuneu, Beatriz
Sr./Sra. Schroder Vilalta, Andrea
Sr./Sra. Belibov, Mariana
Sr./Sra. Sole Lopez, Silvia
Sr./Sra. Reina Pardo, Luis
Sr./Sra. Cardenal Lagos, Manel Josep
Sr./Sra. Bru Galiana, David
...and so on.

Scraping Yahoo Finance with Python3

I'm a complete newbie in scraping and I'm trying to scrape https://fr.finance.yahoo.com and I can't figure out what I'm doing wrong.
My goal is to scrape the index name, current level and the change(both in value and in %)
Here is the code I have used:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find("div",attrs={'data-reactid':'12'})
print(main_table)
links = main_table.find_all("li", class_=' D(ib) Bxz(bb) Bdc($seperatorColor) Mend(16px) BdEnd ')
print(links)
However, the print(links) comes out empty. Could someone please assist? Any help would be highly appreciated as I have been trying to figure this out for a few days now.

Although the better way to get all the fields is parse and process the relevant script tag, this is one of the ways you can get all them.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com/'
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,'html.parser')
df = pd.DataFrame(columns=['Index Name','Current Level','Value','Percentage Change'])
for item in soup.select("[id='market-summary'] li"):
index_name = item.select_one("a").contents[1]
current_level = ''.join(item.select_one("a > span").text.split())
value = ''.join(item.select_one("a")['aria-label'].split("ou")[1].split("points")[0].split())
percentage_change = ''.join(item.select_one("a > span + span").text.split())
df = df.append({'Index Name':index_name, 'Current Level':current_level,'Value':value,'Percentage Change':percentage_change}, ignore_index=True)
print(df)
Output are like:
Index Name Current Level Value Percentage Change
0 CAC 40 4444,56 -0,88 -0,02%
1 Euro Stoxx 50 2905,47 0,49 +0,02%
2 Dow Jones 24438,63 -35,49 -0,15%
3 EUR/USD 1,0906 -0,0044 -0,40%
4 Gold future 1734,10 12,20 +0,71%
5 BTC-EUR 8443,23 161,79 +1,95%
6 CMC Crypto 200 185,66 4,42 +2,44%
7 Pétrole WTI 33,28 -0,64 -1,89%
8 DAX 11073,87 7,94 +0,07%
9 FTSE 100 5993,28 -21,97 -0,37%
10 Nasdaq 9315,26 30,38 +0,33%
11 S&P 500 2951,75 3,24 +0,11%
12 Nikkei 225 20388,16 -164,15 -0,80%
13 HANG SENG 22930,14 -1349,89 -5,56%
14 GBP/USD 1,2177 -0,0051 -0,41%

I think you need to fix your element selection.
For example the following code:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_table = soup.find(id="market-summary")
links = main_table.find_all("a")
for i in links:
print(i.attrs["aria-label"])
Gives output text having index name, % change, change, and value:
CAC 40 a augmenté de 0,37 % ou 16,55 points pour atteindre 4 461,99 points
Euro Stoxx 50 a augmenté de 0,28 % ou 8,16 points pour atteindre 2 913,14 points
Dow Jones a diminué de -0,63 % ou -153,98 points pour atteindre 24 320,14 points
EUR/USD a diminué de -0,49 % ou -0,0054 points pour atteindre 1,0897 points
Gold future a augmenté de 0,88 % ou 15,10 points pour atteindre 1 737,00 points
a augmenté de 1,46 % ou 121,30 points pour atteindre 8 402,74 points
CMC Crypto 200 a augmenté de 1,60 % ou 2,90 points pour atteindre 184,14 points
Pétrole WTI a diminué de -3,95 % ou -1,34 points pour atteindre 32,58 points
DAX a augmenté de 0,29 % ou 32,27 points pour atteindre 11 098,20 points
FTSE 100 a diminué de -0,39 % ou -23,18 points pour atteindre 5 992,07 points
Nasdaq a diminué de -0,30 % ou -28,25 points pour atteindre 9 256,63 points
S&P 500 a diminué de -0,43 % ou -12,62 points pour atteindre 2 935,89 points
Nikkei 225 a diminué de -0,80 % ou -164,15 points pour atteindre 20 388,16 points
HANG SENG a diminué de -5,56 % ou -1 349,89 points pour atteindre 22 930,14 points
GBP/USD a diminué de -0,34 % ou -0,0041 points pour atteindre 1,2186 points

Try following css selector to get all the links.
import urllib
from bs4 import BeautifulSoup
url = 'https://fr.finance.yahoo.com'
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
links=[link['href'] for link in soup.select("ul#market-summary a")]
print(links)
Output:
['/quote/^FCHI?p=^FCHI', '/quote/^STOXX50E?p=^STOXX50E', '/quote/^DJI?p=^DJI', '/quote/EURUSD=X?p=EURUSD=X', '/quote/GC=F?p=GC=F', '/quote/BTC-EUR?p=BTC-EUR', '/quote/^CMC200?p=^CMC200', '/quote/CL=F?p=CL=F', '/quote/^GDAXI?p=^GDAXI', '/quote/^FTSE?p=^FTSE', '/quote/^IXIC?p=^IXIC', '/quote/^GSPC?p=^GSPC', '/quote/^N225?p=^N225', '/quote/^HSI?p=^HSI', '/quote/GBPUSD=X?p=GBPUSD=X']

Python understand accents like ( ^ ' ç º)

I'm creating a Python script, basically this part I'm having problems, it simply takes the titles of the posts of a webpage.
Python does not understand the accents and I've tried everything I know
1 - put this code in the first line # - * - coding: utf-8 - * -
2 - put .encode ("utf-8")
code:
# -*- coding: utf-8 -*-
import re
import requests
def opena(url):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
lexdan1 = requests.get(url,headers=headers)
lexdan2 = lexdan1.text
lexdan1.close
return lexdan2
dan = []
a = opena('http://www.megafilmesonlinehd.com/filmes-lancamentos')
d = re.compile('<strong class="tt-filme">(.+?)</strong>').findall(a)
for name in d:
name = name.encode("utf-8")
dan.append(name)
print dan
this what i got:
['Porta dos Fundos: Contrato Vital\xc3\xadcio HD 720p', 'Os 28 Homens de Panfilov Legendado HD', 'Estrelas Al\xc3\xa9m do Tempo Dublado', 'A Volta do Ju\xc3\xadzo Final Dublado Full HD 1080p', 'The Love Witch Legendado HD', 'Manchester \xc3\x80 Beira-Mar Legendado', 'Semana do P\xc3\xa2nico Dublado HD 720p', 'At\xc3\xa9 o \xc3\x9altimo Homem Legendado HD 720p', 'Arbor Demon Legendado HD 720p', 'Esquadr\xc3\xa3o de Elite Dublado Full HD 1080p', 'Ouija Origem do Mal Dublado Full HD 1080p', 'As Muitas Mulheres da Minha Vida Dublado HD 720p', 'Um Novo Desafio para Callan e sua Equipe Dublado Full HD 1080p', 'Terror Herdado Dublado DVDrip', 'Officer Downe Legendado HD', 'N\xc3\xa3o Bata Duas Vezes Legendado HD', 'Eu, Daniel Blake Legendado HD', 'Sangue Pela Gl\xc3\xb3ria Legendado', 'Quase 18 Legendado HD 720p', 'As Aventuras de Robinson Cruso\xc3\xa9 Dublado Full HD 1080p', 'Indigna\xc3\xa7\xc3\xa3o Dublado HD 720p']

Because you're telling the interpreter to print a list, the interpreter calls the list class's __str__ method. When you call a container's __str__method, it uses uses the __repr__ method for each of the contained objects (in this case - str type). The str type's __repr__ method doesn't convert the unicode characters, but its __str__ method (which gets called when you print an individual str object) does.
Here's a great question to help explain the difference:
Difference between __str__ and __repr__ in Python
If you print each string individually, you should get the results you want.
import re
import requests
def opena(url):
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
lexdan1 = requests.get(url,headers=headers)
lexdan2 = lexdan1.text
lexdan1.close
return lexdan2
dan = []
a = opena('http://www.megafilmesonlinehd.com/filmes-lancamentos')
d = re.compile('<strong class="tt-filme">(.+?)</strong>').findall(a)
for name in d:
dan.append(name)
for item in dan:
print item

When printing a list whatever is inside them is represented (calls __repr__ method), and not printed (call __str__ method):
class test():
def __repr__(self):
print '__repr__'
return ''
def __str__(self):
print '__str__'
return ''
will get you:
>>> a = [test()]
>>> a
[__repr__
]
>>> print a
[__repr__
]
>>> print a[0]
__str__
And the __repr__ method of string does not convert special characters (not even \t or \n).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why not able to scrape all pages from a website with BeautifulSoup? - python

Related

Can't find hrefs of interest with BeautifulSoup

Using beautifulsoup4, avoid an AttributeError on a non-existing element

Can't parse all the next page links from a website using requests

Scraping Yahoo Finance with Python3

Python understand accents like ( ^ ' ç º)

Categories

Resources