Can't find hrefs of interest with BeautifulSoup

Can't find hrefs of interest with BeautifulSoup - python

I am trying to collect a list of hrefs from the Netflix careers site: https://jobs.netflix.com/search. Each job listing on this site has an anchor and a class: <a class=css-2y5mtm essqqm81>. To be thorough here, the entire anchor is:
<a class="css-2y5mtm essqqm81" role="link" href="/jobs/244837014" aria-label="Manager, Written Communications"\>\
<span tabindex="-1" class="css-1vbg17 essqqm80"\>\<h4 class="css-hl3xbb e1rpdjew0"\>Manager, Written Communications\</h4\>\</span\>\</a\>
Again, the information of interest here is the hrefs of the form href="/jobs/244837014". However, when I perform the standard BS commands to read the HTML:
html_page = urllib.request.urlopen("https://jobs.netflix.com/search")
soup = BeautifulSoup(html_page)
I don't see any of the hrefs that I'm interested in inside of soup.
Running the following loop does not show the hrefs of interest:
for link in soup.findAll('a'):
print(link.get('href'))
What am I doing wrong?

That information is being fed dynamically in page, via XHR calls. You need to scrape the API endpoint to get jobs info. The following code will give you a dataframe with all jobs currently listed by Netflix:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from tqdm import tqdm ## if Jupyter: from tqdm.notebook import tqdm
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'referer': 'https://jobs.netflix.com/search',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
big_df = pd.DataFrame()
s = requests.Session()
s.headers.update(headers)
for x in tqdm(range(1, 20)):
url = f'https://jobs.netflix.com/api/search?page={x}'
r = s.get(url)
df = pd.json_normalize(r.json()['records']['postings'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df[['text', 'team', 'external_id', 'updated_at', 'created_at', 'location', 'organization' ]])
Result:
100%
19/19 [00:29<00:00, 1.42s/it]
text team external_id updated_at created_at location organization
0 Events Manager - SEA [Publicity] 244936062 2022-11-23T07:20:16+00:00 2022-11-23T04:47:29Z Bangkok, Thailand [Marketing and PR]
1 Manager, Written Communications [Publicity] 244837014 2022-11-23T07:20:16+00:00 2022-11-22T17:30:06Z Los Angeles, California [Marketing and Publicity]
2 Manager, Creative Marketing - Korea [Marketing] 244740829 2022-11-23T07:20:16+00:00 2022-11-22T07:39:56Z Seoul, South Korea [Marketing and PR]
3 Administrative Assistant - Philippines [Netflix Technology Services] 244683946 2022-11-23T07:20:16+00:00 2022-11-22T01:26:08Z Manila, Philippines [Corporate Functions]
4 Associate, Studio FP&A - APAC [Finance] 244680097 2022-11-23T07:20:16+00:00 2022-11-22T01:01:17Z Seoul, South Korea [Corporate Functions]
... ... ... ... ... ... ... ...
365 Software Engineer (L4/L5) - Content Engineering [Core Engineering, Studio Technologies] 77239837 2022-11-23T07:20:31+00:00 2021-04-22T07:46:29Z Mexico City, Mexico [Product]
366 Distributed Systems Engineer (L5) - Data Platform [Data Platform] 201740355 2022-11-23T07:20:31+00:00 2021-03-12T22:18:57Z Remote, United States [Product]
367 Senior Research Scientist, Computer Graphics / Computer Vision / Machine Learning [Data Science and Engineering] 227665988 2022-11-23T07:20:31+00:00 2021-02-04T18:54:10Z Los Gatos, California [Product]
368 Counsel, Content - Japan [Legal and Public Policy] 228338138 2022-11-23T07:20:31+00:00 2020-11-12T03:08:04Z Tokyo, Japan [Corporate Functions]
369 Associate, FP&A [Financial Planning and Analysis] 46317422 2022-11-23T07:20:31+00:00 2017-12-26T19:38:32Z Los Angeles, California [Corporate Functions]
370 rows × 7 columns

For each job, the url would be https://jobs.netflix.com/jobs/{external_id}

Related

Why not able to scrape all pages from a website with BeautifulSoup?

I'm trying to get all the data from all pages,
i used a counter and cast it to take the page number in the url
then looped using this counter but always the same result
This is my code :
# Scrapping job offers from hello work website
#import libraries
import random
import requests
import csv
from bs4 import BeautifulSoup
from datetime import date
#configure user agent for mozilla browser
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0",
"Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0",
"Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"
]
random_user_agent= random.choice(user_agents)
headers = {'User-Agent': random_user_agent}
here where i have used my counter:
i=0
for i in range(1,15):
url = 'https://www.hellowork.com/fr-fr/emploi/recherche.html?p='+str(i)
print(url)
page = requests.get(url,headers=headers)
if (page.status_code==200):
soup = BeautifulSoup(page.text,'html.parser')
jobs = soup.findAll('div',class_=' new action crushed hoverable !tw-p-4 md:!tw-p-6 !tw-rounded-2xl')
#config csv
csvfile=open('jobList.csv','w+',newline='')
row_list=[] #to append list of job
try :
writer=csv.writer(csvfile)
writer.writerow(["ID","Job Title","Company Name","Contract type","Location","Publish time","Extract Date"])
for job in jobs:
id = job.get('id')
jobtitle= job.find('h3',class_='!tw-mb-0').a.get_text()
companyname = job.find('span',class_='tw-mr-2').get_text()
contracttype = job.find('span',class_='tw-w-max').get_text()
location = job.find('span',class_='tw-text-ellipsis tw-whitespace-nowrap tw-block tw-overflow-hidden 2xsOld:tw-max-w-[20ch]').get_text()
publishtime = job.find('span',class_='md:tw-mt-0 tw-text-xsOld').get_text()
extractdate = date.today()
row_list=[[id,jobtitle,companyname,contracttype,location,publishtime,extractdate]]
writer.writerows(row_list)
finally:
csvfile.close()

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
BeautifulSoup is not necessary needed here - You could get all and more information directly via api using a mix of requests and pandas - Check all available information here:
https://www.hellowork.com/searchoffers/getsearchfacets?p=1
Example
import requests
import pandas as pd
from datetime import datetime
df = pd.concat(
[
pd.json_normalize(
requests.get(f'https://www.hellowork.com/searchoffers/getsearchfacets?p={i}', headers={'user-agent':'bond'}).json(), record_path=['Results']
)[['ContractType','Localisation', 'OfferTitle', 'PublishDate', 'CompanyName']]
for i in range(1,15)
],
ignore_index=True
)
df['extractdate '] = datetime.today().strftime('%Y-%m-%d')
df.to_csv('jobList.csv', index=False)
Output
ContractType
Localisation
OfferTitle
PublishDate
CompanyName
extractdate
0
CDI
Beaurepaire - 85
Chef Gérant H/F
2023-01-24T16:35:15.867
Armonys Restauration - Morbihan
2023-01-24
1
CDI
Saumur - 49
Dessinateur Métallerie Débutant H/F
2023-01-24T16:35:14.677
G2RH
2023-01-24
2
Franchise
Villenave-d'Ornon - 33
Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F
2023-01-24T16:35:13.707
Elysée Concept
2023-01-24
3
Franchise
Montpellier - 34
Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F
2023-01-24T16:35:12.61
Elysée Concept
2023-01-24
4
CDD
Monaco
Spécialiste Senior Développement Matières Premières Cosmétique H/F
2023-01-24T16:35:06.64
Expectra Monaco
2023-01-24
...
275
CDI
Brétigny-sur-Orge - 91
Magasinier - Cariste H/F
2023-01-24T16:20:16.377
DELPHARM
2023-01-24
276
CDI
Lille - 59
Technicien Helpdesk Français - Italien H/F
2023-01-24T16:20:16.01
Akkodis
2023-01-24
277
CDI
Tours - 37
Conducteur PL H/F
2023-01-24T16:20:15.197
Groupe Berto
2023-01-24
278
Franchise
Nogent-le-Rotrou - 28
Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F
2023-01-24T16:20:12.29
Elysée Concept
2023-01-24
279
CDI
Cholet - 49
Ingénieur Assurance Qualité H/F
2023-01-24T16:20:10.837
Akkodis
2023-01-24

python BeautifulSoup Wikipedia Webscapping -learning

I learning Python and BeautifulSoup
I am trying to do some webscraping:
Let me first describe want I am trying to do?
the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks
I am trying to print out the
<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
I want to print out the text: By market capitalization
Then the text of the table of the banks:
Example:
By market capitalization
Rank
Bank
Cap Rate
1
JP Morgan
466.1
2
Bank of China
300
all the way to 50
My code starts out like this:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
I believe my problem is more on the html side of things:
But I am completely lost:
I inspected the element and the tags that I believe to look for are
{section class_='mf-section-2 collapsible-block open-block'}

Close to your goal - Find the heading and than its next table and transform it via pandas.read_html() to dataframe.
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]
or
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
Example
from bs4 import BeautifulSoup
import requests
import panda as pd
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
header = soup.select_one('h2:has(>#By_market_capitalization)')
print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
Output
By market capitalization
Rank
Bank name
Market cap(US$ billion)
1
JPMorgan Chase
466.21[5]
2
Industrial and Commercial Bank of China
295.65
3
Bank of America
279.73
4
Wells Fargo
214.34
5
China Construction Bank
207.98
6
Agricultural Bank of China
181.49
7
HSBC Holdings PLC
169.47
8
Citigroup Inc.
163.58
9
Bank of China
151.15
10
China Merchants Bank
133.37
11
Royal Bank of Canada
113.80
12
Toronto-Dominion Bank
106.61
...

As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:
import pandas as pd
df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0, drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

Using beautifulsoup4, avoid an AttributeError on a non-existing element

Good time of the day!
While working on a scraping project, I have faced some issues.
Currently I am working on a draft:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
import requests
from bs4 import BeautifulSoup
import time
import random
#driver Path
PATH = "C:\Program Files (x86)\chromedriver"
BASE_URL = "https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167"
driver = webdriver.Chrome(PATH)
driver.implicitly_wait(30)
driver.get(BASE_URL)
time.sleep(random.uniform(3.0, 5.0))
btn = driver.find_elements_by_xpath('//*[#id="uc-btn-accept-banner"]')[0]
btn.click()
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.content, "html.parser")
def reader(url):
ls = list()
ImmoWebCode = url.find(class_ ="classified__information--immoweb-code").text.strip()
Price = url.find("p", class_="classified__price").find("span",class_="sr-only").text.strip()
Locality = url.find(class_="classified__information--address-row").find("span").text.strip()
HouseType = url.find(class_="classified__title").text.strip()
LivingArea = url.find("th",text="Living area").find_next(class_="classified-table__data").next_element.strip()
RoomsNumber = url.find("th",text="Bedrooms").find_next(class_="classified-table__data").next_element.strip()
Kitchen = url.find("th",text="Kitchen type").find_next(class_="classified-table__data").next_element.strip()
TerraceOrientation = url.find("th",text="Terrace orientation").find_next(class_="classified-table__data").next_element.strip()
TerraceArea = url.find("th",text="Terrace").find_next(class_="classified-table__data").next_element.strip()
Furnished = url.find("th",text="Furnished").find_next(class_="classified-table__data").next_element.strip()
ls.append(Furnished)
OpenFire = url.find("th", text="How many fireplaces?").find_next(class_="classified-table__data").next_element.strip()
GardenOrientation = url.find("th", text="Garden orientation").find_next(class_="classified-table__data").next_element.strip()
ls.append(GardenOrientation)
GardenArea = url.find("th",text="Garden surface").find_next(class_="classified-table__data").next_element.strip()
PlotSurface = url.find("th",text="Surface of the plot").find_next(class_="classified-table__data").next_element.strip()
ls.append(PlotSurface)
FacadeNumber = url.find("th",text="Number of frontages").find_next(class_="classified-table__data").next_element.strip()
SwimmingPoool = url.find("th",text="Swimming pool").find_next(class_="classified-table__data").next_element.strip()
StateOfTheBuilding = url.find("th",text="Building condition").find_next(class_="classified-table__data").next_element.strip()
return ls
print(reader(soup))
I start facing issues, when the code reaches "Locality", I receive an Exception has occurred: AttributeError 'NoneType' object has no attribute 'find, though it is clear that the mentioned element is present on HTML code. I am adamant, that it is a synthax issue, but I can not put a finger on it.
It brings me to my second question:
Since this code will be running on multiple pages, those pages might not have requested elements. How can I place the None value if it occurs.
Thank you very much in advance!
Source Code:
<div class="classified__header-secondary-info classified__informations"><p class="classified__information--property">
3 bedrooms
<span aria-hidden="true">|</span>
199
<span class="abbreviation"><span aria-hidden="true">
m² </span> <span class="sr-only">
square meters </span></span></p> <div class="classified__information--financial"><!----> <!----> <span class="mortgage-banner__text">Request your mortgage loan</span></div> <div class="classified__information--address"><p><span class="classified__information--address-row"><span>
1340
</span> <span aria-hidden="true">—</span> <span>
Ottignies
</span>
|
</span> <button class="button button--text button--size-small classified__information--address-button">
Ask for the exact address
</button></p></div> <div class="classified__information--immoweb-code">
Immoweb code : 9308167
</div></div>

Isn't that data all within the <table> tags of the site? Can just use pandas:
import requests
import pandas as pd
url = 'https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167'
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
response = requests.get(url,headers=headers)
dfs = pd.read_html(response.text)
df = pd.concat(dfs).dropna().reset_index(drop=True)
df = df.pivot(index=None, columns=0,values=1).bfill().iloc[[0],:]
Output:
print(df.to_string())
0 Address As built plan Available as of Available date Basement Bathrooms Bedrooms CO₂ emission Construction year Double glazing Elevator Energy class External reference Furnished Garden Garden orientation Gas, water & electricity Heating type Investment property Kitchen type Land is facing street Living area Living room surface Number of frontages Outdoor parking spaces Price Primary energy consumption Reference number of the EPC report Shower rooms Surface of the plot Toilets Website Yearly theoretical total energy consumption
0 Grand' Route 69 A 1435 - Corbais No To be defined December 31 2022 - 12:00 AM Yes 1 3 Not specified 2020 Yes Yes Not specified 8566 - 4443 No Yes East Yes Gas No Installed No 199 m² square meters 48 m² square meters 2 1 € 410,000 410000 € Not specified Not specified 1 150 m² square meters 3 http://www.gilmont.be Not specified

Grab table from football recruiting website

I would like to create the exact same table as the one shown in the following webpage: https://247sports.com/college/penn-state/Season/2022-Football/Commits/
I am currently using Selenium and Beautiful Soup to start making it happen on a Google Colab notebook because I am getting forbidden errors when performing a "read_html" command. I have just started to get some output, but I only want to grab the text and not the external stuff surrounding it.
Here is my code so far...
from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)
soup = BeautifulSoup(wd.page_source)
school=soup.find_all('span', class_='meta')
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')
status
...and here is my output...
[<p class="commit-date withDate"> Commit 7/25/2020 </p>,
<p class="commit-date withDate"> Commit 9/4/2020 </p>,
<p class="commit-date withDate"> Commit 1/1/2021 </p>,
<p class="commit-date withDate"> Commit 3/8/2021 </p>,
<p class="commit-date withDate"> Commit 10/29/2020 </p>,
<p class="commit-date withDate"> Commit 7/28/2020 </p>,
<p class="commit-date withDate"> Commit 9/8/2020 </p>,
<p class="commit-date withDate"> Commit 8/3/2020 </p>,
<p class="commit-date withDate"> Commit 5/1/2021 </p>]
Any assistance on this is greatly appreciated.

There's no need to use Selenium, to get a response from the website you need to specify the HTTP User-Agent header, otherwise, the website thinks that your a bot and will block you.
To create a DataFrame see this sample:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
response = requests.get(url, headers=headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="ri-page__list-item")[1:]: # `[1:]` Since the first result is a table header
school = tag.find_next("span", class_="meta").text
name = tag.find_next("a", class_="ri-page__name-link").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
status = tag.find_next("p", class_="commit-date withDate").text
data.append(
{
"school": school,
"name": name,
"position": position,
"height_weight": height_weight,
"rating": rating,
"nat_rank": nat_rank,
"state_rank": state_rank,
"pos_rank": pos_rank,
"status": status,
}
)
df = pd.DataFrame(data)
print(df.to_string())
Output:
school name position height_weight rating nat_rank state_rank pos_rank status
0 Westerville South (Westerville, OH) Kaden Saunders WR 5-10 / 172 0.9509 116 5 16 Commit 7/25/2020
1 IMG Academy (Bradenton, FL) Drew Shelton OT 6-5 / 290 0.9468 130 17 14 Commit 9/4/2020
2 Central Dauphin East (Harrisburg, PA) Mehki Flowers WR 6-1 / 190 0.9461 131 4 18 Commit 1/1/2021
3 Medina (Medina, OH) Drew Allar PRO 6-5 / 220 0.9435 138 6 8 Commit 3/8/2021
4 Manheim Township (Lancaster, PA) Anthony Ivey WR 6-0 / 190 0.9249 190 6 26 Commit 10/29/2020
5 King (Milwaukee, WI) Jerry Cross TE 6-6 / 218 0.9153 218 4 8 Commit 7/28/2020
6 Northeast (Philadelphia, PA) Ken Talley WDE 6-3 / 230 0.9069 253 9 13 Commit 9/8/2020
7 Central York (York, PA) Beau Pribula DUAL 6-2 / 215 0.8891 370 12 9 Commit 8/3/2020
8 The Williston Northampton School (Easthampton, MA) Maleek McNeil OT 6-8 / 340 0.8593 705 8 64 Commit 5/1/2021

Remove image sources with same class reference when web scraping in python?

I'm trying to write some code to extract some data from transfermarkt (Link Here for the page I'm using). I'm stuck trying to print the clubs. I've figured out that I need to access h2 and then the a class in order to just get the text. The HTML code is below
<div class="table-header" id="to-349"><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018"><img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" /></a><h2><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">Barnsley FC</a></h2></div>
so you can see if I just try find_all("a", "class": "vereinprofil_tooltip"}) it doesn't work properly as it also returns the image file which has no plain text? But if I can search for h2 first and then search find_all("a", "class": "vereinprofil_tooltip"}) within the returned h2 it would get me what I want. My code is below.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
Club = Clubs.find("a", {"class": "vereinprofil_tooltip"})
print(Club)
I get the error in getattr
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I know what the error means but I've been going round in circles trying to find a way of actually doing it properly and getting what I want. Any help is appreciated.

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/
plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
print(type(Clubs)) # this can be removed, but I left it to expose how I figured this out
for club in Clubs:
print(club.text)
Basically: Clubs is a list (technically, a ResultSet, but the behavior is very similar), you need to iterate it as such. .text gives just the text, other attributes could be retrieved as well.
Output looks like:
Transfer record 18/19
Barnsley FC
Burton Albion
Sunderland AFC
Shrewsbury Town
Scunthorpe United
Charlton Athletic
Plymouth Argyle
Portsmouth FC
Peterborough United
Southend United
Bradford City
Blackpool FC
Bristol Rovers
Fleetwood Town
Doncaster Rovers
Oxford United
Gillingham FC
AFC Wimbledon
Walsall FC
Rochdale AFC
Accrington Stanley
Luton Town
Wycombe Wanderers
Coventry City
Transfer record 18/19
There are, however, a bunch of blank lines (I.e., .text was '') that you should probably handle as well.

my guess is you might mean findAll instead of find_all
I tried this code below and it works
content = """<div class="table-header" id="to-349">
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
<img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" />
</a>
<h2>
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
Barnsley FC
</a>
</h2>
</div>"""
soup = BeautifulSoup(content, 'html.parser')
#get main_box
main_box = soup.findAll('a', {'class': 'vereinprofil_tooltip'})
#print(main_box)
for main_text in main_box: # looping thru the list
if main_text.text.strip(): # get the body text
print(main_text.text.strip()) # print it
output is
Barnsley FC
I'll edit this with a reference to the documentation about findAll. cant remember it on to pof my head
edit:
did a look at the documentation, turns out find_all = findAll..
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
now I feel dumb lol

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't find hrefs of interest with BeautifulSoup - python

Related

Why not able to scrape all pages from a website with BeautifulSoup?

python BeautifulSoup Wikipedia Webscapping -learning

Using beautifulsoup4, avoid an AttributeError on a non-existing element

Grab table from football recruiting website

Remove image sources with same class reference when web scraping in python?

Categories

Resources