Can't Parse Google headings with python urllib - python

I cannot parse Gooogle search results:
def extracter(url,key,change):
if " " in key:
key=key.replace(" ",str(change))
url=url+str(key)
response=ur.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
sauce =ur.urlopen(response).read()
soup=bs(sauce,"html.parser")
return soup
def google(keyword):
soup = extracter("https://www.google.com/search?q=",str(keyword),"+")
search_result = soup.findAll("h3",attrs={"class":"LC20lb"})
print(search_result)
google("tony stark")
Output:
[]

I simply changed the headers and it worked:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
Result:
[<h3 class="LC20lb"><span dir="ltr">Tony Stark (Marvel Cinematic Universe) - Wikipedia</span></h3>, <h3 class="LC20lb"><span dir="ltr">Tony Stark / Iron Man - Wikipedia</span></h3>, <h3 class="LC20lb"><span dir="ltr">Iron Man | Marvel Cinematic Universe Wiki | FANDOM ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">Tony Stark (Earth-199999) | Iron Man Wiki | FANDOM ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">Is Tony Stark Alive As AI? Marvel Fans Say Tony Stark ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">'Avengers: Endgame' Might Not Have Been the End of Tony ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">Robert Downey Jr to RETURN to MCU as AI Tony Stark - ...</span></h3>, <h3 class="LC20lb"><span dir="ltr">Avengers Endgame theory: Tony Stark is backed up as AI ...</span></h3>]

Related

Can't find hrefs of interest with BeautifulSoup

I am trying to collect a list of hrefs from the Netflix careers site: https://jobs.netflix.com/search. Each job listing on this site has an anchor and a class: <a class=css-2y5mtm essqqm81>. To be thorough here, the entire anchor is:
<a class="css-2y5mtm essqqm81" role="link" href="/jobs/244837014" aria-label="Manager, Written Communications"\>\
<span tabindex="-1" class="css-1vbg17 essqqm80"\>\<h4 class="css-hl3xbb e1rpdjew0"\>Manager, Written Communications\</h4\>\</span\>\</a\>
Again, the information of interest here is the hrefs of the form href="/jobs/244837014". However, when I perform the standard BS commands to read the HTML:
html_page = urllib.request.urlopen("https://jobs.netflix.com/search")
soup = BeautifulSoup(html_page)
I don't see any of the hrefs that I'm interested in inside of soup.
Running the following loop does not show the hrefs of interest:
for link in soup.findAll('a'):
print(link.get('href'))
What am I doing wrong?
That information is being fed dynamically in page, via XHR calls. You need to scrape the API endpoint to get jobs info. The following code will give you a dataframe with all jobs currently listed by Netflix:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from tqdm import tqdm ## if Jupyter: from tqdm.notebook import tqdm
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'referer': 'https://jobs.netflix.com/search',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
big_df = pd.DataFrame()
s = requests.Session()
s.headers.update(headers)
for x in tqdm(range(1, 20)):
url = f'https://jobs.netflix.com/api/search?page={x}'
r = s.get(url)
df = pd.json_normalize(r.json()['records']['postings'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df[['text', 'team', 'external_id', 'updated_at', 'created_at', 'location', 'organization' ]])
Result:
100%
19/19 [00:29<00:00, 1.42s/it]
text team external_id updated_at created_at location organization
0 Events Manager - SEA [Publicity] 244936062 2022-11-23T07:20:16+00:00 2022-11-23T04:47:29Z Bangkok, Thailand [Marketing and PR]
1 Manager, Written Communications [Publicity] 244837014 2022-11-23T07:20:16+00:00 2022-11-22T17:30:06Z Los Angeles, California [Marketing and Publicity]
2 Manager, Creative Marketing - Korea [Marketing] 244740829 2022-11-23T07:20:16+00:00 2022-11-22T07:39:56Z Seoul, South Korea [Marketing and PR]
3 Administrative Assistant - Philippines [Netflix Technology Services] 244683946 2022-11-23T07:20:16+00:00 2022-11-22T01:26:08Z Manila, Philippines [Corporate Functions]
4 Associate, Studio FP&A - APAC [Finance] 244680097 2022-11-23T07:20:16+00:00 2022-11-22T01:01:17Z Seoul, South Korea [Corporate Functions]
... ... ... ... ... ... ... ...
365 Software Engineer (L4/L5) - Content Engineering [Core Engineering, Studio Technologies] 77239837 2022-11-23T07:20:31+00:00 2021-04-22T07:46:29Z Mexico City, Mexico [Product]
366 Distributed Systems Engineer (L5) - Data Platform [Data Platform] 201740355 2022-11-23T07:20:31+00:00 2021-03-12T22:18:57Z Remote, United States [Product]
367 Senior Research Scientist, Computer Graphics / Computer Vision / Machine Learning [Data Science and Engineering] 227665988 2022-11-23T07:20:31+00:00 2021-02-04T18:54:10Z Los Gatos, California [Product]
368 Counsel, Content - Japan [Legal and Public Policy] 228338138 2022-11-23T07:20:31+00:00 2020-11-12T03:08:04Z Tokyo, Japan [Corporate Functions]
369 Associate, FP&A [Financial Planning and Analysis] 46317422 2022-11-23T07:20:31+00:00 2017-12-26T19:38:32Z Los Angeles, California [Corporate Functions]
370 rows × 7 columns
​
For each job, the url would be https://jobs.netflix.com/jobs/{external_id}

How to find the nearest above sibling <li> for each <li class=""><a>?

My example html is
<li><h4>A0: Pronouns</h4></li>
<li class="">
<a>bb</a>
<a>cc</a>
</li>
<li class="">
<a>dd</a>
<a>ee</a>
</li>
<li><h4>A0: Verbs Tenses & Conjugation</h4></li>
<li class="">
<a>ff</a>
<a>gg</a>
</li>
<li class="">
<a>hh</a>
<a>kk</a>
</li>
<li class="">
<a>jj</a>
<a>ii</a>
</li>
For each element <li class=""><a>, I would like to find its nearest above sibling <li><h4>. For example,
<li class=""><a>bb</a></li> corresponds to <li><h4>A0: Pronouns</h4></li>.
<li class=""><a>dd</a></li> corresponds to <li><h4>A0: Pronouns</h4></li>.
<li class="">ff<a>dd</a></li> corresponds to <li><h4>A0: Verbs Tenses & Conjugation</h4></li>.
<li class="">hh<a>dd</a></li> corresponds to <li><h4>A0: Verbs Tenses & Conjugation</h4></li>.
<li class="">jj<a>dd</a></li> corresponds to <li><h4>A0: Verbs Tenses & Conjugation</h4></li>.
Could you please elaborate how to do so?
import requests
from bs4 import BeautifulSoup
session = requests.Session()
headers = {
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
}
link = 'https://french.kwiziq.com/revision/grammar'
r = session.get(link, headers = headers)
soup = BeautifulSoup(r.content, 'html.parser')
for d in soup.select('.callout-body > ul li > a:nth-of-type(1)'):
print(d)
You can use .find_previous('h4'):
import requests
from bs4 import BeautifulSoup
url = "https://french.kwiziq.com/revision/grammar"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select(".callout li > a:nth-of-type(1)"):
print(
"{:<70} {}".format(
a.get_text(strip=True), a.find_previous("h4").get_text(strip=True)
)
)
Prints:
Saying your name: Je m'appelle, Tu t'appelles, Vous vous appelez A0: Pronouns
Tu and vous are used for three types of you A0: Pronouns
Je becomes j' with verbs beginning with a vowel (elision) A0: Verbs Tenses & Conjugation
J'habite à [city] = I live in [city] A0: Idioms, Idiomatic Usage, and Structures
Je viens de + [city] = I'm from + [city] A0: Idioms, Idiomatic Usage, and Structures
Conjugate être (je suis, tu es, vous êtes) in Le Présent (present tense) A0: Verbs Tenses & Conjugation
Make most adjectives feminine by adding -e A0: Adjectives & Adverbs
Nationalities differ depending on whether you're a man or a woman (adjectives) A0: Adjectives & Adverbs
Conjugate avoir (j'ai, tu as, vous avez) in Le Présent (present tense) A0: Verbs Tenses & Conjugation
Using un, une to say "a" (indefinite articles) A0: Nouns & Articles
...
French vocabulary and grammar lists by theme C1: Idioms, Idiomatic Usage, and Structures
French Fill-in-the-Blanks Tests C1: Idioms, Idiomatic Usage, and Structures
You can use :is in your CSS path:
from bs4 import BeautifulSoup as soup
from collections import defaultdict
d, l = defaultdict(list), None
for i in soup1.select('li > :is(a, h4):nth-of-type(1)'):
if i.name == 'h4':
l = i.get_text(strip=True)
else:
d[l].append(i.get_text(strip=True))
print(dict(d))
Output:
{'A0: Pronouns': ['bb', 'dd'], 'A0: Verbs Tenses & Conjugation': ['ff', 'hh', 'jj']}
The output is storing the first a for every li associated with a grammatical section. If you only want a 1-1 section to component result, you can use a dictionary comprehension:
new_d = {a:b for a, (b, *_) in d.items()}
Output:
{'A0: Pronouns': 'bb', 'A0: Verbs Tenses & Conjugation': 'ff'}

Python - Web Scraping for Walmart

I'm trying to get some datas from Walmart using Python and BeautifulSoup bs4.
Simply I wrote a code for get the all category names and that works:
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.walmart.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
r = requests.get('https://www.walmart.com/all-departments')
soup = BeautifulSoup(r.content, 'lxml')
sub_list = soup.find_all('div', class_='alldeps-DepartmentNav-link-wrapper display-inline-block u-size-1-3')
print(sub_list)
The problem is; when I try to get the values from this link by using the code below, I get empty results:
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.walmart.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
r = requests.get('https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391')
soup = BeautifulSoup(r.content, 'lxml')
general_list = soup.find_all('a', class_='product-title-link line-clamp line-clamp-2 truncate-title')
print(general_list)
As I searched on old docs, I see only SerpApi solution but it is paid solution so is there any way for get the values? Or am I doing something wrong?
Here is good tutotial for Selenium:
https://selenium-python.readthedocs.io/getting-started.html#simple-usage.
I've wrote a short script for you to get started. All you need is to download chromedriver(Chromium) and put it to path. For Windows, chromedriver will have .exe resolution
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.get("https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391")
assert "Walmart.com" in driver.title
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".product-title-link.line-clamp.line-clamp-2.truncate-title>span")))
elems = driver.find_elements_by_css_selector(".product-title-link.line-clamp.line-clamp-2.truncate-title>span")
for el in elems:
print(el.text)
driver.close()
My output:
Lance Sandwich Cookies, Nekot Lemon Creme, 8 Ct Box
Nature Valley Biscuits, Almond Butter Breakfast Biscuits w/ Nut Filling, 13.5 oz
Pepperidge Farm Soft Baked Strawberry Cheesecake Cookies, 8.6 oz. Bag
Nutter Butter Family Size Peanut Butter Sandwich Cookies, 16 oz
SnackWell's Devil's Food Cookie Cakes 6.75 oz. Box
Munk Pack Protein Cookies, Variety Pack, Vegan, Gluten Free, Dairy Free Snacks, 6 Count
Great Value Twist & Shout Chocolate Sandwich Cookies, 15.5 Oz.
CHIPS AHOY! Chewy Brownie Filled Chocolate Chip Cookies, 9.5 oz
Nutter Butter Peanut Butter Wafer Cookies, 10.5 oz
Nabisco Sweet Treats Cookie Variety Pack OREO, OREO Golden & CHIPS AHOY!, 30 Snack Packs (2 Cookies Per Pack)
Archway Cookies, Soft Dutch Cocoa, 8.75 oz
OREO Double Stuf Chocolate Sandwich Cookies, Family Size, 20 oz
OREO Chocolate Sandwich Cookies, Party Size, 25.5 oz
Fiber One Soft-Baked Cookies, Chocolate Chunk, 6.6 oz
Nature Valley Toasted Coconut Biscuits with Coconut Filling, 10 ct, 13.5 oz
Great Value Duplex Sandwich Creme Cookies Family Size, 25 Oz
Great Value Assorted Sandwich creme Cookies Family Size, 25 oz
CHIPS AHOY! Original Chocolate Chip Cookies, Family Size, 18.2 oz
Archway Cookies, Crispy Windmill, 9 oz
Nabisco Classic Mix Variety Pack, OREO Mini, CHIPS AHOY! Mini, Nutter Butter Bites, RITZ Bits Cheese, Easter Snacks, 20 Snack Packs
Mother's Original Circus Animal Cookies 11 oz
Lotus Biscoff Cookies, 8.8 Oz.
Archway Cookies, Crispy Gingersnap, 12 oz
Great Value Vanilla Creme Wafer Cookies, 8 oz
Pepperidge Farm Verona Strawberry Thumbprint Cookies, 6.75 oz. Bag
Absolutely Gluten Free Coconut Macaroons
Sheila G's Brownie Brittle GLUTEN-FREE Chocolate Chip Cookie Snack Thins, 4.5oz
CHIPS AHOY! Peanut Butter Cup Chocolate Cookies, Family Size, 14.25 oz
Great Value Lemon Sandwich Creme Cookies Family Size, 25 oz
Keebler Sandies Classic Shortbread Cookies 11.2 oz
Nabisco Cookie Variety Pack, OREO, Nutter Butter, CHIPS AHOY!, 12 Snack Packs
OREO Chocolate Sandwich Cookies, Family Size, 19.1 oz
Lu Petit Ecolier European Dark Chocolate Biscuit Cookies, 45% Cocoa, 5.3 oz
Keebler Sandies Pecan Shortbread Cookies 17.2 oz
CHIPS AHOY! Reeses Peanut Butter Cup Chocolate Chip Cookies, 9.5 oz
Fiber One Soft-Baked Cookies, Oatmeal Raisin, 6 ct, 6.6 oz
OREO Dark Chocolate Crme Chocolate Sandwich Cookies, Family Size, 17 oz
Pinwheels Pure Chocolate & Marshmallow Cookies, 12 oz
Keebler Fudge Stripes Original Cookies 17.3 oz
Pepperidge Farm Classic Collection Cookies, 13.25 oz. Box
It's because the website is dynamically rendered. So the javascript first need to run before it shows the product. Therefore you need somewhere to run the javascript (bs can't do that) Have a look at the selinium library.

Can't parse all the next page links from a website using requests

I'm trying to get all the links traversing next pages from this website. My script below can parse the links of next pages until 10. However, I can't go past that link visible as 10 at the bottom of that page.
I've tried with:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = 'https://www.icab.es'
link = 'https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc82c494d2c55856f1e25c06b4b6fcee5ddabebfe2d30057589a86e9750b459e9d60598cc6e5c52a4697030b2b8921f29f'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
p = 1
while True:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
"""some data I can fetch myself from current pages, so ignore this portion"""
p+=1
next_page = soup.select_one(f"a[title='{p}']")
if next_page:
link = urljoin(base,next_page.get("href"))
print("next page:",link)
else:
break
How can I get all the next page links from the website above?
PS selenium is not an option I would like to cope with.
You only need to get the href of ">" when your (p-1)%10 != 0
Code:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = 'https://www.icab.es'
link = 'https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc82c494d2c55856f1e25c06b4b6fcee5ddabebfe2d30057589a86e9750b459e9d60598cc6e5c52a4697030b2b8921f29f'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
p = 1
while True:
r = s.get(link)
soup = BeautifulSoup(r.text, "lxml")
"""some data I can fetch myself from current pages, so ignore this portion"""
p += 1
if not ((p-1) % 10):
next_page = soup.select_one(f"a[title='Següent']")
else:
next_page = soup.select_one(f"a[title='{p}']")
if next_page:
link = urljoin(base, next_page.get("href"))
print("page", next_page.text, link)
Result(page >> could be considered as page ?1):
D:\python37\python.exe E:/work/Compile/python/python_project/try.py
page 2 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64ddd1e9bee4bbc2e02ab2de1dfe88d7a8623a04a8617c3a28f3f17b03d0448cd1d399689a629d00f53e570c691ad10f5cfba28f6f9ee8d48ddb9b701d116d7c2a6d4ea403cef5d996fcb28a12b9f7778cd7521cfdf3d243cb2b1f3de9dfe304a10437e417f6c68df79efddd721f2ab8167085132c5e745958a3a859b9d9f04b63e402ec6e8ae29bee9f4791fed51e5758ae33460e9a12b6d73f791fd118c0c95180539f1db11c86a7ab97b31f94fb84334dce6867d519873cc3b80e182ff0b778
page 3 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64cc111efb5ef5c42ecde14350e1f5a2e0e8db36a9258d5c95496c2450f44ccd8c4074edb392b09c03e136988ff15e707fa0e01d1ee6c3198a166e2677b4b418e0b07cafd4d98a19364077e7ed2ea0341001481d8b9622a969a524a487e7d69f6b571f2cb03c2277ecd858c68a7848a0995c1c0e873d705a72661b69ab39b253bb775bc6f7f6ae3df2028114735a04dcb8043775e73420cb40c4a5eccb727438ea225b582830ce84eb959753ded1b3eb57a14b283c282caa7ad04626be8320b4ab
page 4 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64d8e9a9d04523d43bfb106098548163bfec74e190632187d135f2a0949b334acad719ad7c326481a43dfc6f966eb038e0a5a178968601ad0681d586ffc8ec21e414628f96755116e65b7962dfcf3a227fc1053d17701937d4f747b94c273ce8b9ccec178386585075c17a4cb483c45b85c1209329d1251767b8a0b4fa29969cf6ad42c7b04fcc1e64b9defd528753677f56e081e75c1cbc81d1f4cc93adbde29d06388474671abbab246160d0b3f03a17d1db2c6cd6c6d7a243d872e353200a35
page 5 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c643ba4bcf6634af323cf239c7ccf7eca827c3a245352a03532a91c0ced15db81dcfc52b6dfa69853a68cb320e29ca25e25fac3da0e85667145375c3fa1541d80b1b056c03c02400220223ad5766bd1a4824171188fd85a5412b59bd48fe604451cbd56d763be67b50e474befa78340d625d222f1bb6b337d8d2b335d1aa7d0374b1be2372e77948f22a073e5e8153c32202a219ed2ef3f695b5b0040ded1ca9c4a03462b5937182c004a1a425725d3d20a10b41fd215d551abf10ef5e8a76ace4f
page 6 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64418cdf5c38c01a1ac019cc46242eb9ba25f012f2e4bee18a2864a19dde58d6ee2ae93254aff239c70b7019526af1a435e0e89a7c81dc4842e365163d8f9e571ae4fc8b0fc7455f573abee020e21207a604f3d6b7c2015c300a7b1dbc75980b435bb1904535bed2610771fee5e3338a79fad6d024ec2684561c3376463b2cacc00a99659918b41a12c92233bca3eaa1e003dbb0a094b787244ef3c33688b4382f89ad64a92fa8b738dd810b6e32a087564a8db2422c5b2013e9103b1b57b4248d
page 7 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64f96d66b04d442c09e3b891f2a5f3fb235c1aa2786face046665066db9a63e7ca4523e5cf28f4f17898204642a7d5ef3f8474ecd5bf58b252944d148467b495ad2450ea157ce8f606d4b9a6bc2ac04bec3a666757eac42cbea0737e8191b0d375425e11c76990c82246cfb9cbe94daa46942d824ff9f144f6b51c768b50c3e35acfa81e5ebf95bcb5200f5b505595908907c99b8d59893724eb16d5694d36cd30d8a15744af616675b2d6a447c10b79ca04814aece8c7ab4d878b7955cd5cd9ef
page 8 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64d1c210208efbd4630d44a5a92d37e5eabccba6abf83c2584404a24d08e0ad738be3598a03bbec8975275b06378cc5f7c55a9b700eb5bd4ee243a3c30f781512c0ebd23800890cb150621caab21a7a879639331b369d92bb9668815465f5d3b6c061daa011784909fc09af75ab705612ba504b4c268b43f8a029e840b8c69531423e8b5e8fe91d7cc628c309ffb633e233932b7c1b57c5cf0a2f2f47618bca4837ce355f34ae154565b447cfffcecb66458d19e5e5f3547f6916cd1c30baec1a7
page 9 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c6415c187c4ac2cf9d4c982984e1b56faf34a31272a90b11557d1651ad92b01a2ecd3c719cfe78863f99e31b0fc1b6bc7b09e1e0e585ebdc0b04fc9dca8744bb66e8af86d65b39827f1265a82aea0286376456ccfa9cce638d72c494db2391127979eed3d349d725f2e60e2629512c388738fc26b1c9f16a2b478862469835474b305f1300c0aa53c2c4033e4b0967a542079915e30bb18418eb79a47a292ed835dd54689c1fd9ceda898678e7114fa95d559b55367e6f7f9d1ce3fb5ebb5d479c5
page 10 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c644ab59a0b943deffee8845c2521afef0ea3ff2d96cc2b65a071da1201603b54b15b5c4363e92285c60dffd0e893ba6a58ff528fb3278db8e746697dc8712936a560a3da8085e3dcab05949afecddaced326332986240624575c6b7f104182a8c57718ec62e728d8eaa886a611ad55e0d3dd0c1ba59b47cf89d1bd5b000f9fbc5bd7d6310742a53eedfa44383d62145c28ebcf9f180ca49a3616fcfaf7ecaaa0b2f7183fc1d10d18e0062613e73f9077d11a1dfaf044990c200ac10aac4f7cb332
page » https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64ff2c69157ff5cf4b8ccbc2674205f4fb3048dc10db0c7cb36c42fbc59aaa972b9fab70578ff58757fae7a1f1ca17076dfddb919cf92389ba66c8de7f6ea9ec08277b0228f8bd14ea82409ff7e5a051ea58940736b475c6f75c7eba096b711812ed5b6b8454ec11145b0ce10191a38068c6ca7e7c64a86b4c71819d55b3ab34233e9887c7bfa05f9f8bc488cb0986fb2680b8cb9278a437e7c91c7b9d15426e159c30c6c2351ed300925ef1b24bbf2dbf60cf9dea935d179235ed46640d2b0b54
page 12 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64346907383e54eae9d772c10d3600822205ff9b81665ff0f58fd876b4e0d9aeb6e0271904c5251d9cf6eb1fdd1ea16f8ea3f42ad3db66678bc538c444e0e5e4064946826aaf85746b3f87fb436d83a8eb6d6590c25dc7f208a16c1db7307921d79269591e036fed1ec78ec7351227f925a32d4d08442b9fd65b02f6ef247ca5f713e4faffe994bf26a14c2cb21268737bc2bc92bb41b3e3aaa05de10da4e38de3ab725adb5560eee7575cdf6d51d59870efacc1b9553609ae1e16ea25e6d6e9e6
page 13 https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc7aa5373444141c64afc9149ba3dadd6054f6d8629d1c750431a15f9c4048195cfc2823f61f6cfd1f2e4f78eb835829db8e7c88279bf3a38788d8feaf5327f1b42d863bba24d893ea5e033510dc2e0579474ac7efcc1915438eacb83f2a3b5416e64e3beb726d721eb79f55082be0371414ccd132e95cd53339cf7a8d6ec15b72595bf87107d082c9db7bba6cf45b8cfe7a9352abe2f289ae8591afcfd78e17486c25e94ea57c00e290613a18a8b991def7e1cd4cae517a4ee1b744036336fbc68b657cd33cc4c949
I had problems with SSL, so I changed the default ssl_context for this site:
import ssl
import requests
import requests.adapters
from bs4 import BeautifulSoup
# adapted from https://stackoverflow.com/questions/42981429/ssl-failure-on-windows-using-python-requests/50215614
class SSLContextAdapter(requests.adapters.HTTPAdapter):
def init_poolmanager(self, *args, **kwargs):
ssl_context = ssl.create_default_context()
# Sets up old and insecure TLSv1.
ssl_context.options &= ~ssl.OP_NO_TLSv1_3 & ~ssl.OP_NO_TLSv1_2 & ~ssl.OP_NO_TLSv1_1
ssl_context.minimum_version = ssl.TLSVersion.TLSv1
kwargs['ssl_context'] = ssl_context
return super(SSLContextAdapter, self).init_poolmanager(*args, **kwargs)
base = 'https://www.icab.es'
link = 'https://www.icab.es/?go=eaf9d1a0ec5f1dc58757ad6cffdacedb1a58854a600312cc82c494d2c55856f1e25c06b4b6fcee5ddabebfe2d30057589a86e9750b459e9d60598cc6e5c52a4697030b2b8921f29f'
with requests.session() as s:
s.mount('https://www.icab.es', SSLContextAdapter())
p = 1
while True:
print('Page {}..'.format(p))
# r = urllib.request.urlopen(link, context=ssl_context)
r = s.get(link)
soup = BeautifulSoup(r.content, "lxml")
for li in soup.select('li.principal'):
print(li.get_text(strip=True))
p += 1
link = soup.select_one('a[title="{}"]'.format(p))
if not link:
link = soup.select_one('a[title="Següent"]')
if not link:
break
link = base + link['href']
Prints:
Page 1..
Sr./Sra. Martínez Gòmez, Marc
Sr./Sra. Eguinoa de San Roman, Roman
Sr./Sra. Morales Santiago, Maria Victoria
Sr./Sra. Bengoa Tortajada, Javier
Sr./Sra. Moralo Rodríguez, Xavier
Sr./Sra. Romagosa Huerta, Marta
Sr./Sra. Peña Moncho, Juan
Sr./Sra. Piñana Morera, Roman
Sr./Sra. Millán Sánchez, Antonio
Sr./Sra. Martínez Mira, Manel
Sr./Sra. Montserrat Rincón, Anna
Sr./Sra. Fernández Paricio, Maria Teresa
Sr./Sra. Ruiz Macián- Dagnino, Claudia
Sr./Sra. Barba Ausejo, Pablo
Sr./Sra. Bruna de Quixano, Jose Luis
Sr./Sra. Folch Estrada, Fernando
Sr./Sra. Gracia Castellón, Sonia
Sr./Sra. Sales Valls, Gemma Elena
Sr./Sra. Pastor Giménez-Salinas, Adolfo
Sr./Sra. Font Jané, Àlvar
Sr./Sra. García González, Susana
Sr./Sra. Garcia-Tornel Florensa, Xavier
Sr./Sra. Marín Granados, Alejandra
Sr./Sra. Albero Jové, José María
Sr./Sra. Galcerà Margalef, Montserrat
Page 2..
Sr./Sra. Chimenos Minguella, Sergi
Sr./Sra. Lacasta Casado, Ramón
Sr./Sra. Alcay Morandeira, Carlos
Sr./Sra. Ribó Massó, Ignacio
Sr./Sra. Fitó Baucells, Antoni
Sr./Sra. Paredes Batalla, Patricia
Sr./Sra. Prats Viñas, Francesc
Sr./Sra. Correig Ferré, Gerard
Sr./Sra. Subirana Freixas, Alba
Sr./Sra. Álvarez Crexells, Juan
Sr./Sra. Glaser Woloschin, Joan Nicolás
Sr./Sra. Nel-lo Padro, Francesc Xavier
Sr./Sra. Oliveras Dalmau, Rosa Maria
Sr./Sra. Badia Piqué, Montserrat
Sr./Sra. Fuentes-Lojo Rius, Alejandro
Sr./Sra. Argemí Delpuy, Marc
Sr./Sra. Espinoza Carrizosa, Pina
Sr./Sra. Ges Clot, Carla
Sr./Sra. Antón Tuneu, Beatriz
Sr./Sra. Schroder Vilalta, Andrea
Sr./Sra. Belibov, Mariana
Sr./Sra. Sole Lopez, Silvia
Sr./Sra. Reina Pardo, Luis
Sr./Sra. Cardenal Lagos, Manel Josep
Sr./Sra. Bru Galiana, David
...and so on.

Remove image sources with same class reference when web scraping in python?

I'm trying to write some code to extract some data from transfermarkt (Link Here for the page I'm using). I'm stuck trying to print the clubs. I've figured out that I need to access h2 and then the a class in order to just get the text. The HTML code is below
<div class="table-header" id="to-349"><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018"><img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" /></a><h2><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">Barnsley FC</a></h2></div>
so you can see if I just try find_all("a", "class": "vereinprofil_tooltip"}) it doesn't work properly as it also returns the image file which has no plain text? But if I can search for h2 first and then search find_all("a", "class": "vereinprofil_tooltip"}) within the returned h2 it would get me what I want. My code is below.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
Club = Clubs.find("a", {"class": "vereinprofil_tooltip"})
print(Club)
I get the error in getattr
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I know what the error means but I've been going round in circles trying to find a way of actually doing it properly and getting what I want. Any help is appreciated.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/
plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
print(type(Clubs)) # this can be removed, but I left it to expose how I figured this out
for club in Clubs:
print(club.text)
Basically: Clubs is a list (technically, a ResultSet, but the behavior is very similar), you need to iterate it as such. .text gives just the text, other attributes could be retrieved as well.
Output looks like:
Transfer record 18/19
Barnsley FC
Burton Albion
Sunderland AFC
Shrewsbury Town
Scunthorpe United
Charlton Athletic
Plymouth Argyle
Portsmouth FC
Peterborough United
Southend United
Bradford City
Blackpool FC
Bristol Rovers
Fleetwood Town
Doncaster Rovers
Oxford United
Gillingham FC
AFC Wimbledon
Walsall FC
Rochdale AFC
Accrington Stanley
Luton Town
Wycombe Wanderers
Coventry City
Transfer record 18/19
There are, however, a bunch of blank lines (I.e., .text was '') that you should probably handle as well.
my guess is you might mean findAll instead of find_all
I tried this code below and it works
content = """<div class="table-header" id="to-349">
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
<img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" />
</a>
<h2>
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
Barnsley FC
</a>
</h2>
</div>"""
soup = BeautifulSoup(content, 'html.parser')
#get main_box
main_box = soup.findAll('a', {'class': 'vereinprofil_tooltip'})
#print(main_box)
for main_text in main_box: # looping thru the list
if main_text.text.strip(): # get the body text
print(main_text.text.strip()) # print it
output is
Barnsley FC
I'll edit this with a reference to the documentation about findAll. cant remember it on to pof my head
edit:
did a look at the documentation, turns out find_all = findAll..
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
now I feel dumb lol

Categories