How do use the soup.find, soup.find_all - python

Here is my code and the output
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.jobberman.com/jobs")
soup = BeautifulSoup(res.text, "html.parser")
job = soup.find("div", class_ = "relative inline-flex flex-col w-full text-sm font-normal pt-2")
company_name = job.find('a[href*="jobs"]')
print(company_name)
output is none
None
But when i use the select method, i got the desired result but cant use .text on it
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.jobberman.com/jobs")
soup = BeautifulSoup(res.text, "html.parser")
job = soup.find("div", class_ = "relative inline-flex flex-col w-full text-sm font-normal pt-2")
company_name = job.select('a[href*="jobs"]').text
print(company_name)
output
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Change your selection strategy - Cause main issue here is, that not all company names are linked:
job.find('div',{'class':'search-result__job-meta'}).text.strip()
or
job.select_one('.search-result__job-meta').text.strip()
Example
Also store your information in a structured way for post processing:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.jobberman.com/jobs")
soup = BeautifulSoup(res.text, "html.parser")
data = []
for job in soup.select('div:has(>.search-result__body)'):
data.append({
'job':job.h3.text,
'company':job.select_one('.search-result__job-meta').text.strip()
})
data
Output
[{'job': 'Restaurant Manager', 'company': 'Balkaan Employments service'},
{'job': 'Executive Assistant', 'company': 'Nolla Fresh & Frozen ltd'},
{'job': 'Portfolio Manager/Instructor 1', 'company': 'Fun Science World'},
{'job': 'Microbiologist', 'company': "NEIMETH INT'L PHARMACEUTICALS PLC"},
{'job': 'Data Entry Officer', 'company': 'Nkoyo Pharmaceuticals Ltd.'},
{'job': 'Chemical Analyst', 'company': "NEIMETH INT'L PHARMACEUTICALS PLC"},
{'job': 'Senior Front-End Engineer', 'company': 'Salvo Agency'},...]

The problems with your search strategy has been covered by comments and answers posted earlier. I am offering a solution for your problem which involves the use of regex library, along with the find_all() function call:
import requests
from bs4 import BeautifulSoup
import re
res = requests.get("https://www.jobberman.com/jobs")
soup = BeautifulSoup(res.text, "html.parser")
company_name = soup.find_all("a", href=re.compile("/jobs\?"), rel="nofollow")
for i in range(len(company_name)):
print(company_name[i].text)
Output:
GRATIAS DEI NIGERIA LIMITED
Balkaan Employments service
Fun Science World
NEIMETH INT'L PHARMACEUTICALS PLC
Nkoyo Pharmaceuticals Ltd.
...

Related

Problem with .Get href link using scraper?

So I am trying to follow a video tutorial that is just a bit outdated. In the video, using href = links[idx].get('href') grabs the link, however if I use it here, it won't work. It just says none. If I just type .getText() it will grab the title.
The element for the entire href and title is Stop the proposal on mass surveillance of the EU
Here's my code:
`import requests
from bs4 import BeautifulSoup
res = requests.get('https://news.ycombinator.com/news')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titleline')
votes = soup.select('.score')
def create_custom_hn(links, votes):
hn = []
for idx, item in enumerate(links):
title = links[idx].getText()
href = links[idx].get('href')
print(href)
#hn.append({'title': title, 'link': href})
return hn
print(create_custom_hn(links, votes))`
I tried to grab the link using .get('href')
Try to select your elements more specific and avoid using different lists there is no need for that and you have to ensure that they will have same length.
You could get all information in one go, selecting the <tr> with class athing and its next sibling.
Example
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://news.ycombinator.com/news').text)
data = []
for i in soup.select('.athing'):
data.append({
'title' : i.select_one('span a').text,
'link' : i.select_one('span a').get('href'),
'score' : list(i.next_sibling.find('span').stripped_strings)[0]
})
data
Output
[{'title': 'Stop the proposal on mass surveillance of the EU',
'link': 'https://mullvad.net/nl/blog/2023/2/2/stop-the-proposal-on-mass-surveillance-of-the-eu/',
'score': '287 points'},
{'title': 'Bay 12 Games has made $7M from the Steam release of Dwarf Fortress',
'link': 'http://www.bay12forums.com/smf/index.php?topic=181354.0',
'score': '416 points'},
{'title': "Google's OSS-Fuzz expands fuzz-reward program to $30000",
'link': 'https://security.googleblog.com/2023/02/taking-next-step-oss-fuzz-in-2023.html',
'score': '31 points'},
{'title': "Connecticut Parents Arrested for Letting Kids Walk to Dunkin' Donuts",
'link': 'https://reason.com/2023/01/30/dunkin-donuts-parents-arrested-kids-cops-freedom/',
'score': '225 points'},
{'title': 'Ronin 2.0 – open-source Ruby toolkit for security research and development',
'link': 'https://ronin-rb.dev/blog/2023/02/01/ronin-2-0-0-finally-released.html',
'score': '62 points'},...]

Why Python BeautifulSoup returns a empty list?

I'm a rookie student of IT, I was trying to help my Friend with his job and I wanted to create a list of costumers he could serve (maybe exporting it in a file would be awesome too but I will think about it later I guess).
When I try to run the code it just returns an empty list, do you have any suggestions?
any suggestions/feedback would be highly appreciated!
Thank you!
(I know maybe it isn't the best code you have ever seen! so I apologize myself in advance!)
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://www.paginebianche.it/toscana/li/gommisti.html')
res2 = requests.get('https://www.paginebianche.it/ricerca?qs=gommisti&dv=li&p=2')
soup = BeautifulSoup(res.text, 'html.parser')
soup2 = BeautifulSoup(res2.text, 'html.parser')
links = soup.select('.org fn')
subtext = soup.select('.address')
links2 = soup2.select('.org fn')
subtext2 = soup2.select('.address')
mega_links = links + links2
mega_subtext = subtext + subtext2
def create_custom_hn(mega_links,mega_subtext):
hn = []
for links,address in enumerate(mega_links):
title = links.getText()
address= address.getText()
hn.append({'title': title, 'address': address})
return hn
pprint.pprint(create_custom_hn(mega_links,mega_subtext))
The selector .org fn is wrong, should be .org.fn to select all elements with class org and fn.
However, some items don't have .address so, your code produces skewed results. You can use this example to get title and addresses (in case of missing address, - is used):
import pprint
import requests
from itertools import chain
from bs4 import BeautifulSoup
res = requests.get('https://www.paginebianche.it/toscana/li/gommisti.html')
res2 = requests.get('https://www.paginebianche.it/ricerca?qs=gommisti&dv=li&p=2')
soup = BeautifulSoup(res.text, 'html.parser')
soup2 = BeautifulSoup(res2.text, 'html.parser')
hn = []
for i in chain.from_iterable([soup.select('.item'), soup2.select('.item')]):
title = i.h2.getText(strip=True)
addr = i.select_one('[itemprop="address"]')
addr = addr.getText(strip=True, separator='\n') if addr else '-'
hn.append({'title': title, 'address': addr})
pprint.pprint(hn)
Prints:
[{'address': 'Via Don Giovanni Minzoni 44\n-\n57025\nPiombino (LI)',
'title': 'CENTROGOMMA'},
{'address': 'Via Quaglierini 14\n-\n57123\nLivorno (LI)',
'title': 'F.LLI CAPALDI'},
{'address': 'Via Ugione 9\n-\n57121\nLivorno (LI)',
'title': 'PNEUMATICI INTERGOMMA GOMMISTA'},
{'address': "Viale Carducci Giosue' 88/90\n-\n57124\nLivorno (LI)",
'title': 'ITALMOTORS'},
{'address': 'Piazza Chiesa 53\n-\n57124\nLivorno (LI)',
'title': 'Lo Coco Pneumatici'},
{'address': '-', 'title': 'PIERO GOMME'},
{'address': 'Via Pisana Livornese Nord 95\n-\n57014\nVicarello (LI)',
'title': 'GOMMISTA TRAVAGLINI PNEUMATICI'},
{'address': 'Via Cimarosa 165\n-\n57124\nLivorno (LI)',
'title': 'GOMMISTI CIONI AUTORICAMBI & SERVIZI'},
{'address': 'Loc. La Cerretella, 219\n-\n57022\nCastagneto Carducci (LI)',
'title': 'AURELIA GOMME'},
{'address': 'Strada Provinciale Vecchia Aurelia 243\n'
'-\n'
'57022\n'
'Castagneto Carducci (LI)',
'title': 'AURELIA GOMME DI GIANNELLI SIMONE'},
...and so on.

Python BeautifulSoup Access Div container

I am trying to use BeautifulSoup to grab the container from below product detail page that contains brand, product name, price etc.
According to chrome site-inspection it is a "div" container from the class "product-detail__info" (please see screenshot)
Unfortunately my code does work...
I would appreciate if someone could give me a tip :)
Thanks in advance
Link: https://www.nemlig.com/opvasketabs-all-in-one-5039333
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://www.nemlig.com/opvasketabs-all-in-one-5039333"
#Opening connection and grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
#Closing connection
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs product detail container
container = page_soup.find_all("div", {"class": "product-detail__info"})
print(container)
The data that you are looking for is part of the source page (as a script).
Here is the code that will return it to you:
import requests
from bs4 import BeautifulSoup as soup
import json
r = requests.get('https://www.nemlig.com/opvasketabs-all-in-one-5039333')
if r.status_code == 200:
soup = soup(r.text, "html.parser")
scripts = soup.find_all("script")
data = json.loads(scripts[6].next.strip()[:-1])
print(data)
output
[{'#context': 'http://schema.org/', '#type': 'Organization', 'url': 'https://www.nemlig.com/', 'logo': 'https://www.nemlig.com/https://live.nemligstatic.com/s/b1.0.7272.30289/scom/dist/images/logos/nemlig-web-logo_tagline_rgb.svg', 'contactPoint': [{'#type': 'ContactPoint', 'telephone': '+45 70 33 72 33', 'contactType': 'customer service'}], 'sameAs': ['https://www.facebook.com/nemligcom/', 'https://www.instagram.com/nemligcom/', 'https://www.linkedin.com/company/nemlig-com']}, {'#context': 'http://schema.org/', '#type': 'Product', 'name': 'Opvasketabs all in one', 'brand': 'Ecover', 'image': 'https://live.nemligstatic.com/scommerce/images/opvasketabs-all-in-one.jpg?i=ZowWdq-y/5039333', 'description': '25 stk. / zero / Ecover', 'category': 'Maskinopvask', 'url': 'https://www.nemlig.com/opvasketabs-all-in-one-5039333', 'offers': {'#type': 'Offer', 'priceCurrency': 'DKK', 'price': '44.95'}}]

python beautifulsoup web scraping issue

page = requests.get("http://www.freejobalert.com/upsc-recruitment/16960/#Engg-Services2019")
c = page.content
soup=BeautifulSoup(c,"html.parser")
data=soup.find_all("tr")
for r in data:
td = r.find_all("td",{"style":"text-align: center;"})
for d in td:
link =d.find_all("a")
for li in link:
span = li.find_all("span",{"style":"color: #008000;"})
for s in span:
strong = s.find_all("strong")
for st in strong:
dict['title'] = st.text
for l in link:
dict["link"] = l['href']
print(dict)
It is giving
{'title': 'Syllabus', 'link': 'http://www.upsc.gov.in/'}
{'title': 'Syllabus', 'link': 'http://www.upsc.gov.in/'}
{'title': 'Syllabus', 'link': 'http://www.upsc.gov.in/'}
I am expecting:
{'title': 'Apply Online', 'link': 'https://upsconline.nic.in/mainmenu2.php'}
{'title': 'Notification', 'link': 'http://www.freejobalert.com/wp-content/uploads/2018/09/Notification-UPSC-Engg-Services-Prelims-Exam-2019.pdf'}
{'title': 'Official Website ', 'link': 'http://www.upsc.gov.in/'}
Here i want all "Important Links" means "Apply online","Notification","official website"
and it's link for each table.
but it is giving me "Syllabus" in title instead with repeting links..
please have a look into this..
This may help you, check the code below.
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.freejobalert.com/'
'upsc-recruitment/16960/#Engg-Services2019')
c = page.content
soup = BeautifulSoup(c,"html.parser")
row = soup.find_all('tr')
dict = {}
for i in row:
for title in i.find_all('span', attrs={
'style':'color: #008000;'}):
dict['Title'] = title.text
for link in i.find_all('a', href=True):
dict['Link'] = link['href']
print(dict)

Python - How to retrieve certain text from a website

I have the following code:
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import re
market = 'INDU:IND'
quote_page = 'http://www.bloomberg.com/quote/' + market
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip()
print('Market: ' + name)
This code works and lets me get the market name from the url. I'm trying to do something similar to this website. Here is my code:
market = 'BTC-GBP'
quote_page = 'https://uk.finance.yahoo.com/quote/' + market
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('span', attrs={'class': 'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)'})
name = name_box.text.strip()
print('Market: ' + name)
I'm not sure what to do. I want to retrieve the current rate, the amount it's increased/decreased by as a number & a percentage. And finally as of when the information was updated. How do I do this, I don't mind if you do a different method to the one I used previously as long as you explain it. If my code is inefficient/unpythonic could you also tell me what to do to fix this. I'm pretty new to web scraping and these new modules. Thanks!
You can use BeautifulSoup and when searching for the desired data, use regex to match the dynamic span classnames generated by the site's backend script:
from bs4 import BeautifulSoup as soup
import requests
import re
data = requests.get('https://uk.finance.yahoo.com/quote/BTC-GBP').text
s = soup(data, 'lxml')
d = [i.text for i in s.find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(\w+\) Fz\(\d+px\) Mb\(-\d+px\) D\(\w+\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})]
date_published = re.findall('As of\s+\d+:\d+PM GMT\.|As of\s+\d+:\d+AM GMT\.', data)
final_results = dict(zip(['current', 'change', 'published'], d+date_published))
Output:
{'current': u'6,785.02', 'change': u'-202.99 (-2.90%)', 'published': u'As of 3:55PM GMT.'}
Edit: given the new URL, you need to change the span classname:
data = requests.get('https://uk.finance.yahoo.com/quote/AAPL?p=AAPL').text
final_results = dict(zip(['current', 'change', 'published'], [i.text for i in soup(data, 'lxml').find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(b\) Fz\(\d+px\) Mb\(-\d+px\) D\(b\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})] + re.findall('At close:\s+\d:\d+PM EST', data)))
Output:
{'current': u'175.50', 'change': u'+3.00 (+1.74%)', 'published': u'At close: 4:00PM EST'}
You can directly use api provided by yahoo Finance,
For reference check this answer :-
Yahoo finance webservice API

Categories