I am trying to extract a specific link for this html code
<a class="pageNum taLnk" data-offset="10" data-page-number="1"
href="www.blahblahblah.com/bb32123">Page 1 </a>
<a class="pageNum taLnk" data-offset="20" data-page-number="2"
href="www.blahblahblah.com/bb45135">Page 2 </a>
As you can see, the link (href) are disorganized, therefore there are no pattern for me to use which means I need to extract the href manually using BeautifulSoup.
I want to specifically get Page 2's href.
These can the code I have now.
from bs4 import BeautifulSoup
import urllib
url = 'https://www.tripadvisor.com/ShowUserReviews-g293917-d539542-r447460956-Duangtawan_Hotel_Chiang_Mai-Chiang_Mai.html#REVIEWS'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
for link in soup.find_all('a', attrs = {'class' : 'pageNum taLnk'}):
print (link)
As you can see, I am stuck at trying to obtain the href information specifically for Page 2. Is there anyway to access with extra bit of information within the tags such as data-page-number = "2" or data-offset = "20".
page_2 = soup.find('a', attrs = {'data-page-number' : '2'})
This will only get you the page 2, if you want to get the next page no matter what the current page is, you should find the next page url:
next_page = soup.find('a', attrs = {'class' = 'nav next rndBtn ui_button primary taLnk'})
Some attributes, like the data-* attributes in HTML 5, have names that
can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them into a
dictionary and passing the dictionary into find_all() as the attrs
argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
Related
I have to extract from different web data sheet like this the section with url of website.
The problem is that the class “vermell_nobullet” than has the href than I need its repeat at least twice.
How to extract the specific class “vermell_nobullet” with the href of website.
My code
from bs4 import BeautifulSoup
import lxml
import requests
def parse_url(url):
response = requests.get(url)
content = response.content
parsed_response = BeautifulSoup(content, "lxml") # Variable que filtre pel contigut lxml
return parsed_response
depPres = "http://sac.gencat.cat/sacgencat/AppJava/organisme_fitxa.jsp?codi=6"
print(depPres)
soup = parse_url(depPres)
referClass = soup.find_all("a", {"class":"vermell_nobullet"})
referClass
Output that I have:
[<a class="vermell_nobullet" href="https://ovt.gencat.cat/gsitfc/AppJava/generic/conqxsGeneric.do?webFormId=691">
Bústia electrònica
</a>,
<a class="vermell_nobullet" href="http://presidencia.gencat.cat">http://presidencia.gencat.cat</a>]
Output that I want:
http://presidencia.gencat.cat
You can put condition like if text and href is same from a tag you can take
particular tag
referClass = soup.find_all("a", {"class":"vermell_nobullet"})
for refer in referClass:
if refer.text==refer['href']:
print(refer['href'])
Another Way find last div element and also find last href using find_all method
soup.find_all("div",class_="blockAdresa")[-1].find_all("a")[-1]['href']
Output:
'http://presidencia.gencat.cat'
I'm a beginner in python, I'm trying to get the first search result link from google which was stored inside a div with class='yuRUbf' using beautifulsoup. When I run the script output is 'None' what is the error here.
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
productDivs = soup.find("div", {"class": "yuRUbf"})
print(productDivs)
Let's see:
from bs4 import BeautifulSoup
import requests, json
headers = {
'User-agent':
"useragent"
}
html = requests.get('https://www.google.com/search?q=hello', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# locating div element with a tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.find('div', class_='tF2Cxc').a['href']
print(link)
output:
'''
https://www.youtube.com/watch?v=YQHsXMglC9A
As you want first google search in which class name which you are looking for might be differ with name so first you can first find manually that link so it will be easy to identify
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
Using select method:
I have used css selector method in which it identifies all matching
divs and from list i have taken from index postion 1
And than i have use select_one to get a tag and find href
according to it!
main_data=soup.select("div.ZINbbc.xpd.O9g5cc.uUPGi")[1:]
main_data[0].select_one("a")['href'].replace("/url?q=","")
Using find method:
main_data=soup.find_all("div",class_="ZINbbc xpd O9g5cc uUPGi")[1:]
main_data[0].find("a")['href'].replace("/url?q=","")
Output [Same for Both the Case]:
'https://stackoverflow.com/questions/23102833/how-to-scrape-a-website-which-requires-login-using-python-and-beautifulsoup&sa=U&ved=2ahUKEwjGxv2wytXyAhUprZUCHR8mBNsQFnoECAkQAQ&usg=AOvVaw280R9Wlz2mUKHFYQUOFVv8'
I just started using beautifulsoup and am stuck on an issue regarding getting attributes of tags inside other tags. I am using the whitehouse.gov/briefing-room/ for practice. What I'm trying to do right now is just get all the links on this page and append them to an empty list. This is my code right now:
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for h2_tags in soup.find_all('h2'):
a_tag = h2_tags.find('a')
urls.append(a_tag.attr['href']) # This is where I get the NoneType error
This code returns the <a tags, but the first and last 3 tags it returns are 'None' and because of this, get a type error when trying to access the attributes to get the href for these <a tags
The problem is, that some <h2> tags don't contain <a> tags. So you have to check for that alternative. Or just select all <a> tags that are under <h2> using CSS selector:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for a_tag in soup.select('h2 a'): # <-- select <A> tags that are under <H2> tags
urls.append(a_tag.attrs['href'])
print(*urls, sep='\n')
Prints:
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/10/statement-by-nsc-spokesperson-emily-horne-on-national-security-advisor-jake-sullivan-leading-the-first-virtual-meeting-of-the-u-s-israel-strategic-consultative-group/
https://www.whitehouse.gov/briefing-room/press-briefings/2021/03/09/press-briefing-by-press-secretary-jen-psaki-and-deputy-director-of-the-national-economic-council-bharat-ramamurti-march-9-2021/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-the-white-houses-meeting-with-climate-finance-leaders/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-vice-president-kamala-harris-call-with-prime-minister-erna-solberg-of-norway/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/nomination-sent-to-the-senate-3/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-biden-announces-key-hire-for-the-office-of-management-and-budget/
https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/03/09/remarks-by-president-biden-during-tour-of-w-s-jenks-son/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-joseph-r-biden-jr-approves-louisiana-disaster-declaration/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/statement-by-president-joe-biden-on-the-house-taking-up-the-pro-act/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/white-house-announces-additional-staff/
I have code that extracts links from the main page and navigates through each page in the list of links, the new link has a tab page that is represented as follows in the source:
<Li Class=" tab-contacts" Id="contacts"><A Href="?id=448&tab=contacts"><Span Class="text">Contacts</Span>
I want to extract the href value and navigate to that page to get some information, here is my code so far:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get(link_to_the_website)
data = r.content
soup = BeautifulSoup(data, "html.parser")
links = []
for i in soup.find_all('div',{'class':'leftInfoWrap'}):
link = i.find('a',href=True)
if link is None:
continue
links.append(link.get('href'))
for link in links:
soup = BeautifulSoup(link,"lxml")
tabs = soup.select('Li',{'class':' tab-contacts'})
print(tabs)
However I am getting an empty list with 'print(tabs)' command. I did verify the link variable and it is being populated. Thanks in advance
Looks like you are trying to mix find syntax with select.
I would use the parent id as an anchor then navigate to the child with css selectors and child combinator.
partial_link = soup.select_one('#contacts > a')['href']
You need to append the appropriate prefix.
I am using beautifulsoup to get all the links from a page. My code is:
import requests
from bs4 import BeautifulSoup
url = 'http://www.acontecaeventos.com.br/marketing-promocional-sao-paulo'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')
soup.find_all('href')
All that I get is:
[]
How can I get a list of all the href links on that page?
You are telling the find_all method to find href tags, not attributes.
You need to find the <a> tags, they're used to represent link elements.
links = soup.find_all('a')
Later you can access their href attributes like this:
link = links[0] # get the first link in the entire page
url = link['href'] # get value of the href attribute
url = link.get('href') # or like this
Replace your last line:
links = soup.find_all('a')
By that line :
links = [a.get('href') for a in soup.find_all('a', href=True)]
It will scrap all the a tags, and for each a tags, it will append the href attribute to the links list.
If you want to know more about the for loop between the [], read about List comprehensions.
To get a list of everyhref regardless of tag use:
href_tags = soup.find_all(href=True)
hrefs = [tag.get('href') for tag in href_tags]