I'm a beginner in python, I'm trying to get the first search result link from google which was stored inside a div with class='yuRUbf' using beautifulsoup. When I run the script output is 'None' what is the error here.
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
productDivs = soup.find("div", {"class": "yuRUbf"})
print(productDivs)
Let's see:
from bs4 import BeautifulSoup
import requests, json
headers = {
'User-agent':
"useragent"
}
html = requests.get('https://www.google.com/search?q=hello', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# locating div element with a tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.find('div', class_='tF2Cxc').a['href']
print(link)
output:
'''
https://www.youtube.com/watch?v=YQHsXMglC9A
As you want first google search in which class name which you are looking for might be differ with name so first you can first find manually that link so it will be easy to identify
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
Using select method:
I have used css selector method in which it identifies all matching
divs and from list i have taken from index postion 1
And than i have use select_one to get a tag and find href
according to it!
main_data=soup.select("div.ZINbbc.xpd.O9g5cc.uUPGi")[1:]
main_data[0].select_one("a")['href'].replace("/url?q=","")
Using find method:
main_data=soup.find_all("div",class_="ZINbbc xpd O9g5cc uUPGi")[1:]
main_data[0].find("a")['href'].replace("/url?q=","")
Output [Same for Both the Case]:
'https://stackoverflow.com/questions/23102833/how-to-scrape-a-website-which-requires-login-using-python-and-beautifulsoup&sa=U&ved=2ahUKEwjGxv2wytXyAhUprZUCHR8mBNsQFnoECAkQAQ&usg=AOvVaw280R9Wlz2mUKHFYQUOFVv8'
Related
I am looking to download the "Latest File" from provided url below
https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product
The file i want to download is at the following exact location
https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product/sep-2022#data-downloads
for example file name is "Table 1"
how can i download this when i am only given the base URL as above? using beautifulSoup
I am unable to figure out how to work through nested urls within the html page to find the one i need to download.
First u need to get latest link:
latest_link = 'https://www.abs.gov.au/' + soup.find('span', class_='flag_latest').find_previous('a').get('href')
Then find document to download, in my example - download all, but u can change it:
download_all_link = 'https://www.abs.gov.au/' + soup.find('div', class_='anchor-button-wrapper').find('a').get('href')
And last point - download it.
FULL CODE:
import requests
from bs4 import BeautifulSoup
url = 'https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
latest_link = 'https://www.abs.gov.au/' + soup.find('span', class_='flag_latest').find_previous('a').get('href')
response = requests.get(latest_link)
soup = BeautifulSoup(response.text, 'lxml')
download_all_link = 'https://www.abs.gov.au/' + soup.find('div', class_='anchor-button-wrapper').find('a').get('href')
file_data = requests.get(download_all_link).content
with open(download_all_link.split("/")[-1], 'wb') as handler:
handler.write(file_data)
I've never used BeautifulSoup before. Pretty cool stuff. This seems to do it or me:
from bs4 import BeautifulSoup
with open("demo.html") as fp:
soup = BeautifulSoup(fp, "html.parser")
# lets look for the span with the 'flag_latest' class attribute
for span in soup.find_all('span'):
if span.get('class', None) and 'flag_latest' in span['class']:
# step up the a level to the div and grab the a tag
print(span.parent.a['href'])
So we just look for the span with the 'flag_latest' class and then step up a level in the tree (a div) and then grab the first a tag and extract the href.
Check out the docs and read the sections on "Navigating the Tree" and "Searching the Tree"
I have to extract from different web data sheet like this the section with url of website.
The problem is that the class “vermell_nobullet” than has the href than I need its repeat at least twice.
How to extract the specific class “vermell_nobullet” with the href of website.
My code
from bs4 import BeautifulSoup
import lxml
import requests
def parse_url(url):
response = requests.get(url)
content = response.content
parsed_response = BeautifulSoup(content, "lxml") # Variable que filtre pel contigut lxml
return parsed_response
depPres = "http://sac.gencat.cat/sacgencat/AppJava/organisme_fitxa.jsp?codi=6"
print(depPres)
soup = parse_url(depPres)
referClass = soup.find_all("a", {"class":"vermell_nobullet"})
referClass
Output that I have:
[<a class="vermell_nobullet" href="https://ovt.gencat.cat/gsitfc/AppJava/generic/conqxsGeneric.do?webFormId=691">
Bústia electrònica
</a>,
<a class="vermell_nobullet" href="http://presidencia.gencat.cat">http://presidencia.gencat.cat</a>]
Output that I want:
http://presidencia.gencat.cat
You can put condition like if text and href is same from a tag you can take
particular tag
referClass = soup.find_all("a", {"class":"vermell_nobullet"})
for refer in referClass:
if refer.text==refer['href']:
print(refer['href'])
Another Way find last div element and also find last href using find_all method
soup.find_all("div",class_="blockAdresa")[-1].find_all("a")[-1]['href']
Output:
'http://presidencia.gencat.cat'
Hello every one I'm new to beautifulsoup, I'm trying to write a function that will be able to extract second level urls from a given website.
For example if I have this website url : https://edition.cnn.com/ my function should be able to return
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
first I have tried this code to retrieve all links starting with the string of the url:
from bs4 import BeautifulSoup as bs4
import requests
import lxml
import re
def getLinks(url):
response = requests.get(url)
data = response.text
soup = bs4(data, 'lxml')
links = []
for link in soup.find_all('a', href=re.compile(str(url))):
links.append(link.get('href'))
return links
But then again the actual output is giving me all the links even links of articles which is not I'm looking for. is there a method that I can use to get what I want using regular expression or others.
The links are inside <nav> tag, so using CSS selector nav a[href] will select only links inside <nav> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://edition.cnn.com'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for a in soup.select('nav a[href]'):
if a['href'].count('/') > 1 or '#' in a['href']:
continue
print(url + a['href'])
Prints:
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
https://edition.cnn.com/sport
https://edition.cnn.com/videos
https://edition.cnn.com/world
https://edition.cnn.com/africa
https://edition.cnn.com/americas
https://edition.cnn.com/asia
https://edition.cnn.com/australia
https://edition.cnn.com/china
https://edition.cnn.com/europe
https://edition.cnn.com/india
https://edition.cnn.com/middle-east
https://edition.cnn.com/uk
...and so on.
I am trying to extract the gallery link of the first result on an imgur search.
theurl = "https://imgur.com/search?q=" +text
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
link = soup.findAll('a',{"class":"image-list-link"})[0].decode_contents()
Here is what is being displayed for link:
I am mainly trying to get the href value from only this section (the first result for the search)
Here is what the inspect element looks like:
Actually, it's pretty easy to accomplish what you're trying to do. As shown in the image, the href of first image (or any image for that matter) is located inside the <a> tag with the attribute class="image-list-link". So, you can use the find() function, which returns the first match found. And then, use ['href'] to get the link.
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://imgur.com/search?q=python')
soup = BeautifulSoup(r.text, 'lxml')
first_image_link = soup.find('a', class_='image-list-link')['href']
print(first_image_link)
# /gallery/AxKwQ2c
If you want to get the links for all the images, you can use a list comprehension.
all_image_links = [a['href'] for a in soup.find_all('a', class_='image-list-link')]
I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D
This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.