BeautifulSoup doesn't find all spans or children

BeautifulSoup doesn't find all spans or children - python

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D

This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

Related

BeautifulSoup scraping the specific href that has a class that it's repeat at least twice

I have to extract from different web data sheet like this the section with url of website.
The problem is that the class “vermell_nobullet” than has the href than I need its repeat at least twice.
How to extract the specific class “vermell_nobullet” with the href of website.
My code
from bs4 import BeautifulSoup
import lxml
import requests
def parse_url(url):
response = requests.get(url)
content = response.content
parsed_response = BeautifulSoup(content, "lxml") # Variable que filtre pel contigut lxml
return parsed_response
depPres = "http://sac.gencat.cat/sacgencat/AppJava/organisme_fitxa.jsp?codi=6"
print(depPres)
soup = parse_url(depPres)
referClass = soup.find_all("a", {"class":"vermell_nobullet"})
referClass
Output that I have:
[<a class="vermell_nobullet" href="https://ovt.gencat.cat/gsitfc/AppJava/generic/conqxsGeneric.do?webFormId=691">
Bústia electrònica
</a>,
<a class="vermell_nobullet" href="http://presidencia.gencat.cat">http://presidencia.gencat.cat</a>]
Output that I want:
http://presidencia.gencat.cat

You can put condition like if text and href is same from a tag you can take
particular tag
referClass = soup.find_all("a", {"class":"vermell_nobullet"})
for refer in referClass:
if refer.text==refer['href']:
print(refer['href'])
Another Way find last div element and also find last href using find_all method
soup.find_all("div",class_="blockAdresa")[-1].find_all("a")[-1]['href']
Output:
'http://presidencia.gencat.cat'

Scrape tab href value from a webpage by python Beautiful Soup

I have code that extracts links from the main page and navigates through each page in the list of links, the new link has a tab page that is represented as follows in the source:
<Li Class=" tab-contacts" Id="contacts"><A Href="?id=448&tab=contacts"><Span Class="text">Contacts</Span>
I want to extract the href value and navigate to that page to get some information, here is my code so far:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get(link_to_the_website)
data = r.content
soup = BeautifulSoup(data, "html.parser")
links = []
for i in soup.find_all('div',{'class':'leftInfoWrap'}):
link = i.find('a',href=True)
if link is None:
continue
links.append(link.get('href'))
for link in links:
soup = BeautifulSoup(link,"lxml")
tabs = soup.select('Li',{'class':' tab-contacts'})
print(tabs)
However I am getting an empty list with 'print(tabs)' command. I did verify the link variable and it is being populated. Thanks in advance

Looks like you are trying to mix find syntax with select.
I would use the parent id as an anchor then navigate to the child with css selectors and child combinator.
partial_link = soup.select_one('#contacts > a')['href']
You need to append the appropriate prefix.

Exporting data from HTML to Excel

i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())

There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)

How do I extract just the blog content and exclude other elements using Beautiful Soup

I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:
soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')
Printing body will also include other stuff under the main div tag.

Try this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
print(item.text.strip())
Use this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
print(item.text)

I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python
However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.
What I find most obvious to do right now is something like this:
Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer

Beautifulsoup unable to extract data using attrs=class

I am extracting data for a research project and I have sucessfully used findAll('div', attrs={'class':'someClassName'}) in many websites but this particular website,
WebSite Link
doesn't return any values when I used attrs option. But when I don't use the attrs option I get entire html dom.
Here is the simple code that I started with to test it out:
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div

My code is working fine, with requests
import requests
from BeautifulSoup import BeautifulSoup as bs
#grab HTML
r = requests.get(r'http://www.amazon.com/s/ref=sr_pg_1?rh=n:172282,k%3adigital%20camera&keywords=digital%20camera&ie=UTF8&qid=1343600585')
html = r.text
#parse the HTML
soup = bs(html)
results= soup.findAll('div', attrs={'class': 'data'})
print results

If you or anyone reading this question would like to know the reason that the code wasn't able to find the attrs value using the code you've given (copied below):
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div
The issue is when you attempted to create a BeautifulSoup object soup = bs(urlopen(url)) as the value of urlopen(url) is a response object and not the DOM.
I'm sure any issues you had encountered could have been more easily resolved by using bs(urlopen(url).read()) instead.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup doesn't find all spans or children - python

Related

BeautifulSoup scraping the specific href that has a class that it's repeat at least twice

Scrape tab href value from a webpage by python Beautiful Soup

Exporting data from HTML to Excel

How do I extract just the blog content and exclude other elements using Beautiful Soup

Beautifulsoup unable to extract data using attrs=class

Categories

Resources