BS4 to find embedded ID number - python

I'm trying to use BS4 to find ID numbers that are embedded in html. However, they aren't really attached to anything I'm used to working with, like a tag or the like. I've looked at how to pull something from a div class, but haven't had success with that either. Below is what soup looks like after I collect the html:
<div class="result-bio">
<div class="profile-image">
<img class="search-image load-member-profile" ng-click="results.loadProfile(result.UGuid)"
ng-src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"
src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"/>
</div>
The code I have been attempting is:
soup = BeautifulSoup(browser.page_source, 'html.parser')
print(soup)
ID_Numbers = []
for IDs in soup.find_all('div', string='http'):
ID_Numbers.append(IDs.text)
Does anyone have a suggestion on how to solve this? I imagine I will have to strip later, but really all I want is that id=xxxx value embedded in there. I've tried most of the solutions I've seen on stack with no success. Thanks!

You need to find all the image elements, then loop over these elements to get the source for each image. Then, simply break the src down to get the id.
soup = bs4.BeautifulSoup(page_source, features='html.parser')
for i in soup.find_all('img'):
src = i['src']
try:
id = src.split('?id=')[1]
print(id)
except(IndexError):
continue
Here, I have split the src to get the id, but in more complicated cases you may need to use regex.

You can use the inbuilt [urlparse][1] library like below:
from bs4 import BeautifulSoup
from urllib.parse import urlsplit
html_doc = """
<div class="result-bio">
<div class="profile-image">
<img class="search-image load-member-profile" ng-click="results.loadProfile(result.UGuid)"
ng-src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"
src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"/>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
images = soup.findAll('img')
ids = []
for i in images:
parsed = urlsplit(i['src'])
id = parsed.query.replace("id=", "")
ids.append(id)
print(ids)
This gave me output:-
['9091a557-fd44-44be-9468-9386d90a39b8']

Related

Extract <div title="xpto"> element with BS4

I'm trying to extract all <div title="some different text here"> elements but I couldn't find a way.
There are some of them spread over the HTML structure but it seems that I need to fill the title attribute with the right value of text, but they are all different.
What I am trying:
container_indicators = today_container.find_all('div', {'title'})
But it didn't work out.
I'm working with beautiful soup + python.
Any help?
Try:
container_indicators = today_container.find_all('div', attrs={'title': True})
Or:
container_indicators = today_container.select('div[title]')

Get content from certain tags with certain attributes using BS4

I need to get the content from the following tag with these attributes: <span class="h6 m-0">.
An example of the HTML I'll encounter would be <span class="h6 m-0">Hello world</span>, and it obviously needs to return Hello world.
My current code is as follows:
page = BeautifulSoup(text, 'html.parser')
names = [item["class"] for item in page.find_all('span')]
This works fine, and gets me all the spans in the page, but I don't know how to specify that I only want those with the specific class "h6 m-0" and grab the content inside. How will I go about doing this?
page = BeautifulSoup(text, 'html.parser')
names = page.find_all('span' , class_ = 'h6 m-0')
Without knowing your use case I don't know if this will work.
names = [item["class"] for item in page.find_all('span',class_="h6 m-0" )]
can you please be more specific about what problem you face
but this should work fine for you

BeautifulSoup doesn't find all spans or children

I am trying to access the sequence on this webpage:
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta
The sequence is stored under the div class="seq gbff". Each line is stored under
<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>
When I try to search for the spans containing the sequence, beautiful soup returns None. Same problem when I try to look at the children or content of the div above the spans.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')
div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})
Neither method works and I'd greatly appreciate any help :D
This page uses JavaScript to load data
With DevTools in Chrome/Firefox I found this url and there are all <span>
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
Now hard part. You have to find this url in HTML because different pages will use different arguments in url. Or you have to compare few urls and find schema so you could generate this url manually.
EDIT: if in url you change retmode=html to retmode=xml then you get it as XML. If you use retmode=text then you get it as text without HTML tags. retmode=json doesn't works.

I am struggling with python-html . I know the class of a certain header. I need the info from the generic <a href... in this h1

So, I have this:
<h1 class='entry-title'>
<a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
</h1>
How can I retrieve the URL (it is not always the same) and the title (also not always the same)?
Parse it with an HTML parser, e.g. with BeautifulSoup it would be:
from bs4 import BeautifulSoup
data = "your HTML here" # data can be the result of urllib2.urlopen(url)
soup = BeautifulSoup(data)
link = soup.select("h1.entry-title > a")[0]
print link.get("href")
print link.get_text()
where h1.entry-title > a is a CSS selector matching an a element directly under h1 element with class="entry-title".
Well, just working with strings, you can
>>> s = '''<h1 class='entry-title'>
... <a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
... </h1>'''
>>> s.split('>')[1].strip().split('=')[1].strip("'")
'http://theurlthatvariesinlengthbasedonwhenirequesthehtml'
>>> s.split('>')[2][:-3]
'theTitleIneedthatvariesinlength'
There are other (and better) options for parsing HTML though.

Improving a python snippet

I'm working on a python script to do some web scraping. I want to find the base URL of a given section on a web page that looks like this:
<div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
...
</div>
So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:
pages = [l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href'])]
s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])
The problem is, generating this list is a waste, since I just need the first href. I think a Generator would be the answer but I couldn't pull this off. Maybe you guys could help me to make this code more concise?
What about this:
from bs4 import BeautifulSoup
html = """ <div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
</div>"""
soup = BeautifulSoup(html)
link = soup.find('div', {'class': 'pagination'}).find('a')['href']
print '/'.join(link.split('/')[:-1])
prints:
webpage-category/page
Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:
s = next(l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href']))
UPD (using the website link provided):
import urllib2
from bs4 import BeautifulSoup
url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')
print next('/'.join(link['href'].split('/')[:-1]) for link in links
if link.text.isdigit() and link.text != "1")

Categories