Extracting specific information from fetched HTML code using python - python

I'm a relatively newb in python. I need some advice for a bioinformatics project. It's about converting certain enzyme IDs to others.
What I already did and what works, is fetch the html code for a list of IDs from the Rhea database:
53 url2 = "http://www.rhea-db.org/reaction?id=16952"
54 f_xml2 = open("xml_tempfile2.txt", "w")
55
56 fetch2 = pycurl.Curl()
57 fetch2.setopt(fetch2.URL, url2)
58 fetch2.setopt(fetch.WRITEDATA, f_xml2)
59 fetch2.perform()
60 fetch2.close
So the HTML code is saved to a temporary txt file (I know, possibly not the most elegant way to do stuff, but it works for me ;).
Now what I am interested in is the following part from the HTML:
<p>
<h3>Same participants, different directions</h3>
<div>
<span>RHEA:16949</span>
<span class="icon-question">myo-inositol + NAD(+) <?> scyllo-inosose + H(+) + NADH</span>
</div><div>
<span>RHEA:16950</span>
<span class="icon-arrow-right">myo-inositol + NAD(+) => scyllo-inosose + H(+) + NADH</span>
</div><div>
<span>RHEA:16951</span>
<span class="icon-arrow-left-1">scyllo-inosose + H(+) + NADH => myo-inositol + NAD(+)</span>
</div>
</p>
I want to go through the code until the class "icon-arrow-right" is reached (this expression is unique in the HTML). Then I want to extract the information of "RHEA:XXXXXX" from the line above. So in this example, I want to end up with 16950.
Is there a simple way to do this? I've already experimented with HTMLparser but couldn't get it to work in a way that it looks for a certain class and then gives me the ID from the line above.
Thank you very much in advance!

You can use an HTML parser like BeautifulSoup to do this:
>>> from bs4 import BeautifulSoup
>>> html = """ <p>
... <h3>Same participants, different directions</h3>
... <div>
... <span>RHEA:16949</span>
... <span class="icon-question">myo-inositol + NAD(+) <?> scyllo-inosose + H(+) + NADH</span>
... </div><div>
... <span>RHEA:16950</span>
... <span class="icon-arrow-right">myo-inositol + NAD(+) => scyllo-inosose + H(+) + NADH</span>
... </div><div>
... <span>RHEA:16951</span>
... <span class="icon-arrow-left-1">scyllo-inosose + H(+) + NADH => myo-inositol + NAD(+)</span>
... </div>
... </p>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('span', class_='icon-arrow-right').find_previous_sibling().get_text()
'RHEA:16950'

Related

How to get the number between >< [duplicate]

I am facing a problem and don't know how to solve it properly.
I want to extract the price (so in the first example 130€, in the second 130€).
the problem is that the attributes are changing all the time. so I am unable to do something like this, because I am scraping hundreds of sites and and on each site the first 2 chars of the "id" attribute may differ:
tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'(07_content$)')})
Even if I would use something like this it wont work, because there is no link to the price and I would probably get some other value:
tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'([0-9]{2}_content$)')})
Example html code:
<span id="07_lbl" class="lbl">Price:</span>
<span id="07_content" class="content">130 €</span>
<span id="08_lbl" class="lbl">Value:</span>
<span id="08_content" class="content">90000 €</span>
<span id="03_lbl" class="lbl">Price:</span>
<span id="03_content" class="content">130 €</span>
<span id="04_lbl" class="lbl">Value:</span>
<span id="04_content" class="content">90000 €</span>
The only thing I can imagine of at the moment is to identify the price tag with something like "text = 'Price:'" and after that get .next_sibling and extract the string. but I am not sure if there is better way to do it. Any suggestions? :-)
How about a findAll solution?
First collect all possibles id prefixes and then iterate them and get all elements
>>> from bs4 import BeautifulSoup
>>> import re
>>> html = """
... <span id="07_lbl" class="lbl">Price:</span>
... <span id="07_content" class="content">130 €</span>
... <span id="08_lbl" class="lbl">Value:</span>
... <span id="08_content" class="content">90000 €</span>
...
...
... <span id="03_lbl" class="lbl">Price:</span>
... <span id="03_content" class="content">130 €</span>
... <span id="04_lbl" class="lbl">Value:</span>
... <span id="04_content" class="content">90000 €</span>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> span_id_prefixes = [
... span['id'].replace("_content","")
... for span in soup.findAll('span', attrs={'id' : re.compile(r'(_content$)')})
... ]
>>> for prefix in span_id_prefixes:
... lbl = soup.find('span', attrs={'id' : '%s_lbl' % prefix})
... content = soup.find('span', attrs={'id' : '%s_content' % prefix})
... if lbl and content:
... print lbl.text, content.text
...
Price: 130 €
Value: 90000 €
Price: 130 €
Value: 90000 €
Here is how you would easily extract only the price values like you had in mind in your original post.
html = """
<span id="07_lbl" class="lbl">Price:</span>
<span id="07_content" class="content">130 €</span>
<span id="08_lbl" class="lbl">Value:</span>
<span id="08_content" class="content">90000 €</span>
<span id="03_lbl" class="lbl">Price:</span>
<span id="03_content" class="content">130 €</span>
<span id="04_lbl" class="lbl">Value:</span>
<span id="04_content" class="content">90000 €</span>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
price_texts = soup.find_all('span', text='Price:')
for element in price_texts:
# .next_sibling() might work, too, with a parent element present
price_value = element.find_next_sibling('span')
print price_value.get_text()
# It prints:
# 130 €
# 130 €
This solution has less code and, IMO, is more clear.
Try Beautiful soup selects function. It uses css selectors:
for span in soup_expose_html.select("span[id$=_content]"):
print span.text
the result is a list with all spans which have an id ending with _content

Python 3 get child elements (lxml)

I am using lxml with html:
from lxml import html
import requests
How would I check if any of an element's children have the class = "nearby"
my code (essentially):
url = "www.example.com"
Page = requests.get(url)
Tree = html.fromstring(Page.content)
resultList = Tree.xpath('//p[#class="result-info"]')
i=len(resultList)-1 #to go though the list backwards
while i>0:
if (resultList[i].HasChildWithClass("nearby")):
print('This result has a child with the class "nearby"')
How would I replace "HasChildWithClass()" to make it actually work?
Here's an example tree:
...
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
...
I tried to understand why you use lxml to find the element. However BeautifulSoup and re may be a better choice.
lxml = """
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
"""
But i done what you want.
from lxml import html
Tree = html.fromstring(lxml)
resultList = Tree.xpath('//p[#class="result-info"]')
i = len(resultList) - 1 #to go though the list backwards
for result in resultList:
for e in result.iter():
if e.attrib.get("class") == "nearby":
print(e.text)
Try to use bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(lxml,"lxml")
result = soup.find_all("span", class_="nearby")
print(result[0].text)
Here is an experiment I did.
Take r = resultList[0] in python shell and type:
>>> dir(r)
['__bool__', '__class__', ..., 'find_class', ...
Now this find_class method is highly suspicious. If you check its help doc:
>>> help(r.find_class)
you'll confirm the guess. Indeed,
>>> r.find_class('nearby')
[<Element span at 0x109788ea8>]
For the other tag s = resultList[1] in the example xml code you gave,
>>> s.find_class('nearby')
[]
Now it's clear how to tell whether a 'nearby' child exists or not.
Cheers!

Web Scraping using bs4. What should I do if a string doesn't have a tag associated with

I was using bs4 to do a web scraping.
I have had not problem getting the desired strings within tags, but it seems like there is one string that doesn't have any tags associated with (maybe I was wrong).
So the html format looks like this:
<li class='A'>
<span class='B'> Some_string_here </span>
" MY DESIRED STRING "
<div class='C'> Some_string_here </div>
</li>
I know how to get the "some string here" but I have no idea how to get "MY DESIRE STRING"
Thanks a lot in advance!!
There are various ways to do this:
>>> s = """
... <li class='A'>
... <span class='B'> Some_string_here </span>
... " MY DESIRED STRING "
... <div class='C'> Some_string_here </div>
... </li>
... """
>>> from bs4 import BeautifulSoup
>>> tree = BeautifulSoup(s)
using contents:
>>> tree.li.contents
['\n', <span class="B"> Some_string_here </span>, '\n " MY DESIRED STRING "\n ', <div class="C"> Some_string_here </div>, '\n']
>>> tree.li.contents[2].strip()
'" MY DESIRED STRING "'
using strings or stripped_strings:
>>> list(tree.li.stripped_strings)
['Some_string_here', '" MY DESIRED STRING "', 'Some_string_here']
using find_all:
>>> tree.li.find_all(text=True, recursive=False)
['\n', '\n " MY DESIRED STRING "\n ', '\n']
and there are probably several other ways...

Parse between pre tag Python

I'm trying to parse between PRE tags using Python using this code
s = br.open(base_url+str(string))
u = br.geturl()
seq = br.open(u)
blat = BeautifulSoup(seq)
for res in blat.find('pre').findChildren():
seq = res.string
print seq
from the following HTML source code:
<PRE><TT>
<span style='color:#22CCEE;'>T</span><span style='color:#3300FF;'>AAAAGATGA</span> <span style='color:#3300FF;'>AGTTTCTATC</span> <span style='color:#3300FF;'>ATCCAAA</span>aa<span style='color:#3300FF;'>A</span> <span style='color:#3300FF;'>TGGGCTACAG</span> <span style='color:#3300FF;'>AAAC</span><span style='color:#22CCEE;'>C</span></TT></PRE>
<HR ALIGN="CENTER"><H4><A NAME=genomic></A>Genomic chr17 (reverse strand):</H4>
<PRE><TT>
tacatttttc tctaactgca aacataatgt tttcccttgt attttacaga 41256278
tgcaaacagc tataattttg caaaaaagga aaataactct cctgaacatc 41256228
<A NAME=1></A><span style='color:#22CCEE;'>T</span><span style='color:#3300FF;'>AAAAGATGA</span> <span style='color:#3300FF;'>AGTTTCTATC</span> <span style='color:#3300FF;'>ATCCAAA</span>gt<span style='color:#3300FF;'>A</span> <span style='color:#3300FF;'>TGGGCTACAG</span> <span style='color:#3300FF;'>AAAC</span><span style='color:#22CCEE;'>C</span>gtgcc 41256178
aaaagacttc tacagagtga acccgaaaat ccttccttgg taaaaccatt 41256128
tgttttcttc ttcttcttct tcttcttttc tttttttttt ctttt</TT></PRE>
<HR ALIGN="CENTER"><H4><A NAME=ali></A>Side by Side Alignment</H4>
<PRE><TT>
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
</TT></PRE>
It gives me the first PRE tag elements when I want to parse the last one. I'd appreciate any suggestions to achieve it.
I'd like the output to be like:
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
whereas my current output is
T
AAAAGATGA
AGTTTCTATC
ATCCAAA
A
TGGGCTACAG
AAAC
C
You can use find_all() an get the last result:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('../index.html'), 'html5lib')
pre = soup.find_all('pre')[-1]
print pre.text.strip()
where index.html contains the html you provided.
It prints:
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
Another option would be to rely on the previous h4 tag to get the appropriate pre:
h4 = soup.select('h4 > a[name="ali"]')[0].parent
print h4.find_next_sibling('pre').text.strip()

BeautifulSoup: get tag text behind another tag

How to find tag by another tag using BeautifulSoup? In this example I want to get for example '0993 999 999' which is in div right behind another div with 'Telefon:' text.
I tried to get it using this:
print parsed.findAll('div',{'class':"dva" })[3].text
But It does not work properly. I think there must be a way to tell BeautifulSoup that it is right behind 'Telefon' text or another way.
<div class="kontakt">
<h2 class="section-head">Kontaktné údaje</h2>
<address itemprop="address" itemscope itemtype="http://schema.org/PostalAddress" >
<span itemprop="streetAddress" >SNP 12</span>, <span itemprop="postalCode" >904 01</span> <span itemprop="addressLocality" >Pezinok</span> </address>
<div class="jedna">Telefon:</div>
<div class="dva">013 / 688 27 78</div>
<div class="jedna">Mobil:</div>
<div class="dva">0993 999 999</div>
<div class="jedna">Fax:</div
<div class="dva">033 / 690 97 94</div>
<div class="jedna">E-mail:</div>
<div class="dva"><br /></div></div>
EDIT: I tried this, does not works neither.
tags = parsed.findAll('div',{'class':"jedna"})
for tag in tags:
if tag.text=='Telefon:':
print tag.next_siebling.string
Could you guys please give me a hint how to do that?
Thanks!
You can use find_next_sibling():
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
data = u"""html here"""
soup = BeautifulSoup(data)
print soup.find('div', text='Telefon:').find_next_sibling('div', class_='dva').text
print soup.find('div', text='Mobil:').find_next_sibling('div', class_='dva').text
Prints:
013 / 688 27 78
0993 999 999
FYI, you can extract the duplication and have a nice reusable function:
def get_field_value(soup, field):
return soup.find('div', text=field+':').find_next_sibling('div', class_='dva').text
soup = BeautifulSoup(data)
print get_field_value(soup, 'Telefon') # prints 013 / 688 27 78
print get_field_value(soup, 'Mobil') # prints 0993 999 999
Hope that helps.

Categories