Parse between pre tag Python - python

I'm trying to parse between PRE tags using Python using this code
s = br.open(base_url+str(string))
u = br.geturl()
seq = br.open(u)
blat = BeautifulSoup(seq)
for res in blat.find('pre').findChildren():
seq = res.string
print seq
from the following HTML source code:
<PRE><TT>
<span style='color:#22CCEE;'>T</span><span style='color:#3300FF;'>AAAAGATGA</span> <span style='color:#3300FF;'>AGTTTCTATC</span> <span style='color:#3300FF;'>ATCCAAA</span>aa<span style='color:#3300FF;'>A</span> <span style='color:#3300FF;'>TGGGCTACAG</span> <span style='color:#3300FF;'>AAAC</span><span style='color:#22CCEE;'>C</span></TT></PRE>
<HR ALIGN="CENTER"><H4><A NAME=genomic></A>Genomic chr17 (reverse strand):</H4>
<PRE><TT>
tacatttttc tctaactgca aacataatgt tttcccttgt attttacaga 41256278
tgcaaacagc tataattttg caaaaaagga aaataactct cctgaacatc 41256228
<A NAME=1></A><span style='color:#22CCEE;'>T</span><span style='color:#3300FF;'>AAAAGATGA</span> <span style='color:#3300FF;'>AGTTTCTATC</span> <span style='color:#3300FF;'>ATCCAAA</span>gt<span style='color:#3300FF;'>A</span> <span style='color:#3300FF;'>TGGGCTACAG</span> <span style='color:#3300FF;'>AAAC</span><span style='color:#22CCEE;'>C</span>gtgcc 41256178
aaaagacttc tacagagtga acccgaaaat ccttccttgg taaaaccatt 41256128
tgttttcttc ttcttcttct tcttcttttc tttttttttt ctttt</TT></PRE>
<HR ALIGN="CENTER"><H4><A NAME=ali></A>Side by Side Alignment</H4>
<PRE><TT>
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
</TT></PRE>
It gives me the first PRE tag elements when I want to parse the last one. I'd appreciate any suggestions to achieve it.
I'd like the output to be like:
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
whereas my current output is
T
AAAAGATGA
AGTTTCTATC
ATCCAAA
A
TGGGCTACAG
AAAC
C

You can use find_all() an get the last result:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('../index.html'), 'html5lib')
pre = soup.find_all('pre')[-1]
print pre.text.strip()
where index.html contains the html you provided.
It prints:
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
Another option would be to rely on the previous h4 tag to get the appropriate pre:
h4 = soup.select('h4 > a[name="ali"]')[0].parent
print h4.find_next_sibling('pre').text.strip()

Related

How to exclude inner tags with beautifulsoup

Hey Im currently trying to parse through a website and I'm almost done, but there's a little problem. I wannt to exclude inner tags from a html code
<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
I tried using
...find("span", "moto-color5_5") but this returns
Text 1 Text 2
instead of only returning Text 1
Any suggestions?
sincierly :)
Excluding inner tags would also exclude Text 1 because it's in an inner tag <strong>.
You can however just find strong inside of your current soup:
html = """<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
"""
soup = BeautifulSoup(html)
result = soup.find("span", "moto-color5_5").find('strong')
print(result.text) # Text 1

Python 3 get child elements (lxml)

I am using lxml with html:
from lxml import html
import requests
How would I check if any of an element's children have the class = "nearby"
my code (essentially):
url = "www.example.com"
Page = requests.get(url)
Tree = html.fromstring(Page.content)
resultList = Tree.xpath('//p[#class="result-info"]')
i=len(resultList)-1 #to go though the list backwards
while i>0:
if (resultList[i].HasChildWithClass("nearby")):
print('This result has a child with the class "nearby"')
How would I replace "HasChildWithClass()" to make it actually work?
Here's an example tree:
...
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
...
I tried to understand why you use lxml to find the element. However BeautifulSoup and re may be a better choice.
lxml = """
<p class="result-info">
<span class="result-meta">
<span class="nearby">
... #this SHOULD print something
</span>
</span>
</p>
<p class="result-info">
<span class="result-meta">
<span class="FAR-AWAY">
... # this should NOT print anything
</span>
</span>
</p>
"""
But i done what you want.
from lxml import html
Tree = html.fromstring(lxml)
resultList = Tree.xpath('//p[#class="result-info"]')
i = len(resultList) - 1 #to go though the list backwards
for result in resultList:
for e in result.iter():
if e.attrib.get("class") == "nearby":
print(e.text)
Try to use bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(lxml,"lxml")
result = soup.find_all("span", class_="nearby")
print(result[0].text)
Here is an experiment I did.
Take r = resultList[0] in python shell and type:
>>> dir(r)
['__bool__', '__class__', ..., 'find_class', ...
Now this find_class method is highly suspicious. If you check its help doc:
>>> help(r.find_class)
you'll confirm the guess. Indeed,
>>> r.find_class('nearby')
[<Element span at 0x109788ea8>]
For the other tag s = resultList[1] in the example xml code you gave,
>>> s.find_class('nearby')
[]
Now it's clear how to tell whether a 'nearby' child exists or not.
Cheers!

Scraping <span>flow text</span> with BeautifulSoup and urllib

I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
data = """ <div class="grouping">
<div class="a1 left" style="width:20px;">Text</div>
<div class="a2 left" style="width:30px;"><span
id="target_0">Data1</span>
</div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2
</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3
</span</div>
</div>
"""
My ultimate goal would be to able to print a list ["Text", "Data1", "Data2"] for each entry. But right now I am having trouble getting python and urllib to produce any text between the . Here is what I am running:
import urllib
from bs4 import BeautifulSoup
url = 'http://target.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
Search_List = [0,4,5] # list of Target IDs to scrape
for i in Search_List:
h = str(i)
root = 'target_' + h
taggr = soup.find("span", { "id" : root })
print taggr, ", ", taggr.text
When I use urllib it produces this:
<span id="target_0"></span>,
<span id="target_4"></span>,
<span id="target_5"></span>,
However, I also downloaded the html file, and when I parse the downloaded file it produces this output (the one that I want):
<span id="target_0">Data1</span>, Data1
<span id="target_4">Data1</span>, Data1
<span id="target_5">Data1</span>, Data1
Can anyone explain to me why urllib doesn't produce the outcome?
use this code :
...
soup = BeautifulSoup(html, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'target_0'}):
your_data.append(line.text)
...
similarly add all class attributes which you need to extract data from and write your_data list in csv file. Hope this will help if this doesn't work out. let me know.
You can use the following approach to create your lists based on the source HTML you have shown:
from bs4 import BeautifulSoup
data = """
<div class="grouping">
<div class="a1 left" style="width:20px;">Text0</div>
<div class="a2 left" style="width:30px;"><span id="target_0">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text2</div>
<div class="a2 left" style="width:30px;"><span id="target_2">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text4</div>
<div class="a2 left" style="width:30px;"><span id="target_4">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
"""
soup = BeautifulSoup(data, "lxml")
search_ids = [0, 4, 5] # list of Target IDs to scrape
for i in search_ids:
span = soup.find("span", id='target_{}'.format(i))
if span:
grouping = span.parent.parent
print list(grouping.stripped_strings)[:-1] # -1 to remove "Data3"
The example has been slightly modified to show it finding IDs 0 and 4. This would display the following output:
[u'Text0', u'Data1', u'Data2']
[u'Text4', u'Data1', u'Data2']
Note, if the HTML you are getting back from your URL is different to that seen been viewing the source from your browser (i.e. the data you want is missing completely) then you will need to use a solution such as selenium to connect to your browser and extract the HTML. This is because in this case, the HTML is probably being generated locally via Javascript, and urllib does not have a Javascript processor.

Unable to fetch <div> tag values in python

The required value is present within the div tag:
<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>
I am using the below code to fetch the value "Rs. 350":
soup.select('div.search-page-text'):
But in the output i get "None". Could you pls help me resolve this issue?
An element with both a sub-element and string content can be accessed using strippe_strings:
from bs4 import BeautifulSoup
h = """<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>"""
soup = BeautifulSoup(h)
for s in soup.select("div.search-page-text")[0].stripped_strings:
print(s)
Output:
Cost for 2:
Rs. 350
The problem is that this includes both the strong content of the span and the div. But if you know that the div first contains the span with text, you could get the intersting string as
list(soup.select("div.search-page-text")[0].stripped_strings)[1]
If you know you only ever want the string that is the immediate text of the <div> tag and not the <span> child element, you could do this.
from bs4 import BeautifulSoup
txt = '''<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>'''
soup = BeautifulSoup(txt)
for div in soup.find_all("div", { "class" : "search-page-text" }):
print ''.join(div.find_all(text=True, recursive=False)).strip()
#print div.find_all(text=True, recursive=False)[1].strip()
One of the lines returned by div.find_all is just a newline. That could be handled in a variety of ways. I chose to join and strip it rather than rely on the text being at a certain index (see commented line) in the resultant list.
Python 3
For python 3 the print line should be
print (''.join(div.find_all(text=True, recursive=False)).strip())

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?
You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.

Categories