beautifulsoup how to recombine words - python

Some of the words outputted are split when running this code. Like the word "tolerances" is split into "tole rances". I looked at the html source and it seems that's how the page was created.
There are also many other places where the word is split. How do I recombine them before writing to text?
import requests, codecs
from bs4 import BeautifulSoup
from bs4.element import Comment
path='C:\\Users\\jason\\Google Drive\\python\\'
def tag_visible(element):
if element.parent.name in ['sup']:
return False
if isinstance(element, Comment):
return False
return True
ticker = 'TSLA'
quarter = '18Q2'
mark1= 'ITEM 1A'
mark2= 'UNREGISTERED SALES'
url_new='https://www.sec.gov/Archives/edgar/data/1318605/000156459018019254/tsla-10q_20180630.htm'
def get_text(url,mark1,mark2):
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
for hr in soup.select('hr'):
hr.find_previous('p').extract()
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text=u" ".join(t.strip() for t in visible_texts)
return text[text.find(mark1): text.find(mark2)]
text = get_text(url_new,mark1,mark2)
file=codecs.open(path + "test.txt", 'w', encoding='utf8')
file.write (text)
file.close()

You are dealing with HTML formatted with Microsoft Word. Don't extract text and try to process it without that context.
The section you want to process is clearly delineated with <a name="..."> tags, lets start with selecting all elements with the <a name="ITEM_1A_RISK_FACTORS"> marker, all the way up to but not including the <a name="ITEM2_UNREGISTERED_SALES"> marker:
def sec_section(soup, item_name):
"""iterate over SEC document paragraphs for the section named item_name
Item name must be a link target, starting with ITEM
"""
# ask BS4 to find the section
elem = soup.select_one('a[name={}]'.format(item_name))
# scan up to the parent text element
# html.parser does not support <text> but lxml does
while elem.parent is not soup and elem.parent.name != 'text':
elem = elem.parent
yield elem
# now we can yield all next siblings until we find one that
# is also contains a[name^=ITEM] element:
for elem in elem.next_siblings:
if not isinstance(elem, str) and elem.select_one('a[name^=ITEM]'):
return
yield elem
This function gives us all child nodes from the <text> node in the HTML document that start at a paragraph containing a specific link target, all the way through to the next link target that names an ITEM.
Next, the usual Word cleanup task is to remove <font> tags, style attributes:
def clean_word(elem):
if isinstance(elem, str):
return elem
# remove last-rendered break markers, non-rendering but messy
for lastbrk in elem.select('a[name^=_AEIOULastRenderedPageBreakAEIOU]'):
lastbrk.decompose()
# remove font tags and styling from the document, leaving only the contents
if 'style' in elem.attrs:
del elem.attrs['style']
for e in elem: # recursively do the same for all child nodes
clean_word(e)
if elem.name == 'font':
elem = elem.unwrap()
return elem
The Tag.unwrap() method is what'll most help your case, as the text is divided up almost arbitrarily by <font> tags.
Now it's suddenly trivial to extract the text cleanly:
for elem in sec_section(soup, 'ITEM_1A_RISK_FACTORS'):
clean_word(elem)
if not isinstance(elem, str):
elem = elem.get_text(strip=True)
print(elem)
This outputs, among the rest of the text:
•that the equipment and processes which we have selected for Model 3 production will be able to accurately manufacture high volumes of Model 3 vehicles within specified design tolerances and with high quality;
The text is now properly joined up, no re-combining required any more.
The whole section is still in a table but clean_word() cleaned this up now to the much more reasonable:
<div align="left">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td valign="top">
<p> </p></td>
<td valign="top">
<p>•</p></td>
<td valign="top">
<p>that the equipment and processes which we have selected for Model 3 production will be able to accurately manufacture high volumes of Model 3 vehicles within specified design tolerances and with high quality;</p></td></tr></table></div>
so you can use smarter text extraction techniques to further ensure a clean text conversion here; you could convert such bullet tables to a * prefix, for example:
def convert_word_bullets(soup, text_bullet="*"):
for table in soup.select('div[align=left] > table'):
div = table.parent
bullet = div.find(string='\u2022')
if bullet is None:
# not a bullet table, skip
continue
text_cell = bullet.find_next('td')
div.clear()
div.append(text_bullet + ' ')
for i, elem in enumerate(text_cell.contents[:]):
if i == 0 and elem == '\n':
continue # no need to include the first linebreak
div.append(elem.extract())
In addition, you probably want to skip the page breaks too (a combination of <p>[page number]</p> and <hr/> elements), if you run
for pagebrk in soup.select('p ~ hr[style^=page-break-after]'):
pagebrk.find_previous_sibling('p').decompose()
pagebrk.decompose()
This is more explicit than your own version, where you remove all <hr/> elements and preceding <p> element regardless of whether they are actually siblings.
Execute both before cleaning up your Word HTML. Combined with your function that together becomes:
def get_text(url, item_name):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
for pagebrk in soup.select('p ~ hr[style^=page-break-after]'):
pagebrk.find_previous_sibling('p').decompose()
pagebrk.decompose()
convert_word_bullets(soup)
cleaned_section = map(clean_word, sec_section(soup, item_name))
return ''.join([
elem.get_text(strip=True) if elem.name else elem
for elem in cleaned_section])
text = get_text(url, 'ITEM_1A_RISK_FACTORS')
with open(os.path.join(path, 'test.txt'), 'w', encoding='utf8') as f:
f.write(text)

This page markup is really bad. You will need to remove excess tags to fix your issue. Luckily for you, beautifulsoup can do the heavy-lifting. The code below will remove all font tags.
soup = BeautifulSoup(html.text, 'html.parser')
for font in soup.find_all('font'):
font.unwrap()

Related

How to scrape content from a website with no class or id specified in attribute with BeautifulSoup4

I want to scrape separate content like- text in 'a' tag (ie. only the name- "42mm Architecture") and 'scope of services, types of built projects, Locations of Built Projects, Style of work, Website' as CSV file headers and its content for a whole webpage.
The elements have no Class or ID associated with it. So I am kind of stuck on how to extract those details properly, also there are those 'br' and 'b' tags in between.
There are multiple 'p' tags before and after the block of code provided. Here is the website.
<h2>
<a href="http://www.dezeen.com/tag/design-by-42mm-architecture" rel="noopener noreferrer" target="_blank">
42mm Architecture
</a>
|
<span style="color: #808080;">
Delhi | Top Architecture Firms/ Architects in India
</span>
</h2>
<!-- /wp:paragraph -->
<p>
<b>
Scope of services:
</b>
Architecture, Interiors, Urban Design.
<br/>
<b>
Types of Built Projects:
</b>
Residential, commercial, hospitality, offices, retail, healthcare, housing, Institutional
<br/>
<b>
Locations of Built Projects:
</b>
New Delhi and nearby states
<b>
<br/>
</b>
<b>
Style of work
</b>
<span style="font-weight: 400;">
: Contemporary
</span>
<br/>
<b>
Website
</b>
<span style="font-weight: 400;">
:
<a href="https://www.42mm.co.in/">
42mm.co.in
</a>
</span>
</p>
So how is it done using BeautifulSoup4?
This one was a bit of a time consuming one! The webpage is not complete and it has less tags and identifiers. To add more on that they haven't even spell checked the content Eg. One place has a heading Scope of Services and another place has Scope of services and there are many more like that! So what I have done is a crude extraction and I'm sure it would help you if you have the idea of paginating also.
import requests
from bs4 import BeautifulSoup
import csv
page = requests.get('https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/')
soup = BeautifulSoup(page.text, 'lxml')
# there are many h2 tags but we want the one without any class name
h2 = soup.find_all('h2', class_= '')
headers = []
contents = []
header_len = []
a_tags = []
for i in h2:
if i.find_next().name == 'a': # to make sure we do not grab the wrong tag
a_tags.append(i.find_next().text)
p = i.find_next_sibling()
contents.append(p.text)
h =[j.text for j in p.find_all('strong')] # some headings were bold in the website
headers.append(h)
header_len.append(len(h))
# since only some headings were in bold the max number of bold would give all headers
headers = headers[header_len.index(max(header_len))]
# removing the : from headings
headers = [i[:len(i)-1] for i in headers]
# inserted a new heading
headers.insert(0, 'Firm')
# n for traversing through headers list
# k for traversing through a_tags list
n =1
k =0
# this is the difficult part where the content will have all the details in one value including the heading like this
"""
Scope of services: Architecture, Interiors, Urban Design.Types of Built Projects: Residential, commercial, hospitality, offices, retail, healthcare, housing, InstitutionalLocations of Built Projects: New Delhi and nearby statesStyle of work: ContemporaryWebsite: 42mm.co.in
"""
# thus I am splitting it using the ':' and then splicing it from the start of the each heading
contents = [i.split(':') for i in contents]
for i in contents:
for j in i:
h = headers[n][:5]
if i.index(j) == 0:
i[i.index(j)] = a_tags[k]
n+=1
k+=1
elif h in j:
i[i.index(j)] = j[:j.index(h)]
j = j[:j.index(h)]
if n < len(headers)-1:
n+=1
n =1
# merging those extra values in the list if any
if len(i) == 7:
i[3] = i[3] + ' ' + i[4]
i.remove(i[4])
# writing into csv file
# if you don't want a line space between each row then add newline = '' argument in the open function below
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(contents)
This was the output:
If you want to paginate then just add the page number to the end of the url and you'll be good!
page_num = 1
while page_num <13:
page = requests.get(f'https://www.re-thinkingthefuture.com/top-architects/top-architecture-firms-in-india-part-1/{page_num}/')
# paste the above code starting from soup = BeautifulSoup(page.text, 'lxml')
page_num +=1
Hope this helps, let me know if there's any error.
EDIT 1:
I forgot to say the most important part sorry, if there is a tag with no class name then you can still get the tag with what I have used in the code above
h2 = soup.find_all('h2', class_= '')
This just says that give me all the h2 tags which does not have a class name. This itself can sometimes be a unique identifier as we are using this no class value to identify it.
You can use this example as a basis how to scrape the informations from that page:
import requests
import pandas as pd
url = "https://www.gov.uk/government/publications/endorsing-bodies-start-up/start-up"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
parent = soup.select_one("div.govspeak")
mapping = {"sector": "sectors", "endorses businesses": "endorses businesses in"}
all_data = []
for h3 in parent.select("h3"):
name = h3.text
link = h3.a["href"] if h3.a else "-"
ul = h3.find_next("ul")
if ul and ul.find_previous("h3") == h3 and ul.parent == parent:
li = [
list(map(lambda x: mapping.get((i := x.strip()), i), v))
for li in ul.select("li")
if len(v := li.get_text(strip=True).split(":")) == 2
]
else:
li = []
all_data.append({"name": name, "link": link, **dict(li)})
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):

Beautiful Soup - Get all text, but preserve link html?

I have to process a large archive of extremely messy HTML full of extraneous tables, spans and inline styles into markdown.
I am trying to use Beautiful Soup to accomplish this task, and my goal is basically the output of the get_text() function, except to preserve anchor tags with the href intact.
As an example, I would like to convert:
<td>
<font><span>Hello</span><span>World</span></font><br>
<span>Foo Bar <span>Baz</span></span><br>
<span>Example Link: Google</span>
</td>
Into:
Hello World
Foo Bar Baz
Example Link: Google
My thought process so far was to simply grab all the tags and unwrap them all if they aren't anchors, but this causes the text to be repeated several times as soup.find_all(True) returns recursively nested tags as individual elements:
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: Google</span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)
for tag in tags:
if (tag.name == 'a'):
print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
else:
print(tag.get_text())
Which returns multiple fragments/duplicates as the parser moves down the tree:
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World
Foo Bar Baz
Baz
Example Link: Google
<a href='https://google.com'>Google</a>
One of the possible ways to tackle this problem would be to introduce some special handling for a elements when it comes to printing out a text of an element.
You can do it by overriding _all_strings() method and returning a string representation of an a descendant element and skip a navigable string inside an a element. Something along these lines:
from bs4 import BeautifulSoup, NavigableString, CData, Tag
class MyBeautifulSoup(BeautifulSoup):
def _all_strings(self, strip=False, types=(NavigableString, CData)):
for descendant in self.descendants:
# return "a" string representation if we encounter it
if isinstance(descendant, Tag) and descendant.name == 'a':
yield str(descendant)
# skip an inner text node inside "a"
if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
continue
# default behavior
if (
(types is None and not isinstance(descendant, NavigableString))
or
(types is not None and type(descendant) not in types)):
continue
if strip:
descendant = descendant.strip()
if len(descendant) == 0:
continue
yield descendant
Demo:
In [1]: data = """
...: <td>
...: <font><span>Hello</span><span>World</span></font><br>
...: <span>Foo Bar <span>Baz</span></span><br>
...: <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
...: t-decoration: underline;">Google</a></span>
...: </td>
...: """
In [2]: soup = MyBeautifulSoup(data, "lxml")
In [3]: print(soup.get_text())
HelloWorld
Foo Bar Baz
Example Link: Google
To only consider direct children set recursive = False then you need to process each 'td' and extract the text and anchor link individually.
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Some Example Text</span></font><br><span>Another Example Text</span><br><span>Example Link: Google</span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(recursive=False)
for tag in tags:
print(tag.text)
print(tag.find('a'))
If you want the text printed on separate lines you will have to process the spans individually.
for tag in tags:
spans = tag.find_all('span')
for span in spans:
print(span.text)
print(tag.find('a'))

BeautifulSoup - combine consecutive tags

I have to work with the messiest HTML where individual words are split into separate tags, like in the following example:
<b style="mso-bidi-font-weight:normal"><span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span></b><b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span></b>
That's kind of hard to read, but basically the word "INTRODUCTION" is split into
<b><span>I</span></b>
and
<b><span>NTRODUCTION</span></b>
having the same inline properties for both span and b tags.
What's a good way to combine these? I figured I'd loop through to find consecutive b tags like this, but am stuck on how I'd go about merging the consecutive b tags.
for b in soup.findAll('b'):
try:
if b.next_sibling.name=='b':
## combine them here??
except:
pass
Any ideas?
EDIT:
Expected output is the following
<b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>INTRODUCTION</span></b>
The solution below combines text from all the selected <b> tags into one <b> of your choice and decomposes the others.
If you only want to merge the text from consecutive tags follow Danny's approach.
Code:
from bs4 import BeautifulSoup
html = '''
<div id="wrapper">
<b style="mso-bidi-font-weight:normal">
<span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span>
</b>
<b style="mso-bidi-font-weight:normal">
<span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span>
</b>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
container = soup.select_one('#wrapper') # it contains b tags to combine
b_tags = container.find_all('b')
# combine all the text from b tags
text = ''.join(b.get_text(strip=True) for b in b_tags)
# here you choose a tag you want to preserve and update its text
b_main = b_tags[0] # you can target it however you want, I just take the first one from the list
b_main.span.string = text # replace the text
for tag in b_tags:
if tag is not b_main:
tag.decompose()
print(soup)
Any comments appreciated.
Perhaps you could check if the b.previousSibling is a b tag, then append the inner text from the current node into that. After doing this - you should be able to remove the current node from the tree with b.decompose.
The approach I have used for this problem is to insert one element inside the other, then unwrap() it, which will preserve all nested text and tags -- unlike approaches using the text contents of the elements.
For example:
for b in soup.find_all('b'):
prev = b.previous_sibling
if prev and prev.name == 'b': # Any conditions needed to decide to merge
b.insert(0, prev) # Move the previous element inside this one
prev.unwrap() # Unwrap <b><b>prev</b> b</b> into <b>prev b</b>
Note the use of previous_sibling instead of next_sibling so that we don't modify subsequent parts of the soup that we are about to iterate over.
Then we might want to repeat the process with <span> to achieve the final result. Perhaps also check b['style'] == prev['style'] if that condition is desired for merging.
The adjacent answer combines only text tags, without preserving nested tags, such as <i>.
The following code does this.
For example, for this html:
<div>
<p>A<i>b</i>cd1, <i>a</i><b><i>b</i></b><i>cd2</i> abcd3 <i>ab</i></p>
<p>cd4 <i>a</i><i>bsd5</i> <i>ab<span>cd6</span></i></p>
</div>
the result will be:
<div>
<p>A<i>b</i>cd1, <i>a<b>b</b>cd2</i> abcd3 <i>ab</i></p>
<p>cd4 <i>absd5 ab<span>cd6</span></i></p>
</div>
In the ignoring_tags_names variable, you can set which tags are considered nested and ignored when merging. Any other tags will break the merging chain.
In the re_symbols_ignore variable, you can set text characters between the same tags to be ignored when concatenating. Any other characters will break the merging chain.
You can also specify a check for the identity of tag attributes. But their order is not checked. {class: ['a', 'b']} and {class: ['b', 'a']} are considered different and tags are not combined.
import re
from bs4 import BeautifulSoup, NavigableString
def find_and_combine_tags(soup, init_tag_name: str, init_tag_attrs: dict = None or {}):
def combine_tags(tag, tags: list):
# appending the tag chain to the first tag
for t in tags:
tag.append(t)
# unwrapping them
for t in tag.find_all(init_tag_name):
if t.name == init_tag_name and t.attrs == init_tag_attrs:
t.unwrap()
def fill_next_siblings(tag, init_tag_name: str, ignoring_tags_names: list) -> list:
next_siblings = []
for t in tag.next_siblings:
if isinstance(t, NavigableString) and not re_symbols_ignore.match(t):
next_siblings.append(t)
elif isinstance(t, NavigableString) and re_symbols_ignore.match(t):
next_siblings.append(t)
elif t.name in ignoring_tags_names and t.attrs == init_tag_attrs: # also checking the tag attrs
next_siblings.append(t)
else:
# filling `next_siblings` until another tag met
break
has_other_tag_met = False
for t in next_siblings:
if t.name == init_tag_name and t.attrs == init_tag_attrs:
has_other_tag_met = True
break
# removing unwanted tags on the tail of `next_siblings`
if has_other_tag_met:
while True:
last_tag = next_siblings[-1]
if isinstance(last_tag, NavigableString):
next_siblings.pop()
elif last_tag.name != init_tag_name and last_tag.attrs != init_tag_attrs:
next_siblings.pop()
else:
break
return next_siblings
# Ignore nested tags names
if init_tag_name in ['i', 'b', 'em']:
ignoring_tags_names = ['i', 'b', 'em']
elif init_tag_name in ['div']:
# A block tags can have many nested tags
ignoring_tags_names = ['div', 'p', 'span', 'a']
else:
ignoring_tags_names = []
# Some symbols between same tags can add into them. Because they don't changing of font style.
if init_tag_name == 'i':
# Italic doesn't change the style of some characters (spaces, period, comma), so they can be combined
re_symbols_ignore = re.compile(r'^[\s.,-]+$')
elif init_tag_name == 'b':
# Bold changes the style of all characters
re_symbols_ignore = re.compile(r'^[\s]+$')
elif init_tag_name == 'div':
# Here should be careful with merging, because a html can have some `\n` between block tags (like `div`s)
re_symbols_ignore = re.compile(r'^[\s]+$')
else:
re_symbols_ignore = None
all_wanted_tags = soup.find_all(init_tag_name)
if all_wanted_tags:
tag_groups_to_combine = []
tag = all_wanted_tags[0]
last_tag = tag
while True:
tags_to_append = fill_next_siblings(tag, init_tag_name, ignoring_tags_names)
if tags_to_append:
tag_groups_to_combine.append((tag, tags_to_append)) # the first tag and tags to append
# looking for a next tags group
last_tag = tags_to_append[-1] if tags_to_append else tag
for tag in all_wanted_tags:
if tag.sourceline > last_tag.sourceline \
or (tag.sourceline == last_tag.sourceline and tag.sourcepos > last_tag.sourcepos):
break
if last_tag.sourceline == all_wanted_tags[-1].sourceline and last_tag.sourcepos == last_tag.sourcepos:
break
last_tag = tag
for first_tag, tags_to_append in tag_groups_to_combine:
combine_tags(first_tag, tags_to_append)
return soup

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

I’m trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.
HTML sample that works
Search term: OG00131
<tr>
<td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation:
<br> PMID:
20068230
[CAD, ETD MS/MS]; <br>
<br>
</td>
</tr>
HTML sample that doesn't work
Search term: OG00020
<td align="top" bgcolor="#FBFFCC">
<div class="STYLE28">Literature describing O-GlcNAcylation: </div>
<div class="STYLE28">
<div class="STYLE28">PMID:
16408927
[Azide-tag, nano-HPLC/tandem MS]
</div>
<br>
Site has not yet been determined. Use
OGlcNAcScan
to predict the O-GlcNAc site. </div>
</td>
Here's the code I have so far
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the <td> tag with the term "Data and Source"
pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent <td> tag
pmid_p = pmid.parent
# Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (<td>) tag
pmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the <a> tag includes more than just the link address.
# for each <a> tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?
I'm using Python 2.7.10
import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')
a_tag = soup.find('a', href=regex)
has_pmid = 'PMID' in a_tag.previous_element
if has_pmid :
print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
else:
print("Not available")
out:
18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation
find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element.
this web is so inconsistancies , and i try many times, hope this would help

Combine multiple tags with lxml

I have an html file which looks like:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
What I need is, if all the tags in a 'p' block are 'strong', then combine them into one line, i.e.
<p>
<strong>This is a line which I want to join.</strong>
</p>
Without touching the other block since it contains something else.
Any suggestions? I am using lxml.
UPDATE:
So far I tried:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
With these code I was able to strip off the strong tag in the part desired, giving:
<p>
This is a line which I want to join.
</p>
So now I just need a way to put the tag back in...
I was able to do this with bs4 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
Then use replace_with():
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
prints:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
I have managed to solve my own problem.
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
Special thanks to #Scott who helps me come down to this solution. Although I cannot mark his answer correct, I have no less appreciation to his guidance.
Alternatively, you can use more specific xpath to get the targeted p elements directly :
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
brief explanation about xpath being used :
//p[strong] : find p element, anywhere in the XML/HTML document, having child element strong...
[not(*[not(self::strong)])] : ..and not having child element other than strong...
[not(text()[normalize-space()])] : ..and not having non-empty text node child.
normalize-space() : get all text nodes from current context element, concatenated with consecutive whitespaces normalized to single space

Categories