How to replace a number of string with incremental count in python - python

I have some HTML code (for display in a browser) in a string which contains any number of svg images such as:
<table>
<tr>
<td><img src="http://localhost/images/Store.Tools.svg"></td>
<td><img src="http://localhost/images/Store.Diapers.svg"></td>
</tr>
</table>
I would like to find all HTML links and replace them to the following (in order to attach them as email):
<table>
<tr>
<td><cid:image1></td><td><cid:image2></td>
</tr>
</table>
SVG filenames can have any arbitrary number of dots, chars, and numbers.
What's the best way to do this in python?

I'd use an HTML Parser to find all img tags and replace them.
Example using BeautifulSoup and it's replace_with():
from bs4 import BeautifulSoup
data = """
<table><tr>
<td><img src="http://localhost/images/Store.Tools.svg"></td>
<td><img src="http://localhost/images/Store.Diapers.svg"></td>
</tr></table>
"""
soup = BeautifulSoup(data, 'html.parser')
for index, image in enumerate(soup.find_all('img'), start=1):
tag = soup.new_tag('img', src='cid:image{}'.format(index))
image.replace_with(tag)
print soup.prettify()
Prints:
<table>
<tr>
<td>
<img src="cid:image1"/>
</td>
<td>
<img src="cid:image2"/>
</td>
</tr>
</table>

Related

Select a specific column and ignore the rest in BeautifulSoup Python (Avoiding nested tables)

I'm trying to get only the first two columns of a webpage table using beautifulsoup in python. The problem is that this table sometimes contains nested tables in the third column. The structure of the html is similar to this:
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<\td>
<\tr>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
<\table>
<\div>
<\td>
<\tr>
<\tbody>
<\table>
The main problem is that I don't know how to simply ignore every third td so I don't read the nested tables inside the main one. I just want to have a list with the first column of the main table and another list with the second column of the main table but the nested table ruins everything when I'm reading.
I have tried with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
links = soup.select("table.relative-table tbody tr td.confluenceTd")
anchorList = []
for anchor in links:
anchorList.append(anchor.text)
del anchorList[2:len(anchorList):3]
for anchorItem in anchorList:
print(anchorItem)
print('-------------------')
This works really good until I reach the nested table and then it starts deleting other columns.
I have also tried this other code:
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for row in soup.findAll('table')[0].tbody.findAll('tr'):
firstColumn = row.findAll('td')[0].contents
secondColumn = row.findAll('td')[1].contents
print(firstColumn, secondColumn)
But I get an IndexError because it's reading the nested tabble and the nested table only has one td.
Does anyone knows how could I read the first two columns and ignore the rest?
Thank you.
It may needs some improved examples / details to clarify, but as I understand you are selecting the first <table> and try to iterate its rows:
soup.table.select('tr:not(:has(table))')
Above selector would exclude all thr rows that includes an additional <table>.
Alternative would be to get rid of these last/third <td> :
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
#### or by its index row.select_one('td:nth-of-type(3)').decompose()
Now you could perform your selections on a <table> with two columns.
Example
from bs4 import BeautifulSoup
html ='''
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
</table>
</div>
</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html, 'html.parser')
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
soup
New soup
<table class:"relative-table="" wrapped"="">
<tbody>
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
</tr>
</tbody>
</table>

BS4 get the TH data within the table

I am trying to read data from a website which has a table like this:
<table border="0" width = "100%">
<tr>
<th width="33%">Product</th>
<th width="34%">ID</th>
<th width="33%">Serial</th>
</tr>
<tr>
<td align='center'>
<a target="_TOP" href="Link1.html">ProductName</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=1'></a>
</td>
<td align='center'>
<a target="_TOP" href="Link2.html">ProductID</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=2'></a>
</td>
<td align='center'>
<a target="_TOP" href="Link3.html">ProductSerial</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=3'></a>
</td>
</tr>
</table>
and all I want from this table is the ProductID which is content inside of the tag.
The problem is, I am trying to use BS4 for this, to find the TAG, and read inside of it, but how to accurately point BS4 to it?
I have tried:
with open("src/file.html", 'r') as inf:
html = inf.read()
soup = bs4.BeautifulSoup(html, features="html.parser")
for container in soup.find_all("table", {"td": ""}):
print(container)
But doesn't work..Is there Any way to achieve this? To read the content inside of the a tag?
You can use the :nth-of-type CSS selector:
print(soup.select_one("td:nth-of-type(2) a:nth-of-type(1)").text)
Output:
ProductID

How to extract HTML table following a specific heading?

I am using BeautifulSoup to parse HTML files. I have a HTML file similar to this:
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>THE GOOD STUFF</h3>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
I want to extract the string "I WANT THIS STRING". The perfect solution would be to get the first table following the h3 heading called "THE GOOD STUFF". I have no idea how to do this with BeautifulSoup - I only know how to extract a table with a specific class, or a table nested within some particular tag, but not following a particular tag.
I think a fallback solution could make use of the string "Key C", assuming it's unique (it almost certainly is) and appears in only that one table, but I'd feel better with going for the specific h3 heading.
Following the logic of #Zroq's answer on another question, this code will give you the table following your defined header ("THE GOOD STUFF"). Please note I just put all your html in the variable called "html".
from bs4 import BeautifulSoup, NavigableString, Tag
soup=BeautifulSoup(html, "lxml")
for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print(nextNode)
Output:
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
Cheers!
The docs explain that if you don't want to use find_all, you can do this:
for sibling in soup.a.next_siblings:
print(repr(sibling))
I am sure there are many ways to this more efficiently, but here is what I can think about right now:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
if flag == 'print':
print(td.text)
break
if td.text == 'Key C':
flag = 'print'
Output:
I WANT THIS STRING

how can i get tag element by text content in beautifulsoup4

I have to scrap data from a thousand sites, local HTML files, the complication is that these sites are like 90's structure, almost same nested tables structure, no id's no CSS classes only nested tables, how can I select a specific table base in the text in one tr tag.
XPath is not a solution because the sites are mainly the same structure, but not always have the same table order, so I'm looking for a way to extract those table data from all of them, selecting or searching certain table b some text in it and by that obtain the parent tag.
Any idea?
The code on every page is huge, here is an examaple of the structure, the data is not always on the same table position.
Update:
thanks to alecxe i made this code
# coding: utf-8
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
html_content = """
<body>
<table id="gotthistable">
<tr>
<table id="needthistable">
<tr>
<td>text i'm searching</td>
</tr>
<tr>
<td>Some other text</td>
</tr>
</table>
</tr>
<tr>
<td>
<table>
<tr>
<td>Other text</td>
</tr>
<tr>
<td>Some other text</td>
</tr>
</table>
</td>
</tr>
</table>
<table>
<tr>
<td>Different table</td>
</tr>
</table>
</body>
"""
soup = BeautifulSoup(html_content, "lxml")
table = soup.find(lambda tag: tag.name == "table" and "searching" in tag.text)
print table
the output of print table or soup var is the same:
<table>
<tr>
<table id="needthistable">
<tr>
<td>text i'm searching</td>
</tr>
</tr>
.
.
.
but with this code:
soup = BeautifulSoup(html_content, "lxml")
table = soup.find(lambda tag: tag.name == "td" and "searching" in tag.text).parent.parent
print table
i got the output that i want:
<table id="needthistable">
<tr>
<td>text im searching</td>
</tr>
<tr>
<td>Some other text</td>
</tr>
</table>
but what if is not always on the same two parent elements? i mean if a got this td tag how can i get the table where it belong.
use BeautifulSoup regex filter:
If you pass in a regular expression object, Beautiful Soup will filter
against that regular expression using its search() method.
Example:
soup.find_all(name='tr', text=re.compile('this is part or full text of tr'))
You should use find() with a searching function and check the .text of a table to contain the desired text:
soup.find(lambda tag: tag.name == "table" and "part of text" in tag.text)
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <body>
... <table>
... <tr>
... <td>This text has a part of text</td>
... </tr>
... <tr>
... <td>Some other text</td>
... </tr>
... </table>
...
... <table>
... <tr>
... <td>Different table</td>
... </tr>
... </table>
... </body>
...
... """
>>>
>>> soup = BeautifulSoup(data, 'lxml')
>>>
>>> table = soup.find(lambda tag: tag.name == "table" and "part of text" in tag.text)
>>> print(table)
<table>
<tr>
<td>This text has a part of text</td>
</tr>
<tr>
<td>Some other text</td>
</tr>
</table>

Extract attribute's value with XPath in Python

I have the HTML:
<table>
<tbody>
<tr>
<td align="left" valign="top" style="padding: 0 10px 0 60px;">
<img src="/files/39.jpg" width="64" height="64">
</td>
<td align="left" valign="middle"><h1>30 Rock</h1></td>
</tr>
</tbody>
</table>
Using Python and LXML I need to extract the value from the attribute src of the <img> element. Here's what I've tried:
import lxml.html
import urllib
# make HTTP request to site
page = urllib.urlopen("http://my.url.com")
# read the downloaded page
doc = lxml.html.document_fromstring(page.read())
txt1 = doc.xpath('/html/body/table[2]/tbody/tr/td[1]/img')
When I print txt1 I get the empty list only []. How can I correct this?
Use this XPath:
//img/#src
Selects the src attributes of all img elements in the entire input XML document

Categories