extract text from two tables by Beautiful Soup

extract text from two tables by Beautiful Soup - python

I have a lot of pages where the structure is the following:
<table class='CERTAIN_CLASS'> ... </table>
A lot of stuff here (divs, ps, brs, images)
<table class='CERTAIN_CLASS'> ... </table>
What is the most efficient way to extract the text (text only!) from everything between two tables of a certain class? I've found a lot of similar questions on SO, but nothing on this secifically this task.

Assuming that you have already loaded the content of the html page:
For a specific class:
text.find(class_="CERTAIN CLASS").text.strip()
To find all the text from this certain class, then you could iterate through every element:
text.findAll(class_="CERTAIN CLASS"):

Related

Is there a specific way of retreiving only the required information from an HTML tree? Example included

I am using python3.8 and BeautfiulSoup 4 to parse a website. The section I want to read is here:
<h1 class="pr-new-br">
Rotring
<span> 0.7 Imza Uçlu Kurşun Versatil Kalem 37.28.221.368 </span>
</h1>
I find this from the website using this code and get the text from it (soup is the variable for the BeautifulSoup object from the website):
product_name_text = soup.select("h1.pr_new_br")[0].get_text()
However, this ofcourse return me all of the text. I want to seperate the text between the <a href> and the text between <span>.
How can I do this? How can I specifically for for a tag or a link in a href?
Thank you very much in advance, I am pretty new in the field, sorry if this is very basic.

get_text method has a parameter to split different elements' text.
As an example:
product_name_text = soup.select("h1.pr_new_br")[0].get_text('|')
# You will get -> Rotring|0.7 Imza Uçlu Kurşun Versatil Kalem 37.28.221.368
# Then you can split with same symbol and you would have list of different el's texts

How do I extract text between two objects using XPath?

I'm using XPath to extract different web elements on a webpage, but have his a roadblock on one particular object that is sitting between two objects, but doesn't have a closing object behind it for a while.
I've been able to successfully extract other elements from the webpage, but don't know how to proceed at this point.
Here is a copy of what the HTML looks like from the Inspector:
<body>
<table>
<tbody>
<tr>
<td id="left_column">
<div id="top">
<h1></h1>
#SOME TEXT
<div>
<table>
.......
</table>
</div>
</div>
</td>
</tr>
Any suggestions would be greatly appreciated! Thank you!

Here is a thought that I hope will help, but with out seeing the entire HTML I can't give more then just an idea. I have more experience with Selenium in java, so I am not 100% sure that python will have the same functionality but I imagine it does.
You should be able to get the text from any WebElement. In Java it would look something like this, but I imagine it should be too hard to change it to python
WebElement top = driver.findElement(By.xpath("//div[#id='top']"));
String topString = top.getText();
If in your case your getting more then just the "#SomeText" you would need to remove the text from the other elements that you don't want. Something like:
WebElement topH1 = top.findElement(By.xpath("./h1"));
WebElement topInsideDiv = top.findElement(By.xpath("./div"));
String topHString = topH1.getText();
String topInsideDivString = topTable.getText();
//since you know that the H1 string would come first and the inside div
//would come after you could take the substring of the topString
String result = topString.subString(topHString.length,
topString.length - topInsideDivString.length);
This is really just an idea on how you could do it. The way that you determine the part of the string that you would be interested in might need to be more complex. It could be that you just cycle through the strings to determine where you need to break apart the entire string to get what you want. If there is text before the tag you would need to be more complex about your solution, perhaps by searching for the text and discounting anything you find before it, but without that information I cant really help out more then this.

python how to identify block html contain text?

I have raw HTML files and i remove script tag.
I want to identify in the DOM the block elements (like <h1> <p> <div> etc, not <a> <em> <b> etc) and enclose them in <div> tags.
Is there any easy way to do it in python?
is there library in python to identify the block element
Thanks
UPDATE
actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.

You should use something like Beautiful Soup or HTMLParser.
Have a look at their docs: Beautiful Soup or HTMLParser.
You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.
Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:
soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so
Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:
def walker(soup):
if soup.name is not None:
for child in soup.children:
# do stuff with the node
print ':'.join([str(child.name), str(type(child))])
walker(child)
Note: the code is from this great tutorial.

Writing clean code in beautifulsoup

While parsing a table on a webpage with little semantic structure, my beautiful soup expressions are getting really ugly. I might be going about it the wrong way and would like to know how I can rewrite my code to make it more readable and less messy?
For example, in a page there are three tables. Relevant data is in the third table. The actual data starts in the second row. The first entry in the row is an index and the data I need is in the second td element. This second td element has two links and my text of interest is within the second a tag. Translating this into beuatifulsoup I wrote
soup.find_all('table')[2].find_all('tr')[2].find_all('td')[1].find_all('a')[1].text
works fine, and I grab all the 70 elements in the table using the same principle in a list comprehension.
relevant_data = [ x.find_all('td')[1].find_all('a')[1].text for x in soup.find_all('table')[2].find_all('tr')[2:]]
Is this kind of code OK or is there any scope for improvement?

Using lxml, you can use XPath.
For example:
html = '''
<body>
<table></table>
<table></table>
<table>
<tr></tr>
<tr></tr>
<tr><td></td><td><a>blah1</a><a>blah1-1</a></td></tr>
<tr><td></td><td><a>blah2</a><a>blah2-1</a></td></tr>
<tr><td></td><td><a>blah3</a><a>blah3-1</a></td></tr>
<tr><td></td><td><a>blah4</a><a>blah4-1</a></td></tr>
<tr><td></td><td><a>blah5</a><a>blah5-1</a></td></tr>
</table>
<table></table>
</body>
'''
import lxml.html
root = lxml.html.fromstring(html)
print(root.xpath('.//table[3]/tr[position()>=2]/td[2]/a[2]/text()'))
output:
['blah1-1', 'blah2-1', 'blah3-1', 'blah4-1', 'blah5-1']

Beautiful Soup - Find identified tag in the original text

I need to manipulate certain text in an HTML document after I have identified the text in the original document. Let's say I have this HTML code
<div id="identifier">
<a href="link" id="linkid">
</a>
</div>
I want to delete the id attribute in the <a> tag. I can identify a particular tag using BeautifulSoup, but because it changes the formatting of the original document I can't search/replace the string either. I don't want to just write output of BeautifulSoup, instead I want to identify <a href="link" id="linkid"> tag in the original document and replace with just <a href="link">. Any idea how to proceed?
Answering a few questions raised:
This is a huge existing codebase that needs some updation, so it's not just a single search/replace kind of job.
The original formatting is important because the organization follows a certain coding standards for formatting code, which I want to retain. Also, BS introduces extra tags for the sake of completeness like for and so on.

Which version of beautifulsoup are you using?
You can edit html nodes like dictionaries in bs4
From documentation:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#changing-tag-names-and-attributes
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
del tag['class']
del tag['id']
Also, you seem to have a problem with the way beautiful soup output the modified html code.
If you want to pretty print your document, or use custom formatting, you can do it easily
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract text from two tables by Beautiful Soup - python

Assuming that you have already loaded the content of the html page: For a specific class: text.find(class_="CERTAIN CLASS").text.strip() To find all the text from this certain class, then you could iterate through every element: text.findAll(class_="CERTAIN CLASS"):

Related

Is there a specific way of retreiving only the required information from an HTML tree? Example included

How do I extract text between two objects using XPath?

python how to identify block html contain text?

Writing clean code in beautifulsoup

Beautiful Soup - Find identified tag in the original text

Categories

Resources