Extract tabel from html code with beautiful soup

Extract tabel from html code with beautiful soup - python

I've got html as follows:
<table class="tbOdpis" width="100%" cellspacing="0">
<tbody>
<tr>
<td class="csEmptyLine" colspan="100" width="100%"></td>
</tr>
<tr>
<td class="csTTytul" colspan="100" width="100%">OZNACZENIE KSIĘGI WIECZYSTEJ</td>
</tr>
</tbody>
</table>
I've tried to extract this table with below code:
soup.findAll('table')[1]
Evrything would be ok except I've received only this:
<table cellspacing="0" class="tbOdpis" width="100%">
<td class="csEmptyLine" colspan="100" width="100%"></td>
</table>
Why the second row has disappeared?

Related

How to grab the numebr with the end of "0" from the website?

I use BeaustifulSoup to grab some texts on the url"https://nature.altmetric.com/details/114136890",and get such response
# The table is called twitterGeographical_TableChoice
<table>
<tr>
<th>Country</th>
<th class="num">Count</th>
<th class="num percent">As %</th>
</tr>
<tr>
<td>Japan</td>
<td class="num">3</td>
<td class="num">12%</td>
</tr>
<tr>
<td>Poland</td>
<td class="num">3</td>
<td class="num">12%</td>
</tr>
<tr>
<td>Spain</td>
<td class="num">3</td>
<td class="num">12%</td>
</tr>
<tr>
<td>El Salvador</td>
<td class="num">2</td>
<td class="num">8%</td>
</tr>
<tr>
<td>Ecuador</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr>
<td>Mexico</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr>
<td>Chile</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr>
<td>India</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr class="meta">
<td>Unknown</td>
<td class="num">10</td>
<td class="num">40%</td>
</tr>
</table>
Then I want to get the number from it.I use regular expression to get it.
My format is
twitterGeographical_Table_Num_pattern = re.compile('<td class=\"num\">(\d*%)</td>',re.S)
twitterGeographical_Table_Num = twitterGeographical_Table_Num_pattern.findall(twitterGeographical_TableChoice)
But I can only get 4% instead of 40%.I am puzzled.Thanks for your help!

I am not sure why you are going to get the numbers with the regex module when BeautifulSoup has already a lot of approaches for this. Anyway, if you are interested in regex you can use this pattern instead:
<td class=\"num\">((\d+)(%)?)</td>
Then you can get the numbers (percentages, if they are) using the code below:
[x[0] for x in twitterGeographical_Table_Num]
Output
['10', '40%']
Side note: I beg you to consider naming the variables shorter and more clear!:)

python df to html result is not table format

I am trying to send some values(server_data) to a basic webpage and want to see as a table form.
I reformated my values as a Dataframe and converted them to html format.
But when I display my table, I just see html codes, not table form.
What am I missing?
Python code:
def vip_result(request): (---)
server_data{"SERVER_IP":result1,"PORT":result2,"SERV.STATE":result3,"OPR. STATE":result4}
df=pandas.DataFrame(server_data)
df=df.to_html
return render(request, 'vip_result.html', {"df": df})
Html site:(vip_result.html)
{{df}}
Result page`
<table border="1" class="dataframe">
<thead> <tr style="text-align: right;">
<th></th> <th>SERVER IP</th>
<th>PORT</th>
<th>SERV.STATE</th>
<th>OPR. STATE</th>
</tr> </thead> <tbody> <tr> <th>0</th> <td>10.6.87.17</td> <td>7777</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>1</th> <td>10.6.87.18</td> <td>7777</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>2</th> <td>10.6.87.21</td> <td>7777</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>3</th> <td>10.6.87.21</td> <td>7780</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>4</th> <td>10.6.87.23</td> <td>7781</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>5</th> <td>10.6.87.23</td> <td>7783</td> <td>UP</td> <td>ENABLED</td> </tr> </tbody> </table>`:
Result page that I expect

You need to tell Django to trust the content
{{df | safe}}
This tells Django it can put the HTML in as HTML.

Some <td>'s Cannot Be Found by find_next()

So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules.
The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it.
Here's the html:
<table style="max-width: 350px;" border="0">
<tbody><tr>
<td style="max-width: 215px;">REF. NO.</td>
<td style="max-width: 12px;" align="center"> </td>
<td align="right">000124 </td>
</tr>
<tr>
<td>REF. NO.</td>
<td align="center"> </td>
<td align="right"> </td>
</tr>
<tr>
<td>MANU</td>
<td align="center"> </td>
<td align="right"></td>
</tr>
<tr>
<td>STREAK</td>
<td align="center"> </td>
<td align="right">1075</td>
</tr>
<tr>
<td>PACK</td>
<td align="center"> </td>
<td align="right">1</td>
</tr>
<tr>
<td colspan="3">ON STOCK. </td>
</tr>
.... and so on
So I used this code to get what I want:
div = soup.find('div', {'id': 'infodata'})
table_data = div.find_all('td')
for element in table_data:
if "STREAK" in element.get_text():
price= element.find_next('td').find_next('td').text
print(price+ "price")
else:
print('NOT FOUND!')
I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK
By the way, im using two find_next() there because the html has three td's in every <tr>
Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!

I would rewrite it like this:
trs = div.find_all('tr')
for tr in trs:
tds = tr.select('td')
if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip():
price = tds[-1].get_text().strip()

BeautifulSoup: How to extract data after specific html tag

I have following html and I am trying to figure out how exactly I can tell BeautifulSoup to extract td after certain html element. In this case I want to get data in <td> after <td>Color Digest</td>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
This is the entire HTML
<html>
<head>
<body>
<div align="center">
<table cellspacing="0" cellpadding="0" style="clear:both; width:100%;margin:0px; font-size:1pt;">
<br>
<br>
<table>
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>

Sounds like you need to iterate over a list of <td> and stop once you've found your data.
Example:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<html><tr><td>X</td><td>Color Digest</td><td>THE DIGEST</td></tr></html>')
for cell in soup.html.tr.findAll('td'):
if 'Color Digest' == cell.text:
print cell.nextSibling.text

Selenium number of lines in a table?

I have this structure :
<table>
<tbody>
<tr id="1_2011_11_11_07_45_00" class="on">
</tr>
<tr id="1_2011_11_11_09_25_00">
</tr>
<tr id="1_2011_11_11_11_05_00">
</tr>
<tr id="1_2011_11_11_14_50_00">
</tr>
<tr id="1_2011_11_11_16_00_00">
</tr>
<tr id="1_2011_11_11_18_10_00">
</tr>
<tr id="1_2011_11_11_21_30_00">
</tr>
</tbody>
and I would like to count the number of lines that are in the table. I am using Python for the script.
The xpath of the table is :
xpath=/html/body/form/div[3]/div/div/div[2]/div/div/table
Anyone could help me ?

Can also be done via get_xpath_count.
for ex. Number_of_row = $browser.get_xpath_count("/tbody/tr")
I have not checked the above code but I think it will work

s = """<table>
<tbody>
<tr id="1_2011_11_11_07_45_00" class="on">
</tr>
<tr id="1_2011_11_11_09_25_00">
</tr>
<tr id="1_2011_11_11_11_05_00">
</tr>
<tr id="1_2011_11_11_14_50_00">
</tr>
<tr id="1_2011_11_11_16_00_00">
</tr>
<tr id="1_2011_11_11_18_10_00">
</tr>
<tr id="1_2011_11_11_21_30_00">
</tr>
</tbody>"""
import re
len(re.findall('\tr',s))

Xpath contains a count(<node-set expr>) function.
Simplifying your example, if your table were the only table in the html source, then the
the xpath expression
count(//table//tr)
would return the number 7.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract tabel from html code with beautiful soup - python

Related

How to grab the numebr with the end of "0" from the website?

python df to html result is not table format

Some <td>'s Cannot Be Found by find_next()

BeautifulSoup: How to extract data after specific html tag

Selenium number of lines in a table?

Categories

Resources