Extract tabel from html code with beautiful soup - python
I've got html as follows:
<table class="tbOdpis" width="100%" cellspacing="0">
<tbody>
<tr>
<td class="csEmptyLine" colspan="100" width="100%"></td>
</tr>
<tr>
<td class="csTTytul" colspan="100" width="100%">OZNACZENIE KSIĘGI WIECZYSTEJ</td>
</tr>
</tbody>
</table>
I've tried to extract this table with below code:
soup.findAll('table')[1]
Evrything would be ok except I've received only this:
<table cellspacing="0" class="tbOdpis" width="100%">
<td class="csEmptyLine" colspan="100" width="100%"></td>
</table>
Why the second row has disappeared?
Related
How to grab the numebr with the end of "0" from the website?
I use BeaustifulSoup to grab some texts on the url"https://nature.altmetric.com/details/114136890",and get such response # The table is called twitterGeographical_TableChoice <table> <tr> <th>Country</th> <th class="num">Count</th> <th class="num percent">As %</th> </tr> <tr> <td>Japan</td> <td class="num">3</td> <td class="num">12%</td> </tr> <tr> <td>Poland</td> <td class="num">3</td> <td class="num">12%</td> </tr> <tr> <td>Spain</td> <td class="num">3</td> <td class="num">12%</td> </tr> <tr> <td>El Salvador</td> <td class="num">2</td> <td class="num">8%</td> </tr> <tr> <td>Ecuador</td> <td class="num">1</td> <td class="num">4%</td> </tr> <tr> <td>Mexico</td> <td class="num">1</td> <td class="num">4%</td> </tr> <tr> <td>Chile</td> <td class="num">1</td> <td class="num">4%</td> </tr> <tr> <td>India</td> <td class="num">1</td> <td class="num">4%</td> </tr> <tr class="meta"> <td>Unknown</td> <td class="num">10</td> <td class="num">40%</td> </tr> </table> Then I want to get the number from it.I use regular expression to get it. My format is twitterGeographical_Table_Num_pattern = re.compile('<td class=\"num\">(\d*%)</td>',re.S) twitterGeographical_Table_Num = twitterGeographical_Table_Num_pattern.findall(twitterGeographical_TableChoice) But I can only get 4% instead of 40%.I am puzzled.Thanks for your help!
I am not sure why you are going to get the numbers with the regex module when BeautifulSoup has already a lot of approaches for this. Anyway, if you are interested in regex you can use this pattern instead: <td class=\"num\">((\d+)(%)?)</td> Then you can get the numbers (percentages, if they are) using the code below: [x[0] for x in twitterGeographical_Table_Num] Output ['10', '40%'] Side note: I beg you to consider naming the variables shorter and more clear!:)
python df to html result is not table format
I am trying to send some values(server_data) to a basic webpage and want to see as a table form. I reformated my values as a Dataframe and converted them to html format. But when I display my table, I just see html codes, not table form. What am I missing? Python code: def vip_result(request): (---) server_data{"SERVER_IP":result1,"PORT":result2,"SERV.STATE":result3,"OPR. STATE":result4} df=pandas.DataFrame(server_data) df=df.to_html return render(request, 'vip_result.html', {"df": df}) Html site:(vip_result.html) {{df}} Result page` <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>SERVER IP</th> <th>PORT</th> <th>SERV.STATE</th> <th>OPR. STATE</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>10.6.87.17</td> <td>7777</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>1</th> <td>10.6.87.18</td> <td>7777</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>2</th> <td>10.6.87.21</td> <td>7777</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>3</th> <td>10.6.87.21</td> <td>7780</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>4</th> <td>10.6.87.23</td> <td>7781</td> <td>UP</td> <td>ENABLED</td> </tr> <tr> <th>5</th> <td>10.6.87.23</td> <td>7783</td> <td>UP</td> <td>ENABLED</td> </tr> </tbody> </table>`: Result page that I expect
You need to tell Django to trust the content {{df | safe}} This tells Django it can put the HTML in as HTML.
Some <td>'s Cannot Be Found by find_next()
So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules. The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it. Here's the html: <table style="max-width: 350px;" border="0"> <tbody><tr> <td style="max-width: 215px;">REF. NO.</td> <td style="max-width: 12px;" align="center"> </td> <td align="right">000124 </td> </tr> <tr> <td>REF. NO.</td> <td align="center"> </td> <td align="right"> </td> </tr> <tr> <td>MANU</td> <td align="center"> </td> <td align="right"></td> </tr> <tr> <td>STREAK</td> <td align="center"> </td> <td align="right">1075</td> </tr> <tr> <td>PACK</td> <td align="center"> </td> <td align="right">1</td> </tr> <tr> <td colspan="3">ON STOCK. </td> </tr> .... and so on So I used this code to get what I want: div = soup.find('div', {'id': 'infodata'}) table_data = div.find_all('td') for element in table_data: if "STREAK" in element.get_text(): price= element.find_next('td').find_next('td').text print(price+ "price") else: print('NOT FOUND!') I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK By the way, im using two find_next() there because the html has three td's in every <tr> Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!
I would rewrite it like this: trs = div.find_all('tr') for tr in trs: tds = tr.select('td') if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip(): price = tds[-1].get_text().strip()
BeautifulSoup: How to extract data after specific html tag
I have following html and I am trying to figure out how exactly I can tell BeautifulSoup to extract td after certain html element. In this case I want to get data in <td> after <td>Color Digest</td> <tr> <td> Color Digest </td> <td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td> </tr> This is the entire HTML <html> <head> <body> <div align="center"> <table cellspacing="0" cellpadding="0" style="clear:both; width:100%;margin:0px; font-size:1pt;"> <br> <br> <table> <table> <tbody> <tr bgcolor="#AAAAAA"> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <tr> <td> Color Digest </td> <td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td> </tr> </tbody> </table>
Sounds like you need to iterate over a list of <td> and stop once you've found your data. Example: from BeautifulSoup import BeautifulSoup soup = BeautifulSoup('<html><tr><td>X</td><td>Color Digest</td><td>THE DIGEST</td></tr></html>') for cell in soup.html.tr.findAll('td'): if 'Color Digest' == cell.text: print cell.nextSibling.text
Selenium number of lines in a table?
I have this structure : <table> <tbody> <tr id="1_2011_11_11_07_45_00" class="on"> </tr> <tr id="1_2011_11_11_09_25_00"> </tr> <tr id="1_2011_11_11_11_05_00"> </tr> <tr id="1_2011_11_11_14_50_00"> </tr> <tr id="1_2011_11_11_16_00_00"> </tr> <tr id="1_2011_11_11_18_10_00"> </tr> <tr id="1_2011_11_11_21_30_00"> </tr> </tbody> and I would like to count the number of lines that are in the table. I am using Python for the script. The xpath of the table is : xpath=/html/body/form/div[3]/div/div/div[2]/div/div/table Anyone could help me ?
Can also be done via get_xpath_count. for ex. Number_of_row = $browser.get_xpath_count("/tbody/tr") I have not checked the above code but I think it will work
s = """<table> <tbody> <tr id="1_2011_11_11_07_45_00" class="on"> </tr> <tr id="1_2011_11_11_09_25_00"> </tr> <tr id="1_2011_11_11_11_05_00"> </tr> <tr id="1_2011_11_11_14_50_00"> </tr> <tr id="1_2011_11_11_16_00_00"> </tr> <tr id="1_2011_11_11_18_10_00"> </tr> <tr id="1_2011_11_11_21_30_00"> </tr> </tbody>""" import re len(re.findall('\tr',s))
Xpath contains a count(<node-set expr>) function. Simplifying your example, if your table were the only table in the html source, then the the xpath expression count(//table//tr) would return the number 7.