I have added an snippet of the html i wish to scrape.
I would like to go through each row (tbody) and scrape the relevant data using xml.
the xss for each row can be found by the following:
//*[#id="re_"]/table/tbody
but im unsure how to set it up in python to loop through each tbody? there is not set number for the tbody rows so could range from any number.
eg.
for each tbody:
...get data
below is the HTML page
http://www.racingpost.com/horses/result_home.sd?race_id=651402&r_date=2016-06-07&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS
Using lxml, you can pull the table directly using the class name and extract all the tbody tags with the xpath //table[#class="grid resultRaceGrid"]/tbody
from lxml import html
x = html.parse("http://www.racingpost.com/horses/result_home.sd?race_id=651402&r_date=2016-06-07&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS")
tbodys= x.xpath('//table[#class="grid resultRaceGrid"]/tbody')
# iterate over the list of tbody tags
for tbody in tbodys:
# get all the rows from the tbody
for row in tbody.xpath("./tr"):
# extract the tds and do whatever you want.
tds = row.xpath("./td")
print(tds)
Obviously you can be more specific, the td tags have class names which you can use to extract and some tr tags also have classes.
I'm thinking you'd be interested in BeautifulSoup.
With your data, if you wanted to print all comment texts, it would be as simple as:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for tbody in soup.find_all('tbody'):
print tbody.find('.commentText').get_text()
You can do much more fancy stuff. You can read more here.
Related
I have a html table data formatted as a string I'd like to add a HTML row.
Let's say I have a row tag tag in BeautifulSoup.
<tr>
</tr>
I want to add the following data to the row which is formatted as a string (including the inner tags themselves)
<td>A\</td><td>A1<time>(3)</time>, A2<time>(4)</time>, A3<time>(8)</time></td>
Is there an easy way to do this through BeautifulSoup or otherwise (for example, I could convert my document to a string, but I would make it harder to find the tag I need to edit). I'm not sure If I have to add those inner tags manually.
Try tag.append:
from bs4 import BeautifulSoup
html = "<tr></tr>"
my_string = r'<td>A\</td><td>A1<time>(3)</time>, A2<time>(4)</time>, A3<time>(8)</time></td>'
soup = BeautifulSoup(html, "html.parser")
soup.find("tr").append(BeautifulSoup(my_string, "html.parser"))
print(soup)
Prints:
<tr><td>A\</td><td>A1<time>(3)</time>, A2<time>(4)</time>, A3<time>(8)</time></td></tr>
I am struggling with the syntax required to grab some hrefs in a td.
The table, tr and td elements dont have any class's or id's.
If I wanted to grab the anchor in this example, what would I need?
< tr >
< td > < a >...
Thanks
As per the docs, you first make a parse tree:
import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)
and then you search in it, for example for <a> tags whose immediate parent is a <td>:
for ana in soup.findAll('a'):
if ana.parent.name == 'td':
print ana["href"]
Something like this?
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]
That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.
UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]
Which basically just adds a check to see if you have an actual element returned by td.find('a').
I am trying to scrape table data from this link
http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=30-01-2017&venue=ST&raceno=2&lang=en
Here is my code
from lxml import html
import webbrowser
import re
import xlwt
import requests
import bs4
content = requests.get("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=30-01-2017&venue=ST&raceno=1&lang=en").text # Get page content
soup = bs4.BeautifulSoup(content, 'lxml') # Parse page content
table = soup.find('div', {'id': 'detailWPTable'}) # Locate that table tag
rows = table.find_all('tr') # Find all row tags in that table
for row in rows:
columns = row.find_all('td') # Find all data tags in each column
print ('\n')
for column in columns:
print (column.text.strip(),end=' ') # Output data in each column
It is not giving any output . Please help !
The table is generated by JavaScrip and requests will only return html code like the picture shows.
Use selemium
I'm looking at the last line of your code:
print (column.text.strip(),end=' ') # Output data in each column
Are you sure that should read column.text. Maybe you could try column.strings or column.get_text(). Or column.stripped_strings even
I just wanted to mention that id you are using are for the wrapping div, not for the child table element.
Maybe you could try something like:
wrapper = soup.find('div', {'id': 'detailWPTable'})
table_body = wrapper.table.tbody
rows = table_body.find_all('tr')
But thinking about it, the tr elements are also descendants of the wrapping div, so find_all should still find them %]
Update: adding tbody
Update: sorry I'm not allowed to comment yet :). Are you sure you have the correct document. Have you checked the whole soup that the tags are actually there?
And I guess all those lines could be written as:
rows = soup.find('div', {'id': 'detailWPTable'}).find('tbody').find_all('tr')
Update: Yeah the wrapper div is empty. So it seems that you don't get whats being generated by javascript like the other guy said. Maybe you should try Selenium as he suggested? Possibly PhantomJS as well.
You can try it with dryscrape like so:
import dryscrape
from bs4 import BeautifulSoup as BS
import re
import xlwt
ses=dryscrape.Session()
ses.visit("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=30-01-2017&venue=ST&raceno=1&lang=en")
soup = BS(ses.body(), 'lxml') # Parse page content
table = soup.find('div', {'id': 'detailWPTable'}) # Locate that table tag
rows = table.find_all('tr') # Find all row tags in that table
for row in rows:
columns = row.find_all('td') # Find all data tags in each column
print ('\n')
for column in columns:
print (column.text.strip())
I am trying to parse several div blocks using Beautiful Soup using some html from a website. However, I cannot work out which function should be used to select these div blocks. I have tried the following:
import urllib2
from bs4 import BeautifulSoup
def getData():
html = urllib2.urlopen("http://www.racingpost.com/horses2/results/home.sd?r_date=2013-09-22", timeout=10).read().decode('UTF-8')
soup = BeautifulSoup(html)
print(soup.title)
print(soup.find_all('<div class="crBlock ">'))
getData()
I want to be able to select everything between <div class="crBlock "> and its correct end </div>. (Obviously there are other div tags but I want to select the block all the way down to the one that represents the end of this section of html.)
The correct use would be:
soup.find_all('div', class_="crBlock ")
By default, beautiful soup will return the entire tag, including contents. You can then do whatever you want to it if you store it in a variable. If you are only looking for one div, you can also use find() instead. For instance:
div = soup.find('div', class_="crBlock ")
print(div.find_all(text='foobar'))
Check out the documentation page for more info on all the filters you can use.
I'm using Beautifulsoup to parse a website
request = urllib2.Request(url)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
I am using it to traverse a table. The problem I am running into is that BS is adding an extra end tag for the table into the html which doesn't exist, which I verified with: print soup.prettify(). So, one of the td tags is getting left out of the table and I can't select it.
How about searching directly for each tag instead of trying to traverse into the table?
for td in soup.find("td"):
...
its not unusual to find the tbody tag nested within a table automatically when its not in the code. Either you can code for it or just jump straight to the tr or td tag.