Getting the child element of a particular div element using beautiful soup - python

I am trying to scrape table data from this link
http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=30-01-2017&venue=ST&raceno=2&lang=en
Here is my code
from lxml import html
import webbrowser
import re
import xlwt
import requests
import bs4
content = requests.get("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=30-01-2017&venue=ST&raceno=1&lang=en").text # Get page content
soup = bs4.BeautifulSoup(content, 'lxml') # Parse page content
table = soup.find('div', {'id': 'detailWPTable'}) # Locate that table tag
rows = table.find_all('tr') # Find all row tags in that table
for row in rows:
columns = row.find_all('td') # Find all data tags in each column
print ('\n')
for column in columns:
print (column.text.strip(),end=' ') # Output data in each column
It is not giving any output . Please help !

The table is generated by JavaScrip and requests will only return html code like the picture shows.
Use selemium

I'm looking at the last line of your code:
print (column.text.strip(),end=' ') # Output data in each column
Are you sure that should read column.text. Maybe you could try column.strings or column.get_text(). Or column.stripped_strings even

I just wanted to mention that id you are using are for the wrapping div, not for the child table element.
Maybe you could try something like:
wrapper = soup.find('div', {'id': 'detailWPTable'})
table_body = wrapper.table.tbody
rows = table_body.find_all('tr')
But thinking about it, the tr elements are also descendants of the wrapping div, so find_all should still find them %]
Update: adding tbody
Update: sorry I'm not allowed to comment yet :). Are you sure you have the correct document. Have you checked the whole soup that the tags are actually there?
And I guess all those lines could be written as:
rows = soup.find('div', {'id': 'detailWPTable'}).find('tbody').find_all('tr')
Update: Yeah the wrapper div is empty. So it seems that you don't get whats being generated by javascript like the other guy said. Maybe you should try Selenium as he suggested? Possibly PhantomJS as well.

You can try it with dryscrape like so:
import dryscrape
from bs4 import BeautifulSoup as BS
import re
import xlwt
ses=dryscrape.Session()
ses.visit("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=30-01-2017&venue=ST&raceno=1&lang=en")
soup = BS(ses.body(), 'lxml') # Parse page content
table = soup.find('div', {'id': 'detailWPTable'}) # Locate that table tag
rows = table.find_all('tr') # Find all row tags in that table
for row in rows:
columns = row.find_all('td') # Find all data tags in each column
print ('\n')
for column in columns:
print (column.text.strip())

Related

Python BeautifulSoup include blank lines

I'm scraping a website with python3 and BeautifullSoup and exporting into csv. The issue I am having is some elements are blank and when I print the page those elements are now missing. I would prefer it still prints even though it's blank. Due to this in my csv file the rows do not match with the columns when an element is blank. I am sure if I can get the print working as expected I can fix the issue in my csv file.
Example html code
<tr><td>item1</td><td>server11</td><td>env</td><td>uptime</td></tr>
<tr><td>item2</td><td></td><td>env</td><td>uptime</td></tr>
As you can see item2 has td tag which is blank
soup = BeautifulSoup(content, 'lxml')
for s in soup.findAll('tr'):
print(s.get_text(","))
The output is
item1,server11,env,uptime
item2,env,uptime
However I would the output to look like this
item1,server11,env,uptime
item2,,env,uptime
You can use str.join to join texts from all <td>.
For example:
from bs4 import BeautifulSoup
txt = '''
<tr><td>item1</td><td>server11</td><td>env</td><td>uptime</td></tr>
<tr><td>item2</td><td></td><td>env</td><td>uptime</td></tr>'''
soup = BeautifulSoup(txt, 'html.parser')
for tr in soup.select('tr'):
print(','.join(td.get_text(strip=True) for td in tr.select('td')))
Prints:
item1,server11,env,uptime
item2,,env,uptime

Using BeautifulSoup to find a attribute called data-stats

I'm currently working on a web scraper that will allow me to pull stats from a football player. Usually this would be an easy task if I could just grab the divs however, this website uses a attribute called data-stats and uses it like a class. This is an example of that.
<th scope="row" class="left " data-stat="year_id">2000</th>
If you would like to check the site for yourself here is the link.
https://www.pro-football-reference.com/players/B/BradTo00.htm
I'm tried a few different methods. Either It won't work at all or I will be able to start a for loop and start putting things into arrays, however you will notice that not everything in the table is the same var type.
Sorry for the formatting and the grammer.
Here is what I have so far, I'm sure its not the best looking code, it's mainly just code I've tried on my own and a few things mixed in from searching on Google. Ignore the random imports I was trying different things
# import libraries
import csv
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
# specify url
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'lxml')
# find searches the given tag (div) with given class attribute and returns the first match it finds
headers = [c.get_text() for c in soup.find(class_ = 'table_container').find_all('td')[0:31]]
data = [[cell.get_text(strip=True) for cell in row.find_all('td')[0:32]]
for row in soup.find_all("tr", class_=True)]
tags = soup.find(data ='pos')
#stats = tags.find_all('td')
print(tags)
You need to use the get method from BeautifulSoup to get the attributes by name
See: BeautifulSoup Get Attribute
Here is a snippet to get all the data you want from the table:
from bs4 import BeautifulSoup
import requests
url = "https://www.pro-football-reference.com/players/B/BradTo00.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# Get table
table = soup.find(class_="table_outer_container")
# Get head
thead = table.find('thead')
th_head = thead.find_all('th')
for thh in th_head:
# Get case value
print(thh.get_text())
# Get data-stat value
print(thh.get('data-stat'))
# Get body
tbody = table.find('tbody')
tr_body = tbody.find_all('tr')
for trb in tr_body:
# Get id
print(trb.get('id'))
# Get th data
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))
for td in trb.find_all('td'):
# Get case value
print(td.get_text())
# Get data-stat value
print(td.get('data-stat'))
# Get footer
tfoot = table.find('tfoot')
thf = tfoot.find('th')
# Get case value
print(thf.get_text())
# Get data-stat value
print(thf.get('data-stat'))
for tdf in tfoot.find_all('td'):
# Get case value
print(tdf.get_text())
# Get data-stat value
print(tdf.get('data-stat'))
You can of course save the data in a csv or even a json instead of printing it
It's not very clear what exactly you're trying to extract, but this might help you a little bit:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
page = requests.get(url)
soup = bs(page.text, "html.parser")
# Extract table
table = soup.find_all('table')
# Let's extract data from each row in table
for row in table:
col = row.find_all('td')
for c in col:
print(c.text)
Hope this helps!

Wikipedia table scraping using python

I am trying to scrape tables from wikipedia. I wrote a table scraper that downloads a table and saves it as a pandas data frame.
This is the code
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
print soup
# Create an object of the first object
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
rank=[]
country=[]
pop=[]
date=[]
per=[]
source=[]
for row in table.find_all('tr')[1:]:
col=row.find_all('td')
col1=col[0].string.strip()
rank.append(col1)
col2=col[1].string.strip()
country.append(col2)
col3=col[2].string.strip()
pop.append(col2)
col4=col[3].string.strip()
date.append(col4)
col5=col[4].string.strip()
per.append(col5)
col6=col[5].string.strip()
source.append(col6)
columns={'Rank':rank,'Country':country,'Population':pop,'Date':date,'Percentage':per,'Source':source}
# Create a dataframe from the columns variable
df = pd.DataFrame(columns)
df
But it is not downloading the table. The problem is in this section
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
where output is None
As far as I can see, there is no such element on that page. The main table has "class":"wikitable sortable" but not the jquery-tablesorter.
Make sure you know what element you are trying to select and check if your program sees the same elements you see, then make your selector.
The docs says you need to specify multiple classes like so:
soup.find("table", class_="wikitable sortable jquery-tablesorter")
Also, consider using requests instead of urllib2.

Python xml - how to loop through <tbody> to get data

I have added an snippet of the html i wish to scrape.
I would like to go through each row (tbody) and scrape the relevant data using xml.
the xss for each row can be found by the following:
//*[#id="re_"]/table/tbody
but im unsure how to set it up in python to loop through each tbody? there is not set number for the tbody rows so could range from any number.
eg.
for each tbody:
...get data
below is the HTML page
http://www.racingpost.com/horses/result_home.sd?race_id=651402&r_date=2016-06-07&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS
Using lxml, you can pull the table directly using the class name and extract all the tbody tags with the xpath //table[#class="grid resultRaceGrid"]/tbody
from lxml import html
x = html.parse("http://www.racingpost.com/horses/result_home.sd?race_id=651402&r_date=2016-06-07&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS")
tbodys= x.xpath('//table[#class="grid resultRaceGrid"]/tbody')
# iterate over the list of tbody tags
for tbody in tbodys:
# get all the rows from the tbody
for row in tbody.xpath("./tr"):
# extract the tds and do whatever you want.
tds = row.xpath("./td")
print(tds)
Obviously you can be more specific, the td tags have class names which you can use to extract and some tr tags also have classes.
I'm thinking you'd be interested in BeautifulSoup.
With your data, if you wanted to print all comment texts, it would be as simple as:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for tbody in soup.find_all('tbody'):
print tbody.find('.commentText').get_text()
You can do much more fancy stuff. You can read more here.

How to get the contents under a particular column in a table from Wikipedia using soup & python

I need to get the href links that the contents point to under a particular column from a table in wikipedia. The page is "http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015". On this page there are a few tables with class "wikitable". I need the links of the contents under the column Title for each row that they point to. I would like them to be copied onto an excel sheet.
I do not know the exact code of searching under a particular column but I came upto this far and I am getting a "Nonetype object is not callable". I am using bs4. I wanted to extract atleast somepart of the table so I could figure out narrowing to the href links under the Title column I want but I am ending up with this error. The code is as below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015').read())
for row in soup('table', {'class': 'wikitable'})[1].tbody('tr'):
tds = row('td')
print (tds[0].string, tds[0].string)
A little guidance appreciated. Anyone knows?
Figured out that the none type error might be related to the table filtering. Corrected code is as below:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))

Categories