I am trying to scrape only the total score for a specified team. I have written the following:
import urllib.request
import re
from bs4 import BeautifulSoup
#url1 = "http://scores.nbcsports.com/nhl/scoreboard.asp"
## This works, however is using a set day for testing, will need url changed to url1 for current day scoreboard
url = "http://scores.nbcsports.com/nhl/scoreboard.asp?day=20141202"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page)
allrows = soup.findAll('td')
userows = [t for t in allrows if t.findAll(text=re.compile('Vancouver'))]
print(userows)
This returns:
[<td><table cellspacing="0"><tr class="shsTableTtlRow"><td class="shsNamD" colspan="1">Final</td>
<td class="shsTotD">1</td>
<td class="shsTotD">2</td>
<td class="shsTotD">3</td>
<td class="shsTotD">Tot</td>
</tr>
<tr>
<td class="shsNamD" nowrap=""><span class="shsLogo"><span class="shsNHLteam22sm_trans"></span></span>Vancouver</td>
<td class="shsTotD">1</td>
<td class="shsTotD">2</td>
<td class="shsTotD">1</td>
<td class="shsTotD">4</td>
</tr>
<tr>
<td class="shsNamD" nowrap=""><span class="shsLogo"><span class="shsNHLteam23sm_trans"></span></span>Washington</td>
<td class="shsTotD">0</td>
<td class="shsTotD">2</td>
<td class="shsTotD">1</td>
<td class="shsTotD">3</td>
</tr>
</table>
</td>, <td class="shsNamD" nowrap=""><span class="shsLogo"><span class="shsNHLteam22sm_trans"></span></span>Vancouver</td>]
What I can't seem to get to is the 4 in <td class="shsTotD">4</td> from the middle block. If it is only possible to get the 1 2 1 4 I could compare the values and always pick the largest, but I can't even seem to get that far. Thanks in advance.
Find the tag containing Vancouver and get the next td tags by using find_next_siblings():
vancouver = soup.find('a', text='Vancouver')
for td in vancouver.parent.find_next_siblings('td', class_='shsTotD'):
print(td.text)
Prints:
1
2
1
4
Related
How to find all tags that include tags with certain class?
The data is:
<tr>
<td class="TDo1" width=17%>Tournament</td>
<td class="TDo2" width=8%>Date</td>
<td class="TDo2" width=6%>Pts.</td>
<td class="TDo2" width=34%>Pos. Player (team)</td>
<td class="TDo5" width=35%>Pos. Opponent (team)</td>
</tr>
<tr>
<td class=TDq1>GpWl(op) 4.01/02</td>
<td class=TDq2>17.02.02</td>
<td class=TDq3>34/75</td>
<td class=TDq5>39. John Deep</td>
<td class=TDq9>68. Mark Deep</td>
</tr>
<tr>
<td class=TDp1>GpWl(op) 4.01/02</td>
<td class=TDp2>17.02.02</td>
<td class=TDp3>34/75</td>
<td class=TDp6>39. John Deep</td>
<td class=TDp8>7. Darius Star</td>
</tr>
I am trying
for mtable in bs.find_all('tr', text=re.compile(r'class=TD?3')):
print(mtable)
but this returns zero results.
I suppose you want to find all <tr> that contains any tag with class TD<any character>3:
import re
# `html` contains your html from the question
soup = BeautifulSoup(html, "html.parser")
pat = re.compile(r"TD.3")
for tr in soup.find_all(
lambda tag: tag.name == "tr"
and tag.find(class_=lambda cl: cl and pat.match(cl))
):
print(tr)
Prints:
<tr>
<td class="TDq1">GpWl(op) 4.01/02</td>
<td class="TDq2">17.02.02</td>
<td class="TDq3">34/75</td>
<td class="TDq5">39. John Deep</td>
<td class="TDq9">68. Mark Deep</td>
</tr>
<tr>
<td class="TDp1">GpWl(op) 4.01/02</td>
<td class="TDp2">17.02.02</td>
<td class="TDp3">34/75</td>
<td class="TDp6">39. John Deep</td>
<td class="TDp8">7. Darius Star</td>
</tr>
You need to find matching with td. Like this,
In [1]: bs.find_all('td', {"class": re.compile(r'TD\w\d')})
Out[1]:
[<td class="TDo1" width="17%">Tournament</td>,
<td class="TDo2" width="8%">Date</td>,
<td class="TDo2" width="6%">Pts.</td>,
<td class="TDo2" width="34%">Pos. Player (team)</td>,
<td class="TDo5" width="35%">Pos. Opponent (team)</td>,
<td class="TDq1">GpWl(op) 4.01/02</td>,
<td class="TDq2">17.02.02</td>,
<td class="TDq3">34/75</td>,
<td class="TDq5">39. John Deep</td>,
<td class="TDq9">68. Mark Deep</td>,
<td class="TDp1">GpWl(op) 4.01/02</td>,
<td class="TDp2">17.02.02</td>,
<td class="TDp3">34/75</td>,
<td class="TDp6">39. John Deep</td>,
<td class="TDp8">7. Darius Star</td>]
This may help you:
from bs4 import BeautifulSoup
import re
t = 'your page source'
pat = re.compile(r'class=TD.3')
classes = re.findall(pat,t)
classes = [j[6:] for j in classes]
soup = BeautifulSoup(t)
result = list()
for i in classes:
item = soup.find_all(attrs={"class": i})
result.extend(item)
for i in result:
print(i.parent)
You could use css selectors to get the tags with class = "TD...3" and then get their parent tags
mtables = [s.parent for s in bs.select('*[class^="TD"][class$="3"]')]
# if you want tr only:
# mtables = [s.parent for s in bs.select('tr *[class^="TD"][class$="3"]')]
mtables = list(set(mtables)) # no duplicates
but this will not work if there are multiple classes (unless the first one starts with "TD" and the last ends in "3"), and you can't limit the characters in between.
You could use find with lambda twice like this
tagPat = '^TD.3$'
# tagPat = '^TD.*3$' # if there might be more than one character between TD and 3
mtables = bs.find_all(
lambda p:
p.name == 'tr' and # remove if you want all tags and not just tr
p.find(
lambda t: t.get('class') is not None and
len([c for c in t.get('class') if re.search('^TD.3$', c)]) > 0
, recursive=False # prevent getting <tr><tr><td class="TDo3">someval</td></tr></tr>
))
If you don't want to use lambda, you can replace the select in the first method with regex and find
tagPat = '^TD.3$' # '^TD.*3$' #
mtables = [
s.parent for s in bs.find_all(class_ = re.compile(tagPat))
if s.parent.name == 'tr' # remove if you want all tags and not just tr
]
mtables = list(set(mtables)) # no duplicates
For the html in your question all 3 methods would lead to same data - you can print with
for mtable in mtables: print('---\n', mtable, '\n---')
and get the output
---
<tr>
<td class="TDq1">GpWl(op) 4.01/02</td>
<td class="TDq2">17.02.02</td>
<td class="TDq3">34/75</td>
<td class="TDq5">39. John Deep</td>
<td class="TDq9">68. Mark Deep</td>
</tr>
---
---
<tr>
<td class="TDp1">GpWl(op) 4.01/02</td>
<td class="TDp2">17.02.02</td>
<td class="TDp3">34/75</td>
<td class="TDp6">39. John Deep</td>
<td class="TDp8">7. Darius Star</td>
</tr>
---
Using beautifulsou 4 to scrape a HTML table.
To display values from one of the table rows and remove any empty td fields.
The source being scraped shares classes=''
So is there any way to pull the data form just one row? using
data-name ="Georgia" in the html source below
Using: beautifulsoup4
Current code
import bs4 as bs from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
version = 'My new User-Agent' # Set this to a string you want for your user agent
myopener = MyOpener()
sauce = myopener.open('')
soup = bs.BeautifulSoup(sauce,'lxml')
#table = soupe.table
table = soup.find('table')
table_rows = table.find_all_next('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
HTML SOURCE
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Kazakhstan">★</span>
Kazakhstan
</td>
<td class="text--green">
81
</td>
<td class="text--green">
9
</td>
<td class="text--green">
12.5
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
0
</td>
<td class="text--yellow">
0
</td>
</tr>
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Georgia">★</span>
Georgia
</td>
<td class="text--green">
75
</td>
<td class="text--green">
0
</td>
<td class="text--green">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
10
</td>
<td class="text--yellow">
1
</td>
</tr>
Are you talking about something like:
tr.find_all('td',{'data-name' : True})
That should find any td that contains data name. I could be reading your question all wrong though.
I want to traverse through each row and capture values of td.text. However problem here is table does not have class. and all the td got same class name. I want to traverse through each row and want following output:
1st row)"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C" (new line)
2nd row) "AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior"," C" (new line)
<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
<tbody>
<tr class="tblHeading">
<td colspan="7">AMERICANS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> Cameron Coya </td>
<td width="19%" class="tdUnderLine">
Rozel, Max
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
228004
</td>
<td width="16%" class="tdUnderLine" align="center">
09/10/16 02:15 PM
</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">AVIATORS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td>
</tr>
<tr bgcolor="#FBFBFB">
<td width="19%" class="tdUnderLine"> Saskia Reyes </td>
<td width="19%" class="tdUnderLine">
HollaenderNardelli, Eric
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
224463
</td>
<td width="16%" class="tdUnderLine" align="center">
09/11/16 06:45 PM
</td>
<td width="30%" class="tdUnderLine"> player/sub guilty of unsporting behavior </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">BERGENFIELD SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> Christian Latorre </td>
<td width="19%" class="tdUnderLine">
Coyle, Kevin
</td>
<td width="06%" class="tdUnderLine">
09-10-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
226294
</td>
<td width="16%" class="tdUnderLine" align="center">
09/10/16 11:00 AM
</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
I tried with following code.
import requests
from bs4 import BeautifulSoup
import re
try:
import urllib.request as urllib2
except ImportError:
import urllib2
url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")
#tableList = soup.findAll("table")
for tr in soup.find_all("tr"):
for td in tr.find_all("td"):
print(td.text.strip())
but it is obvious that it will return text form all td and I will not able to identify particular column name or will not able to determine start of new record. I want to know
1) how to identify each column(because class name is same) and there are headings as well (I will appreciate if you provide code for that)
2) how to identify new record in such structure
If the data is really structured like a table, there's a good chance you can read it into pandas directly with pd.read_table(). Note that it accepts urls in the filepath_or_buffer argument.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
print string[:-1] + "\n\n" # string[:-1] to remove the last ","
string = ""
as the table is not in the proper required format we shall just go with the td rather than going into each row then going into td in each row which complicates the work. I just used a string you can append the data into a list of lists and get process it for later use.
Hope this solves your problem
from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup
soup = ""
with open("/tmp/a.html") as page:
soup = BeautifulSoup(page.read(),"html.parser")
table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')
trs = table.find_all('tr')
table_dict = {}
game = ""
section = ""
for tr in trs:
if tr.has_attr('class'):
game = tr.text.strip('\n')
if tr.has_attr('bgcolor'):
if tr['bgcolor'] == '#CCE4F1':
section = tr.text.strip('\n')
else:
tds = tr.find_all('td')
extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
extracted_text = [x.strip() for x in extracted_text]
extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
extracted_text.pop(1)
extracted_text[2] = "Player " + extracted_text[2]
extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
print(','.join(extracted_text))
And when run:
$ python a.py
"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"
Based on further conversation with the OP, the input was https://paste.fedoraproject.org/428111/87928814/raw/ and the output after running the above code is: https://paste.fedoraproject.org/428110/38792211/raw/
There seems to be a pattern. After every 7 tr(s), there is a new line.
So, what you can do is keep a counter starting from 1, when it touches 7, append a new line and restart it to 0.
counter = 1
for tr in find_all("tr"):
for td in tr.find_all("td"):
# place code
counter = counter + 1
if counter == 7:
print "\n"
counter = 1
<hknbody>
<tr>
<td class="padding_25 font_7 bold xicolor_07" style="width:30%">
date
</td>
<td class="font_34 xicolor_42">
19 Eylül 2013
</td>
</tr>
<tr>
<td style="height:10px" colspan="3"></td>
</tr>
<tr>
<td class="bgcolor_09" style="height:5px" colspan="3"></td>
</tr>
<tr>
<td style="height:10px" colspan="3"></td>
</tr>
<tr>
<td class="padding_25 font_7 bold xicolor_07" style="width:30%">
Size
</td>
<td class="font_34 xicolor_42">
650 cm
Classes names same, classes in the same table.
How can I find correct data? Example; if "date" doesn't exist in <td class="padding_25 font_7 bold xicolor_07>, you don't pull date and find next data.
If this is your HTML and you can change it, you should be using semantic HTML to markup your elements with class, id, or name attributes that describe the meaning of the data, not its appearance. Then you will have an unambiguous way of selecting the required tags.
As it is all you have to do something like this:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
date_tag = soup.find('td', text=re.compile('^\s*date\s*$')) # find first <td> containing text "date"
if date_tag:
date_value = date_tag.find_next_sibling('td').text.strip()
>>> print date_value
19 Eylül 2013
I am parsing a html document using a Beautiful Soup 4.0.
Here is an example of table in document
<tr>
<td class="nob"></td>
<td class="">Time of price</td>
<td class=" pullElement pullData-DE000BWB14W0.teFull">08/06/2012</td>
<td class=" pullElement pullData-DE000BWB14W0.PriceTimeFull">11:43:08 </td>
<td class="nob"></td>
</tr>
<tr>
<td class="nob"></td>
<td class="">Daily volume (units)</td>
<td colspan="2" class=" pullElement pullData-DE000BWB14W0.EWXlume">0</td>
<td class="nob"></td>
<t/r>
I would like to extract 08/06/2012 and 11:43:08 DAily volume, 0 etc.
This is my code to find specific table and all data of it
html = file("some_file.html")
soup = BeautifulSoup(html)
t = soup.find(id="ctnt-2308")
dat = [ map(str, row.findAll("td")) for row in t.findAll("tr") ]
I get a list of data that needs to be organized
Any suggestions to do it in a simple way??
Thank you
list(soup.stripped_strings)
will give you all the string in that soup (removing all trailing spaces)