Python HTML Regex

Python HTML Regex - python

correct output using below txt file should be: PlayerA 29.2 PlayerB 32.2
I have a txt file filled with html that looks like below,
I'm trying to use a python 2.6 regular expression to collect all the playernames and ratings.
The first time the playername appears is on line 4, the rating appears on line 16.(29.2)
Then the next player name appears on line 22, the rating on line 35.
and so on...
fileout = open('C:\Python26\hotcold.txt')
read_file = fileout.readlines()
source = str(read_file)
expression = re.findall(r"(LS=113>.+?", source)
print expression
I was trying to make a expression that would find all the names and ratings but it isnt working..
<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&P=245&LS=113">
PlayerA
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
4
</b>
,
<b>
8
</b>
</td>
<td class="stats" colspan="1" valign="top">
29.2
</td>
<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&P=245&LS=113">
PlayerB
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
4
</b>
,
<b>
8
</b>
</td>
<td class="stats" colspan="1" valign="top">
32.2
</td>

I would recommend using Beautiful Soup to parse the HTML and get the values you are after.
Use the following code:
from bs4 import BeautifulSoup
with open('sample.html', 'r') as html_doc:
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.find_all('tr', 'stats'):
row_tds = row.find_all_next('td')
print('{0} {1}'.format(
row_tds[0].find('a').string.strip() if row_tds[0].find('a').string else 'None',
row_tds[2].string.strip() if row_tds[2].string else 'None')
)
output:
$ python testparse.py
PlayerA 29.2
PlayerB 32.2
Works.

Alternatively, I would suggest using a proper html parser instead of relying on regex -- although BeautifulSoup is actually a very good and easy-to-use library.
In your sample is that missing the closing <tr> tags between the <td>?
Edit: using OP sample as source
Anyhow, and using lxml.html with simple xpath to get hopefully what you expected:
In [1]: import lxml.html
# sample.html is the same as in OP sample
In [2]: tree = lxml.html.parse("sample.html")
In [3]: root = tree.getroot()
In [4]: players = root.xpath('.//td[#class="stats"]/a/text()')
In [5]: stats = root.xpath('//td[#class="stats" and normalize-space(text())]/text()')
In [6]: print players, stats
['\nPlayerA\n', '\nPlayerB\n'] ['\n29.2\n', '\n32.2\n']
In [7]: for player, stat in zip(players, stats):
...: print player.strip(), stat.strip()
...:
PlayerA 29.2
PlayerB 32.2

Related

bs4 list comprehension with 'if' statement

I'm trying to add the text of each table row to my list if the table row contains text. I want to do this using list comprehension.
Here's what I tried
listt2 = [s.span.text for s in soup.find_all('tr') if s.span.text]
Here's the error
listt2 = [s.span.text for s in soup.find_all('tr') if s.span.text]
AttributeError: 'NoneType' object has no attribute 'text'
Here's 1 'tr' that contains a span tag:
<tr>
<td colspan="2" class="cell--section-end cell--link cell--link__icon">
<a data-analytics="[Competitions] - German Bundesliga" href="/football/german-bundesliga/event/26301018" class="cell--link__link cell-text">
<i class="i accordion__title-icon--green accordion__title-icon--right" data-char=""></i> <b class="cell-text__line cell-text__line--icon">
<span class="competitions-team-name js-ev-desc">1. FC Köln v 1899 Hoffenheim</span>
</b>
</a>
</td>
<tr>
Here's Another that doesn't:
<tr>
<td colspan="5" class="group-header">
Sat 14:30 </td>
</tr>
Please note there are many more tr tags on this page

If you want to get only <tr> tags which contain <span> tag, you can use this list comprehension:
listt2 = [s.span.text for s in soup.select('tr:has(span)') if s.span.text]
EDIT:
from bs4 import BeautifulSoup
html_doc = '''<tr>
<td colspan="2" class="cell--section-end cell--link cell--link__icon">
<a data-analytics="[Competitions] - German Bundesliga" href="/football/german-bundesliga/event/26301018" class="cell--link__link cell-text">
<i class="i accordion__title-icon--green accordion__title-icon--right" data-char=""></i> <b class="cell-text__line cell-text__line--icon">
<span class="competitions-team-name js-ev-desc">1. FC Köln v 1899 Hoffenheim</span>
</b>
</a>
</td>
<tr>'''
soup = BeautifulSoup(html_doc, 'html.parser')
listt2 = [s.span.text for s in soup.select('tr:has(span)') if s.span.text]
print(listt2)
Prints:
['1. FC Köln v 1899 Hoffenheim']

You just need to check that span is not None before looking for span.text.
listt2 = [s.span.text for s in soup.find_all('tr') if s.span is not None and s.span.text]
Because of short-circuiting, s.span.text is never evaluated if s.span is None because False and * is False

Python: Accessing a new <tr> while inside a different <tr> with BeautifulSoup4

I am trying to gather some data by webscraping a local HTML file using BeautifulSoup4. The problem is, that the information I'm trying to get is on different rows that have the same class tags. I'm not sure about how to access them. The following html screenshot contains the two rows I'm accessing with the data I need highlighted (sensitive info is scribbled out).
The code I have currently is:
def find_data(fileName):
with open(fileName) as html_file:
soup = bs(html_file, "lxml")
hline1 = soup.find("td", class_="headerTableEntry")
hline2 = hline1.find_next_sibling("td")
hline3 = hline2.find_next_sibling("td")
hline4 = hline3.find_next_sibling("td", class_="headerTableEntry")
line1 = hline1.text
line2 = hline2.text
line3 = hline3.text
#Nothing yet for lines 4,5,6
The first 3 lines work great and give 13, 39, and 33.3% as they should. But for line 4 (which should be the second tag and first tag with class=headerTableEntry) I get an error "'NoneType' object is not callable".
My question is, is there a different way to go at this so I can access all 6 data cells or is there a way to edit how I wrote line 4 to work? Thank you for your help, it is very much appreciated!

The <tr> tag is not inside another <tr> tag as you can see that first <tr> tag is closed with the </tr> So that next <td> is not a sibling of the previous, hence it returns None. It's within the next <tr> tag.
Pandas is a great package to parse html <table> tags (which this is). It actually uses beautifulsoup under the hood. Just get the full table, and slice the table for the columns you want:
html_file = '''<table>
<tr>
<td class="headerName">File:</td>
<td class="HeaderValue">Some Value</td>
<td></td>
<td class="headerName">Lines:</td>
<td class="headerTableEntry">13</td>
<td class="headerTableEntry">39</td>
<td class="headerTableEntry" style="back-ground-color:LightPink">33.3 %</td>
</tr>
<tr>
<td class="headerName">Date:</td>
<td class="HeaderValue">2020-06-18 11:15:19</td>
<td></td>
<td class="headerName">Branches:</td>
<td class="headerTableEntry">10</td>
<td class="headerTableEntry">12</td>
<td class="headerTableEntry" style="back-ground-color:#FFFF55">83.3 %</td>
</tr>
</table>'''
import pandas as pd
df = pd.read_html(html_file)[0]
df = df.iloc[:,3:]
So for your code:
def find_data(fileName):
with open(fileName) as html_file:
df = pd.read_html(html_file)[0].iloc[:,3:]
print (df)
Output:
print (df)
3 4 5 6
0 Lines: 13 39 33.3 %
1 Branches: 10 12 83.3 %

joining RE requests in Python

I have a page.htm file:
</td></tr>
<tr>
<td height="120" class="box_pic">
<img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105">
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=79159" target="_blank">ABird</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105">
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=78759" target="_blank">ADog</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXfdgfdgZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105">
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=87159" target="_blank">ACat56</a></span></td>
It has 3 sets of data which I need:
1) 1322679 79159 ABird
2) 1546679 78759 ADog
3) 5622679 87159 ACat56
I have 3 requests for RE which can dig elements from this page:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
result = re.findall(r'view\.php\?item=(\d+)', page)
result2 = re.findall(r'user&id=(\d+)', page)
result3 = re.findall(r'user&id=.*>(\w+)', page)
print (result, len(result))
print (result2, len(result2))
print (result3, len(result3))
the result I get:
['1322679', '1546679', '5622679'] 3
['79159', '78759', '87159'] 3
['ABird', 'ADog', 'ACat56'] 3
Do you know the way to join these 3 requests in ONE? So that
1) file would be analized 1 time instead of 3 times
2) only ONE re.findall() would be used
3) data would be joined in the way I need
a) 1322679 79159 ABird
b) 1546679 78759 ADog
c) 5622679 87159 ACat56
the result request should be something like this:
result = re.findall(r'view\.php\?item=(\d+) SOMETHING_HERE user&id=(\d+) SOMETHING_HERE .*>(\w+)', page)

Here is how to do it properly with an HTML parser in Python 2:
from urlparse import parse_qs, urlparse
from bs4 import BeautifulSoup
def only(x):
x = list(x)
assert len(x) == 1
return x[0]
def url_params(a):
return parse_qs(urlparse(a['href']).query)
def main():
with open('page.html') as f:
soup = BeautifulSoup(f, 'html.parser')
rows = soup.find_all('tr', recursive=False)
# Data is in alternating rows, so take pairs of rows at a time
for row1, row2 in zip(rows[::2], rows[1::2]):
a = only(row1.select('td.box_pic a'))
item_id = only(url_params(a)['item'])
a = only(row2.select('a.usernick'))
user_id = only(url_params(a)['id'])
nick = a.text
print item_id, user_id, nick
main()
Output:
1322679 79159 ABird
1546679 78759 ADog
5622679 87159 ACat56
Now, this may not be as concise as the re method, but this code is aware of how the input is meant to be structured and that makes it robust. If the structure of the input changes, e.g. the format of the URLs or the shape of the HTML, this code will either continue to work correctly or it will raise an error to tell you that things aren't as expected. The re method may very easily continue to run but give you incorrect results, which is not a situation you want. And if you want to extract more information in the future, it's very easy to add the necessary lines without interfering with the existing code.

finally, I found the solution:
This is the answer, which satisfies all the requirements:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
page = re.sub(r'[\t\r\n\s]','',page)
re.DOTALL
result = re.findall(r'view\.php\?item=(\d+).*?user&id=(\d+).*?>(\w+)', page)
print (result, len(result))
and:
1) results are in needed order
2) 1 request
result:
[('1322679', '79159', 'ABird'), ('1546679', '78759', 'ADog'), ('5622679', '87159', 'ACat56')] 3

How to extract info from varying table entries: Text vs. DIV vs. SPAN

I am new to python and searched the internet to find an answer to my problem, but so far I failed...
The problem: My aim is to extract data from websites. More specifically, from the tables in these websites. The relevant snippet from the website-code you find in "data" in my python-code example here:
from bs4 import BeautifulSoup
data = '''<table class="ds-table">
<tr>
<td class="data-label">year of birth:</td>
<td class="data-value">1994</td>
</tr>
<tr>
<td class="data-label">reporting period:</td>
<td class="data-value">
<span class="editable" id="c-scope_beginning_date">
? </span>
-
<span class="editable" id="c-scope_ending_date">
? </span>
</td>
</tr>
<tr>
<td class="data-label">reporting cycle:</td>
<td class="data-value">
<span class="editable" id="c-periodicity">
- </span>
</td>
</tr>
<tr>
<td class="data-label">grade:</td>
<td class="data-value">1.3, upper 10% of class</td>
</tr>
<tr>
<td class="data-label">status:</td>
<td class="data-value"></td>
</tr>
</table>
<table class="ds-table">
<tr>
<td class="data-label">economics:</td>
<td class="data-value"><span class="positive-value"></span></td>
</tr>
<tr>
<td class="data-label">statistics:</td>
<td class="data-value"><span class="negative-value"></span></td>
</tr>
<tr>
<td class="data-label">social:</td>
<td class="data-value"><div id="music_id" class="trigger"><span class="negative-value"></span></div></td>
</tr>
<tr>
<td class="data-label">misc:</td>
<td class="data-value">
<div id="c_assurance" class="">
<span class="positive-value"></span> </div>
</td>
</tr>
<tr>
<td class="data-label">recommendation:</td>
<td class="data-value">
<span class="negative-value"></span> </td>
</tr>
</table>'''
soup = BeautifulSoup(data)
For the class="data-label" so far I successfully implemented...
box_cdl = []
for i, cdl in enumerate(soup.findAll('td', attrs={'class': 'data-label'})):
box_cdl.append(cdl.contents[0])
print box_cdl
...which extracts the text from the columns, in the (for me satisfying) output:
[u'year of birth:',
u'reporting period:',
u'reporting cycle:',
u'grade:',
u'status:',
u'economics:',
u'statistics:',
u'social:',
u'misc:',
u'recommendation:']
Where I get stuck is the part for class="data-value" with the div- and span-fields and that some of the relevant information is hidden in the span-class. Moreover, the amount of the tr-rows can change from website to website, e.g. "status" comes after "reporting cycle" (instead of "grade").
However, when I do...
box_cdv = []
for j, cdv in enumerate(soup.findAll('td', attrs={'class': 'data-value'})):
box_cdv.append(cdv.contents[0])
print box_cdv
...I get the error:
Traceback (most recent call last):
File "<ipython-input-53-7d5c095cf647>", line 3, in <module>
box_cdv.append(cdv.contents[0])
IndexError: list index out of range
What I would like to get instead is something like this (corresponding to the above "data"-example):
[u'1994',
u'? - ?',
u'-',
u'1.3, upper 10% of class',
u'',
u'positive-value',
u'negative-value',
u'negative-value',
u'positive-value',
u'negative-value']
The Question: how can I extract this information and collect the relevant data from each tr-row, given that the adequate extraction-code depends on the type of the category (year of birth, reporting period, ..., recommendation)?
Or, asking differently: what code extracts me, depending on the category (year of birth, reporting period, ..., recommendation), the corresponding value (1994, ..., negative-value)?
Since the amount and the type of the table-entries can differ between websites, a simple "on the i-th entry do the following" procedure is not applicable. The thing I am looking for I think is something like "if you find the text "recommendation:", then extract the class-type from the span-field", I guess. But unfortunately I do not have any clue how to translate that into python-language.
Any help is highly appreciated.

You get that error because one of the tags don't have any children so the contents list gives an error when searching for that index.
You can approeach this on the following way:
1) Search for the data-label tags;
2) Find the next TD sibling;
3 A) Check of the sibling has text;
3 A) 1) If so create a dict entry with data-label as the key and the sibling text as its value;
3 A) B) If not check if the sibling first child have a class containing -value`
4) Parse the data.
Example:
soup = BeautifulSoup(data, 'lxml')
result = {}
for tag in soup.find_all("td", { "class" : "data-label" }):
NextSibling = tag.find_next("td", { "class" : "data-value" }).get_text(strip = True)
if not NextSibling and len(tag.find_next("td").select('span[class*=-value]')) > 0:
NextSibling = tag.find_next("td").select('span[class*=-value]')[0]["class"][0]
result[tag.get_text(strip = True)] = NextSibling
print (result)
Result:
{
'year of birth:': '1994',
'reporting period:': '?-?',
'reporting cycle:': '-',
'grade:': '1.3, upper 10% of class',
'status:': '',
'economics:': 'positive-value',
'statistics:': 'negative-value',
'social:': 'negative-value',
'misc:': 'positive-value',
'recommendation:': 'negative-value'
}

How can I iterate through tags with different identifiers with BeautifulSoup in Python

This is probably an easy question, but I'd like to iterate through the tags with id = dgrdAcquired_hyplnkacquired_0, dgrdAcquired_hyplnkacquired_1, etc.
Is there any easier way to do this than the code I have below? The trouble is that the number of these tags will be different for each webpage I pull up. I'm not sure how to get the text in these tags when each webpage might have a different number of tags.
html = """
<tr>
<td colspan="3"><table class="datagrid" cellspacing="0" cellpadding="3" rules="rows" id="dgrdAcquired" width="100%">
<tr class="datagridH">
<th scope="col"><font face="Arial" color="Blue" size="2"><b>Name (RSSD ID)</b></font></th><th scope="col"><font face="Arial" color="Blue" size="2"><b>Acquisition Date</b></font></th><th scope="col"><font face="Arial" color="Blue" size="2"><b>Description</b></font></th>
</tr><tr class="datagridI">
<td nowrap="nowrap"><font face="Arial" size="2">
<a id="dgrdAcquired_hyplnkacquired_0" href="InstitutionProfile.aspx?parID_RSSD=3557617&parDT_END=20110429">FIRST CHOICE COMMUNITY BANK (3557617)</a>
</font></td><td><font face="Arial" size="2">
<span id="dgrdAcquired_lbldtAcquired_0">2011-04-30</span>
</font></td><td><font face="Arial" size="2">
<span id="dgrdAcquired_lblAcquiredDescText_0">The acquired institution failed and disposition was arranged of by a regulatory agency. Assets were distributed to the acquiring institution.</span>
</font></td>
</tr><tr class="datagridAI">
<td nowrap="nowrap"><font face="Arial" size="2">
<a id="dgrdAcquired_hyplnkacquired_1" href="InstitutionProfile.aspx?parID_RSSD=104038&parDT_END=20110429">PARK AVENUE BANK, THE (104038)</a>
</font></td>
"""
soup = BeautifulSoup(html)
firm1 = soup.find('a', { "id" : "dgrdAcquired_hyplnkacquired_0"})
data1 = ''.join(firm1.findAll(text=True))
print data1
firm2 = soup.find('a', { "id" : "dgrdAcquired_hyplnkacquired_1"})
data2 = ''.join(firm2.findAll(text=True))
print data2

I would do the following, assuming that if there are n such tags, they are numbered 0...n:
soup = BeautifulSoup(html)
i = 0
data = []
while True:
firm1 = soup.find('a', { "id" : "dgrdAcquired_hyplnkacquired_%s" % i})
if not firm1:
break
data.append(''.join(firm1.findAll(text=True)))
print data[-1]
i += 1

Regex is probably overkill in this particular case.
Nonetheless here's another option:
import re
soup.find_all('a', id=re.compile(r'[dgrdAcquired_hyplnkacquired_]\d+'))
Please note: s/find_all/findAll/g if using BS3.
Result (a bit of whitespace removed for purposes of display):
[<a href="InstitutionProfile.aspx?parID_RSSD=3557617&parDT_END=20110429"
id="dgrdAcquired_hyplnkacquired_0">FIRST CHOICE COMMUNITY BANK (3557617)</a>,
<a href="InstitutionProfile.aspx?parID_RSSD=104038&parDT_END=20110429"
id="dgrdAcquired_hyplnkacquired_1">PARK AVENUE BANK, THE (104038)</a>]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python HTML Regex - python

Related

bs4 list comprehension with 'if' statement

Python: Accessing a new <tr> while inside a different <tr> with BeautifulSoup4

joining RE requests in Python

How to extract info from varying table entries: Text vs. DIV vs. SPAN

How can I iterate through tags with different identifiers with BeautifulSoup in Python

Categories

Resources