joining RE requests in Python - python

I have a page.htm file:
</td></tr>
<tr>
<td height="120" class="box_pic">
<img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105">
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=79159" target="_blank">ABird</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105">
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=78759" target="_blank">ADog</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXfdgfdgZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105">
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=87159" target="_blank">ACat56</a></span></td>
It has 3 sets of data which I need:
1) 1322679 79159 ABird
2) 1546679 78759 ADog
3) 5622679 87159 ACat56
I have 3 requests for RE which can dig elements from this page:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
result = re.findall(r'view\.php\?item=(\d+)', page)
result2 = re.findall(r'user&id=(\d+)', page)
result3 = re.findall(r'user&id=.*>(\w+)', page)
print (result, len(result))
print (result2, len(result2))
print (result3, len(result3))
the result I get:
['1322679', '1546679', '5622679'] 3
['79159', '78759', '87159'] 3
['ABird', 'ADog', 'ACat56'] 3
Do you know the way to join these 3 requests in ONE? So that
1) file would be analized 1 time instead of 3 times
2) only ONE re.findall() would be used
3) data would be joined in the way I need
a) 1322679 79159 ABird
b) 1546679 78759 ADog
c) 5622679 87159 ACat56
the result request should be something like this:
result = re.findall(r'view\.php\?item=(\d+) SOMETHING_HERE user&id=(\d+) SOMETHING_HERE .*>(\w+)', page)

Here is how to do it properly with an HTML parser in Python 2:
from urlparse import parse_qs, urlparse
from bs4 import BeautifulSoup
def only(x):
x = list(x)
assert len(x) == 1
return x[0]
def url_params(a):
return parse_qs(urlparse(a['href']).query)
def main():
with open('page.html') as f:
soup = BeautifulSoup(f, 'html.parser')
rows = soup.find_all('tr', recursive=False)
# Data is in alternating rows, so take pairs of rows at a time
for row1, row2 in zip(rows[::2], rows[1::2]):
a = only(row1.select('td.box_pic a'))
item_id = only(url_params(a)['item'])
a = only(row2.select('a.usernick'))
user_id = only(url_params(a)['id'])
nick = a.text
print item_id, user_id, nick
main()
Output:
1322679 79159 ABird
1546679 78759 ADog
5622679 87159 ACat56
Now, this may not be as concise as the re method, but this code is aware of how the input is meant to be structured and that makes it robust. If the structure of the input changes, e.g. the format of the URLs or the shape of the HTML, this code will either continue to work correctly or it will raise an error to tell you that things aren't as expected. The re method may very easily continue to run but give you incorrect results, which is not a situation you want. And if you want to extract more information in the future, it's very easy to add the necessary lines without interfering with the existing code.

finally, I found the solution:
This is the answer, which satisfies all the requirements:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
page = re.sub(r'[\t\r\n\s]','',page)
re.DOTALL
result = re.findall(r'view\.php\?item=(\d+).*?user&id=(\d+).*?>(\w+)', page)
print (result, len(result))
and:
1) results are in needed order
2) 1 request
result:
[('1322679', '79159', 'ABird'), ('1546679', '78759', 'ADog'), ('5622679', '87159', 'ACat56')] 3

Related

Python: Accessing a new <tr> while inside a different <tr> with BeautifulSoup4

I am trying to gather some data by webscraping a local HTML file using BeautifulSoup4. The problem is, that the information I'm trying to get is on different rows that have the same class tags. I'm not sure about how to access them. The following html screenshot contains the two rows I'm accessing with the data I need highlighted (sensitive info is scribbled out).
The code I have currently is:
def find_data(fileName):
with open(fileName) as html_file:
soup = bs(html_file, "lxml")
hline1 = soup.find("td", class_="headerTableEntry")
hline2 = hline1.find_next_sibling("td")
hline3 = hline2.find_next_sibling("td")
hline4 = hline3.find_next_sibling("td", class_="headerTableEntry")
line1 = hline1.text
line2 = hline2.text
line3 = hline3.text
#Nothing yet for lines 4,5,6
The first 3 lines work great and give 13, 39, and 33.3% as they should. But for line 4 (which should be the second tag and first tag with class=headerTableEntry) I get an error "'NoneType' object is not callable".
My question is, is there a different way to go at this so I can access all 6 data cells or is there a way to edit how I wrote line 4 to work? Thank you for your help, it is very much appreciated!
The <tr> tag is not inside another <tr> tag as you can see that first <tr> tag is closed with the </tr> So that next <td> is not a sibling of the previous, hence it returns None. It's within the next <tr> tag.
Pandas is a great package to parse html <table> tags (which this is). It actually uses beautifulsoup under the hood. Just get the full table, and slice the table for the columns you want:
html_file = '''<table>
<tr>
<td class="headerName">File:</td>
<td class="HeaderValue">Some Value</td>
<td></td>
<td class="headerName">Lines:</td>
<td class="headerTableEntry">13</td>
<td class="headerTableEntry">39</td>
<td class="headerTableEntry" style="back-ground-color:LightPink">33.3 %</td>
</tr>
<tr>
<td class="headerName">Date:</td>
<td class="HeaderValue">2020-06-18 11:15:19</td>
<td></td>
<td class="headerName">Branches:</td>
<td class="headerTableEntry">10</td>
<td class="headerTableEntry">12</td>
<td class="headerTableEntry" style="back-ground-color:#FFFF55">83.3 %</td>
</tr>
</table>'''
import pandas as pd
df = pd.read_html(html_file)[0]
df = df.iloc[:,3:]
So for your code:
def find_data(fileName):
with open(fileName) as html_file:
df = pd.read_html(html_file)[0].iloc[:,3:]
print (df)
Output:
print (df)
3 4 5 6
0 Lines: 13 39 33.3 %
1 Branches: 10 12 83.3 %

Python HTML Regex

correct output using below txt file should be: PlayerA 29.2 PlayerB 32.2
I have a txt file filled with html that looks like below,
I'm trying to use a python 2.6 regular expression to collect all the playernames and ratings.
The first time the playername appears is on line 4, the rating appears on line 16.(29.2)
Then the next player name appears on line 22, the rating on line 35.
and so on...
fileout = open('C:\Python26\hotcold.txt')
read_file = fileout.readlines()
source = str(read_file)
expression = re.findall(r"(LS=113>.+?", source)
print expression
I was trying to make a expression that would find all the names and ratings but it isnt working..
<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&P=245&LS=113">
PlayerA
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
4
</b>
,
<b>
8
</b>
</td>
<td class="stats" colspan="1" valign="top">
29.2
</td>
<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&P=245&LS=113">
PlayerB
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
4
</b>
,
<b>
8
</b>
</td>
<td class="stats" colspan="1" valign="top">
32.2
</td>
I would recommend using Beautiful Soup to parse the HTML and get the values you are after.
Use the following code:
from bs4 import BeautifulSoup
with open('sample.html', 'r') as html_doc:
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.find_all('tr', 'stats'):
row_tds = row.find_all_next('td')
print('{0} {1}'.format(
row_tds[0].find('a').string.strip() if row_tds[0].find('a').string else 'None',
row_tds[2].string.strip() if row_tds[2].string else 'None')
)
output:
$ python testparse.py
PlayerA 29.2
PlayerB 32.2
Works.
Alternatively, I would suggest using a proper html parser instead of relying on regex -- although BeautifulSoup is actually a very good and easy-to-use library.
In your sample is that missing the closing <tr> tags between the <td>?
Edit: using OP sample as source
Anyhow, and using lxml.html with simple xpath to get hopefully what you expected:
In [1]: import lxml.html
# sample.html is the same as in OP sample
In [2]: tree = lxml.html.parse("sample.html")
In [3]: root = tree.getroot()
In [4]: players = root.xpath('.//td[#class="stats"]/a/text()')
In [5]: stats = root.xpath('//td[#class="stats" and normalize-space(text())]/text()')
In [6]: print players, stats
['\nPlayerA\n', '\nPlayerB\n'] ['\n29.2\n', '\n32.2\n']
In [7]: for player, stat in zip(players, stats):
...: print player.strip(), stat.strip()
...:
PlayerA 29.2
PlayerB 32.2

Python + BS Picking a specific word(location) form webpage table

Hello all…I want to pick a word on specific locaiton from a table on webpage. The source code is like:
table = '''
<TABLE class=form border=0 cellSpacing=1 cellPadding=2 width=500>
<TBODY>
<TR>
<TD vAlign=top colSpan=3><IMG class=ad src="/images/ad.gif" width=1 height=1></TD></TR>
<TR>
<TH vAlign=top width=22>Code:</TH>
<TD class=dash vAlign=top width=5 lign="left"> </TD>
<TD class=dash vAlign=top width=30 align=left><B>BAN</B></TD></TR>
<TR>
<TH vAlign=top>Color:</TH>
<TD class=dash vAlign=top align=left> </TD>
<TD class=dash vAlign=top align=left>White</TD></TR>
<TR>
<TD colSpan=3> </TD></TR></TBODY></TABLE>
'''
I want to pick the word of color here (it could be “White”, "red" or something else). What I tried is:
soup = BeautifulSoup(table)
for a in soup.find_all('table')[0].find_all('tr')[2:3]:
print a.text
It gives:
Color:
 
White
It looks like 4 lines. I tried to add them into a list then remove the unwanted but unsuccessful.
What’s the best way to only pick the color in the table?
Many thanks.
This will match all instances of 'white' case independent ...
soup = BeautifulSoup(table)
res = []
for a in soup.find_all('table')[0].find_all('tr')[2:3]:
if 'white' in a.text.lower():
text = a.text.encode('ascii', 'ignore').replace(':','').split()
res.append(text)
slightly better implementation ...
# this will iterate through all 'table' and 'tr' tags within each 'table'
res = [tr.text.encode('ascii', 'ignore').replace(':','').split() \
for table in soup.findAll('table') for tr in table.findAll('tr') \
if 'color' in tr.text.lower()]
print res
[['Color', 'White']]
to only return the colors themselves, do...
# Assuming the same format throughout the html
# if format is changing just add more logic
tr.text.encode('ascii', 'ignore').replace(':','').split()[1]
...
print res
['White']

Regex to search HTML and find a number after the occurrence of a string in python [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am having trouble parsing HTML in Python. I'm looking for a solution of how to use Regex specifically for this solution, I'm not looking for why I shouldn't do this with Regex. There might be other solutions that could solve this better, however my requirement unfortunately cannot use other modules or libraries, thanks for the help
I have the following HTML:
<tbody ID='archive'>
<tr><td valign="top">Type / Path</td>
<td colspan=2>CIFS / 10.5.0.5:/selva</td>
</tr>
<tr><td valign="top">Last availability</td>
<td colspan=2>1970-01-01 05:30:00</td>
</tr>
<tr><td valign="top">Capacity Internal / Archive</td>
<td colspan=2>3.7 / 10.0 GByte</td>
</tr>
<tr><td valign="top">Blocks To sync / Transferred / Lost</td>
<td colspan=2>951 / 0 / 15 (last 24 hours)</td>
</tr>
<tr><td valign="top">Bandwidth Available / Total usage</td>
<td colspan=2>0 kB/s / 0 kB/s</td>
</tr>
<tr><td valign="top">Buffer Usage / Capacity left</td>
<td colspan=2>100 % / 0 m</td>
</tr>
</tbody>
<tr bgcolor="#CCCCCC"><th onclick="showhide(this,'events')" align=left colspan=3 width="style: auto;">▽ Event and Action Setup</th></tr>
<tbody ID='events'>
<tr>
<td>Arming</td>
<td>Enabled</td>
</tr>
<tr>
<td>Events</td>
<td colspan=2>PI MI AS UC TimeSync </td>
</tr>
<tr>
<td>Actions</td>
<td colspan=2>(IP) REC FR</td>
</tr>
</tbody>
I need to get the number which comes after the Buffer Usage element (line 17 in the code above); in this case it is 100% (line 18 in the code above), and this number can have 1 to 3 digits.
How do I get this number extracted from the code above in Python?
The reason I need to do this is so I can send out an email if the buffer is above 10%. I can code that part, but I don't know how to extract the information from the HTML above.
The code will be run on a NAS box, where it would ideal if the solution used only Python standard libraries.
Anand Davis, please try this for a start:
from bs4 import BeautifulSoup
html = """<tbody ID='archive'>
<tr><td valign="top">Type / Path</td>
<td colspan=2>CIFS / 10.5.0.5:/selva</td>
</tr>
<tr><td valign="top">Last availability</td>
<td colspan=2>1970-01-01 05:30:00</td>
</tr>
<tr><td valign="top">Capacity Internal / Archive</td>
<td colspan=2>3.7 / 10.0 GByte</td>
</tr>
<tr><td valign="top">Blocks To sync / Transferred / Lost</td>
<td colspan=2>951 / 0 / 15 (last 24 hours)</td>
</tr>
<tr><td valign="top">Bandwidth Available / Total usage</td>
<td colspan=2>0 kB/s / 0 kB/s</td>
</tr>
<tr><td valign="top">Buffer Usage / Capacity left</td>
<td colspan=2>100 % / 0 m</td>
</tr>
</tbody>
<tr bgcolor="#CCCCCC"><th onclick="showhide(this,'events')" align=left colspan=3 width="style: auto;">▽ Event and Action Setup</th></tr>
<tbody ID='events'>
<tr><td>Arming</td>
<td>Enabled</td>
</tr>
<tr><td>Events</td>
<td colspan=2>PI MI AS UC TimeSync </td>
</tr>
<tr><td>Actions</td>
<td colspan=2>(IP) REC FR</td>
</tr>
</tbody>"""
html = BeautifulSoup(html)
trs = html.find_all('tr')
for td in trs:
if "Buffer Usage / Capacity left" in td.text:
print td.find_all("td")[1].text.split(" ")[0]
Output:
100
In tr variable you will get list of all the rows containing individual elements as per your requirement. You can further apply certain operations on this list as per your requirement.
Please refer to Beautiful Soup documentation here
You can pass text=re.compile("Buffer Usage") to find the td that contains the contains the text Buffer Usage then get the next td tag and extract the usage with re.
from bs4 import BeautifulSoup
soup= BeautifulSoup(html)
import re
txt = soup.find("td",text=re.compile("Buffer Usage")).find_next("td").text
print(re.search("\d+",txt).group())
100
If there is always a space you can split:
print(txt.split(None,1)[0])
Or if other numbers can come before search for the number before % :
print(re.search("(\d+)\s+%",txt).group(1))
Using BeautifulSoup you can access the parts of your HTML.
The following code snippet extracts the usage as an integer, but assumes that the structure of the page is always the same. It takes the 2nd column in the 5th row and parses it using a regex.
from bs4 import BeautifulSoup # A library with which to parse HTML (fragments)
import re
s = '''<tbody ID='archive'>
<tr><td valign="top">Type / Path</td>
<td colspan=2>CIFS / 10.5.0.5:/selva</td>
</tr>
<tr><td valign="top">Last availability</td>
<td colspan=2>1970-01-01 05:30:00</td>
</tr>
<tr><td valign="top">Capacity Internal / Archive</td>
<td colspan=2>3.7 / 10.0 GByte</td>
</tr>
<tr><td valign="top">Blocks To sync / Transferred / Lost</td>
<td colspan=2>951 / 0 / 15 (last 24 hours)</td>
</tr>
<tr><td valign="top">Bandwidth Available / Total usage</td>
<td colspan=2>0 kB/s / 0 kB/s</td>
</tr>
<tr><td valign="top">Buffer Usage / Capacity left</td>
<td colspan=2>100 % / 0 m</td>
</tr>
</tbody>
<tr bgcolor="#CCCCCC"><th onclick="showhide(this,'events')" align=left colspan=3 width="style: auto;">▽ Event and Action Setup</th></tr>
<tbody ID='events'>
<tr><td>Arming</td>
<td>Enabled</td>
</tr>
<tr><td>Events</td>
<td colspan=2>PI MI AS UC TimeSync </td>
</tr>
<tr><td>Actions</td>
<td colspan=2>(IP) REC FR</td>
</tr>
</tbody>'''
doc = BeautifulSoup(s)
row = doc.find_all('tr')[5]
column = row.find_all('td')[1]
usage_string = column.get_text()
r = re.match(r'(\d{0,3}) % .+', usage_string)
usage = int(r.group(1))
If the page content is a bit more dynamic, you need to write code that finds the correct row instead of picking it out by index like this.
The BeautifulSoup documentation should give you all information you need to refine the code if necessary.
A possibilty would be to check for the "archive" ID and then scan the rows checking the first TD for the "Buffer Usage" string.
As the other answers point out, regexes are not suited to parse html. See this answer. However, if you cannot install a proper parsing library like Beautiful Soap, regexes are your best bet.
A regex that will solve the problem as desired is:
import re
text ="""<tr><td valign="top">Buffer Usage / Capacity left</td>
<td colspan=2>100 % / 0 m</td>"""
result = re.search(r"Buffer Usage.*\n.*?>(\d{1,3}) % .+",text).group(1)
print result # 100

How to extract info from varying table entries: Text vs. DIV vs. SPAN

I am new to python and searched the internet to find an answer to my problem, but so far I failed...
The problem: My aim is to extract data from websites. More specifically, from the tables in these websites. The relevant snippet from the website-code you find in "data" in my python-code example here:
from bs4 import BeautifulSoup
data = '''<table class="ds-table">
<tr>
<td class="data-label">year of birth:</td>
<td class="data-value">1994</td>
</tr>
<tr>
<td class="data-label">reporting period:</td>
<td class="data-value">
<span class="editable" id="c-scope_beginning_date">
? </span>
-
<span class="editable" id="c-scope_ending_date">
? </span>
</td>
</tr>
<tr>
<td class="data-label">reporting cycle:</td>
<td class="data-value">
<span class="editable" id="c-periodicity">
- </span>
</td>
</tr>
<tr>
<td class="data-label">grade:</td>
<td class="data-value">1.3, upper 10% of class</td>
</tr>
<tr>
<td class="data-label">status:</td>
<td class="data-value"></td>
</tr>
</table>
<table class="ds-table">
<tr>
<td class="data-label">economics:</td>
<td class="data-value"><span class="positive-value"></span></td>
</tr>
<tr>
<td class="data-label">statistics:</td>
<td class="data-value"><span class="negative-value"></span></td>
</tr>
<tr>
<td class="data-label">social:</td>
<td class="data-value"><div id="music_id" class="trigger"><span class="negative-value"></span></div></td>
</tr>
<tr>
<td class="data-label">misc:</td>
<td class="data-value">
<div id="c_assurance" class="">
<span class="positive-value"></span> </div>
</td>
</tr>
<tr>
<td class="data-label">recommendation:</td>
<td class="data-value">
<span class="negative-value"></span> </td>
</tr>
</table>'''
soup = BeautifulSoup(data)
For the class="data-label" so far I successfully implemented...
box_cdl = []
for i, cdl in enumerate(soup.findAll('td', attrs={'class': 'data-label'})):
box_cdl.append(cdl.contents[0])
print box_cdl
...which extracts the text from the columns, in the (for me satisfying) output:
[u'year of birth:',
u'reporting period:',
u'reporting cycle:',
u'grade:',
u'status:',
u'economics:',
u'statistics:',
u'social:',
u'misc:',
u'recommendation:']
Where I get stuck is the part for class="data-value" with the div- and span-fields and that some of the relevant information is hidden in the span-class. Moreover, the amount of the tr-rows can change from website to website, e.g. "status" comes after "reporting cycle" (instead of "grade").
However, when I do...
box_cdv = []
for j, cdv in enumerate(soup.findAll('td', attrs={'class': 'data-value'})):
box_cdv.append(cdv.contents[0])
print box_cdv
...I get the error:
Traceback (most recent call last):
File "<ipython-input-53-7d5c095cf647>", line 3, in <module>
box_cdv.append(cdv.contents[0])
IndexError: list index out of range
What I would like to get instead is something like this (corresponding to the above "data"-example):
[u'1994',
u'? - ?',
u'-',
u'1.3, upper 10% of class',
u'',
u'positive-value',
u'negative-value',
u'negative-value',
u'positive-value',
u'negative-value']
The Question: how can I extract this information and collect the relevant data from each tr-row, given that the adequate extraction-code depends on the type of the category (year of birth, reporting period, ..., recommendation)?
Or, asking differently: what code extracts me, depending on the category (year of birth, reporting period, ..., recommendation), the corresponding value (1994, ..., negative-value)?
Since the amount and the type of the table-entries can differ between websites, a simple "on the i-th entry do the following" procedure is not applicable. The thing I am looking for I think is something like "if you find the text "recommendation:", then extract the class-type from the span-field", I guess. But unfortunately I do not have any clue how to translate that into python-language.
Any help is highly appreciated.
You get that error because one of the tags don't have any children so the contents list gives an error when searching for that index.
You can approeach this on the following way:
1) Search for the data-label tags;
2) Find the next TD sibling;
3 A) Check of the sibling has text;
3 A) 1) If so create a dict entry with data-label as the key and the sibling text as its value;
3 A) B) If not check if the sibling first child have a class containing -value`
4) Parse the data.
Example:
soup = BeautifulSoup(data, 'lxml')
result = {}
for tag in soup.find_all("td", { "class" : "data-label" }):
NextSibling = tag.find_next("td", { "class" : "data-value" }).get_text(strip = True)
if not NextSibling and len(tag.find_next("td").select('span[class*=-value]')) > 0:
NextSibling = tag.find_next("td").select('span[class*=-value]')[0]["class"][0]
result[tag.get_text(strip = True)] = NextSibling
print (result)
Result:
{
'year of birth:': '1994',
'reporting period:': '?-?',
'reporting cycle:': '-',
'grade:': '1.3, upper 10% of class',
'status:': '',
'economics:': 'positive-value',
'statistics:': 'negative-value',
'social:': 'negative-value',
'misc:': 'positive-value',
'recommendation:': 'negative-value'
}

Categories