I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.
<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>
I can get the first value by using
match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)
But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match
match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
'<td width="65.+?value="(.+?)"></td>').findall(html_source_det)
Perhaps I am unable to get it to work since the text is multiline, but I added "\n" at the end of the first line, so I thought this would resolve it but it did not.
What I am doing wrong?
The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.
I am obtaining the html_source like this:
new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()
Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:
from bs4 import BeautifulSoup
html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''
soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']
Or more simply:
print soup.find('input', attrs={'name': 'T1'})['value']
Related
Here's the complete HTML Code of the page that I'm trying to scrape so please take a look first https://codepen.io/bendaggers/pen/LYpZMNv
As you can see, this is the page source of mbasic.facebook.com.
What I'm trying to do is scrape all the anchor tags that have a pattern like this:
Example
<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">
Example with wild card.
<a class="cf" href="*">
so I decided to add a wild card identifier after href="*" since the value are dynamic.
Here's my (not working) Python Code.
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = re.compile(driver.page_source)
pattern = "<a class=\"cf\" href=\"*\">"
print(pagex.findall(pattern))
Note that in the page, there are several patterns like this so I need to capture all and print it.
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/79342209_112439723581175_5245034566049071104_o.jpg?_nc_cat=108&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=lADKURnNsk4AX8WTS1F&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=96f40cb2f95acbcfe9f6e4dc6cb31161&oe=5EC27AEB" class="bo s" alt="Natividad Cruz, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">Natividad Cruz</a>
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/10306248_10201945477974508_4213924286888352892_n.jpg?_nc_cat=109&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=Z2daQ-qGgpsAX8BmLKr&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=22f2b487166a7cd06e4ff650af4f7a7b&oe=5EC34325" class="bo s" alt="John Vinas, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/john.vinas?fref=fr_tab">John Vinas</a>
My goal is to print or findall the anchor tags and display it in terminal. Appreciate your help on this. Thank you!
Tried another set of code but no luck :)
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = driver.page_source
pattern = "<td class=\".*\" style=\"vertical-align: middle\"><a class=\".*\">"
x = re.findall(pattern, pagex)
print(x)
I think your wildcard match needs a dot in front like .*
I'd also recommend using a library like Beautiful Soup for this, it might make your life easier.
You should use a parsing library, such as BeautifulSoup or requests-html. If you want to do it manually, then build on the second attempt you posted. The first won't get you what you want because you are compiling the entire page as a regular expression.
import re
s = """<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">\n\n<h1>\n<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">"""
patt = r'<a.*?class[="]{2}cf.*?href.*?profile.*?>'
matches = re.findall(patt, s)
Output
>>>matches
['<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">',
'<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">']
As mentioned by the previous respondent, BeautifulSoup is the best thats available out there in python to scrape web pages. To import beautiful soup and other libraries use the following commands
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
Post this the below set of commands should solve your purpose
req=Request(url,headers = {'User-Agent': 'Chrome/64.0.3282.140'})
result=urlopen(req).read()
soup = BeautifulSoup(result, "html.parser")
atags=soup('a')
url in the above command is the link you want to scrape and headers argument takes by browser specs/version
Sorry for this silly question as I'm new to web scraping and have no knowledge about HTML etc.
I'm trying to scrape data from this website. Specifically, from this part/table of the page:
末"四"位数 9775,2275,4775,7275
末"五"位数 03881,23881,43881,63881,83881,16913,66913
末"六"位数 313110,563110,813110,063110
末"七"位数 4210962,9210962,9785582
末"八"位数 63262036
末"九"位数 080876872
I'm sorry that's in Chinese and it looks terrible since I can't embed the picture. However, The table is roughly in the middle(40 percentile from the top) of the page. The table id is 'tr_zqh'.
Here is my source code:
import bs4 as bs
import urllib.request
def scrapezqh(url):
source = urllib.request.urlopen(url).read()
page = bs.BeautifulSoup(source, 'html.parser')
print(page)
url = 'http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1'
print(scrapezqh(url))
It scrapes most of the table but the part that I'm interested in. Here is a part of what it returns where I think the data should be:
<td class="tdcolor">网下有效申购股数(万股)
</td>
<td class="tdwidth" id="td_wxyxsggs">
</td>
</tr>
<tr id="tr_zqh">
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>
<td class="tdcolor">中签号公布日期
</td>
<td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
I'd like to get the content of this table: tr id="tr_zqh" (the 6th row above). However for some reason it doesn't scrape its data(No content below). However, when I check the source code of the webpage, the data are in the table. I don't think it is a dynamic table which BeautifulSoup4 can't handle. I've tried both lxml and html parser and I've tried pandas.read_html. It returned the same results. I'd like to get some help to understand why it doesn't get the data and how I can fix it. Many thanks!
Forgot to mention that I tried page.find('tr'), it returned a part of the table but not the lines I'm interested. Page.find('tr') returns the 1st line of the screenshot. I want to get the data of the 2nd & 3rd line(highlighted in the screenshot)
If you extract a couple of variables from the initial page you can use themto make a request to the api directly. Then you get a json object which you can use to get the data.
import requests
import re
import json
from pprint import pprint
s = requests.session()
r = s.get('http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1')
gdpm = re.search('var gpdm = \'(.*)\'', r.text).group(1)
token = re.search('http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get\?type=XGSG_ZQH&token=(.*)&st=', r.text).group(1)
url = "http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=XGSG_ZQH&token=" + token + "&st=LASTFIGURETYPE&sr=1&filter=%28securitycode='" + gdpm + "'%29&js=var%20zqh=%28x%29"
r = s.get(url)
j = json.loads(r.text[8:])
for i in range (len(j)):
print ( j[i]['LOTNUM'])
#pprint(j)
Outputs:
9775,2275,4775,7275
03881,23881,43881,63881,83881,16913,66913
313110,563110,813110,063110
4210962,9210962,9785582
63262036
080876872
From where I look at things your question isn't clear to me. But here's what I did.
I do a lot of webscraping so I just made a package to get me beautiful soup objects of any webpage. Package is here.
So my answer depends on that. But you can take a look at the sourcecode and see that there's really nothing esoteric about it. You may drag out the soup-making part and use as you wish.
Here we go.
pip install pywebber --upgrade
from pywebber import PageRipper
page = PageRipper(url='http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1', parser='html5lib')
page_soup = page.soup
tr_zqh_table = page_soup.find('tr', id='tr_zqh')
from here you can do tr_zqh_table.find_all('td')
tr_zqh_table.find_all('td')
Output
[
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>, <td class="tdcolor">中签号公布日期
</td>, <td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
]
Going a bit further
for td in tr_zqh_table.find_all('td'):
print(td.contents)
Output
['中签号\n ']
['中签号公布日期\n ']
['\xa02018-02-22 (周四)\n ']
I'm facing quite a tricky problem while trying to fetch some data with BeautifulSoup.
I'd like to find all the tables that have certain text in them (in my example code 'Name:', 'City:' and 'Address:') and parse the text that is located in the very next table in the source code.
Page source code:
...
...
<td>Name:</td>
<td>John</td>
...
<td>City:</td>
<td>London</td>
...
<td>Address:</td>
<td>Bowling Alley 123</td>
...
...
I'd like to parse: "John", "London", "Bowling Alley 123"
Sorry I don't have any python code here to show my past effort, but it's because I've no idea where to start. Thanks!
This is clunky, but depending on how your TD's are wrapped and how consistent your TD targets are, you should be able to find them, iterate through them and use findNextSibling() to get your data:
from BeautifulSoup import BeautifulSoup
html = """\
<table>
<tr>
<td>Name:</td>
<td>John</td>
</tr>
<tr>
<td>City:</td>
<td>London</td>
</tr>
<tr>
<td>Address:</td>
<td>Bowling Alley 123</td>
</tr>
</table>
"""
targets=["City:","Address:","Name:"]
soup = BeautifulSoup(html)
for tr in soup.findAll("tr"):
for td in tr.findAll("td"):
if td.text in targets:
print td.findNextSibling().text
Bottom line, as long as you've got some sane/normal elements containing your TD's, using the NextSibling functions should get you where you're going.
Whether this works properly is dependent on whether the HTML is properly formed, but will likely work even if there are extraneous newlines or other text.
import bs4
def parseCAN(html):
b = bs4.BeautifulSoup(html)
matches = ('City:', 'Address:', 'Name:')
found = []
elements = b.findAll('td')
for n, e in enumerate(elements):
if e.text not in matches:
continue
if n < len(elements) - 1:
found.append(elements[n+1].text)
return found
when I want to capture the following information:
<td>But<200g/M2</td>
name = fila.select('.//td[2]/text()').extract()
I capture the following
"But"
apparently there is a conflict with these characters "< /"
escape special characters with a '\', so :
But\<200g\/M2
note that creating a file with those characters wouldn't be so easy
Here is an approach that uses BeautifulSoup, in case you have more luck with a different library:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<html><head><title>StackOverflow-Question</title></head><body>
<table>
<tr>
<td>Ifs</td>
<td>Ands</td>
<td>But<200g/M2</td>
</tr>
</table>
</body></html>""")
print soup.find_all('td')[2].get_text()
The output of this is:
But<200g/M2
If you wanted to use XPath you could also use The ElementTree XML API. Here I'm using BeautifulSoup to take HTML and convert it to valid XML so I can run an XPath query against it:
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
html = """<html><head><title>StackOverflow-Question</title></head><body>
<table>
<tr>
<td>Ifs / Ands / Or</td>
<td>But<200g/M2</td>
</tr>
</table>
</body></html>"""
soup = BeautifulSoup(html)
root = ET.fromstring(soup.prettify())
print root.findall('.//td[2]')[0].text
The output of this is the same (note that the HTML is slightly different, this is because XPath arrays start at one while Python arrays start at 0).
Here is the part of the HTML:
<td class="team-name">
<div class="goat_australia"></div>
Melbourne<br />
Today
</td>
<td class="team-name">
<div class="goat_australia"></div>
Sydney<br />
Tomorrow
</td>
So i would like to return all these td tags with the class name "team-name", and only if it contains the text "Today" in it.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
starting_url = urllib2.urlopen('http://www.mysite.com.au/').read()
soup = BeautifulSoup(''.join(starting_url))
soup2 = soup.findAll("td", {'class':'team-name'})
for entry in soup2:
if "Today" in soup2:
print entry
If i run this nothing returns.
If i take out that last if statement and just put
print soup2
I get back all the td tags, but some have "Today" and some have "Tomorrow" etc.
So any pointers? is there a way to add 2 attributes to the soup.findAll function?
I also tried running a findAll on a findAll, that did not work.
Using the structure of the code you've got currently, try looking for "Today" with an embedded findAll:
for entry in soup2:
if entry.findAll(text=re.compile("Today")):
print entry