How to extract specific string on a web page using Python - python

Here's the complete HTML Code of the page that I'm trying to scrape so please take a look first https://codepen.io/bendaggers/pen/LYpZMNv
As you can see, this is the page source of mbasic.facebook.com.
What I'm trying to do is scrape all the anchor tags that have a pattern like this:
Example
<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">
Example with wild card.
<a class="cf" href="*">
so I decided to add a wild card identifier after href="*" since the value are dynamic.
Here's my (not working) Python Code.
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = re.compile(driver.page_source)
pattern = "<a class=\"cf\" href=\"*\">"
print(pagex.findall(pattern))
Note that in the page, there are several patterns like this so I need to capture all and print it.
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/79342209_112439723581175_5245034566049071104_o.jpg?_nc_cat=108&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=lADKURnNsk4AX8WTS1F&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=96f40cb2f95acbcfe9f6e4dc6cb31161&oe=5EC27AEB" class="bo s" alt="Natividad Cruz, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">Natividad Cruz</a>
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/10306248_10201945477974508_4213924286888352892_n.jpg?_nc_cat=109&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=Z2daQ-qGgpsAX8BmLKr&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=22f2b487166a7cd06e4ff650af4f7a7b&oe=5EC34325" class="bo s" alt="John Vinas, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/john.vinas?fref=fr_tab">John Vinas</a>
My goal is to print or findall the anchor tags and display it in terminal. Appreciate your help on this. Thank you!
Tried another set of code but no luck :)
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = driver.page_source
pattern = "<td class=\".*\" style=\"vertical-align: middle\"><a class=\".*\">"
x = re.findall(pattern, pagex)
print(x)

I think your wildcard match needs a dot in front like .*
I'd also recommend using a library like Beautiful Soup for this, it might make your life easier.

You should use a parsing library, such as BeautifulSoup or requests-html. If you want to do it manually, then build on the second attempt you posted. The first won't get you what you want because you are compiling the entire page as a regular expression.
import re
s = """<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">\n\n<h1>\n<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">"""
patt = r'<a.*?class[="]{2}cf.*?href.*?profile.*?>'
matches = re.findall(patt, s)
Output
>>>matches
['<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">',
'<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">']

As mentioned by the previous respondent, BeautifulSoup is the best thats available out there in python to scrape web pages. To import beautiful soup and other libraries use the following commands
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
Post this the below set of commands should solve your purpose
req=Request(url,headers = {'User-Agent': 'Chrome/64.0.3282.140'})
result=urlopen(req).read()
soup = BeautifulSoup(result, "html.parser")
atags=soup('a')
url in the above command is the link you want to scrape and headers argument takes by browser specs/version

Related

BeautifulSoup4 not able to scrape data from this table

Sorry for this silly question as I'm new to web scraping and have no knowledge about HTML etc.
I'm trying to scrape data from this website. Specifically, from this part/table of the page:
末"四"位数 9775,2275,4775,7275
末"五"位数 03881,23881,43881,63881,83881,16913,66913
末"六"位数 313110,563110,813110,063110
末"七"位数 4210962,9210962,9785582
末"八"位数 63262036
末"九"位数 080876872
I'm sorry that's in Chinese and it looks terrible since I can't embed the picture. However, The table is roughly in the middle(40 percentile from the top) of the page. The table id is 'tr_zqh'.
Here is my source code:
import bs4 as bs
import urllib.request
def scrapezqh(url):
source = urllib.request.urlopen(url).read()
page = bs.BeautifulSoup(source, 'html.parser')
print(page)
url = 'http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1'
print(scrapezqh(url))
It scrapes most of the table but the part that I'm interested in. Here is a part of what it returns where I think the data should be:
<td class="tdcolor">网下有效申购股数(万股)
</td>
<td class="tdwidth" id="td_wxyxsggs"> 
</td>
</tr>
<tr id="tr_zqh">
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>
<td class="tdcolor">中签号公布日期
</td>
<td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
I'd like to get the content of this table: tr id="tr_zqh" (the 6th row above). However for some reason it doesn't scrape its data(No content below). However, when I check the source code of the webpage, the data are in the table. I don't think it is a dynamic table which BeautifulSoup4 can't handle. I've tried both lxml and html parser and I've tried pandas.read_html. It returned the same results. I'd like to get some help to understand why it doesn't get the data and how I can fix it. Many thanks!
Forgot to mention that I tried page.find('tr'), it returned a part of the table but not the lines I'm interested. Page.find('tr') returns the 1st line of the screenshot. I want to get the data of the 2nd & 3rd line(highlighted in the screenshot)
If you extract a couple of variables from the initial page you can use themto make a request to the api directly. Then you get a json object which you can use to get the data.
import requests
import re
import json
from pprint import pprint
s = requests.session()
r = s.get('http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1')
gdpm = re.search('var gpdm = \'(.*)\'', r.text).group(1)
token = re.search('http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get\?type=XGSG_ZQH&token=(.*)&st=', r.text).group(1)
url = "http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=XGSG_ZQH&token=" + token + "&st=LASTFIGURETYPE&sr=1&filter=%28securitycode='" + gdpm + "'%29&js=var%20zqh=%28x%29"
r = s.get(url)
j = json.loads(r.text[8:])
for i in range (len(j)):
print ( j[i]['LOTNUM'])
#pprint(j)
Outputs:
9775,2275,4775,7275
03881,23881,43881,63881,83881,16913,66913
313110,563110,813110,063110
4210962,9210962,9785582
63262036
080876872
From where I look at things your question isn't clear to me. But here's what I did.
I do a lot of webscraping so I just made a package to get me beautiful soup objects of any webpage. Package is here.
So my answer depends on that. But you can take a look at the sourcecode and see that there's really nothing esoteric about it. You may drag out the soup-making part and use as you wish.
Here we go.
pip install pywebber --upgrade
from pywebber import PageRipper
page = PageRipper(url='http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1', parser='html5lib')
page_soup = page.soup
tr_zqh_table = page_soup.find('tr', id='tr_zqh')
from here you can do tr_zqh_table.find_all('td')
tr_zqh_table.find_all('td')
Output
[
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>, <td class="tdcolor">中签号公布日期
</td>, <td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
]
Going a bit further
for td in tr_zqh_table.find_all('td'):
print(td.contents)
Output
['中签号\n ']
['中签号公布日期\n ']
['\xa02018-02-22 (周四)\n ']

Can't get a regex pattern to work in Python

I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.
<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>
I can get the first value by using
match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)
But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match
match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
'<td width="65.+?value="(.+?)"></td>').findall(html_source_det)
Perhaps I am unable to get it to work since the text is multiline, but I added "\n" at the end of the first line, so I thought this would resolve it but it did not.
What I am doing wrong?
The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.
I am obtaining the html_source like this:
new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()
Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:
from bs4 import BeautifulSoup
html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''
soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']
Or more simply:
print soup.find('input', attrs={'name': 'T1'})['value']

Find 2 attributes in BeautifulSoup

Here is the part of the HTML:
<td class="team-name">
<div class="goat_australia"></div>
Melbourne<br />
Today
</td>
<td class="team-name">
<div class="goat_australia"></div>
Sydney<br />
Tomorrow
</td>
So i would like to return all these td tags with the class name "team-name", and only if it contains the text "Today" in it.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
starting_url = urllib2.urlopen('http://www.mysite.com.au/').read()
soup = BeautifulSoup(''.join(starting_url))
soup2 = soup.findAll("td", {'class':'team-name'})
for entry in soup2:
if "Today" in soup2:
print entry
If i run this nothing returns.
If i take out that last if statement and just put
print soup2
I get back all the td tags, but some have "Today" and some have "Tomorrow" etc.
So any pointers? is there a way to add 2 attributes to the soup.findAll function?
I also tried running a findAll on a findAll, that did not work.
Using the structure of the code you've got currently, try looking for "Today" with an embedded findAll:
for entry in soup2:
if entry.findAll(text=re.compile("Today")):
print entry

Printing certain HTML Python Mechanize

Im making a small python script for auto logon to a website. But i'm stuck.
I'm looking to print into terminal a small part of the html, located within this tag in the html file on the site:
<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>
But how do I extract and print just the name, John Appleseed?
I'm using Pythons' Mechanize on a mac, by the way.
Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)
Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:
html = ... # this is the html you've fetched
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element
As you have not provided the full HTML of the page, the only option right now is either using string.find() or regular expressions.
But, the standard way of finding this is using xpath. See this question: How to use Xpath in Python?
You can obtain the xpath for an element using "inspect element" feature of firefox.
For ex, if you want to find the XPATH for username in stackoverflow site.
Open firefox and login to the website & RIght-click on username(shadyabhi in my case) and select Inspect Element.
Keep your mouse over tag or right click it and "Copy xpath".
You can use a parser to extract any information in a document. I suggest you to use lxml module.
Here you have an example:
from lxml import etree
from StringIO import StringIO
parser = etree.HTMLParser()
tree = etree.parse(StringIO("""<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>"""),parser)
>>> tree.xpath("string()").strip()
u'John Appleseed'
More information about lxml here

Problems with regex matching

I'm currently learning Python and I'm trying to make a small scraper but I'm running in to problems with Beautiful Soup and regex.
I am trying to match all links in a site that has the following html:
<td>
Place Number 1
</td>
<td width="100">
California </td>
<td>
Place Number 2
</td>
<td width="100">
Florida </td>
I want to get all the following links : "/lxxxx/Place+Number+x"
I am using python and beautifulsoup for this:
import BeautifulSoup
import urllib2
import re
address = 'http://www.example.com'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
for tag in soup.findAll('a', id = re.compile('l[0-9]*')):
print tag['href']
The regex part in the soup.findAll I found on some example code because I can't seem to get the example from the BeautifulSoup documentation to work.Without the regex part, I got all the links on the page, but I only want the "lxxx" ones
What am I doing wrong with my regex? Maybe there's a way to do this wthout regexes, but I can't seem to find a way.
Shouldn't you be trying to do the regex match on href and not id?
for tag in soup.findAll('a', href = re.compile('l[0-9]*')):
print tag['href']
I would suggest
for tag in soup.findAll('a', href = re.compile('^/l[0-9]+/.*$')):
print tag['href']
for avoiding tags looking like but not exactly what you are look like
Apart from check href not id
re.compile(r'^\/l[0-9]{4}/Place\+Number\+[0-9]+')
match seems to assume your regex starts with "^".
>>> m = re.compile(r"abc")
>>> m.match("eabc")
>>> m.match("abcd")
<_sre.SRE_Match object at 0x7f23192318b8>
So adding the \/ allows the first slash to be matched. Also I'm using {4} to match four numbers rather than * which will match zero or more numbers.
>>> m = re.compile(r'\/l[0-9]*')
>>> m.match("/longurl/somewhere")
<_sre.SRE_Match object at 0x7f2319231850>

Categories