I'm currently learning Python and I'm trying to make a small scraper but I'm running in to problems with Beautiful Soup and regex.
I am trying to match all links in a site that has the following html:
<td>
Place Number 1
</td>
<td width="100">
California </td>
<td>
Place Number 2
</td>
<td width="100">
Florida </td>
I want to get all the following links : "/lxxxx/Place+Number+x"
I am using python and beautifulsoup for this:
import BeautifulSoup
import urllib2
import re
address = 'http://www.example.com'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
for tag in soup.findAll('a', id = re.compile('l[0-9]*')):
print tag['href']
The regex part in the soup.findAll I found on some example code because I can't seem to get the example from the BeautifulSoup documentation to work.Without the regex part, I got all the links on the page, but I only want the "lxxx" ones
What am I doing wrong with my regex? Maybe there's a way to do this wthout regexes, but I can't seem to find a way.
Shouldn't you be trying to do the regex match on href and not id?
for tag in soup.findAll('a', href = re.compile('l[0-9]*')):
print tag['href']
I would suggest
for tag in soup.findAll('a', href = re.compile('^/l[0-9]+/.*$')):
print tag['href']
for avoiding tags looking like but not exactly what you are look like
Apart from check href not id
re.compile(r'^\/l[0-9]{4}/Place\+Number\+[0-9]+')
match seems to assume your regex starts with "^".
>>> m = re.compile(r"abc")
>>> m.match("eabc")
>>> m.match("abcd")
<_sre.SRE_Match object at 0x7f23192318b8>
So adding the \/ allows the first slash to be matched. Also I'm using {4} to match four numbers rather than * which will match zero or more numbers.
>>> m = re.compile(r'\/l[0-9]*')
>>> m.match("/longurl/somewhere")
<_sre.SRE_Match object at 0x7f2319231850>
Related
Here's the complete HTML Code of the page that I'm trying to scrape so please take a look first https://codepen.io/bendaggers/pen/LYpZMNv
As you can see, this is the page source of mbasic.facebook.com.
What I'm trying to do is scrape all the anchor tags that have a pattern like this:
Example
<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">
Example with wild card.
<a class="cf" href="*">
so I decided to add a wild card identifier after href="*" since the value are dynamic.
Here's my (not working) Python Code.
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = re.compile(driver.page_source)
pattern = "<a class=\"cf\" href=\"*\">"
print(pagex.findall(pattern))
Note that in the page, there are several patterns like this so I need to capture all and print it.
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/79342209_112439723581175_5245034566049071104_o.jpg?_nc_cat=108&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=lADKURnNsk4AX8WTS1F&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=96f40cb2f95acbcfe9f6e4dc6cb31161&oe=5EC27AEB" class="bo s" alt="Natividad Cruz, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">Natividad Cruz</a>
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/10306248_10201945477974508_4213924286888352892_n.jpg?_nc_cat=109&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=Z2daQ-qGgpsAX8BmLKr&_nc_ht=scontent.fceb2-1.fna&_nc_tp=3&oh=22f2b487166a7cd06e4ff650af4f7a7b&oe=5EC34325" class="bo s" alt="John Vinas, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/john.vinas?fref=fr_tab">John Vinas</a>
My goal is to print or findall the anchor tags and display it in terminal. Appreciate your help on this. Thank you!
Tried another set of code but no luck :)
driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = driver.page_source
pattern = "<td class=\".*\" style=\"vertical-align: middle\"><a class=\".*\">"
x = re.findall(pattern, pagex)
print(x)
I think your wildcard match needs a dot in front like .*
I'd also recommend using a library like Beautiful Soup for this, it might make your life easier.
You should use a parsing library, such as BeautifulSoup or requests-html. If you want to do it manually, then build on the second attempt you posted. The first won't get you what you want because you are compiling the entire page as a regular expression.
import re
s = """<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">\n\n<h1>\n<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">"""
patt = r'<a.*?class[="]{2}cf.*?href.*?profile.*?>'
matches = re.findall(patt, s)
Output
>>>matches
['<a class="cf" href="/profile.php?id=100044454444312&fref=fr_tab">',
'<a class="cf" href="/profile.php?id=20004666644312&fref=fr_tab">']
As mentioned by the previous respondent, BeautifulSoup is the best thats available out there in python to scrape web pages. To import beautiful soup and other libraries use the following commands
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
Post this the below set of commands should solve your purpose
req=Request(url,headers = {'User-Agent': 'Chrome/64.0.3282.140'})
result=urlopen(req).read()
soup = BeautifulSoup(result, "html.parser")
atags=soup('a')
url in the above command is the link you want to scrape and headers argument takes by browser specs/version
I'm having trouble understanding how the bs4 parsing works to pull out information that is several levels down in a tag hierarchy.
Here is an example of what I'm trying to parse (from www.j-archive.com/showgame.php?game_id=50):
...
<table>
<tr>
<td>
<div onmouseover="toggle('clue_DJ_1_1', 'clue_DJ_1_1_stuck', '<em class="correct_response"><i>The Red Badge of Courage</i></em><br /><br /><table width="100%"><tr><td class="right">Kelley</td></tr></table>')" onmouseout="toggle('clue_DJ_1_1', 'clue_DJ_1_1_stuck', 'This classic by Stephen Crane is subtitled "An Episode of the American Civil War"')" onclick="togglestick('clue_DJ_1_1_stuck')">
I specifically want to get out the words "The Red Badge of Courage", so that's within <table>, <tr>, <td>, and <div>, and then appears to be part of the attribute onmouseover.
I can pull out all onmouseover statements with:
for tag in soup.findAll(onmouseover=True):
print(tag['onmouseover'])
But I don't know hot to parse inside that ouput.
Thanks in advance.
Since the text you're interested in is in the <em> tags, it's pretty easy to parse using substring indexing:
import requests
from bs4 import BeautifulSoup
req = requests.get('http://www.j-archive.com/showgame.php?game_id=50')
soup = BeautifulSoup(req.text, 'lxml')
for tag in soup.findAll('div',onmouseover=True):
parseText = str(tag['onmouseover'])
tag1 = '<em class="correct_response">'
tag2 = '</em>'
i1 = parseText.index(tag1)
i2 = parseText.index(tag2)
print(parseText[i1+len(tag1):i2])
This code got all of the entries until 'Final Jeopardy'.
I'm using Beautiful Soup and I want to extract the text within '' with the findall method.
content = urllib.urlopen(address).read()
soup = BeautifulSoup(content, from_encoding='utf-8')
soup.prettify()
x = soup.findAll(do not know what to write)
An extract from soup as an example:
<td class="leftCell identityColumn snap" onclick="fundview('Schroder
European Special Situations');" title="Schroder European Special
Situations"> <a class="coreExpandArrow" href="javascript:
void(0);"></a> <span class="sigill"><a class="qtpop"
href="/vips/ska/all/sv/quicktake/redirect?perfid=0P0000XZZ3&flik=Chosen">
<img
src="/vips/Content/corestyles/4pSigillGubbe.gif"/></a></span>
<span class="bluetext" style="white-space: nowrap; overflow:
hidden;">Schroder European Spe..</span>
I would like the result from soup.findAll(do not know what to write) to be: Schroder European Special Situations and the findall logic should be based on that it is the text between single quotation marks.
Locate the td element and get the onclick attribute value - the BeautifulSoup's job at this point would be completed. The next step would be to extract the desired text from the attribute value - let's use regular expressions for that. Implementation:
import re
onclick = soup.select_one("td.identityColumn[onclick]")["onclick"]
match = re.search(r"fundview\('(.*?)'\);", onclick)
if match:
print(match.group(1))
Alternatively, it looks like the span with bluetext class has the desired text inside:
soup.select_one("td.identityColumn span.bluetext").get_text()
Also, make sure you are using the 4th BeautifulSoup version and your import statement is:
from bs4 import BeautifulSoup
I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.
<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>
I can get the first value by using
match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)
But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match
match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
'<td width="65.+?value="(.+?)"></td>').findall(html_source_det)
Perhaps I am unable to get it to work since the text is multiline, but I added "\n" at the end of the first line, so I thought this would resolve it but it did not.
What I am doing wrong?
The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.
I am obtaining the html_source like this:
new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()
Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:
from bs4 import BeautifulSoup
html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''
soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']
Or more simply:
print soup.find('input', attrs={'name': 'T1'})['value']
Here is the part of the HTML:
<td class="team-name">
<div class="goat_australia"></div>
Melbourne<br />
Today
</td>
<td class="team-name">
<div class="goat_australia"></div>
Sydney<br />
Tomorrow
</td>
So i would like to return all these td tags with the class name "team-name", and only if it contains the text "Today" in it.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
starting_url = urllib2.urlopen('http://www.mysite.com.au/').read()
soup = BeautifulSoup(''.join(starting_url))
soup2 = soup.findAll("td", {'class':'team-name'})
for entry in soup2:
if "Today" in soup2:
print entry
If i run this nothing returns.
If i take out that last if statement and just put
print soup2
I get back all the td tags, but some have "Today" and some have "Tomorrow" etc.
So any pointers? is there a way to add 2 attributes to the soup.findAll function?
I also tried running a findAll on a findAll, that did not work.
Using the structure of the code you've got currently, try looking for "Today" with an embedded findAll:
for entry in soup2:
if entry.findAll(text=re.compile("Today")):
print entry