How to get unlabeled data by BeautifulSoup Python - python

<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Contents
<a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a>
<br/>
It's 4:38.
<br/>
2018. 2. 5. 5:38:41 PM
</div>]
In the code above, I want to extract the answer ("It's 4:38") and the timestamp. For the question, I used
for link in soup.find_all('a'): Questions.append(link.text);
but I couldn't do the same with the answers and timestamp. How do I resolve this issue?

You can see that the text you want is a descendant of the div element. You can also see that it immediately follows the first br element that is such a descendant. Then one way to find it is simply to iterate through the descendants of the div looking for that br. When you see that take the next item.
Here's how it will play out.
>>> import bs4
>>> soup = bs4.BeautifulSoup(open('sumin.htm').read(), 'lxml')
>>> div = soup.find('div')
>>> for element in div.descendants:
... element.name, element
...
(None, '\n Contents\n ')
('a', <a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a>)
(None, '\n Google what time is it\n ')
(None, '\n')
('br', <br/>)
(None, "\n It's 4:38.\n ")
('br', <br/>)
(None, '\n 2018. 2. 5. 5:38:41 PM\n ')
Notice that elements such as br have the name property but navigable strings do not (this property is None).

How to get unlabeled data
Actually, it is not unlabeled. The text you want is located inside the <div> tag. To get that text you can check if the text is NavigableString.
If you check the type of each content,
Contents #<class 'bs4.element.NavigableString'>
<a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a> #<class 'bs4.element.Tag'>
#<class 'bs4.element.NavigableString'>
<br/> #<class 'bs4.element.Tag'>
It's 4:38. #<class 'bs4.element.NavigableString'>
<br/> #<class 'bs4.element.Tag'>
2018. 2. 5. 5:38:41 PM #<class 'bs4.element.NavigableString'>
Code:
>>> from bs4 import BeautifulSoup, NavigableString
>>> html = '''<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
... Contents
... <a href="https://www.google.com/search?q=Google+what+time+is+it">
... Google what time is it
... </a>
... <br/>
... It's 4:38.
... <br/>
... 2018. 2. 5. 5:38:41 PM
... </div>'''
>>> soup = BeautifulSoup(html, 'lxml')
>>> div = soup.find('div', class_='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1')
>>> contents = [x.strip() for x in div.contents if isinstance(x, NavigableString)]
>>> contents
['Contents', '', "It's 4:38.", '2018. 2. 5. 5:38:41 PM']
From this, you can understand what NavigableString is. Now, to get the date and time you can simply join the last 2 elements of the list.
>>> ' '.join(contents[-2:])
"It's 4:38. 2018. 2. 5. 5:38:41 PM"

text = '<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">\n Contents\n \n Google what time is it\n \n <br/>\n It\'s 4:38.\n <br/>\n 2018. 2. 5. 5:38:41 PM\n </div>'
s = BeautifulSoup(text,"lxml")
>>> s.find("br").findNext("br").next
'\n 2018. 2. 5. 5:38:41 PM\n '
>>> s.find("br").next
"\n It's 4:38.\n "

Use select_one() in combo with SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser:
soup.select_one('.YwPhnf').text
# 09:06
Or you can use stripped_strings, but it's not as pretty as using CSS selectors:
html = '''
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Contents
<a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a>
<br/>
It's 4:38.
<br/>
2018. 2. 5. 5:38:41 PM
</div>
'''
soup = BeautifulSoup(html, 'lxml')
# returns a generator object
current_time = list(soup.select_one('.content-cell').stripped_strings)[2]
print(current_time)
# It's 4:38.
Code:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "what time it is", # query
"gl": "us", # country to search from
"hl": "en" # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
current_time = soup.select_one('.YwPhnf').text
current_date = soup.select_one('.KfQeJ:nth-child(1)').text
print(f'{current_time}\n{current_date}')
# 2:11 AM
# September 11, 2021
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to get the data you want from the structured JSON, rather than figuring out to extract things and maintain the parser over time if something won't work correctly because of some changes in the HTML.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "what time it is",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(results['answer_box']['result'])
# 2:07 AM
Disclaimer, I work for SerpApi.

Related

Remove image sources with same class reference when web scraping in python?

I'm trying to write some code to extract some data from transfermarkt (Link Here for the page I'm using). I'm stuck trying to print the clubs. I've figured out that I need to access h2 and then the a class in order to just get the text. The HTML code is below
<div class="table-header" id="to-349"><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018"><img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" /></a><h2><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">Barnsley FC</a></h2></div>
so you can see if I just try find_all("a", "class": "vereinprofil_tooltip"}) it doesn't work properly as it also returns the image file which has no plain text? But if I can search for h2 first and then search find_all("a", "class": "vereinprofil_tooltip"}) within the returned h2 it would get me what I want. My code is below.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
Club = Clubs.find("a", {"class": "vereinprofil_tooltip"})
print(Club)
I get the error in getattr
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I know what the error means but I've been going round in circles trying to find a way of actually doing it properly and getting what I want. Any help is appreciated.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/
plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
print(type(Clubs)) # this can be removed, but I left it to expose how I figured this out
for club in Clubs:
print(club.text)
Basically: Clubs is a list (technically, a ResultSet, but the behavior is very similar), you need to iterate it as such. .text gives just the text, other attributes could be retrieved as well.
Output looks like:
Transfer record 18/19
Barnsley FC
Burton Albion
Sunderland AFC
Shrewsbury Town
Scunthorpe United
Charlton Athletic
Plymouth Argyle
Portsmouth FC
Peterborough United
Southend United
Bradford City
Blackpool FC
Bristol Rovers
Fleetwood Town
Doncaster Rovers
Oxford United
Gillingham FC
AFC Wimbledon
Walsall FC
Rochdale AFC
Accrington Stanley
Luton Town
Wycombe Wanderers
Coventry City
Transfer record 18/19
There are, however, a bunch of blank lines (I.e., .text was '') that you should probably handle as well.
my guess is you might mean findAll instead of find_all
I tried this code below and it works
content = """<div class="table-header" id="to-349">
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
<img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" />
</a>
<h2>
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
Barnsley FC
</a>
</h2>
</div>"""
soup = BeautifulSoup(content, 'html.parser')
#get main_box
main_box = soup.findAll('a', {'class': 'vereinprofil_tooltip'})
#print(main_box)
for main_text in main_box: # looping thru the list
if main_text.text.strip(): # get the body text
print(main_text.text.strip()) # print it
output is
Barnsley FC
I'll edit this with a reference to the documentation about findAll. cant remember it on to pof my head
edit:
did a look at the documentation, turns out find_all = findAll..
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
now I feel dumb lol

Data Scraping across <div>'s

I am trying to extract information from a repeating set of rows containing many embedded 's. For the page, I am trying to write a scraper to get various elements from this page. For some reason, I can't find a way to get to the tag with the class that contains the information for each row. Further, I am not able to isolate the sections that I will need to extract the information. For reference, here is a sample of one row:
<div id="dTeamEventResults" class="col-md-12 team-event-results"><div>
<div class="row team-event-result team-result">
<div class="col-md-12 main-info">
<div class="row">
<div class="col-md-7 event-name">
<dl>
<dt>Team Number:</dt>
<dd>11733</dd>
<dt>Team:</dt>
<dd> Aqua Duckies</dd>
<dt>Program:</dt>
<dd>FIRST LEGO League Jr.</dd>
</dl>
</div>
The script I have started to build looks like the following:
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
rows = page_soup.findAll("div", {"class":"row team-event-result team-result"})
whenever I run len(rows), it always results in 0. I seem to have hit a wall and am having trouble. Thanks for your help!
The content of this page is generated dynamically so to catch that you need to use any browser simulator like selenium. Here is the script which will fetch your desired content. Give this a shot:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017')
soup = BeautifulSoup(driver.page_source,"lxml")
for items in soup.select('.main-info'):
docs = ' '.join([' '.join([item.text,' '.join(val.text.split())]) for item,val in zip(items.select(".event-name dt"),items.select(".event-name dd"))])
location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-location-type address")])
print("Event_Info: {}\nEvent_Location: {}\n".format(docs,location))
driver.quit()
The results look something like:
Event_Info: Team Number: 11733 Team: Aqua Duckies Program: FIRST LEGO League Jr.
Event_Location: Sparta, NJ 07871 USA
Event_Info: Team Number: 4281 Team: Bulldogs Program: FIRST Robotics Competition
Event_Location: Somerset, NJ 08873 USA
This seems like an issue of multiple-class tags. I believe this question might help you figure out the solution.
You can search specifically for dt and dd, the tags containing the target data:
from bs4 import BeautifulSoup as soup
from urllib2 import urlopen as uReq
import re
data = str(uReq('https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017').read())
s = soup(data, 'lxml')
headers = map(lambda x:x[:-1], [[b.text for b in i.find_all('dt')] for i in s.find_all('dl')][0])
data = [[re.sub('\s{2,}', '', b.text) for b in i.find_all('dd')] for i in s.find_all('dl')]
print(data)
final_data = [dict(zip(headers, i)) for i in data]
print(final_data)
When running this code on your example above, the output is:
[[u'11733', u' Aqua Duckies', u'FIRST LEGO League Jr.']]
[{u'Program': u'FIRST LEGO League Jr.', u'Team Number': u'11733', u'Team': u' Aqua Duckies'}]

Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.
I'm not using the Twitter API because it doesn't look at the tweets by
hashtag this far back. Complete code and output are below after examples.
I want to scrape specific data from each tweet. name and handle are retrieving exactly what I'm looking for, but I'm having trouble narrowing down the rest of the elements.
As an example:
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
Retrieves this:
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
<span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
For url, I only need the href value from the first line.
Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.
How can I narrow down the results to the required data for the url, retweetcount and favcount outputs?
I am planning to have this cycle through all the tweets once I get it working, in case that has an influence on your suggestions.
Complete Code:
from bs4 import BeautifulSoup
import requests
import sys
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")
name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
username = name[0].contents[0]
handle = soup('span', {'class': 'username js-action-profile-name'})
userhandle = handle[0].contents[1].contents[0]
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
messagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
message = messagetext[0]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcount = retweets[0]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcount = favorites[0]
print (username, "\n", "#", userhandle, "\n", "\n", url, "\n", "\n", message, "\n", "\n", retweetcount, "\n", "\n", favcount) #extra linebreaks for ease of reading
Complete Output:
Michael Peel
#Mikepeeljourno
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>
<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>
<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>
It was suggested that BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags
Use the dictionary-like access to the Tag's attributes.
For example, to get the href attribute value:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]
Or, if you need to get the href values for every link found:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
As a side note, you don't need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:
soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
You may use:
soup('a', {'class': 'tweet-timestamp'})
Or, a CSS selector:
soup.select("a.tweet-timestamp")
Alecxe already explained to use the 'href' key to get the value.
So I'm going to answer the other part of your questions:
Similarly, the retweets and favorites commands return large chunks of
html, when all I really need is the numerical value that is displayed
for each one.
.contents returns a list of all the children. Since you're finding 'buttons' which has several children you're interested in, you can just get them from the following parsed content list:
retweetcount = retweets[0].contents[3].contents[1].contents[1].string
This will return the value 4.
If you want a rather more readable approach, try this:
retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string
favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
This returns 4 and 2 respectively.
This works because we convert the ResultSet returned by soup/find_all and get the tag element (using [0]) and recursively find across all it's descendants again using find_all().
Now you can loop across each tweet and extract this information rather easily.

Extract content with BeautifulSoup and Python

I'm trying to scrap a forum but I can't deal with the comments, because the users use emoticons, and bold font, and cite previous messages, and and and...
For example, here's one of the comments that I have a problem with:
<div class="content">
<blockquote>
<div>
<cite>User write:</cite>
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
</div>
</blockquote>
<br/>
THIS IS THE COMMENT THAT I NEED!
</div>
I searching for help for the last 4 days and I couldn't find anything, so I decided to ask here.
This is the code that I'm using:
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
def get_messages(url):
soup = make_soup(url)
msg = soup.find("div", {"class" : "content"})
# I get in msg the hole message, exactly as I wrote previously
print msg
# Here I get:
# 1. <blockquote> ... </blockquote>
# 2. <br/>
# 3. THIS IS THE COMMENT THAT I NEED!
for item in msg.children:
print item
I'm looking for a way to deal with messages in a general way, no matter how they are. Sometimes they put emoticons between the text and I need to remove them and get the hole message (in this situation, bsp will put each part of the message (first part, emoticon, second part) in different items).
Thanks in advance!
Use decompose http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose
Decompose extracts tags that you don't want. In your case:
soup.blockquote.decompose()
or all unwanted tags:
for tag in ['blockquote', 'img', ... ]:
soup.find(tag).decompose()
Your example:
>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
... <blockquote>
... <div>
... <cite>User write:</cite>
... I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
... </div>
... </blockquote>
... <br/>
... THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'
Update
Sometimes all you have is a tag starting point but you are actually interested in the content before or after that starting point. You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:
>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> elm = soup.blockquote.next_sibling
>>> txt = ""
>>> while elm:
... txt += elm.string
... elm = elm.next_sibling
...
>>> print(txt)
u'Yes.Yes!Yes?'
BeautifulSoup has a get_text method. Maybe this is what you want.
From their documentation:
markup = '\nI linked to <i>example.com</i>\n'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
If the text you want is never within any additional tags, as in your example, you can use extract() to get rid of all the tags and their contents:
html = '<div class="content">\
<blockquote>\
<div>\
<cite>User write:</cite>\
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">\
</div>\
</blockquote>\
<br/>\
THIS IS THE COMMENT THAT I NEED!\
</div>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
tag.extract()
text = div.get_text(strip=True)
print(text)
This gives:
THIS IS THE COMMENT THAT I NEED!
To deal with emoticons, you'll have to do something more complicated. You'll probably have to define a list of emoticons to recognize yourself, and then parse the text to look for them.

Python and BeautifulSoup, not finding 'a'

Here's a piece of HTML code (from delicious):
<h4>
<a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers & Anti-Bot Protection</a>
<span class="saverem">
<em class="bookmark-actions">
<strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%20Links%20with%20Anonymous%20Referers%20%26%20Anti-Bot%20Protection&jump=%2Fdux&key=fFS4QzJW2lBf4gAtcrbuekRQfTY-&original_user=dux&copyuser=dux&copytags=web+apps+url+security+generator+shortener+anonymous+links">SAVE</a></strong>
</em>
</span>
</h4>
I'm trying to find all the links where class="inlinesave action". Here's the code:
sock = urllib2.urlopen('http://delicious.com/theuser')
html = sock.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a', attrs={'class':'inlinesave action'})
print len(tags)
But it doesn't find anything!
Any thoughts?
Thanks
If you want to look for an anchor with exactly those two classes you'd, have to use a regexp, I think:
tags = soup.findAll('a', attrs={'class': re.compile(r'\binlinesave\b.*\baction\b')})
Keep in mind that this regexp won't work if the ordering of the class names is reversed (class="action inlinesave").
The following statement should work for all cases (even though it looks ugly imo.):
soup.findAll('a',
attrs={'class':
re.compile(r'\baction\b.*\binlinesave\b|\binlinesave\b.*\baction\b')
})
Python string methods
html=open("file").read()
for item in html.split("<strong>"):
if "class" in item and "inlinesave action" in item:
url_with_junk = item.split('href="')[1]
m = url_with_junk.index('">')
print url_with_junk[:m]
May be that issue is fixed in verion 3.1.0, I could parse yours,
>>> html="""<h4>
... <a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anony
... <span class="saverem">
... <em class="bookmark-actions">
... <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Gen
... </em>
... </span>
... </h4>"""
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> tags = soup.findAll('a', attrs={'class':'inlinesave action'})
>>> print len(tags)
1
>>> tags
[<a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%
>>>
I have tried with BeautifulSoup 2.1.1 also, its does not work at all.
You might make some forward progress using pyparsing:
from pyparsing import makeHTMLTags, withAttribute
htmlsrc="""<h4>... etc."""
atag = makeHTMLTags("a")[0]
atag.setParseAction(withAttribute(("class","inlinesave action")))
for result in atag.searchString(htmlsrc):
print result.href
Gives (long result output snipped at '...'):
/save?url=http%3A%2F%2Fimfy.us%2F&title=Genera...+anonymous+links

Categories