I'm working on a python script to do some web scraping. I want to find the base URL of a given section on a web page that looks like this:
<div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
...
</div>
So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:
pages = [l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href'])]
s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])
The problem is, generating this list is a waste, since I just need the first href. I think a Generator would be the answer but I couldn't pull this off. Maybe you guys could help me to make this code more concise?
What about this:
from bs4 import BeautifulSoup
html = """ <div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
</div>"""
soup = BeautifulSoup(html)
link = soup.find('div', {'class': 'pagination'}).find('a')['href']
print '/'.join(link.split('/')[:-1])
prints:
webpage-category/page
Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:
s = next(l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href']))
UPD (using the website link provided):
import urllib2
from bs4 import BeautifulSoup
url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')
print next('/'.join(link['href'].split('/')[:-1]) for link in links
if link.text.isdigit() and link.text != "1")
Related
How can I get the links if the tag is in this form?
<div class="BNeawe vvjwJb AP7Wnd">Going Gourmet Catering (#goinggourmet) - Instagram</div></h3><div class="BNeawe UPmit AP7Wnd">www.instagram.com › goinggourmet</div>
I have tried the below code and it helped me get only URLs, but the URLs comes in this format.
/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-
/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e
/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR
I need only URLs from Facebook and Instagram, without any additional wordings, What I mean is I want only real link, not the redirected link.
I need something like this from above links,
'https://www.facebook.com/bespokecatering.sydney'
'https://www.instagram.com/bespoke_catering'
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
Any help is much appreciated.
I tried the below code, but it returns empty results or different results
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
for url in urls:
try:
j=url.split('=')[1]
k= '/'.join(j.split('/')[0:4])
#print(k)
except:
k = ''
You already have your <a> selected - Just loop over selection and print results via ['href']:
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(link['href'])
If you improve your question and add additional information as requested, we can answer more detailed.
EDIT
Answering your additional question with a simple example (smth you should provide in your question)
import requests
from bs4 import BeautifulSoup
result = '''
<div class="kCrYT">
</div>
<div class="kCrYT">
</div>
<div class="kCrYT">
</div>
'''
soup = BeautifulSoup(result, 'lxml')
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(dict(x.split('=') for x in requests.utils.urlparse(link['href']).query.split('&'))['q'].split('%3F')[0])
Result:
https://bespokecatering.sydney/
https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/
https://www.instagram.com/bespoke_catering/
I'm trying to use BS4 to find ID numbers that are embedded in html. However, they aren't really attached to anything I'm used to working with, like a tag or the like. I've looked at how to pull something from a div class, but haven't had success with that either. Below is what soup looks like after I collect the html:
<div class="result-bio">
<div class="profile-image">
<img class="search-image load-member-profile" ng-click="results.loadProfile(result.UGuid)"
ng-src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"
src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"/>
</div>
The code I have been attempting is:
soup = BeautifulSoup(browser.page_source, 'html.parser')
print(soup)
ID_Numbers = []
for IDs in soup.find_all('div', string='http'):
ID_Numbers.append(IDs.text)
Does anyone have a suggestion on how to solve this? I imagine I will have to strip later, but really all I want is that id=xxxx value embedded in there. I've tried most of the solutions I've seen on stack with no success. Thanks!
You need to find all the image elements, then loop over these elements to get the source for each image. Then, simply break the src down to get the id.
soup = bs4.BeautifulSoup(page_source, features='html.parser')
for i in soup.find_all('img'):
src = i['src']
try:
id = src.split('?id=')[1]
print(id)
except(IndexError):
continue
Here, I have split the src to get the id, but in more complicated cases you may need to use regex.
You can use the inbuilt [urlparse][1] library like below:
from bs4 import BeautifulSoup
from urllib.parse import urlsplit
html_doc = """
<div class="result-bio">
<div class="profile-image">
<img class="search-image load-member-profile" ng-click="results.loadProfile(result.UGuid)"
ng-src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"
src="http://file.asdf.org/profile/picture.ashx?id=9091a557-fd44-44be-9468-9386d90a39b8"/>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
images = soup.findAll('img')
ids = []
for i in images:
parsed = urlsplit(i['src'])
id = parsed.query.replace("id=", "")
ids.append(id)
print(ids)
This gave me output:-
['9091a557-fd44-44be-9468-9386d90a39b8']
I have the following:
html =
'''<div class=“file-one”>
<a href=“/file-one/additional” class=“file-link">
<h3 class=“file-name”>File One</h3>
</a>
<div class=“location”>
Down
</div>
</div>'''
And would like to get just the text of href which is /file-one/additional. So I did:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
link_text = “”
for a in soup.find_all(‘a’, href=True, text=True):
link_text = a[‘href’]
print “Link: “ + link_text
But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.
What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?
Thank you in advance and will be sure to upvote/accept answer!
The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.
You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
Or you could use a list comprehension, if you prefer one-liners.
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
Or you could pass a lambda to .find_all().
tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)
If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.
Using .find_all().
links = [a['href'] for a in soup.find_all('a', href=True)]
Using .select() with CSS selectors.
links = [a['href'] for a in soup.select('a[href]')]
You can also use attrs to get the href tag with regex search
soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']
First of all, use a different text editor that doesn't use curly quotes.
Second, remove the text=True flag from the soup.find_all
You could solve this with just a couple lines of gazpacho:
from gazpacho import Soup
html = """\
<div class="file-one">
<a href="/file-one/additional" class="file-link">
<h3 class="file-name">File One</h3>
</a>
<div class="location">
Down
</div>
</div>
"""
soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']
Which would output:
'/file-one/additional'
A bit late to the party but I had the same issue recently scraping some recipes and got mine printing clean by doing this:
from bs4 import BeautifulSoup
import requests
source = requests.get('url for website')
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
link = article.find('a', href=True)['href'}
print(link)
from bs4 import BeautifulSoup
import requests
url ="http://www.basketnews.lt/lygos/59-nacionaline-krepsinio-asociacija/2013/naujienos.html"
r = requests.get(url)
soup = BeautifulSoup(r.text)
naujienos = soup.findAll('a', {'class':'title'})
print naujienos
Here is important part of HTML:
<div class="title">
<span class="feedbacks"></span>
</div>
I get empty list. Where is my mistake?
EDIT:
Thanks it worked. Now I want to print news titles. This is how I am trying to do it:
nba = soup.select('div.title > a')
for i in nba:
print ""+i.string+"\n"
I get max 5 titles and error occurs: cannot concatenate 'str' and 'NoneType' objects
soup.findAll('a', {'class':'title'})
This says, give me all a tags that also have class="title". That's obviously not what you're trying to do.
I think you want a tags that are the direct descendant of a tag with class="title". You can try using a css selector:
soup.select('div.title > a')
Out[58]:
[Blatche'as: âGarantuoju, kad laimÄsimeâ,
<a href="/news-73147-rockets-veikiausiai-pasiliks-mchalea.html">âRocketsâ veikiausiai pasiliks McHaleâÄ
</a>,
# snip lots of other links
]
I've got the following code trying to return data from some html, however I am unable to return what I require...
import urllib2
from bs4 import BeautifulSoup
from time import sleep
def getData():
htmlfile = open('C:/html.html', 'rb')
html = htmlfile.read()
soup = BeautifulSoup(html)
items = soup.find_all('div', class_="blocks")
for item in items:
links = item.find_all('h3')
for link in links:
print link
getData()
Returns the a list of following:
<h3>
<a href="http://www.mywebsite.com/titles" title="Click for details(x)">
TITLE STUFF HERE (YES)
</a>
</h3>
<h3>
<a href="http://www.mywebsite.com/titles" title="Click for details(x)">
TITLE STUFF HERE (MAYBE)
</a>
</h3>
I want to be able to return just the title: TITLE STUFF HERE (YES) and TITLE STUFF HERE (MAYBE)
Another thing I want to be able to do to use the
soup.find_all("a", limit=2) function but instead of "limit" and instead of returning two results only I want it to return ONLY the second link... so a select feature not a limit? (Does such a feature exist?)
import urllib2
from bs4 import BeautifulSoup
from time import sleep
def getData():
htmlfile = open('C:/html.html', 'rb')
html = htmlfile.read()
soup = BeautifulSoup(html)
items = soup.find_all('div', class_="blocks")
for item in items:
links = item.find_all('a')
for link in links:
if link.parent.name == 'h3':
print(link.text)
getData()
You can also just find all the links from the very beginning and check both the parent is h3 and the parent's parent is a div with class blocks