Getting Duplicate links in Scraping - python

I am trying to collect "a" tags which are in class="featured" from a site http://www.pakistanfashionmagazine.com
I wrote this piece of code it has no error but it duplicates the links. How can I overcome this duplication ?
from bs4 import BeautifulSoup
import requests
url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
results= soup.findAll('div', attrs={"class":'featured'})
for div in results:
links = div.findAll('a')
for a in links:
print "http://www.pakistanfashionmagazine.com/" +a['href']

The actual HTML page has two links per item <div>; one for the image, the other for the <h4> tag:
<div class="item">
<div class="image">
<img src="/siteimages/upload/BELLA-Embroidered-Lawn-Collection3-STITCHED-SUITSPKR-14000-ONLY_1529IM1-thumb.jpg" alt="Featured Product" /> </div>
<div class="detail">
<h4>BELLA Embroidered Lawn Collection*3 STITCHED SUITS#PKR 14000 ONLY</h4>
<em>updated: 2013-06-03</em>
<p>BELLA Embroidered Lawn Collection*3 STITCHED SUITS#PKR 14000 ONLY</p>
</div>
</div>
Limit your links to just one or the other; I'd use CSS selectors here:
links = soup.select('div.featured .detail a[href]')
for link in links:
print "http://www.pakistanfashionmagazine.com/" + link['href']
Now 32 links are printed, not 64.
If you needed to limit this to just the second featured section (Beauty Tips), then do so; select the featured divs, pick the second from the list, then
links = soup.select('div.featured')[1].select('.detail a[href]')
Now you have just the 8 links in that section.

Related

How do I get href links under a specific class with BeautifulSoup

This is an example of the type of block of HTML source code I'm targeting with BeautifulSoup
<div class="fighter_list left">
<meta itemprop="image" content="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG">
<img class="lazy" src="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG" data-original="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG" alt="Jason DeLucia" title="Jason DeLucia" />
<div class="fighter_result_data">
<a itemprop="url" href="/fighter/Jason-DeLucia-22"><span itemprop="name">Jason<br />DeLucia</span></a><br>
This is one of multiple blocks like this for each "fighter_list left" on the page.
I want to get all of the itemprop="url" href links that are in the "fighter_list left" class (i.e. /fighter/Jason-DeLucia-22)
When I try the below code I get nothing.
for link in html.find_all('a', class_="fighter_List left", itemprop="url"):
print(link.get('href'))
The closest I can get is getting every itemprop=url link on the page when I omit the class_= part.
But I only want the ones under the fighter_list left class.
This is the website https://www.sherdog.com/events/UFC-1-The-Beginning-7
You can use CSS selector for the task:
import requests
from bs4 import BeautifulSoup
url = "https://www.sherdog.com/events/UFC-1-The-Beginning-7"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for link in soup.select('.fighter_list.left [itemprop="url"]'):
print(link["href"])
Prints:
/fighter/Jason-DeLucia-22
/fighter/Royce-Gracie-19
/fighter/Gerard-Gordeau-15
/fighter/Ken-Shamrock-4
/fighter/Royce-Gracie-19
/fighter/Kevin-Rosier-17
/fighter/Gerard-Gordeau-15

BeautifulSoup Web Scraping - Can't access and extract element

this is my first question at stack overflow.
I am working on a web scraping project and I try to access html elements with beautiful soup.
Please can someone give me advice how to extract the following elements?
The task is to scrape all job listings from a search result page.
The job listing elements are inside the "ResultsSectionContainer".
I want to access each "article class" and
extract its id e.g job-item-7460756
extract its href where data-at="job-item-title"
extract its h2 text (solved)
How to loop through the ResultsSectionContainer and access/extract the information for each 'article class' element / id job-item ?
The name of the article class is somehow dynamic/unique and changes (I guess) every time a new search is done.
<div class="ResultsSectionContainer-gdhf14-0 cxyAav">\n
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">\n
<h2 class="sc-fzqARJ iyolKq">\n Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme\n
</h2>\n
</a>\n
<article class="sc-fzowVh cUgVEH" id="job-item-7465958">\n
...
You can do like this.
Select the <div> with class name as ResultsSectionContainer-gdhf14-0
Find all the <article> tags inside the above <div> using .find_all()- This will give you a list of all article tags
Iterate over the above list and extract the data you need.
from bs4 import BeautifulSoup
s = '''<div class="ResultsSectionContainer-gdhf14-0 cxyAav">
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">
<h2 class="sc-fzqARJ iyolKq"> Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme
</h2>
</a>
</div>'''
soup = BeautifulSoup(s, 'lxml')
d = soup.find('div', class_='ResultsSectionContainer-gdhf14-0')
for i in d.find_all('article'):
job_id = i['id']
job_link = i.find('a', {'data-at': 'job-item-title'})['href']
print(f'JOB_ID: {job_id}\nJOB_LINK: {job_link}')
JOB_ID: job-item-7460756
JOB_LINK: /stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html
If all article classes are same try this
articles = data.find_all("article", attrs={"class": "sc-fzowVh cUgVEH"})
for article in articles:
print(article.get("id"))
print(article.a.get("href"))
print(article.h2.text.strip())
You could do something like this:
results = soup.findAll('article', {'class': 'sc-fzowVh cUgVEH'})
for result in results:
id = result.attrs['id']
href = result.find('a').attrs['href']
h2 = result.text.strip()
print(f' Job id: \t{id}\n Job link: \t{href}\n Job desc: \t{h2}\n')
print('---')
you may also want to prefix href with the url where you're pulling the results from.

How to get only links from parsed html using python?

How can I get the links if the tag is in this form?
<div class="BNeawe vvjwJb AP7Wnd">Going Gourmet Catering (#goinggourmet) - Instagram</div></h3><div class="BNeawe UPmit AP7Wnd">www.instagram.com › goinggourmet</div>
I have tried the below code and it helped me get only URLs, but the URLs comes in this format.
/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-
/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e
/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR
I need only URLs from Facebook and Instagram, without any additional wordings, What I mean is I want only real link, not the redirected link.
I need something like this from above links,
'https://www.facebook.com/bespokecatering.sydney'
'https://www.instagram.com/bespoke_catering'
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
Any help is much appreciated.
I tried the below code, but it returns empty results or different results
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
for url in urls:
try:
j=url.split('=')[1]
k= '/'.join(j.split('/')[0:4])
#print(k)
except:
k = ''
You already have your <a> selected - Just loop over selection and print results via ['href']:
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(link['href'])
If you improve your question and add additional information as requested, we can answer more detailed.
EDIT
Answering your additional question with a simple example (smth you should provide in your question)
import requests
from bs4 import BeautifulSoup
result = '''
<div class="kCrYT">
</div>
<div class="kCrYT">
</div>
<div class="kCrYT">
</div>
'''
soup = BeautifulSoup(result, 'lxml')
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(dict(x.split('=') for x in requests.utils.urlparse(link['href']).query.split('&'))['q'].split('%3F')[0])
Result:
https://bespokecatering.sydney/
https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/
https://www.instagram.com/bespoke_catering/

Beautiful Soup Query with Filtering in Python

I am trying to save the contents of each article in its own text file. What I am having trouble with is coming up with a beautiful soup approach that returns articles of the type News only while ignoring the other article types.
Website in question: https://www.nature.com/nature/articles
Info
Every article is enclosed in a pair of <article> tags
Each article type is hidden inside a <span> tag containing the data-test attribute with the article.type value.
Title to the article is placed inside the <a> tag with the data-track-label="link" attribute.
The article body wrapped in the <div> tag (look for "body" in the class attribute).
Current code
I was able to get up to the point where I can query the <span> for articles of the News type, but am struggling to take the next steps to return the other article specific information.
How can I take this further? For the articles of the the type News, I'd like to also be able to return that article's title and body while ignoring the other articles that are not of type News?
# Send HTTP requests
import requests
from bs4 import BeautifulSoup
class WebScraper:
#staticmethod
def get_the_source():
# Obtain the URL
url = 'https://www.nature.com/nature/articles'
# Get the webpage
r = requests.get(url)
# Check response object's status code
if r:
the_source = open("source.html", "wb")
soup = BeautifulSoup(r.content, 'html.parser')
type_news = soup.find_all("span", string='News')
for i in type_news:
print(i.text)
the_source.write(r.content)
the_source.close()
print('\nContent saved.')
else:
print(f'The URL returned {r.status_code}!')
WebScraper.get_the_source()
Sample HTML for an article that is of type News
The source code has 19 other articles with similar and different article types.
<article class="u-full-height c-card c-card--flush" itemscope itemtype="http://schema.org/ScholarlyArticle">
<div class="c-card__image">
<picture>
<source
type="image/webp"
srcset="
//media.springernature.com/w165h90/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 160w,
//media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 290w"
sizes="
(max-width: 640px) 160px,
(max-width: 1200px) 290px,
290px">
<img src="//media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg"
alt=""
itemprop="image">
</picture>
</div>
<div class="c-card__body u-display-flex u-flex-direction-column">
<h3 class="c-card__title" itemprop="name headline">
<a href="/articles/d41586-021-00485-2"
class="c-card__link u-link-inherit"
itemprop="url"
data-track="click"
data-track-action="view article"
data-track-label="link">Mars arrivals and Etna eruption — February's best science images</a>
</h3>
<div class="c-card__summary u-mb-16 u-hide-sm-max"
itemprop="description">
<p>The month’s sharpest science shots, selected by <i>Nature's</i> photo team.</p>
</div>
<div class="u-mt-auto">
<ul data-test="author-list" class="c-author-list c-author-list--compact u-mb-4">
<li itemprop="creator" itemscope="" itemtype="http://schema.org/Person"><span itemprop="name">Emma Stoye</span></li>
</ul>
<div class="c-card__section c-meta">
<span class="c-meta__item c-meta__item--block-at-xl" data-test="article.type">
<span class="c-meta__type">News</span>
</span>
<time class="c-meta__item c-meta__item--block-at-xl" datetime="2021-03-05" itemprop="datePublished">05 Mar 2021</time>
</div>
</div>
</div>
</article>
</div>
</li>
<li class="app-article-list-row__item">
<div class="u-full-height" data-native-ad-placement="false">
The simplest way, and you get more results per hit, is to add News into the query string as a param
https://www.nature.com/nature/articles?type=news
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nature.com/nature/articles?type=news')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item')
for n in news_articles:
print(n.select_one('.c-card__link').text)
A variety of params for page 2 of news:
https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&type=news&page=2
If you monitor the browser network tab whilst manually filtering on the page, or
selecting different pages numbers, you can see the logic of how the querystrings are constructed and tailor your requests accordingly e.g.
https://www.nature.com/nature/articles?type=news&year=2021
Otherwise, you could do more convoluted (in/ex)clusion with css selectors, based on whether based on whether article nodes have a specific child containing "News" (inclusion); exclusion beings News with another word/symbol (as per categories list):
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nature.com/nature/articles')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item:has(.c-meta__type:contains("News"):not( \
:contains("&"), \
:contains("in"), \
:contains("Career"), \
:contains("Feature")))') #exclusion n
for n in news_articles:
print(n.select_one('.c-card__link').text)
You can remove categories from the :not() list if you want News & or News In etc...
if you don't want to filter the URL, loop through <article> then check element text for class c-meta__type
articles = soup.select('article')
for article in articles:
article_type = article.select_one('.c-meta__type').text.strip()
if article_type == 'News':
# or if type contain News
# if 'News' in article_type:
title = article.select_one('a').text
summary = article.select_one('.c-card__summary p').text
print("{}: {}\n{}\n\n".format(article_type, title, summary))

How can I get the link from href in "a" with class name by using python 3

I've tried to get the link from google map which the element is:
<div class="something1">
<span class="something2"></span>
<a data-track-id="Google Map" href="https://www.google.com/maps/dir//11111/#22222" target="_blank" class="something3">Google Map</a>
</div>
which I only would like to get https://www.google.com/maps/dir//11111/#22222
My code is
gpslocation = []
for gps in (secondpage_parser.find("a", {"data-track-id":"Google Map"})):
gpslocation.append(gps.attrs["href"])
I'm using 2 url pages (main and secondpage) for scraping a blog website which this is in the secondpage. The other info like Story-Title or Author Name work as it appears as text so I can use get_text().
But this case, I could not get the link after href. Please help.
Ps. In the case I only want Latitude and Longtitude in the link which are (11111 and 22222) is there is a way to use str.rplit?
Thank you so much
You can use the following :
secondpage_parser.find("a", {"data-track-id":"Google Map"})['href']
Use soup.find(...)['href'] for finding all links with an href or soup.find_all('a' ... , href=True)for all links
Yes you can use split to only get lat and long
First split on // and get the latest [-1]
Then split on /# to get both lat and long
from bs4 import BeautifulSoup
data = """
<div class="something1">
<span class="something2"></span>
<a data-track-id="Google Map" href="https://www.google.com/maps/dir//11111/#22222" target="_blank" class="something3">Google Map</a>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for gps in soup.find_all('a', href=True):
href = gps['href']
print(href)
lati, longi = href.split("//")[-1].split('/#')
print(lati)
print(longi)

Categories