How to get only links from parsed html using python? - python

How can I get the links if the tag is in this form?
<div class="BNeawe vvjwJb AP7Wnd">Going Gourmet Catering (#goinggourmet) - Instagram</div></h3><div class="BNeawe UPmit AP7Wnd">www.instagram.com › goinggourmet</div>
I have tried the below code and it helped me get only URLs, but the URLs comes in this format.
/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-
/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e
/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR
I need only URLs from Facebook and Instagram, without any additional wordings, What I mean is I want only real link, not the redirected link.
I need something like this from above links,
'https://www.facebook.com/bespokecatering.sydney'
'https://www.instagram.com/bespoke_catering'
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
Any help is much appreciated.
I tried the below code, but it returns empty results or different results
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
for url in urls:
try:
j=url.split('=')[1]
k= '/'.join(j.split('/')[0:4])
#print(k)
except:
k = ''

You already have your <a> selected - Just loop over selection and print results via ['href']:
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(link['href'])
If you improve your question and add additional information as requested, we can answer more detailed.
EDIT
Answering your additional question with a simple example (smth you should provide in your question)
import requests
from bs4 import BeautifulSoup
result = '''
<div class="kCrYT">
</div>
<div class="kCrYT">
</div>
<div class="kCrYT">
</div>
'''
soup = BeautifulSoup(result, 'lxml')
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(dict(x.split('=') for x in requests.utils.urlparse(link['href']).query.split('&'))['q'].split('%3F')[0])
Result:
https://bespokecatering.sydney/
https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/
https://www.instagram.com/bespoke_catering/

Related

BeautifulSoup Web Scraping - Can't access and extract element

this is my first question at stack overflow.
I am working on a web scraping project and I try to access html elements with beautiful soup.
Please can someone give me advice how to extract the following elements?
The task is to scrape all job listings from a search result page.
The job listing elements are inside the "ResultsSectionContainer".
I want to access each "article class" and
extract its id e.g job-item-7460756
extract its href where data-at="job-item-title"
extract its h2 text (solved)
How to loop through the ResultsSectionContainer and access/extract the information for each 'article class' element / id job-item ?
The name of the article class is somehow dynamic/unique and changes (I guess) every time a new search is done.
<div class="ResultsSectionContainer-gdhf14-0 cxyAav">\n
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">\n
<h2 class="sc-fzqARJ iyolKq">\n Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme\n
</h2>\n
</a>\n
<article class="sc-fzowVh cUgVEH" id="job-item-7465958">\n
...
You can do like this.
Select the <div> with class name as ResultsSectionContainer-gdhf14-0
Find all the <article> tags inside the above <div> using .find_all()- This will give you a list of all article tags
Iterate over the above list and extract the data you need.
from bs4 import BeautifulSoup
s = '''<div class="ResultsSectionContainer-gdhf14-0 cxyAav">
<article class="sc-fzowVh cUgVEH" id="job-item-7460756">
<a class="sc-fzoiQi eRNcm" data-at="job-item-title"
href="/stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html" target="_blank">
<h2 class="sc-fzqARJ iyolKq"> Wirtschaftsinformatiker (m/w/d) mit Schwerpunkt ERP-Systeme
</h2>
</a>
</div>'''
soup = BeautifulSoup(s, 'lxml')
d = soup.find('div', class_='ResultsSectionContainer-gdhf14-0')
for i in d.find_all('article'):
job_id = i['id']
job_link = i.find('a', {'data-at': 'job-item-title'})['href']
print(f'JOB_ID: {job_id}\nJOB_LINK: {job_link}')
JOB_ID: job-item-7460756
JOB_LINK: /stellenangebote--Wirtschaftsinformatiker-m-w-d-mit-Schwerpunkt-ERP-Systeme-Heidelberg-Celonic-Deutschland-GmbH-Co-KG--7460756-inline.html
If all article classes are same try this
articles = data.find_all("article", attrs={"class": "sc-fzowVh cUgVEH"})
for article in articles:
print(article.get("id"))
print(article.a.get("href"))
print(article.h2.text.strip())
You could do something like this:
results = soup.findAll('article', {'class': 'sc-fzowVh cUgVEH'})
for result in results:
id = result.attrs['id']
href = result.find('a').attrs['href']
h2 = result.text.strip()
print(f' Job id: \t{id}\n Job link: \t{href}\n Job desc: \t{h2}\n')
print('---')
you may also want to prefix href with the url where you're pulling the results from.

Scrapy: how to get links to users?

I'm trying to get links to group members:
response.css('.text--ellipsisOneLine::attr(href)').getall()
Why isn't this working?
html:
<div class="flex flex--row flex--noGutters flex--alignCenter">
<div class="flex-item _memberItem-module_name__BSx8i">
<a href="/ru-RU/Connect-IT-Meetup-in-Chisinau/members/280162178/profile/?returnPage=1">
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
</a>
</div>
</div>
Your selector isn't working because you are looking for a attribute (href) that this element doesn't have.
response.css('.text--ellipsisOneLine::attr(href)').getall()
This selector is searching for href inside elements of class text--ellipsisOneLine. In your HTML snippet that class matches only with this:
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
As you can see, there is no href attribute. Now, if you want the text between this h4 element you need to use ::text pseudo-element.
response.css('.text--ellipsisOneLine::text').getall()
Read more here.
I realize that this isn't scrapy, but personally for web scraping I use the requests module and BeautifulSoup4, and the following code snippet will get you a list of users with the aforementioned modules:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.meetup.com/ru-RU/Connect-IT-Meetup-in-Chisinau/members/')
if response.status_code == 200:
html_doc = response.text
html_source = BeautifulSoup(html_doc, 'html.parser')
users = html_source.findAll('h4')
for user in users:
print(user.text)
css:
response.css('.member-item .flex--alignCenter a::attr(href)').getall()

How can I get the link from href in "a" with class name by using python 3

I've tried to get the link from google map which the element is:
<div class="something1">
<span class="something2"></span>
<a data-track-id="Google Map" href="https://www.google.com/maps/dir//11111/#22222" target="_blank" class="something3">Google Map</a>
</div>
which I only would like to get https://www.google.com/maps/dir//11111/#22222
My code is
gpslocation = []
for gps in (secondpage_parser.find("a", {"data-track-id":"Google Map"})):
gpslocation.append(gps.attrs["href"])
I'm using 2 url pages (main and secondpage) for scraping a blog website which this is in the secondpage. The other info like Story-Title or Author Name work as it appears as text so I can use get_text().
But this case, I could not get the link after href. Please help.
Ps. In the case I only want Latitude and Longtitude in the link which are (11111 and 22222) is there is a way to use str.rplit?
Thank you so much
You can use the following :
secondpage_parser.find("a", {"data-track-id":"Google Map"})['href']
Use soup.find(...)['href'] for finding all links with an href or soup.find_all('a' ... , href=True)for all links
Yes you can use split to only get lat and long
First split on // and get the latest [-1]
Then split on /# to get both lat and long
from bs4 import BeautifulSoup
data = """
<div class="something1">
<span class="something2"></span>
<a data-track-id="Google Map" href="https://www.google.com/maps/dir//11111/#22222" target="_blank" class="something3">Google Map</a>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for gps in soup.find_all('a', href=True):
href = gps['href']
print(href)
lati, longi = href.split("//")[-1].split('/#')
print(lati)
print(longi)

Getting Duplicate links in Scraping

I am trying to collect "a" tags which are in class="featured" from a site http://www.pakistanfashionmagazine.com
I wrote this piece of code it has no error but it duplicates the links. How can I overcome this duplication ?
from bs4 import BeautifulSoup
import requests
url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
results= soup.findAll('div', attrs={"class":'featured'})
for div in results:
links = div.findAll('a')
for a in links:
print "http://www.pakistanfashionmagazine.com/" +a['href']
The actual HTML page has two links per item <div>; one for the image, the other for the <h4> tag:
<div class="item">
<div class="image">
<img src="/siteimages/upload/BELLA-Embroidered-Lawn-Collection3-STITCHED-SUITSPKR-14000-ONLY_1529IM1-thumb.jpg" alt="Featured Product" /> </div>
<div class="detail">
<h4>BELLA Embroidered Lawn Collection*3 STITCHED SUITS#PKR 14000 ONLY</h4>
<em>updated: 2013-06-03</em>
<p>BELLA Embroidered Lawn Collection*3 STITCHED SUITS#PKR 14000 ONLY</p>
</div>
</div>
Limit your links to just one or the other; I'd use CSS selectors here:
links = soup.select('div.featured .detail a[href]')
for link in links:
print "http://www.pakistanfashionmagazine.com/" + link['href']
Now 32 links are printed, not 64.
If you needed to limit this to just the second featured section (Beauty Tips), then do so; select the featured divs, pick the second from the list, then
links = soup.select('div.featured')[1].select('.detail a[href]')
Now you have just the 8 links in that section.

Improving a python snippet

I'm working on a python script to do some web scraping. I want to find the base URL of a given section on a web page that looks like this:
<div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
...
</div>
So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:
pages = [l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href'])]
s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])
The problem is, generating this list is a waste, since I just need the first href. I think a Generator would be the answer but I couldn't pull this off. Maybe you guys could help me to make this code more concise?
What about this:
from bs4 import BeautifulSoup
html = """ <div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
</div>"""
soup = BeautifulSoup(html)
link = soup.find('div', {'class': 'pagination'}).find('a')['href']
print '/'.join(link.split('/')[:-1])
prints:
webpage-category/page
Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:
s = next(l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href']))
UPD (using the website link provided):
import urllib2
from bs4 import BeautifulSoup
url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')
print next('/'.join(link['href'].split('/')[:-1]) for link in links
if link.text.isdigit() and link.text != "1")

Categories