unable to extract full url #href using scrapy - python

I am trying to extract the url of a product from amazon.in. The href-attribute inside the a-tag from the source looks like this:
href="/Parachute-Coconut-Oil-600-Free/dp/B081WSB91C/ref=sr_1_49?dchild=1&fpw=pantry&fst=as%3Aoff&qid=1588693187&s=pantry&sr=8-49&srs=9574332031&swrs=789D2F4EC1B25821250A55BFCB953F03"
What Scrapy is extracting is:
/Parachute-Coconut-Oil-Bottle-600ml/dp/B071FB2ZVT?dchild=1
I used the following xpath:
//div[#class="a-section a-spacing-none a-spacing-top-small"]//a[#class="a-link-normal a-text-normal"]/#href
This is the website I am trying to scrape:
https://www.amazon.in/s?i=pantry&srs=9574332031&bbn=9735693031&rh=n%3A9735693031&dc&page=2&fst=as%3Aoff&qid=1588056650&swrs=789D2F4EC1B25821250A55BFCB953F03&ref=sr_pg_2
How can I extract the expected url with Scrapy?

That is known as a relative URL. To get the full URL you can simply combine it to the base URL. I don't know what your code is but try something like this.
half_url = response.xpath('//div[#class="a-section a-spacing-none a-spacing-top-small"]//a[#class="a-link-normal a-text-normal"]/#href').extract_first()
full_url = 'https://www.amazon.in/' + half_url

Related

How can I get a clean URL from a href selector with Scrapy?

When I inspect the source code on the results page of a Google search, I see the URL I'm looking for. Something like this:
<a href="https://www.amazon.co.uk/<product_name>/dp/<product_code>" ping=...> </a>
But when I use Scrapy to retrieve that same URL with the following selector:
response.xpath("//a[contains(#href, 'amazon.com')]/#href").get()
I get the following URL that doesn't work:
'/url?q=https://www.amazon.com/<product_name>/dp/<product_code>&sa=U&ved=2ahUKEwjbjbqFzPLsAhWQpFkKHUgfA50QFjAAegQIARAB&usg=AOvVaw0cZpoBVRm94Z7lbphHTjsW'
Considering I don't want to slice the strings manually:
How can I get the URL without the /url?q=at the very beginning?
And how can I get the URL without the last piece of random stuff?
You can replace the /url?q= part from the string:
url = response.xpath("//a[contains(#href, 'amazon.com')]/#href").get().replace('/url?q=','').split('&')[0]
Or if strictly at beginning use Regex:
import re
url = re.sub('^\/url\?q\=','',response.xpath("//a[contains(#href, 'amazon.com')]/#href").get()).split('&')[0]

Scrapy - scraping html custom attributes

I am trying to scrape a website and I want to scrape a custom html attribute.
First I get the link:
result.css('p.paraclass a').extract()
It looks like this:
I am a link
I'd like to scrape the value of the data-id tag. I can do this by getting the entire link and then manipulating it, but I'd like to figure out if there is a way to do it directly with a scrapy selector.
I believe the following will work:
result.css('a::attr(data-id)').extract()
Two ways to achieve this:
from scrapy.selector import Selector
partial_body = ' I am a link'
sel = Selector(text=partial_body)
Xpath Selector
sel.xpath('//a/#data-id').extract()
#output : ['12345']
CSS Selector
sel.css('a::attr(data-id)').extract_first()
# output: '12345'

Python: Parse next page's url from html

I want to parse next page's url.
I used to parse it successfully with the example below:
ยป
Code:
url_n=soup3.find('',rel = 'next')['href']
But when I do it with another one, it doesn't work:
<a href='/showroom.php?cate_name_eng_lv1=news-and-culture&p=1' class='next' >next</a>
I tried this:
url_n=soup3.find('',class = 'next')['href']
or this:
url_n=soup.select_one('.next')['href']
Any advice is appreciated!

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:
http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php
I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]
print(source_url)
I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.
You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:
source_url.get('href')
You almost got it!!
SOLUTION 1:
You just have to run the .text method on the soup you've assigned to source_url.
So instead of:
print(source_url)
You should use:
print(source_url.text)
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense
SOLUTION 2:
You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.
print source_url.get('href')
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

Categories