Python: Parse next page's url from html - python

I want to parse next page's url.
I used to parse it successfully with the example below:
»
Code:
url_n=soup3.find('',rel = 'next')['href']
But when I do it with another one, it doesn't work:
<a href='/showroom.php?cate_name_eng_lv1=news-and-culture&p=1' class='next' >next</a>
I tried this:
url_n=soup3.find('',class = 'next')['href']
or this:
url_n=soup.select_one('.next')['href']
Any advice is appreciated!

Related

How can I get a clean URL from a href selector with Scrapy?

When I inspect the source code on the results page of a Google search, I see the URL I'm looking for. Something like this:
<a href="https://www.amazon.co.uk/<product_name>/dp/<product_code>" ping=...> </a>
But when I use Scrapy to retrieve that same URL with the following selector:
response.xpath("//a[contains(#href, 'amazon.com')]/#href").get()
I get the following URL that doesn't work:
'/url?q=https://www.amazon.com/<product_name>/dp/<product_code>&sa=U&ved=2ahUKEwjbjbqFzPLsAhWQpFkKHUgfA50QFjAAegQIARAB&usg=AOvVaw0cZpoBVRm94Z7lbphHTjsW'
Considering I don't want to slice the strings manually:
How can I get the URL without the /url?q=at the very beginning?
And how can I get the URL without the last piece of random stuff?
You can replace the /url?q= part from the string:
url = response.xpath("//a[contains(#href, 'amazon.com')]/#href").get().replace('/url?q=','').split('&')[0]
Or if strictly at beginning use Regex:
import re
url = re.sub('^\/url\?q\=','',response.xpath("//a[contains(#href, 'amazon.com')]/#href").get()).split('&')[0]

unable to extract full url #href using scrapy

I am trying to extract the url of a product from amazon.in. The href-attribute inside the a-tag from the source looks like this:
href="/Parachute-Coconut-Oil-600-Free/dp/B081WSB91C/ref=sr_1_49?dchild=1&fpw=pantry&fst=as%3Aoff&qid=1588693187&s=pantry&sr=8-49&srs=9574332031&swrs=789D2F4EC1B25821250A55BFCB953F03"
What Scrapy is extracting is:
/Parachute-Coconut-Oil-Bottle-600ml/dp/B071FB2ZVT?dchild=1
I used the following xpath:
//div[#class="a-section a-spacing-none a-spacing-top-small"]//a[#class="a-link-normal a-text-normal"]/#href
This is the website I am trying to scrape:
https://www.amazon.in/s?i=pantry&srs=9574332031&bbn=9735693031&rh=n%3A9735693031&dc&page=2&fst=as%3Aoff&qid=1588056650&swrs=789D2F4EC1B25821250A55BFCB953F03&ref=sr_pg_2
How can I extract the expected url with Scrapy?
That is known as a relative URL. To get the full URL you can simply combine it to the base URL. I don't know what your code is but try something like this.
half_url = response.xpath('//div[#class="a-section a-spacing-none a-spacing-top-small"]//a[#class="a-link-normal a-text-normal"]/#href').extract_first()
full_url = 'https://www.amazon.in/' + half_url

Scraping data from span tag with class using BS

Im trying to scrape project's names from Gitlab. When I inspect source code I see that name of project is in:
<span class='project-name'>Project Name</span>
Unfortunately, when I try to scrape this date I got empty list, My code looks like:
url = 'https://gitlab.com/users/USER/projects'
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')
repos = [repo.text for repo in soup.find_all('span',{'class':'project-name'})]
I was trying other solutions like using attrs, class_ or using other HTML tags, but nothing works. What can be wrong here?
Ok, so it looks like when you inspect the page in the network tabs in chrome developer tools, you can see that projects are not rendered when request is made:
What that means is, project information is requested after. In order to get the projects you need to send a request to https://gitlab.com/users/USER/projects.json endpoint:
After that you can inspect the response from that endpoint. As you can see the response here is json so we can load json data with json module and then in that dictionary there is an entry called html which has html data in it, so we can parse it with beautifulsoup and the rest of the code stays the same:
import bs4 as bs
import urllib, json
url = 'https://gitlab.com/users/USER/projects.json'
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(json.loads(source)["html"],'html.parser')
repos = [repo.text for repo in soup.find_all('span',{'class':'project-name'})]
print(repos)
Output:
['freebsd', 'freebsd-ports', 'freebsd-test', 'risc-vhdl', 'dotfiles', 'tideyBot']

HTML in browser doesn't correspond to scraped data in python

For a project I've to scrap datas from a different website, and I'm having problem with one.
When I look at the source code the things I want are in a table, so it seems to be easy to scrap. But when I run my script that part of the code source doesn't show.
Here is my code. I tried different things. At first there wasn't any headers, then I added some but no difference.
# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv
import requests
# specify the url
quote_page = 'http://www.airpl.org/Pollens/pollinariums-sentinelles'
# query the website and return the html to the variable 'page'
response = requests.get(quote_page)
response.addheaders = [('User-agent', 'Mozilla/5.0')]
print(response.text)
# parse the html using beautiful soap and store in variable `response`
soup = BeautifulSoup(response.text, 'html.parser')
with open('allergene.txt', 'w') as f:
f.write(soup.encode('UTF-8', 'ignore'))
What I'm looking for in the website is the things after "Herbacée" whose HTML Look like :
<p class="level1">
<img src="/static/img/state-0.png" alt="pas d'émission" class="state">
Herbacee
</p>
Do you have any idea what's wrong ?
Thanks for your help and happy new year guys :)
This page use JavaScript to render the table, the real page contains the table is:
http://www.alertepollens.org/gardens/garden/1/state/
You can find this url in Chrome Dev tools>>>Network.

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

Categories