extracting text from css node scrapy - python

I'm trying to scrape a catalog id number from this page:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='
response = HtmlResponse(url=url)
using the css selector (which works in R with rvest::html_nodes)
".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"
I would like to retrieve the catalog id, which in this case should be:
6011038
I'm ok if it is done easier with the xpath

I don't have scrapy here, but tested this xpath and it will get you the href:
//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href
If you're having too much trouble with scrapy and css selector syntax, I would also suggest trying out BeautifulSoup python package. With BeautifulSoup you can do things like
link.get('href')

If you need to parse id from href:
catalog_id = response.xpath("//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href").re_first( r'(\d+)$' )

There seems to be only one link in the h5 element. So in short:
response.css('h5 > a::attr(href)').re('(\d+)$')

Related

How to retrieve URLs under a certain property using BeautifulSoup in Python?

I am trying to retrieve urls under a certain property. The current code I have is
import urllib
import lxml.html
url = 'https://play.acast.com/s/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-'
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
links = []
for link in dom.xpath('//meta/#content'): # select the url in href for all a tags(links)
if 'mp3' in link:
links.append(link)
output = set(links)
for i in output:
print(i)
This outputs 2 links which is not what I want.
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-/media.mp3
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-r/media.mp3
What I would like to do is to get 'only' the URL link that is under og:audio property. Not og:audio:secure_url property.
How do I accomplish this?
To only select a tag where the property="og:audio" and not property="og:audio:secure_url", you can use an [attribute=value]
CSS selector. In your case it would be: [property="og:audio"].
Since you tagged beautifulsoup, you can do it as follows:
soup = BeautifulSoup(connection.read(), "html.parser")
for tag in soup.select('[property="og:audio"]'):
print(tag["content"])
Output:
https://sphinx.acast.com/jeg-kan-ingenting-om-vin/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-/media.mp3
A better way would be to study the XHR calls in the Network tab when you inspect the page. In the response of https://feeder.acast.com/api/v1/shows/jeg-kan-ingenting-om-vin/episodes/33.hvorforercheninblancfraloireogsor-afrikaikkelengerpafolksradar-?showInfo=true the url key is what you are looking for.

Can't find the Web scraping selector for div class

I am new to Web scraping and this is one of my first web scraping project, I cant find the right selector for my soup.select("")
I want to get the "data-phone" (See picture bellow to undersdtand) But it In a div class and after it in a <a href>, who make that a little complicate for me!
I searched online and I foud that I have to use soup.find_all but this is not very helpfull Can anyone help me or give me a quick tip ?Thanks you!
my code:
import webbrowser, requests, bs4, os
url = "https://www.pagesjaunes.ca/search/si/1/electricien/Montreal+QC"
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
result = soup.find('a', {'class', 'mlr__item__cta jsMlrMenu'})
Phone = result['data-phone']
print(Phone)
I think one of the simplest way is to use the soup.select which allows the normal css selectors.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
soup.select('a.mlr__item_cta.jsMlrMenu')
This should return the entire list of anchors from which you can pick the data attribute.
Note I just tried it in the terminal:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'
r = requests.get(url)
soup = BeautifulSoup(r.text)
result = soup.select('a.mw-jump-link') # or any other selector
print(result)
print(result[0].get("href"))
You will have to loop over the result of soup.select and just collect the data-phone value from the attribute.
UPDATE
Ok I have searched in the DOM myself, and here is how I managed to retrieve all the phone data:
anchores = soup.select('a[data-phone]')
for a in anchores:
print(a.get('data-phone'))
It works also with only data selector like this: soup.select('[data-phone]')
Here real proof:
Surprisingly, for me it works also this one with classes:
for a in soup.select('a.mlr__item__cta.jsMlrMenu'):
print(a.get('data-phone'))
There is no surprise, we just had a typo in our first selector...
Find the difference :)
GOOD: a.mlr__item__cta.jsMlrMenu
BAD : a.mlr__item_cta.jsMlrMenu

Scrapy - scraping html custom attributes

I am trying to scrape a website and I want to scrape a custom html attribute.
First I get the link:
result.css('p.paraclass a').extract()
It looks like this:
I am a link
I'd like to scrape the value of the data-id tag. I can do this by getting the entire link and then manipulating it, but I'd like to figure out if there is a way to do it directly with a scrapy selector.
I believe the following will work:
result.css('a::attr(data-id)').extract()
Two ways to achieve this:
from scrapy.selector import Selector
partial_body = ' I am a link'
sel = Selector(text=partial_body)
Xpath Selector
sel.xpath('//a/#data-id').extract()
#output : ['12345']
CSS Selector
sel.css('a::attr(data-id)').extract_first()
# output: '12345'

soup.find_all works but soup.select doesn't work

I'm playing around with parsing an html page using css selectors
import requests
import webbrowser
from bs4 import BeautifulSoup
page = requests.get('http://www.marketwatch.com', headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(page.content, 'html.parser')
I'm having trouble selecting a list tag with a class when using the select method. However, I have no trouble when using the find_all method
soup.find_all('ul', class_= "latestNews j-scrollElement")
This returns the output I desire, but for some reason I can't do the same using
css selectors. I want to know what I'm doing wrong.
Here is my attempt:
soup.select("ul .latestNews j-scrollElement")
which is returning an empty list.
I can't figure out what I'm doing wrong with the select method.
Thank you.
From the documentation:
If you want to search for tags that match two or more CSS classes, you
should use a CSS selector:
css_soup.select("p.strikeout.body")
In your case, you'd call it like this:
In [1588]: soup.select("ul.latestNews.j-scrollElement")
Out[1588]:
[<ul class="latestNews j-scrollElement" data-track-code="MW_Header_Latest News|MW_Header_Latest News_Facebook|MW_Header_Latest News_Twitter" data-track-query=".latestNews__headline a|a.icon--facebook|a.icon--twitter">
.
.
.

scrape through website with href references

I am using scrapy, and I want to scrape through www.rentler.com. I have gone to the website and searched for the city that I am interested in, and here is the link of that search result:
https://www.rentler.com/search?Location=millcreek&MaxPrice=
Now, all of the listings that I am interested in are contained on that page, and I want to recursively step through them, one by one.
Each listing is listed under:
<body>/<div id="wrap">/<div class="container search-res">/<ul class="search-results"><li class="result">
each result has a <a class="search-result-link" href="/listing/288910">
I know that I need to create a rule for the crawlspider and have it look at that href and append it to the url. That way it could go to each page, and grab that data that I am interested in.
I think I need something like this:
rules = (Rule(SgmlLinkExtractor(allow="not sure what to insert here, but this is where I think I need to href appending", callback='parse_item', follow=true),)
UPDATE
*Thank you for the input. Here is what I now have, it seems to run but does not scrape:*
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from KSL.items import KSLitem
class KSL(CrawlSpider):
name = "ksl"
allowed_domains = ["https://www.rentler.com"]
start_urls = ["https://www.rentler.com/ksl/listing/index/?sid=17403849&nid=651&ad=452978"]
regex_pattern = '<a href="listing/(.*?) class="search-result-link">'
def parse_item(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = re.findall(regex_pattern, "https://www.rentler.com/search?location=millcreek&MaxPrice=")
for site in sites:
item = KSLitem()
item['price'] = site.select('//div[#class="price"]/text()').extract()
item['address'] = site.select('//div[#class="address"]/text()').extract()
item['stats'] = site.select('//ul[#class="basic-stats"]/li/div[#class="count"]/text()').extract()
item['description'] = site.select('//div[#class="description"]/div/p/text()').extract()
items.append(item)
return items
Thoughts?
If you need to scrape data out a html files, which is the case, I would recommend using BeautifulSoup, it's very easy to install and to use:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html)
for link in bs.find_all('a'):
if link.has_attr('href'):
print link.attrs['href']
This little script would get all href that are inside a HTML tag.
Edit: Fully functional script:
I tested this on my computer and the result was as expected, BeautifulSoup needs plain HTML and you can scrape what you need out of it, take a look at this code:
import requests
from bs4 import BeautifulSoup
html = requests.get(
'https://www.rentler.com/search?Location=millcreek&MaxPrice=').text
bs = BeautifulSoup(html)
possible_links = bs.find_all('a')
for link in possible_links:
if link.has_attr('href'):
print link.attrs['href']
That only shows you how to scrape href out of the html page you are trying to scrape, of course you can use it inside scrapy, as I told you, BeautifulSoup only needs plain HTML, that is why I use requests.get(url).text and you can scrape out of that. So I guess scrapy can pass that plain HTML to BeautifulSoup.
Edit 2
Ok, look I don't think you need scrapy at all, so if the previous script gets you all the links that you want to take data from works, you only need to do something like this:
supposing I have a valid list of urls I want to get specific data from, say price, acres, address... You could have this with the previous script only instead of printing urls to screen you could append them to a list and append only the ones that start with /listing/. That way you have a valid list of urls.
for url in valid_urls:
bs = BeautifulSoup(requests.get(url).text)
price = bs.find('span', {'class': 'amount'}).text
print price
You only need to look at the source code and you'll get the idea of how to scrape the data you need from every single url.
You can use a regular expression to find all the rental home ids from the links. From there, you can use the ids you have and scrape that page instead.
import re
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, SOURCE_OF_THE_RENTLER_PAGE)
for rental_id in rental_home_ids:
#Process the data from the page here.
print rental_id
EDIT:
Here's a working-on-its-own version of the code. It prints all the link ids. You can use it as-is.
import re
import urllib
url_to_scrape = "https://www.rentler.com/search?Location=millcreek&MaxPrice="
page_source = urllib.urlopen(url_to_scrape).read()
regex_pattern = '<a href="/listing/(.*?)" class="search-result-link">'
rental_home_ids = re.findall(regex_pattern, page_source)
for rental_id in rental_home_ids:
#Process the data from the page here.
print rental_id

Categories