Amazon Web Service ItemSearch DetailPageURL's with Associate IDs? - python

DetailPageURL's returned by ItemSearch seem to include an incorrect ID/tag rather than the associate ID I requested the search with.
I'm getting:
http://www.amazon.co.uk/gp/product/1590595009?SubscriptionId=XXX&tag=foo-12&linkCode=as2&camp=1634&creative=19450&creativeASIN=1590595009
When I expect:
http://www.amazon.co.uk/gp/product/1590595009?SubscriptionId=XXX&tag=wwwmydomain-12&linkCode=as2&camp=1634&creative=19450&creativeASIN=1590595009
How do I get the correct tag? (Note that SO rewrites the above links to their own Associate ID if you click either of the above)
I'm using Python and PyAWS 0.3.0, although I think the problem is with my request, rather than with the API wrapper.
(As an aside, The Amazon Associates Link Checker (U.K. store)/U.S. store is invaluable in testing these links)

Simple error in the end..... I was including the tag in the initial search:
for searchResult in
ecs.ItemSearch(item,
SearchIndex=index,
AssociateTag='wwwmydomain-12')
But not in the secondary loop that steps through each result getting more details:
for item in
ecs.ItemSearch(searchResult.ASIN,
ResponseGroup='Medium'):
should be:
for item in
ecs.ItemSearch(searchResult.ASIN,
ResponseGroup='Medium',
AssociateTag='wwwodbodycom-21'):
The tag is needed in both - it seems it's not carried over.

Related

Extracting element IDs of required forms in selenium/python

I'm crawling this page (https://boards.greenhouse.io/reddit/jobs/4330383) using Selenium in Python and am looping through all of the required fields using:
required = driver.find_elements_by_css_selector("[aria-required=true]").
The problem is that I can't view each element's id. The command required[0].id (which is the same as driver.find_element_by_id("first_name").id returns a long string of alphanumeric characters and hyphens – even though the id is first_name in the HTML. Can someone explain to be why the id is being changed from first_name to this string? And how can I view the actual id that I'm expecting?
Additionally, what would be the simplest way to reference the associated label mentioned right before it in the HTML (i.e. "First Name " in this case)? The goal is to loop through the required list and be able to tell what each of these forms is actually asking the user for.
Thanks in advance! Any advice or alternatives are welcome, as well.
Your code is almost good. All you need to do is use the .get_attribute() method to get your id's:
required = driver.find_elements_by_css_selector("[aria-required=true]")
for r in required:
print(r.get_attribute("id"))
driver.find_element_by_id("first_name") returns a web element object.
In order to get a web element attribute value like href or id - get_attribute() method should be applied on the web element object.
So, you need to change your code to be
driver.find_element_by_id("first_name").get_attribute("id")
This will give you the id attribute value of that element
I'm going to answer my second question ("How to reference the element's associated label?") since I just figured it out using the find_element_by_xpath() method in conjunction with the .get_attribute("id") solutions mentioned in the previous answers:
ele_id = driver.find_element_by_id("first_name").get_attribute("id")
label_text = driver.find_element_by_xpath('//label[#for="{}"]'.format(ele_id)).text

xpath result from scrapy don't show the same result from a html page

I'm having some issues in crawling this website search:
https://www.simplyhired.com/search?q=data+engineer&l=United+States&pn=1&job=ZMzeXt6JW0jMuZc6H-3Af3sqOGzeQMLj7X5mnXXv9ZteeAoGm6oDdg
I'm trying to extract these elements from de SimplyHired search jobs for Data Engineer in US:
But when I try using xpath locator to any of them using selector module I'm getting different results and in different order.
Also the output for all of them isn't matching (The index corresponding to xpath job name is not the same index for ther location in xpath location for example).
Here is my code:
from scrapy import Selector
import requests
response = requests.get('https://www.simplyhired.com/search?q=data+engineer&l=united+states&mi=exact&sb=dd&pn=1&job=X1yGOt2Y8QTJm0tYqyptbgV9Pu19ge0GkVZK7Im5WbXm-zUr-QMM-A').content
sel=Selector(text=response)
#job name
sel.xpath('//main[#id="job-list"]/div/article[contains(#class,"SerpJob")]/div/div[#class="jobposting-title-container"]/h2/a/text()').extract()
#company
sel.xpath('//main[#id="job-list"]/div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').extract()
#location
sel.xpath('//main[#id="job-list"]//div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-location"]/span/span/text()').extract()
#salary estimates
sel.xpath('//main[#id="job-list"]//div/article/div/div[#class="SerpJob-metaInfo"]//div[#class="SerpJob-metaInfoLeft"]/span/text()[2]').extract()
I'm not quite sure whether you're trying to use Scrapy or requests. Looks like you're wanting to use requests but with xpath selectors.
For websites like this, it's best to look at each individual job advert as a 'card'. You want to loop over each card with the XPATH selectors that you need to get the data you want.
Code Example
card = sel.xpath('//div[#class="SerpJob-jobCard card"]')
for a in card:
title = a.xpath('.//a[#class="card-link"]/text()').get()
company = a.xpath('.//span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').get()
salary = a.xpath('.//span[#class="jobposting-salary"]/text()').get()
location = a.xpath('.//span[#class="jobposting-location"]/text()').get()
Explanation
You want to search each card with relative XPATH selectors. The .// searches within the chunk of HTML downstream of the card variable.
Always use get() instead of extract(). get() is used to get one value and returns a string always, here that's what we want when we're looping over each card. extract() extracts all values if there are multiple and if there's only one value for the XPATH selector it puts it into a list which is often not what you want. The ambiguity of extract() is not ideal, if you want multiple values to use getall(), this is explicit and will only give you multiple values.
Additional Information
If you're finding you're not getting the correct data in the right format, always look to see if javascript content is being added to the website. Turn off your browsers javascript to refresh the page. On this particular site, none of the data you require is loaded by javascript, this makes it much easier to scrape.

How to loop through retrieve div-id

I try to retrieve posts - title, url and image from a website using selenium. I have get to the first level, but on second level the post starts from (div id="abcde-11-0") to (div id="abcde-11-20"), in this case how should I get into second level before I can reach the (a herf) to retrieve the data?
Here are my code:
# I get to the first level by using xpath
Container=chromedrv.find_elements_by_xpath("//*[#id="MainContent"]/div[2]/div/div/div[6]/div[2]")
I'm wondering if I understood your question correctly, but please use this code.
first_level_rows =chromedrv.find_elements_by_xpath('//*[#id="MainContent"]/div[2]/div/div/div[6]/div[2]')
for first_level_row in first_level_rows:
second_level_rows = first_level_row.find_elements_by_xpath('//*[contains(#id, "abcde-11-")]')
find_elements_by_xpath('//*[contains(#id, "abcde-11-")]') => This part extracts all tags contains "abcde-11-" in id.

How to get all elements by tag?

I use diskcache to persist my data. I save users cache.add(key=k, value=v, tag="users"), now i want to get all users by tag, but there is no such method.
How i can do this?
I've found only one way to do this:
def _get_all(self):
r = []
for k in list(self._cache.iterkeys()):
r.append(self._cache.get(key=k))
But this way does not assume tag as argument, so i can not persist in 1 diskcache instance different items and filter them by tag.
Looking at the source code of python-diskcache, the sole purpose of the tag is be used in an SQLite index which is meant to enable fast cache eviction/culling on the basis of a tag value.
The only SQL statement that tag is ever used in is in the .evict() method.
There is no official API to get cache entries by tag, and the library is not designed to do that. The whole underlying setup of the database and the item-retrieval mechanics are key-centered.

Getting author's articles from Scopus using Scopus API (AUTHENTICATION_ERROR)

I've registered at http://www.developers.elsevier.com/action/devprojects. I created a project and got my scopus key:
Now, using this generated key, I would like to find an author by firstname, lastname and subjectarea. I make requests from my university network, which is allowed to visit Scopus (I have full manual access to Scopus search, use it from Firefox with no problem). However, I wanted to automatize my Scopus mining, by writing a simple script. I would like to find publications of an author by giving his/her firstname, lastname and subjectarea.
Here's my code:
# !/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import json
from scopus import SCOPUS_API_KEY
scopus_author_search_url = 'http://api.elsevier.com/content/search/author?'
headers = {'Accept':'application/json', 'X-ELS-APIKey': SCOPUS_API_KEY}
search_query = 'query=AUTHFIRST(%) AND AUTHLASTNAME(%s) AND SUBJAREA(%s)' % ('John', 'Kitchin', 'COMP')
# api_resource = "http://api.elsevier.com/content/search/author?apiKey=%s&" % (SCOPUS_API_KEY)
# request with first searching page
page_request = requests.get(scopus_author_search_url + search_query, headers=headers)
print page_request.url
# response to json
page = json.loads(page_request.content.decode("utf-8"))
print page
Where SCOPUS_API_KEY looks just like this: SCOPUS_API_KEY="xxxxxxxx".
Although I have full access to scopus from my university network, I'm getting such response:
{u'service-error': {u'status': {u'statusText': u'Requestor
configuration settings insufficient for access to this resource.',
u'statusCode': u'AUTHENTICATION_ERROR'}}}
The generated link looks like this: http://api.elsevier.com/content/search/author?query=AUTHFIRST(John)%20AND%20AUTHLASTNAME(Kitchin)%20AND%20SUBJAREA(COMP) and when I click it, it shows an XML file:
<service-error><status>
<statusCode>AUTHORIZATION_ERROR</statusCode>
<statusText>No APIKey provided for request</statusText>
</status></service-error>
Or, when I change the scopus_author_search_url to "http://api.elsevier.com/content/search/author?apiKey=%s&" % (SCOPUS_API_KEY) I'm getting:
{u'service-error': {u'status': {u'statusText': u'Requestor configuration settings insufficient for access to this resource.', u'statusCode': u'AUTHENTICATION_ERROR'}}} and the XML file:
<service-error>
<status>
<statusCode>AUTHENTICATION_ERROR</statusCode>
<statusText>Requestor configuration settings insufficient for access to this resource.</statusText>
</status>
</service-error>
What can be the cause of this problem and how can I fix it?
I have just registered for an API key and tested it first with this URL:
http://api.elsevier.com/content/search/author?apikey=4xxxxxxxxxxxxxxxxxxxxxxxxxxxxx43&query=AUTHFIRST%28John%29+AND+AUTHLASTNAME%28Kitchin%29+AND+SUBJAREA%28COMP%29
This works fine from my university network. I also tested a second API Key, so have verified one with registered website on my university domain, one with registered website http://apitest.example.com, ruling out the domain name used to register as the source of your problem.
I tested this
in the browser,
using your python code both with the api key in the headers. The only change I made to your code is removing
from scopus import SCOPUS_API_KEY
and adding
SCOPUS_API_KEY ='4xxxxxxxxxxxxxxxxxxxxxxxxxxxxx43'
using your python code adapted to put the apikey in the URL instead of the headers.
In all cases, the query returns two authors, one at Carnegie Mellon and one at Palo Alto.
I can't replicate your error message. If I try to use the API key from an IP address unregistered with elsevier (e.g. my home computer), I see a different error:
<service-error>
<status>
<statusCode>AUTHENTICATION_ERROR</statusCode>
<statusText>Client IP Address: xxx.yyy.aaa.bbb does not resolve to an account</statusText>
</status>
</service-error>
If I use a random (wrong) API key from the university network, I see
<service-error>
<status>
<statusCode>AUTHORIZATION_ERROR</statusCode>
<statusText>APIKey <mad3upa1phanum3r1ck3y> with IP address <my.uni.IP.add> is unrecognized or has insufficient privileges for access to this resource</statusText>
</status>
</service-error>
Debug steps
As I can't replicate your problem - here are some diagnostic steps you can use to resolve:
Use your browser at uni to actually submit the api query with your key in the URL (i.e. copy the URL above, paste it into the address bar, substitute your key and see whether you get the XML back)
If 1 returns the XML you expect, move onto submitting the request via Python - first, copy the exact URL straight into Python (no variable substitution via %s, no apikey in the header) and simply do a .get() on it.
If 2 returns correctly, ensure that your SCOPUS_API_KEY holds the exact key value, no more no less. i.e. print 'SCOPUS_API_KEY' should return your apikey: 4xxxxxxxxxxxxxxxxxxxxxxxxxxxxx43
If 1 returns the error, it looks like your uni (for whatever reason) has not got access to the authors query API. This doesn't make much sense given that you can perform manual search, but that is all I can conclude
Docs
For reference the authentication algorithm documentation is here, but it is not very simple to follow. You are following authentication option 1 and your method should just work.
N.B. The API is limited to 5000 author retrievals per week. If you have run a lot of queries in a loop, even if they have failed, it is possible that you have exceeded that...
For future reference. OP was using the package scopus which has long been renamed to pybliometrics.
Nowadays you can do
from pybliometrics.scopus import AuthorSearch
q = "AUTHFIRST(John) AND AUTHLASTNAME(Kitchin) AND SUBJAREA(COMP)"
s = AuthorSearch(q) # handles access, retrieval, parsing and even caches results
print(s)
results = s.authors # Holds all the information as a list of namedtuples
print(results) # You can put this into a pandas DataFrame as well

Categories