Authenticated API Call Python + Pagination - python

I am making an authenticated API call in Python - and dealing with pagination. If I call each page separately, I can pull in each of the 20 pages of records and combine them into a dataframe, but it's obviously not a very efficient process and the code gets quite lengthy.
I found some instructions on here to retrieve all records across all pages -- and the json output confirms there are 20 total pages, ~4900 records -- but I'm still only, somehow, getting page 1 of the data. Any ideas on how I might pull each page into a single dataframe via a single call to the API?
Link I Referenced:
Python to retrieve multiple pages of data from API with GET
My Code:
import requests
import json
import pandas as pd
for page in range (1,20):
url = "https://api.xxx.xxx...json?fields=keywords, source, campaign, per_page=250&date_range=all_time"
headers={"myauthkey"}
response=request.get(url, headers=headers)
print(response.json()) #Shows Page 1, per_page: 25, total_pages: 20, total_records: 4900
data=response.json()
df=pd.json_normalize(data,'records')
df.info() #confirms I've only pulled in 250 records, 1 page of the 20 pages of data.
I've scoured the web and this site - and can't find a solution to efficiently pull in all 20 pages of data, except to call each page one-by-one. I think one solution might be looping through the code until the final page of data has been reached, but I'm not quite sure how to set that up and sort of thought the above might accomplish that - and maybe it does, but maybe the subsequent code I've written is not right to pull in all pages of data.
Appreciate any guidance/advice. Thanks!

Related

Python - How to use scrape table from website with dropdown of available rows

I am trying to scrape the earnings calendar data from the table from zacks.com and the url is attached below.
https://www.zacks.com/stock/research/aapl/earnings-calendar
The thing is I am trying to scrape all data from the table, but it has a dropdown list to select 10, 25, 50 and 100 rows on a page. Ideally I want to scrape for all 100 rows but when I select 100 from the dropdown list, the url doesn't change. My code is below.
To note that the website blocks user-agent so I had to use chrome driver to impersonate human visiting the web. The obtained result from the pd.read_html is a list of all the tables and the d[4] returns the earnings calendar with only 10 rows (which I want to change to 100)
driver = webdriver.Chrome('../files/chromedriver96')
symbol = 'AAPL'
url = 'https://www.zacks.com/stock/research/{}/earnings-calendar'.format(symbol)
driver.get(url)
content = driver.page_source
d = pd.read_html(content)
d[4]
So calling help for anyone to guide me on this
Thanks!
UPDATE: it looks like my last post was downgraded due to lack of clear articulation and evidence of showing the past research. Maybe I am still a newbie to posting questions on this site. Actually, I have found several pages including this page with the same issue but the solutions didn't seem to work for me, which is why I came to post this as a new question
UPDATE 12/05:
Thanks a lot for the advise. As commented below, I finally got it working. Below is the code I used
dropdown = driver.find_element_by_css_selector('#earnings_announcements_earnings_table_length')
time.sleep(1)
hundreds = dropdown.find_element_by_xpath(".//option[. = '100']")
hundreds.click()
Having taken a look this is not going to be something that is easy to scrape. Given that the table is produced from the javascript I would say you have two options.
Option one:
Use selenium to render the page allowing the javascript to run. This way you can simply use the id/class of the drop down to interact with it.
You can then scrape the data by looking at the values in the table.
Option two:
This is the more challenging one. Look through the data that the page gets in response and try to find requests which result in the data you then see on the page. By cross-referencing these there will be a way to directly request the data you want.
You may find that to get at the data you want you need to accept a key from the original request to the page and then send that key as part of a second request. This way should allow you to scrape the data without having to run a selenium instance which will run more efficiently.
My personal suggestion is to go with option one as computer resources are cheap and developer time expensive.

Unable to get all the links within a page

I am trying to scrape this page:
https://www.jny.com/collections/bottoms
It has a total of 55 products listed with only 24 listed once the page is loaded. However, the div contains list of all the 55 products. I am trying to scrape that using scrappy like this :
def parse(self, response):
print("in herre")
self.product_url = response.xpath('//div[#class = "collection-grid js-filter-grid"]//a/#href').getall()
print(len(self.product_url))
print(self.product_url)
It only gives me a list of length 25. How do I get the rest?
I would suggest scraping it through the API directly - the other option would be rendering Javascript using something like Splash/Selenium, which is really not ideal.
If you open up the Network panel in the Developer Tools on Chrome/Firefox, filter down to only the XHR Requests and reload the page, you should be able to see all of the requests being sent out. Some of those requests can help us figure out how the data is being loaded into the HTML. Here's a screenshot of what's going on there behind the scenes.
Clicking on those requests can give us more details on how the requests are being made and the request structure. At the end of the day, for your use case, you would probably want to send out a request to https://www.jny.com/collections/bottoms/products.json?limit=250&page=1 and parse the body_html attribute for each Product in the response (perhaps using scrapy.selector.Selector) and use that however you want. Good luck!

(scrapy) HTTP GET response different from what's being displayed on browser

I am new to web scraping and scrapy.
I am trying to scrape items from a website by parsing the GET response which is in json.
However, I am noticing that instead of having just the 90 or so elements that are shown on the website, the raw json response contains 140 + elements.
Just by inspecting the json array, there doesn't seem to be any difference between items that do end up getting displayed in the browser vs those that don't.
Is it possible for me to capture with scrapy the filtered array of items instead of the raw information?
So I've realized that when the website loads, it makes 1 request for product details and 1 for stock availability. By cross-checking those responses, I realized that only those products with items available are displayed.
Now my questions is, can these 2 requests be handled in one scrapy spider class?
I'd recommend scraping all the items, and then filtering them in a custom pipeline.
You would simply get the stock data in open_spider(), and filter out the items you don't need in process_item().

Incomplete data after scraping a website for Data

I am working on some web scraping using Python and experienced some issues with extracting the table values. For example, I am interested in scraping the ETFs values from http://www.etf.com/etfanalytics/etf-finder. Below is a snapshot of the tables I am trying to scrap values from.
Here is the codes which I am trying to use in the scraping.
#Import packages
import pandas as pd
import requests
#Get website url and get request
etf_list = "http://www.etf.com/etfanalytics/etf-finder"
etf_df = pd.read_html(requests.get(etf_list, headers={'User-agent':
'Mozilla/5.0'}).text)
#printing the scraped data to screen
print(etf_df)
# Output the read data into dataframes
for i in range(0,len(etf_df)):
frame[i] = pd.DataFrame(etf_df[i])
print(frame[i])
I have several issues.
The tables only consist of 20 entries while the total entries per table from the website should be 2166 entries. How do I amend the code to pull all the values?
Some of the dataframes could not be properly assigned after scraping from the site. For example, the outputs for frame[0] is not a dataframe format and nothing was seen for frame[0] when trying to view as DataFrame under the Python console. However it seems fine when printing to the screen. Would it be better if I phase the HTML using beautifulSoup instead?
As noted by Alex, the website requests the data from http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1, which checks the Referer header to see if you're allowed to see it.
However, Alex is wrong in saying that you're unable to change the header.
It is in fact very easy to send custom headers using requests:
>>> r = requests.get('http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1', headers={'Referer': 'http://www.etf.com/etfanalytics/etf-finder'})
>>> data = r.json()
>>> len(data)
2166
At this point, data is a dict containing all the data you need, pandas probably has a simple way of loading it into a dataframe.
You get only 20 rows of the table, because only 20 rows are present on the html page by default. View the source-code of the page, you are trying to parse. There could be a possible solution to iterate through the pagination til the end, but pagination there is implemented with JS, it is not reflected in the URL, so I don't see, how you can access next pages of the table directly.
Looks like there is a request to
http://www.etf.com/etf-finder-funds-api//-aum/100/100/1
on that page, when I try to load the 2nd group of 100 rows. But getting an access to that URL might very tricky if possible. Maybe for this particular site you should use something, like WebBrowser in C# (I don't know what it will be in python, but I'm sure that python can do everything). You will be able to imitate browser and execute javascript.
Edit: I've tried to run the next JS code in console on the page, you provided.
jQuery.ajax({
url: "http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1",
success: function(data) {
console.log(JSON.parse(data));
}
});
It logged an array of all 2166 objects, representing table rows, you are looking for. Try it yourself to see the result. Looks like in the request url "0" is a start index and "3000" is a limit.
But if you try this from some other domain you will get 403 Forbidden. This is because of they have a Referer header check.
Edit again as mentioned by #stranac it is easy to set that header. Just set it to http://www.etf.com/etfanalytics/etf-finder and enjoy.

facebook graph api {page}/links only returns for 60 days

I'm managing a Facebook page in which I'm also analyzing it's insights. We own the page and every post on the page feed(page doesn't allow other users to post). I'm doing an analysis on all of the posts that we've every created.
I've been using {page}/posts edge to get the post ids but found out that it only returns a subset of the data. Then I tried {page}/links and {page}/videos because these are the post types I'm mostly interested in. The video edge works great; it gave me all of the videos ids from the page. However, {page}/links only returned 2 months worth of link ids.
Here is a sample GET I'm using (I'm trying to get the post ids from 10/2014 to 12/2014):
https://graph.facebook.com/v2.2/{actual_page_id}/links?fields=id,created_time&since=1414175236&until=1419445636&access_token=[The_actual_access_token]
But I get an empty result string:
{"data": []}
And when I set the date with in the 2 months frame I can get proper response.
My question is: Is there a way to get ALL of the Facebook page posts ids that we have created? I've tried to set limits and paging but none have worked. Thank you very much for your help.
The Below snippet should solve your issue, It uses Facepy and handles paging on its own.
from facepy import GraphAPI
import json
access = '<access_token>'
graph = GraphAPI(access)
page_id= '<page_id>'
data= graph.get(page_id+ "/posts?fields=id", page=True, retry=5)
data1=[]
for i in data:
data1.append(i)
print data1

Categories