Incomplete data after scraping a website for Data - python

I am working on some web scraping using Python and experienced some issues with extracting the table values. For example, I am interested in scraping the ETFs values from http://www.etf.com/etfanalytics/etf-finder. Below is a snapshot of the tables I am trying to scrap values from.
Here is the codes which I am trying to use in the scraping.
#Import packages
import pandas as pd
import requests
#Get website url and get request
etf_list = "http://www.etf.com/etfanalytics/etf-finder"
etf_df = pd.read_html(requests.get(etf_list, headers={'User-agent':
'Mozilla/5.0'}).text)
#printing the scraped data to screen
print(etf_df)
# Output the read data into dataframes
for i in range(0,len(etf_df)):
frame[i] = pd.DataFrame(etf_df[i])
print(frame[i])
I have several issues.
The tables only consist of 20 entries while the total entries per table from the website should be 2166 entries. How do I amend the code to pull all the values?
Some of the dataframes could not be properly assigned after scraping from the site. For example, the outputs for frame[0] is not a dataframe format and nothing was seen for frame[0] when trying to view as DataFrame under the Python console. However it seems fine when printing to the screen. Would it be better if I phase the HTML using beautifulSoup instead?

As noted by Alex, the website requests the data from http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1, which checks the Referer header to see if you're allowed to see it.
However, Alex is wrong in saying that you're unable to change the header.
It is in fact very easy to send custom headers using requests:
>>> r = requests.get('http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1', headers={'Referer': 'http://www.etf.com/etfanalytics/etf-finder'})
>>> data = r.json()
>>> len(data)
2166
At this point, data is a dict containing all the data you need, pandas probably has a simple way of loading it into a dataframe.

You get only 20 rows of the table, because only 20 rows are present on the html page by default. View the source-code of the page, you are trying to parse. There could be a possible solution to iterate through the pagination til the end, but pagination there is implemented with JS, it is not reflected in the URL, so I don't see, how you can access next pages of the table directly.
Looks like there is a request to
http://www.etf.com/etf-finder-funds-api//-aum/100/100/1
on that page, when I try to load the 2nd group of 100 rows. But getting an access to that URL might very tricky if possible. Maybe for this particular site you should use something, like WebBrowser in C# (I don't know what it will be in python, but I'm sure that python can do everything). You will be able to imitate browser and execute javascript.
Edit: I've tried to run the next JS code in console on the page, you provided.
jQuery.ajax({
url: "http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1",
success: function(data) {
console.log(JSON.parse(data));
}
});
It logged an array of all 2166 objects, representing table rows, you are looking for. Try it yourself to see the result. Looks like in the request url "0" is a start index and "3000" is a limit.
But if you try this from some other domain you will get 403 Forbidden. This is because of they have a Referer header check.
Edit again as mentioned by #stranac it is easy to set that header. Just set it to http://www.etf.com/etfanalytics/etf-finder and enjoy.

Related

How do I extract data from dynamically populated site?

I want to extract domain data from https://www.whois.com/whois/ using this site for example to get information for domain named tinymail.com i want to use https://www.whois.com/whois/tinymail.com, if I open it in browser first then soup gives credible data otherwise no domain dtata is received (I guess it is something like site putting data on cache). I do not want to use selenium method (as it will increase time required) I have tried inspecting networking option in inspect element but saw only two updates none of them is showing any data.
you can use requests to get data:
This retrieves data from the website in the question.
import requests
url = 'https://www.whois.com/whois/'
r = requests.get(url)
if r.status_code==200:
# page works
print(r.text)
else:
print('no website')
Here is a link for more: https://docs.python-requests.org/en/latest/
Also, you can sign up for an API key to get specific data. This might be free for limited data requests.

Python - How to use scrape table from website with dropdown of available rows

I am trying to scrape the earnings calendar data from the table from zacks.com and the url is attached below.
https://www.zacks.com/stock/research/aapl/earnings-calendar
The thing is I am trying to scrape all data from the table, but it has a dropdown list to select 10, 25, 50 and 100 rows on a page. Ideally I want to scrape for all 100 rows but when I select 100 from the dropdown list, the url doesn't change. My code is below.
To note that the website blocks user-agent so I had to use chrome driver to impersonate human visiting the web. The obtained result from the pd.read_html is a list of all the tables and the d[4] returns the earnings calendar with only 10 rows (which I want to change to 100)
driver = webdriver.Chrome('../files/chromedriver96')
symbol = 'AAPL'
url = 'https://www.zacks.com/stock/research/{}/earnings-calendar'.format(symbol)
driver.get(url)
content = driver.page_source
d = pd.read_html(content)
d[4]
So calling help for anyone to guide me on this
Thanks!
UPDATE: it looks like my last post was downgraded due to lack of clear articulation and evidence of showing the past research. Maybe I am still a newbie to posting questions on this site. Actually, I have found several pages including this page with the same issue but the solutions didn't seem to work for me, which is why I came to post this as a new question
UPDATE 12/05:
Thanks a lot for the advise. As commented below, I finally got it working. Below is the code I used
dropdown = driver.find_element_by_css_selector('#earnings_announcements_earnings_table_length')
time.sleep(1)
hundreds = dropdown.find_element_by_xpath(".//option[. = '100']")
hundreds.click()
Having taken a look this is not going to be something that is easy to scrape. Given that the table is produced from the javascript I would say you have two options.
Option one:
Use selenium to render the page allowing the javascript to run. This way you can simply use the id/class of the drop down to interact with it.
You can then scrape the data by looking at the values in the table.
Option two:
This is the more challenging one. Look through the data that the page gets in response and try to find requests which result in the data you then see on the page. By cross-referencing these there will be a way to directly request the data you want.
You may find that to get at the data you want you need to accept a key from the original request to the page and then send that key as part of a second request. This way should allow you to scrape the data without having to run a selenium instance which will run more efficiently.
My personal suggestion is to go with option one as computer resources are cheap and developer time expensive.

Authenticated API Call Python + Pagination

I am making an authenticated API call in Python - and dealing with pagination. If I call each page separately, I can pull in each of the 20 pages of records and combine them into a dataframe, but it's obviously not a very efficient process and the code gets quite lengthy.
I found some instructions on here to retrieve all records across all pages -- and the json output confirms there are 20 total pages, ~4900 records -- but I'm still only, somehow, getting page 1 of the data. Any ideas on how I might pull each page into a single dataframe via a single call to the API?
Link I Referenced:
Python to retrieve multiple pages of data from API with GET
My Code:
import requests
import json
import pandas as pd
for page in range (1,20):
url = "https://api.xxx.xxx...json?fields=keywords, source, campaign, per_page=250&date_range=all_time"
headers={"myauthkey"}
response=request.get(url, headers=headers)
print(response.json()) #Shows Page 1, per_page: 25, total_pages: 20, total_records: 4900
data=response.json()
df=pd.json_normalize(data,'records')
df.info() #confirms I've only pulled in 250 records, 1 page of the 20 pages of data.
I've scoured the web and this site - and can't find a solution to efficiently pull in all 20 pages of data, except to call each page one-by-one. I think one solution might be looping through the code until the final page of data has been reached, but I'm not quite sure how to set that up and sort of thought the above might accomplish that - and maybe it does, but maybe the subsequent code I've written is not right to pull in all pages of data.
Appreciate any guidance/advice. Thanks!

Where is the information stored in a html? (web-scraping)

A beginner here.
I want to extract all the jobs from Barclays (https://search.jobs.barclays/search-jobs)
I got through scraping the first page but am struggling to go to the next page, as the url don't change.
I tried to scrape the url on the next page button, but that href brings me back to the homepage.
Does that mean that all the job data is actually stored within the original html?
If so, how can I extract it?
Thanks!
So I analyzed the website, and it communicates with the server using an API, so you can get data directly from it as a JSON file.
This is the API link in this specific case(for my computer): https://search.jobs.barclays/search-jobs/results?ActiveFacetID=44699&CurrentPage=2&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&ShowRadius=False&IsPagination=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=0&SearchType=5&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=
For you the url might be different, but the concept is the same:
As you ca see, there is a 'CurrentPage=2' inside the url which you can use to get any of the pages using requests, then extract what you need from the json.

How to retrieve data from API Explorer?

My question is more in the "concept" side, as I don't have any code to show yet. I've basically got access to an API Explorer for a website, but the information retrieved when I put a specific url in the API Explorer is not the same as the html information I'd get if I opened a webpage with the same url and "inspected" the elements. I'm honestly lost on how to retrieve the data I need, as they are only present in the API Explorer but can't be accessible via web scraping.
Here is an example to show you what I mean:
API Explorer link: https://platform.worldcat.org/api-explorer/apis/worldcatidentities/identity/Read,
and the specific url to request is: http://www.worldcat.org/identities/lccn-n80126307/
If I put the url (http://www.worldcat.org/identities/lccn-n80126307/) myself and "inspect element", this piece of information:
does not have all the same data as:
For example, the language count, audLevel, oclcnum and many others are not existent in the html version but are in the API Explorer and with other authors, the genres count is only existent in the API Explorer.
I realize that one is in xml and the other in html so is that why the data is not the same in both versions? And whatever is the reason, what can I do to retrieve the data present only in the API Explorer? (such as genres count, audLevel, oclcnum, etc.)
Any insight would be really helpful.
It's not unusual for sites not showing all the data, that's in the underlying json/xml. Those sorts of things often holds interesting content that aren't displayed anywhere onsite.
In this case the server gives you, what you ask for. If you're going for the data using Python, all you really have to do is specify in your header what you're after. If you don't do that on this site, you get the html-stuff.
If you do like this, you'll get the xml data, you're interested in:
import requests
import xml.dom.minidom
url = 'https://www.worldcat.org/identities/lccn-n80126307/'
r = requests.get(url, headers={'Accept': 'application/json'})
# a couple of lines for printing the xml pretty
xml = xml.dom.minidom.parseString(r.text)
pretty_xml_as_string = xml.toprettyxml()
print(pretty_xml_as_string)
Then all you have to do is extract the content, you're after. That can be done in many ways. Let me know if this helps you.

Categories