I want to extract domain data from https://www.whois.com/whois/ using this site for example to get information for domain named tinymail.com i want to use https://www.whois.com/whois/tinymail.com, if I open it in browser first then soup gives credible data otherwise no domain dtata is received (I guess it is something like site putting data on cache). I do not want to use selenium method (as it will increase time required) I have tried inspecting networking option in inspect element but saw only two updates none of them is showing any data.
you can use requests to get data:
This retrieves data from the website in the question.
import requests
url = 'https://www.whois.com/whois/'
r = requests.get(url)
if r.status_code==200:
# page works
print(r.text)
else:
print('no website')
Here is a link for more: https://docs.python-requests.org/en/latest/
Also, you can sign up for an API key to get specific data. This might be free for limited data requests.
Related
I am trying to scrape this page:
https://www.jny.com/collections/bottoms
It has a total of 55 products listed with only 24 listed once the page is loaded. However, the div contains list of all the 55 products. I am trying to scrape that using scrappy like this :
def parse(self, response):
print("in herre")
self.product_url = response.xpath('//div[#class = "collection-grid js-filter-grid"]//a/#href').getall()
print(len(self.product_url))
print(self.product_url)
It only gives me a list of length 25. How do I get the rest?
I would suggest scraping it through the API directly - the other option would be rendering Javascript using something like Splash/Selenium, which is really not ideal.
If you open up the Network panel in the Developer Tools on Chrome/Firefox, filter down to only the XHR Requests and reload the page, you should be able to see all of the requests being sent out. Some of those requests can help us figure out how the data is being loaded into the HTML. Here's a screenshot of what's going on there behind the scenes.
Clicking on those requests can give us more details on how the requests are being made and the request structure. At the end of the day, for your use case, you would probably want to send out a request to https://www.jny.com/collections/bottoms/products.json?limit=250&page=1 and parse the body_html attribute for each Product in the response (perhaps using scrapy.selector.Selector) and use that however you want. Good luck!
I am trying to scrape a table of company info from the table on this page: https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/
I can see the table contents when using chrome's dev tool element inspector, but when I request the page in my script, the contents of the table are gone... just with no content.
Any idea how I can get that sweet, sweet content?
Thanks
Code is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/")
page = BeautifulSoup(response.text, "html.parser")
page
You can find the API in the network traffic tab: it's calling
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&companyName=&ticker=&year=2018&analysis=1&index=&sic=&keywords=
and you should be able to reconstruct the table from the resulting JSON. I haven't played around with all the parameters but it's seems like only year affects the resulting data set, i.e.
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&year=2018&analysis=1
should give you the same result as the query above.
Based on the Network traffic using the dev tool, the content isn't directly on the html, but gets called dynamically from ApiService.js script. My suggestion would be to use Selenium to extract the content once the page has fully loaded (for example until the loading element has disappeared).
I am working on some web scraping using Python and experienced some issues with extracting the table values. For example, I am interested in scraping the ETFs values from http://www.etf.com/etfanalytics/etf-finder. Below is a snapshot of the tables I am trying to scrap values from.
Here is the codes which I am trying to use in the scraping.
#Import packages
import pandas as pd
import requests
#Get website url and get request
etf_list = "http://www.etf.com/etfanalytics/etf-finder"
etf_df = pd.read_html(requests.get(etf_list, headers={'User-agent':
'Mozilla/5.0'}).text)
#printing the scraped data to screen
print(etf_df)
# Output the read data into dataframes
for i in range(0,len(etf_df)):
frame[i] = pd.DataFrame(etf_df[i])
print(frame[i])
I have several issues.
The tables only consist of 20 entries while the total entries per table from the website should be 2166 entries. How do I amend the code to pull all the values?
Some of the dataframes could not be properly assigned after scraping from the site. For example, the outputs for frame[0] is not a dataframe format and nothing was seen for frame[0] when trying to view as DataFrame under the Python console. However it seems fine when printing to the screen. Would it be better if I phase the HTML using beautifulSoup instead?
As noted by Alex, the website requests the data from http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1, which checks the Referer header to see if you're allowed to see it.
However, Alex is wrong in saying that you're unable to change the header.
It is in fact very easy to send custom headers using requests:
>>> r = requests.get('http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1', headers={'Referer': 'http://www.etf.com/etfanalytics/etf-finder'})
>>> data = r.json()
>>> len(data)
2166
At this point, data is a dict containing all the data you need, pandas probably has a simple way of loading it into a dataframe.
You get only 20 rows of the table, because only 20 rows are present on the html page by default. View the source-code of the page, you are trying to parse. There could be a possible solution to iterate through the pagination til the end, but pagination there is implemented with JS, it is not reflected in the URL, so I don't see, how you can access next pages of the table directly.
Looks like there is a request to
http://www.etf.com/etf-finder-funds-api//-aum/100/100/1
on that page, when I try to load the 2nd group of 100 rows. But getting an access to that URL might very tricky if possible. Maybe for this particular site you should use something, like WebBrowser in C# (I don't know what it will be in python, but I'm sure that python can do everything). You will be able to imitate browser and execute javascript.
Edit: I've tried to run the next JS code in console on the page, you provided.
jQuery.ajax({
url: "http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1",
success: function(data) {
console.log(JSON.parse(data));
}
});
It logged an array of all 2166 objects, representing table rows, you are looking for. Try it yourself to see the result. Looks like in the request url "0" is a start index and "3000" is a limit.
But if you try this from some other domain you will get 403 Forbidden. This is because of they have a Referer header check.
Edit again as mentioned by #stranac it is easy to set that header. Just set it to http://www.etf.com/etfanalytics/etf-finder and enjoy.
My question is more in the "concept" side, as I don't have any code to show yet. I've basically got access to an API Explorer for a website, but the information retrieved when I put a specific url in the API Explorer is not the same as the html information I'd get if I opened a webpage with the same url and "inspected" the elements. I'm honestly lost on how to retrieve the data I need, as they are only present in the API Explorer but can't be accessible via web scraping.
Here is an example to show you what I mean:
API Explorer link: https://platform.worldcat.org/api-explorer/apis/worldcatidentities/identity/Read,
and the specific url to request is: http://www.worldcat.org/identities/lccn-n80126307/
If I put the url (http://www.worldcat.org/identities/lccn-n80126307/) myself and "inspect element", this piece of information:
does not have all the same data as:
For example, the language count, audLevel, oclcnum and many others are not existent in the html version but are in the API Explorer and with other authors, the genres count is only existent in the API Explorer.
I realize that one is in xml and the other in html so is that why the data is not the same in both versions? And whatever is the reason, what can I do to retrieve the data present only in the API Explorer? (such as genres count, audLevel, oclcnum, etc.)
Any insight would be really helpful.
It's not unusual for sites not showing all the data, that's in the underlying json/xml. Those sorts of things often holds interesting content that aren't displayed anywhere onsite.
In this case the server gives you, what you ask for. If you're going for the data using Python, all you really have to do is specify in your header what you're after. If you don't do that on this site, you get the html-stuff.
If you do like this, you'll get the xml data, you're interested in:
import requests
import xml.dom.minidom
url = 'https://www.worldcat.org/identities/lccn-n80126307/'
r = requests.get(url, headers={'Accept': 'application/json'})
# a couple of lines for printing the xml pretty
xml = xml.dom.minidom.parseString(r.text)
pretty_xml_as_string = xml.toprettyxml()
print(pretty_xml_as_string)
Then all you have to do is extract the content, you're after. That can be done in many ways. Let me know if this helps you.
I'm using requests to access this webpage and subsequently parsing and inspecting the HTML with Beautiful Soup.
This page allows a user to specify the number of days in the past for which results should be returned. This is accomplished via a form on the page:
When I submit the request in the browser with my selection of 365 days and examine the response, I find this form data was sent with the request:
Of note is the form datum "dnf_class_values[procurement_notice][_posted_date]: 365" as this is the only element that corresponds with my selection of 365 days.
When this request is returned in the browser, I get n results, where n is the maximum number possible given this is the largest time period possible. n is visible in the markup as <span class="lst-cnt">.
I can't seem to duplicate the sending of that form data with requests. Here is the relevant portion of my code:
import requests
from bs4 import BeautifulSoup as bs
formData = {'dnf_class_values[procurement_notice][_posted_date]':'365'}
r = requests.post("https://www.fbo.gov/index?s=opportunity&mode=list&tab=list&tabmode=list&pp=20&pageID=1", data = formData)
s = bs(r.content)
s.find('span',{'class':'lst-cnt'})
This is returning the same number of results as when the form is submitted with the default value for number of days.
I've tried URL encoding the key in data, as well as using requests.get, and specifying params as opposed to data. Additionally, I've attempted to append the form data field as a query string parameter:
url...?s=opportunity&mode=list&tab=list&tabmode=list&pp=20&pageID=1&dnf_class_values%5Bprocurement_notice%5D%5B_posted_date%5D=365
What is the appropriate syntax for that request?
You cannot send only the sections you care about, you need to send everything. Duplicate the POST request that Chrome made exactly.
Note that some of the POSTed values may be CSRF tokens. The Base64-encoded strings are particularly likely (dnf_opt_template, dnf_opt_template_dir, dnf_opt_subform_template and dnf_class_values[procurement_notice][notice_id]), and should probably be pulled out of the HTML for the original page using BeautifulSoup. The rest can be hardcoded.
Otherwise, your original syntax was correct.