I'm using requests to access this webpage and subsequently parsing and inspecting the HTML with Beautiful Soup.
This page allows a user to specify the number of days in the past for which results should be returned. This is accomplished via a form on the page:
When I submit the request in the browser with my selection of 365 days and examine the response, I find this form data was sent with the request:
Of note is the form datum "dnf_class_values[procurement_notice][_posted_date]: 365" as this is the only element that corresponds with my selection of 365 days.
When this request is returned in the browser, I get n results, where n is the maximum number possible given this is the largest time period possible. n is visible in the markup as <span class="lst-cnt">.
I can't seem to duplicate the sending of that form data with requests. Here is the relevant portion of my code:
import requests
from bs4 import BeautifulSoup as bs
formData = {'dnf_class_values[procurement_notice][_posted_date]':'365'}
r = requests.post("https://www.fbo.gov/index?s=opportunity&mode=list&tab=list&tabmode=list&pp=20&pageID=1", data = formData)
s = bs(r.content)
s.find('span',{'class':'lst-cnt'})
This is returning the same number of results as when the form is submitted with the default value for number of days.
I've tried URL encoding the key in data, as well as using requests.get, and specifying params as opposed to data. Additionally, I've attempted to append the form data field as a query string parameter:
url...?s=opportunity&mode=list&tab=list&tabmode=list&pp=20&pageID=1&dnf_class_values%5Bprocurement_notice%5D%5B_posted_date%5D=365
What is the appropriate syntax for that request?
You cannot send only the sections you care about, you need to send everything. Duplicate the POST request that Chrome made exactly.
Note that some of the POSTed values may be CSRF tokens. The Base64-encoded strings are particularly likely (dnf_opt_template, dnf_opt_template_dir, dnf_opt_subform_template and dnf_class_values[procurement_notice][notice_id]), and should probably be pulled out of the HTML for the original page using BeautifulSoup. The rest can be hardcoded.
Otherwise, your original syntax was correct.
Related
Use case: Need to check if JSON data from a url has been updated by checking it's created_date field which lies in the first few lines. The entire page's JSON data is huge and i don't want to retrieve the entire page just to check the first few lines.
Currently, For both
x=feedparser.parse(url)
y=requests.get(url).text
#y.split("\n") etc..
the entire url data is retrieved and then parsed.
I want to do some sort of next(url) or reading only first 10 lines (chunks).. thus not sending request for entire page's data...i.e just scroll & check 'created_date' field and exit.
What can be utilized to solve this? Thanks for your knowledge & Apologies for the noob q
Example of URL -> https://www.w3schools.com/xml/plant_catalog.xml
I want to stop reading the entire URL data if the first PLANT object's LIGHT tag hadn't changed from 'Mostly Shady' (without needing to read/get the data below)
Original poster stated below solution worked:
Instead of GET request, one can try HEAD request:
"The GET method requests a representation of the specified resource. Requests using GET should only retrieve data. The HEAD method asks for a response identical to a GET request, but without the response body."
This way, you don't need to request entire JSON, and will therefore speed up the server side part, as well as be more friendly to the hosting server!
I want to extract domain data from https://www.whois.com/whois/ using this site for example to get information for domain named tinymail.com i want to use https://www.whois.com/whois/tinymail.com, if I open it in browser first then soup gives credible data otherwise no domain dtata is received (I guess it is something like site putting data on cache). I do not want to use selenium method (as it will increase time required) I have tried inspecting networking option in inspect element but saw only two updates none of them is showing any data.
you can use requests to get data:
This retrieves data from the website in the question.
import requests
url = 'https://www.whois.com/whois/'
r = requests.get(url)
if r.status_code==200:
# page works
print(r.text)
else:
print('no website')
Here is a link for more: https://docs.python-requests.org/en/latest/
Also, you can sign up for an API key to get specific data. This might be free for limited data requests.
I am working on some web scraping using Python and experienced some issues with extracting the table values. For example, I am interested in scraping the ETFs values from http://www.etf.com/etfanalytics/etf-finder. Below is a snapshot of the tables I am trying to scrap values from.
Here is the codes which I am trying to use in the scraping.
#Import packages
import pandas as pd
import requests
#Get website url and get request
etf_list = "http://www.etf.com/etfanalytics/etf-finder"
etf_df = pd.read_html(requests.get(etf_list, headers={'User-agent':
'Mozilla/5.0'}).text)
#printing the scraped data to screen
print(etf_df)
# Output the read data into dataframes
for i in range(0,len(etf_df)):
frame[i] = pd.DataFrame(etf_df[i])
print(frame[i])
I have several issues.
The tables only consist of 20 entries while the total entries per table from the website should be 2166 entries. How do I amend the code to pull all the values?
Some of the dataframes could not be properly assigned after scraping from the site. For example, the outputs for frame[0] is not a dataframe format and nothing was seen for frame[0] when trying to view as DataFrame under the Python console. However it seems fine when printing to the screen. Would it be better if I phase the HTML using beautifulSoup instead?
As noted by Alex, the website requests the data from http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1, which checks the Referer header to see if you're allowed to see it.
However, Alex is wrong in saying that you're unable to change the header.
It is in fact very easy to send custom headers using requests:
>>> r = requests.get('http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1', headers={'Referer': 'http://www.etf.com/etfanalytics/etf-finder'})
>>> data = r.json()
>>> len(data)
2166
At this point, data is a dict containing all the data you need, pandas probably has a simple way of loading it into a dataframe.
You get only 20 rows of the table, because only 20 rows are present on the html page by default. View the source-code of the page, you are trying to parse. There could be a possible solution to iterate through the pagination til the end, but pagination there is implemented with JS, it is not reflected in the URL, so I don't see, how you can access next pages of the table directly.
Looks like there is a request to
http://www.etf.com/etf-finder-funds-api//-aum/100/100/1
on that page, when I try to load the 2nd group of 100 rows. But getting an access to that URL might very tricky if possible. Maybe for this particular site you should use something, like WebBrowser in C# (I don't know what it will be in python, but I'm sure that python can do everything). You will be able to imitate browser and execute javascript.
Edit: I've tried to run the next JS code in console on the page, you provided.
jQuery.ajax({
url: "http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1",
success: function(data) {
console.log(JSON.parse(data));
}
});
It logged an array of all 2166 objects, representing table rows, you are looking for. Try it yourself to see the result. Looks like in the request url "0" is a start index and "3000" is a limit.
But if you try this from some other domain you will get 403 Forbidden. This is because of they have a Referer header check.
Edit again as mentioned by #stranac it is easy to set that header. Just set it to http://www.etf.com/etfanalytics/etf-finder and enjoy.
I want to open a page then find a number and multiply by another random number and then submit it to the page So what I'm doing is saving the page as a html then finding the 2 numbers multiplying it then sending it as a post but
post = urllib.urlencode({'answer': goal, 'submit': 'Submit+Answer'})
req2 = urllib2.Request("example", None, headers)
response = urllib2.urlopen(req, post) #this causes it not to work it opens the page a second time
this makes it connect a second time and thus the random number sent is wrong since it makes a new random number so how can i send a post request to a page I already have open without reopening it?
You might want to use something like mechanize, which enables stateful web browsing in Python. You could use it to load a URL, read a value from a page, perform the multiplication, place that number into a form on the page, and then submit it.
Does that sound like what you're trying to do? This page gives some information on how to fill in forms using mechanize.
I don't believe urllib supports keeping the connection open, as described here.
Looks like you'll have to send a reference to the original calculation back with your post. Or send the data back at the same time as the answer, so the server has some way of matching question with answer.
I'm looking to be able to query a site for warranty information on a machine that this script would be running on. It should be able to fill out a form if needed ( like in the case of say HP's service site) and would then be able to retrieve the resulting web page.
I already have the bits in place to parse the resulting html that is reported back I'm just having trouble with what needs to be done in order to do a POST of data that needs to be put in the fields and then being able to retrieve the resulting page.
If you absolutely need to use urllib2, the basic gist is this:
import urllib
import urllib2
url = 'http://whatever.foo/form.html'
form_data = {'field1': 'value1', 'field2': 'value2'}
params = urllib.urlencode(form_data)
response = urllib2.urlopen(url, params)
data = response.read()
If you send along POST data (the 2nd argument to urlopen()), the request method is automatically set to POST.
I suggest you do yourself a favor and use mechanize, a full-blown urllib2 replacement that acts exactly like a real browser. A lot of sites use hidden fields, cookies, and redirects, none of which urllib2 handles for you by default, where mechanize does.
Check out Emulating a browser in Python with mechanize for a good example.
Using urllib and urllib2 together,
data = urllib.urlencode([('field1',val1), ('field2',val2)]) # list of two-element tuples
content = urllib2.urlopen('post-url', data)
content will give you the page source.
I’ve only done a little bit of this, but:
You’ve got the HTML of the form page. Extract the name attribute for each form field you need to fill in.
Create a dictionary mapping the names of each form field with the values you want submit.
Use urllib.urlencode to turn the dictionary into the body of your post request.
Include this encoded data as the second argument to urllib2.Request(), after the URL that the form should be submitted to.
The server will either return a resulting web page, or return a redirect to a resulting web page. If it does the latter, you’ll need to issue a GET request to the URL specified in the redirect response.
I hope that makes some sort of sense?