I'm trying to bypass this page http://casesearch.courts.state.md.us/casesearch/ by mimicking the POST data when you click "I Agree". I've been able to do it with PHP but for some reason I can't recreate the request in Python. The response is always the page with the disclaimer despite sending the correct POST data. The title of the page I want to land on is "Maryland Judiciary Case Search Criteria". I'm using Python 3.
url = "http://casesearch.courts.state.md.us/casesearch/"
postdata = parse.urlencode({'disclaimer':'Y','action':'Continue'})
postdata = postdata.encode("utf-8")
header = {"Content-Type":"application/x-www-form-urlencoded"}
req = request.Request(url,data=postdata,headers=header)
res = request.urlopen(req)
print(res.read())
It looks like the URL you want to post to is actually http://casesearch.courts.state.md.us/casesearch/processDisclaimer.jis. Try this:
url = "http://casesearch.courts.state.md.us/casesearch/processDisclaimer.jis"
postdata = parse.urlencode({'disclaimer':'Y','action':'Continue'})
postdata = postdata.encode("utf-8")
header = {"Content-Type":"application/x-www-form-urlencoded"}
req = request.Request(url,data=postdata,headers=header)
res = request.urlopen(req)
print(res.read())
Related
This is what I have so far. I'm very new to this so point out if I'm doing anything wrong.
import requests
url = "https://9anime.to/user/watchlist"
payload = {
"username":"blahblahblah",
"password":"secretstringy"
# anything else?
}
with requests.Session() as s:
res = s.post(url, data=payload)
print(res.status_code)
You will need to inspect the form and see where the form is posting to in the action tag. In this case it is posting to user/ajax/login. Instead of requesting the watchlist URL you should post those details to the loginurl. Once you are logged in you can request your watchlist.
from lxml.html import fromstring
import requests
url = "https://9anime.to/user/watchlist"
loginurl = "https://9anime.to/user/ajax/login"
payload = {
"username":"someemail#gmail.com",
"password":"somepass"
}
with requests.Session() as s:
res = s.post(loginurl, data=payload)
print(res.content)
# b'{"success":true,"message":"Login successful"}'
res = s.get(url)
tree = fromstring(res.content)
elem = tree.cssselect("div.watchlist div.widget-body")[0]
print(elem.text_content())
# Your watch list is empty.
You would need to have knowledge (documentation of some form) on what that URL is expecting and how you are expected to interact with it. There is no way to know just given the information you have provided.
If you have some system that is able to interact with that URL already (e.g. your able to log in with your browser), then you could try to reverse-engineer what it is your browser is doing...
I have been reading a lot on how to submit a form with python and then read and scrap the obtained page. However I do not manage to do it in the specific form I am filling. My code returns the html of the form page. Here is my code :
import requests
values = {}
values['archive'] = "1"
values['descripteur[]'] = ["mc82", "mc84"]
values['typeavis[]'] = ["10","6","7","8","9"]
values['dateparutionmin'] = "01/01/2015"
values['dateparutionmax'] = "31/12/2015"
req = requests.post('https://www.boamp.fr/avis/archives', data=values)
print req.text
Any suggestion appreciated.
req.text looks like :
You may post data to a wrong page. I access the url and post one, then i found the post data is sent to https://www.boamp.fr/avis/liste. (sometime fiddler is useful to figure out the process)
So your code should be this
req = requests.post('https://www.boamp.fr/avis/liste', data=values)
I'm trying to write a code to collect resumes from "indeed.com" website.
In order to download resumes from "indeed.com" you have to login with your account.
The problem with me is after posting data it shows me response [200] which indicates successful post but still fail to login.
Here is my code :
import requests
from bs4 import BeautifulSoup
from lxml import html
page = requests.get('https://secure.indeed.com/account/login')
soup = BeautifulSoup(page.content, 'html.parser')
row_text = soup.text
surftok = str(row_text[row_text.find('"surftok":')+11:row_text.find('","tmpl":')])
formtok = str(row_text[row_text.find('"tk":') + 6:row_text.find('","variation":')])
logintok = str(row_text[row_text.find('"loginTk":') + 11:row_text.find('","debugBarLink":')])
cfb = int(str(row_text[row_text.find('"cfb":')+6:row_text.find(',"pvr":')]))
pvr = int(str(row_text[row_text.find('"pvr":') + 6:row_text.find(',"obo":')]))
hl = str(row_text[row_text.find('"hl":') + 6:row_text.find('","co":')])
data = {
'action': 'login',
'__email': 'myEmail',
'__password': 'myPassword',
'remember': '1',
'hl': hl,
'cfb': cfb,
'pvr': pvr,
'form_tk': formtok,
'surftok': surftok,
'login_tk': logintok
}
response = requests.post("https://secure.indeed.com/", data=data)
print response
print 'myEmail' in response.text
It shows me response [200] but when I search for my email in the response page to make sure that login is successful, I don't find it. It seems that login failed for a reason that I don't know.
send headers as well in your post request, get the headers from response headers of your browser.
headers = {'user-agent': 'Chrome'}
response = requests.post("https://secure.indeed.com/",headers = headers, data=data)
Some websites uses JavaScript redirection. "indeed.com" is one of them. Unfortunately, python requests does not support JavaScript redirection. In such situations we may use selenium.
I’m trying to scrape data from http://portal.uspto.gov/EmployeeSearch/ web site.
I open the site in browser, click on the Search button inside the Search by Organisation part of the site and look for the request being sent to server.
When I post the same request using python requests library in my program, I don’t get the result page which I am expecting but I get the same Search page, with no employee data on it.
I’ve tried all variants, nothing seems to work.
My question is, what URL should I use in my request, do I need to specify headers (tried also, copied headers viewed in Firefox developer tools upon request) or something else?
Below is the code that sends the request:
import requests
from bs4 import BeautifulSoup
def scrape_employees():
URL = 'http://portal.uspto.gov/EmployeeSearch/searchEm.do;jsessionid=98BC24BA630AA0AEB87F8109E2F95638.prod_portaljboss4_jvm1?action=displayResultPageByOrgShortNm¤tPage=1'
response = requests.post(URL)
site_data = response.content
soup = BeautifulSoup(site_data, "html.parser")
print(soup.prettify())
if __name__ == '__main__':
scrape_employees()
All the data you need is in a form tag:
action is the url when you make a post to server.
input is the data you need post to server. {name:value}
import requests, bs4, urllib.parse,re
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_form(soup):
form = soup.find(name='form', action=re.compile(r'OrgShortNm'))
return form
def get_action(form, base_url):
action = form['action']
# action is reletive url, convert it to absolute url
abs_action = urllib.parse.urljoin(base_url, action)
return abs_action
def get_form_data(form, org_code):
data = {}
for inp in form('input'):
# if the value is None, we put the org_code to this field
data[inp['name']] = inp['value'] or org_code
return data
if __name__ == '__main__':
url = 'http://portal.uspto.gov/EmployeeSearch/'
soup = make_soup(url)
form = get_form(soup)
action = get_action(form, url)
data = get_form_data(form, '1634')
# make request to the action using data
r = requests.post(action, data=data)
I have seen questions like this asked many many times but none are helpful
Im trying to submit data to a form on the web ive tried requests, and urllib and none have worked
for example here is code that should search for the [python] tag on SO:
import urllib
import urllib2
url = 'http://stackoverflow.com/'
# Prepare the data
values = {'q' : '[python]'}
data = urllib.urlencode(values)
# Send HTTP POST request
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html = response.read()
# Print the result
print html
yet when i run it i get the html soure of the home page
here is an example of using requests:
import requests
data= {
'q': '[python]'
}
r = requests.get('http://stackoverflow.com', data=data)
print r.text
same result! i dont understand why these methods arent working i've tried them on various sites with no success so if anyone has successfully done this please show me how!
Thanks so much!
If you want to pass q as a parameter in the URL using requests, use the params argument, not data (see Passing Parameters In URLs):
r = requests.get('http://stackoverflow.com', params=data)
This will request https://stackoverflow.com/?q=%5Bpython%5D , which isn't what you are looking for.
You really want to POST to a form. Try this:
r = requests.post('https://stackoverflow.com/search', data=data)
This is essentially the same as GET-ting https://stackoverflow.com/questions/tagged/python , but I think you'll get the idea from this.
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
This makes a POST request with the data specified in the values. we need urllib to encode the url and then urllib2 to send a request.
Mechanize library from python is also great allowing you to even submit forms. You can use the following code to create a browser object and create requests.
import mechanize,re
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Firefox')]
br.open( "http://google.com" )
br.select_form( 'f' )
br.form[ 'q' ] = 'foo'
br.submit()
resp = None
for link in br.links():
siteMatch = re.compile( 'www.foofighters.com' ).search( link.url )
if siteMatch:
resp = br.follow_link( link )
break
content = resp.get_data()
print content