IMDB allows you to create a watchlist, which can be easily exported in CSV format. I would like to be able to do this programmatically using Python.
The problem I am facing is that I can obviously not access it without logging in. If I try to access it directly I get the 404 response. So I figure, I will need to log in first, and attempt to get fetch the data afterwards.
Looking at the HTML code, I find that at least one of the login forms has these inputs:
<input type="hidden" name="49e6c" value="898d" />
<input id="usernameprompt" type="text" size="20" name="login" value="" >
<input id="passwordprompt" type="password" size="20" name="password">
<input type="submit" class="linkasbutton-primary" value="Login!">
The values in the first input have not changed yet, so I figure that is not yet an issue.
The location of the form is at https://secure.imdb.com/register-imdb/login?ref_=nv_usr_lgin_3, designated IMDBLOGIN in the code.
Now I would like to use this information to log in, using the name of each input as key and value as value:
form = OrderedDict([("49e6c", "898d"), ("login", username), ("password", password), ("submit", "submit")])
url = urlsplit(IMDBLOGIN)
try:
conn = httplib.HTTPSConnection(url.netloc)
request = url.path + "?" + url.query + "&" + urlencode(form)
conn.putrequest("POST", request)
conn.putheader("Content-Type", "application/x-www-form-urlencoded")
conn.endheaders()
loggedin = conn.getresponse()
logger.debug("Log in first %s %s %s", loggedin.status, loggedin.reason, loggedin.getheaders())
except:
logger.exception("Can't log in via HTTPS")
finally:
conn.close()
The problem is that I am unsure what to do with the submit input. The result I am now getting is 400 (Bad request).
Furthermore I don't know if I am on the right track anyway. Any considerations are welcome!
Your best bet is probably to use a inspector, like Chrome's "F12" developer tools, to take a look at the request that IMDB sends to itself in response to the user filling out the login form. When I did this, I noticed similar form values to the ones you had, though there are also cookies and other information that IMDB may be relying on to allow the authentication to complete. This is, of course, a notoriously fragile type of code.
If this is just for your own personal use, you could also consider simply signing in to IMDB from your browser, then finding the cookies that are set in your browser session and use them in your requests. This is the technique used by IMDbPY, which you might consider looking at.
It turns out that it is not actually that difficult. To make things easier I have switched from httplib to urllib2.
IMDBLOGIN = "https://secure.imdb.com/register/login?ref_=nv_usr_lgin_3"
form = OrderedDict([("49e6c", "3478"), ("login", self.username), ("password", self.password)])
cj = CookieJar()
opener = build_opener(HTTPHandler(), HTTPSHandler(), HTTPErrorProcessor(),
HTTPRedirectHandler(), HTTPCookieProcessor(cj))
params = urlencode(form)
response = opener.open(IMDBLOGIN, params)
# cookies automatically sent
response2 = opener.open(csv_url)
content = response2.read()
Basically, I do not need to add the submit input, and so far I have not been able to figure out what purpose the first input has, because I seem to be able to fill in any value. (Though not extensively tested)
After logging in I make sure that I keep the cookie for the next request, and retrieve the file.
Related
i'm having a hard time understanding why i can't load the "/ticket" page properly, but i can load the "/" fine. I should have the /ticket page running and able to type the information under username and email but i'm missing something.
I'm running Python 3.10.4
"localhost:8080/"
"localhost:8080/ticket"
import os
import bottle
import base64
import traceback
from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import padding
from binascii import hexlify
from html import escape
from bottle import route, request, get, post, response
page = '''
<html>
<head><title>Welcome</title></head>
<body>
<p><h2>Welcome, %s!</h2></p>
<p><h3>There are %s tickets left</h3></p>
<p>%s</p>
</body>
</html>
'''
key = os.urandom(32)
iv = os.urandom(16)
cipher = Cipher(algorithms.AES(key), modes.CBC(iv), backend=default_backend())
def get_ticket(username, email, amount):
encryptor = cipher.encryptor()
padder = padding.PKCS7(128).padder()
message = padder.update("%s&%s&%d" % (username, email, amount)) + padder.finalize()
ct = encryptor.update(message) + encryptor.finalize()
return base64.b64encode(ct)
#bottle.route('/', method=('GET', 'POST'))
def default():
decryptor = cipher.decryptor()
unpadder = padding.PKCS7(128).unpadder()
try:
ticket = bottle.request.get_cookie('ticket')
if ticket:
message = decryptor.update(base64.b64decode(ticket)) + decryptor.finalize()
username, email, amount = (unpadder.update(message) + unpadder.finalize()).split('&')
return page % (escape(username), escape(amount),
'Already done' if amount == '1' else
'Go ahead')
else:
return '<html><body><p><a href=/ticket>Fill in the forms</a></p></body><html>\n'
except Exception as e:
print(traceback.format_exc())
raise bottle.HTTPError(status=401, body=e.message)
#bottle.post('/ticket', ['GET', 'POST'])
def ticket():
username = bottle.request.forms.get('username')
email = bottle.request.forms.get('email')
if username and email:
ticket = get_ticket(username, email, 1)
bottle.response.set_cookie('ticket', ticket)
return '<html><body><p>Thank you!</p></body></html>\n'
raise bottle.HTTPError(status=400, body='Please fill in the form')
bottle.run(host='localhost', port=8080)
The reason this code hits an error is that you never actually serve a form for the user to fill in (and thus your code correctly errors). I don't really understand your flow at all, but I think it's supposed to look something like this:
user navigates to /
if user doesn't have a ticket, user is sent to a login page
user logs in and gets a ticket
user goes back to / and requests it, this time with a ticket (authentication cookie) in the get request.
Mind you I don't understand what these tickets are, as they appear to be a hashed form of whatever the user submits and the constant 1, but I guess there's more logic coming along later.
Currently there are quite a few problems with the setup:
/ accepts post requests, but never does anything with them. It also doesn't make sense for it to accept post requests, since POST is to send data to a server.
your /ticket endpoint would do what you want if you sent it a form, but you never serve a form for the user to fill in.
your /ticket endpoint needs to return one thing if you make a GET request to it, another completely different thing if you make a POST request. This is possible, but it's a bad way to build these endpoints in general.
after getting the cookie the browser is simply going to throw it away. If you want to do cookie-based authentication, you need to store the cookie and handle it (normally you use this with JS in the webbrowser, and take care to add it to all requests).
If we add a login endpoint:
#bottle.route("/login")
def login():
return """<html><body>
<form action="./ticket" method="POST">
Username: <input type="text" name="username"><br>
Email: <input type="text" name="email"><br>
<input type="submit">
</form>
</body></html>"""
And edit the previous href to point us to /login, we now get a form. Filling that form in and pressing 'submit' correctly sends the data to the /ticket endpoint, calling the code to get the ticket. Unfortunately that fails for unrelated reasons (your hashing needs bytes, not strings) but that's a different question.
I suggest you have a think about the overall flow of this application---what are these tickets for, and where is the user supposed to end up? Regardless you will need to serve some kind of login form at some point. You probably want this to be a static file, by the way, and served as such---see the bottle docs.
im having an issues while trying to scrape the html from the current page the user is on.. Essentially the user is building a list of exercises to create a workout routine, the user is picking from a Select field, and each time they click the "add" button, it will populate a list of what they have chosen so far. Then I will grab the Text from that list and match it to what I have in my database
My issue is coming up in requests.get(url_for('createARoutine')).
requests.exceptions.MissingSchema: Invalid URL '/createaroutine': No schema supplied. Perhaps you meant http:///createaroutine?
when testing it with the direct url "http://127.0.0.1:5000/createaroutine" my error changes to werkzeug.routing.BuildError: Could not build url for endpoint 'http://127.0.0.1:5000/createaroutine'. Did you mean 'createARoutine' instead?
#app.route("/createaroutine", methods=["GET", "POST"])
def createARoutine():
"""
present form to creatine new routine, each time user clicks to
add an exercise, show the exercise to the side"""
form = CreateRoutineForm()
query = Exercises.query.all()
choices = [(c.id, c.name) for c in query]
form.exercises.choices = choices
# collect all exercises and add to routine,
# also add routine to users favorites
if request.method == 'POST':
this_html = requests.get(url_for('createARoutine')) <----ERROR
soup = BeautifulSoup(this_html, 'html.parser')
p = soup.find_all("li", {"id": "exerciseChoices"})
print(p)
return redirect(url_for('showWorkoutRoutines'))
return render_template("createRoutine.html", form=form)
`
This will not work, as requests needs a fully qualified url, including the server where you are running.
On the other hand, I'm strugling to understand why you are doing what you are doing! You are calling your own site with a get, instead of accessing the data from where you are. This is a terrible idea. You have all the data you need in the exact function. If you need the html, it's in the template. You should never, ever do what you are doing here.
I am trying to enter my username , password and login into the website.I am a begineer to this and am trying this for the first time.I dont know if I have to include any other data here.
The sample website that I am trying is: http://testing-ground.scraping.pro/login.I am passing my credentials and checking if the contents of the welcome page appears after successfully logging in by printing page.content.But it displays the content of access denied(this appears when you enter the wrong credentials).I dont know where I am wrong here.
import requests
with HTMLSession() as c:
url='http://testing-ground.scraping.pro/login?mode=login'
usr='admin'
pwd='12345'
c.get(url)
login_data=dict(username=usr,password=pwd)
c.post(url,data=login_data)
page=c.get('http://testing-ground.scraping.pro/login?mode=welcome')
print(page.content)
I have not tested this but it seems like you used the wrong names for the username and password parameters in the request. A quick inspection of the request sent by the site shows this:
As you can see, in the form data the username is sent as usr and the password is sent as pwd. However, when you built the dict for the login data, you used login_data=dict(username=usr,password=pwd), which constructs a dict of {"username": usr, "password": pwd}, which does not match the requirements of the actual request. What you want instead is dict(usr=usr,pwd=pwd).
Have a look in the <form> tag in page source. Or you can check the Network tab for post request for the name value of each field. The correct field names are usr and pwd.
So basically change this line of code:
login_data=dict(username=usr,password=pwd)
to
login_data=dict(usr=usr,pwd=pwd)
I am trying to work through some code to connect to merchantos.com's rest API via Python.
With some research, I have managed to get the GET access working, using the following urllib2 code:
# NOTE: This api key has been made bogus
lcMOS_APIKey = '07203c82fab495xxxxxxxxxxxxxxxxxxxc2a499c'
# also bogus...
lcMOS_Acct = '98765'
lcBaseURL = 'https://api.merchantos.com/API/Account/' + lcMOS_Acct + '/'
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, lcBaseURL, lcMOS_APIKey, 'apikey')
# create "opener" (OpenerDirector instance)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
urllib2.install_opener(opener)
# use the opener to fetch a URL
#loReturn = opener.open(lcBaseURL + lcURLEnd)
loReturn = opener.open(lcBaseURL + 'Customer.xml?firstName=Alex')
lcResponse = loReturn.read()
So, the above successfully pulls data back. I get an XML of the customer record.
Now, what I need to do is change the method so that I can do a PUT (for an update) and a POST (for a create/new).
MerchantOS requires the following for an update:
UPDATE / HTTP PUT
To update an existing record/object you do an HTTP PUT request. The put/post data should be an XML block defining the updates to the object. For example to update an Item you would PUT to API/Account/1/Item/2 with an block (1 is the account number and 2 the itemID in this example).
So, for example, I want to do a PUT to update customer ID = 2
I would provide a data reference to an XML block for the
<Customer>
..contents omitted here...
</Customer>
And, I am to point it to theURL.
The problems I am facing here are..
I do not know where/how to change the method to PUT
I need top know how to attach my data block and post it
So, can someone please show me how to adapt the above code for a GET to make a PUT .. as well as a POST (for creating a new record)
Thanks, in advance, for any assistance in this regard.
Scott.
You might try cURL instead of urllib. cURL is extremely flexible and addresses your needs:
http://pycurl.sourceforge.net/
Here are two of the options you can set with cURL:
CURLOPT_POST: A parameter set to 1 tells the library to do a regular HTTP post...
CURLOPT_POSTFIELDS: The full data to post in a HTTP POST operation...
I am having a rough time gathering the data from a website programatically. I am attempting to utilize this example to log into the server, but it is not working since I think that this is the wrong type of login.
The site I am trying to access redirects to a login page when I attempt to download the data to parse the html.
This is the URL:
https://mtred.com/rewards.html
and heres the code:
# build opener with HTTPCookieProcessor
o = urllib2.build_opener( urllib2.HTTPCookieProcessor() )
urllib2.install_opener( o )
# assuming the site expects 'user' and 'pass' as query params
p = urllib.urlencode( { 'UserLogin_username': 'mylogin', 'UserLogin_password': 'mypass' } )
# perform login with params
f = o.open( 'http://www.mtred.com/user/login.html', p )
data = f.read()
f.close()
# second request should automatically pass back any
# cookies received during login... thanks to the HTTPCookieProcessor
f = o.open( 'https://www.mtred.com/rewards.html',p )
data = f.read()
print data
it kicks me to the login page again when I attempt to open rewards. I am trying to pass the rewards to do some statistics automatically since this information isn't available via public API
One issue that pops out is that you're passing in the id values of the form parameters for the login, not the name parameters. E.g., in the user name form field, you are specifying UserLogin_username, but the name of that field as expected by the server is "UserLogin[username]"
<label for="UserLogin_username" class="required">
username or email <span class="required">*</span></label>
<input name="UserLogin[username]" id="UserLogin_username" type="text" /> </div>
<div class="row">
<label for="UserLogin_password" class="required">password <span class="required">*</span></label>
<input name="UserLogin[password]" id="UserLogin_password" type="password" /> </div>
Since the server isn't getting back parameters that it knows about, the behavior you're seeing is not unexpected. (Not saying that there aren't other problems here; haven't looked.)
you must inclue in ur post data the value named "YII_CSRF_TOKEN" that included in html form .
or use "ClientForm" lib