Requests dict from cookiejar issue with escaped chars - python

I'm running into some issues getting a cookie into a dictionary with python. It seems to be all escaped somehow even after running the command provided by requests.
resp = requests.get(geturl, cookies=cookies)
cookies = requests.utils.dict_from_cookiejar(resp.cookies)
and this is what cookies looks like
{'P-fa9d887b1fe1a997d543493080644610': '"\\050dp1\\012S\'variant\'\\012p2\\012S\'corrected\'\\012p3\\012sS\'pid\'\\012p4\\012VNTA2NjU0OTU4MDc5MTgwOA\\075\\075\\012p5\\012sS\'format\'\\012p6\\012S\'m3u8\'\\012p7\\012sS\'mode\'\\012p8\\012Vlive\\012p9\\012sS\'type\'\\012p10\\012S\'video/mp2t\'\\012p11\\012s."'}
Is there any way to make the characters unescaped in the value section of P-fa9d887b1fe1a997d543493080644610 become escaped and part of the dict itself?
Edit:
I would like the dictionary to look something like:
{'format': 'm3u8', 'variant': 'corrected', 'mode': u'live', 'pid': u'NTA2NjU0OTU4MDc5MTgwOA==', 'type': 'video/mp2t'}

You are dealing with the Python Pickle format for data serialisation. Once you have evaluated the expression, so escaped characters are unescaped, you need to load the pickle from a string using the pickle.loads function.
>>> import pickle
>>> import ast
>>> pickle.loads(ast.literal_eval("'''" + cookies.values()[0] + "'''")[1:-1])
{'pid': u'NTA2NjU0OTU4MDc5MTgwOA==', 'type': 'video/mp2t', 'variant': 'corrected', 'mode': u'live', 'format': 'm3u8'}

Related

How to extract some data from url using Python

I have an url as follows:
https://some_url/vivi/v2/ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=/BE?category=PASSENGER&make=30&model=124&regmonth=3&regdate=2015-03&body=443,4781&facelift=252&seats=4&bodyHeight=443&bodyLength=443&weight=-1&engine=1394&wheeldrive=196&transmission=400
What I need is to get the string after v2/, thus ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=
I use furl to extract the parameter value. I do it as follows:
furl(url).args['category'] // gives PASSENGER
But here I do not have the name of the parameter.
How can I do that?
If you don't need a generalized solution but for the url you have provided in question. Then you can do the following:
url="https://some_url/vivi/v2/ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=/BE?category=PASSENGER&make=30&model=124&regmonth=3&regdate=2015-03&body=443,4781&facelift=252&seats=4&bodyHeight=443&bodyLength=443&weight=-1&engine=1394&wheeldrive=196&transmission=400"
answer=url.split('/')[5]
Use following code:
l=url.split('/')
m=l[l.index('v2')+1]
print(m)
Desired output using re.
import re
url = "https://some_url/vivi/v2/ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0=/BE?category=PASSENGER&make=30&model=124&regmonth=3&regdate=2015-03&body=443,4781&facelift=252&seats=4&bodyHeight=443&bodyLength=443&weight=-1&engine=1394&wheeldrive=196&transmission=400"
re.findall(r'v2/(.*)/', url)
Resulting with ['ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0='].
But it's safer to use split() the way other mentioned, because when api version changes to v3 this re code won't work anymore.
The string that you are after is not a query parameter, it is part of the URL path.
In the general case you can use the urllib.parse module to parse the URL into its components, then access the path. Then extract the required part of the path:
import base64
from urllib.parse import urlparse, parse_qs
parsed_url = urlparse(url)
s = parsed_url.path.split('/')[-2] # second last component of path
>>> s
'ZUxOZmVrdzJqTURxV20wQ0RvRld6SytEQWNocThwMGVnbFJ4RDQrZzJMeGRBcnhPYnUzV1pRPT0='
>>> base64.b64decode(s)
b'eLNfekw2jMDqWm0CDoFWzK+DAchq8p0eglRxD4+g2LxdArxObu3WZQ=='
The keys and values of the query string can also be processed into a dictionary and accessed by key:
params = parse_qs(parsed_url.query)
>>> params
{'category': ['PASSENGER'], 'make': ['30'], 'model': ['124'], 'regmonth': ['3'], 'regdate': ['2015-03'], 'body': ['443,4781'], 'facelift': ['252'], 'seats': ['4'], 'bodyHeight': ['443'], 'bodyLength': ['443'], 'weight': ['-1'], 'engine': ['1394'], 'wheeldrive': ['196'], 'transmission': ['400']}
>>> params['category']
['PASSENGER']

Extracting articles from New York Post by using Python and New York Post API

I am trying to create a corpus of text documents via the New york Times API (articles concerning terrorist attacks) on Python.
I am aware that the NYP API do not provide the full body text, but provides the URL from which I can scrape the article. So the idea is to extract the "web_url" parameters from the API and consequently scrape the full body article.
I am trying to use the NYT API library on Python with these lines:
from nytimesarticle import articleAPI
api = articleAPI("*Your Key*")
articles = api.search( q = 'terrorist attack')
print(articles['response'],['docs'],['web_url'])
But I cannot extract the "web_url" or the articles. All I get is this output:
{'meta': {'time': 19, 'offset': 10, 'hits': 0}, 'docs': []} ['docs'] ['web_url']
There seems to be an issue with the nytimesarticle module itself. For example, see the following:
>>> articles = api.search(q="trump+women+accuse", begin_date=20161001)
>>> print(articles)
{'response': {'docs': [], 'meta': {'offset': 0, 'hits': 0, 'time': 21}}, 'status': 'OK', 'copyright': 'Copyright (c) 2013 The New York Times Company. All Rights Reserved.'}
But if I use requests (as is used in the module) to access the API directly, I get the results I'm looking for:
>>> import requests
>>> r = requests.get("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=trump+women+accuse&begin_date=20161001&api-key=XXXXX")
>>> data = r.json()
>>> len(data["response"]["docs"])
10
meaning that 10 articles were returned (the full value of data is 16kb, so I won't include it all here). Contrast that to the response from api.search(), where articles["response"]["docs"] is an empty list.
nytimesarticle.py is only 115 lines long, so it's pretty straightforward to debug. Printing the value of the URL sent to the API reveals this:
>>> articles = api.search(q="trump+women+accuse", begin_date=20161001)
https://api.nytimes.com/svc/search/v2/articlesearch.json?q=b'trump+women+accuse'&begin_date=20161001&api-key=XXXXX
# ^^ THIS
The offending code encodes every string parameter to UTF-8, which makes it a bytes object. This is not necessary, and wrecks the constructed URL as shown above. Fortunately, there is a pull request that fixes this:
>>> articles = api.search(q="trump+women+accuse", begin_date=20161001)
http://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20161001&q=trump+women+accuse&api-key=XXXXX
>>> len(articles["response"]["docs"])
10
This also allows for other string parameters such as sort="newest" to be used, as the bytes formatting was causing an error previously.
The comma in the print statement separates what is printed.
You'll want something like this
articles['response']['docs']['web_url']
But 'docs': [] is both an array and empty, so above line won't work, so you could try
articles = articles['response']['docs']
for article in articles:
print(article['web_url'])

Json encoding, customizing dumps method

I have the following dictionary:
data = {
'name': 'david',
'avatar': '\xed\xb3\x1cW\x7f\x87\x1c\xb9*Pw\x9a#W\x05\xeaNs\xe4\xaa#\xddi\x8e\\\x15\xa8;\xcd\x91\xab\x02u\xa79rU\xa0\xee4\xf7K\xb9\x05{t\x02\xc6I\xb6\xaa\xbf\x00\x00\x00\x00IEND\xaeB`\x82...'
}
When I try doing:
import json
json.dumps(data)
I get a UnicodeDecodeError, probably because it's trying to interpret the avatar data.
How would I 'skip' a value when doing json-encoding, so the value of the key just uses its string and doesn't try and encode it?

How to url encode the binary contents of a video in Python?

I wanted to port video to Tumblr using the API using the Tumblpy library.
My code is this:
import requests
r = requests.get(video-url)
f = {'data':r.content}
dat = urllib.urlencode(f)
t.post('post', blog_url='http://tumblrname.tumblr.com/',params={'type':'video',
'title':post.title, 'slug': post.slug,'date':post.date,'data':dat,'tags':post.tagscsv,
'caption': post.body_html}) #t is TumblPy instance
Well, I am not being successful in this. I do think I am missing out on how to encode the binary contents to make the post successful, though I am not sure.
Presumably it's going to be similar to how you post a photo, in which case the library wants a file(like) object. A requests response can act as a file-like object just fine:
import requests
r = requests.get(video_url)
t.post('post', blog_url='http://tumblrname.tumblr.com/',
params={'type': 'video', 'title': post.title, 'slug': post.slug,
'date': post.date, 'data': r.raw, 'tags': post.tagscsv,
'caption': post.body_html})
where r.raw gives you a file-like object that, when read, yields the video data read from video_url.

jquery: post with json will actually post array

I have a python as CGI and the POST from jquery will transform json object to array, so when I see the POST from jquery, I actually see:
login_user[username]=dfdsfdsf&login_user[password]=dsfsdf
(the [ and ] already escaped)
My question is how I can convert this string back to JSON in python? Or, how can I convert this string to python array/dict structure so that I can process it easier?
[edit]
My jquery is posting:
{'login_user': {'username':username, 'password':password}}
If what you want to accomplish is to send structured data from the browser and then unpack it in your Python backend and keep the same structure, I suggest the following:
Create JavaScript objects in the browser to hold your data:
var d = {}
d['login_user'] = { 'username': 'foo', 'password': 'bar' }
Serialize to JSON, with https://github.com/douglascrockford/JSON-js
POST to your backend doing something like this:
$.post(url, {'data': encoded_json_data}, ...)
In your Python code, parse the JSON, POST in my example is where you get your POST data in your CGI script:
data = json.loads(POST['data'])
data['login_user']
import re
thestring = "login_user[username]=dfdsfdsf&login_user[password]=dafef"
pattern = re.compile(r'^login_user\[username\]=(.*)&login_user\[password\]=(.*)')
match = pattern.search(thestring)
print match.groups()
Output:
>>> ('dfdsfdsf', 'dafef')
Thus,
lp = match.groups()
print "{'login_user': {'username':"+lp[0]+", 'password':"+lp[1]+"}}"
shall bear: >>> {'login_user': {'username':dfdsfdsf, 'password':dafef}}
>>> import json
>>> data = {'login_user':{'username':'dfdsfdsf', 'password':'dsfsdf'}}
>>> json.dumps(data)
'{"login_user": {"username": "dfdsfdsf", "password": "dsfsdf"}}'
I suspect that data would already be contained in a GET var if that's coming from the URL...

Categories