I've been looking around for a better solution that what I'm already doing. I need to construct an xml to send to a SOAP service but I need to generate the XML dinamically. The problem or may not be a problem, is that I find this really long and I think there must be a better way of achiving this. I'm using python 2.7.5, this is my XML (kind of, it's actually larger):
SINGLE_PAYMENT = '''...
<v1:Shipping>
<v1:Type>%s</v1:Type>
<v1:Address1>%s</v1:Address1>
<v1:Address2>%s</v1:Address2>
<v1:City>%s</v1:City>
<v1:Country>%s</v1:Country>
<v1:Items>%s</v1:Items>
<v1:State>%s</v1:State>
<v1:Carrier>%s</v1:Carrier>
<v1:Weight>%s</v1:Weight>
<v1:Total>%s</v1:Total>
</v1:Shipping>
....'''
Then I do
SoapMessage = SINGLE_PAYMENT%...and replace here with variables passed to this function
Is there any better way of doing this?.Thanks!
You could use a proper XML library, or you could just write a method that generates a single "shipping" element using templates, for example:
def generate_tag(name, values, version='1'):
return '\n'.join([
'<v{0}:{1}>'.format(version, name),
'\n'.join(
'\t<v{2}:{0}>{1}</v{2}:{0}>'.format(k, v, version) for k, v in values.iteritems()
),
'</v{0}:{1}>'.format(version, name)
])
# Output:
>>> print generate_tag('Shipping', {"Total": "100", "Type": "A"})
<v1:Shipping>
<v1:Total>100</v1:Total>
<v1:Type>A</v1:Type>
</v1:Shipping>
But, as you can see, that gets messy quickly.
Using the libraries (such as ElementTree or MiniDom) would be more to your benefit, though, as they will take care of any needed escaping and provide a better interface should you need to make changes later.
Related
Can someone help me understand what I'm doing wrong in the following code:
def matchTrigTohost(gtriggerids,gettriggers):
mylist = []
for eachid in gettriggers:
gtriggerids['params']['triggerids'] = str(eachid)
hgetjsonObject = updateitem(gtriggerids,processor)
hgetjsonObject = json.dumps(hgetjsonObject)
hgetjsonObject = json.loads(hgetjsonObject)
hgetjsonObject = eval(hgetjsonObject)
hostid = hgetjsonObject["result"][0]["hostid"]
hname = hgetjsonObject["result"][0]["name"]
endval = hostid + "--" + hname
mylist.append(endval)
return(hgetjsonObject)
The variable gettriggers contain a lot of ids (~3500):
[ "26821", "26822", "26810", ..... ]
I'm looping through the ids in the variable and assigning them to a json object.
gtriggerids = {
"jsonrpc": "2.0",
"method": "host.get",
"params": {
"output": ["hostid", "name"],
"triggerids": "26821"
},
"auth": mytoken,
"id": 2
}
When I run the code against the above json variable, it is very slow. It is taking several minutes to check each ID. I'm sure I'm doing many things wrong here or at least not in the pythonic way. Can anyone help me speed this up? I'm very new to python.
NOTE:
The dump() , load(), eval() were used to convert the str produced to json.
You asked for help knowing what you're doing wrong. Happy to oblige :-)
At the lowest level—why your function is running slowly—you're running many unnecessary operations. Specifically, you're moving data between formats (python dictionaries and JSON strings) and back again which accomplishes nothing but wasting CPU cycles.
You mentioned this is only way you could get the data in the format you needed. That brings me to the second thing you're doing wrong.
You're throwing code at the wall instead of understanding what's happening.
I'm quite sure (and several of your commenters appear to agree) that your code is not the only way to arrange your data into a usable structure. What you should do instead is:
Understand as much as you can about the data you're being given. I suspect the output of updateitem() should be your first target of learning.
Understand the right/typical way to interact with that data. Your data doesn't have to be a dictionary before you can use it. Maybe it's not the best approach.
Understand what regularities and irregularities the data may have. Part of your problem may not be with types or dictionaries, but with an unpredictable/dirty data source.
Armed with all this new knowledge, manipulate your as simply as you can.
I can pretty much guarantee the result will run faster.
More detail! Some things you wrote suggest misconceptions:
I'm looping through the ids in the variable and assigning them to a json object.
No, you can't assign to a JSON object. In python, JSON data is always a string. You probably mean that you're assigning to a python dictionary, which (sometimes!) can be converted to a JSON object, represented as a string. Make sure you have all those concepts clear before you move forward.
The dump() , load(), eval() were used to convert the str produced to json.
Again, you don't call dumps() on a string. You use that to convert a python object to a string. Run this code in a REPL, go step by step, and inspect or play with each output to understand what it is.
I'm downloading tweets from Twitter Streaming API using Tweepy. I manage to check if downloaded data has keys as 'extended_tweet', but I'm struggling with an specific key inside another key.
def on_data(self, data):
savingTweet = {}
if not "retweeted_status" in data:
dataJson = json.loads(data)
if 'extended_tweet' in dataJson:
savingTweet['text'] = dataJson['extended_tweet']['full_text']
else:
savingTweet['text'] = dataJson['text']
if 'coordinates' in dataJson:
if 'coordinates' in dataJson['coordinates']:
savingTweet['coordinates'] = dataJson['coordinates']['coordinates']
else:
savingTweet['coordinates'] = 'null'
I'm checking 'extended_key' propertly, but when I try to do the same with ['coordinates]['coordinates] I get the following error:
TypeError: argument of type 'NoneType' is not iterable
Twitter documentation says that key 'coordinates' has the following structure:
"coordinates":
{
"coordinates":
[
-75.14310264,
40.05701649
],
"type":"Point"
}
I achieved to solve it by just putting the conflictive check in a try, except, but I think this is not the most suitable approach to the problem. Any other idea?
So the twitter API docs are probably lying a bit about what they return (shock horror!) and it looks like you're getting a None in place of the expected data structure. You've already decided against using try, catch, so I won't go over that, but here are a few other suggestions.
Using dict get() default
There are a couple of options that occur to me, the first is to make use of the default ability of the dict get command. You can provide a fall back if the expected key does not exist, which allows you to chain together multiple calls.
For example you can achieve most of what you are trying to do with the following:
return {
'text': data.get('extended_tweet', {}).get('full_text', data['text']),
'coordinates': data.get('coordinates', {}).get('coordinates', 'null')
}
It's not super pretty, but it does work. It's likely to be a little slower that what you are doing too.
Using JSONPath
Another option, which is likely overkill for this situation is to use a JSONPath library which will allow you to search within data structures for items matching a query. Something like:
from jsonpath_rw import parse
matches = parse('extended_tweet.full_text').find(data)
if matches:
print(matches[0].value)
This is going to be a lot slower that what you are doing, and for just a few fields is overkill, but if you are doing a lot of this kind of work it could be a handy tool in the box. JSONPath can also express much more complicated paths, or very deeply nested paths where the get method might not work, or would be unweildy.
Parse the JSON first!
The last thing I would mention is to make sure you parse your JSON before you do your test for "retweeted_status". If the text appears anywhere (say inside the text of a tweet) this test will trigger.
JSON parsing with a competent library is usually extremely fast too, so unless you are having real speed problems it's not necessarily worth worrying about.
I'm trying to load JSON back into an object. The "loads" method seems to work without error, but the object doesn't seem to have the properties I expect.
How can I go about examining/inspecting the object that I have (this is web-based code).
results = {"Subscriber": {"firstname": "Neal", "lastname": "Walters"}}
subscriber = json.loads(results)
for item in inspect.getmembers(subscriber):
self.response.out.write("<BR>Item")
for subitem in item:
self.response.out.write("<BR> SubItem=" + subitem)
The attempt above returned this:
Item
SubItem=__class__
I don't think it matters, but for context:
The JSON is actually coming from a urlfetch in Google App Engine to
a rest web service created using this utility:
http://code.google.com/p/appengine-rest-server.
The data is being retrieved from a datastore with this definition:
class Subscriber(db.Model):
firstname = db.StringProperty()
lastname = db.StringProperty()
Thanks,
Neal
Update #1: Basically I'm trying to deserialize JSON back into an object.
In theory it was serialized from an object, and I want to now get it back into an object.
Maybe the better question is how to do that?
Update #2: I was trying to abstract a complex program down to a few lines of code, so I made a few mistakes in "pseudo-coding" it for purposes of posting here.
Here's a better code sample, now take out of website where I can run on PC.
results = '{"Subscriber": {"firstname": "Neal", "lastname": "Walters"}}'
subscriber = json.loads(results)
for key, value in subscriber.items():
print " %s: %s" %(key, value)
The above runs, what it displays doesn't look any more structured than the JSON string itself. It displays this:
Subscriber: {u'lastname': u'Walters', u'firstname': u'Neal'}
I have more of a Microsoft background, so when I hear serialize/deserialize, I think going from an object to a string, and from a string back to an object. So if I serialize to JSON, and then deserialize, what do I get, a dictionary, a list, or an object? Actually, I'm getting the JSON from a REST webmethod, that is on my behalf serializing my object for me.
Ideally I want a subscriber object that matches my Subscriber class above, and ideally, I don't want to write one-off custom code (i.e. code that would be specific to "Subscriber"), because I would like to do the same thing with dozens of other classes. If I have to write some custom code, I will need to do it generically so it will work with any class.
Update #3: This is to explain more of why I think this is a needed tool. I'm writing a huge app, probably on Google App Engine (GAE). We are leaning toward a REST architecture for several reasons, but one is that our web GUI should access the data store via a REST web layer. (I'm a lot more used to SOAP, so switching to REST is a small challenge in itself). So one of the classic ways of getting and update data is through a business or data tier. By using the REST utility mention above, I have the choice of XML or JSON. I'm hoping to do a small working prototype of both before we develop the huge app). Then, suppose we have a successful app, and GAE doubles it prices. Then we can rewrite just the data tier, and take our Python/Django user tier (web code), and run it on Amazon or somewhere else.
If I'm going to do all that, why would I want everything to be dictionary objects. Wouldn't I want the power of full-blown class structure? One of the next tricks is sort of an object relational mapping (ORM) so that we don't necessarily expose our exact data tables, but more of a logical layer.
We also want to expose a RESTful API to paying users, who might be using any language. For them, they can use XML or JSON, and they wouldn't use the serialize routine discussed here.
json only encodes strings, floats, integers, javascript objects (python dicts) and lists.
You have to create a function to turn the returned dictionary into a class and then pass it to a json.loads using the object_hook keyword argument along with the json string. Heres some code that fleshes it out:
import json
class Subscriber(object):
firstname = None
lastname = None
class Post(object):
author = None
title = None
def decode_from_dict(cls,vals):
obj = cls()
for key, val in vals.items():
setattr(obj, key, val)
return obj
SERIALIZABLE_CLASSES = {'Subscriber': Subscriber,
'Post': Post}
def decode_object(d):
for field in d:
if field in SERIALIZABLE_CLASSES:
cls = SERIALIZABLE_CLASSES[field]
return decode_from_dict(cls, d[field])
return d
results = '''[{"Subscriber": {"firstname": "Neal", "lastname": "Walters"}},
{"Post": {"author": {"Subscriber": {"firstname": "Neal",
"lastname": "Walters"}}},
"title": "Decoding JSON Objects"}]'''
result = json.loads(results, object_hook=decode_object)
print result
print result[1].author
This will handle any class that can be instantiated without arguments to the constructor and for which setattr will work.
Also, this uses json. I have no experience with simplejson so YMMV but I hear that they are identical.
Note that although the values for the two subscriber objects are identical, the resulting objects are not. This could be fixed by memoizing the decode_from_dict class.
results in your snippet is a dict, not a string, so the json.loads would raise an exception. If that is fixed, each subitem in the inner loop is then a tuple, so trying to add it to a string as you are doing would raise another exception. I guess you've simplified your code, but the two type errors should already show that you simplified it too much (and incorrectly). Why not use an (equally simplified) working snippet, and the actual string you want to json.loads instead of one that can't possibly reproduce your problem? That course of action would make it much easier to help you.
Beyyond peering at the actual string, and showing some obvious information such as type(subscriber), it's hard to offer much more help based on that clearly-broken code and such insufficient information:-(.
Edit: in "update2", the OP says
It displays this: Subscriber: {u'lastname': u'Walters', u'firstname': u'Neal'}
...and what else could it possibly display, pray?! You're printing the key as string, then the value as string -- the key is a string, and the value is another dict, so of course it's "stringified" (and all strings in JSON are Unicode -- just like in C# or Java, and you say you come from a MSFT background, so why does this surprise you at all?!). str(somedict), identically to repr(somedict), shows the repr of keys and values (with braces around it all and colons and commas as appropriate separators).
JSON, a completely language-independent serialization format though originally centered on Javascript, has absolutely no idea of what classes (if any) you expect to see instances of (of course it doesn't, and it's just absurd to think it possibly could: how could it possibly be language-independent if it hard-coded the very concept of "class", a concept which so many languages, including Javascript, don't even have?!) -- so it uses (in Python terms) strings, numbers, lists, and dicts (four very basic data types that any semi-decent modern language can be expected to have, at least in some library if not embedded in the language proper!). When you json.loads a string, you'll always get some nested combination of the four datatypes above (all strings will be unicode and all numbers will be floats, BTW;-).
If you have no idea (and don't want to encode by some arbitrary convention or other) what class's instances are being serialized, but absolutely must have class instances back (not just dicts etc) when you deserialize, JSON per se can't help you -- that metainformation cannot possibly be present in the JSON-serialized string itself.
If you're OK with the four fundamental types, and just want to see some printed results that you consider "prettier" than the default Python string printing of the fundamental types in question, you'll have to code your own recursive pretty-printing function depending on your subjective definition of "pretty" (I doubt you'd like Python's own pprint standard library module any more than you like your current results;-).
My guess is that loads is returning a dictionary. To iterate over its content, use something like:
for key, value in subscriber.items():
self.response.out.write("%s: %s" %(key, value))
Situation:
I am writing a basic templating system in Python/mod_python that reads in a main HTML template and replaces instances of ":value:" throughout the document with additional HTML or db results and then returns it as a view to the user.
I am not trying to replace all instances of 1 substring. Values can vary. There is a finite list of what's acceptable. It is not unlimited. The syntax for the values is [colon]value[colon]. Examples might be ":gallery: , :related: , :comments:". The replacement may be additional static HTML or a call to a function. The functions may vary as well.
Question:
What's the most efficient way to read in the main HTML file and replace the unknown combination of values with their appropriate replacement?
Thanks in advance for any thoughts/solutions,
c
There are dozens of templating options that already exist. Consider genshi, mako, jinja2, django templates, or more.
You'll find that you're reinventing the wheel with little/no benefit.
If you can't use an existing templating system for whatever reason, your problem seems best tackled with regular expressions:
import re
valre = re.compile(r':\w+:')
def dosub(correspvals, correspfuns, lastditch):
def f(value):
v = value.group()[1:-1]
if v in correspvals:
return correspvals[v]
if v in correspfuns:
return correspfuns[v]() # or whatever args you need
# what if a value has neither a corresponding value to
# substitute, NOR a function to call? Whatever...:
return lastditch(v)
return f
replacer = dosub(adict, another, somefun)
thehtml = valre.sub(replacer, thehtml)
Basically you'll need two dictionaries (one mapping values to corresponding values, another mapping values to corresponding functions to be called) and a function to be called as a last-ditch attempt for values that can't be found in either dictionary; the code above shows you how to put these things together (I'm using a closure, a class would of course do just as well) and how to apply them for the required replacement task.
This is probably a job for a templating engine and for Python there are a number of choices. In this stackoveflow question people have listed their favourites and some helpfully explain why: What is your single favorite Python templating engine?
My question is as follows:
I have to read a big XML file, 50 MB; and anonymise some tags/fields that relate to private issues, like name surname address, email, phone number, etc...
I know exactly which tags in XML are to be anonymised.
s|<a>alpha</a>|MD5ed(alpha)|e;
s|<h>beta</h>|MD5ed(beta)|e;
where alpha and beta refer to any characters within, which will also be hashed, using probably an algorithm like MD5.
I will only convert the tag value, not the tags themselves.
I hope, I am clear enough about my problem. How do I achieve this?
You have to do something like the following in Python.
import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
print alphaTag, alphaTag.text
alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)
Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.
Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.
Compact XML::Twig code:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
use Digest::MD5 'md5_base64';
my #tags_to_anonymize= qw( name surname address email phone);
# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } #tags_to_anonymize;
XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
->parsefile( "my_big_file.xml")
->flush;
Bottom line: don't parse XML using regex.
Use your language's DOM parsing libraries instead, and if you know the elements you need to anonymize, grab them using XPath and hash their contents by setting their innerText/innerHTML properties (or whatever your language calls them).
As Welbog said, don't try to parse XML with a regex. You'll regret it eventually.
Probably the easiest way to do this is using XML::Twig. It can process XML in chunks, which lets you handle very large files.
Another possibility would be using SAX, especially with XML::SAX::Machines. I've never really used that myself, but it's a stream-oriented system, so it should be able to handle large files. The downside is that you'll probably have to write more code to collect the text inside each tag that you care about (where XML::Twig will collect that text for you).