For boto3, how many client instances should I use? - python

When I'm using the boto3 SDK for Python, is it better to have a single client object, like this:
client = boto3.client('s3')
# use client through the file
and then use that everywhere, or should I instantiate it as needed, like this:
size = client('s3').head_object(Bucket=bucket, Key=key)['ContentLength']
Which is better? Does it make a different?

I don't see any harm using single client object through the file for a particular AWS service. As boto is widely used standard SDK, it won't be changed drastically and even if they make backwards incompatible change, they announce it. So it won't affect your application while running some process.

Related

Fastest way of communication between multiple EC2 instances in python

I am looking for the absolute fastest way to submit a short string from one to multiple EC2 Instances of the type t2.nano. Example: Something happens on Instance 1, Instance 2,3,4 should (almost) instantly know about it. Target is < 5ms. For now the instances are all in the same region and same cluster availability zone.
What I have looked at so far:
Shared drive where instance1 can drop the data and the rest of the instances can check it
-> Not possible as this instance type does not support shared drives
Redis
-> I tested this locally and it is pretty slow actually, at least XXms, and sometimes XXXms for one read and one write (just for
testing).
Any ideas how to solve this problem?
You can try AWS EFS
Multiple compute instances, including Amazon EC2, Amazon ECS, and AWS Lambda, can access an Amazon EFS file system at the same time, providing a common data source for workloads and applications running on more than one compute instance or server.
https://docs.aws.amazon.com/efs/latest/ug/whatisefs.html
Consider using Twisted for Publish/Subscribe between your clients, where clients see all messages posted by other clients.
Alternatively, consider Autobahn which builds abstraction layers on Twisted including WebSocket-based pub/sub and WAMP.

Caching Google API calls for unit tests

I've got a Google App Engine project that uses the Google Cloud Language API, and I'm using the Google API Client Library (Python) to make the API calls.
When running my unit tests, I make quite a few calls to the API. This slows down my testing and also incurs costs.
I'd like to cache the calls to the Google API to speed up my tests and avoid the API charges, and I'd rather not roll my own if another solution is available.
I found this Google API page, which suggests doing this:
import httplib2
http = httplib2.Http(cache=".cache")
And I've added these lines to my code (there is another option to use GAE memcache but won't be persisted between test code invocations) and right after these lines, I create my API call connection:
NLP = discovery.build("language", "v1", API_KEY)
The caching isn't working and the above solution seems too simple so I suspect I am missing something.
UPDATE:
I updated my tests so that App Engine is not used (just a regular unit test) and I also figured out that I can pass the http I created to the Google API client like this:
NLP = discovery.build("language", "v1", http, API_KEY)
Now, the initial discovery call is cached but the actual API calls are not cached,e.g., this call is not cached:
result = NLP.documents().annotateText(body=data).execute()
The suggested code:
http = httplib2.Http(cache=".cache") is trying to cache to the local filesystem in a directory called ".cache". On App Engine, you cannot write to the local filesystem, so this does nothing.
Instead, you could try caching to Memcache. The other suggestion on the Python Client docs referenced is to do exactly this:
from google.appengine.api import memcache
http = httplib2.Http(cache=memcache)
Since all App Engine apps get free access to shared memcache this should be better than nothing.
If this fails, you could also try memoization. I've had success memoizing calls to slow or flaky APIs, but it comes at the cost of increased memory usage (so I need bigger instances).
EDIT: I see from your comment you're having this problem locally. I was originally thinking that memoization would be an alternative, but the need to hack on httplib2 makes that overly complicated. I'm back to thinking about how to convince httplib2 to do the right thing.
If you're trying to make a test run faster by caching an API call result, stop and consider whether you may have taken a wrong turn.
If can you restructure your code such that you can replace the API call with a unittest.mock, your tests will run much, much faster.
I just came across vcrpy which seems to do exactly this. I'll update this answer after I've had a chance to try it out.

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?
Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.
You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

Python oauth2client async

I am fighting with tornado and the official python oauth2client, gcloud... modules.
These modules accept an alternate http client passed with http=, as long as it has a method called request which can be called by any of these libraries, whenever an http request must be sent to google and/or to renew the access tokens using the refresh tokens.
I have created a simple class which has a self.client = AsyncHttpClient()
Then in its request method, returns self.client.fetch(...)
My goal is to be able to yield any of these libraries calls, so that tornado will execute them in asynchronously.
The thing is that they are highly dependant on what the default client - set to httplib2.Http() returns: (response, content)
I am really stuck and cannot find a clean way of making this async
If anyone already found a way, please help.
Thank you in advance
These libraries do not support asynchronous. The porting process is not always easy.
oauth2client
Depending on what you want to do maybe Tornado's GoogleOAuth2Mixin or tornado-alf will be enough.
gcloud
Since I am not aware of any Tornado/asyncio implementation of gcloud-python, so you could:
you may write it yourself. Again it's not simple transport change of Connection.http or request, all the stuff around must be able to use/yield future/coroutines.
wrap it in ThreadPoolExecutor (as #Apero mentioned). This is high level API, so any nested api calls within that yield will be executed in same thread (not using the pool). It could work well.
external app (with ProcessPoolExecutor or Popen).
When I had similar problem with AWS couple years ago, I've ended up with executing, asynchronously, CLI (Tornado + subprocess.Popen + some cli (awscli, or boto based)) and simple cases (like S3, basic EC2 operations) with plain AsyncHTTPClient.

How do I create a D-Bus service that dynamically creates multiple objects?

I'm new to D-Bus (and to Python, double whammy!) and I am trying to figure out the best way to do something that was discussed in the tutorial.
However, a text editor application
could as easily own multiple bus names
(for example, org.kde.KWrite in
addition to generic TextEditor), have
multiple objects (maybe
/org/kde/documents/4352 where the
number changes according to the
document), and each object could
implement multiple interfaces, such as
org.freedesktop.DBus.Introspectable,
org.freedesktop.BasicTextField,
org.kde.RichTextDocument.
For example, say I want to create a wrapper around flickrapi such that the service can expose a handful of Flickr API methods (say, urls_lookupGroup()). This is relatively straightforward if I want to assume that the service will always be specifying the same API key and that the auth information will be the same for everyone using the service.
Especially in the latter case, I cannot really assume this will be true.
Based on the documentation quoted above, I am assuming there should be something like this:
# Get the connection proxy object.
flickrConnectionService = bus.get_object("com.example.FlickrService",
"/Connection")
# Ask the connection object to connect, the return value would be
# maybe something like "/connection/5512" ...
flickrObjectPath = flickrConnectionService.connect("MY_APP_API_KEY",
"MY_APP_API_SECRET",
flickrUsername)
# Get the service proxy object.
flickrService = bus.get_object("com.example.FlickrService",
flickrObjectPath);
# As the flickr service object to get group information.
groupInfo = flickrService.getFlickrGroupInfo('s3a-belltown')
So, my questions:
1) Is this how this should be handled?
2) If so, how will the service know when the client is done? Is there a way to detect if the current client has broken connection so that the service can cleanup its dynamically created objects? Also, how would I create the individual objects in the first place?
3) If this is not how this should be handled, what are some other suggestions for accomplishing something similar?
I've read through a number of D-Bus tutorials and various documentation and about the closest I've come to seeing what I am looking for is what I quoted above. However, none of the examples look to actually do anything like this so I am not sure how to proceed.
1) Mostly yes, I would only change one thing in the connect method as I explain in 2).
2) D-Bus connections are not persistent, everything is done with request/response messages, no connection state is stored unless you implement this in third objects as you do with your flickerObject. The d-bus objects in python bindings are mostly proxies that abstract the remote objects as if you were "connected" to them, but what it really does is to build messages based on the information you give to D-Bus object instantiation (object path, interface and so). So the service cannot know when the client is done if client doesn't announce it with other explicit call.
To handle unexpected client finalization you can create a D-Bus object in the client and send the object path to the service when connecting, change your connect method to accept also an ObjectPath parameter. The service can listen to NameOwnerChanged signal to know if a client has died.
To create the individual object you only have to instantiate an object in the same service as you do with your "/Connection", but you have to be sure that you are using an unexisting name. You could have a "/Connection/Manager", and various "/Connection/1", "/Connection/2"...
3) If you need to store the connection state, you have to do something like that.

Categories