Background
I finally convinced someone willing to share his full archival node 5868GiB database for free (which now requires to be built in ram and thus requires 100000$ worth of ram in order to be built but can be run on an ssd once done).
However he want to send it only through sending a single tar file over raw tcp using a rather slow (400Mps) connection for this task.
I m needing to get it on dropbox and as a result, he don’t want to use the https://www.dropbox.com/request/[my upload key here] allowing to upload files through a web browser without a dropbox account (it really annoyed him that I talked about using an other method or compressing the database to the point he is on the verge of changing his mind about sharing it).
Because on my side, dropbox allows using 10Tib of storage for free during 30 days and I didn’t receive the required ssd yet (so once received I will be able to download it using a faster speed).
The problem
I m fully aware of upload file to my dropbox from python script but in my case the file doesn t fit into a memory buffer not even on disk.
And previously in api v1 it wasn t possible to append data to an exisiting file (but I didn t find the answer for v2).
To upload a large file to the Dropbox API using the Dropbox Python SDK, you would use upload sessions to upload it in pieces. There's a basic example here.
Note that the Dropbox API only supports files up to 350 GB though.
One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.
I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!
I would like to run a program on my laptop (Gazebo simulator) and send a stream of image data to a GCE instance, where it will be run through an object-detection network and sent back to my laptop in near real-time. Is such a set-up possible?
My best idea right now is, for each image:
Save the image as a JPEG on my personal machine
Stream the JPEG to a Cloud Storage bucket
Access the storage bucket from my GCE instance and transfer the file to the instance
In my python script, convert the JPEG image to numpy array and run through the object detection network
Save the detection results in a text file and transfer to the Cloud Storage bucket
Access the storage bucket from my laptop and download the detection results file
Convert the detection results file to a numpy array for further processing
This seems like a lot of steps, and I am curious if there are ways to speed it up, such as reducing the number of save and load operations or transporting the image in a better format.
If your question is "is it possible to set up such a system and do those actions in real time?" then I think the answer is yes I think so. If your question is "how can I reduce the number of steps in doing the above" then I am not sure I can help and will defer to one of the experts on here and can't wait to hear the answer!
I have implemented a system that I think is similar to what you describe for research of Forex trading algorithms (e.g. upload data to storage from my laptop, compute engine workers pull the data and work on it, post results back to storage and I download the compiled results from my laptop).
I used the Google PubSub architecture - apologies if you have already read up on this. It allows near-realtime messaging between programs. For example you can have code looping on your laptop that scans a folder that looks out for new images. When they appear it automatically uploads the files to a bucket and once theyre in the bucket it can send a message to the instance(s) telling them that there are new files there to process, or you can use the "change notification" feature of Google Storage buckets. The instances can do the work, send the results back to the storage and send a notification to the code running on your laptop that work is done and results are available for pick-up.
Note that I set this up for my project above and encountered problems to the point that I gave up with PubSub. The reason was that the Python Client Library for PubSub only supports 'asynchronous' message pulls, which seems to mean that the subscribers will pull multiple messages from the queue and process them in parallel. There are some features to help manage 'flow control' of messages built into the API, but even with them implemented I couldn't get it to work the way I wanted. For my particular application I wanted to process everything in order, one file at a time because it was important to me that I'm clear what the instance is doing and the order its doing it in. There are several threads on google search, StackOverflow and Google groups that discuss workarounds for this using queues, classes, allocating specific tasks for specific instances, etc which I tried, but even these presented problems for me. Some of these links are:
Run synchronous pull in PubSub using Python client API and pubsub problems pulling one message at a time and there are plenty more if you would like them!
You may find that if the processing of an image is relatively quick, order isn't too important and you don't mind an instance working on multiple things in parallel that my problems don't really apply to your case.
FYI, I ended up just making a simple loop on my 'worker instances' that scans the 'task list' bucket every 30 seconds or whatever to look for new files to process, but obviously this isn't quite the real-time approach that you were originally looking for. Good luck!
I was wondering when dealing with a web service API that returns XML, whether it's better (faster) to just call the external service each time and parse the XML (using ElementTree) for display on your site or to save the records into the database (after parsing it once or however many times you need to each day) and make database calls instead for that same information.
First off -- measure. Don't just assume that one is better or worse than the other.
Second, if you really don't want to measure, I'd guess the database is a bit faster (assuming the database is relatively local compared to the web service). Network latency usually is more than parse time unless we're talking a really complex database or really complex XML.
Everyone is being very polite in answering this question: "it depends"... "you should test"... and so forth.
True, the question does not go into great detail about the application and network topographies involved, but if the question is even being asked, then it's likely a) the DB is "local" to the application (on the same subnet, or the same machine, or in memory), and b) the webservice is not. After all, the OP uses the phrases "external service" and "display on your own site." The phrase "parsing it once or however many times you need to each day" also suggests a set of data that doesn't exactly change every second.
The classic SOA myth is that the network is always available; going a step further, I'd say it's a myth that the network is always available with low latency. Unless your own internal systems are crap, sending an HTTP query across the Internet will always be slower than a query to a local DB or DB cluster. There are any number of reasons for this: number of hops to the remote server, outage or degradation issues that you can't control on the remote end, and the internal processing time for the remote web service application to analyze your request, hit its own persistence backend (aka DB), and return a result.
Fire up your app. Do some latency and response times to your DB. Now do the same to a remote web service. Unless your DB is also across the Internet, you'll notice a huge difference.
It's not at all hard for a competent technologist to scale a DB, or for you to completely remove the DB from caching using memcached and other paradigms; the latency between servers sitting near each other in the datacentre is monumentally less than between machines over the Internet (and more secure, to boot). Even if achieving this scale requires some thought, it's under your control, unlike a remote web service whose scaling and latency are totally opaque to you. I, for one, would not be too happy with the idea that the availability and responsiveness of my site are based on someone else entirely.
Finally, what happens if the remote web service is unavailable? Imagine a world where every request to your site involves a request over the Internet to some other site. What happens if that other site is unavailable? Do your users watch a spinning cursor of death for several hours? Do they enjoy an Error 500 while your site borks on this unexpected external dependency?
If you find yourself adopting an architecture whose fundamental features depend on a remote Internet call for every request, think very carefully about your application before deciding if you can live with the consequences.
Consuming the webservices is more efficient because there are a lot more things you can do to scale your webservices and webserver (via caching, etc.). By consuming the middle layer, you also have the options to change the returned data format (e.g. you can decide to use JSON rather than XML). Scaling database is much harder (involving replication, etc.) so in general, reduce hits on DB if you can.
There is not enough information to be able to say for sure in the general case. Why don't you do some tests and find out? Since it sounds like you are using python you will probably want to use the timeit module.
Some things that could effect the result:
Performance of the web service you are using
Reliability of the web service you are using
Distance between servers
Amount of data being returned
I would guess that if it is cacheable, that a cached version of the data will be faster, but that does not necessarily mean using a local RDBMS, it might mean something like memcached or an in memory cache in your application.
It depends - who is calling the web service? Is the web service called every time the user hits the page? If that's the case I'd recommend introducing a caching layer of some sort - many web service API's throttle the amount of hits you can make per hour.
Whether you choose to parse the cached XML on the fly or call the data from a database probably won't matter (unless we are talking enterprise scaling here). Personally, I'd much rather make a simple SQL call than write a DOM Parser (which is much more prone to exceptional scenarios).
It depends from case to case, you'll have to measure (or at least make an educated guess).
You'll have to consider several things.
Web service
it might hit database itself
it can be cached
it will introduce network latency and might be unreliable
or it could be in local network and faster than accessing even local disk
DB
might be slow since it needs to access disk (although databases have internal caches, but those are usually not targeted)
should be reliable
Technology itself doesn't mean much in terms of speed - in one case database parses SQL, in other XML parser parses XML, and database is usually acessed via socket as well, so you have both parsing and network in either case.
Caching data in your application if applicable is probably a good idea.
As a few people have said, it depends, and you should test it.
Often external services are slow, and caching them locally (in a database in memory, e.g., with memcached) is faster. But perhaps not.
Fortunately, it's cheap and easy to test.
Test definitely. As a rule of thumb, XML is good for communicating between apps, but once you have the data inside of your app, everything should go into a database table. This may not apply in all cases, but 95% of the time it has for me. Anytime I ever tried to store data any other way (ex. XML in a content management system) I ended up wishing I would have just used good old sprocs and sql server.
It sounds like you essentially want to cache results, and are wondering if it's worth it. But if so, I would NOT use a database (I assume you are thinking of a relational DB): RDBMSs are not good for caching; even though many use them. You don't need persistence nor ACID.
If choice was between Oracle/MySQL and external web service, I would start with just using service.
Instead, consider real caching systems; local or not (memcache, simple in-memory caches etc).
Or if you must use a DB, use key/value store, BDB works well. Store response message in its serialized form (XML), try to fetch from cache, if not, from service, parse. Or if there's a convenient and more compact serialization, store and fetch that.