I have large csv files and excel files where I read them and create the needed create table script dynamically depending on the fields and types it has. Then insert the data to the created table.
I have read this and understood that I should send them with jobs.insert() instead of tabledata.insertAll() for large amount of data.
This is how I call it (Works for smaller files not large ones).
result = client.push_rows(datasetname,table_name,insertObject) # insertObject is a list of dictionaries
When I use library's push_rows it gives this error in windows.
[Errno 10054] An existing connection was forcibly closed by the remote host
and this in ubuntu.
[Errno 32] Broken pipe
So when I went through BigQuery-Python code it uses table_data.insertAll().
How can I do this with this library? I know we can upload through Google storage but I need direct upload method with this.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails. gsutil has a more robust uploading algorithm than just a plain POST.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
See also BigQuery script failing for large file
Related
We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.
No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.
This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.
if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch
Background
I finally convinced someone willing to share his full archival node 5868GiB database for free (which now requires to be built in ram and thus requires 100000$ worth of ram in order to be built but can be run on an ssd once done).
However he want to send it only through sending a single tar file over raw tcp using a rather slow (400Mps) connection for this task.
I m needing to get it on dropbox and as a result, he don’t want to use the https://www.dropbox.com/request/[my upload key here] allowing to upload files through a web browser without a dropbox account (it really annoyed him that I talked about using an other method or compressing the database to the point he is on the verge of changing his mind about sharing it).
Because on my side, dropbox allows using 10Tib of storage for free during 30 days and I didn’t receive the required ssd yet (so once received I will be able to download it using a faster speed).
The problem
I m fully aware of upload file to my dropbox from python script but in my case the file doesn t fit into a memory buffer not even on disk.
And previously in api v1 it wasn t possible to append data to an exisiting file (but I didn t find the answer for v2).
To upload a large file to the Dropbox API using the Dropbox Python SDK, you would use upload sessions to upload it in pieces. There's a basic example here.
Note that the Dropbox API only supports files up to 350 GB though.
TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.
This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:
s3 = boto3.resource('s3')
data = '{"some":"json","test":"data"}'
key = "path/to/my/file/[timestamp].json"
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.
Since you are potentially uploading many small files, you should consider a few items:
Some form of threading/multiprocessing. For example you can see How to upload small files to Amazon S3 efficiently in Python
Creating some form of archive file (ZIP) containing sets of your small data blocks and uploading them as larger files. This is of course dependent on your access patterns. If you go this route, be sure to use the boto3 upload_file or upload_fileobj methods instead as they will handle multi-part upload and threading.
S3 performance implications as described in Request Rate and Performance Considerations
Please pardon my ignorance if this question may sound silly the expert audience here
Currently as per my use case
I am performing certain analysis on the data present in aws redshift tables and saving them a csv file in s3 buckets
(operation is some what similar to Pivot for redshift database)
and after that i am updating data back to redshift db using copy command
Currently after performing analysis (which is done in python3) for 200 csv files are generated which are saved in 200 different table in redshift
The count of csv would keep on increasing with time
Currently the whole process takes about 50-60 minutes to complete
25 minutes to get approx 200 csv and update them in s3 buckets
25 minutes to update the approx 200 csv into 200 aws redshift tables
The size of csv vary form few MB to 1GB
I was looking for tools or aws technologies which can help me reduce my time
*additional info
Structure of csv keeps on changing .Hence i have to drop and create tables again
This would be a repetitive tasks and would be executed in every 6hours
You can achieve a significant speed-up by:
Using multi-part upload of CSV to S3, so instead of waiting for a single file to upload, multi-part upload will upload the file to S3 in-parallel, saving you considerable time. Read about it here and here. Here is the Boto3 reference for it.
Copying data into Redshift from S3, in parallel. If you split your file in multiple parts, and then run the COPY command, the data will be loaded from multiple files in parallel, instead of waiting for 1 GB file to load, which might be really slow. Read more about it here.
Hope this helps.
You should explore Athena. It's a tool that comes within the AWS package and gives you the flexibility to query csv (or even gzip) files.
It'll save you the time you take to manually copy the data in the Redshift tables and you'll be able to query the dataset from the csv itself. Athena has the ability to query them from an s3 bucket.
However, still in the development phase, you'll have to spend sometime with it as it's not very user friendly. A syntax error in your query logs you out from your AWS session rather than throwing a syntax error. Moreover, you'll not find too many documentation and developer talks over the internet since Athena is still largely unexplored.
Athena charges you depending upon the data that your query fetches and is thus, more pocket friendly. If the query fails to execute, Amazon wouldn't charge you.
I am trying to write a data migration script moving data from one database to another (Teradata to snowflake) using JDBC cursors.
The table I am working on has about 170 million records and I am running into the issue where when I execute the batch insert a maximum number of expressions in a list exceeded, expected at most 16,384, got 170,000,000.
I was wondering if there was any way around this or if there was a better way to batch migrate records without exporting the records to a file and moving it to s3 to be consumed by the snowflake.
If your table has 170M records, then using JDBC INSERT to Snowflake is not feasible. It would perform millions of separate insert commands to the database, each requiring a round-trip to the cloud service, which would require hundreds of hours.
Your most efficient strategy would be to export from Teradata into multiple delimited files -- say with 1 - 10 million rows each. You can then either use the Amazon's client API to move the files to S3 using parallelism, or use Snowflake's own PUT command to upload the files to Snowflake's staging area for your target table. Either way, you can then load the files very rapidly using Snowflake's COPY command once they are in your S3 bucket or Snowflake's staging area.