copy large data to GCS

copy large data to GCS - python

I have 10001920 images.And their name is train_0, train_1, ....
I tried to copy them like
!gsutil -m cp -r /content/train/* gs://{my_bucket_name}/data
And I failed b.c it was too long. So I decided to use wild card like
!gsutil -m cp -r /content/train/train_1????.png gs://{my_bucket_name}/data
And I wanted to upload iterative way. After using 'for statement' to generate command line,
for script in script_list:
os.system(script)
And returns
31512
I just wanna know how can I upload those huge files to GCS.
Please give me some ideas

I don't think * should be used. It's not used that way in the documentation. I'd just try:
!gsutil -m cp -r ./content/train gs://{my_bucket_name}/data
This explains the failure number:
Also, although most commands normally fail upon encountering an error when the -m flag is disabled, all commands continue to try all operations when -m is enabled with multiple threads or processes, and the number of failed operations (if any) are reported as an exception at the end of the command's execution.

Related

Multiples Commands To CMD

I have to use The following command line tool: ncftp,
After that I need to
execute the following commands:
"open ftp//..."
"get -R Folder", but I need to do this automatically . How do I achieve this using Python or command line

You can use the Python subprocess module for this.
from subprocess import Popen, PIPE
# if you don't want your script to print the output of the ncftp
# commands, use Popen(['ncftp'], stdin=PIPE, stdout=PIPE)
with Popen(['ncftp'], stdin=PIPE) as proc:
proc.stdin.write(b"open ...\n") # must terminate each command with \n
proc.stdin.write(b"get -R Folder\n")
# ...etc
See the documentation for subprocess for more information. It can be a little tricky to get the hang of this library, but it's very versatile.
Alternatively, you can use the non-interactive commands ncftpget (docs) and ncftpput (docs) from the NcFTP package.
I recommend reading through the documentation on these commands before proceeding.
In the comments, you said you needed to get some files, delete those files and after upload some new files. Here's how you can do that:
$ ncftpget -R -DD -u username -p password ftp://server path/to/local/directory path/to/remote/directory/Folder
$ ncftpput -R -u username -p password ftp://server path/to/remote/directory path/to/local/directory/Folder
-DD will delete all files after downloading, but it will leave the directory and any subdirectories in place
If you need to delete the empty folder, you can run the ncftpget command again without -R (but the folder must be completely empty, i.e. no subdirectories, so rinse and repeat as necessary).
You can do this in a bash script or using subprocess.run in Python.

Execute bash commands that are within a list from python

i got this list
commands = ['cd var','cd www','cd html','sudo rm -r folder']
I'm trying to execute one by one all the elements inside as a bash script, with no success. Do i need a for loop here?
how to achieve that?, thanks all!!!!

for command in commands:
os.system(command)
is one way you could do it ... although just cd'ing into a bunch of directories isnt going to have much impact
NOTE this will run each command in its own subshell ... so they would not remember their state (ie any directory changes or environmental variables)
if you need to run them all in one subshell than you need to chain them together with "&&"
os.system(" && ".join(commands)) # would run all of the commands in a single subshell
as noted in the comments, in general it is preferred to use subprocess module with check_call or one of the other variants. however in this specific instance i personally think that you are in a 6 to 1 half a dozen to the other, and os.system was less typing (and its gonna exist whether you are using python3.7 or python2.5 ... but in general use subprocess exactly which call probably depends on the version of python you are using ... there is a great description in the post linked in the comments by #triplee why you should use subprocess instead)
really you should reformat your commands to simply
commands = ["sudo rm -rf var/www/html/folder"] note that you will probably need to add your python file to your sudoers file
also Im not sure exactly what you are trying to accomplish here ... but i suspect this might not be the ideal way to go about it (although it should work...)

This is just a suggestion, but if your just wanting to change directories and delete folders, you could use os.chdir() and shutil.rmtree():
from os import chdir
from os import getcwd
from shutil import rmtree
directories = ['var','www','html','folder']
print(getcwd())
# current working directory: $PWD
for directory in directories[:-1]:
chdir(directory)
print(getcwd())
# current working directory: $PWD/var/www/html
rmtree(directories[-1])
Which will cd three directories deep into html, and delelte folder. The current working directory changes when you call chdir(), as seen when you call os.getcwd().

declare -a command=("cd var","cd www","cd html","sudo rm -r folder")
## now loop through the above array
for i in "${command[#]}"
do
echo "$i"
# or do whatever with individual element of the array
done
# You can access them using echo "${arr[0]}", "${arr[1]}" also

Skyfield.api loader behaves differently in docker container

I wish to specify to Skyfield a download directory as documented here :
http://rhodesmill.org/skyfield/files.html
Here is my script:
from skyfield.api import Loader
load = Loader('~/data/skyfield')
# Next line downloads deltat.data, deltat.preds, Leap_Second.dat in ~/data/skyfield
ts = load.timescale()
t = ts.utc(2017,9,13,0,0,0)
stations_url = 'http://celestrak.com/NORAD/elements/stations.txt'
# Next line downloads stations.txt in ~/data/skyfield AND deltat.data, deltat.preds, Leap_Second.dat in $PWD !!!
satellites = load.tle(stations_url)
satellite = satellites['ISS (ZARYA)']
Expected behaviour (works fine outside docker)
The 3 deltat files (deltat.data, deltat.preds and Leap_Second.dat) are downloaded in ~/data/skyfield with load.timescale() and stations.txt is downloaded at the same place with load.tle(stations_url)
Behaviour when run in a container
The 3 deltat files get downloaded twice :
one time in the specified folder at the call load.timescale()
another time in the current directory at the call load.tle(stations_url)
This is frustrating because they already exist at this point and they pollute current directory. Note that stations.txt end up in the right place (~/data/skyfield)
If the container is ran interactively, then calling exec(open("script.py").read()) in a python shell gives a normal behaviour again. Can anyone reproduce this issue? It is hard to tell wether it comes from python, docker or skyfield.
The dockerfile is just these 2 lines:
FROM continuumio/anaconda3:latest
RUN conda install -c astropy astroquery && conda install -c anaconda ephem=3.7.6.0 && pip install skyfield
Then (assuming the built image is tagged astro) I run it with :
docker run --rm -w /tmp/working -v $PWD:/tmp/working astro:latest python script.py
And here is the output (provided the folders are empty before the run):
[#################################] 100% deltat.data
[#################################] 100% deltat.preds
[#################################] 100% Leap_Second.dat
[#################################] 100% stations.txt
[#################################] 100% deltat.data
[#################################] 100% deltat.preds
[#################################] 100% Leap_Second.dat
EDIT
Adding -t to docker run did not solve the issue but helped to even illustrate it better. I think it may come from Skyfield because some recent issues on github seem quite similar although not exactly the same.

The simple solution here is to add -t to your docker run command to allocate a pseudo TTY:
docker run --rm -t -w /tmp/working -v $PWD:/tmp/working astro:latest python script.py
What you are seeing is caused by the way the lines are printed and buffering of non-TTY based stdout. The percentage up to 100% is likely printed on a line without newlines. Then after 100% it is printed again with a newline. With buffering, this causes it to be printed twice.
When you run the same command with a TTY, there is no buffering and the lines are printed realtime so the newlines actually work as desired.
The code path isn't actually running twice :)
See Docker run with pseudoTTY (-t) gives instant stdout, buffering happens without it for another explanation (possibly better than mine).

Error in check_call() subprocess, executing 'mv' unix command: "Syntax error: '(' unexpected"

I'm making a python script for Travis CI.
.travis.yml
...
script:
- support/travis-build.py
...
The python file travis-build.py is something like this:
#!/usr/bin/env python
from subprocess import check_call
...
check_call(r"mv !(my_project|cmake-3.0.2-Darwin64-universal) ./my_project/final_folder", shell=True)
...
When Travis building achieves that line, I'm getting an error:
/bin/sh: 1: Syntax error: "(" unexpected
I just tried a lot of different forms to write it, but I get the same result. Any idea?
Thanks in advance!
Edit
My current directory layout:
- my_project/final_folder/
- cmake-3.0.2-Darwin64-universal/
- fileA
- fileB
- fileC
I'm trying with this command to move all the current files fileA, fileB and fileC, excluding my_project and cmake-3.0.2-Darwin64-universal folders into ./my_project/final_folder. If I execute this command on Linux shell, I get my aim but not through check_call() command.
Note: I can't move the files one by one, because there are many others
I don't know which shell Travis are using by default because I don't specify it, I only know that if I write the command in my .travis.yml:
.travis.yml
...
script:
# Here is the previous Travis code
- mv !(my_project|cmake-3.0.2-Darwin64-universal) ./my_project/final_folder
...
It works. But If I use the script, it fails.
I found this command from the following issue:
How to use 'mv' command to move files except those in a specific directory?

You're using the bash feature extglob, to try to exclude the files that you're specifying. You'll need to enable it in order to have it exclude the two entries you're specifying.
The python subprocess module explicitly uses /bin/sh when you use shell=True, which doesn't enable the use of bash features like this by default (it's a compliance thing to make it more like original sh).
If you want to get bash to interpret the command; you have to pass it to bash explicitly, for example using:
subprocess.check_call(["bash", "-O", "extglob", "-c", "mv !(my_project|cmake-3.0.2-Darwin64-universal) ./my_project/final_folder"])
I would not choose to do the job in this manner, though.

Let me try again: in which shell do you expect your syntax !(...) to work? Is it bash? Is it ksh? I have never used it, and a quick search for a corresponding bash feature led nowhere. I suspect your syntax is just wrong, which is what the error message is telling you. In that case, your problem is entirely independent form python and the subprocess module.
If a special shell you have on your system supports this syntax, you need to make sure that Python is using the same shell when invoking your command. It tells you which shell it has been using: /bin/sh. This is usually just a link to the real shell executable. Does it point to the same shell you have tested your command in?
Edit: the SO solution you referenced contains the solution in the comments:
Tip: Note however that using this pattern relies on extglob. You can
enable it using shopt -s extglob (If you want extended globs to be
turned on by default you can add shopt -s extglob to .bashrc)
Just to demonstrate that different shells might deal with your syntax in different ways, first using bash:
$ !(uname)
-bash: !: event not found
And then, using /bin/dash:
$ !(uname)
Linux

The argument to a subprocess.something method must be a list of command line arguments. Use e.g. shlex.split() to make the string be split into correct command line arguments:
import shlex, subprocess
subprocess.check_call( shlex.split("mv !(...)") )
EDIT:
So, the goal is to move files/directories, with the exemption of some file(s)/directory(ies). By playing around with bash, I could get it to work like this:
mv `ls | grep -v -e '\(exclusion1\|exclusion2\)'` my_project
So in your situation that would be:
mv `ls | grep -v -e '\(myproject\|cmake-3.0.2-Darwin64-universal\)'` my_project
This could go into the subprocess.check_call(..., shell=True) and it should do what you expect it to do.

How to ignore files starting with a period using watchmedo?

My file editor creates temporary files prefixed with a ..
I am running:
watchmedo shell-command -p '*.py' -R -c 'echo "${watch_src_path}"'
I see events for the temporary files as I am editing, then two events on file save (presumably because it does a delete and write).
I would like to see one event -- only when I save a file.
Is there a way for me to do this with just the CLI? I am not interested in creating a python script and using the watchdog API directly.

Use the --ignore-patterns (-i) switch.
watchmedo shell-command \
-p'*.py' \
-R \
-c'echo "${watch_src_path}"'\
--ignore-patterns="*/.*"
Note that watchmedo is matching on the full watch_src_path so your ignore pattern can't be as simple as ".*" like you'd think at first. Also all the pitfalls of wildcards are in effect, so if you were doing something silly like working in a hidden directory /path/to/some/.hidden/dir then you'd have to have a fancier pattern.
You also might want the --ignore-directories (-D) switch if the directory-related event is causing you annoyance too (this one is just boolean, no argument needed).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

copy large data to GCS - python

Related

Multiples Commands To CMD

Execute bash commands that are within a list from python

Skyfield.api loader behaves differently in docker container

Error in check_call() subprocess, executing 'mv' unix command: "Syntax error: '(' unexpected"

How to ignore files starting with a period using watchmedo?

Categories

Resources