Writing csv files to local host from a docker container

Writing csv files to local host from a docker container - python

I am trying to set up a very basic data processing project where I use docker to create an ubuntu environment on an EC2, install python, take an input csv, perform some simple data manipulation, then output the data to a new csv in the folder where the input was. I have been able to successfully run my python code locally, as well as on the ec2, but when I run it with the docker container, the data appears to be processed (my script prints out the data), but the results not saved at the end of the run. Is there a command I am missing from my dockerfile that is causing the results not to be saved? Alternatively, is there a way I can save the output directly to an S3 bucket?
EDIT: The the path to the input files is "/home/ec2-user/docker_test/data" and the path to the code is "/home/ec2-user/docker_test/code". After the data is processed, I want the result to be written as a new file in the "/home/ec2-user/docker_test/data" directory on the host.
Dockerfile:
FROM ubuntu:latest
RUN apt-get update \
&& apt-get install -y --no-install-recommends software-properties-common \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get update \
&& apt-get install -q -y --no-install-recommends python3.6 python3.6-dev python3-pip python3-setuptools \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
VOLUME /home/ec2-user/docker_test/data
VOLUME /home/ec2-user/docker_test/code
WORKDIR /home/ec2-user/docker_test/
COPY requirements.txt ./
RUN cat requirements.txt | xargs -n 1 -L 1 python3.6 -m pip install --no-cache-dir
COPY . .
ENV LC_ALL C.UTF-8
ENV LANG=C.UTF-8
CMD python3.6 main.py
Python Script:
import pandas as pd
import os
from code import processing
path = os.getcwd()
def main():
df = pd.read_csv(path + '/data/table.csv')
print('input df: \n{}'.format(df))
df_out = processing.processing(df)
df_out.to_csv(path + '/data/updated_table.csv', index = False)
print('\noutput df: \n{}'.format(df_out))
if __name__ == '__main__':
main()
EDIT: I have been running the dockerfile with "docker run docker_test"

Ok, gotcha, with the edit about expectations of the CSV being output to the host, we do have a problem with how this is set up.
You've got two VOLUMEs declared in your Dockerfile, which is fine. These are named volumes, which are great for persisting data between containers going up and down on a single host, but you aren't able to easily just go in like it's a normal file system from your host.
If you want the file to show up on your host, you can create a bind mounted volume at runtime, which maps a path in your host filesystem to a path in the Docker container's filesystem.
docker run -v $(pwd):/home/ec2-user/docker_test/data docker_test will do this. $(pwd) is an expression that evaluates to your current working directory if you're on a *nix system, where you're running the command. Take care with that and adjust as needed (like if you're using Windows as your host).
With a volume set up this way, when the CSV is created in the container file system at the location you intend, it will be accessible on your host in the location relative to however you've mapped it.
Read up on volumes. They're vital to using Docker, not hard to grasp at first glance, but there a some gotchas in the details.
Regarding uploading to S3, I would recommend using the boto3 library and doing it in your Python script. You could also use something like s3cmd if you find that simpler.

You could use S3FS Fuse to mount the S3 bucket as a drive in your docker container. This basically creates a folder on your filesystem that is actually the S3 bucket. Anything that you save/modify in that folder will be reflected in the S3 bucket.
If you delete the docker container or unmount the drive you still have your S3 bucket intact, so you don't need to worry too much about erasing files in the S3 bucket through normal docker use.

Related

Docker : Add a valid entrypoint for multiple python script

Helli, I have to build a Docker image for the following bioinformatics tool: https://github.com/CAMI-challenge/CAMISIM. Their dockerfile works but takes a long time to build and I would like to build my own, slightly differently, to learn. I face issues: there are several python script that I should be able to choose to run, not only a main. If I add one script in particular as an ENTRYPOINT then the behavior isn't exactly what I shoud have.
The Dockerfile:
FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninteractive
USER root
#COPY ./install_docker.sh ./
#RUN chmod +x ./install_docker.sh && sh ./install_docker.sh
RUN apt-get update && \
apt install -y git python3-pip libxml-simple-perl libncursesw5 && \
git clone https://github.com/CAMI-challenge/CAMISIM.git && \
pip3 install numpy ete3 biom-format biopython matplotlib joblib scikit-learn
ENTRYPOINT ["python3"]
ENV PATH="/CAMISIM/:${PATH}"
This yields :
sudo docker run camisim:latest metagenomesimulation.py --help
python3: can't open file 'metagenomesimulation.py': [Errno 2] No such file or directory
Adding that script as an ENTRYPOINT after python3 allows me to use it with 2 drawbacks: I cannot use another script (I could build a second docker image but that would be a bad solution), and it outputs:
ERROR: 0
usage: python metagenomesimulation.py configuration_file_path
#######################################
# MetagenomeSimulationPipeline #
#######################################
Pipeline for the simulation of a metagenome
optional arguments:
-h, --help show this help message and exit
-silent, --silent Hide unimportant Progress Messages.
-debug, --debug_mode more information, also temporary data will not be deleted
-log LOGFILE, --logfile LOGFILE
output will also be written to this log file
optional config arguments:
-seed SEED seed for random number generators
-s {0,1,2}, --phase {0,1,2}
available options: 0,1,2. Default: 0
0 -> Full run,
1 -> Only Comunity creation,
2 -> Only Readsimulator
-id DATA_SET_ID, --data_set_id DATA_SET_ID
id of the dataset, part of prefix of read/contig sequence ids
-p MAX_PROCESSORS, --max_processors MAX_PROCESSORS
number of available processors
required:
config_file path to the configuration file
You can see there is an error that should'nt be there, it actually does not use the help flag. The original Dockerfile is:
FROM ubuntu:20.04
RUN apt update
RUN apt install -y python3 python3-pip perl libncursesw5
RUN perl -MCPAN -e 'install XML::Simple'
ADD requirements.txt /requirements.txt
RUN cat requirements.txt | xargs -n 1 pip install
ADD *.py /usr/local/bin/
ADD scripts /usr/local/bin/scripts
ADD tools /usr/local/bin/tools
ADD defaults /usr/local/bin/defaults
WORKDIR /usr/local/bin
ENTRYPOINT ["python3"]
It works but shows the error as above, so not so much. Said error is not present when using the tool outside of docker. Last time I made a Docker image I just pulled the git repo and added the main .sh script as an ENTRYPOINT and everything worked despite being more complex (see https://github.com/Louis-MG/Metadbgwas).
Why would I need ADD and moving everything ? I added the git folder to the path, why can't I find the scripts ? How is it different from the Metadbgwas image ?

In your first setup, you start in the image root directory / and run git clone to check out the repository into /CAMISIM. You never change the current directory, though, so when you try to run python3 metagenomesimulation.py --help it's looking in / and not /CAMISIM, hence the "not found" error.
You can fix this just by changing the current directory. At any point after you check out the repository, run
WORKDIR /CAMISIM
You should also delete the ENTRYPOINT line. For each of the scripts you could run as a top-level entry point, check two things:
Is it executable; if you ls -l metagenomesimulation.py are there x in the permission listing? If not, on the host system, run chmod +x metagenomesimulation.py and commit to source control. (Or you could RUN chmod ... in the Dockerfile if you really can't change the repository.)
Does it have a "shebang" line? The very first line of the script should be
#!/usr/bin/env python3
If both of these things are true, then you can just run ./metagenomesimulation.py without explicitly saying python3; since you add the directory to $PATH as well, you can probably run it without specifying the ./... file location.
(Probably deleting the ENTRYPOINT line on its own is enough, given that ENV PATH setting, but your script still might be confused by starting up in the wrong directory.)
The long "help" output just suggests to me that the script is expecting a configuration file name as a parameter and you haven't provided it, or else you've repeated the script name in both the entrypoint and command parts of the container command string.

In the end very little was recquired and the original Dockerfile was correct, the same error is displayed anyway, that is due to the script itself.
What was missing was a link to the interpreter, so I could remove the ENTRYPOINT and actually interpret the script instead of having python look for it in its own path. The Dockerfile:
FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninteractive
USER root
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN apt-get update && \
apt install -y git python3-pip libxml-simple-perl libncursesw5 && \
git clone https://github.com/CAMI-challenge/CAMISIM.git && \
pip3 install numpy ete3 biom-format biopython matplotlib joblib scikit-learn
ENV PATH="/CAMISIM:${PATH}"
Trying WORKDIR as suggested instead of the PATH yielded an error.

Is it possible to share a volume with 2 docker containters?

I can't run 2 containers whereas I can run each one them separately.
I have this 1st container/image related to this DockerFile
FROM debian:latest
RUN apt-get update && apt-get install python3-pip -y && pip3 install requests
ADD test1.py /app/container1/test1.py
WORKDIR /app/
CMD python3 container1/test1.py
I have this 2nd container/image related to this DockerFile
FROM debian:latest
RUN apt-get update && apt-get install python3-pip -y && pip3 install requests
ADD test2.py /app/container2/test2.py
WORKDIR /app/
CMD python3 container2/test2.py
No issues to create images:
docker image build ./authentif -t test1:latest
docker image build ./authoriz -t test2:latest
When I run the 1st container with this command:
docker container run -it --network my_network --name test1_container\
--mount type=volume,src=my_volume,dst=/app -e LOG=1\
--rm test1:latest
it works.
And If i want to check my volume:
sudo ls /var/lib/docker/volumes/my_volume/_data
I can see data in my volume
However when I want run the 2nd container:
docker container run -it --network my_network --name test2_container\
--mount type=volume,src=my_volume,dst=/app -e LOG=1\
--rm test2:latest
I have this error:
python3: can't open file '/app/container2/test2.py': [Errno 2] No such file or directory
If i delete everything and start over : if I start running the 2nd container it works but then id I want to run the 1st container, i have the error again.
why is that?
in my container1, let's assume that my script python writes data in a file, for example :
import os
print("test111111111")
if os.environ.get('LOG') == "1":
print("1111111")
with open('record.log', 'a') as file:
file.write("file11111")

I can't reproduce your issue. When I start 2 containers using
docker run -d --rm -v myvolume:/app --name container1 debian tail -f /dev/null
docker run -d --rm -v myvolume:/app --name container2 debian tail -f /dev/null
and then do
docker exec container1 /bin/sh -c 'echo hello > /app/hello.txt'
docker exec container2 cat /app/hello.txt
it prints out 'hello' as expected.

You are mounting the volume over /app, the directory that contains your application code. That hides the code and replaces it with something else.
The absolute best approach here, if you can handle it, is to avoid sharing files at all. Keep the data somewhere like a relational database (which may be stateful). Don't mount anything on to your containers. Especially if you're looking forward to a clustered environment like Kubernetes, sharing files can be surprisingly tricky.
If you can't get rid of the shared directory, then put it somewhere other than /app. You might need to configure the alternate directory using an environment variable.
docker container run ... \
--mount type=volume,src=my_volume,dst=/data \ # /data, not /app
...
What's actually happening in your setup is that Docker has a feature to copy the contents of the image into an empty named volume on first use. This only happens if the volume is completely empty, this only happens with a named Docker volume and not bind mounts, and this doesn't happen on other container systems like Kubernetes. (I'd discourage actually relying on this behavior.)
So when you run the first container, it sees that my_volume is empty and copies the test1 image into it; then the container sees the code it expects it in /app and it apparently works fine. The second container sees my_volume is non-empty, and so the volume contents (with the first image's code) hide what was in the image (the second image's code). I'd expect, if you started from scratch, whichever of the two containers you started first would work, but not the other, and if you change the code in the working image, a new container won't see that change (it will use the code out of the volume).

Binary sometimes, but not always, fails in AWS Lambda

We are running LibreOffice to convert Office documents to PDF in AWS Lambda. This normally works well. However, sometimes it fails with a DeploymentException. If this happens, it fails for all invocations on a single Lambda host (meaning all invocations in a single log file)
This is the Dockerfile without our python code copying and CMD/entrypoint:
FROM public.ecr.aws/lambda/python:3.9
RUN yum install -y \
cups-libs \
cairo \
libSM \
jre \
tar
RUN curl --location -o libreoffice.rpm.tar.gz https://download.documentfoundation.org/libreoffice/stable/7.2.5/rpm/x86_64/LibreOffice_7.2.5_Linux_x86-64_rpm.tar.gz
RUN tar zxf libreoffice.rpm.tar.gz
RUN yum install -y LibreOffice_7.2.5.2_Linux_x86-64_rpm/RPMS/*.rpm
COPY requirements.txt .
RUN pip install -r requirements.txt && rm requirements.txt
The Office file is copied to local tmp space handled by tempfile.TemporaryDirectory to ensure we don't leave anything behind to fill up when the Lambda is reused.
This is the subprocess execution:
with tempfile.TemporaryDirectory(dir=work_directory) as work_directory_name:
my_work_directory = Path(work_directory_name)
tmp_filename = my_work_directory / original_filename.name
with tmp_filename.open('wb') as tmp_file:
tmp_file.write(data)
result = subprocess.run(
['libreoffice7.2', '--headless', '--nologo', '--convert-to', 'pdf', '--outdir', my_work_directory, tmp_filename],
env={
'HOME': work_directory
}
)
When this fails, the subprocess returns code 1 and stderr is:
terminate called after throwing an instance of 'com::sun::star::deployment::DeploymentException'
Unspecified Application Error
This error indicates that there is something wrong with the installation. I have checked host info using both uname and /proc/cpuinfo and while the lambdas end up running on two variants of Xeon CPUs, there are no other differences.
My assumption is that the Libreoffice somehow ends up depending on a dynamic library from the host and that this particular library has different versions of different lambda hosts.
Any suggestions to how this could be resolved?

Pass windows environmental variables to dockerized python app

I am running a python application that reads two paths from Windows env vars and proceeds to use the executables in those paths to do OCR on some documents. Since POPPLER, TESSERACT env vars are already set in Windows, this Python snippet works for me:
popplerPath = os.environ.get('POPPLER')
tesseractPath = os.environ.get('TESSERACT')
Now I am trying to dockerize the app, and, to my understanding, since my container will need access to those paths, I need to mount them using VOLUME during run. My dockerfile looks like this:
FROM python:3.7.7-slim
WORKDIR ./
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY documents/ .
COPY src/ ./src
CMD [ "python", "./src/run.py" ]
I build the image using:
docker build -t ocr .
And I try to run my container using:
docker run -v %POPPLER%:%POPPLER% -v %TESSERACT%:%TESSERACT% ocr
... but my app still gets a None value for these paths and can't use the executable files. Is my approach correct and beyond that, is it a good dev practice?

See the doc, the switch for environment variable is -e:
$ docker run -e MYVAR1 --env MYVAR2=foo --env-file ./env.list ubuntu bash
and in dockerfile, you can use
ENV FOO=/bar
If I understand your statement correctly, your paths are mounted in the container in the same path as the host. The only problem is your Python script, which expects the paths to be provided by the environment variable. This will not exist unless you pass on them from your host system to your container system.
Once you verified your mounted volume with -v is there correctly, you can try with
docker run -v %POPPLER%:%POPPLER% -v %TESSERACT%:%TESSERACT% --env POPPLER=%POPPLER% --env TESSERACT=%TESSERACT% ocr
or, if you always run this, you can consider to put them in your dockerfile to save some keystroke.

Any executable you call must be built into the image. Containers can't usually call executables on the host or in other containers. In the specific example you show, a Linux container can't run a Windows executable, even if you do use a bind mount to inject it into the container.
The "slim" python images are built on Debian GNU/Linux, and you need to use its APT tool to install these executable dependencies in your Dockerfile. (https://www.debian.org/distrib/packages has a search box to help you find the right package name; Ubuntu Linux also uses Debian packages.)
FROM python:3.7-slim
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive \
apt-get install -y \
popper-utils \
tesseract-ocr-all
COPY requirements.txt .
...
I'd suggest putting reasonable defaults in your code if these environment variables aren't set. The apt-get install command will put them in the system path inside the image.
popplerPath = os.environ.get('POPPLER', 'poppler')
tesseractPath = os.environ.get('TESSERACT', 'tesseract')
If you really need them as environment variables you could use the Dockerfile ENV directive
ENV POPPLER=poppler TESSERACT=tesseract
Environment variables from the host don't automatically get passed through to the container; you need a Dockerfile ENV or docker run -e option. Also remember that the container has an isolated filesystem (and Windows-syntax paths don't make sense in Linux containers) so these environment variables would need to be container paths, the second half of your proposed docker run -v option.

Retrieving .csv file written by docker python program

I am trying to access the .csv file which my dockerized python program is making.
Here is my docker file:
# Use an official Python runtime as a parent image
FROM python:3.7
# Set the working directory to /app
WORKDIR /BotCloud
# Copy the current directory contents into the container at /app
ADD . /BotCloud
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
RUN wget http://prdownloads.sourceforge.net/ta-lib/ta-lib-0.4.0-src.tar.gz && \
tar -xvzf ta-lib-0.4.0-src.tar.gz && \
cd ta-lib/ && \
./configure --prefix=/usr && \
make && \
make install
RUN rm -R ta-lib ta-lib-0.4.0-src.tar.gz
RUN pip install ta-lib
# Run BotFinal.py when the container launches
CMD ["python","-u", "BotLiveSnake.py"]
Here is the code snippet that is in my python file BotSnakeLive.py
def write(string):
with open('outfile.csv','w') as f:
f.write(string)
f.write("\n")
write(str("Starting Time: "+datetime.datetime.utcfromtimestamp(int(df.tail(1)['Open Time'])/10**3).strftime('%Y-%m-%d,%H:%M:%SUTC'))+",Trading:"+str(pairing)+",Starting Money:"+str(money)+",SLpercent:"+str(SLpercent)+",TPpercent,"+str(TPpercent))
Running my python program locally, outfile.csv is created in the same folder as my python program. However, with docker, I'm not sure where this outfile ends up. Any help would be appreciated.

In general, references to file paths that don't start with / are always interpreted relative to the current working directory. Unless you've changed that somehow (os.chdir, an entrypoint script running cd, the docker run -w option) that will be the WORKDIR you declared in the Dockerfile.
So: your file should be in /BotCloud/outfile.csv, in the container's filesystem space.
Note that containers have their own isolated filesystem space that is destroyed when the container is deleted. If the primary way your application communicates is via files, it may be much easier to use a non-Docker mechanism, such as Python virtual environments, to isolate your application from the rest of the system. You can mount a host directory into the container with docker run -v, or docker cp files out. (Note with docker run -v in particular it is helpful if the data is written to someplace that isn't the same directory as your application.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.