I have a directory with a bunch of CSV files, I want to import them all to MongoDB and then delete all of them.
On Ubuntu 14.04, the following would work:
for f in /home/v/scr/alerts/*;
do
mongoimport -d emails -c main --type csv --file "$f" --headerline && rm /home/v/scr/alerts/*;
done
However, I receive the following output now (on Ubuntu 16.04):
2018-01-29T21:49:00.752+0000 connected to: localhost
2018-01-29T21:49:00.759+0000 imported 1 document
2018-01-29T21:49:00.767+0000 Failed: open /home/v/scr/alerts/fH88Vaxr.csv: no such file or directory
2018-01-29T21:49:00.767+0000 imported 0 documents
2018-01-29T21:49:00.772+0000 Failed: open /home/v/scr/alerts/m45EkP9N.csv: no such file or directory
2018-01-29T21:49:00.772+0000 imported 0 documents
It appears that the first CSV file is imported correctly, then everything is removed which is not what I want.
after first import you are removing all files in from the source directory
delete only the file has been imported
for f in /home/v/scr/alerts/*;
do
mongoimport -d emails -c main --type csv --file "$f" --headerline
rm $f #rm only current file
done
or delete all source files after successfully imported
for f in /home/v/scr/alerts/*;
do
mongoimport -d emails -c main --type csv --file "$f" --headerline
done
rm /home/v/scr/alerts/* #rm all files
optionally we can include isFile check as well
if [ -f $f ]; then
# import
fi
Related
curl -F user=aditya -F password=1234 -F date=20220516 -F format=csv -F report= sales -F type=IT -F family=SAAS -F version=4 https://www.yahoo.jsp > file.zip
Note: I have changed the data here but format is same, I need the file to be downloaded from website and saved it in my Desktop, I have 4 classification of files under which multiple subclassification is there?
I have a bash script that extracts a tar file:
tar --no-same-owner -xzf "$FILE" -C "$FOLDER"
--no-same-owner is needed because this script runs as root in Docker and I want the files to be owned by root, rather than the original uid/gid that created the tar
I have changed the script to a python script, and need to add the --no-same-owner flag functionality, but can't see an option in the docs to do so
with tarfile.open(file_path, "r:gz") as tar:
tar.extractall(extraction_folder)
Is this possible? Or do I need to run the bash command as a subprocess?
Just noticed strange behaviour of either Python, Pyspark or maybe even Hadoop.
I have accidentally created a folder with a backslash in its name on HDFS:
>hdfs dfs -ls -h
drwxr-xr-x -user hdfs 0 2020-08-04 08:59 Q2\solution2
I'm using Spark version 2.3.0.2.6.5.0-292 with Python 2.7.5.
So here is what I have tried. Start pyspark2, then execute following commands:
>import os
>os.system("hdfs dfs -rm -r -f 'Q2\solution2'")
0
file/folder is not deleted!
However, when I execute the same command directly from OS...
hdfs dfs -rm -r -f 'Q2\solution2'
The file/folder is deleted!
Can anyone explain why is this happening?
Could you please show me how to implement git hook?
Before committing, the hook should run a python script. Something like this:
cd c:\my_framework & run_tests.py --project Proxy-Tests\Aeries \
--client Aeries --suite <Commit_file_Name> --dryrun
If the dry run fails then commit should be stopped.
You need to tell us in what way the dry run will fail. Will there be an output .txt with errors? Will there be an error displayed on terminal?
In any case you must name the pre-commit script as pre-commit and save it in .git/hooks/ directory.
Since your dry run script seems to be in a different path than the pre-commit script, here's an example that finds and runs your script.
I assume from the backslash in your path that you are on a windows machine and I also assume that your dry-run script is contained in the same project where you have git installed and in a folder called tools (of course you can change this to your actual folder).
#!/bin/sh
#Path of your python script
FILE_PATH=tools/run_tests.py/
#Get relative path of the root directory of the project
rdir=`git rev-parse --git-dir`
rel_path="$(dirname "$rdir")"
#Cd to that path and run the file.
cd $rel_path/$FILE_PATH
echo "Running dryrun script..."
python run_tests.py
#From that point on you need to handle the dry run error/s.
#For demonstrating purproses I'll asume that an output.txt file that holds
#the result is produced.
#Extract the result from the output file
final_res="tac output | grep -m 1 . | grep 'error'"
echo -e "--------Dry run result---------\n"${final_res}
#If a warning and/or error exists abort the commit
eval "$final_res" | while read -r line; do
if [ $line != "0" ]; then
echo -e "Dry run failed.\nAborting commit..."
exit 1
fi
done
Now every time you fire git commit -m the pre-commit script will run the dry run file and abort the commit if any errors have occured, keeping your files in the stagin area.
I have implemented this in my hook. Here is the code snippet.
#!/bin/sh
#Path of your python script
RUN_TESTS="run_tests.py"
FRAMEWORK_DIR="/my-framework/"
CUR_DIR=`echo ${PWD##*/}`
`$`#Get full path of the root directory of the project under RUN_TESTS_PY_FILE
rDIR=`git rev-parse --git-dir --show-toplevel | head -2 | tail -1`
OneStepBack=/../
CD_FRAMEWORK_DIR="$rDIR$OneStepBack$FRAMEWORK_DIR"
#Find list of modified files - to be committed
LIST_OF_FILES=`git status --porcelain | awk -F" " '{print $2}' | grep ".txt" `
for FILE in $LIST_OF_FILES; do
cd $CD_FRAMEWORK_DIR
python $RUN_TESTS --dryrun --project $CUR_DIR/$FILE
OUT=$?
if [ $OUT -eq 0 ];then
continue
else
return 1
fi
done
I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?
For example, when I run the script on files in a local directory I have:
cd /path/to/files
for file in *.xml
do
python /path/processxml.py
$file > /path2/$file
done
So basically, how would I go about doing the same, but this time the files are in hdfs?
You basically have two options:
1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:
hadoop jar <the location of the streamlib> \
-D mapred.job.name=<name for the job> \
-input /hdfs/input/dir \
-output /hdfs/output/dir \
-file your_script.py \
-mapper python your_script.py \
-numReduceTasks 0
2) Create a PIG script and ship your python code. Here is a basic example for the script:
input_data = LOAD '/hdfs/input/dir';
DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;
STORE updated_data INTO 'hdfs/output/dir';
If you need to process data in your files or move/cp/rm/etc. them around the file-system then PySpark (Spark with Python interface) would be one of the best options (speed, memory).