Pyspark can not delete file from HDFS containing backslash - python

Just noticed strange behaviour of either Python, Pyspark or maybe even Hadoop.
I have accidentally created a folder with a backslash in its name on HDFS:
>hdfs dfs -ls -h
drwxr-xr-x -user hdfs 0 2020-08-04 08:59 Q2\solution2
I'm using Spark version 2.3.0.2.6.5.0-292 with Python 2.7.5.
So here is what I have tried. Start pyspark2, then execute following commands:
>import os
>os.system("hdfs dfs -rm -r -f 'Q2\solution2'")
0
file/folder is not deleted!
However, when I execute the same command directly from OS...
hdfs dfs -rm -r -f 'Q2\solution2'
The file/folder is deleted!
Can anyone explain why is this happening?

Related

no-same-owner flag in python tar extract

I have a bash script that extracts a tar file:
tar --no-same-owner -xzf "$FILE" -C "$FOLDER"
--no-same-owner is needed because this script runs as root in Docker and I want the files to be owned by root, rather than the original uid/gid that created the tar
I have changed the script to a python script, and need to add the --no-same-owner flag functionality, but can't see an option in the docs to do so
with tarfile.open(file_path, "r:gz") as tar:
tar.extractall(extraction_folder)
Is this possible? Or do I need to run the bash command as a subprocess?

AWSGLUE python package - ls cannot access dir

I'm trying to install local awsglue package for developing purpose on my local machine (Windows + Git Bash)
https://github.com/awslabs/aws-glue-libs/tree/glue-1.0
https://support.wharton.upenn.edu/help/glue-debugging
Spark directory and py4j mentioned in below error does exist but still getting error
Directory from which I trigger the sh is below:
user#machine xxxx64~/Desktop/lm_aws_glue/aws-glue-libs-glue-1.0/bin
$ ./glue-setup.sh
ls: cannot access 'C:\Spark\spark-3.1.1-bin-hadoop2.7/python/lib/py4j-*-src.zip': No such file or directory
rm: cannot remove 'PyGlue.zip': No such file or directory
./glue-setup.sh: line 14: zip: command not found
ls result:
$ ls -l
total 7
-rwxr-xr-x 1 n1543781 1049089 135 May 5 2020 gluepyspark*
-rwxr-xr-x 1 n1543781 1049089 114 May 5 2020 gluepytest*
-rwxr-xr-x 1 n1543781 1049089 953 Mar 5 11:10 glue-setup.sh*
-rwxr-xr-x 1 n1543781 1049089 170 May 5 2020 gluesparksubmit*
Original install code requires few tweaks and works ok. Still need a workaround for zip.
#!/usr/bin/env bash
#original code
#ROOT_DIR="$(cd $(dirname "$0")/..; pwd)"
#cd $ROOT_DIR
#re-written
ROOT_DIR="$(cd /c/aws-glue-libs; pwd)"
cd $ROOT_DIR
SPARK_CONF_DIR=$ROOT_DIR/conf
GLUE_JARS_DIR=$ROOT_DIR/jarsv1
#original code
#PYTHONPATH="$SPARK_HOME/python/:$PYTHONPATH"
#PYTHONPATH=`ls $SPARK_HOME/python/lib/py4j-*-src.zip`:"$PYTHONPATH"
#re-written
PYTHONPATH="/c/Spark/spark-3.1.1-bin-hadoop2.7/python/:$PYTHONPATH"
PYTHONPATH=`ls /c/Spark/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-*-src.zip`:"$PYTHONPATH"
# Generate the zip archive for glue python modules
rm PyGlue.zip
zip -r PyGlue.zip awsglue
GLUE_PY_FILES="$ROOT_DIR/PyGlue.zip"
export PYTHONPATH="$GLUE_PY_FILES:$PYTHONPATH"
# Run mvn copy-dependencies target to get the Glue dependencies locally
#mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jarsv1 dependency:copy-dependencies
export SPARK_CONF_DIR=${ROOT_DIR}/conf
mkdir $SPARK_CONF_DIR
rm $SPARK_CONF_DIR/spark-defaults.conf
# Generate spark-defaults.conf
echo "spark.driver.extraClassPath $GLUE_JARS_DIR/*" >> $SPARK_CONF_DIR/spark-defaults.conf
echo "spark.executor.extraClassPath $GLUE_JARS_DIR/*" >> $SPARK_CONF_DIR/spark-defaults.conf
# Restore present working directory
cd -

How to import multiple csv files to MongoDB?

I have a directory with a bunch of CSV files, I want to import them all to MongoDB and then delete all of them.
On Ubuntu 14.04, the following would work:
for f in /home/v/scr/alerts/*;
do
mongoimport -d emails -c main --type csv --file "$f" --headerline && rm /home/v/scr/alerts/*;
done
However, I receive the following output now (on Ubuntu 16.04):
2018-01-29T21:49:00.752+0000 connected to: localhost
2018-01-29T21:49:00.759+0000 imported 1 document
2018-01-29T21:49:00.767+0000 Failed: open /home/v/scr/alerts/fH88Vaxr.csv: no such file or directory
2018-01-29T21:49:00.767+0000 imported 0 documents
2018-01-29T21:49:00.772+0000 Failed: open /home/v/scr/alerts/m45EkP9N.csv: no such file or directory
2018-01-29T21:49:00.772+0000 imported 0 documents
It appears that the first CSV file is imported correctly, then everything is removed which is not what I want.
after first import you are removing all files in from the source directory
delete only the file has been imported
for f in /home/v/scr/alerts/*;
do
mongoimport -d emails -c main --type csv --file "$f" --headerline
rm $f #rm only current file
done
or delete all source files after successfully imported
for f in /home/v/scr/alerts/*;
do
mongoimport -d emails -c main --type csv --file "$f" --headerline
done
rm /home/v/scr/alerts/* #rm all files
optionally we can include isFile check as well
if [ -f $f ]; then
# import
fi

Implement Git hook - prePush and preCommit

Could you please show me how to implement git hook?
Before committing, the hook should run a python script. Something like this:
cd c:\my_framework & run_tests.py --project Proxy-Tests\Aeries \
--client Aeries --suite <Commit_file_Name> --dryrun
If the dry run fails then commit should be stopped.
You need to tell us in what way the dry run will fail. Will there be an output .txt with errors? Will there be an error displayed on terminal?
In any case you must name the pre-commit script as pre-commit and save it in .git/hooks/ directory.
Since your dry run script seems to be in a different path than the pre-commit script, here's an example that finds and runs your script.
I assume from the backslash in your path that you are on a windows machine and I also assume that your dry-run script is contained in the same project where you have git installed and in a folder called tools (of course you can change this to your actual folder).
#!/bin/sh
#Path of your python script
FILE_PATH=tools/run_tests.py/
#Get relative path of the root directory of the project
rdir=`git rev-parse --git-dir`
rel_path="$(dirname "$rdir")"
#Cd to that path and run the file.
cd $rel_path/$FILE_PATH
echo "Running dryrun script..."
python run_tests.py
#From that point on you need to handle the dry run error/s.
#For demonstrating purproses I'll asume that an output.txt file that holds
#the result is produced.
#Extract the result from the output file
final_res="tac output | grep -m 1 . | grep 'error'"
echo -e "--------Dry run result---------\n"${final_res}
#If a warning and/or error exists abort the commit
eval "$final_res" | while read -r line; do
if [ $line != "0" ]; then
echo -e "Dry run failed.\nAborting commit..."
exit 1
fi
done
Now every time you fire git commit -m the pre-commit script will run the dry run file and abort the commit if any errors have occured, keeping your files in the stagin area.
I have implemented this in my hook. Here is the code snippet.
#!/bin/sh
#Path of your python script
RUN_TESTS="run_tests.py"
FRAMEWORK_DIR="/my-framework/"
CUR_DIR=`echo ${PWD##*/}`
`$`#Get full path of the root directory of the project under RUN_TESTS_PY_FILE
rDIR=`git rev-parse --git-dir --show-toplevel | head -2 | tail -1`
OneStepBack=/../
CD_FRAMEWORK_DIR="$rDIR$OneStepBack$FRAMEWORK_DIR"
#Find list of modified files - to be committed
LIST_OF_FILES=`git status --porcelain | awk -F" " '{print $2}' | grep ".txt" `
for FILE in $LIST_OF_FILES; do
cd $CD_FRAMEWORK_DIR
python $RUN_TESTS --dryrun --project $CUR_DIR/$FILE
OUT=$?
if [ $OUT -eq 0 ];then
continue
else
return 1
fi
done

Processing multiple files in HDFS via Python

I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?
For example, when I run the script on files in a local directory I have:
cd /path/to/files
for file in *.xml
do
python /path/processxml.py
$file > /path2/$file
done
So basically, how would I go about doing the same, but this time the files are in hdfs?
You basically have two options:
1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:
hadoop jar <the location of the streamlib> \
-D mapred.job.name=<name for the job> \
-input /hdfs/input/dir \
-output /hdfs/output/dir \
-file your_script.py \
-mapper python your_script.py \
-numReduceTasks 0
2) Create a PIG script and ship your python code. Here is a basic example for the script:
input_data = LOAD '/hdfs/input/dir';
DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;
STORE updated_data INTO 'hdfs/output/dir';
If you need to process data in your files or move/cp/rm/etc. them around the file-system then PySpark (Spark with Python interface) would be one of the best options (speed, memory).

Categories