Processing multiple files in HDFS via Python - python

I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?
For example, when I run the script on files in a local directory I have:
cd /path/to/files
for file in *.xml
do
python /path/processxml.py
$file > /path2/$file
done
So basically, how would I go about doing the same, but this time the files are in hdfs?

You basically have two options:
1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:
hadoop jar <the location of the streamlib> \
-D mapred.job.name=<name for the job> \
-input /hdfs/input/dir \
-output /hdfs/output/dir \
-file your_script.py \
-mapper python your_script.py \
-numReduceTasks 0
2) Create a PIG script and ship your python code. Here is a basic example for the script:
input_data = LOAD '/hdfs/input/dir';
DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;
STORE updated_data INTO 'hdfs/output/dir';

If you need to process data in your files or move/cp/rm/etc. them around the file-system then PySpark (Spark with Python interface) would be one of the best options (speed, memory).

Related

no-same-owner flag in python tar extract

I have a bash script that extracts a tar file:
tar --no-same-owner -xzf "$FILE" -C "$FOLDER"
--no-same-owner is needed because this script runs as root in Docker and I want the files to be owned by root, rather than the original uid/gid that created the tar
I have changed the script to a python script, and need to add the --no-same-owner flag functionality, but can't see an option in the docs to do so
with tarfile.open(file_path, "r:gz") as tar:
tar.extractall(extraction_folder)
Is this possible? Or do I need to run the bash command as a subprocess?

Mixed execution of two Python scripts & DatastaxBulk loader scripts to load to .csv in Apache Cassandra

I have a .sh file in which I call two python scripts:
for fileMaster.sh :
python script1.py && python script2.py
now, the problem is that I want to add the action that after the script2.py do the upload into Apache Cassandra with Datastax bulk loader.
so, if I do this;
python script1.py && python script2.py && fileSlave.sh
in with fileSlave.sh is:
export PATH=/home/mypc/dsbulk-1.7.0/bin:$PATH
source ~/.bashrc
dsbulk load -url /home/mypc/Desktop/foldertest/data.csv -k data_test -t data_table -delim "," -header true -m '0=time_exp, 1=p'
it gives to me access denied to load into Cassandra. As imagine, the impossibility to do the same if I add the code of fileSlave.sh directly under the py calls in fileMaster.sh
How can I do that?
I solved the problem cause the file fileMaster.sh need this:
python script1.py && python script2.py
chmod u+x ./shell2.sh
./fileSlave.sh
It works!

Pyspark can not delete file from HDFS containing backslash

Just noticed strange behaviour of either Python, Pyspark or maybe even Hadoop.
I have accidentally created a folder with a backslash in its name on HDFS:
>hdfs dfs -ls -h
drwxr-xr-x -user hdfs 0 2020-08-04 08:59 Q2\solution2
I'm using Spark version 2.3.0.2.6.5.0-292 with Python 2.7.5.
So here is what I have tried. Start pyspark2, then execute following commands:
>import os
>os.system("hdfs dfs -rm -r -f 'Q2\solution2'")
0
file/folder is not deleted!
However, when I execute the same command directly from OS...
hdfs dfs -rm -r -f 'Q2\solution2'
The file/folder is deleted!
Can anyone explain why is this happening?

How to load additional JARs for an Hadoop Streaming job on Amazon EMR

TL;DR
How I can upload or specify additional JARs to an Hadoop Streaming Job on Amazon Elastic MapReduce (Amazon EMR)?
Long version
I want to analyze a set of Avro files (> 2000 files) using Hadoop on Amazon Elastic MapReduce (Amazon EMR). It should be a simple exercise through which I should gain some confidence with MapReduce and Amazon EMR (I am new to both).
Since python is my favorite language I have decided to use Hadoop Streaming. I have built a simple mapper and reducer in python, and I have tested it on a local Hadoop (single node install). The command I was issuing on my local Hadoop install was this:
$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming-2.4.0-amzn-1.jar \
-files avro-1.7.7.jar,avro-mapred-1.7.7.jar \
-libjars avro-1.7.7.jar,avro-mapred-1.7.7.jar \
-input "input" \
-mapper "python2.7 $PWD/mapper.py" \
-reducer "python2.7 $PWD/reducer.py" \
-output "output/outdir" \
-inputformat org.apache.avro.mapred.AvroAsTextInputFormat
and the job completed successfully.
I have a bucket on Amazon S3 with a folder containing all the input files and another folder with the mapper and reducer scripts (mapper.py and reducer.py respectively).
Using the interface I have created a small cluster, then I have added a bootstrap action to install all the required python modules on each node and then I have added an "Hadoop Streaming" step specifying the location of the mapper and reducer scripts on S3.
The problem is that I don't have the slightest idea on how I can upload or specify in the options the two JARs - avro-1.7.7.jar and avro-mapred-1.7.7.jar - required to run this job?
I have tried several things:
using the -files flag in combination with -libjars in the optional arguments;
adding another bootstrap action that downloads the JARs on every node (and I have tried to download it on different locations on the nodes);
I have tried to upload the JARs on my bucket and specify a full s3://... path as argument to -libjars (note: these file are actively ignored by Hadoop, and a warning is issued) in the options;
If I don't pass the two JARs the job fails (it does not recognize the -inputformat class), but I have tried all the possibilities (and combinations thereof!) I could think of to no avail.
In the end, I figures it out (and it was, of course, obvious):
Here's how I have done it:
add a bootstrap action that downloads the JARs on every node, for example you can upload the JARs in your bucket, make them public and then do:
wget https://yourbucket/path/somejar.jar -O $HOME/somejar.jar
wget https://yourbucket/path/avro-1.7.7.jar -O $HOME/avro-1.7.7.jar
wget https://yourbucket/path/avro-mapred-1.7.7.jar -O $HOME/avro-mapred-1.7.7.jar
when you specify -libjars in the optional arguments use the abosolute path, so:
-libjars /home/hadoop/somejar.jar,$HOME/avro-1.7.7.jar,/home/hadoop/avro-mapred-1.7.7.jar
I have lost a number of hours that I am ashamed to say, hope this helps somebody else.
Edit (Feb 10th, 2015)
I have double checked, and I want to point out that it seems that environment variable are not expanded when passed to the optional arguments field. So, use the explicit $HOME path (i.e. /home/hadoop)
Edit (Feb 11th, 2015)
If you want to launch the a streaming job on Amazon EMR using the AWS cli you can use the following command.
aws emr create-cluster --ami-version '3.3.2' \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType='m1.medium' InstanceGroupType=CORE,InstanceCount=2,InstanceType='m3.xlarge' \
--name 'TestStreamingJob' \
--no-auto-terminate \
--log-uri 's3://path/to/your/bucket/logs/' \
--no-termination-protected \
--enable-debugging \
--bootstrap-actions Path='s3://path/to/your/bucket/script.sh',Name='ExampleBootstrapScript' Path='s3://path/to/your/bucket/another_script.sh',Name='AnotherExample' \
--steps file://./steps_test.json
and you can specify the steps in a JSON file:
[
{
"Name": "Avro",
"Args": ["-files","s3://path/to/your/mapper.py,s3://path/to/your/reducer.py","-libjars","/home/hadoop/avro-1.7.7.jar,/home/hadoop/avro-mapred-1.7.7.jar","-inputformat","org.apache.avro.mapred.AvroAsTextInputFormat","-mapper","mapper.py","-reducer","reducer.py","-input","s3://path/to/your/input_directory/","-output","s3://path/to/your/output_directory/"],
"ActionOnFailure": "CONTINUE",
"Type": "STREAMING"
}
]
(please note that the official Amazon documentation is somewhat outdated, in fact it uses the old Amazon EMR CLI tool which is deprecated in favor of the more recente AWS CLI)

set the number of reducers does not work

I am using Hadoop streaming with -io typedbytes and set mapred.reduce.tasks=2, but I finally got only one output file. And if I set mapred.reduce.tasks=0, then I got many output files. I am very confused.
SO my question is:
How to make mapred.reduce.tasks = num (num >1) config valid when I using -io typedbytes in streaming?
PS: my mapper's output is (key:string of python, value:array of numpy) .
And my .sh file:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.2.1.jar \
-D mapred.reduce.tasks=2 \
-fs local \
-jt local \
-io typedbytes \
-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat \
-input FFT_SequenceFile \
-output pinvoutput \
-mapper 'pinvmap.py' \
-file pinvmap.py \
-D mapred.reduce.tasks=2 \ -fs local \ -jt local
By checking values of -fs and -jt i came to know you are running it in local mode.
In local mode, either zero or one reducer can run atmost.
Because it uses local file system and a single JVM, there is no Hadoop daemons in this mode.
In psuedo distributed mode where all the daemons runs on the same machine, the property -D mapred.reduce.tasks=n will work and results n reducers.
So you should use psuedo distributed mode for working with multiple reducers.
Hope it helps!

Categories