Get input file name in streaming hadoop program

Get input file name in streaming hadoop program - python

I am able to find the name if the input file in a mapper class using FileSplit when writing the program in Java.
Is there a corresponding way to do this when I write a program in Python (using streaming?)
I found the following in the hadoop streaming document on apache:
See Configured Parameters. During the execution of a streaming job,
the names of the "mapred" parameters are transformed. The dots ( . )
become underscores ( _ ). For example, mapred.job.id becomes
mapred_job_id and mapred.jar becomes mapred_jar. In your code, use the
parameter names with the underscores.
But I still cant understand how to make use of this inside my mapper.
Any help is highly appreciated.
Thanks

According to the "Hadoop : The Definitive Guide"
Hadoop sets job configuration parameters as environment variables for Streaming programs. However, it replaces non-alphanumeric character with underscores to make sure they are valid names. The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script:
os.environ["mapred_job_id"]
You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). For example, the following sets the MAGIC_PARAMETER environment variable:
-cmdenv MAGIC_PARAMETER=abracadabra

By parsing the mapreduce_map_input_file(new) or map_input_file(deprecated) environment variable, you will get the map input file name.
Notice:
The two environment variables are case-sensitive, all letters are lower-case.

The new ENV_VARIABLE for Hadoop 2.x is MAPREDUCE_MAP_INPUT_FILE

Related

ExperimentHandler CSV output value separator environment dependant

(I'm not a pyschologist nor pysychopy export but helping one)
We're developing an expirement using the Coder (not Builder).
The output of an experiment is written out using the data.ExperimentHandler.addData().
When I run the application on my machine I get output like this:
,,1,1,z,f,o,o,1.254349554568762,0.8527608618387603,0.0,1.0,FSAS,,,,,,,,,,,,,,,52,male,52,
When the other person runs the application she gets:
;;0;1;z;z;j;j;105.498.479.999.369;0.501072800019756;1.0;1.0;FNES;;;;;;;;;;;;;;;23;male;0
(the difference I want to show is about format, not the values)
There are 2 differences:
RT values > 1 are printed wrong: the value 105.498.479.999.369 (RT less then 1.5s) should be 1.05498479999369
value separator is ; in my output and , in hers. I don't know what is the best value separator comma or semilcolon for later processing in R ?
I think it has to do something with regional settings.
Question: Is it possible to force the format so the application always generates the same output format independant from local settings ?

You don't specify whether you are using the graphical Builder interface to PsychoPy to generate your Python script, or writing a custom script directly in Python.
If using the Builder interface, click the "Experiment settings" toolbar icon select the "Data" tab, and specify that the datafile delimiter should be "comma" rather than "auto".
If running your own script, then in the saveAsWideText() method of the ExperimentHandler class, similarly specify ',' as the delim parameter. The API is here:
https://psychopy.org/api/data.html
In future, you will likely get better support at the dedicated PsychoPy forum at https://discourse.psychopy.org
The actual PsychoPy developers and other experienced users are much more likely to see your queries there compared to here at StackOverflow.

When building python from source how to get rid of abs_srcdir & abs_builddir string in output files

I'm trying to build python from source and using the prefix option to control the target directory where it gets installed.
After successful installation, in some files in the target directory I see the entries of the working directory from where I actually built.
Example files which has entry for abs_srcdir & abs_builddir
lib/python3.9/_sysconfigdata__linux_x86_64-linux-gnu.py
lib/python3.9/config-3.9-x86_64-linux-gnu/Makefile
How can I avoid this?

I am a bit unfamiliar with the process in Python but I can tell that these are part of the Preset Output Variables
From docs:
Some output variables are preset by the Autoconf macros. Some of the Autoconf macros set additional output variables, which are mentioned in the descriptions for those macros. See Output Variable Index, for a complete list of output variables. See Installation Directory Variables, for the list of the preset ones related to installation directories. Below are listed the other preset ones, many of which are precious variables (see Setting Output Variables, AC_ARG_VAR).
You can see the variables you mentioned here - B.2 Output Variable Index. Since these are preset variables, I don't see how you can exclude them post-installation. Manually removing or creating some sort of a script seems like the only way you can solve this.
If this was done in GNU Make then you can use the filter-out text function

Maya Python: fix matching names

I am writing a script to export alembic caches of animation in a massive project containing lots of maya files. Our main character is having an issue; along the way his eyes somehow ended up with the same name. This has created issues with the alembic export. Dose maya already have a sort of clean up function that can correct matching names?

Any two objects can have the same names, but never the same DAG paths. In your script, make sure all your ls, listRelatives calls etc. Have the full path or longName or long flags set so you always operate on the full DAG paths as opposed to the possibly conflicting short names.

To my knowledge maya (and its python api) does not offer anything like that.
You'll have to run a snippet on export to check for duplicates before export.
Or, alternatively use an already existing script and run that.

Using TotalOrderPartitioner in Hadoop streaming

I'm using python with Hadoop streaming to do a project, and I need the similar functionality provided by the TotalOrderPartitioner and InputSampler in Hadoop, that is, I need to sample the data first and create a partition file, then use the partition file to decide which K-V pair will go to which reducer in the mapper. I need to do it in Hadoop 1.0.4.
I could only find some Hadoop streaming examples with KeyFieldBasedPartitioner and customized partitioners, which use the -partitioner option in the command to tell Hadoop to use these partitioners. The examples I found using TotalOrderPartitioner and InputSampler are all in Java, and they need to use the writePartitionFile() of InputSampler and the DistributedCache class to do the job. So I am wondering if it is possible to use TotalOrderPartitioner with hadoop streaming? If it is possible, how can I organize my code to use it? If it is not, is it practical to implement the total partitioner in python first and then use it?

Did not try, but taking the example with KeyFieldBasedPartitioner and simply replacing:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
with
-partitioner org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner
Should work.

One possible way to use TotalOrderPartitioner in Hadoop streaming is to recode a small part of it to get the pathname of its partition file from an environment variable then compile it, define that environment variable on your systems and pass its name to the streaming job with the -cmdenv option (documented at https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Streaming_Command_Options.
Source code for TotalOrderPartitioner is available at TotalOrderPartitioner.java. In it getPartitionFile() is defined on two lines starting on line 143 and its second line shows that if its not given an argument it uses DEFAULT_PATH as the partition file name. DEFAULT_PATH is defined on line 54 as "_partition.lst" and line 83 has a comment that says its assumed to be in the DistributedCache. Based on that, without modifying getPartitionFile(), it should be possible to use _partition.lst as the partition filename as long as its in DistributedCache.
That leaves the issue of running an InputSampler to write content to the partition file. I think that's best done by running an already coded Java MapReduce job that uses TotalOrderPartitioner, at least to get an example of the output of InputSampler to determine its format. If the example job can be altered to process the type of data you want then you could use it to create a partition file usable for your purposes. A couple of coded MapReduce jobs using TotalOrderPartitioner are TotalOrderSorting.java and TotalSortMapReduce.java.
Alternatively at twittomatic there is a simple, custom IntervalPartitioner.java in which the partition file pathname is hardcoded as /partitions.lst and in the sorter directory is supplied a script, sample.sh, that builds partition.lst using hadoop, a live twitter feed and sample.py. It should be fairly easy to adapt this system to your needs starting with replacing the twitter feed with a sample of your data.

Having short python file log file names with complete parameters details in the name

I am generating log files in python. I have around 20 parameters, which I am reading from config.cnf file. Based on the value of these parameters, I name the log file. I want to use abbreviations to make the file name short. The present name of file and directories generated form my code are very long.
This is an example of a file name "cifar10_symm_ter_1bits_128_256_512_f1024_2_det_fil_ssl_1e-06_chan_ssl_1e-06_fco_ssl_1e-06_fci_ssl_1e-06_prthr1e-06_bc_init_cgs_c_50_4_cgs_f_50_16.txt". Here I have inserted different parameter names such as "symm_ter" and following the parameter I have added details of the parameter as "1bits". If their are some recommended naming techniques etc. and not exactly a library that should also be helpful.
I want to shorten file and folder names and at the same time have details about the parameters in the file name. If there is some python library which can help me abbreviate names for the parameters to be used in file name that will solve my task. Also, if that library can help display details about the abbreviated parameters that will help. I have looked into argparse library but that is for command line parameters. In my code I am reading from, config.cnf, configuration file. I have read python module naming convention; however I am concerned about brevity of the names here.

You can shorten by encoding your options through your own dictionary ( for example if some options are only 128,256,512 you can decide it will be A B or C) but you will lose readability.
In fact, I think you are on a wrong way. The name is not the place to store your 20 parameters. What happens if 21 or if a very long option ? You are already in trouble.
If i have to do, i will use a general name for the log file (ex: Mylog+datetime) and either put the 20 parameters as my first log line, either adding a secondary file with same root (ex: mylog+datetime+. desc) to store and retreive these values.
Hth
It seems you want to design a cms (content management system) using file name as descriptor. This is (a little) meaningful if you need to move these folders as individuals.
If your store will stay same place, using a sqlite db to link a folder name with its parameters will be much more powerful and quite simple in python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.