Python mrjob mapreduce how to preprocess the input file - python

I am trying to pre-process a XML file to extract certain nodes before putting into mapreduce. I have the following code:
from mrjob.compat import jobconf_from_env
from mrjob.job import MRJob
from mrjob.util import cmd_line, bash_wrap
class MRCountLinesByFile(MRJob):
def configure_options(self):
super(MRCountLinesByFile, self).configure_options()
self.add_file_option('--filter')
def mapper_cmd(self):
cmd = cmd_line([self.options.filter, jobconf_from_env('mapreduce.map.input.file'])
return cmd
if __name__ == '__main__':
MRCountLinesByFile.run()
And on the command line, I type:
python3 test_job_conf.py --filter ./filter.py -r local < test.txt
test.txt is a normal XML file like here. While filter.py is a script to find all title information.
However, I am getting the following errors:
Creating temp directory /tmp/test_job_conf.vagrant.20160406.042648.689625
Running step 1 of 1...
Traceback (most recent call last):
File "./filter.py", line 8, in <module>
with open(filename) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'None'
Step 1 of 1 failed: Command '['./filter.py', 'None']' returned non-zero exit status 1
It looks like mapreduce.map.input.file render None in this case. How can I ask the mapper_cmd function to read the file that mrjob is currently reading?

As per my understanding goes in the your self.add_file_option should have the path to your file.
self.add_file_option('--items', help='Path to u.item')
I do not quite get your scenario right but here is my understanding.
You use the configure option to make sure a given file is sent to all the mappers for processing for example when you want to do an ancillary lookup on data in another file other than the source. This ancillary lookup file is made available by self.add_file_option('--items', help='Path to u.item').
To preprocess something say before a reducer or a mapper phase, you use the reducer_init or the mapper_init. These init or the processing steps also need to be mentioned in your step function like shown below for example.
def steps(self):
return [
MRStep(mapper=self.mapper_get_name,
reducer_init=self.reducer_init,
reducer=self.reducer_count_name),
MRStep(reducer = self.reducer_find_maxname)
]
Within your init function you do the actual pre-processing of what you need to done before sending to mapper or reducer. Say for example open a file xyz and copy the values in the first field in another field which I would be using in my reducer and output the same.
def reducer_init(self):
self.movieNames = {}
with open("xyz") as f:
for line in f:
fields = line.split('|')
self.myNames[fields[0]] = fields[1]
Hope this helps!!

Related

Unable to extract some files from zip file

I have made a python script which takes the latest files from university e-class (lectures in pdf formats, scripts etc) via requests and downloads them. After downloading it automatically extracts the zip on the specific folder i want. But the extraction function sometimes it gets stuck on the same files.
The function is this:
import zipfile
from tqdm import tqdm
zipnm = 'Διαδικτυακά και Φορητά Πληροφοριακά Συστήματα'
quartls = '6'
def extractZip(zipName, quartInd):
with zipfile.ZipFile('./'+zipName+'.zip', 'r') as zip_ref:
for member in tqdm(zip_ref.infolist(), desc='Extracting '):
try:
zip_ref.extract(member, './'+quartInd+'° εξάμηνο/'+zipName)
except zipfile.error as e:
print(e)
if __name__ == '__main__':
extractZip(zipnm, quartls)
When running on terminal it throws this:
PS Microsoft.PowerShell.Core\FileSystem::\\192.168.1.200\[REDACTED]> python .\test.py
Extracting : 39%|███████████████████████████ | 11/28 [00:00<00:01, 12.04it/s]
Traceback (most recent call last):
File "\\192.168.1.200\[REDACTED]\test.py", line 18, in <module>
extractZip(zipsd, quartsd)
File "\\192.168.1.200\[REDACTED]\test.py", line 12, in extractZip
zip_ref.extract(member, './'+quartInd+'° εξάμηνο/'+zipName)
File "C:\Users\[REDACTED]\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1616, in extract
return self._extract_member(member, path, pwd)
File "C:\Users\[REDACTED]\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1683, in _extract_member
os.mkdir(targetpath)
FileExistsError: [WinError 183] Δεν είναι δυνατή η δημιουργία ενός αρχείου όταν αυτό το αρχείο υπάρχει ήδη: '6° εξάμηνο\\Διαδικτυακά και Φορητά Πληροφοριακά Συστήματα\\Εργαστήρια\\Lab 5 - Introduction to PHP'
When I'm trying to extract the zip file manually, it sticks on 2 files that I either can retry or abort the files from extracting. Although the other files that already exist show me the options to replace them. My question is, why is this happening on these 2 files and not letting me to replace them like the others? Are the files corrupted (although I checked their size and looking if EOF is missing but nothing suspicious)?
P.S. I don't want on my script file to check if exists to exclude them I just want to find the source of the problem so I can act accordingly.
P.S.#2:
The files that get extracted are actually in an Ubuntu server machine which I access via Samba on Windows and then I run the script or doing anything with the files.

Second `ParallelRunStep` in pipeline times out at start

Im trying to run a sequence of more than one ParallelRunStep in an AzureML pipeline. To do so, I create a step with the following helper:
def create_step(name, script, inp, inp_ds):
out = pip_core.PipelineData(name=f"{name}_out", datastore=dstore, is_directory=True)
out_ds = out.as_dataset()
out_ds_named = out_ds.as_named_input(f"{name}_out")
config = cont_steps.ParallelRunConfig(
source_directory="src",
entry_script=script,
mini_batch_size="1",
error_threshold=0,
output_action="summary_only",
compute_target=compute_target,
environment=component_env,
node_count=2,
logging_level="DEBUG"
)
step = cont_steps.ParallelRunStep(
name=name,
parallel_run_config=config,
inputs=[inp_ds],
output=out,
arguments=[],
allow_reuse=False,
)
return step, out, out_ds_named
As an example I create two steps like this
step1, out1, out1_ds_named = create_step("step1", "demo_s1.py", input_ds, named_input_ds)
step2, out2, out2_ds_named = create_step("step2", "demo_s2.py", out1, out1_ds_named)
Creating an experiment and submitting it to an existing workspace and Azure ML compute cluster works. Also the first step step1 uses the input_ds runs its script demo_s1.py (which produces its output files, and finishes successfully.
However the second step step2 never get started.
And there is a final exception
The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.16968441009521484 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 394
Traceback (most recent call last):
File "driver/amlbi_main.py", line 52, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job_starter.py", line 48, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job.py", line 70, in start
master.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 174, in start
self._start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 149, in _start
self.wait_for_input_init()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 124, in wait_for_input_init
raise exc
exception.FirstTaskCreationTimeout: Unable to create any task within 600 seconds.
Load the datasource and read the first row locally to see how long it will take.
Set the advanced argument '--first_task_creation_timeout' to a larger value in arguments in ParallelRunStep.
I have the impression, that the second step is waiting for some data. However the first step creates the supplied output directory and also a file.
import argparse
import os
def init():
pass
def run(parallel_input):
print(f"*** Running {os.path.basename(__file__)} with input {parallel_input}")
parser = argparse.ArgumentParser(description="Data Preparation")
parser.add_argument('--output', type=str, required=True)
args, unknown_args = parser.parse_known_args()
out_path = os.path.join(args.output, "1.data")
os.makedirs(args.output, exist_ok=True)
open(out_path, "a").close()
return [out_path]
I have no idea how to debug further. Has anybody an idea?
You can check this notebook for parallel run and make sure that you are using the same packages.
https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb

Python cPickle unable to load an OCR model library

I have just installed ocropus OCR with all dependencies in my windows 7 machine. (I am using 32bit python 2.7) It seems to be working fine except that I cannot load the default OCR model: en-default.pyrnn.gz. , and receiving a Traceback. I am using the following syntax:
python ocropus-rpred -m en-default.pyrnn.gz book\0001\*.png
here is the error
INFO: #inputs47
# loading object /usr/local/share/ocropus/en-default.pyrnn.gz
Traceback (most recent call last):
File "ocropus-rpred" line 109, in <module>
network = ocrolib.load_object(args.model,verbose=1)
File "C:\anaconda32\lib\site-packages\ocrolib\common.py", line 513, in load_object
return unpickler.load()
EOFError
I have checked the file is not empty; also double checked the binary mode flag enabled i.e. "wb" and "rb"; also converted the newlines of common.py using dos2unix. I am being unable to unable to solve this problem. If anyone have expereinced similar issues, kindly share.
import cPickle
import gzip
def save_object(fname,obj,zip=0):
if zip==0 and fname.endswith(".gz"):
zip = 1
if zip>0:
# with gzip.GzipFile(fname,"wb") as stream:
with os.popen("gzip -9 > '%s'"%fname,"wb") as stream:
cPickle.dump(obj,stream,2)
else:
with open(fname,"wb") as stream:
cPickle.dump(obj,stream,2)
def unpickle_find_global(mname,cname):
if mname=="lstm.lstm":
return getattr(lstm,cname)
if not mname in sys.modules.keys():
exec "import "+mname
return getattr(sys.modules[mname],cname)
def load_object(fname,zip=0,nofind=0,verbose=0):
"""Loads an object from disk. By default, this handles zipped files
and searches in the usual places for OCRopus. It also handles some
class names that have changed."""
if not nofind:
fname = ocropus_find_file(fname)
if verbose:
print "# loading object",fname
if zip==0 and fname.endswith(".gz"):
zip = 1
if zip>0:
# with gzip.GzipFile(fname,"rb") as stream:
with os.popen("gunzip < '%s'"%fname,"rb") as stream:
unpickler = cPickle.Unpickler(stream)
unpickler.find_global = unpickle_find_global
return unpickler.load()
else:
with open(fname,"rb") as stream:
unpickler = cPickle.Unpickler(stream)
unpickler.find_global = unpickle_find_global
return unpickler.load()
UPDATE: Hi, please note that I have used Python's native gzip, and it is working fine. Thank you for pointing that out. Here is the correct syntax that is working on Windows: {with gzip.GzipFile(fname,"rb") as stream:}
Your use of gunzip (in the load_object function) is incorrect. Unless passed the -c argument, gunzip writes the decompressed data to a new file, not to its stdout (which is what you seem to be attempting to do).
As a result, it doesn't write anything to its stdout, and your stream variable contains no data, hence the EOFError.
A quick fix is to change your gunzip command line to give it the -c argument.
More info here: http://linux.die.net/man/1/gzip
That said, why are you even shelling out to gunzip to decompress your data? Python's built-in gzip module should handle that without problems.

ConfigParser instance has no attribute '[extension]'

I am learning python and I am trying to do a simple task of reading information from a config file.
So using the Python Doc and this similar problem as a reference I created two files.
This is my config file config.ini (also tried config.cfg)
[DEFAULT]
OutDir = path_to_file/
[AUTH]
TestInt = 100
TestStr = blue
TestParse = blua
and this is my python file test.py
import ConfigParser
from ConfigParser import *
config = ConfigParser()
config.read(config.cfg)
for name in config.options('AUTH'):
print name
out = config.get('DEFAULT', 'OutDir')
print 'Output directory is ' + out
however when running the command python test.py I am erroring out and receiving this error
Traceback (most recent call last):
File "test.py", line 7, in <module>
config.read(config.cfg)
AttributeError: ConfigParser instance has no attribute 'cfg'
Note: I thought that meant it the extension couldn't be read so I created the .ini file and changed it in the code and I received the same error but it instead read ...has no attribute 'ini'
I am not sure what I am doing wrong since I am doing the exact same as the python doc and the solution someone used to fix this similar issue.
config.read takes a string as its argument. You forgot to quote the file name, and config was coincidentally the name of a Python object (the module) that potentially could have a cfg attribute. You'd get an entirely different error if you had written config.read(foobarbaz.ini).
The correct line is
config.read('config.cfg') # or 'config.ini', if that's the file name

How to know which file is calling which file, filesystem

How to know which file is calling which file in filesystem, like file1.exe is calling file2.exe
so file2.exe is modified,
and file1.exe is entered in log file.
winos
I have searched INTERNET but not able to find any samples.
In order know which file is calling which file you can use the Trace module
exp: if you have 2 files
***file1.py***
import file2
def call1():
file2.call2()
***file2.py***
def call2():
print "---------"
u can use it using console:
$ python -m trace --trackcalls path/to/file1.py
or within a program using a Trace object
****tracefile.py***
import trace,sys
from file1 import call1
#specify what to trace here
tracer = trace.Trace(ignoredirs=[sys.prefix, sys.exec_prefix], trace=0, count=1)
tracer.runfunc(call1) #call the function call1 in fille1
results = tracer.results()
results.write_results(summary=True, coverdir='.')

Categories