Variable number of arguments to Python script - python

The command line to run my Python script is:
./parse_ms.py inputfile 3 2 2 2
the arguments are an input, number 3 is the number of samples of my study each with 2 individuals.
In the script, I indicate the arguments as follows:
inputfile = open(sys.argv[1], "r")
nsam = int(sys.argv[2])
nind1 = int(sys.argv[3])
nind2 = int(sys.argv[4])
nind3 = int(sys.argv[5])
However, the number of samples may vary. I can have:
./parse_ms.py input 4 6 8 2 20
in this case, I have 4 samples with 6, 8, 2 and 20 individuals in each.
It seems inefficient to add another sys.argv everything a sample is added. Is there a way to make this more general? That is, if I write nsam to be equal to 5, automatically, Python excepts five numbers to follow for the individuals in each sample.

You can simply slice off the rest of sys.argv into a list. e.g.
inputfile = open(sys.argv[1], "r")
num_samples = int(sys.argv[2])
samples = sys.argv[3:3+num_samples]
Although if that is all your arguments, you can simply not pass a number of samples and just grab everything.
inputfile = open(sys.argv[1], "r")
samples = sys.argv[2:]
Samples can be converted to the proper datatype afterward.
Also, look at argparse for a nicer way of handling command line arguments in general.

You can have a list of ninds and even catch expections doing the following
try:
ninds = [int(argv[i+3]) for i in range(int(argv[2]))]
except IndexError:
print("Error. Expected %s samples and got %d" %(argv[2], len(argv[3:])))

Related

Python input() does not read whole input data

I'm trying to read the data from stdin, actually I'm using Ctrl+C, Ctrl+V to pass the values into cmd, but it stops the process at some point. It's always the same point. The input file is .in type, formating is that the first row is one number and next 3 rows contains the set of numbers separated with space. I'm using Python 3.9.9. Also this problem occurs with longer files (number of elements in sets > 10000), with short input everything is fine. It seems like the memory just run out. I had following aproach:
def readData():
# Read input
for line in range(5):
x = list(map(int, input().rsplit()))
if(line == 0):
nodes_num = x[0]
if(line == 1):
masses_list = x
if(line == 2):
init_seq_list = x
if(line == 3):
fin_seq_list = x
return nodes_num, masses_list, init_seq_list, fin_seq_list
and the data which works:
6
2400 2000 1200 2400 1600 4000
1 4 5 3 6 2
5 3 2 4 6 1
and the long input file:
https://pastebin.com/atAcygkk
it stops at the sequence: ... 2421 1139 322], so it's like a part of 4th row.
To read input from "standard input", you just need to use the stdin stream. Since your data is all on lines you can just read until the EOL delimiter, not having to track lines yourself with some index number. This code will work when run as python3.9 sowholeinput.py < atAcygkk.txt, or cat atAcygkk.txt| python3.9 sowholeinput.py.
def read_data():
stream = sys.stdin
num = int(stream.readline())
masses = [int(t) for t in stream.readline().split()]
init_seq = [int(t) for t in stream.readline().split()]
fin_seq = [int(t) for t in stream.readline().split()]
return num, masses, init_seq, fin_seq
Interestingly, it does not work, as you describe, when pasting the text using the terminal cut-and-paste. This implies a limitation with that method, not Python itself.

Reading input lines for int objects separated with whitespace?

I'm trying to solve a programming problem that involves returning a boolean for an uploaded profile pic, matching its resolution with the one that I provide as input and returning a statement that I've described below. This is one such test case that is giving me errors:
180
3
640 480 CROP IT
320 200 UPLOAD ANOTHER
180 180 ACCEPTED
The first line reads the dimension that needs to be matched, the second line represents the number of test cases and the rest comprise of resolutions with whitespace separators. For each of the resolutions, the output shown for each line needs to be printed.
I've tried this, since it was the most natural thing I could think of and being very new to Python I/O:
from sys import stdin, stdout
dim = int(input())
n = int(input())
out = ''
for cases in range(0, n):
in1 = int(stdin.readline().rstrip('\s'))
in2 = int(stdin.readline().rstrip('\s'))
out += str(prof_pic(in1, in2, dim))+'\n'
stdout.write(out)
ValueError: invalid literal for int() with base 10 : '640 480\n'
prof_pic is the function that I'm abstaining from describing here to prevent the post getting too long. But I've written in such a way that the width and height params both get compared with dim and return an output. The problem is with reading those lines. What is the best way to read such lines with differing separators?
You can try this it is in python 3.x
dimention=int(input())
t=int(input())
for i in range(t):
a=list(map(int,input().split()))
Instead of:
in2 = int(stdin.readline().rstrip('\s'))
you may try:
in2 = map( int, stdin.readline().split()[:2])
and you get
in2 = [640, 480]
You're calling readline. As the name implies, this reads in a whole line. (If you're not sure what you're getting, you should try printing it out.) So, you get something like this:
640 480 CROP IT
You can't call int on that.
What you want to do is split that line into separate pieces like this:
['640', '480', 'CROP IT']
For example:
line = stdin.readline().rstrip('\s')
in1, in2, rest = line.split(None, 2)
Now you can convert those first two into ints:
in1 = int(in1)
in2 = int(in2)

Fastest way to process a large file?

I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows. My question is, what will be faster?
Reading line-by-line?
with open() as infile:
for line in infile:
Reading the file into memory in chunks and processing it, say 250 MB at a time?
The processing is not very complicated, I am just grabbing value in column1 to List1, column2 to List2 etc. Might need to add some column values together.
I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.
Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any CSVReader module going to help?
I don't have to do it in python, any other language or database use ideas are welcome.
It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.
And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.
Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.
Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)
So, for example:
bufsize = 65536
with open(path) as infile:
while True:
lines = infile.readlines(bufsize)
if not lines:
break
for line in lines:
process(line)
Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:
with open(path) as infile:
m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)
A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.
Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.
I know this question is old; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. Leaving what I tried as an answer.
This is the code, I give an example in the end
def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
"""
function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned
Params :
fname : path to the file to be chunked
size : size of each chink is ~> this
skiplines : number of lines in the begining to skip, -1 means don't skip any lines
Returns :
start and end position of chunks in Bytes
"""
chunks = []
fileEnd = os.path.getsize(fname)
with open(fname, "rb") as f:
if(skiplines > 0):
for i in range(skiplines):
f.readline()
chunkEnd = f.tell()
count = 0
while True:
chunkStart = chunkEnd
f.seek(f.tell() + size, os.SEEK_SET)
f.readline() # make this chunk line aligned
chunkEnd = f.tell()
chunks.append((chunkStart, chunkEnd - chunkStart, fname))
count+=1
if chunkEnd > fileEnd:
break
return chunks
def parallel_apply_line_by_line_chunk(chunk_data):
"""
function to apply a function to each line in a chunk
Params :
chunk_data : the data for this chunk
Returns :
list of the non-None results for this chunk
"""
chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
func_args = chunk_data[4:]
t1 = time.time()
chunk_res = []
with open(file_path, "rb") as f:
f.seek(chunk_start)
cont = f.read(chunk_size).decode(encoding='utf-8')
lines = cont.splitlines()
for i,line in enumerate(lines):
ret = func_apply(line, *func_args)
if(ret != None):
chunk_res.append(ret)
return chunk_res
def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
"""
function to apply a supplied function line by line in parallel
Params :
input_file_path : path to input file
chunk_size_factor : size of 1 chunk in MB
num_procs : number of parallel processes to spawn, max used is num of available cores - 1
skiplines : number of top lines to skip while processing
func_apply : a function which expects a line and outputs None for lines we don't want processed
func_args : arguments to function func_apply
fout : do we want to output the processed lines to a file
Returns :
list of the non-None results obtained be processing each line
"""
num_parallel = min(num_procs, psutil.cpu_count()) - 1
jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)
jobs = [list(x) + [func_apply] + func_args for x in jobs]
print("Starting the parallel pool for {} jobs ".format(len(jobs)))
lines_counter = 0
pool = mp.Pool(num_parallel, maxtasksperchild=1000) # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering
outputs = []
for i in range(0, len(jobs), num_parallel):
print("Chunk start = ", i)
t1 = time.time()
chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])
for i, subl in enumerate(chunk_outputs):
for x in subl:
if(fout != None):
print(x, file=fout)
else:
outputs.append(x)
lines_counter += 1
del(chunk_outputs)
gc.collect()
print("All Done in time ", time.time() - t1)
print("Total lines we have = {}".format(lines_counter))
pool.close()
pool.terminate()
return outputs
Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like
def count_words_line(line):
return len(line.strip().split())
and then call the function like:
parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)
Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line.

Declaring a positional argument inside a group of optional arguments in Python

I want to use argparse in Python to declare arguments as the following:
./get_efms_by_ids [-h] [-v] [inputfile [1 3 4 9] [-c 11..18] [20 25 40]]
What I want to do in this case are:
If inputfile is used, one can take two type of optional arguments: 1 3 4 9 or c 11..18 or both of them. If I do not enter inputfile, the optional arguments must be absent.
For example:
I can show you some examples of command line usage:
./get_efms_by_ids Vacf.txt // default: get 1 or 10 first lines in Vacf.txt
./get_efms_by_ids Vacf.txt 1 3 4 9 // get the lines that indexes: 1 3 4 9 in Vacf.txt
./get_efms_by_ids Vacf.txt c 11..18 22 25 29 // get the lines that indexes are from 11 to 18, then the lines 22, 25, 29
./get_efms_by_ids c 11.. 18 // shows a readable error message
./get_efms_by_ids 1 3 4 9 // shows a readable error message
One can use args='?' or args='*' like in the following example:
parser = argparse.ArgumentParser(description='Selecting some Elementary Flux Modes by indexes.',version='1.0')
parser.add_argument('efm_matrix_file', type=file, help='give the name of the efms matrix file')
parser.add_argument('ids', nargs='?', help='give the indexes of the chosen efms')
parser.add_argument('-i','--indexes',nargs='*', help='give the begin and start indexes of the chosen efms')
But the result did not satisfy with the purpose have proposed in the beginning of this post.
Any help will be appreciated.
First, I would ditch the -c option. You don't need both -c and .. to indicate a range of values. This would simplify your call to something like
./get_efms_by_ids [-h] [-v] [inputfile [index ...]]
where each index can be either a single integer or a range specified by lower..upper.
The argument parser could then be a simple as
def index_type(s):
try:
return int(s)
except ValueError:
try:
return map(int, s.split(".."))
except:
raise ArgumentTypeError("Invalid index: %s" % (s,))
p = ArgumentParser()
p.add_argument("-h")
p.add_argument("-v")
p.add_argument("inputfile", nargs="?")
p.add_argument("indices", nargs="*", type=index_type)
args = p.parse_args()
if not (args.inputfile is None or os.path.exists(args.inputfile)):
sys.exit("Invalid file name: %s" % (args.inputfile,))
You'll have to check that the first positional argument (if any) is a valid file or not after parsing, since any arbitrary string could be a valid file name.
The index_type function is just an example of how you could transform each index (whether an integer or range) during the course of parsing.
I take a different approach from chepner, but borrow some of chepner's ideas: ditching the -c option and use a modified index_type().
Code
#!/usr/bin/env python
import argparse
from itertools import chain
def index_type(s):
try:
return [int(s)]
except ValueError:
try:
start, stop = map(int, s.split('..'))
return range(start, stop + 1)
except:
raise argparse.ArgumentTypeError("Invalid index: %s" % (s,))
def get_options():
parser = argparse.ArgumentParser()
parser.add_argument('-v')
parser.set_defaults(fileinput=None)
options, remaining = parser.parse_known_args()
if remaining:
parser = argparse.ArgumentParser()
parser.add_argument('fileinput', type=argparse.FileType())
parser.add_argument('selected_lines', nargs='*', type=index_type)
parser.parse_args(remaining, namespace=options)
# Convert a nested list into a set of line numbers
options.selected_lines = set(chain.from_iterable(options.selected_lines))
# If the command line does not specify the line numbers, assume a default
if not options.selected_lines:
options.selected_lines = set(index_type('1..10'))
return options
if __name__ == '__main__':
options = get_options()
# If the command line contains a file name, loop through the file and process only the lines
# requested
if options.fileinput is not None:
for line_number, line in enumerate(options.fileinput, 1):
if line_number in options.selected_lines:
line = line.rstrip()
print '{:>4} {}'.format(line_number, line)
Discussion
The argparse module allows for optional argument, but fileinput cannot be optional because it is a positional argument--that is how argparse operates
To get around this limitation, I parse the command line twice: the first time to get the -v flag. For the first parsing, I use the parse_known_args() method, which ignores those parameters it does not understand.
For the second parsing, I work on the remaning arguments, assuming the first argument is the file name, followed by a series of lines numbers
Parsing line numbers is tricky. The ultimate goal is to convert something like "11..18 1 3 4 9" into [1, 3, 4, 9, 11, 12, 13, 14, 15, 16, 17, 18]
Using a modified index_type() (thanks to chepner), I was able to parse the command line from "11..18 1 3 4 9" to [11, 12, 13, 14, 15, 16, 17, 18], [1], [3], [4], [9]]
The next step is to convert this nested list into a set of line numbers for easy look up
As a bonus, if the command line does not specify any line number, I assume 1..10
After get_options returns, options.fileinput will either be None or a file handle--no need to open the file to read. options.selected_lines will be a set of line numbers to select
The final task is to go through the lines, if it is selected, process it. In my case, I just print it out

Unable to have a command line parameter in Python

I run
import sys
print "x \tx^3\tx^3+x^3\t(x+1)^3\tcube+cube=cube+1"
for i in range(sys.argv[2]): // mistake here
cube=i*i*i
cube2=cube+cube
cube3=(i+1)*(i+1)*(i+1)
truth=(cube2==cube3)
print i, "\t", cube, "\t", cube + cube, "\t", cube3, "\t", truth
I get
Traceback (most recent call last):
File "cube.py", line 5, in <module>
for i in range(sys.argv[2]):
IndexError: list index out of range
How can you use command line parameter as follows in the code?
Example of the use
python cube.py 100
It should give
x x^3 x^3+x^3 (x+1)^3 cube+cube=cube+1
0 0 0 1 False
1 1 2 8 False
2 8 16 27 False
--- cut ---
97 912673 1825346 941192 False
98 941192 1882384 970299 False
99 970299 1940598 1000000 False
Use:
sys.argv[1]
also note that arguments are always strings, and range expects an integer.
So the correct code would be:
for i in range(int(sys.argv[1])):
You want int(sys.argv[1]) not 2.
Ideally you would check the length of sys.argv first and print a useful error message if the user doesn't provide the proper arguments.
Edit: See http://www.faqs.org/docs/diveintopython/kgp_commandline.html
Here are some tips on how you can often solve this type of problem yourself:
Read what the error message is telling you: "list index out of range".
What list? Two choices (1) the list returned by range (2) sys.argv
In this case, it can't be (1); it's impossible to get that error out of
for i in range(some_integer) ... but you may not know that, so in general, if there are multiple choices within a line for the source of an error, and you can't see which is the cause, split the line into two or more statements:
num_things = sys.argv[2]
for i in range(num_things):
and run the code again.
By now we know that sys.argv is the list. What index? Must be 2. How come that's out of range? Knowledge-based answer: Because Python counts list indexes from 0. Experiment-based answer: Insert this line before the failing line:
print list(enumerate(sys.argv))
So you need to change the [2] to [1]. Then you will get another error, because in range(n) the n must be an integer, not a string ... and you can work through this new problem in a similar fashion -- extra tip: look up range() in the docs.
I'd like to suggest having a look at Python's argparse module, which is a giant improvement in parsing commandline parameters - it can also do the conversion to int for you including type-checking and error-reporting / generation of help messages.
Its sys.argv[1] instead of 2. You also want to makes sure that you convert that to an integer if you're doing math with it.
so instead of
for i in range(sys.argv[2]):
you want
for i in range(int(sys.argv[1])):

Categories