Python input() does not read whole input data - python

I'm trying to read the data from stdin, actually I'm using Ctrl+C, Ctrl+V to pass the values into cmd, but it stops the process at some point. It's always the same point. The input file is .in type, formating is that the first row is one number and next 3 rows contains the set of numbers separated with space. I'm using Python 3.9.9. Also this problem occurs with longer files (number of elements in sets > 10000), with short input everything is fine. It seems like the memory just run out. I had following aproach:
def readData():
# Read input
for line in range(5):
x = list(map(int, input().rsplit()))
if(line == 0):
nodes_num = x[0]
if(line == 1):
masses_list = x
if(line == 2):
init_seq_list = x
if(line == 3):
fin_seq_list = x
return nodes_num, masses_list, init_seq_list, fin_seq_list
and the data which works:
6
2400 2000 1200 2400 1600 4000
1 4 5 3 6 2
5 3 2 4 6 1
and the long input file:
https://pastebin.com/atAcygkk
it stops at the sequence: ... 2421 1139 322], so it's like a part of 4th row.

To read input from "standard input", you just need to use the stdin stream. Since your data is all on lines you can just read until the EOL delimiter, not having to track lines yourself with some index number. This code will work when run as python3.9 sowholeinput.py < atAcygkk.txt, or cat atAcygkk.txt| python3.9 sowholeinput.py.
def read_data():
stream = sys.stdin
num = int(stream.readline())
masses = [int(t) for t in stream.readline().split()]
init_seq = [int(t) for t in stream.readline().split()]
fin_seq = [int(t) for t in stream.readline().split()]
return num, masses, init_seq, fin_seq
Interestingly, it does not work, as you describe, when pasting the text using the terminal cut-and-paste. This implies a limitation with that method, not Python itself.

Related

Reading input lines for int objects separated with whitespace?

I'm trying to solve a programming problem that involves returning a boolean for an uploaded profile pic, matching its resolution with the one that I provide as input and returning a statement that I've described below. This is one such test case that is giving me errors:
180
3
640 480 CROP IT
320 200 UPLOAD ANOTHER
180 180 ACCEPTED
The first line reads the dimension that needs to be matched, the second line represents the number of test cases and the rest comprise of resolutions with whitespace separators. For each of the resolutions, the output shown for each line needs to be printed.
I've tried this, since it was the most natural thing I could think of and being very new to Python I/O:
from sys import stdin, stdout
dim = int(input())
n = int(input())
out = ''
for cases in range(0, n):
in1 = int(stdin.readline().rstrip('\s'))
in2 = int(stdin.readline().rstrip('\s'))
out += str(prof_pic(in1, in2, dim))+'\n'
stdout.write(out)
ValueError: invalid literal for int() with base 10 : '640 480\n'
prof_pic is the function that I'm abstaining from describing here to prevent the post getting too long. But I've written in such a way that the width and height params both get compared with dim and return an output. The problem is with reading those lines. What is the best way to read such lines with differing separators?
You can try this it is in python 3.x
dimention=int(input())
t=int(input())
for i in range(t):
a=list(map(int,input().split()))
Instead of:
in2 = int(stdin.readline().rstrip('\s'))
you may try:
in2 = map( int, stdin.readline().split()[:2])
and you get
in2 = [640, 480]
You're calling readline. As the name implies, this reads in a whole line. (If you're not sure what you're getting, you should try printing it out.) So, you get something like this:
640 480 CROP IT
You can't call int on that.
What you want to do is split that line into separate pieces like this:
['640', '480', 'CROP IT']
For example:
line = stdin.readline().rstrip('\s')
in1, in2, rest = line.split(None, 2)
Now you can convert those first two into ints:
in1 = int(in1)
in2 = int(in2)

Variable number of arguments to Python script

The command line to run my Python script is:
./parse_ms.py inputfile 3 2 2 2
the arguments are an input, number 3 is the number of samples of my study each with 2 individuals.
In the script, I indicate the arguments as follows:
inputfile = open(sys.argv[1], "r")
nsam = int(sys.argv[2])
nind1 = int(sys.argv[3])
nind2 = int(sys.argv[4])
nind3 = int(sys.argv[5])
However, the number of samples may vary. I can have:
./parse_ms.py input 4 6 8 2 20
in this case, I have 4 samples with 6, 8, 2 and 20 individuals in each.
It seems inefficient to add another sys.argv everything a sample is added. Is there a way to make this more general? That is, if I write nsam to be equal to 5, automatically, Python excepts five numbers to follow for the individuals in each sample.
You can simply slice off the rest of sys.argv into a list. e.g.
inputfile = open(sys.argv[1], "r")
num_samples = int(sys.argv[2])
samples = sys.argv[3:3+num_samples]
Although if that is all your arguments, you can simply not pass a number of samples and just grab everything.
inputfile = open(sys.argv[1], "r")
samples = sys.argv[2:]
Samples can be converted to the proper datatype afterward.
Also, look at argparse for a nicer way of handling command line arguments in general.
You can have a list of ninds and even catch expections doing the following
try:
ninds = [int(argv[i+3]) for i in range(int(argv[2]))]
except IndexError:
print("Error. Expected %s samples and got %d" %(argv[2], len(argv[3:])))

converting a file of numbers in to a list

to put it simply i am trying to read a file that will eventually have nothing but numbers in them either seperated by spaces, commas, or new lines. I have read through alot of these post and fixed somethings. I learned they are imported as strings first. however i am running into an issue where its importing the numbers as list. so now i have a list of list. this would be fine except i cant have it checked by ints or have numbers added to it. the idea is to have each user asigned a number and then saved. im not worrying about saving right now im just worried about importing the numbers and being able to use them as individual numbers.
my code thus far:
fo1 = open('mach_uID_3.txt', 'a+')
t1 = fo1.read()
t2 = []
print t1
for x in t1.split():
print x
z = [int(n) for n in x.split()]
t2.append(z)
print t2
print t2[3]
fo1.close()
and the file its reading is.
0 1 2 25
34
23
my results are pretty ugly but here you go.
0 1 2 25
34
23
0
1
2
25
34
23
[[0], [1], [2], [25], [34], [23]]
[25]
Process finished with exit code 0
Use extend instead of append:
t2.extend(int(n) for n in x.split())
To have all the numbers in a single, flattened list, do this:
fo1 = open('mach_uID_3.txt', 'a+')
number_list = list(map(int, fo1.read().split())
fo1.close()
But it's better to open the file like this:
with open('mach_uID_3.txt', 'a+') as fo1:
number_list = list(map(int, fo1.read().split())
so you don't have to explicitly close it.

how to get result from many list in same line in python

I have three list:
alist=[1,2,3,4,5]
blist=['a','b','c','d','e']
clist=['#','#','$','&','*']
I want my output in this format:
1 2 3 4 5
a b c d e
# # $ & *
I am able to print in correct format but when i am having list with many elements it's actually printing like this:
1 2 3 4 5 6 ..........................................................................
................................................................................
a b c d e ............................................................................
......................................................................................
# # $ & * .............................................................................
.......................................................................................
but I want my output like this:
12345....................................................................
abcde...................................................................
##$&*...................................................................
............................................................... {this line is from alist}
................................................................ {this line is from blist}
................................................................ {this line is from clist}
Try the following:
term_width = 80
all_lists = (alist, blist, clist)
length = max(map(len, all_lists))
for offset in xrange(0, length, term_width):
print '\n'.join(''.join(map(str, l[offset:offset+term_width])) for l in all_lists)
This assumes terminal width is 80 characters, which is the default. You might want to detect it's actual width with curses library or something based on it.
Either way, to adapt to any output width you only need to change term_width value and the code will use it.
It also assumes all elements are 1-character long. If it's not the case, please clarify.
If you need to detect terminal width, you may find some solutions here: How to get Linux console window width in Python

Efficient file buffering & scanning methods for large files in python

The description of the problem I am having is a bit complicated, and I will err on the side of providing more complete information. For the impatient, here is the briefest way I can summarize it:
What is the fastest (least execution
time) way to split a text file in to
ALL (overlapping) substrings of size N (bound N, eg 36)
while throwing out newline characters.
I am writing a module which parses files in the FASTA ascii-based genome format. These files comprise what is known as the 'hg18' human reference genome, which you can download from the UCSC genome browser (go slugs!) if you like.
As you will notice, the genome files are composed of chr[1..22].fa and chr[XY].fa, as well as a set of other small files which are not used in this module.
Several modules already exist for parsing FASTA files, such as BioPython's SeqIO. (Sorry, I'd post a link, but I don't have the points to do so yet.) Unfortunately, every module I've been able to find doesn't do the specific operation I am trying to do.
My module needs to split the genome data ('CAGTACGTCAGACTATACGGAGCTA' could be a line, for instance) in to every single overlapping N-length substring. Let me give an example using a very small file (the actual chromosome files are between 355 and 20 million characters long) and N=8
>>>import cStringIO
>>>example_file = cStringIO.StringIO("""\
>header
CAGTcag
TFgcACF
""")
>>>for read in parse(example_file):
... print read
...
CAGTCAGTF
AGTCAGTFG
GTCAGTFGC
TCAGTFGCA
CAGTFGCAC
AGTFGCACF
The function that I found had the absolute best performance from the methods I could think of is this:
def parse(file):
size = 8 # of course in my code this is a function argument
file.readline() # skip past the header
buffer = ''
for line in file:
buffer += line.rstrip().upper()
while len(buffer) >= size:
yield buffer[:size]
buffer = buffer[1:]
This works, but unfortunately it still takes about 1.5 hours (see note below) to parse the human genome this way. Perhaps this is the very best I am going to see with this method (a complete code refactor might be in order, but I'd like to avoid it as this approach has some very specific advantages in other areas of the code), but I thought I would turn this over to the community.
Thanks!
Note, this time includes a lot of extra calculation, such as computing the opposing strand read and doing hashtable lookups on a hash of approximately 5G in size.
Post-answer conclusion: It turns out that using fileobj.read() and then manipulating the resulting string (string.replace(), etc.) took relatively little time and memory compared to the remainder of the program, and so I used that approach. Thanks everyone!
Could you mmap the file and start pecking through it with a sliding window? I wrote a stupid little program that runs pretty small:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
sarnold 20919 0.0 0.0 33036 4960 pts/2 R+ 22:23 0:00 /usr/bin/python ./sliding_window.py
Working through a 636229 byte fasta file (found via http://biostar.stackexchange.com/questions/1759) took .383 seconds.
#!/usr/bin/python
import mmap
import os
def parse(string, size):
stride = 8
start = string.find("\n")
while start < size - stride:
print string[start:start+stride]
start += 1
fasta = open("small.fasta", 'r')
fasta_size = os.stat("small.fasta").st_size
fasta_map = mmap.mmap(fasta.fileno(), 0, mmap.MAP_PRIVATE, mmap.PROT_READ)
parse(fasta_map, fasta_size)
Some classic IO bound changes.
Use a lower level read operation like os.read and read in to a large fixed buffer.
Use threading/multiprocessing where one reads and buffers and the other processes.
If you have multiple processors/machines use multiprocessing/mq to divy up processing across CPUs ala map-reduce.
Using a lower level read operation wouldn't be that much of a rewrite. The others would be pretty large rewrites.
I suspect the problem is that you have so much data stored in string format, which is really wasteful for your use case, that you're running out of real memory and thrashing swap. 128 GB should be enough to avoid this... :)
Since you've indicated in comments that you need to store additional information anyway, a separate class which references a parent string would be my choice. I ran a short test using chr21.fa from chromFa.zip from hg18; the file is about 48MB and just under 1M lines. I only have 1GB of memory here, so I simply discard the objects afterwards. This test thus won't show problems with fragmentation, cache, or related, but I think it should be a good starting point for measuring parsing throughput:
import mmap
import os
import time
import sys
class Subseq(object):
__slots__ = ("parent", "offset", "length")
def __init__(self, parent, offset, length):
self.parent = parent
self.offset = offset
self.length = length
# these are discussed in comments:
def __str__(self):
return self.parent[self.offset:self.offset + self.length]
def __hash__(self):
return hash(str(self))
def __getitem__(self, index):
# doesn't currently handle slicing
assert 0 <= index < self.length
return self.parent[self.offset + index]
# other methods
def parse(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield Subseq(whole, offset, size)
class Seq(object):
__slots__ = ("value", "offset")
def __init__(self, value, offset):
self.value = value
self.offset = offset
def parse_sep_str(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield Seq(whole[offset:offset + size], offset)
def parse_plain_str(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield whole[offset:offset+size]
def parse_tuple(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield (whole, offset, size)
def parse_orig(file, size=8):
file.readline() # skip header
buffer = ''
for line in file:
buffer += line.rstrip().upper()
while len(buffer) >= size:
yield buffer[:size]
buffer = buffer[1:]
def parse_os_read(file, size=8):
file.readline() # skip header
file_size = os.fstat(file.fileno()).st_size
whole = os.read(file.fileno(), file_size).replace("\n", "").upper()
for offset in xrange(0, len(whole) - size + 1):
yield whole[offset:offset+size]
def parse_mmap(file, size=8):
file.readline() # skip past the header
buffer = ""
for line in file:
buffer += line
if len(buffer) >= size:
for start in xrange(0, len(buffer) - size + 1):
yield buffer[start:start + size].upper()
buffer = buffer[-(len(buffer) - size + 1):]
for start in xrange(0, len(buffer) - size + 1):
yield buffer[start:start + size]
def length(x):
return sum(1 for _ in x)
def duration(secs):
return "%dm %ds" % divmod(secs, 60)
def main(argv):
tests = [parse, parse_sep_str, parse_tuple, parse_plain_str, parse_orig, parse_os_read]
n = 0
for fn in tests:
n += 1
with open(argv[1]) as f:
start = time.time()
length(fn(f))
end = time.time()
print "%d %-20s %s" % (n, fn.__name__, duration(end - start))
fn = parse_mmap
n += 1
with open(argv[1]) as f:
f = mmap.mmap(f.fileno(), 0, mmap.MAP_PRIVATE, mmap.PROT_READ)
start = time.time()
length(fn(f))
end = time.time()
print "%d %-20s %s" % (n, fn.__name__, duration(end - start))
if __name__ == "__main__":
sys.exit(main(sys.argv))
1 parse 1m 42s
2 parse_sep_str 1m 42s
3 parse_tuple 0m 29s
4 parse_plain_str 0m 36s
5 parse_orig 0m 45s
6 parse_os_read 0m 34s
7 parse_mmap 0m 37s
The first four are my code, while orig is yours and the last two are from other answers here.
User-defined objects are much more costly to create and collect than tuples or plain strings! This shouldn't be that surprising, but I had not realized it would make this much of a difference (compare #1 and #3, which really only differ in a user-defined class vs tuple). You said you want to store additional information, like offset, with the string anyway (as in the parse and parse_sep_str cases), so you might consider implementing that type in a C extension module. Look at Cython and related if you don't want to write C directly.
Case #1 and #2 being identical is expected: by pointing to a parent string, I was trying to save memory rather than processing time, but this test doesn't measure that.
I have a function for process a text file and use buffer in read and write and parallel computing with async pool of workets of process. I have a AMD of 2 cores, 8GB RAM, with gnu/linux and can process 300000 lines in less of 1 second, 1000000 lines in aproximately 4 seconds and aproximately 4500000 lines (more of 220MB) in aproximately 20 seconds:
# -*- coding: utf-8 -*-
import sys
from multiprocessing import Pool
def process_file(f, fo="result.txt", fi=sys.argv[1]):
fi = open(fi, "r", 4096)
fo = open(fo, "w", 4096)
b = []
x = 0
result = None
pool = None
for line in fi:
b.append(line)
x += 1
if (x % 200000) == 0:
if pool == None:
pool = Pool(processes=20)
if result == None:
result = pool.map_async(f, b)
else:
presult = result.get()
result = pool.map_async(f, b)
for l in presult:
fo.write(l)
b = []
if not result == None:
for l in result.get():
fo.write(l)
if not b == []:
for l in b:
fo.write(f(l))
fo.close()
fi.close()
First argument is function that rceive one line, process and return result for will write in file, next is file of output and last is file of input (you can not use last argument if you receive as first parameter in your script file of input).

Categories