S3 get_object..iter_lines() skipping lines with islice/zip

S3 get_object..iter_lines() skipping lines with islice/zip - python

response = s3.get_object(Bucket=bucket, Key=file )
def generate_files(resp, N):
while True:
line = list(islice(resp["Body"].iter_lines(), 0, 10))
if not line:
break
yield line
return
However, when I call my generator, the result is not as expected.
for line in generate_files(response, 10):
print('--')
print([l.decode('utf-8') for l in line])
Instead of going from line 0 to 9, then 10 to 19, then 20 to 29 etc.. It skips ahead an arbitrary number of lines between generator calls. So it is returning lines 0 to 9. Then lines 17 to 26. Then lines 33 to 40 etc.. It's also moving forward. So it seems to be reading the stream even after islice's line call. I've also tried with zip and get the same result.
What am I missing here?

I think that the problem is in iter_lines, because iter_lines is not reentrant safe. Calling this method multiple times causes some of the received data being lost. In case you need to call it from multiple places, you should in my opinion use the resulting iterator object instead, e.g.:
r = requests.get('https://httpbin.org/stream/20', stream=True)
lines = r.iter_lines()
for line in lines:
print(line)
Sorry, but there is poorly off for the code, so I can't to check this.
More informations is in the url site: https://2.python-requests.org/projects/3/user/advanced/
Please, if you check it, give a feedback.

Related

How to get approximate line number of large files

I have CSV files with up to 10M+ rows. I am attempting to get the total line numbers of a file so I can split the processing of each file into a multiprocessing approach. To do this, I will set a start and end line for each sub-process to handle. This cuts down my processing time from 180s to 110s for a file size of 2GB. However, in order to do this, It requires to know the line number count. If I attempt to get the exact line number count, it will take ~30seconds. I feel like this time is wasted as an approximate with the final thread possibly having to read an extra hundred thousand lines or so, would only add a couple seconds as apposed to the 30 seconds it takes to get the exact line count.
How would I go about getting an approximate line count for files? I would like this estimate to be within 1 million lines (Preferably within a couple hundred thousand lines). Would something like this be possible?

This will be horribly inaccurate but it will get the size of a row and divide it against the size of the file.
import sys
import csv
import os
with open("example.csv", newline="") as f:
reader = csv.reader(f)
row1 = next(reader)
_Size = sys.getsizeof(len("".join(row1)))
print("Size of Line 1 > ",_Size)
print("Size of File >",str(os.path.getsize("example.csv")))
print("Approx Lines >",(os.path.getsize("example.csv") / _Size))
(Edit) If you change the last line to
math.floor(os.path.getsize("example.csv") / _Size) It's actually
quite accurate

I'd suggest you split the file into chunks of similar size, before even parsing.
The example code below will split data.csv into 4 chunks of approximately equal size, by seeking and searching for the next line break. It'll then call launch_worker() for each chunk, indicating the start offset and length of the data that worker should handle.
Ideally you'd use a subprocess for each worker.
import os
n_workers = 4
# open the log file, and find out how long it is
f = open('data.csv', 'rb')
length_total = f.seek(0, os.SEEK_END)
# split the file evenly among n workers
length_worker = int(length_total / n_workers)
prev_worker_end = 0
for i in range(n_workers):
# seek to the next worker's approximate start
file_pos = f.seek(prev_worker_end + length_worker, os.SEEK_SET)
# see if we tried to seek past the end of the file... the last worker probably will
if file_pos >= length_total: # <-- (3)
# ... if so, this worker's chunk extends to the end of the file
this_worker_end = length_total
else:
# ... otherwise, look for the next line break
buf = f.read(256) # <-- (1)
next_line_end = buf.index(b'\n') # <-- (2)
this_worker_end = file_pos + next_line_end
# calculate how long this worker's chunk is
this_worker_length = this_worker_end - prev_worker_end
if this_worker_length > 0:
# if there is any data in the chunk, then try to launch a worker
launch_worker(prev_worker_end, this_worker_length)
# remember where the last worker got to in the file
prev_worker_end = this_worker_end + 1
Some expansion on markers in the code:
You'll need to make sure that the read() will consume at least an entire line. Alternatively you could loop to perform multiple read()s if you don't know how long a line could be upfront.
This assumes \n line endings... you may need to modify for your data.
The last worker will get slightly less data to handle that the others... this is because we always search-forwards for the next line break. The more workers you have, the less data the final worker gets. It's not very significant (~200-500 bytes in my testing).
Make sure you always use binary-mode, as text-mode can give you wonky seek()s / read()s.
An example launch_worker() would look like this:
def launch_worker(offset, length):
print('Starting a worker... using chunk %d - %d (%d bytes)...'
% ( offset, offset + length, length ))
with open('log.txt', 'rb') as f:
f.seek(offset, os.SEEK_SET)
worker_buf = f.read(length)
lines = worker_buf.split(b'\n')
print('First Line:')
print('\t' + str(lines[0]))
print('Last Line:')
print('\t' + str(lines[-1]))

ValueError in Python 3 code

I have this code that will allow me to count the number of missing rows of numbers within the csv for a script in Python 3.6. However, these are the following errors in the program:
Error:
Traceback (most recent call last):
File "C:\Users\GapReport.py", line 14, in <module>
EndDoc_Padded, EndDoc_Padded = (int(s.strip()[2:]) for s in line)
File "C:\Users\GapReport.py", line 14, in <genexpr>
EndDoc_Padded, EndDoc_Padded = (int(s.strip()[2:]) for s in line)
ValueError: invalid literal for int() with base 10: 'AC-SEC 000000001'
Code:
import csv
def out(*args):
print('{},{}'.format(*(str(i).rjust(4, "0") for i in args)))
prev = 0
data = csv.reader(open('Padded Numbers_export.csv'))
print(*next(data), sep=', ') # header
for line in data:
EndDoc_Padded, EndDoc_Padded = (int(s.strip()[2:]) for s in line)
if start != prev+1:
out(prev+1, start-1)
prev = end
out(start, end)
I'm stumped on how to fix these issues.Also, I think the csv many lines in it, so if there's a section that limits it to a few numbers, please feel free to update me on so.
CSV Snippet (Sorry if I wasn't clear before!):

The values you have in your CSV file are not numeric.
For example, FMAC-SEC 000000001 is not a number. So when you run int(s.strip()[2:]), it is not able to convert it to an int.
Some more comments on the code:
What is the utility of doing EndDoc_Padded, EndDoc_Padded = (...)? Currently you are assigning values to two variables with the same name. Either name one of them something else, or just have one variable there.
Are you trying to get the two different values from each column? In that case, you need to split line into two first. Are the contents of your file comma separated? If yes, then do for s in line.split(','), otherwise use the appropriate separator value in split().
You are running this inside a loop, so each time the values of the two variables would get updated to the values from the last line. If you're trying to obtain 2 lists of all the values, then this won't work.

Fastest way to process a large file?

I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows. My question is, what will be faster?
Reading line-by-line?
with open() as infile:
for line in infile:
Reading the file into memory in chunks and processing it, say 250 MB at a time?
The processing is not very complicated, I am just grabbing value in column1 to List1, column2 to List2 etc. Might need to add some column values together.
I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.
Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any CSVReader module going to help?
I don't have to do it in python, any other language or database use ideas are welcome.

It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.
And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.
Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.
Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)
So, for example:
bufsize = 65536
with open(path) as infile:
while True:
lines = infile.readlines(bufsize)
if not lines:
break
for line in lines:
process(line)
Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:
with open(path) as infile:
m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)
A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.
Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.

I know this question is old; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. Leaving what I tried as an answer.
This is the code, I give an example in the end
def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
"""
function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned
Params :
fname : path to the file to be chunked
size : size of each chink is ~> this
skiplines : number of lines in the begining to skip, -1 means don't skip any lines
Returns :
start and end position of chunks in Bytes
"""
chunks = []
fileEnd = os.path.getsize(fname)
with open(fname, "rb") as f:
if(skiplines > 0):
for i in range(skiplines):
f.readline()
chunkEnd = f.tell()
count = 0
while True:
chunkStart = chunkEnd
f.seek(f.tell() + size, os.SEEK_SET)
f.readline() # make this chunk line aligned
chunkEnd = f.tell()
chunks.append((chunkStart, chunkEnd - chunkStart, fname))
count+=1
if chunkEnd > fileEnd:
break
return chunks
def parallel_apply_line_by_line_chunk(chunk_data):
"""
function to apply a function to each line in a chunk
Params :
chunk_data : the data for this chunk
Returns :
list of the non-None results for this chunk
"""
chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
func_args = chunk_data[4:]
t1 = time.time()
chunk_res = []
with open(file_path, "rb") as f:
f.seek(chunk_start)
cont = f.read(chunk_size).decode(encoding='utf-8')
lines = cont.splitlines()
for i,line in enumerate(lines):
ret = func_apply(line, *func_args)
if(ret != None):
chunk_res.append(ret)
return chunk_res
def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
"""
function to apply a supplied function line by line in parallel
Params :
input_file_path : path to input file
chunk_size_factor : size of 1 chunk in MB
num_procs : number of parallel processes to spawn, max used is num of available cores - 1
skiplines : number of top lines to skip while processing
func_apply : a function which expects a line and outputs None for lines we don't want processed
func_args : arguments to function func_apply
fout : do we want to output the processed lines to a file
Returns :
list of the non-None results obtained be processing each line
"""
num_parallel = min(num_procs, psutil.cpu_count()) - 1
jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)
jobs = [list(x) + [func_apply] + func_args for x in jobs]
print("Starting the parallel pool for {} jobs ".format(len(jobs)))
lines_counter = 0
pool = mp.Pool(num_parallel, maxtasksperchild=1000) # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering
outputs = []
for i in range(0, len(jobs), num_parallel):
print("Chunk start = ", i)
t1 = time.time()
chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])
for i, subl in enumerate(chunk_outputs):
for x in subl:
if(fout != None):
print(x, file=fout)
else:
outputs.append(x)
lines_counter += 1
del(chunk_outputs)
gc.collect()
print("All Done in time ", time.time() - t1)
print("Total lines we have = {}".format(lines_counter))
pool.close()
pool.terminate()
return outputs
Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like
def count_words_line(line):
return len(line.strip().split())
and then call the function like:
parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)
Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line.

Python MemoryError or ValueError in np.loadtxt and iter_loadtxt

My starting point was a problem with NumPy's function loadtxt:
X = np.loadtxt(filename, delimiter=",")
that gave a MemoryError in np.loadtxt(..). I googled it and came to this question on StackOverflow. That gave the following solution:
def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)
data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data
data = iter_loadtxt('your_file.ext')
So I tried that, but then encountered the following error message:
> data = data.reshape((-1, iter_loadtext.rowlength))
> ValueError: total size of new array must be unchanged
Then I tried to add the number of rows and maximum number of cols to the code with the code fragments down here, which I partly got from another question and partly wrote myself:
num_rows = 0
max_cols = 0
with open(filename, 'r') as infile:
for line in infile:
num_rows += 1
tmp = line.split(",")
if len(tmp) > max_cols:
max_cols = len(tmp)
def iter_func():
#didn't change
data = np.fromiter(iter_func(), dtype=dtype, count=num_rows)
data = data.reshape((num_rows, max_cols))
But this still gave the same error message though I thought it should have been solved. On the other hand I'm not sure if I'm calling data.reshape(..) in the correct manner.
I commented the rule where date.reshape(..) is called to see what happened. That gave this error message:
> ValueError: need more than 1 value to unpack
Which happened at the first point where something is done with X, the variable where this problem is all about.
I know this code can work on the input files I got, because I saw it in use with them. But I can't find why I can't solve this problem. My reasoning goes as far as that because I'm using a 32-bit Python version (on a 64-bit Windows machine), something goes wrong with memory that doesn't happen on other computers. But I'm not sure. For info: I'm having 8GB of RAM for a 1.2GB file but my RAM is not full according to Task Manager.
What I want to solve is that I'm using open source code that needs to read and parse the given file just like np.loadtxt(filename, delimiter=","), but then within my memory. I know the code originally worked in MacOsx and Linux, and to be more precise: "MacOsx 10.9.2 and Linux (version 2.6.18-194.26.1.el5 (brewbuilder#norob.fnal.gov) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) 1 SMP Tue Nov 9 12:46:16 EST 2010)."
I don't care that much about time. My file contains +-200.000 lines on which there are 100 or 1000 (depending on the input files: one is always 100, one is always 1000) items per line, where one item is a floating point with 3 decimals either negated or not and they are separated by , and a space. F.e.: [..] 0.194, -0.007, 0.004, 0.243, [..], and 100 or 100 of those items of which you see 4, for +-200.000 lines.
I'm using Python 2.7 because the open source code needs that.
Does any of you have the solution for this? Thanks in advance.

On Windows a 32 bit process is only given a maximum of 2GB (or GiB?) memory and numpy.loadtxt is notorious for being heavy on memory, so that explains why the first approach doesn't work.
The second problem you appear to be facing is that the particular file you are testing with has missing data, i.e. not all lines have the same number of values. This is easy to check, for example:
import numpy as np
numbers_per_line = []
with open(filename) as infile:
for line in infile:
numbers_per_line.append(line.count(delimiter) + 1)
# Check where there might be problems
numbers_per_line = np.array(numbers_per_line)
expected_number = 100
print np.where(numbers_per_line != expected_number)

SPOJ: runtime error (NZEC)

I tried to do this problem at spoj.com, but I keep getting the error runtime error (NZEC). I code in python. I don't know why. Here is my code,
import sys
def unique(lines, exact=True):
for L in lines:
if L.count('#') != 1 and exact:
return False
if L.count('#') > 1 and not exact:
return False
return True
def resolve(N, lines):
diags = [
[lines[i][i+j] for i in range(max(0, -j), min(N, N-j))]
for j in range(-N+1, N)]
anti_diags = [
[lines[i][N-1 -(i+j)] for i in range(max(0, -j), min(N, N-j))]
for j in range(-N+1, N)]
if unique(lines) and unique(zip(*lines)) and unique(diags, False) and unique(anti_diags, False):
return "YES"
return "NO"
input_file = sys.stdin
output_file = sys.stdout
T = int(raw_input())
for i in range(1, T + 1):
n = int(raw_input())
lines = []
for _ in range(n):
line = raw_input().strip()
lines.append(list(line))
print resolve(n, lines)
It works fine locally with input like:
2
3
..#
#..
.#.
4
.#..
...#
#...
..#.

When I run your code, I get the following:
$ python spoj.py < input
Traceback (most recent call last):
File "spoj.py", line 38, in <module>
print resolve(n, lines)
File "spoj.py", line 15, in resolve
for j in range(-N+1, N)]
IndexError: list index out of range
That's the runtime error you're getting. But the problem is actually in this part:
for _ in range(n):
line = raw_input().strip()
lines.append(list(line))
print resolve(n, lines)
Since the print statement is indented, your program reads a single line, appends to line and calls resolve before reading the following lines. After removing the extra indent, it works fine on my computer.
There's plenty of room to further improve this program, both in style and logic. Let me know if you would like some further advice.
EDIT:
You mentioned in the comments that this indentation problem was a copypasting error. In that case, as sp1r pointed out in his comment, it's most likely that, as a comment in the problem's page suggests, the input used by the online judge is malformed.
In that case, you have to do more rigorous parsing of the input, as it may not be formatted as the example input suggests. That is, instead of:
2
3
..#
#..
.#.
(etc)
you could have something more like:
2 3
..#
#.. .#.
(etc)
That might break your code because it assumes whoever posted the problem was competent enough to make an input that looks like the example. That is not always the case in SPOJ, a lot of the input files is malformed and you can find other cases where malformed input causes otherwise correct programs to fail.

Spoj does not use sys and sys.stdin and sys.stdout. use raw_input() and print.
"I am on windows not on linux. I simply change this line to switch to console modes for tests."
In any language and in any os, these works fine.
#for Python
spoj.py < input.txt > output.txt
#If you are using C/C++, for any EXE file
spoj.exe < input.txt > output.txt
Spoj will nether test this in your computer nor using any UI. They will test your code, in a "console" and it might be a linux / any-OS machine. Use this input / output to Test your code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.