Delete element of list in pool.map() python - python

processPool.map(parserMethod, ((inputFile[line:line + chunkSize], sharedQueue) for line in xrange(0, lengthOfFile, chunkSize)))
Here, I am passing control to parserMethod with a tuple of params inputFile[line:line + chunkSize] and a sharedQueue.
Can anyone tell me how I can delete the elements of inputFile[line:line + chunkSize] after it is passed to the parserMethod ?
Thanks !

del inputFile[line:line + chunkSize]
will remove those items. However, your map is stepping through the entire file, which makes me wonder: are you trying to remove them as they're parsed? This requires the map or parser to alter an input argument, which invites trouble.
If you're only trying to save memory usage, it's a little late: you already saved the entire file in InputFile. If you need only to clean up after the parsing, then use the extreme form of delete, once, after the parsing is finished:
del inputFile[:]
If you want to reduce the memory requirement up front, you have to back up a step. Instead of putting the entire file into a list, try making an nice input pipeline. You didn't post the context of this code, so I'm going to use a generic case with a couple of name assumptions:
def line_chunk_stream(input_stream, chunk_size):
# Generator to return a stream of paring units,
# <chunk_size> lines each.
# To make sure you could check the logic here,
# I avoided several Pythonic short-cuts.
line_count = 0
parse_chunk = []
for line in input_stream:
line_count += 1
parse_chunk.append(line)
if line_count % chunk_size == 0:
yield parse_chunk
del parse_chunk[:]
input_stream = open("source_file", 'r')
parse_stream = line_chunk_stream(input_stream, chunk_size)
parserMethod(parse_stream)
I hope that at least one of these solves your underlying problem.

Related

python: issue with handling csv object in an iteration

First, sorry if the title is not clear. I (noob) am baffled by this...
Here's my code:
import csv
from random import random
from collections import Counter
def rn(dic, p):
for ptry in parties:
if p < float(dic[ptry]):
return ptry
else:
p -= float(dic[ptry])
def scotland(r):
r['SNP'] = 48
r['Con'] += 5
r['Lab'] += 1
r['LibDem'] += 5
def n_ireland(r):
r['DUP'] = 9
r['Alliance'] = 1
# SF = 7
def election():
results = Counter([rn(row, random()) for row in data])
scotland(results)
n_ireland(results)
return results
parties = ['Con', 'Lab', 'LibDem', 'Green', 'BXP', 'Plaid', 'Other']
with open('/Users/andrew/Downloads/msp.csv', newline='') as f:
data = csv.DictReader(f)
for i in range(1000):
print(election())
What happens is that in every iteration after the first one, the variable data seems to have vanished: the function election() creates a Counter object from a list obtained by processing data, but on every pass after the first one, this object is empty, so the function just returns the hard coded data from scotland() and n_ireland(). (msp.csv is a csv file containing detailed polling data). I'm sure I'm doing something stupid but would welcome anyone gently pointing out where...
I’m going to place a bet on your definition of newline. Are you sure you don’t want newline = “\n” ? Otherwise it will interpret the entire file as a single line, which explains what you’re seeing.
EDIT
I now see another issue. The file object in python acts as a generator for each line. The problem is once the generator is finished (you hit the end of the file), you have no more data generated. To solve this: reset your file pointer to the beginning of the file like so:
with open('/Users/andrew/Downloads/msp.csv') as f:
data = csv.DictReader(f)
for i in range(1000):
print(election())
f.seek(0)
Here the call to f.seek(0) will reset the file pointer to the beginning of your file. You are correct that data is a global object given the way you've defined it at the module level, there's no need to pass it as a parameter.
I agree with #smassey, you might need to change the code to
with open('/Users/andrew/Downloads/msp.csv', newline='\n') as f:
or simply try not use that argument
with open('/Users/andrew/Downloads/msp.csv') as f:

Fast extraction of chunks of lines from large CSV file

I have a large CSV file full of stock-related data formatted as such:
Ticker Symbol, Date, [some variables...]
So each line starts of with the symbol (like "AMZN"), then has the date, then has 12 variables related to price or volume on the selected date. There are about 10,000 different securities represented in this file and I have a line for each day that the stock has been publicly traded for each of them. The file is ordered first alphabetically by ticker symbol and second chronologically by date. The entire file is about 3.3 GB.
The sort of task I want to solve would be to be able to extract the most recent n lines of data for a given ticker symbol with respect to the current date. I have code that does this, but based on my observations it seems to take, on average, around 8-10 seconds per retrieval (all tests have been extracting 100 lines).
I have functions I'd like to run that require me to grab such chunks for hundreds or thousands of symbols, and I would really like to reduce the time. My code is inefficient, but I am not sure how to make it run faster.
First, I have a function called getData:
def getData(symbol, filename):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(filename, 'r') as f:
for line in f:
match = checkMatch(symbol, l, line)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and match:
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and not match:
break
return out
(This code has a couple of helper functions, checkMatch and formatLineData, which I will show below.) Then, there is another function called getDataColumn that gets the column I want with the correct number of days represented:
def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False):
dataset = getData(symbol)
if not changeRateTransform:
column = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return column
(changeRateTransform converts raw numbers into daily change rate numbers if True.) The helper functions:
def checkMatch(symbol, symbolLength, line):
out = False
if line[:symbolLength+1] == symbol + ",":
out = True
return out
def formatLineData(lineData):
out = [lineData[0]]
out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date())
out += [float(d) for d in lineData[2:6]]
out += [int(float(d)) for d in lineData[6:9]]
out += [float(d) for d in lineData[9:13]]
out.append(int(float(lineData[13])))
return out
Does anyone have any insight on what parts of my code run slow and how I can make this perform better? I can't do the sort of analysis I want to do without speeding this up.
EDIT:
In response to the comments, I made some changes to the code in order to utilize the existing methods in the csv module:
def getData(symbol, database):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(database, 'r') as f:
databaseReader = csv.reader(f, delimiter=",")
for row in databaseReader:
match = (row[0] == symbol)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(row))
elif not beforeMatch and match:
out.append(formatLineData(row))
elif not beforeMatch and not match:
break
return out
def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False):
if not changeRateTransform:
out = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return out
Performance was worse using the csv.reader class. I tested on two stocks, AMZN (near top of file) and ZNGA (near bottom of file). With the original method, the run times were 0.99 seconds and 18.37 seconds, respectively. With the new method leveraging the csv module, the run times were 3.04 seconds and 64.94 seconds, respectively. Both return the correct results.
My thought is that the time is being taken up more from finding the stock than from the parsing. If I try these methods on the first stock in the file, A, the methods both run in about 0.12 seconds.
When you're going to do lots of analysis on the same dataset, the pragmatic approach would be to read it all into a database. It is made for fast querying; CSV isn't. Use the sqlite command line tools, for example, which can directly import from CSV. Then add a single index on (Symbol, Date) and lookups will be practically instantaneous.
If for some reason that is not feasible, for example because new files can come in at any moment and you cannot afford the preparation time before starting your analysis of them, you'll have to make the best of dealing with CSV directly, which is what the rest of my answer will focus on. Remember that it's a balancing act, though. Either you pay a lot upfront, or a bit extra for every lookup. Eventually, for some amount of lookups it would have been cheaper to pay upfront.
Optimization is about maximizing the amount of work not done. Using generators and the built-in csv module aren't going to help much with that in this case. You'd still be reading the whole file and parsing all of it, at least for line breaks. With that amount of data, it's a no-go.
Parsing requires reading, so you'll have to find a way around it first. Best practices of leaving all intricacies of the CSV format to the specialized module bear no meaning when they can't give you the performance you want. Some cheating must be done, but as little as possible. In this case, I suppose it is safe to assume that the start of a new line can be identified as b'\n"AMZN",' (sticking with your example). Yes, binary here, because remember: no parsing yet. You could scan the file as binary from the beginning until you find the first line. From there read the amount of lines you need, decode and parse them the proper way, etc. No need for optimization there, because a 100 lines are nothing to worry about compared to the hundreds of thousands of irrelevant lines you're not doing that work for.
Dropping all that parsing buys you a lot, but the reading needs to be optimized as well. Don't load the whole file into memory first and skip as many layers of Python as you can. Using mmap lets the OS decide what to load into memory transparently and lets you work with the data directly.
Still you're potentially reading the whole file, if the symbol is near the end. It's a linear search, which means the time it takes is linearly proportional to the number of lines in the file. You can do better though. Because the file is sorted, you could improve the function to instead perform a kind of binary search. The number of steps that will take (where a step is reading a line) is close to the binary logarithm of the number of lines. In other words: the number of times you can divide your file into two (almost) equally sized parts. When there are one million lines, that's a difference of five orders of magnitude!
Here's what I came up with, based on Python's own bisect_left with some measures to account for the fact that your "values" span more than one index:
import csv
from itertools import islice
import mmap
def iter_symbol_lines(f, symbol):
# How to recognize the start of a line of interest
ident = b'"' + symbol.encode() + b'",'
# The memory-mapped file
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Skip the header
mm.readline()
# The inclusive lower bound of the byte range we're still interested in
lo = mm.tell()
# The exclusive upper bound of the byte range we're still interested in
hi = mm.size()
# As long as the range isn't empty
while lo < hi:
# Find the position of the beginning of a line near the middle of the range
mid = mm.rfind(b'\n', 0, (lo+hi)//2) + 1
# Go to that position
mm.seek(mid)
# Is it a line that comes before lines we're interested in?
if mm.readline() < ident:
# If so, ignore everything up to right after this line
lo = mm.tell()
else:
# Otherwise, ignore everything from right before this line
hi = mid
# We found where the first line of interest would be expected; go there
mm.seek(lo)
while True:
line = mm.readline()
if not line.startswith(ident):
break
yield line.decode()
with open(filename) as f:
r = csv.reader(islice(iter_symbol_lines(f, 'AMZN'), 10))
for line in r:
print(line)
No guarantees about this code; I didn't pay much attention to edge cases, and I couldn't test with (any of) your file(s), so consider it a proof of concept. It is plenty fast, however – think tens of milliseconds on an SSD!
So I have an alternative solution which I ran and tested on my own as well with a sample data set that I got on Quandl that appears to have all the same headers and similar data. (Assuming that I havent misunderstood the end result that your trying to achieve).
I have this command line tool that one of our engineers built for us for parsing massive csvs - since I deal with absurd amount of data on a day to day basis - it is open sourced and you can get it here: https://github.com/DataFoxCo/gocsv
I also already wrote the short bash script for it in case you don't want to pipeline the commands but it does also support pipelining.
The command to run the following short script follows a super simple convention:
bash tickers.sh wikiprices.csv 'AMZN' '2016-12-\d+|2016-11-\d+'
#!/bin/bash
dates="$3"
cat "$1" \
| gocsv filter --columns 'ticker' --regex "$2" \
| gocsv filter --columns 'date' --regex "$dates" > "$2"'-out.csv'
both arguments for ticker and for dates are regexes
You can add as many variations as your want into that one regex, separating them by |.
So if you wanted AMZN and MSFT then you would simply modify it to this: AMZN|MSFT
I did something very similar with the dates - but i only limited my sample run to any dates from this month or last month.
End Result
Starting data:
myusername$ gocsv dims wikiprices.csv
Dimensions:
Rows: 23946
Columns: 14
myusername$ bash tickers.sh wikiprices.csv 'AMZN|MSFT' '2016-12-\d+'
myusername$ gocsv dims AMZN|MSFT-out.csv
Dimensions:
Rows: 24
Columns: 14
Here is a sample where I limited to only those 2 tickers and then to december only:
Voila - in a matter of seconds you have a second file saved with out the data you care about.
The gocsv program has great documentation by the way - and a ton of other functions e.g. running a vlookup basically at any scale (which is what inspired the creator to make the tool)
in addition to using csv.reader I think using itertools.groupby would speed up looking for the wanted sections, so the actual iteration could look something like this:
import csv
from itertools import groupby
from operator import itemgetter #for the keyfunc for groupby
def getData(wanted_symbol, filename):
with open(filename) as file:
reader = csv.reader(file)
#so each line in reader is basically line[:-1].split(",") from the plain file
for symb, lines in groupby(reader, itemgetter(0)):
#so here symb is the symbol at the start of each line of lines
#and lines is the lines that all have that symbol in common
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in lines:
#here we have each line as a list of fields
#for only the lines that have `wanted_symbol` as the first element
<DO STUFF HERE>
so in the space of <DO STUFF HERE> you could have the out.append(formatLineData(line)) to do what your current code does but the code for that function has a lot of unnecessary slicing and += operators which I think are pretty expensive for lists (might be wrong), another way you could apply the conversions is to have a list of all the conversions:
def conv_date(date_str):
return datetime.strptime(date_str, '%Y-%m-%d').date()
#the conversions applied to each element (taken from original formatLineData)
castings = [str, conv_date, #0, 1
float, float, float, float, #2:6
int, int, int, #6:9
float, float, float, float, #9:13
int] #13
then use zip to apply these to each field in a line in a list comprehension:
[conv(val) for conv, val in zip(castings, line)]
so you would replace <DO STUFF HERE> with out.append with that comprehension.
I'd also wonder if switching the order of groupby and reader would be better since you don't need to parse most of the file as csv, just the parts you are actually iterating over so you could use a keyfunc that seperates just the first field of the string
def getData(wanted_symbol, filename):
out = [] #why are you starting this with strings in it?
def checkMatch(line): #define the function to only take the line
#this would be the keyfunc for groupby in this example
return line.split(",",1)[0] #only split once, return the first element
with open(filename) as file:
for symb, lines in groupby(file,checkMatch):
#so here symb is the symbol at the start of each line of lines
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in csv.reader(lines):
out.append( [typ(val) for typ,val in zip(castings,line)] )
return out

Parsing blockbased program output using Python

I am trying to parse the output of a statistical program (Mplus) using Python.
The format of the output (example here) is structured in blocks, sub-blocks, columns, etc. where the whitespace and breaks are very important. Depending on the eg. options requested you get an addional (sub)block or column here or there.
Approaching this using regular expressions has been a PITA and completely unmaintainable. I have been looking into parsers as a more robust solution, but
am a bit overwhelmed by all the possible tools and approaches;
have the impression that they are not well suited for this kind of output.
E.g. LEPL has something called line-aware parsing, which seems to go in the right direction (whitespace, blocks, ...) but is still geared to parsing programming syntax, not output.
Suggestion in which direction to look would be appreciated.
Yes, this is a pain to parse. You don't -- however -- actually need very many regular expressions. Ordinary split may be sufficient for breaking this document into manageable sequences of strings.
These are a lot of what I call "Head-Body" blocks of text. You have titles, a line of "--"'s and then data.
What you want to do is collapse a "head-body" structure into a generator function that yields individual dictionaries.
def get_means_intecepts_thresholds( source_iter ):
"""Precondition: Current line is a "MEANS/INTERCEPTS/THRESHOLDS" line"""
head= source_iter.next().strip().split()
junk= source_iter.next().strip()
assert set( junk ) == set( [' ','-'] )
for line in source_iter:
if len(line.strip()) == 0: continue
if line.strip() == "SLOPES": break
raw_data= line.strip().split()
data = dict( zip( head, map( float, raw_data[1:] ) ) )
yield int(raw_data[0]), data
def get_slopes( source_iter ):
"""Precondition: Current line is a "SLOPES" line"""
head= source_iter.next().strip().split()
junk= source_iter.next().strip()
assert set( junk ) == set( [' ','-'] )
for line in source_iter:
if len(line.strip()) == 0: continue
if line.strip() == "SLOPES": break
raw_data= line.strip().split() )
data = dict( zip( head, map( float, raw_data[1:] ) ) )
yield raw_data[0], data
The point is to consume the head and the junk with one set of operations.
Then consume the rows of data which follow using a different set of operations.
Since these are generators, you can combine them with other operations.
def get_estimated_sample_statistics( source_iter ):
"""Precondition: at the ESTIMATED SAMPLE STATISTICS line"""
for line in source_iter:
if len(line.strip()) == 0: continue
assert line.strip() == "MEANS/INTERCEPTS/THRESHOLDS"
for data in get_means_intercepts_thresholds( source_iter ):
yield data
while True:
if len(line.strip()) == 0: continue
if line.strip() != "SLOPES": break
for data in get_slopes( source_iter ):
yield data
Something like this may be better than regular expressions.
Based on your example, what you have is a bunch of different, nested sub-formats that, individually, are very easily parsed. What can be overwhelming is the sheer number of formats and the fact that they can be nested in different ways.
At the lowest level you have a set of whitespace-separated values on a single line. Those lines combine into blocks, and how the blocks combine and nest within each other is the complex part. This type of output is designed for human reading and was never intended to be "scraped" back into machine-readable form.
First, I would contact the author of the software and find out if there is an alternate output format available, such as XML or CSV. If done correctly (i.e. not just the print-format wrapped in clumsy XML, or with commas replacing whitespace), this would be much easier to handle. Failing that I would try to come up with a hierarchical list of formats and how they nest. For example,
ESTIMATED SAMPLE STATISTICS begins a block
Within that block MEANS/INTERCEPTS/THRESHOLDS begins a nested block
The next two lines are a set of column headings
This is followed by one (or more?) rows of data, with a row header and data values
And so on. If you approach each of these problems separately, you will find that it's tedious but not complex. Think of each of the above steps as modules that test the input to see if it matches and if it does, then call other modules to test further for things that can occur "inside" the block, backtracking if you get to something that doesn't match what you expect (this is called "recursive descent" by the way).
Note that you will have to do something like this anyway, in order to build an in-memory version of the data (the "data model") on which you can operate.
My suggestion is to do rough massaging of the lines to more useful form. Here is some experiments with your data:
from __future__ import print_function
from itertools import groupby
import string
counter = 0
statslist = [ statsblocks.split('\n')
for statsblocks in open('mlab.txt').read().split('\n\n')
]
print(len(statslist), 'blocks')
def blockcounter(line):
global counter
if not line[0]:
counter += 1
return counter
blocklist = [ [block, list(stats)] for block, stats in groupby(statslist, blockcounter)]
for blockno,block in enumerate(blocklist):
print(120 * '=')
for itemno,line in enumerate(block[1:][0]):
if len(line)<4 and any(line[-1].endswith(c) for c in string.letters) :
print('\n** DATA %i, HEADER (%r)**' % (blockno,line[-1]))
else:
print('\n** DATA %i, item %i, length %i **' % (blockno, itemno, len(line)))
for ind,subdata in enumerate(line):
if '___' in subdata:
print(' *** Numeric data starts: ***')
else:
if 6 < len(subdata)<16:
print( '** TYPE: %s **' % subdata)
print('%3i : %s' %( ind, subdata))
You could try PyParsing. It enables you to write a grammar for what you want to parse. It has other examples than parsing programming languages. But I agree with Jim Garrison that your case doesn't seem to call for a real parser, because writing the grammar would be cumbersome. I would try a brute-force solution, e.g. splitting lines at whitespaces. It's not foolproof, but we can assume the output is correct, so if a line has n headers, the next line will have exactly n values.
It turns out that tabular program output like this was one of my earliest applications of pyparsing. Unfortunately, that exact example dealt with a proprietary format that I can't publish, but there is a similar example posted here: http://pyparsing.wikispaces.com/file/view/dictExample2.py .

What is the lightest way of doing this task?

I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this
Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).
Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.
Load it into a list then use bisect.

Number of lines in csv.DictReader

I have a csv DictReader object (using Python 3.1), but I would like to know the number of lines/rows contained in the reader before I iterate through it. Something like as follows...
myreader = csv.DictReader(open('myFile.csv', newline=''))
totalrows = ?
rowcount = 0
for row in myreader:
rowcount +=1
print("Row %d/%d" % (rowcount,totalrows))
I know I could get the total by iterating through the reader, but then I couldn't run the 'for' loop. I could iterate through a copy of the reader, but I cannot find how to copy an iterator.
I could also use
totalrows = len(open('myFile.csv').readlines())
but that seems an unnecessary re-opening of the file. I would rather get the count from the DictReader if possible.
Any help would be appreciated.
Alan
rows = list(myreader)
totalrows = len(rows)
for i, row in enumerate(rows):
print("Row %d/%d" % (i+1, totalrows))
You only need to open the file once:
import csv
f = open('myFile.csv', 'rb')
countrdr = csv.DictReader(f)
totalrows = 0
for row in countrdr:
totalrows += 1
f.seek(0) # You may not have to do this, I didn't check to see if DictReader did
myreader = csv.DictReader(f)
for row in myreader:
do_work
No matter what you do you have to make two passes (well, if your records are a fixed length - which is unlikely - you could just get the file size and divide, but lets presume that isn't the case). Opening the file again really doesn't cost you much, but you can avoid it as illustrated here. Converting to a list just to use len() is potentially going to waste tons of memory, and not be any faster.
Note: The 'Pythonic' way is to use enumerate instead of +=, but the UNPACK_TUPLE opcode is so expensive that it makes enumerate slower than incrementing a local. That being said, it's likely an unnecessary micro-optimization that you should probably avoid.
More Notes: If you really just want to generate some kind of progress indicator, it doesn't necessarily have to be record based. You can tell() on the file object in the loop and just report what % of the data you're through. It'll be a little uneven, but chances are on any file that's large enough to warrant a progress bar the deviation on record length will be lost in the noise.
I cannot find how to copy an
iterator.
Closest is itertools.tee, but simply making a list of it, as #J.F.Sebastian suggests, is best here, as itertools.tee's docs explain:
This itertool may require significant
auxiliary storage (depending on how
much temporary data needs to be
stored). In general, if one iterator
uses most or all of the data before
another iterator starts, it is faster
to use list() instead of tee().
As mentioned in the answer https://stackoverflow.com/a/2890569/8056572 you can get the number of lines by taking the length of the reader converted to a list. However, this will have an impact on the RAM consumption and you will loose the benefits of the reader (which is a generator).
The best solution in my opinion is to open the file 2 times:
count the number of lines:
total_rows = sum(1 for _ in open('myFile.csv')) # -1 if you want to remove the header from the count
Note: I am not using .readlines() to avoid to load all the lines in memory
iterate over the lines
According to your snippet you will have something like this:
import csv
totalrows = sum(1 for _ in open('myFile.csv'))
myreader = csv.DictReader(open('myFile.csv'))
for i, _ in enumerate(myreader, start=1):
print("Row %d/%d" % (i, totalrows))
Note: the start=1 in the enumerate indicates the first value of i. By default it is 0, if you keep this default value you have to use i + 1 in the print statement
If you really do not want to open the file two times you can use seek as mentioned in the answer https://stackoverflow.com/a/2891061/8056572
import csv
f = open('myFile.csv')
total_rows = sum(1 for _ in f)
f.seek(0)
myreader = csv.DictReader(f)
for i, _ in enumerate(myreader, start=1):
print("Row %d/%d" % (i, totalrows))

Categories