For loop does not properly iterate through array

For loop does not properly iterate through array - python

When I run my code, only the first item in the array "listWP" checks whether it is in "dataGA." Tried a bunch of things, same problem. New to coding apologize for ignorance
import csv
f_GA = open('MS_GA.csv', 'rt')
f_WP = open('MS_WP.csv', 'rt')
dataGA = csv.reader(f_GA, delimiter=',')
dataWP = csv.reader(f_WP, delimiter=',')
listWP = []
for row in dataWP:
for i in row:
b = i[29:]
listWP.append(b)
for url in listWP:
for row in dataGA:
for i in row:
if url in i:
print (i + " ||is top site")
Current output is the first item of the array listWP checked through dataGA, i would obviously like it to be all items

csv.reader() returns a generator. The first time you iterate through it you process all the elements. All the subsequent times get nothing, since there are no more elements to iterate through. You should convert it to a list.
dataGA = list(csv.reader(f_GA, delimiter = ','))
However, it would probably be better if you redesigned your data structures. Instead of all those nested loops, convert the contents of dataGA to a set and then just use if url in url_set:.

Related

Using an if statement to pass through variables ot further functions for python

I am a biologist that is just trying to use python to automate a ton of calculations, so I have very little experience.
I have a very large array that contains values that are formatted into two columns of observations. Sometimes the observations will be the same between the columns:
v1,v2
x,y
a,b
a,a
x,x
In order to save time and effort I wanted to make an if statement that just prints 0 if the two columns are the same and then moves on. If the values are the same there is no need to run those instances through the downstream analyses.
This is what I have so far just to test out the if statement. It has yet to recognize any instances where the columns are equivalen.
Script:
mylines=[]
with open('xxxx','r') as myfile:
for myline in myfile:
mylines.append(myline) ##reads the data into the two column format mentioned above
rang=len(open ('xxxxx,'r').readlines( )) ##returns the number or lines in the file
for x in range(1, rang):
li = mylines[x] ##selected row as defined by x and the number of lines in the file
spit = li.split(',',2) ##splits the selected values so they can be accessed seperately
print(spit[0]) ##first value
print(spit[1]) ##second value
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
Output:
192Alhe52
192Alhe52
Issue ##should be 0
188Alhe48
192Alhe52
Issue
191Alhe51
192Alhe52
Issue
How do I get python to recgonize that certain observations are actually equal?

When you read the values and store them in the array, you can be storing '\n' as well, which is a break line character, so your array actually looks like this
print(mylist)
['x,y\n', 'a,b\n', 'a,a\n', 'x,x\n']
To work around this issue, you have to use strip(), which will remove this character and occasional blank spaces in the end of the string that would also affect the comparison
mylines.append(myline.strip())
You shouldn't use rang=len(open ('xxxxx,'r').readlines( )), because you are reading the file again
rang=len(mylines)
There is a more readable, pythonic way to replicate your for
for li in mylines[1:]:
spit = li.split(',')
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
Or even
for spit.split(',') in mylines[1:]:
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
will iterate on the array mylines, starting from the first element.
Also, if you're interested in python packages, you should have a look at pandas. Assuming you have a csv file:
import pandas as pd
df = pd.read_csv('xxxx')
for i, elements in df.iterrows():
if elements['v1'] == elements['v2']:
print('Equal')
else:
print('Different')
will do the trick. If you need to modify values and write another file
df.to_csv('nameYouWant')

For one, your issue with the equals test might be because iterating over lines like this also yields the newline character. There is a string function that can get rid of that, .strip(). Also, your argument to split is 2, which splits your row into three groups - but that probably doesn't show here. You can avoid having to parse it yourself when using the csv module, as your file presumably is that:
import csv
with open("yourfile.txt") as file:
reader = csv.reader(file)
next(reader) # skip header
for first, second in reader:
print(first)
print(second)
if first == second:
print(0)
else:
print("Issue")

Compare 2 .CSV with unknown number of columns and names

and thanks in advance for any advice. First-time poster here, so I'll do my best to put in all required info. I am also quite beginner with Python, have been doing some online tutorials, and some copy/paste coding from StackOverflow, it's FrankenCoding... So I'm probably approaching this wrong...
I need to compare two CSV files, that will have a changing number of columns, there will only ever be 2 columns that match (for example, email_address in one file, and EMAIL in the other). Both files will have headers, however the names of these headers may change. The file sizes may be anywhere from a few thousand lines up to +2,000,000, with potentially 100+ columns (but more likely to have a handful).
Output is to a third 'results.csv' file, containing all the info. It may be a merge (all unique entries), a substract (remove entries present in one or the other) or an intersect (all entries present in both).
I have searched here, and found a lot of good information, but all of the ones I saw had a fixed number of columns in the files. I've tried dict and dictreader, and I know the answer is in there somewhere, but right now, I'm a bit confused. But since I haven't made any progress in several days, and I can only devote so much time on this, I'm hoping that I can get a nudge in the right direction.
Ideally, I want to learn how to do it myself, which means understanding how the data is 'moving around'.
Extract of CSV files below, I didn't add more columns then (I think) necessary, the dataset I have now will match on Originalid/UID or emailaddress/email, but this may not always be the case.
Original.csv
"originalid","emailaddress",""
"12345678","Bob#mail.com",""
"23456789","NORMA#EMAIL.COM",""
"34567890","HENRY#some-mail.com",""
"45678901","Analisa#sports.com",""
"56789012","greta#mail.org",""
"67890123","STEVEN#EMAIL.ORG",""
Compare.CSV
"email","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"Bob#mail.com",,,"true"
"NORMA#EMAIL.COM",,,"true"
"HENRY#some-mail.com",,,"true"
"Henrietta#AWESOME.CA",,,"true"
"NORMAN#sports.CA",,,"true"
"albertina#justemail.CA",,,"true"
Data in results.csv should be all columns from Original.CSV + all columns in Compare.csv, but not the matching one (email) :
"originalid","emailaddress","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"12345678","Bob#mail.com","",,,"true"
"23456789","NORMA#EMAIL.COM","",,,"true"
"34567890","HENRY#some-mail.com","",,,"true"
Here are my results as they are now:
email,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob#mail.com,,,true,"['12345678', 'Bob#mail.com', '']"
NORMA#EMAIL.COM,,,true,"['23456789', 'NORMA#EMAIL.COM', '']"
HENRY#some-mail.com,,,true,"['34567890', 'HENRY#some-mail.com', '']"
And here's where I'm at with the code, the print statement returns matching data from the files to screen but not to file, so I'm missing something in there.
***** And I'm not getting the headers from the original.csv file, data is coming in.
import csv
def get_column_from_file(filename, column_name):
f = open(filename, 'r')
reader = csv.reader(f)
headers = next(reader, None)
i = 0
max = (len(headers))
while i < max:
if headers[i] == column_name:
column_header = i
# print(headers[i])
i = i + 1
return(column_header)
file_to_check = "Original.csv"
file_console = "Compare.csv"
column_to_read = get_column_from_file(file_console, 'email')
column_to_compare = get_column_from_file(file_to_check, 'emailaddress')
with open(file_console, 'r') as master:
master_indices = dict((r[1], r) for i, r in enumerate(csv.reader(master)))
with open('Compare.csv', 'r') as hosts:
with open('results.csv', 'w', newline='') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []))
for row in reader:
index = master_indices.get(row[0])
if index is not None:
print (row +[master_indices.get(row[0])])
writer.writerow(row +[master_indices.get(row[0])])
Thanks for your time!
Pat

I like that you want to do this yourself, and recognize a need to "understand how the data is moving around." This is exactly how you should be thinking of the problem: focusing on the movement of data rather than the result. Some people may disagree with me, but I think this is a good philosophy to follow as it will make future reuse easier.
You're not trying to build a tool that combines two CSVs, you're trying to organize data (that happens to come from a CSV) according to a common reference (email address) and output the result as a CSV. Because you are talking about potentially large data sets (+2,000,000 [rows] with potentially 100+ columns) recognize that it is important to pay attention to the asymptotic runtime. If you do not know what this is, I recommend you read up on Big-O notation and asymptotic algorithm analysis. You might be okay without this.
First you decide what, from each CSV, is your key. You've already done this, 'email' for 'Compare.csv' and 'emailaddress' from 'Original.csv'.
Now, build yourself a function to produce dictionaries from the CSV based off the key.
def get_dict_from_csv(path_to_csv, key):
with open(path_to_csv, 'r') as f:
reader = csv.reader(f)
headers, *rest = reader # requires python3
key_index = headers.index(key) # find index of key
# dictionary comprehensions are your friend, just think about what you want the dict to look like
d = {row[key_index]: row[:key_index] + row[key_index+1:] # +1 to skip the email entry
for row in rest}
headers.remove(key)
d['HEADERS'] = headers # add headers so you know what the information in the dict is
return d
Now you can call this function on both of your CSVs.
file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
Now you have two dicts which are keyed off the same information. Now we need a function to combine these into one dict.
def combine_dicts(*dicts):
d, *rest = dicts # requires python3
# iteratively pull other dicts into the first one, d
for r in rest:
original_headers = d['HEADERS'][:]
new_headers = r['HEADERS'][:]
# copy headers
d['HEADERS'].extend(new_headers)
# find missing keys
s = set(d.keys()) - set(r.keys()) # keys present in d but not in r
for k in s:
d[k].extend(['', ] * len(new_headers))
del r['HEADERS'] # we don't want to copy this a second time in the loop below
for k, v in r.items():
# use setdefault in case the key didn't exist in the first dict
d.setdefault(k, ['', ] * len(original_headers)).extend(v)
return d
Now you have one dict which has all the information you want, all you need to do is write it back as a CSV.
def write_dict_to_csv(output_file, d, include_key=False):
with open(output_file, 'w', newline='') as results:
writer = csv.writer(results)
# email isn't in your HEADERS, so you'll need to add it
if include_key:
headers = ['email',] + d['HEADERS']
else:
headers = d['HEADERS']
writer.writerow(headers)
# now remove it from the dict so we can iterate over it without including it twice
del d['HEADERS']
for k, v in d.items():
if include_key:
row = [k,] + v
else:
row = v
writer.writerow(row)
And that should be it. To call all of this is just
file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
results_dict = combine_dicts(file_to_check_dict, file_console_dict)
write_dict_to_csv('results.csv', results_dict)
And you can easily see how this can be extended to arbitrarily many dictionaries.
You said you didn't want the email to be in the final CSV. This is counter-intuitive to me, so I made it an option in write_dict_to_csv() in case you change your mind.
When I run all the above I get
email,originalid,,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob#mail.com,12345678,,,,true
NORMA#EMAIL.COM,23456789,,,,true
HENRY#some-mail.com,34567890,,,,true
Analisa#sports.com,45678901,,,,,
greta#mail.org,56789012,,,,,
STEVEN#EMAIL.ORG,67890123,,,,,
Henrietta#AWESOME.CA,,,,,true
NORMAN#sports.CA,,,,,true
albertina#justemail.CA,,,,,true

Right now it looks like you only use writerow once for the header:
writer.writerow(next(reader, []))
As francisco pointed out, uncommenting that last line may fix your problem. You can do this by removing the "#" at the beginning of the line.

Python Script takes much time / List comprehensions

I have a small script which compares, from CSV input files, how many items of the first list are in the second list. However, it takes a certain time to run when there is many references.
data_1 = import_csv("test1.csv")
print(len(data_1))
data_2 = import_csv("test2.csv")
print(len(data_2))
data_to_keep = len([i for i in data_1 if i in data_2])
I just run a test with 598756 items for the first list and 76612 for the second, and the script hasn't finished yet.
As I'm still relatively new to Python, I would like to know if there is a fastest way to achieve what I'm trying to do. Thank you for your help :)
EDIT : import CSV looks like this :
def import_csv(csvfilename):
data = []
with open(csvfilename, "r", encoding="utf-8", errors="ignore") as scraped:
reader = csv.reader(scraped, delimiter=',')
for row in reader:
if row: # avoid blank lines
data.append(row[0])
return data

Make data_2 a set.
data_2 = set(import_csv("test2.csv"))
In Python, sets are much faster for checking if an object is present (using the in operator).
You might also see some improvement from switching the order of your inputs. Make the larger file the set, that way you do fewer lookups when you iterate over the elements of the smaller file.

You can use set and it's intersection, if duplicates can be safely discarded:
data1 = [1,2,3,3,4]
data2 = [2,3,5,6,1,6]
print(len(set(data1).intersection(data2)))
# 3
This is set operation and is guaranteed to be faster than what you do.

Try it
import csv
with open('test1.csv', newline='') as csvfile:
list1 = csv.reader(csvfile, delimiter=',')
with open('test2.csv', newline='') as csvfile2:
list2 = csv.reader(csvfile2, delimiter=',')
data_to_keep = len([i for i in list1 if i in list2])

I'm making a few assumptions here but here's an idea...
test1.csv and test2.csv hold something unique, like serial numbers. Like...
9210268126,4628032171,6691918168,1499888554,2024134986,
8826205840,5643225730,3174290295,1881330725,7192644763,
7210351670,7956881819,4897219228,4638431591,6444695480,
1949859915,8919131597,2176933146,3875411064,3546520925
Try...
with open("test1.csv") as f1, open("test2.csv") as f2:
data_1 = [line.split(",") for line in f1]
data_2 = [line.split(",") for line in f2]
Since they're unique we can use the set functions to see which entries are in the other file:
data_to_keep = set(data_1).intersection(set(data_2))
I'm not sure how to do it faster - at that point it might be a hardware bottleneck.

That one should also work. It converts the list to a dictionary and avoids a sequential search that is performed using the in operator. In large datasest you often avoid the use of in operator.
data_1 = import_csv("test1.csv")
data_2 = dict([(i,i) for i in import_csv("test2.csv")])
data_to_keep = len([i for i in data_1 if data_2.get(i) is not None])

Speed up simple Python function that uses list comprehension

I'm extracting 4 columns from an imported CSV file (~500MB) to be used for fitting a scikit-learn regression model.
It seems that this function used to do the extraction is extremely slow. I just learnt python today, any suggestions on how the function can be sped up?
Can multithreading/core be used? My system has 4 cores.
def splitData(jobs):
salaries = [jobs[i]['salaryNormalized'] for i, v in enumerate(jobs)]
descriptions = [jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] for i, v in enumerate(jobs)]
titles = [jobs[i]['title'] for i, v in enumerate(jobs)]
return salaries, descriptions, titles
print type(jobs)
<type 'list'>
print jobs[:1]
[{'category': 'Engineering Jobs', 'salaryRaw': '20000 - 30000/annum 20-30K', 'rawLocation': 'Dorking, Surrey, Surrey', 'description': 'Engineering Systems Analyst Dorking Surrey Salary ****K Our client is located in Dorking, Surrey and are looking for Engineering Systems Analyst our client provides specialist software development Keywords Mathematical Modelling, Risk Analysis, System Modelling, Optimisation, MISER, PIONEEER Engineering Systems Analyst Dorking Surrey Salary ****K', 'title': 'Engineering Systems Analyst', 'sourceName': 'cv-library.co.uk', 'company': 'Gregory Martin International', 'contractTime': 'permanent', 'normalizedLocation': 'Dorking', 'contractType': '', 'id': '12612628', 'salaryNormalized': '25000'}]
def loadData(filePath):
reader = csv.reader( open(filePath) )
rows = []
for i, row in enumerate(reader):
categories = ["id", "title", "description", "rawLocation", "normalizedLocation",
"contractType", "contractTime", "company", "category",
"salaryRaw", "salaryNormalized","sourceName"]
# Skip header row
if i != 0:
rows.append( dict(zip(categories, row)) )
return rows
def splitData(jobs):
salaries = []
descriptions = []
titles = []
for i in xrange(len(jobs)):
salaries.append( jobs[i]['salaryNormalized'] )
descriptions.append( jobs[i]['description'] + jobs[i]['normalizedLocation'] + jobs[i]['category'] )
titles.append( jobs[i]['title'] )
return salaries, descriptions, titles
def fit(salaries, descriptions, titles):
#Vectorize
vect = TfidfVectorizer()
vect2 = TfidfVectorizer()
descriptions = vect.fit_transform(descriptions)
titles = vect2.fit_transform(titles)
#Fit
X = hstack((descriptions, titles))
y = [ np.log(float(salaries[i])) for i, v in enumerate(salaries) ]
rr = Ridge(alpha=0.035)
rr.fit(X, y)
return vect, vect2, rr, X, y
jobs = loadData( paths['train_data_path'] )
salaries, descriptions, titles = splitData(jobs)
vect, vect2, rr, X_train, y_train = fit(salaries, descriptions, titles)

I see multiple problems with your code, directly impacting its performance.
You enumerate the jobs list multiple times. You could enumerate it only once and instead use the enumerated list (stored in a variable).
You don't use the value from the enumerated items at all. All you need is the index, and you could easily achieve this using the built-in range function.
Each of the lists is generated in eager manner. What happens is the following: 1st list blocks the execution of the program and it takes some time to finish. Same thing happens with the second and third lists, where calculations are exactly the same.
What I would offer you to do, is to use a generator, so that you process the data in a lazy manner. It's more performance-efficient and allows you to extract the data on-the-go.
def splitData(jobs):
for job in jobs:
yield job['salaryNormalized'], job['description'] + job['normalizedLocation'] + job['category'], job['title']

One simple speedup is to cut down on your list traversals. You can build a generator or generator expression that returns tuples for a single dictionary, then zip the resulting iterable:
(salaries, descriptions, titles) = zip(*((j['salaryNormalized'], j['description'] + j['normalizedLocation'] + j['category'], j['title']) for j in jobs))
Unfortunately, that still creates three sizable in-memory lists - using a generator expression rather than a list comprehension should at least prevent it from creating a full list of three-element tuples prior to zipping.

Correct me if I'm wrong, but it seems that TfidVectorizer accepts an iterator (e.g. generator expression) as well. This helps prevent having multiple copies of this rather large data in memory, which probably is what makes it slow. Alternatively, for sure it can work with files directly. One could transform the csv into separate files and then feed those files to TfidVectorizer directly without keeping them in memory in any way at all.
Edit 1
Now that you provided some more code, I can be a bit more specific.
First of all, please note that loadData is doing more than it needs to; it duplicates functionality present in csv.DictReader. If we use that, we skip the listing of category names. Another syntax for opening files is used, because in this way, they're closed automatically. Also, some names are changed to be both more accurate and Pythonic (underscore style).
def data_from_file(filename):
rows = []
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
rows.append(row)
return rows
We can now change this so that we don't build the list of all rows in memory, but instead give back a row one at a time right after we read it from the file. If this looks like magic, just read a little about generators in Python.
def data_from_file(path):
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
yield row
Now let's have a look at splitData. We could write it more cleanly like this:
def split_data(jobs):
salaries = []
descriptions = []
titles = []
for job in jobs:
salaries.append(job['salaryNormalized'] )
descriptions.append(job['description'] + job['normalizedLocation'] +
job['category'])
titles.append(job['title'])
return salaries, descriptions, titles
But again we don't want to build three huge lists in memory. And generally, it's not going to be practical that this function gives us three different things. So to split it up:
def extract_salaries(jobs):
for job in jobs:
yield job['salaryNormalized']
And so on. This helps us set up some kind of processing pipeline; everytime we'd request a value from extract_salaries(data_from_file(filename)) a single line of the csv would be read and the salary extracted. The next time, the second line giving back the second salary. There's no need to make functions for this simple case. Instead, you can use a generator expression:
salaries = (job['salaryNormalized'] for job in data_from_file(filename))
descriptions = (job['description'] + job['normalizedLocation'] +
job['category'] for job in data_from_file(filename))
titles = (job['title'] for job in data_from_file(filename))
You can now pass these generators to fit, where the most important modification is this:
y = [np.log(float(salary)) for salary in salaries]
You can't index into an iterator (something that gives you one value at a time) so you just assume you will get a salary from salaries as long as there are more, and do something with it.
In the end, you will read the whole csv file multiple times, but I don't expect that to be the bottleneck. Otherwise, some more restructuring is required.
Edit 2
Using DictReader seems a bit slow. Not sure why, but you may stick with your own implementation of that (modified to be a generator) or even better, go with namedtuples:
def data_from_file(filename):
with open(filename) as f:
reader = csv.reader(f)
header = reader.next()
Job = namedtuple('Job', header)
for row in reader:
yield Job(*row)
Then access the attributes with a dot (job.salaryNormalized). But anyway note that you can get the list of column names from the file; don't duplicate it in code.
You may of course decide to keep a single copy of the file in memory after all. In that case, do something like this:
data = list(data_from_file(filename))
salaries = (job['salaryNormalized'] for job in data)
The functions remain untouched. The call to list consumes the whole generator and stores all values in a list.

You don't need the indexes at all. Just use in. This saves the creation of a extra list of tuples, and it removes a level of indirection;
salaries = [j['salaryNormalized'] for j in jobs]
descriptions = [j['description'] + j['normalizedLocation'] + j['category'] for j in jobs]
titles = [j['title'] for j in jobs]
This still iterates over the data three times.
Alternatively you could get everything in one list comprehension, grouping the relevant data from one job together in a tuple;
data = [(j['salaryNormalized'],
j['description'] + j['normalizedLocation'] + j['category'],
j['title']) for j in jobs]
Saving the best for last; why not fill the lists straight from the CSV file instead of making a dict first?
import csv
with open('data.csv', 'r') as df:
reader = csv.reader(df)
# I made up the row indices...
data = [(row[1], row[3]+row[7]+row[6], row[2]) for row in reader]

Number of lines in csv.DictReader

I have a csv DictReader object (using Python 3.1), but I would like to know the number of lines/rows contained in the reader before I iterate through it. Something like as follows...
myreader = csv.DictReader(open('myFile.csv', newline=''))
totalrows = ?
rowcount = 0
for row in myreader:
rowcount +=1
print("Row %d/%d" % (rowcount,totalrows))
I know I could get the total by iterating through the reader, but then I couldn't run the 'for' loop. I could iterate through a copy of the reader, but I cannot find how to copy an iterator.
I could also use
totalrows = len(open('myFile.csv').readlines())
but that seems an unnecessary re-opening of the file. I would rather get the count from the DictReader if possible.
Any help would be appreciated.
Alan

rows = list(myreader)
totalrows = len(rows)
for i, row in enumerate(rows):
print("Row %d/%d" % (i+1, totalrows))

You only need to open the file once:
import csv
f = open('myFile.csv', 'rb')
countrdr = csv.DictReader(f)
totalrows = 0
for row in countrdr:
totalrows += 1
f.seek(0) # You may not have to do this, I didn't check to see if DictReader did
myreader = csv.DictReader(f)
for row in myreader:
do_work
No matter what you do you have to make two passes (well, if your records are a fixed length - which is unlikely - you could just get the file size and divide, but lets presume that isn't the case). Opening the file again really doesn't cost you much, but you can avoid it as illustrated here. Converting to a list just to use len() is potentially going to waste tons of memory, and not be any faster.
Note: The 'Pythonic' way is to use enumerate instead of +=, but the UNPACK_TUPLE opcode is so expensive that it makes enumerate slower than incrementing a local. That being said, it's likely an unnecessary micro-optimization that you should probably avoid.
More Notes: If you really just want to generate some kind of progress indicator, it doesn't necessarily have to be record based. You can tell() on the file object in the loop and just report what % of the data you're through. It'll be a little uneven, but chances are on any file that's large enough to warrant a progress bar the deviation on record length will be lost in the noise.

I cannot find how to copy an
iterator.
Closest is itertools.tee, but simply making a list of it, as #J.F.Sebastian suggests, is best here, as itertools.tee's docs explain:
This itertool may require significant
auxiliary storage (depending on how
much temporary data needs to be
stored). In general, if one iterator
uses most or all of the data before
another iterator starts, it is faster
to use list() instead of tee().

As mentioned in the answer https://stackoverflow.com/a/2890569/8056572 you can get the number of lines by taking the length of the reader converted to a list. However, this will have an impact on the RAM consumption and you will loose the benefits of the reader (which is a generator).
The best solution in my opinion is to open the file 2 times:
count the number of lines:
total_rows = sum(1 for _ in open('myFile.csv')) # -1 if you want to remove the header from the count
Note: I am not using .readlines() to avoid to load all the lines in memory
iterate over the lines
According to your snippet you will have something like this:
import csv
totalrows = sum(1 for _ in open('myFile.csv'))
myreader = csv.DictReader(open('myFile.csv'))
for i, _ in enumerate(myreader, start=1):
print("Row %d/%d" % (i, totalrows))
Note: the start=1 in the enumerate indicates the first value of i. By default it is 0, if you keep this default value you have to use i + 1 in the print statement
If you really do not want to open the file two times you can use seek as mentioned in the answer https://stackoverflow.com/a/2891061/8056572
import csv
f = open('myFile.csv')
total_rows = sum(1 for _ in f)
f.seek(0)
myreader = csv.DictReader(f)
for i, _ in enumerate(myreader, start=1):
print("Row %d/%d" % (i, totalrows))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

For loop does not properly iterate through array - python

Related

Using an if statement to pass through variables ot further functions for python

Compare 2 .CSV with unknown number of columns and names

Python Script takes much time / List comprehensions

Speed up simple Python function that uses list comprehension

Number of lines in csv.DictReader

Categories

Resources