I want to read two column of a csv file separately, but when I wrote code like below python just show first column and nothing for second, but in the csv file the second column also has lots of rows.
import csv
import pprint
f = open("arachnid.csv", 'r')
read = csv.DictReader(f)
for i in range(3):
read.next()
for i in read:
pprint.pprint(i["binomialAuthority_label"])
for i in read:
pprint.pprint(i["rdf-schema#label"])
The reason for this is that when you use DictReader the way you are using it it will create what is called an iterator/generator. So, when you have iterated over it once, you cannot iterate over it again the way you are doing it.
If you want to keep your logic as is, you can actually call seek(0) on your file reader object to reset its position as such:
f.seek(0)
The next time you iterate over your dictreader object, it will give you what you are looking for. So the part of your code of interest would be this:
for i in read:
pprint.pprint(i["binomialAuthority_label"])
# This is where you set your seek(0) before the second loop
f.seek(0)
for i in read:
pprint.pprint(i['rdf-schema#label'])
Your DictReader instance gets exhausted after your first for i in read: loop, so when you try to do your second loop, there is nothing to iterate over.
What you want to do, once you've iterated over the CSV the first time, you can seek your file back to the start, and create a new instance of the DictReader and start again. You'll want to create a new DictReader instance otherwise you'll need to manually skip the header line.
f = open(filename)
read = csv.DictReader(f)
for i in read:
print i
f.seek(0)
read = csv.DictReader(f)
for i in read:
print i
Related
I am trying to pickle a csv file and then turn its pickled representation back into a csv file.
This is the code I came up with:
from pathlib import Path
import pickle, csv
csvFilePath = Path('/path/to/file.csv')
pathToSaveTo = Path('/path/to/newFile.csv')
csvFile = open(csvFilePath, 'r')
f = csvFile.read()
csvFile.close()
f_pickled = pickle.dumps(f)
f_unpickled = pickle.loads(f_pickled)
#save unpickled csv file
new_csvFile = open(pathToSaveTo, 'w')
csvWriter = csv.writer(new_csvFile)
csvWriter.writerow(f_unpickled)
new_csvFile.close()
newFile.csv is created however there are two problems with its content:
There is now a comma between every character.
There is now a pair of quotation marks after every line.
What would I have to change about my code to get an exact copy of file.csv?
The problem is that you are reading the raw text of the file, with f = csvFile.read() then, on writting, you are feeding the data, which is a single lump of text, all in a single string, though a CSV writer object. The CSV writer will see the string as an iterable, and write each of the iterable elements in a CSV cell. Then, there is no data for a second row, and the process ends.
The pickle dumps and loads you perform is just a no-operation: nothing happens there, and if there were any issue, it would rather be due to some unpickleable object reference in the object you are passing to dumps: you'd get an exception, and not differing data when loads is called.
Now, without telling why you want to do this, and what intermediate steps you hav planned for the data, it is hard to tell you: you are performing two non-operations: reading a file, pickling and unpickling its contents, and writting those contents back to disk.
At which point do you need these data structured as rows, or as CSV cells? Just apply the proper transforms where you need it, and you are done.
If you want the whole "do nothing" cycle going through actual having the CSV data separated in different elements in Python you can perform:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
data = list(csv.reader(open(csvFilePath)))
# ^consumes all iterations of the reader: each iteration is a row, composed of a list where each cell value is a list elemnt
pickled_data = pickle.dumps(data)
restored_data = pickle.loads(pickled_data)
csv.writer(open(pathToSaveTo, "wt")).writerows(restored_data)
Perceive as in this snippet the data is read through csv.reader, not directly. Wrapping it in a list call causes all rows to be read and transformed in list items - because the reader is a lazy iterator otherwise (and it would not be pickeable, as one of the attributs it depends for its state is an open file)
I believe the problem is in how you're attempting to write the CSV file, the pickling and unpickling is fine. If you compare f with f_unpickled:
if f==f_unpickled:
print("Same")
This printed in my case. If you print the type, you'll see there's both strings.
The better option is to follow the document style and write each row one at a time rather than putting the entire string in including new lines. Something like this:
from pathlib import Path
import pickle, csv
csvFilePath = Path('file.csv')
pathToSaveTo = Path('newFile.csv')
rows = []
csvFile = open(csvFilePath, 'r')
with open(csvFilePath, 'r') as file:
reader = csv.reader(file)
for row in reader:
rows.append(row)
# pickle and unpickle
rows_pickled = pickle.dumps(rows)
rows_unpickled = pickle.loads(rows_pickled)
if rows==rows_unpickled:
print("Same")
#save unpickled csv file
with open(pathToSaveTo, 'w', newline='') as csvfile:
csvWriter = csv.writer(csvfile)
for row in rows_unpickled:
csvWriter.writerow(row)
This worked when I tested it--albeit it would take more finagling with line separators to get no empty line at the end.
I have a (very large) CSV file that looks something like this:
header1,header2,header3
name0,rank0,serial0
name1,rank1,serial1
name2,rank2,serial2
I've written some code that processes the file, and writes it out (using csvwriter) modified as such, with some information I compute appended to the end of each row:
header1,header2,header3,new_hdr4,new_hdr5
name0,rank0,serial0,salary0,base0
name1,rank1,serial1,salary1,base1
name2,rank2,serial2,salary2,base2
What I'm trying to do is structure the script so that it auto-detects whether or not the CSV file it's reading has already been processed. If it has been processed, I can skip a lot of expensive calculations later. I'm trying to understand whether there is a reasonable way of doing this within the reader loop. I could just open the file once, read in enough to do the detection, and then close and reopen it with a flag set, but this seems hackish.
Is there a way to do this within the same reader? The logic is something like:
read first N lines ###(N is small)
if (some condition)
already_processed = TRUE
read_all_csv_without_processing
else
read_all_csv_WITH_processing
I can't just use the iterator that reader gives me, because by the time I've gotten enough lines to do my conditional check, I don't have any good way to go back to the beginning of the CSV. Is closing and reopening it really the most elegant way to do this?
If you're using the usual python method to read the file (with open("file.csv","r") as f: or equivalent), you can "reset" the file reading by calling f.seek(0).
Here is a piece of code that should (I guess) look a bit more like the way you're reading your file. It demonstate that reseting csvfile with csvfile.seek(0) will also reset csvreader:
with open('so.txt', 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
for row in csvreader:
print('Checking if processed')
print(', '.join(row))
#if condition:
if True:
print('File already processed')
already_processed = True
print('Reseting the file')
csvfile.seek(0)
for row in csvreader:
print(', '.join(row))
break
I suppose if you do not want to just test the first few lines of the file, you could create a single iterator from a list and then the continuation of the csv reader.
Given:
header1,header2,header3
name0,rank0,serial0
name1,rank1,serial1
name2,rank2,serial2
You can do:
import csv
from itertools import chain
with open(fn) as f:
reader=csv.reader(f)
header=next(reader)
N=2
p_list=[]
for i in range(N): # N is however many you need to set processed flag
p_list.append(next(reader))
print("p_list:", p_list)
# now use p_list to determine if processed however processed=check(p_list)
for row in chain(iter(p_list), reader): # chain creates a single csv reader...
# handler processed or not here from the stream of rows...
# if not processed:
# process
# else:
# handle already processed...
# print row just to show csv data is complete:
print(row)
Prints:
p_list: [['name0', 'rank0', 'serial0'], ['name1', 'rank1', 'serial1']]
['name0', 'rank0', 'serial0']
['name1', 'rank1', 'serial1']
['name2', 'rank2', 'serial2']
I think what you're trying to achieve is to use the first lines to decide between the type of processing, then re-use those lines for your read_all_csv_WITH_processing or read_all_csv_without_processing, while still not loading the full csv file in-memory. To achieve that you can load the first lines in a list and concatenate that with the rest of the file with itertools.chain, like this:
import itertools
top_lines = []
reader_iterator = csv.reader(fil)
do_heavy_processing = True
while True:
# Can't use "for line in reader_iterator" directly, otherwise we're
# going to close the iterator when going out of the loop after the first
# N iterations
line = reader_iterator.__next__()
top_lines.append(line)
if some_condition(line):
do_heavy_processing = False
break
elif not_worth_going_further(line)
break
full_file = itertools.chain(top_lines, reader_iterator)
if do_heavy_processing:
read_all_csv_WITH_processing(full_file)
else:
read_all_csv_without_processing(full_file)
I will outline what I consider a much better approach. I presume this is happening over various runs. What you need to do is persist the files seen between runs and only process what has not been seen:
import pickle
import glob
def process(fle):
# read_all_csv_with_processing
def already_process(fle):
# read_all_csv_without_processing
try:
# if it exists, we ran the code previously.
with open("seen.pkl", "rb") as f:
seen = pickle.load(f)
except IOError as e:
# Else first run, so just create the set.
print(e)
seen = set()
for file in glob.iglob("path_where_files_are/*.csv"):
# if not seen before, just process
if file not in seen:
process(file)
else:
# already processed so just do whatever
already_process(file)
seen.add(file)
# persist the set.
with open("seen.pkl", "w") as f:
pickle.dumps(seen, f)
Even if for some strange reason you somehow process the same files in the same run, all you need to do then is implement the seen set logic.
Another alternative would be to use a unique marker in the file that you add at the start if processed.
# something here
header1,header2,header3,new_hdr4,new_hdr5
name0,rank0,serial0,salary0,base0
name1,rank1,serial1,salary1,base1
name2,rank2,serial2,salary2,base2
Then all you would need to process is the very first line. Also if you want to get the first n lines from a file even if you wanted to start from a certain row, use itertools.islce
To be robust you might want to wrap your code in a try/finally in case it errors so you don't end up going over the same files already processed on the next run:
try:
for file in glob.iglob("path_where_files_are/*.csv"):
# if not seen before, just process
if file not in seen:
process(file)
else:
# already processed so just do whatever
already_process(file)
seen.add(file)
finally:
# persist the set.
with open("seen.pkl", "wb") as f:
pickle.dump(seen, f)
I have two functions. The first creates a new CSV file (from an existing CSV). The second appends the same data to the new CSV, but in a slightly different order of the rows.
When I run this together all in one file the first function works but the second does not. However when I tried putting the second function in a separate file then calling it in the first script, it did work, albeit I had to enter the input twice.
What do I need to change to get the second function to run properly?
import csv
export = raw_input('>')
new_file = raw_input('>')
ynabfile = open(export, 'rb')
reader = csv.reader(ynabfile)
def create_file():
with open(new_file, 'wb') as result:
writer = csv.writer(result)
for r in reader:
writer.writerow((r[3], r[5], r[6],r[7], r[7],
r[8],r[8],r[9],r[10]))
def append():
with open(new_file, 'ab') as result2:
writer2 = csv.writer(result2)
for i in reader:
writer.writerow((r[3], r[5], r[6], r[7], r[7],
r[8], r[8], r[10], r[9]))
create_file()
append()
I'm new to Python and programming in general, so if there is an all around better way to do this, I'm all ears.
The csv reader has already read the entire file pointed to by ynabfile, so on the second call (or any subsequent calls) to either create_file or append will not be able to fetch any more data using the reader until the file pointer is sent back to the beginning. In your case, a quick fix would be this:
create_file()
ynabfile.seek(0)
append()
I recommend restructuring your code a bit to avoid pitfalls like this. A few recommendations:
Read all the contents in ynabfile into another list instead, if you can fit the entirety of the file into memory
Have create_file and append take parameter of input and output file names
Alternatively, have those two functions take the file pointer (ynabfile in this case), and ensure that it is seeked to the beginning then create a new csv.reader instance using that.
I haven't been able to re.sub a csv file.
My expression is doing it's job but the writerow is where I'm stuck.
re.sub out
"A1","Address2" "A1","Address2"
0138,"DEERFIELD AVE" 0138,"DEERFIELD"
0490,"REMMINGTON COURT" 0490,"REMMINGTON"
2039,"SANDHILL DR" 2039,"SANDHILL"
import csv
import re
with open('aa_street.txt', 'rb') as f:
reader = csv.reader(f)
read=csv.reader(f)
for row in read:
row_one = re.sub('\s+(DR|COURT|AVE|)\s*$', ' ', row[1])
row_zero = row[0]
print row_one
for row in reader:
print writerow([row[0],row[1]])
Perhaps something like this is what you need?
#!/usr/local/cpython-3.3/bin/python
# "A1","Address2" "A1","Address2"
# 0138,"DEERFIELD AVE" 0138,"DEERFIELD"
# 0490,"REMMINGTON COURT" 0490,"REMMINGTON"
# 2039,"SANDHILL DR" 2039,"SANDHILL"
import re
import csv
with open('aa_street.txt', 'r') as infile, open('actual-output', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
row_zero = row[0]
row_one = re.sub('\s+(DR|COURT|AVE|)\s*$', '', row[1])
writer.writerow([row_zero, row_one])
A file is an iterator—you iterate over it once, and then it's empty.
A csv.reader is also an iterator.
In general, if you want to reuse an iterator, there are three ways to do it:
Re-generate the iterator (and, if its source was an iterator, re-generate that as well, as so on up the chain)—in this case, that means open the file again.
Use itertools.tee.
Copy the iterator into a sequence and reuse that instead.
In the special case of files, you can fake #1 by using f.seek(0). Some other iterators have similar behavior. But in general, you shouldn't rely on this.
Anyway, the last one is the easiest, so let's just see how that works:
reader = list(csv.reader(f))
read = reader
Now you've got a list of all of the rows in the file. You can copy it, loop over it, loop over the copy, close the file, loop over the copy again, it's still there.
Of course the down side it that you need enough memory to put the whole thing in memory (plus, you can't start processing the first line until you've finished reading the last one). If that's a problem, you need to either reorganize your code so it only needs one pass, or re-open (or seek) the file.
I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards