I have a fairly basic question, and I'm wondering what the best solution would be using Python. I have a set of CSV files, and within each file, I have rows of comma separated elements. Importantly, there are two distinct blocks of rows in each CSV file, let's say "Block 1" and "Block 2". Some values overlap between Block 1 and Block 2 (specific item of interest: the name of particular .jpg files), but the order will vary. Here is a shortened version of how the file is organized:
Trial,Image,Type,Reps
1,511.jpg,T,1REP
2,101a.jpg,2,1REP
3,185a.jpg,5,3REP
4,566.jpg,T,3REP
5,560.jpg,T,3REP
Trial,Image,Type,Reps,Keypress
1,101a.jpg,2,1REP,1
2,185a.jpg,5,3REP,0
3,511.jpg,T,1REP,1
4,560.jpg,T,3REP,1
5,566.jpg,T,3REP,0
For some clarification, this is the log file of an experiment where Block 1 is the time when images are studied. "Type" corresponds to the type of picture, and "Reps" corresponds to how many times overall the picture is seen (1 or 3 times), neither of which are central to what I want to achieve. What I would like to do is this: for each row in the first block, match to the name of the same .jpg file in the second block. Then I need to append the Block 1 row with "1" or "0" based on whether the corresponding "Keypress" in Block 2 is "1" or "0" element. Basically, when tested on the pictures, they make a button press of "1" or "0" and I want to back sort which ones got which press during study. Critically, I need to preserve the order of Block 1 (the studied order of images) with whatever solution I take.
Apologies for how basic this request is...I'm learning.
Your question isn't what I would call basic at all (and has nothing to do with sorting). In fact doing the processing you want is fairly involved. Essentially each file has to be read twice, first to extract the information needed from the second block, and then again to update the first block in it. Additionally, reading the file each time is broken down into two sub-steps, since there's two kinds of csv data in each file which must be handled separately in each pass.
Since it's fairly difficult to update a file in-place, an updated version of the file is first written to a separate temporary file which then replaces the original if processing completes without errors.
import csv
import shutil
from tempfile import NamedTemporaryFile
TRIAL = 0
IMAGE = 1
KEYPRESS = 4
filename = 'backsorting.csv'
img_resp_map = {}
# first pass
with open(filename, 'rb') as csvfile:
reader = csv.reader(csvfile)
# skip over first block
next(reader) # header
while True:
row = next(reader)
if not row[TRIAL].isdigit(): # header of second block?
break
# use data in second block to create an image-to-response mapping
for row in reader:
img_resp_map[row[IMAGE]] = row[KEYPRESS]
# second pass
with open(filename, 'rb') as csvfile:
reader = csv.reader(csvfile)
fields = next(reader) # get header of first block
with NamedTemporaryFile('wb', dir='.', delete=False) as tempcsv:
writer = csv.writer(tempcsv)
writer.writerow(fields + ['Keypress']) # new header with added field
# copy and update rows of first block by appending the new field
for row in reader:
if not row[TRIAL].isdigit(): # header of second block?
break
writer.writerow(row+[img_resp_map[row[IMAGE]]])
# copy second block of file unchanged
writer.writerow(row) # header (already read)
writer.writerows(reader)
# NOTE: the following is dangerous since it wipes out the original file
shutil.move(tempcsv.name, filename) # replace original file with temp one
My test file was named backsorting.csv and initially had this in it:
Trial,Image,Type,Reps
1,511.jpg,T,1REP
2,101a.jpg,2,1REP
3,185a.jpg,5,3REP
4,566.jpg,T,3REP
5,560.jpg,T,3REP
Trial,Image,Type,Reps,Keypress
1,101a.jpg,2,1REP,1
2,185a.jpg,5,3REP,0
3,511.jpg,T,1REP,1
4,560.jpg,T,3REP,1
5,566.jpg,T,3REP,0
After running the script, its contents were changed to this:
Trial,Image,Type,Reps,Keypress
1,511.jpg,T,1REP,1
2,101a.jpg,2,1REP,1
3,185a.jpg,5,3REP,0
4,566.jpg,T,3REP,0
5,560.jpg,T,3REP,1
Trial,Image,Type,Reps,Keypress
1,101a.jpg,2,1REP,1
2,185a.jpg,5,3REP,0
3,511.jpg,T,1REP,1
4,560.jpg,T,3REP,1
5,566.jpg,T,3REP,0
Assuming the csv files are small enough, I would simply use an dictionary {} to map values from each file to each other.
Load up all values from Block 2 first.
d= {}
with open('some1.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
num, file_name, third, num = row # 1,a.jpg,XYZ,1
d[file_name] = num
Now when iterating over Block 1, retrieve the values you have stored from Block 2, and append them to your data.
with open('some2.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
num, file_name, third, num = row # 1,a.jpg,XYZ,1
lst = [num, file_name. third, num, d.get(file_name, -1)]
# now convert `lst` to csv, and write to file
Note that the second code block uses a value of -1 if a matching filename wasn't found in the stored Block 2 data.
Related
I have been doing these tasks:
Write a script that reads in the data from the CSV file pastimes.csv located in the
chapter 9 practice files folder, skipping over the header row
Display each row of data (except for the header row) as a list of strings
Add code to your script to determine whether or not the second entry in each row
(the "Favorite Pastime") converted to lower-case includes the word "fighting" using
the string methods find() and lower()
I have complited 2 of them but i really misunderstand the third one, cause my english is not very well and i really can't catch what do they want
import csv
with open("pastimes.csv", "r") as my_file:
my_file_reader = csv.reader(my_file)
next(my_file_reader)
for row in my_file_reader:
print(row)
Output: ['Fezzik', 'Fighting']
['Westley', 'Winning']
['Inigo Montoya', 'Sword fighting']
['Buttercup', 'Complaining']
Headers which i skipped: Person, Favorite pastime
You need something like:
import csv
with open("pastimes.csv", "r") as my_file:
my_file_reader = csv.reader(my_file)
next(my_file_reader)
for row in my_file_reader:
print(row)
if row[1].lower().find('fighting') >= 0:
print('Second entry lowered contains "fighting"')
I'm using python 3.6.0. I have written dictionaries to CSV files before, but never on lines that already contains text. I am having trouble with that now. Here's my code:
import csv
f='/Users/[my name]/Documents/Webscraper/tests/output_sheet_1.csv'
bigdict = {'ex_1': 1, 'ex_2': 2, 'ex_3': 3}
with open(f, 'r+') as file:
fieldnames=['ex_1','ex_2','ex_3']
writer = csv.DictWriter(file, fieldnames=fieldnames,delimiter=',')
if '\n' not in file.readline()[-1]:
file.write('\n')
writer.writerow(bigdict)
When I run this, python appends the dictionary on the first line after the row containing the fieldnames, starting at the last cell that's below a fieldname. In other words, the first row contains many entries, including ex_1, ex_2, and ex_3 in the last three cells. In the second row, I have values stored in all cells except for those under ex_1, ex_2, and ex_3, which are blank. Python writes the integer 1 below ex_3, and writes 2 and 3 in the cells right of it.
I would like to re-position them so each number is under their respective fieldname cells. How do I do this, and why is this happening? Thanks.
You have two issues:
If there is no file (or if it is empty) you would want to have a header line added.
If you are appending a new row to an existing file, you are trying to ensure that the last character in your file is a newline. writerow() will add a trailing newline so normally this would not be a problem. If however the file has been manually edited and the trailing newline is missing, this would cause the new row to be appended to the end of the last line.
The first issue can be resolved by first testing the size of the file. If it is 0 then it exists but is empty. If the file does not exist, an OSError exception is raised. write_header is used to signal this.
The second issue is a bit more tricky. If you open the file in binary mode, it is possible to seek to the last byte of the file and read it in. This can be checked to see if it is a newline. If your file ever used another encoding, this would need to be changed. The file can then be reopened in append mode and the new row written.
This can all be done as follows:
import csv
filename = '/Users/[my name]/Documents/Webscraper/tests/output_sheet_1.csv'
bigdict = {'ex_1': 1, 'ex_2': 2, 'ex_3': 3}
# Does the file exist? If not (or it is empty) write a header
try:
write_header = os.path.getsize(filename) == 0
except OSError:
write_header = True
# If the file exists, does it end with a newline?
if write_header:
no_ending_newline = False
else:
with open(filename, 'rb') as f_input:
f_input.seek(-1, 2) # move to the last byte of the file
no_ending_newline = f_input.read() != b'\n'
with open(filename, 'a', newline='') as f_output:
fieldnames = ['ex_1','ex_2','ex_3']
csv_writer = csv.DictWriter(f_output, fieldnames=fieldnames)
if write_header:
csv_writer.writeheader()
if no_ending_newline:
f_output.write('\n')
csv_writer.writerow(bigdict)
I have a number of txt files that represent spatial data in a grid form, essentially arrays of the same dimensions in which each value signifies a trait about the corresponding parcel of land. I have been trying to script a sequence that imports each file, adds "-9999" on the border of the entire grid, and saves out to an otherwise identical txt file.
The first 6 rows of each txt file are header rows, and shouldn't be changed.
My progress is as follows:
for datfile in spatialfiles:
results = []
borderrow = []
with open('{}.txt'.format(datfile)) as inputfile:
#header = inputfile.readlines()
for line in inputfile:
row = ['-9999'] + line.strip().split(' ') + ['-9999']
results.append(row)
for cell in range(len(row)):
borderrow.append('-9999')
results = [borderrow] + results[6:] + [borderrow]
with file("{}-new.txt".format(datfile), 'w') as outputFile:
for row in header[:6]:
outputFile.write(row)
for row in results:
outputFile.write(row)
"header = inputfile.readlines()" has been commented out because it seems to cause a NameError in which "row" is no longer recognized. At the same time, I haven't found another way to retain the 6 header rows for exporting later.
Why does readlines() seem to alter the ability to iterate through the lines of the inputfile when it is only being used to write to a variable? What am I missing? (Any other pointers on my undoubtedly bloated code always welcome!)
readlines() reads the whole file into memory, parses it into a list, and leaves a pointer to the end of the file. When you try to read the same file again, it will attempt to resume reading from the pointer, which is already at the end of the file. Call readlines() once and loop through the list with a counter which changes the loop's behavior after 6 lines.
I have a stock file which looks like this:
12334232:seat belt:2.30:12:10:30
14312332:toy card:3.40:52:10:30
12512312:xbox one:5.30:23:10:30
12543243:laptop:1.34:14:10:30
65478263:banana:1.23:23:10:30
27364729:apple:4.23:42:10:30
28912382:orange:1.12:16:10:30
12892829:elephant:6.45:14:10:30
I want to replace the items in the fourth column if they are below the numbers in the fifth column after a certain transaction to the numbers in the sxith column. How would I replace the items in the fourth column?
Everytime I use the following lines of code below, it overwrites the whole file with nothing (deletes everything)
for line in stockfile:
c=line.split(":")
print("pass")
if stock_order[i] == User_list[i][0]:
stockfile.write(line.replace(current_stocklevel_list[i], reorder_order[i] ) )
else:
i = i + 1
I want the stockfile to look like this after it has replaced the necessary items in the column:
12334232:seat belt:2.30:30:10:30
14312332:toy card:3.40:30:10:30
12512312:xbox one:5.30:30:10:30
12543243:laptop:1.34:30::10:30
65478263:banana:1.23:30:10:30
27364729:apple:4.23:30:10:30
28912382:orange:1.12:30:10:30
12892829:elephant:6.45:30:10:30
If you are opening file after some time, you should use "a" (append) as a mode so that file doesn't get truncated.
Write pointer will automatically be on the end of file.
So:
f = open("filename", "a")
f.seek(0) # To start from beginning
But if you want to read and write, then add "+" to the mode and file wouldn't be truncated as well.
f = open("filename", "r+")
Both read and write pointers will be at the beginning of file, you'll need to seek only onto position where you wish to start writing/reading.
But you are doing it wrong.
See, file's content will be overwritten, not inserted automatically.
If you are in writable mode and at the end of file content will be added.
So, you either need to load whole file, make changes you need and write everything back.
Or, you have to write changes at some point and shift remaining content truncating the file if the content is shorter than before.
mmap module can help you to treat file as a string,. You will be able to efficiently shift data and to resize the file.
But, if you really want to change file in place, you should have the file with fixed length of columns. So, when you want to change a value, you do not need to shift anything back and forth. Just find the right row and col, seek there, write new value over the old one (making sure to delete all of the old) and that is just that.
You should try to read in the data first:
with open('inputfile', 'r') as infile:
data = infile.readlines()
Then you can loop over the data and edit as needed and write it out:
with open('outputfile', 'w') as outfile:
for line in data:
c = line.split(":")
if random.randint(1,3) == 1:
# update fourth column based on some good reason
c[3] += 2
outfile.write(':'.join(c) + '\n')
Or you could do it on go with something like:
with open('inputfile', 'r') as infile, open('outputfile', 'w') as outfile:
line = infile.readline()
c = line.split(":")
if random.randint(1,3) == 1:
# update fourth column based on some good reason
c[3] += 2
outfile.write(':'.join(c) + '\n')
os.rename('outputfile', 'inputfile')
I have a csv file which I want to sort by taking each row at a time. While sorting the row, I want to ignore the whitespace (or empty cell). Also, I want to ignore the first row and first column while sorting.
This is how my code looks like:
import csv, sys, operator
fname = "Source.csv"
new_fname = "Dest.csv"
data = csv.reader(open(fname,"rb"),delimiter=',')
num = 1
sortedlist = []
ind=0
for row in data:
if num==1:
sortedlist.append(row)
with open(new_fname,"wb") as f:
filewriter = csv.writer(f,delimiter=",")
filewriter.writerow(sortedlist[ind])
ind+=1
elif num > 1:
sortedlist.append(sorted(row))
with open(new_fname,"ab") as f:
filewriter = csv.writer(f,delimiter=",")
filewriter.writerow(sortedlist[ind])
ind+=1
num+=1
I was able to ignore the first row. But, I am not sure how to ignore the whitespace and the first column while sorting. Any suggestions are welcome.
I simplified your code significantly and here's what I got (although I didn't understand the part about empty columns, they are values as well... Did you mean that you wanted to keep empty columns in the same place instead of putting them at start?)
import csv
if __name__ == '__main__':
reader = csv.reader(open("Source.csv","r"),delimiter=',')
out_file = open("Dest.csv","w")
writer = csv.writer(out_file,delimiter=",")
writer.writerow(reader.next())
for row in reader:
writer.writerow([row[0]] + sorted(row[1:]))
out_file.close()
Always write executable code in if __name__ == '__main__':, this is done so that your code is not executed if your script was not run directly, but rather imported by another script.
We record the out_file variable to be able out_file.close() it cleanly later, code will work without it, but it's a clean way to write files.
Do not use "wb", "rb", "ab" for text files, the "b" part stands for "binary" and should be used for structured files.
reader.next() gets the first line of the csv file (or crashes if file is empty)
for row in reader: already runs starting from second line (because we ran reader.next() earlier), so we don't need any line number conditionals anymore.
row[0] gets the first element of the list, row[1:] gets all elements of the list, except the first one. For example, row[3:] would ignore first 3 elements and return the rest of the list. In this case, we only sort the row without its first element by doing sorted(row[1:])
EDIT: If you really want to remove empty columns from your csv, replace sorted(row[1:]) with sorted(filter(lambda x: x.strip()!='', row[1:])). This will remove empty columns from the list before sorting, but keep in mind that empty values in csv are still values.
EDIT2: As correctly pointed out by #user3468054 values will be sorted as strings, if you want them to be sorted as numbers, add a named parameter key=int to the sorted function, or key=float if your values are float.