Python - Most efficient to overwrite a specific row in a CSV file - python

Given the following csv file :
01;blue;brown;black
02;glass;rock;paper
03;pigeon;squirel;shark
My goal is to replace the (unique) line containing '02' in the 1st posisition.
I wrote this piece of code:
with open("csv", 'r+', newline='', encoding='utf-8') as csvfile, open('csvout', 'w', newline='', encoding='utf-8') as out:
reader = csv.reader(csvfile, delimiter=';')
writer = csv.writer(out, delimiter=';')
for row in reader:
if row[0] != '02':
writer.writerow(row)
else:
writer.writerow(['02', 'A', 'B', 'C'])
But re-writing the whole CSV in an other doesn't seem to be the most efficient way to proceed, especially for large files:
Once the match is found, we continue to read till the end.
We have to re-write every line one by one.
Writing a second file isn't very practical nor is storage
efficient.
I wrote a second piece of code who seems to answer to these two problems :
with open("csv", 'r+', newline='', encoding='utf-8') as csvfile:
content = csvfile.readlines()
for index, row in enumerate(content):
row = row.split(';')
if row[2] == 'rock':
tochange = index
break
content.pop(tochange)
content.insert(tochange, '02;A;B;C\n')
content = "".join(content)
csvfile.seek(0)
csvfile.truncate(0) # Erase content
csvfile.write(content)
Do you agree that the second solution is more efficient ?
Do you have any improvement, or better way to proceed ?
EDIT : The number of character in the line can vary.
EDIT 2 : I'm apparently obliged to read and rewrite everything, if I don't want to use padding.
A possible solution would be a database-like solution, I will consider it for the future.
If I had to choose between those 2 solutions, which one would be the best performance-wise ?

As the caracter in the line may vary, I either have to read/write the whole file or; as #tobias_k said, use seek() to come back to the begining of the line and:
If the line is shorter, write just the line and pad with spaces;
If same length, write just the line;
If it's longer re-write that line and the following.
I want to avoid using padding so I used time.perf_counter() to measure exec time of both codes, and the second solution appears to be (almost 2*) faster (CSV of 10 000 lines, match at the 6 000th).
One alternative would be to migrate to a relational database.

Related

Trying to remove rows based in csv file based off column value

I'm trying to remove duplicated rows in a csv file based on if a column has a unique value. My code looks like this:
seen = set()
for line in fileinput.FileInput('DBA.csv', inplace=1):
if line[2] in seen:
continue # skip duplicated line
seen.add(line[2])
print(line, end='')
I'm trying to get the value of the 2 index column in every row and check if it's unique. But for some reason my seen set looks like this:
{'b', '"', 't', '/', 'k'}
Any advice on where my logic is flawed?
You're reading your file line by line, so when you pick line[2] you're actually picking the third character of each line you're running this on.
If you want to capture the value of the second column for each row, you need to parse your CSV first, something like:
import csv
seen = set()
with open("DBA.csv", "rUb") as f:
reader = csv.reader(f)
for line in reader:
if line[2] in seen:
continue
seen.add(line[2])
print(line) # this will NOT print valid CSV, it will print Python list
If you want to edit your CSV in place I'm afraid it will be a bit more complicated than that. If your CSV is not huge, you can load it in memory, truncate it and then write down your lines:
import csv
seen = set()
with open("DBA.csv", "rUb+") as f:
handler = csv.reader(f)
data = list(handler)
f.seek(0)
f.truncate()
handler = csv.writer(f)
for line in data:
if line[2] in seen:
continue
seen.add(line[2])
handler.writerow(line)
Otherwise you'll have to read your file line by line and use a buffer that you'll pass to csv.reader() to parse it, check the value of its third column and if not seen write the line to the live-editing file. If seen, you'll have to seek back to the previous line beginning before writing the next line etc.
Of course, you don't need to use the csv module if you know your line structures well which can simplify the things (you won't need to deal with passing buffers left and right), but for a universal solution it's highly advisable to let the csv module do your bidding.

Python working with CSV with 2 delimiters

I have a programs which outputs the data into a CSV file. These files contain 2 delimiters, these are , and "" for text. The text also contains commas.
How can I work with these 2 delimiters?
My current code gives me list index out of range. If the CSV file is needed I can provide it.
Current code:
def readcsv():
with open('pythontest.csv') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024),delimiters=',"')
csvfile.seek(0)
reader = csv.reader(csvfile,dialect)
for row in reader:
asset_ip_addresses.append(row[0])
service_protocollen.append(row[1])
service_porten.append(row[2])
vurn_cvssen.append(row[3])
vurn_risk_scores.append(row[4])
vurn_descriptions.append(row[5])
vurn_cve_urls.append(row[6])
vurn_solutions.append(row[7])
The CSV File im working with: http://www.pastebin.com/bUbDC419
It seems to have problems with handling the second line. If i append the rows to a list the first row seems to be ok but the second row seems to take it as whole thing and not seperating the commas anymore.
I guess it has something to do with the "enters"
I don't think you should need to define a custom dialect, unless I'm missing something.
The official documentation shows you can provide quotechar as a keyword to the reader() method. The example from the documentation modified for your code:
import csv
with open('pythontest.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
#do something to the row
row is a list of strings for each item in the row with " quotes removed.
The issue with the index out of range suggests that one of the row[x] cannot be accessed.
OK, I think I understand what kind of file you are reading... let's say the content of your CSV file looks like this
192.168.12.255,"Great site, a lot of good, recommended",0,"Last, first, middle"
192.168.0.255,"About cats, dogs, must visit!",1,"One, two, three"
Here is the code that will allow you to read it line by line, text in quotes will be taken out as single array element, but it will not split it. The parameter that you need is this quoting=csv.QUOTE_ALL
import csv
with open('students.csv', newline='') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_ALL)
for row in reader:
print(row[0])
print(row[1])
print(row[2])
print(row[3])
The printed output will look like this
192.168.12.255
Great site, a lot of good, recommended
0
Last, first, middle
192.168.0.255
About cats, dogs, must visit!
1
One, two, three
PS solution is based on the latest official documentation, see here https://docs.python.org/3/library/csv.html
how about a quick solution like this
a quick fix, that would split a row in csv like a,"b,c",d as strings a,b,c,d
def readcsv():
with open('pythontest.csv') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024),delimiters=',"')
csvfile.seek(0)
reader = csv.reader(csvfile,dialect)
for rowx in reader:
row=[e.split(r',') if isinstance(e,str) else e for e in rowx]
#do your stuff on row

How to search CSV line for string in certain column, print entire line to file if found

Sorry, very much a beginner with Python and could really use some help.
I have a large CSV file, items separated by commas, that I'm trying to go through with Python. Here is an example of a line in the CSV.
123123,JOHN SMITH,SMITH FARMS,A,N,N,12345 123 AVE,CITY,NE,68355,US,12345 123 AVE,CITY,NE,68355,US,(123) 555-5555,(321) 555-5555,JSMITH#HOTMAIL.COM,15-JUL-16,11111,2013,22-DEC-93,NE,2,1\par
I'd like my code to scan each line and look at only the 9th item (the state). For every item that matches my query, I'd like that entire line to be written to an CSV.
The problem I have is that my code will find every occurrence of my query throughout the entire line, instead of just the 9th item. For example, if I scan looking for "NE", it will write the above line in my CSV, but also one that contains the string "NEARY ROAD."
Sorry if my terminology is off, again, I'm a beginner. Any help would be greatly appreciated.
I've listed my coding below:
import csv
with open('Sample.csv', 'rb') as f, open('NE_Sample.csv', 'wb') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for line in f:
if "NE" in line:
print ('Found: []'.format(line))
writer.writerow([line])
You're not actually using your reader to read the input CSV, you're just reading the raw lines from the file itself.
A fixed version looks like the following (untested):
import csv
with open('Sample.csv', 'rb') as f, open('NE_Sample.csv', 'wb') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for row in reader:
if row[8] == 'NE':
print ('Found: {}'.format(row))
writer.writerow(row)
The changes are as follows:
Instead of iterating over the input file's lines, we iterate over the rows parsed by the reader (each of which is a list of each of the values in the row).
We check to see if the 9th item in the row (i.e. row[8]) is equal to "NE".
If so, we output that row to the output file by passing it in, as-is, to the writer's writerow method.
I also fixed a typo in your print statement - the format method uses braces (not square brackets) to mark replacement locations.
This snippet should solves your problem
import csv
with open('Sample.csv', 'rb') as f, open('NE_Sample.csv', 'wb') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for row in reader:
if "NE" in row:
print ('Found: {}'.format(row))
writer.writerow(row)
if "NE" in line in your code is trying to find out whether "NE" is a substring of string line, which works not as intended. The lines are raw lines of your input file.
If you use if "NE" in row: where row is parsed line of your input file, you are doing exact element matching.

Trying to copy column1 from a csv file to another empty file using python

I'm looking for a way using python to copy the first column from a csv into an empty file. I'm trying to learn python so any help would be great!
So if this is test.csv
A 32
D 21
C 2
B 20
I want this output
A
D
C
B
I've tried the following commands in python but the output file is empty
f= open("test.csv",'r')
import csv
reader = csv.reader(f,delimiter="\t")
names=""
for each_line in reader:
names=each_line[0]
First, you want to open your files. A good practice is to use the with statement (that, technically speaking, introduces a context manager) so that when your code exits from the with block all the files are automatically closed
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
next you want a loop on the lines of the input file (note the indentation, we are inside the with block), line splitting is automatic when you read a text file with lines separated by newlines…
for line in inpfile:
each line is a string, but you think of it as two fields separated by white space — this situation is so common that strings have a method to deal with this situation (note again the increasing indent, we are in the for loop block)
fields = line.split()
by default .split() splits on white space, but you can use, e.g., split(',') to split on commas, etc — that said, fields is a list of strings, for your first record it is equal to ['A', '32'] and you want to output just the first field in this list… for this purpose a file object has the .write() method, that writes a string, just a string, to the file, and fields[0] IS a string, but we have to add a newline character to it because, in this respect, .write() is different from print().
outfile.write(fields[0]+'\n')
That's all, but if you omit my comments it's 4 lines of code
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
for line in inpfile:
fields = line.split()
outfile.write(fields[0]+'\n')
When you are done with learning (some) Python, ask for an explanation of this...
with open('test.csv') as ifl, open('out.csv', 'w') as ofl:
ofl.write('\n'.join(line.split()[0] for line in ifl))
Addendum
The csv module in such a simple case adds the additional conveniences of
auto-splitting each line into a list of strings
taking care of the details of output (newlines, etc)
and when learning Python it's more fruitful to see how these steps can be done using the bare language, or at least that it is my opinion…
The situation is different when your data file is complex, has headers, has quoted strings possibly containing quoted delimiters etc etc, in those cases the use of csv is recommended, as it takes into account all the gory details. For complex data analisys requirements you will need other packages, not included in the standard library, e.g., numpy and pandas, but that is another story.
This answer reads the CSV file, understanding a column to be demarked by a space character. You have to add the header=None otherwise the first row will be taken to be the header / names of columns.
ss is a slice - the 0th column, taking all rows as denoted by :
The last line writes the slice to a new filename.
import pandas as pd
df = pd.read_csv('test.csv', sep=' ', header=None)
ss = df.ix[:, 0]
ss.to_csv('new_path.csv', sep=' ', index=False)
import csv
reader = csv.reader(open("test.csv","rb"), delimiter='\t')
writer = csv.writer(open("output.csv","wb"))
for e in reader:
writer.writerow(e[0])
The best you can do is create a empty list and append the column and then write that new list into another csv for example:
import csv
def writetocsv(l):
#convert the set to the list
b = list(l)
print (b)
with open("newfile.csv",'w',newline='',) as f:
w = csv.writer(f, delimiter=',')
for value in b:
w.writerow([value])
adcb_list = []
f= open("test.csv",'r')
reader = csv.reader(f,delimiter="\t")
for each_line in reader:
adcb_list.append(each_line)
writetocsv(adcb_list)
hope this works for you :-)

Using CSV module to append multiple files while removing appended headers

I would like to use the Python CSV module to open a CSV file for appending. Then, from a list of CSV files, I would like to read each csv file and write it to the appended CSV file. My script works great - except that I cannot find a way to remove the headers from all but the first CSV file being read. I am certain that my else block of code is not executing properly. Perhaps my syntax for my if else code is the problem? Any thoughts would be appreciated.
writeFile = open(append_file,'a+b')
writer = csv.writer(writeFile,dialect='excel')
for files in lstFiles:
readFile = open(input_file,'rU')
reader = csv.reader(readFile,dialect='excel')
for i in range(0,len(lstFiles)):
if i == 0:
oldHeader = readFile.readline()
newHeader = writeFile.write(oldHeader)
for row in reader:
writer.writerow(row)
else:
reader.next()
for row in reader:
row = readFile.readlines()
writer.writerow(row)
readFile.close()
writeFile.close()
You're effectively iterating over lstFiles twice. For each file in your list, you're running your inner for loop up from 0. You want something like:
writeFile = open(append_file,'a+b')
writer = csv.writer(writeFile,dialect='excel')
headers_needed = True
for input_file in lstFiles:
readFile = open(input_file,'rU')
reader = csv.reader(readFile,dialect='excel')
oldHeader = reader.next()
if headers_needed:
newHeader = writer.writerow(oldHeader)
headers_needed = False
for row in reader:
writer.writerow(row)
readFile.close()
writeFile.close()
You could also use enumerate over the lstFiles to iterate over tuples containing the iteration count and the filename, but I think the boolean shows the logic more clearly.
You probably do not want to mix iterating over the csv reader and directly calling readline on the underlying file.
I think you're iterating too many times (over various things: both your list of files and the files themselves). You've definitely got some consistency problems; it's a little hard to be sure since we can't see your variable initializations. This is what I think you want:
with open(append_file,'a+b') as writeFile:
need_headers = True
for input_file in lstFiles:
with open(input_file,'rU') as readFile:
headers = readFile.readline()
if need_headers:
# Write the headers only if we need them
writeFile.write(headers)
need_headers = False
# Now write the rest of the input file.
for line in readFile:
writeFile.write(line)
I took out all the csv-specific stuff since there's no reason to use it for this operation. I also cleaned the code up considerably to make it easier to follow, using the files as context managers and a well-named boolean instead of the "magic" i == 0 check. The result is a much nicer block of code that (hopefully) won't have you jumping through hoops to understand what's going on.

Categories