Suppose I have sample data in an Excel document:
header1
header2
header3
some data
testing
123
moar data
hello!
456
I export this data to csv format with Excel, with File > Save as > .csv
This is my data sample.csv:
$ cat sample.csv
header1,header2,header3
some data,testing,123
moar data,hello!,456%
Note that Excel apparently does not add a newline at the end, by default -- this is indicated by % at the end.
Now let's say I want to append a row(s) to the CSV file. I can use csv module to do that:
import csv
def append_to_csv_file(file: str, row: dict, encoding=None) -> None:
# open file for reading and writing
with open(file, 'a+', newline='', encoding=encoding) as out_file:
# retrieve field names (CSV file headers)
reader = csv.reader(out_file)
out_file.seek(0)
field_names = next(reader, None)
# add new row to the CSV file
writer = csv.DictWriter(out_file, field_names)
writer.writerow(row)
row = {'header1': 'new data', 'header2': 'blah', 'header3': 789}
append_to_csv_file('sample.csv', row)
So now a newline is added to end of file, but problem is that the data is added to end of last line, rather than on a separate line:
$ cat sample.csv
header1,header2,header3
some data,testing,123
moar data,hello!,456new data,blah,789
This causes issue when I want to read back the updated data from the file:
with open('sample.csv', newline='') as f:
print(list(csv.DictReader(f)))
# [{..., 'header3': '456new data', None: ['blah', '789']}]
Question: so what is the best option to handle case when CSV file might not have newline at the end, when appending a row(s) to file.
Current attempt
This is my solution to work around case when appending to CSV file, but file may not end with a newline character:
import csv
def append_to_csv_file(file: str, row: dict, encoding=None) -> None:
with open(file, 'a+', newline='', encoding=encoding,) as out_file:
# get current file position
pos = out_file.tell()
print('pos:', pos)
# seek to one character back
out_file.seek(pos - 1)
# read in last character
c = out_file.read(1)
print(out_file.tell(), repr(c))
if c != '\n':
delta = out_file.write('\n')
pos += delta
print('new_pos:', pos)
# retrieve field names (CSV file headers)
reader = csv.reader(out_file)
out_file.seek(0)
field_names = next(reader, None)
# add new row to the CSV file
writer = csv.DictWriter(out_file, field_names)
# out_file.seek(pos + 1)
writer.writerow(row)
row = {'header1': 'new data', 'header2': 'blah', 'header3': 789}
append_to_csv_file('sample.csv', row)
This is output from running the script:
pos: 68
68 '6'
new_pos: 69
The contents of CSV file now look as expected:
$ cat sample.csv
header1,header2,header3
some data,testing,123
moar data,hello!,456
new data,blah,789
I am wondering if anyone knows of an easier way to do this. I feel like I might be overthinking this a bit. I basically want to account for cases where CSV file might need a newline added to end, before a new row is appended to end of file.
If it helps, I am running this on a Mac OS environment.
Related
I am having to add couple of lists in python as columns to an existing CSV file. I want to make use of a temporary file for the output CSV because I want to sort first 2 columns of that resulting data and then write to a new final CSV file. I don't want to keep the unsorted csv file which is why I am trying to use tempfile.NamedTemporaryFile for that step. It's giving nothing in the final CSV file but no other code errors. I changed how the with blocks are indented but unable to fix it. I tested by using a file on disk which works fine. I need help understanding what I am doing wrong. Here is my code:
# Open the existing csv in read mode and new temporary csv in write mode
with open(csvfile.name, 'r') as read_f, \
tempfile.NamedTemporaryFile(suffix='.csv', prefix=('inter'), mode='w', delete=False) as write_f:
csv_reader = csv.reader(read_f)
csv_writer = csv.writer(write_f)
i = 0
for row in csv_reader:
# Append the new list values to that row/list
row.append(company_list[i])
row.append(highest_percentage[i])
# Add the updated row / list to the output file
csv_writer.writerow(row)
i += 1
with open(write_f.name) as data:
stuff = csv.reader(data)
sortedlist = sorted(stuff, key=operator.itemgetter(0, 1))
#now write the sorted result into final CSV file
with open(fileout, 'w', newline='') as f:
fileWriter = csv.writer(f)
for row in sortedlist:
fileWriter.writerow(row)
You should insert a write_f.seek(0, 0)
Just before the line opening the temporary file:
write_f.seek(0, 0)
with open(write_f.name) as data:
I found out what was causing the IndexError and consequently the empty final CSV. I resolved it with the help of this: CSV file written with Python has blank lines between each row. Here's my changed code that worked as desired:
with open(csvfile.name, 'r') as read_f, \
tempfile.NamedTemporaryFile(suffix='.csv', prefix=('inter'), newline='', mode='w+', delete=False) as write_f:
csv_reader = csv.reader(read_f)
csv_writer = csv.writer(write_f)
i = 0
for row in csv_reader:
# Append the new list values to that row/list
row.append(company_list[i])
row.append(highest_percentage[i])
# Add the updated row / list to the output file
csv_writer.writerow(row)
i += 1
with open(write_f.name) as read_stuff, \
open(fileout, 'w', newline='') as write_stuff:
read_data = csv.reader(read_stuff)
write_data = csv.writer(write_stuff)
sortedlist = sorted(read_data, key=operator.itemgetter(0, 1))
for row in sortedlist:
write_data.writerow(row)
I have a CSV file with a column 'Flag' that has 0 and 1 values. My goal is to move all rows with 0 values to another CSV file. This script will be scheduled to run every hour and move rows with '0' values to another file.
So far I wrote the below code:
with open("path/to/my/input/file.csv", "rt", encoding="utf8") as f:
reader = csv.DictReader(f, delimiter=',')
with open("/path/to/my/output/file.csv", "a+", encoding="utf8") as f_out:
writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames, delimiter=",")
writer.writeheader()
for row in reader:
if row['flag'] == '0':
writer.writerow(row)
With #Raghvendra help below by adding 'a+' to my code I'm able to add rows to my output.csv file. However, it adds header row to my output file each time the script runs. Also, how to prevent adding rows with matching ID? Would it be possible to replace rows in my output.csv file where the ID match with ID in input.csv file instead of adding rows with duplicated ID to output.csv?
Would someone be able to help me with this? Thanks in advance!
input file.csv:
id date data1 data2 flag
1 2020-03-01 mydata mydata1 0
2 2020-03-02 mydata mydata 1
3 2020-03-03 mydata mydata1 0
Now my problem is to prevent adding records with duplicated IDs to my output.csv. I would need to overwrite records with matching IDs instead if possible.
In order to match IDs, we cannot avoid to read the output file.
import csv
data = dict()
# first read the output file in (if one exists already)
try:
with open("output file.csv", encoding="utf8") as f_out:
for row in csv.DictReader(f_out): data[row['id']] = row
except OSError: pass
# now add the new rows from the input file; rows with existing id are replaced
with open("input file.csv", encoding="utf8") as f:
reader = csv.DictReader(f)
for row in reader:
if row['MyColumn'] == '0': data[row['id']] = row
with open("output file.csv", "w", encoding="utf8") as f_out:
writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames)
writer.writeheader()
for row in data: writer.writerow(data[row])
To append new rows to the file rather than overriding the values try using append (a) permission on the file instead of write (w).
with open("/path/to/my/output/file.csv", "a+", encoding="utf8") as f_out:
No need of writting t as it refers to the text mode which is the default.
Documented here:
Character Meaning
'r' open for reading (default)
'w' open for writing, truncating the file first
'x' open for exclusive creation, failing if the file already exists
'a' open for writing, appending to the end of the file if it exists
'b' binary mode
't' text mode (default)
'+' open a disk file for updating (reading and writing)
'U' universal newlines mode (deprecated)
The second part of your question is not so clear. Can you elaborate a little more?
I have a CSV file that is being constantly appended. It has multiple headers and the only common thing among the headers is that the first column is always "NAME".
How do I split the single CSV file into separate CSV files, one for each header row?
here is a sample file:
"NAME","AGE","SEX","WEIGHT","CITY"
"Bob",20,"M",120,"New York"
"Peter",33,"M",220,"Toronto"
"Mary",43,"F",130,"Miami"
"NAME","COUNTRY","SPORT","NUMBER","SPORT","NUMBER"
"Larry","USA","Football",14,"Baseball",22
"Jenny","UK","Rugby",5,"Field Hockey",11
"Jacques","Canada","Hockey",19,"Volleyball",4
"NAME","DRINK","QTY"
"Jesse","Beer",6
"Wendel","Juice",1
"Angela","Milk",3
If the size of the csv files is not huge -- so all can be in memory at once -- just use read() to read the file into a string and then use a regex on this string:
import re
with open(ur_csv) as f:
data=f.read()
chunks=re.finditer(r'(^"NAME".*?)(?=^"NAME"|\Z)',data,re.S | re.M)
for i, chunk in enumerate(chunks, 1):
with open('/path/{}.csv'.format(i), 'w') as fout:
fout.write(chunk.group(1))
If the size of the file is a concern, you can use mmap to create something that looks like a big string but is not all in memory at the same time.
Then use the mmap string with a regex to separate the csv chunks like so:
import mmap
import re
with open(ur_csv) as f:
mf=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
chunks=re.finditer(r'(^"NAME".*?)(?=^"NAME"|\Z)',mf,re.S | re.M)
for i, chunk in enumerate(chunks, 1):
with open('/path/{}.csv'.format(i), 'w') as fout:
fout.write(chunk.group(1))
In either case, this will write all the chunks in files named 1.csv, 2.csv etc.
Copy the input to a new output file each time you see a header line. Something like this (not checked for errors):
partNum = 1
outHandle = None
for line in open("yourfile.csv","r").readlines():
if line.startswith('"NAME"'):
if outHandle is not None:
outHandle.close()
outHandle = open("part%d.csv" % (partNum,), "w")
partNum += 1
outHandle.write(line)
outHandle.close()
The above will break if the input does not begin with a header line or if the input is empty.
You can use the python csv package to read your source file and write multile csv files based on the rule that if element 0 in your row == "NAME", spawn off a new file. Something like this...
import csv
outfile_name = "out_%.csv"
out_num = 1
with open('nameslist.csv', 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
csv_buffer = []
for row in csvreader:
if row[0] != "NAME":
csv_buffer.append(row)
else:
with open(outfile_name % out_num, 'wb') as csvout:
for b_row in csv_buffer:
csvout.writerow(b_row)
out_num += 1
csv_buffer = [row]
P.S. I haven't actually tested this but that's the general concept
Given the other answers, the only modification that I would suggest would be to open using csv.DictReader. pseudo code would be like this. Assuming that the first line in the file is the first header
Note that this assumes that there is no blank line or other indicator between the entries so that a 'NAME' header occurs right after data. If there were a blank line between appended files the you could use that as an indicator to use infile.fieldnames() on the next row. If you need to handle the inputs as a list, then the previous answers are better.
ifile = open(filename, 'rb')
infile = cvs.Dictreader(ifile)
infields = infile.fieldnames
filenum = 1
ofile = open('outfile'+str(filenum), 'wb')
outfields = infields # This allows you to change the header field
outfile = csv.DictWriter(ofile, fieldnames=outfields, extrasaction='ignore')
outfile.writerow(dict((fn, fn) for fn in outfields))
for row in infile:
if row['NAME'] != 'NAME':
#process this row here and do whatever is needed
else:
close(ofile)
# build infields again from this row
infields = [row["NAME"], ...] # This assumes you know the names & order
# Dict cannot be pulled as a list and keep the order that you want.
filenum += 1
ofile = open('outfile'+str(filenum), 'wb')
outfields = infields # This allows you to change the header field
outfile = csv.DictWriter(ofile, fieldnames=outfields, extrasaction='ignore')
outfile.writerow(dict((fn, fn) for fn in outfields))
# This is the end of the loop. All data has been read and processed
close(ofile)
close(ifile)
If the exact order of the new header does not matter except for the name in the first entry, then you can transfer the new list as follows:
infileds = [row['NAME']
for k in row.keys():
if k != 'NAME':
infields.append(row[k])
This will create the new header with NAME in entry 0 but the others will not be in any particular order.
I'm new to python and I struggling with this code. Have 2 file, 1st file is text file containing email addresses (one each line), 2nd file is csv file with 5-6 columns. Script should take search input from file1 and search in file 2, the output should be stored in another csv file (only first 3 columns) see example below. Also I have copied a script that I was working on. If there is a better/efficient script then please let me know. Thank you, appreciate your help.
File1 (output.txt)
rrr#company.com
eee#company.com
ccc#company.com
File2 (final.csv)
Sam,Smith,sss#company.com,admin
Eric,Smith,eee#company.com,finance
Joe,Doe,jjj#company.com,telcom
Chase,Li,ccc#company.com,IT
output (out_name_email.csv)
Eric,Smith,eee#company.com
Chase,Li,ccc#company.com
Here is the script
import csv
outputfile = 'C:\\Python27\\scripts\\out_name_email.csv'
inputfile = 'C:\\Python27\\scripts\\output.txt'
datafile = 'C:\\Python27\\scripts\\final.csv'
names=[]
with open(inputfile) as f:
for line in f:
names.append(line)
with open(datafile, 'rb') as fd, open(outputfile, 'wb') as fp_out1:
writer = csv.writer(fp_out1, delimiter=",")
reader = csv.reader(fd, delimiter=",")
headers = next(reader)
for row in fd:
for name in names:
if name in line:
writer.writerow(row)
Load the emails into a set for O(1) lookup:
with open(inputfile) as fin:
emails = set(line.strip() for line in fin)
Then loop over the rows once, and check it exists in emails - no need to loop over each possible match for each row:
# ...
for row in reader:
if row[1] in emails:
writer.writerow(row)
If you're not doing anything else, then you can make it:
writer.writerows(row for row in reader if row[1] in emails)
A couple of notes, in your original code you're not using the csv.reader object reader - you're looping over fd and you appear to have some naming issues with names and line and row...
I have CSV files in which Data is formatted as follows:
file1.csv
ID,NAME
001,Jhon
002,Doe
fille2.csv
ID,SCHOOLS_ATTENDED
001,my Nice School
002,His lovely school
file3.csv
ID,SALARY
001,25
002,40
ID field is kind of primary key that will be used to fetch record.
What is the most efficient way to read 3 to 4 files and get corresponding data and store in another CSV file having headings (ID,NAME,SCHOOLS_ATTENDED,SALARY)?
The file sizes are in the hundreds of MBs (100, 200 Mb).
Hundreds of megabytes aren't that much. Why not go for a simple approach using the csv module and collections.defaultdict:
import csv
from collections import defaultdict
result = defaultdict(dict)
fieldnames = {"ID"}
for csvfile in ("file1.csv", "file2.csv", "file3.csv"):
with open(csvfile, newline="") as infile:
reader = csv.DictReader(infile)
for row in reader:
id = row.pop("ID")
for key in row:
fieldnames.add(key) # wasteful, but I don't care enough
result[id][key] = row[key]
The resulting defaultdict looks like this:
>>> result
defaultdict(<type 'dict'>,
{'001': {'SALARY': '25', 'SCHOOLS_ATTENDED': 'my Nice School', 'NAME': 'Jhon'},
'002': {'SALARY': '40', 'SCHOOLS_ATTENDED': 'His lovely school', 'NAME': 'Doe'}})
You could then combine that into a CSV file (not my prettiest work, but good enough for now):
with open("out.csv", "w", newline="") as outfile:
writer = csv.DictWriter(outfile, sorted(fieldnames))
writer.writeheader()
for item in result:
result[item]["ID"] = item
writer.writerow(result[item])
out.csv then contains
ID,NAME,SALARY,SCHOOLS_ATTENDED
001,Jhon,25,my Nice School
002,Doe,40,His lovely school
Following is the working code for combining multiple csv files with specific keywords in their names into 1 final csv file. I have set the default keyword to "file" but u can set it blank if u want to combine all csv files from a folder_path. This code will take header from your first csv file and use it as a header in final combined csv file. It will ignore headers of all other csv files.
import glob,os
#staticmethod
def Combine_multiple_csv_files_thatContainsKeywordInTheirNames_into_one_csv_file(folder_path,keyword='file'):
#takes header only from 1st csv, all other csv headers are skipped and data is appened to final csv
fileNames = glob.glob(folder_path + "*" + keyword + "*"+".csv") # fileNames INCLUDES FOLDER_PATH TOO
with open(folder_path+"Combined_csv.csv", "w", newline='') as fout:
print('Combining multiple csv files into 1')
csv_write_file = csv.writer(fout, delimiter=',')
# a.writerows(op)
with open(fileNames[0], mode='rt') as read_file: # utf8
csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT
csv_write_file.writerows(csv_read_file)
for num in range(1, len(fileNames)):
with open(fileNames[num], mode='rt') as read_file: # utf8
csv_read_file = csv.reader(read_file, delimiter=',') # CSVREADER READS FILE AS 1 LIST PER ROW. SO WHEN WRITIN TO ANOTHER CSV FILE WITH FUNCTION WRITEROWS, IT INTRODUCES ANOTHER NEW LINE '\N' CHARACTER. SO TO AVOID DOUBLE NEWLINES , WE SET NEWLINE AS '' WHEN WE OPEN CSV WRITER OBJECT
next(csv_read_file) # ignore header
csv_write_file.writerows(csv_read_file)