I have a file.dat which looks like:
id | user_id | venue_id | latitude | longitude | created_at
---------+---------+----------+-----------+-----------+-----------------
984301 |2041916 |5222 | | |2012-04-21 17:39:01
984222 |15824 |5222 |38.8951118 |-77.0363658|2012-04-21 17:43:47
984315 |1764391 |5222 | | |2012-04-21 17:37:18
984234 |44652 |5222 |33.800745 |-84.41052 | 2012-04-21 17:43:43
I need to get csv file with deleted empty latitude and longtitude rows, like:
id,user_id,venue_id,latitude,longitude,created_at
984222,15824,5222,38.8951118,-77.0363658,2012-04-21T17:43:47
984234,44652,5222,33.800745,-84.41052,2012-04-21T17:43:43
984291,105054,5222,45.5234515,-122.6762071,2012-04-21T17:39:22
I try to do that, using next code:
with open('file.dat', 'r') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLine = line.strip('|').split()
newLines.append(newLine)
with open('file.csv', 'w') as output_file:
file_writer = csv.writer(output_file)
file_writer.writerows(newLines)
But all the same I get a csv file with "|" symbols and empty latitude/longtitude rows.
Where is mistake?
In general I need to use resulting csv-file in DateFrame, so maybe there is some way to reduce number of actions.
str.strip() removes leading and trailing characters from a string.
You want to split the lines on "|", then strip each element of the resulting list:
import csv
with open('file.dat') as dat_file, open('file.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file)
for line in dat_file:
row = [field.strip() for field in line.split('|')]
if len(row) == 6 and row[3] and row[4]:
csv_writer.writerow(row)
Use this:
data = pd.read_csv('file.dat', sep='|', header=0, skipinitialspace=True)
data.dropna(inplace=True)
I used standard python features without pre-processing data. I got an idea from one of the previous answers and improved it. If data headers contain spaces (it is often the situation in CSV) we should determine column names by ourselves and skip line 1 with headers.
After that, we can remove NaN values only by specific columns.
data = pd.read_csv("checkins.dat", sep='|', header=None, skiprows=1,
low_memory = False, skipinitialspace=True,
names=['id','user_id','venue_id','latitude','longitude','created_at'])
data.dropna(subset=['latitude', 'longitude'], inplace = True)
Using split() without parameters will result in splitting after a space
example "test1 test2".split() results in ["test1", "test2"]
instead, try this:
newLine = line.split("|")
Maybe it's better to use a map() function instead of list comprehensions as it must be working faster. Also writing a csv-file is easy with csv module.
import csv
with open('file.dat', 'r') as fin:
with open('file.csv', 'w') as fout:
for line in fin:
newline = map(str.strip, line.split('|'))
if len(newline) == 6 and newline[3] and newline[4]:
csv.writer(fout).writerow(newline)
with open("filename.dat") as f:
with open("filename.csv", "w") as f1:
for line in f:
f1.write(line)
This can be used to convert a .dat file to .csv file
Combining previous answers I wrote my code for Python 2.7:
import csv
lat_index = 3
lon_index = 4
fields_num = 6
csv_counter = 0
with open("checkins.dat") as dat_file:
with open("checkins.csv", "w") as csv_file:
csv_writer = csv.writer(csv_file)
for dat_line in dat_file:
new_line = map(str.strip, dat_line.split('|'))
if len(new_line) == fields_num and new_line[lat_index] and new_line[lon_index]:
csv_writer.writerow(new_line)
csv_counter += 1
print("Done. Total rows written: {:,}".format(csv_counter))
This has worked for me:
data = pd.read_csv('file.dat',sep='::',names=list_for_names_of_columns)
Related
I have several CSV files that look like this:
Input
Name Code
blackberry 1
wineberry 2
rasberry 1
blueberry 1
mulberry 2
I would like to add a new column to all CSV files so that it would look like this:
Output
Name Code Berry
blackberry 1 blackberry
wineberry 2 wineberry
rasberry 1 rasberry
blueberry 1 blueberry
mulberry 2 mulberry
The script I have so far is this:
import csv
with open(input.csv,'r') as csvinput:
with open(output.csv, 'w') as csvoutput:
writer = csv.writer(csvoutput)
for row in csv.reader(csvinput):
writer.writerow(row+['Berry'])
(Python 3.2)
But in the output, the script skips every line and the new column has only Berry in it:
Output
Name Code Berry
blackberry 1 Berry
wineberry 2 Berry
rasberry 1 Berry
blueberry 1 Berry
mulberry 2 Berry
This should give you an idea of what to do:
>>> v = open('C:/test/test.csv')
>>> r = csv.reader(v)
>>> row0 = r.next()
>>> row0.append('berry')
>>> print row0
['Name', 'Code', 'berry']
>>> for item in r:
... item.append(item[0])
... print item
...
['blackberry', '1', 'blackberry']
['wineberry', '2', 'wineberry']
['rasberry', '1', 'rasberry']
['blueberry', '1', 'blueberry']
['mulberry', '2', 'mulberry']
>>>
Edit, note in py3k you must use next(r)
Thanks for accepting the answer. Here you have a bonus (your working script):
import csv
with open('C:/test/test.csv','r') as csvinput:
with open('C:/test/output.csv', 'w') as csvoutput:
writer = csv.writer(csvoutput, lineterminator='\n')
reader = csv.reader(csvinput)
all = []
row = next(reader)
row.append('Berry')
all.append(row)
for row in reader:
row.append(row[0])
all.append(row)
writer.writerows(all)
Please note
the lineterminator parameter in csv.writer. By default it is
set to '\r\n' and this is why you have double spacing.
the use of a list to append all the lines and to write them in
one shot with writerows. If your file is very, very big this
probably is not a good idea (RAM) but for normal files I think it is
faster because there is less I/O.
As indicated in the comments to this post, note that instead of
nesting the two with statements, you can do it in the same line:
with open('C:/test/test.csv','r') as csvinput, open('C:/test/output.csv', 'w') as csvoutput:
I'm surprised no one suggested Pandas. Although using a set of dependencies like Pandas might seem more heavy-handed than is necessary for such an easy task, it produces a very short script and Pandas is a great library for doing all sorts of CSV (and really all data types) data manipulation. Can't argue with 4 lines of code:
import pandas as pd
csv_input = pd.read_csv('input.csv')
csv_input['Berries'] = csv_input['Name']
csv_input.to_csv('output.csv', index=False)
Check out Pandas Website for more information!
Contents of output.csv:
Name,Code,Berries
blackberry,1,blackberry
wineberry,2,wineberry
rasberry,1,rasberry
blueberry,1,blueberry
mulberry,2,mulberry
import csv
with open('input.csv','r') as csvinput:
with open('output.csv', 'w') as csvoutput:
writer = csv.writer(csvoutput)
for row in csv.reader(csvinput):
if row[0] == "Name":
writer.writerow(row+["Berry"])
else:
writer.writerow(row+[row[0]])
Maybe something like that is what you intended?
Also, csv stands for comma separated values. So, you kind of need commas to separate your values like this I think:
Name,Code
blackberry,1
wineberry,2
rasberry,1
blueberry,1
mulberry,2
I used pandas and it worked well...
While I was using it, I had to open a file and add some random columns to it and then save back to same file only.
This code adds multiple column entries, you may edit as much you need.
import pandas as pd
csv_input = pd.read_csv('testcase.csv') #reading my csv file
csv_input['Phone1'] = csv_input['Name'] #this would also copy the cell value
csv_input['Phone2'] = csv_input['Name']
csv_input['Phone3'] = csv_input['Name']
csv_input['Phone4'] = csv_input['Name']
csv_input['Phone5'] = csv_input['Name']
csv_input['Country'] = csv_input['Name']
csv_input['Website'] = csv_input['Name']
csv_input.to_csv('testcase.csv', index=False) #this writes back to your file
If you want that cell value doesn't gets copy, so first of all create a empty Column in your csv file manually, like you named it as Hours
then, Now for this you can add this line in above code,
csv_input['New Value'] = csv_input['Hours']
or simply we can, without adding the manual column, we can
csv_input['New Value'] = '' #simple and easy
I Hope it helps.
Yes Its a old question but it might help some
import csv
import uuid
# read and write csv files
with open('in_file','r') as r_csvfile:
with open('out_file','w',newline='') as w_csvfile:
dict_reader = csv.DictReader(r_csvfile,delimiter='|')
#add new column with existing
fieldnames = dict_reader.fieldnames + ['ADDITIONAL_COLUMN']
writer_csv = csv.DictWriter(w_csvfile,fieldnames,delimiter='|')
writer_csv.writeheader()
for row in dict_reader:
row['ADDITIONAL_COLUMN'] = str(uuid.uuid4().int >> 64) [0:6]
writer_csv.writerow(row)
I don't see where you're adding the new column, but try this:
import csv
i = 0
Berry = open("newcolumn.csv","r").readlines()
with open(input.csv,'r') as csvinput:
with open(output.csv, 'w') as csvoutput:
writer = csv.writer(csvoutput)
for row in csv.reader(csvinput):
writer.writerow(row+","+Berry[i])
i++
This code will suffice your request and I have tested on the sample code.
import csv
with open(in_path, 'r') as f_in, open(out_path, 'w') as f_out:
csv_reader = csv.reader(f_in, delimiter=';')
writer = csv.writer(f_out)
for row in csv_reader:
writer.writerow(row + [row[0]]
In case of a large file you can use pandas.read_csv with the chunksize argument which allows to read the dataset per chunk:
import pandas as pd
INPUT_CSV = "input.csv"
OUTPUT_CSV = "output.csv"
CHUNKSIZE = 1_000 # Maximum number of rows in memory
header = True
mode = "w"
for chunk_df in pd.read_csv(INPUT_CSV, chunksize=CHUNKSIZE):
chunk_df["Berry"] = chunk_df["Name"]
# You apply any other transformation to the chunk
# ...
chunk_df.to_csv(OUTPUT_CSV, header=header, mode=mode)
header = False # Do not save the header for the other chunks
mode = "a" # 'a' stands for append mode, all the other chunks will be appended
If you want to update the file inplace, you can use a temporary file and erase it at the end
import pandas as pd
INPUT_CSV = "input.csv"
TMP_CSV = "tmp.csv"
CHUNKSIZE = 1_000 # Maximum number of rows in memory
header = True
mode = "w"
for chunk_df in pd.read_csv(INPUT_CSV, chunksize=CHUNKSIZE):
chunk_df["Berry"] = chunk_df["Name"]
# You apply any other transformation to the chunk
# ...
chunk_df.to_csv(TMP_CSV, header=header, mode=mode)
header = False # Do not save the header for the other chunks
mode = "a" # 'a' stands for append mode, all the other chunks will be appended
os.replace(TMP_CSV, INPUT_CSV)
For adding a new column to an existing CSV file(with headers), if the column to be added has small enough number of values, here is a convenient function (somewhat similar to #joaquin's solution). The function takes the
Existing CSV filename
Output CSV filename (which will have the updated content) and
List with header name&column values
def add_col_to_csv(csvfile,fileout,new_list):
with open(csvfile, 'r') as read_f, \
open(fileout, 'w', newline='') as write_f:
csv_reader = csv.reader(read_f)
csv_writer = csv.writer(write_f)
i = 0
for row in csv_reader:
row.append(new_list[i])
csv_writer.writerow(row)
i += 1
Example:
new_list1 = ['test_hdr',4,4,5,5,9,9,9]
add_col_to_csv('exists.csv','new-output.csv',new_list1)
Existing CSV file:
Output(updated) CSV file:
Append new column in existing csv file using python without header name
default_text = 'Some Text'
# Open the input_file in read mode and output_file in write mode
with open('problem-one-answer.csv', 'r') as read_obj, \
open('output_1.csv', 'w', newline='') as write_obj:
# Create a csv.reader object from the input file object
csv_reader = reader(read_obj)
# Create a csv.writer object from the output file object
csv_writer = csv.writer(write_obj)
# Read each row of the input csv file as list
for row in csv_reader:
# Append the default text in the row / list
row.append(default_text)
# Add the updated row / list to the output file
csv_writer.writerow(row)
Thankyou
You may just write:
import pandas as pd
import csv
df = pd.read_csv('csv_name.csv')
df['Berry'] = df['Name']
df.to_csv("csv_name.csv",index=False)
Then you are done. To check it, you may run:
h = pd.read_csv('csv_name.csv')
print(h)
If you want to add a column with some arbitrary new elements(a,b,c), you may replace the 4th line of the code by:
df['Berry'] = ['a','b','c']
I have the below datafile and i want to delete the entire line that contains the "30" number in the first column. This number has always this position.
What i have thought is to read the file and create a list with this first column
and do a check if this number "30" exist on every item on the list and then delete the entire line given the index.
However i am not sure how to proceed.
Please let me know your thoughts .
Datafile
Here is what i have tried up to this point:
f = open("file.txt","r")
lines = f.readlines()
f.close()
f = open("file.txt","w")
for line in lines:
if line!="30"+"\n":
f.write(line)
f.close()
f = open("file.txt", "r")
lines = f.readlines()
f.close()
f = open("file.txt", "w")
for line in lines:
if '30' not in line[4:6]:
f.write(line)
f.close()
Try this
If you're willing to use pandas, you could do it in three lines:
import pandas as pd
# Read in file
df = pd.read_csv("file.txt", header=None, delim_whitespace=True)
# Remove rows where first column contains '30'
df = df[~df[0].str.contains('30')]
# Save the result
df.to_csv("cleaned.txt", sep='\t', index=False, header=False)
This approach can easily be extended to perform other types of filtering or manipulating your data.
One way you can do is use the regular expressions that captures 30 in the beginning is this:
import re
f = open("file.txt", "r")
lines = f.readlines()
f.close()
f = open("file.txt", "w")
for line in lines:
if re.search(r'^\d*30',line):
f.write(line)
f.close()
Hope it works well.
I am trying to read a file with below data
Et1, Arista2, Ethernet1
Et2, Arista2, Ethernet2
Ma1, Arista2, Management1
I need to read the file replace Et with Ethernet and Ma with Management. At the end of them the digit should be the same. The actual output should be as follows
Ethernet1, Arista2, Ethernet1
Ethernet2, Arista2, Ethernet2
Management1, Arista2, Management1
I tried a code with Regular expressions, I am able to get to the point I can parse all Et1, Et2 and Ma1. But unable to replace them.
import re
with open('test.txt','r') as fin:
for line in fin:
data = re.findall(r'\A[A-Z][a-z]\Z\d[0-9]*', line)
print(data)
The output looks like this..
['Et1']
['Et2']
['Ma1']
import re
#to avoid compile in each iteration
re_et = re.compile(r'^Et(\d+),')
re_ma = re.compile(r'^Ma(\d+),')
with open('test.txt') as fin:
for line in fin:
data = re_et.sub('Ethernet\g<1>,', line.strip())
data = re_ma.sub('Management\g<1>,', data)
print(data)
This example follows Joseph Farah's suggestion
import csv
file_name = 'data.csv'
output_file_name = "corrected_data.csv"
data = []
with open(file_name, "rb") as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
data.append(row)
corrected_data = []
for row in data:
tmp_row = []
for col in row:
if 'Et' in col and not "Ethernet" in col:
col = col.replace("Et", "Ethernet")
elif 'Ma' in col and not "Management" in col:
col = col.replace("Ma", "Management")
tmp_row.append(col)
corrected_data.append(tmp_row)
with open(output_file_name, "wb") as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for row in corrected_data:
writer.writerow(row)
print data
Here are the steps you should take:
Read each line in the file
Separate each line into smaller list items using the comments as delimiters
Use str.replace() to replace the characters with the words you want; keep in mind that anything that says "Et" (including the beginning of the word "ethernet") will be replaced, so remember to account for that. Same goes for Ma and Management.
Roll it back into one big list and put it back in the file with file.write(). You may have to overwrite the original file.
I have a CSV file rsvp1.csv:
_id event_id comments
1 | x | hello..
2 | y | bye
3 | y | hey
4 | z | hi
My question is:
For each event e how can I get the comments written to a separate text file?
There is some error with the following code:
import csv
with open('rsvps1.csv','rU') as f:
reader = csv.DictReader(f, delimiter=',')
rows = list(reader)
fi = open('rsvp.txt','wb')
k=0
for row in rows:
if k == row['event_id']:
fi.write(row['comment']+"\n")
else:
fi.write(row['event_id']+"\t")
fi.write(row['comment']+"\n")
k= row['event_id']
f.close()
fi.close()
I think it is best to just forget you're working with a csv file and think of it as a normal file in which you can to the following.
with open('file.csv', 'r') as f:
lines = f.readlines()
for line in lines:
if not line.startswith('_id'):
line_values = line.split(',')
with open('%s.txt' % line_values[1], 'a') as fp:
fp.write(line_values[2] + '\n')
Splitting the csv File
Given a file rsvps1.csv with this content:
_id,event_id,comments
1,x,hello
2,y,bye
3,y,hey
4,z,hi
This:
import csv
import itertools as it
from operator import itemgetter
with open('rsvps1.csv') as fin:
fieldnames = next(csv.reader(fin))
fin.seek(0)
rows = list(csv.DictReader(fin))
for event_id, event in it.groupby(rows, key=itemgetter('event_id')):
with open('event_{}.txt'.format(event_id), 'w') as fout:
csv_out = csv.DictWriter(fout, fieldnames)
csv_out.writeheader()
csv_out.writerows(event)
splits it into three files:
event_x.txt
_id,event_id,comments
1,x,hello
event_y.txt
_id,event_id,comments
2,y,bye
3,y,hey
and event_z.txt
_id,event_id,comments
4,z,hi
Adapt the output to your needs.
Only Comments
If you do not want a csv as output, this becomes simpler:
import csv
import itertools as it
from operator import itemgetter
with open('rsvps1.csv') as fin:
rows = list(csv.DictReader(fin))
for event_id, event in it.groupby(rows, key=itemgetter('event_id')):
with open('event_{}_comments.txt'.format(event_id), 'w') as fout:
for item in event:
fout.write('{}\n'.format(item['comments']))
Now event_y_comments.txt has this content:
bye
hey
I would suggest that you use pandas as your import tool. It creates a clear datastructure of your csv-file, similar to a spreadsheet in MS Excel. You could then use iterrows to loop over your event_id's and process your comments.
import pandas as pd
data = pd.read_csv('rsvps1.csv', sep = ',')
for index, row in data.iterrows():
print(row['event_id'], row['comment') #Python 3.x
However, I not sure what you want to write in the file. Just the comment for all event_id's? The complete 'comment'-column can be exported to a separate file by
data.to_csv('output.csv', columns = ['comment'])
Additional information according to comment:
When you want to save only certain comments that have the same event_id, then you have to select the corresponding rows at first. This is done by
selected_data = data[data['event_id'] == 'x']
for the event_id 'x'. selected_data now contains a dataframe that only holds rows that have 'x' in the 'event_id'-column. You can then loop through this dataframe as shown above.
I have several CSV files that look like this:
Input
Name Code
blackberry 1
wineberry 2
rasberry 1
blueberry 1
mulberry 2
I would like to add a new column to all CSV files so that it would look like this:
Output
Name Code Berry
blackberry 1 blackberry
wineberry 2 wineberry
rasberry 1 rasberry
blueberry 1 blueberry
mulberry 2 mulberry
The script I have so far is this:
import csv
with open(input.csv,'r') as csvinput:
with open(output.csv, 'w') as csvoutput:
writer = csv.writer(csvoutput)
for row in csv.reader(csvinput):
writer.writerow(row+['Berry'])
(Python 3.2)
But in the output, the script skips every line and the new column has only Berry in it:
Output
Name Code Berry
blackberry 1 Berry
wineberry 2 Berry
rasberry 1 Berry
blueberry 1 Berry
mulberry 2 Berry
This should give you an idea of what to do:
>>> v = open('C:/test/test.csv')
>>> r = csv.reader(v)
>>> row0 = r.next()
>>> row0.append('berry')
>>> print row0
['Name', 'Code', 'berry']
>>> for item in r:
... item.append(item[0])
... print item
...
['blackberry', '1', 'blackberry']
['wineberry', '2', 'wineberry']
['rasberry', '1', 'rasberry']
['blueberry', '1', 'blueberry']
['mulberry', '2', 'mulberry']
>>>
Edit, note in py3k you must use next(r)
Thanks for accepting the answer. Here you have a bonus (your working script):
import csv
with open('C:/test/test.csv','r') as csvinput:
with open('C:/test/output.csv', 'w') as csvoutput:
writer = csv.writer(csvoutput, lineterminator='\n')
reader = csv.reader(csvinput)
all = []
row = next(reader)
row.append('Berry')
all.append(row)
for row in reader:
row.append(row[0])
all.append(row)
writer.writerows(all)
Please note
the lineterminator parameter in csv.writer. By default it is
set to '\r\n' and this is why you have double spacing.
the use of a list to append all the lines and to write them in
one shot with writerows. If your file is very, very big this
probably is not a good idea (RAM) but for normal files I think it is
faster because there is less I/O.
As indicated in the comments to this post, note that instead of
nesting the two with statements, you can do it in the same line:
with open('C:/test/test.csv','r') as csvinput, open('C:/test/output.csv', 'w') as csvoutput:
I'm surprised no one suggested Pandas. Although using a set of dependencies like Pandas might seem more heavy-handed than is necessary for such an easy task, it produces a very short script and Pandas is a great library for doing all sorts of CSV (and really all data types) data manipulation. Can't argue with 4 lines of code:
import pandas as pd
csv_input = pd.read_csv('input.csv')
csv_input['Berries'] = csv_input['Name']
csv_input.to_csv('output.csv', index=False)
Check out Pandas Website for more information!
Contents of output.csv:
Name,Code,Berries
blackberry,1,blackberry
wineberry,2,wineberry
rasberry,1,rasberry
blueberry,1,blueberry
mulberry,2,mulberry
import csv
with open('input.csv','r') as csvinput:
with open('output.csv', 'w') as csvoutput:
writer = csv.writer(csvoutput)
for row in csv.reader(csvinput):
if row[0] == "Name":
writer.writerow(row+["Berry"])
else:
writer.writerow(row+[row[0]])
Maybe something like that is what you intended?
Also, csv stands for comma separated values. So, you kind of need commas to separate your values like this I think:
Name,Code
blackberry,1
wineberry,2
rasberry,1
blueberry,1
mulberry,2
I used pandas and it worked well...
While I was using it, I had to open a file and add some random columns to it and then save back to same file only.
This code adds multiple column entries, you may edit as much you need.
import pandas as pd
csv_input = pd.read_csv('testcase.csv') #reading my csv file
csv_input['Phone1'] = csv_input['Name'] #this would also copy the cell value
csv_input['Phone2'] = csv_input['Name']
csv_input['Phone3'] = csv_input['Name']
csv_input['Phone4'] = csv_input['Name']
csv_input['Phone5'] = csv_input['Name']
csv_input['Country'] = csv_input['Name']
csv_input['Website'] = csv_input['Name']
csv_input.to_csv('testcase.csv', index=False) #this writes back to your file
If you want that cell value doesn't gets copy, so first of all create a empty Column in your csv file manually, like you named it as Hours
then, Now for this you can add this line in above code,
csv_input['New Value'] = csv_input['Hours']
or simply we can, without adding the manual column, we can
csv_input['New Value'] = '' #simple and easy
I Hope it helps.
Yes Its a old question but it might help some
import csv
import uuid
# read and write csv files
with open('in_file','r') as r_csvfile:
with open('out_file','w',newline='') as w_csvfile:
dict_reader = csv.DictReader(r_csvfile,delimiter='|')
#add new column with existing
fieldnames = dict_reader.fieldnames + ['ADDITIONAL_COLUMN']
writer_csv = csv.DictWriter(w_csvfile,fieldnames,delimiter='|')
writer_csv.writeheader()
for row in dict_reader:
row['ADDITIONAL_COLUMN'] = str(uuid.uuid4().int >> 64) [0:6]
writer_csv.writerow(row)
I don't see where you're adding the new column, but try this:
import csv
i = 0
Berry = open("newcolumn.csv","r").readlines()
with open(input.csv,'r') as csvinput:
with open(output.csv, 'w') as csvoutput:
writer = csv.writer(csvoutput)
for row in csv.reader(csvinput):
writer.writerow(row+","+Berry[i])
i++
This code will suffice your request and I have tested on the sample code.
import csv
with open(in_path, 'r') as f_in, open(out_path, 'w') as f_out:
csv_reader = csv.reader(f_in, delimiter=';')
writer = csv.writer(f_out)
for row in csv_reader:
writer.writerow(row + [row[0]]
In case of a large file you can use pandas.read_csv with the chunksize argument which allows to read the dataset per chunk:
import pandas as pd
INPUT_CSV = "input.csv"
OUTPUT_CSV = "output.csv"
CHUNKSIZE = 1_000 # Maximum number of rows in memory
header = True
mode = "w"
for chunk_df in pd.read_csv(INPUT_CSV, chunksize=CHUNKSIZE):
chunk_df["Berry"] = chunk_df["Name"]
# You apply any other transformation to the chunk
# ...
chunk_df.to_csv(OUTPUT_CSV, header=header, mode=mode)
header = False # Do not save the header for the other chunks
mode = "a" # 'a' stands for append mode, all the other chunks will be appended
If you want to update the file inplace, you can use a temporary file and erase it at the end
import pandas as pd
INPUT_CSV = "input.csv"
TMP_CSV = "tmp.csv"
CHUNKSIZE = 1_000 # Maximum number of rows in memory
header = True
mode = "w"
for chunk_df in pd.read_csv(INPUT_CSV, chunksize=CHUNKSIZE):
chunk_df["Berry"] = chunk_df["Name"]
# You apply any other transformation to the chunk
# ...
chunk_df.to_csv(TMP_CSV, header=header, mode=mode)
header = False # Do not save the header for the other chunks
mode = "a" # 'a' stands for append mode, all the other chunks will be appended
os.replace(TMP_CSV, INPUT_CSV)
For adding a new column to an existing CSV file(with headers), if the column to be added has small enough number of values, here is a convenient function (somewhat similar to #joaquin's solution). The function takes the
Existing CSV filename
Output CSV filename (which will have the updated content) and
List with header name&column values
def add_col_to_csv(csvfile,fileout,new_list):
with open(csvfile, 'r') as read_f, \
open(fileout, 'w', newline='') as write_f:
csv_reader = csv.reader(read_f)
csv_writer = csv.writer(write_f)
i = 0
for row in csv_reader:
row.append(new_list[i])
csv_writer.writerow(row)
i += 1
Example:
new_list1 = ['test_hdr',4,4,5,5,9,9,9]
add_col_to_csv('exists.csv','new-output.csv',new_list1)
Existing CSV file:
Output(updated) CSV file:
You may just write:
import pandas as pd
import csv
df = pd.read_csv('csv_name.csv')
df['Berry'] = df['Name']
df.to_csv("csv_name.csv",index=False)
Then you are done. To check it, you may run:
h = pd.read_csv('csv_name.csv')
print(h)
If you want to add a column with some arbitrary new elements(a,b,c), you may replace the 4th line of the code by:
df['Berry'] = ['a','b','c']
Append new column in existing csv file using python without header name
default_text = 'Some Text'
# Open the input_file in read mode and output_file in write mode
with open('problem-one-answer.csv', 'r') as read_obj, \
open('output_1.csv', 'w', newline='') as write_obj:
# Create a csv.reader object from the input file object
csv_reader = reader(read_obj)
# Create a csv.writer object from the output file object
csv_writer = csv.writer(write_obj)
# Read each row of the input csv file as list
for row in csv_reader:
# Append the default text in the row / list
row.append(default_text)
# Add the updated row / list to the output file
csv_writer.writerow(row)
Thankyou