I want to know if it is possible to use the pandas to_csv() function to add a dataframe to an existing csv file. The csv file has the same structure as the loaded data.
You can specify a python write mode in the pandas to_csv function. For append it is 'a'.
In your case:
df.to_csv('my_csv.csv', mode='a', header=False)
The default mode is 'w'.
If the file initially might be missing, you can make sure the header is printed at the first write using this variation:
output_path='my_csv.csv'
df.to_csv(output_path, mode='a', header=not os.path.exists(output_path))
You can append to a csv by opening the file in append mode:
with open('my_csv.csv', 'a') as f:
df.to_csv(f, header=False)
If this was your csv, foo.csv:
,A,B,C
0,1,2,3
1,4,5,6
If you read that and then append, for example, df + 6:
In [1]: df = pd.read_csv('foo.csv', index_col=0)
In [2]: df
Out[2]:
A B C
0 1 2 3
1 4 5 6
In [3]: df + 6
Out[3]:
A B C
0 7 8 9
1 10 11 12
In [4]: with open('foo.csv', 'a') as f:
(df + 6).to_csv(f, header=False)
foo.csv becomes:
,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12
with open(filename, 'a') as f:
df.to_csv(f, header=f.tell()==0)
Create file unless exists, otherwise append
Add header if file is being created, otherwise skip it
A little helper function I use with some header checking safeguards to handle it all:
def appendDFToCSV_void(df, csvFilePath, sep=","):
import os
if not os.path.isfile(csvFilePath):
df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
raise Exception("Columns and column order of dataframe and csv file do not match!!")
else:
df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)
Initially starting with a pyspark dataframes - I got type conversion errors (when converting to pandas df's and then appending to csv) given the schema/column types in my pyspark dataframes
Solved the problem by forcing all columns in each df to be of type string and then appending this to csv as follows:
with open('testAppend.csv', 'a') as f:
df2.toPandas().astype(str).to_csv(f, header=False)
This is how I did it in 2021
Let us say I have a csv sales.csv which has the following data in it:
sales.csv:
Order Name,Price,Qty
oil,200,2
butter,180,10
and to add more rows I can load them in a data frame and append it to the csv like this:
import pandas
data = [
['matchstick', '60', '11'],
['cookies', '10', '120']
]
dataframe = pandas.DataFrame(data)
dataframe.to_csv("sales.csv", index=False, mode='a', header=False)
and the output will be:
Order Name,Price,Qty
oil,200,2
butter,180,10
matchstick,60,11
cookies,10,120
A bit late to the party but you can also use a context manager, if you're opening and closing your file multiple times, or logging data, statistics, etc.
from contextlib import contextmanager
import pandas as pd
#contextmanager
def open_file(path, mode):
file_to=open(path,mode)
yield file_to
file_to.close()
##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
saved_df.to_csv('yourcsv.csv',mode='a',header=False)`
Related
I have a requirement where I have to split some columns as first row and the remaining as second row.
I have store them in one dataframe such as :
columnA columnB columnC columnD
A B C D
to a text file sample.txt:
A,B
C,D
This is the code :
cleaned_data.iloc[:, 0:1].to_csv("report_csv.txt", encoding='utf-8', index=False, header=False, line_terminator='')
cleaned_data.iloc[:,1:].to_csv("report_csv.txt", encoding='utf-8', index=False, header=False, mode='a', line_terminator='')
It should produce as expected in sample.txt. However, there is third line which is empty and I dont want it to exist. I tried lineterminator='', it does not work for '' but it works such as ' ' or 'abc' etc..
I'm sure there is better way of producing the sample text file than using what I've written. I'm up for other alternative.
Still, how can I remove the last empty line? I'm using python 3.8
I'm not able to reproduce your issue, but it might be the case that your strings in the dataframe contain trailing line breaks. I'm running Pandas 0.23.4 on linux
import pandas
print(pandas.__version__)
I created what I think your dataframe contains using the command
df = pandas.DataFrame({'colA':['A'], 'colB': ['B'], 'colC':['C'], 'colD':['D']})
To check the contents of a cell, you could use df['colA'][0].
The indexing I needed to grab the first and second columns was
df.iloc[:, 0:2]
and the way I got to a CSV did not rely on lineterminator
df.iloc[:, 0:2].to_csv("report_csv.txt", encoding='utf-8', index=False, header=False)
df.iloc[:,2:].to_csv("report_csv.txt", encoding='utf-8', index=False, header=False, mode='a')
When I run
with open('report_csv.txt','r') as file_handle:
dat = file_handle.read()
I get 'A,B\nC,D\n' from dat.
To get no trailing newline on the last line, use to_string()
with open('output.txt','w') as file_handle:
file_handle.write(df.iloc[:, 0:2].to_string(header=False,index=False)+"\n")
file_handle.write(df.iloc[:,2:].to_string(header=False,index=False))
Then we can verify the file is formatted as desired by running
with open('output.txt','r') as file_handle:
dat = file_handle.read()
The dat contains 'A B\nC D'. If spaces are not an acceptable delimiter, they could be replaced by a , prior to writing to file.
I'm currently sorting a csv file. As far as my output, its correct but it isn't properly formatted. The following is the file I'm sorting
And here is the output after I sort (I'll include the code after the image)
Obviously I'm having a delimiter issue, but here is my code:
with open(out_file, 'r') as unsort:##Opens OMI Data
with open(Pandora_Sorted,'w') as sort:##Opens file to write to
for line in unsort:
if "Datetime" in line:##Searches lines
writer=csv.writer(sort, delimiter = ',')
writer.writerow(headers)##Writes header
elif "T13" in line:
writer=csv.writer(sort)
writer.writerow(line)##Writes to output file
I think it's easier to read the csv file into a pandas data frame and sort, please check below sample code.
import pandas as pd
df = pd.read_csv(input_file)
df.sort_values(by = ['Datetime'], inplace = True)
df.to_csv(output_file)
Do you need to be explicit about your separator for the writer?
Here in the second line of your elif:
elif "T13" in line:
writer=csv.writer(sort, delimiter = ',')
writer.writerow(line) # Writes to output file
For provided code, the header also would also have formatting similar to other rows due to following line:
writer=csv.writer(sort, delimiter = ',')
Using pandas following can be used for sorting in ascending order by list of columns, list_of_columns
import csv
import pandas as pd
input_csv = pd.read_csv(out_file, sep=',')
input_csv.sort_values(by=list_of_columns, ascending=True)
input_csv.to_csv(Pandora_Sorted, sep=',')
for e.g. list_of_columns could be
list_of_columns = ['Datetime', 'JulianDate', 'repetition']
The following code is effective to insert a row (features names) in my dataset as a first row:
features = ['VendorID', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount']
df = pd.DataFrame(pd.read_csv(path + 'data.csv', sep=','))
df.loc[-1] = features # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
But data.csv is very large ~ 10 GB, hence I am wondering if I can insert features row directly in the file without loading it! Is it possible?
Thank you
You don't have to load the entire file into memory, use the stdlib csv module's writer functionality to append a row to the end of the file.
import csv
import os
with open(os.path.join(path, 'data.csv'), 'a') as f:
writer = csv.writer(f)
writer.writerow(features)
I have a 1million line CSV file. I want to do call a lookup function on each row's 1'st column, and append its result as a new column in the same CSV (if possible).
What I want is this is something like this:
for each row in dataframe
string=row[1]
result=lookupFunction(string)
row.append[string]
I Know i could do it using python's CSV library by opening my CSV, read each row, do my operation, write results to a new CSV.
This is my code using Python's CSV library
with open(rawfile, 'r') as f:
with open(newFile, 'a') as csvfile:
csvwritter = csv.writer(csvfile, delimiter=' ')
for line in f:
#do operation
However I really want to do it with Pandas because it would be something new to me.
This is what my data looks like
77,#oshkosh # tannersville pa,,PA,US
82,#osithesakcom ca,,CA,US
88,#osp open records or,,OR,US
89,#ospbco tel ord in,,IN,US
98,#ospwmnwithn return in,,IN,US
99,#ospwmnwithn tel ord in,,IN,US
100,#osram sylvania inc ma,,MA,US
106,#osteria giotto montclair nj,,NJ,US
Any help and guidance will be appreciated it. THanks
here is a simple example of adding 2 columns to a new column from you csv file
import pandas as pd
df = pd.read_csv("yourpath/yourfile.csv")
df['newcol'] = df['col1'] + df['col2']
create df and csv
import pandas as pd
df = pd.DataFrame(dict(A=[1, 2], B=[3, 4]))
df.to_csv('test_add_column.csv')
read csv into dfromcsv
dfromcsv = pd.read_csv('test_add_column.csv', index_col=0)
create new column
dfromcsv['C'] = df['A'] * df['B']
dfromcsv
write csv
dfromcsv.to_csv('test_add_column.csv')
read it again
dfromcsv2 = pd.read_csv('test_add_column.csv', index_col=0)
dfromcsv2
How do I prevent Python from automatically writing objects into csv as a different format than originally? For example, I have list object such as the following:
row = ['APR16', '100.00000']
I want to write this row as is, however when I use writerow function of csv writer, it writes into the csv file as 16-Apr and just 10. I want to keep the original formatting.
EDIT:
Here is the code:
import pandas as pd
dates = ['APR16', 'MAY16', 'JUN16']
numbers = [100.00000, 200.00000, 300.00000]
for i in range(3):
row = []
row.append(dates[i])
row.append(numbers[i])
prow = pd.DataFrame(row)
prow.to_csv('test.csv', index=False, header=False)
And result:
Using pandas:
import pandas as pd
dates = ['APR16', 'MAY16', 'JUN16']
numbers = [100.00000, 200.00000, 300.00000]
data = zip(dates,numbers)
fd = pd.DataFrame(data)
fd.to_csv('test.csv', index=False, header=False) # csv-file
fd.to_excel("test.xls", header=False,index=False) # or xls-file
Result in my terminal:
➜ ~ cat test.csv
APR16
100.00000
Result in LibreOffice: