I have a large csv file, 40+ columns, I'm trying to sort it using pandas and only write selected ones into a new file. Here's my code:
Edit: I was probably wrong to assume I've done everything correctly up until the end, here's the entire file: I read in 10 csv files, add them to one, filter the rows so that they are unique in a way I need them to, then I want to filter again, this time select just the few columns.
I am completely new to python, so the code probably looks disgusting and there's the issue I assume.
if __name__ == "__main__":
files = ['airOT199701.csv', 'airOT199702.csv', 'airOT199703.csv', 'airOT199704.csv', 'airOT199705.csv', 'airOT199706.csv', 'airOT199707.csv', 'airOT199708.csv', 'airOT199709.csv', 'airOT199710.csv', 'airOT199711.csv', 'airOT199712.csv']
with open('filterflights.csv', 'w') as outcsv:
writer = csv.DictWriter(outcsv, fieldnames = ["YEAR","MONTH","DAY_OF_MONTH","DAY_OF_WEEK","FL_DATE","UNIQUE_CARRIER","TAIL_NUM","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR","DEST_AIRPORT_ID","DEST","DEST_STATE_ABR","CRS_DEP_TIME","DEP_TIME","DEP_DELAY","DEP_DELAY_NEW","DEP_DEL15","DEP_DELAY_GROUP","TAXI_OUT","WHEELS_OFF","WHEELS_ON","TAXI_IN","CRS_ARR_TIME","ARR_TIME","ARR_DELAY","ARR_DELAY_NEW","ARR_DEL15","ARR_DELAY_GROUP","CANCELLED","CANCELLATION_CODE","DIVERTED","CRS_ELAPSED_TIME","ACTUAL_ELAPSED_TIME","AIR_TIME","FLIGHTS","DISTANCE","DISTANCE_GROUP","CARRIER_DELAY","WEATHER_DELAY","NAS_DELAY","SECURITY_DELAY","LATE_AIRCRAFT_DELAY","DIFFERENCE"])
writer.writeheader()
filewriter = csv.writer(outcsv, delimiter=',')
for i in range(len(files)):
reader = csv.reader(open(files[i], 'r'), delimiter=',')
next(reader, None)
result = set()
for r in reader:
r.append(abs(int(r[8])-int(r[11]))%25)
key = (r[7],r[8],r[11])
if key not in result:
filewriter.writerow(r)
result.add(key)
df = pd.read_csv('filterflights.csv')
df.header(3)
df = df[["FL_DATE","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR", "DEST_AIRPORT_ID","DEST","DEST_STATE_ABR", "DEP_TIME", "ARR_TIME", "DISTANCE", "DIFFERENCE"]]
df.header(3)
df.to_csv('filteredflights.csv', index=False)
I get the error:AttributeError: 'DataFrame' object has no attribute 'header' in line 23. All csv files are in the same folder as the python file
Possible issue: original csv files do not have DIFFERENCE column, can that cause the issue? Trying to append value with r.append, but maybe it doesn't know what to append to?
you can use pandas.reindex() to subset the data frame and preserve given order,
col_subset = ["FL_DATE","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR", "DEST_AIRPORT_ID","DEST","DEST_STATE_ABR", "DEP_TIME", "ARR_TIME", "DISTANCE", "DIFFERENCE"]
df = df.reindex(columns= col_subset)
Related
Background is the csv file going to grow into huge size after many columns added, so prefer not to use pandas dataframe.to_csv to write the whole matrix from memory. and also the data need to write into the same file instead of generating a new files as historic topic as tried code as below.
might be pandas to_csv append mode, from new column, but not sure how to write.
data1,data2 data3,data4
1,4,2,4
2,32,1,4
3,3,1,5
4,3,1,5
5,2,22,9
6,3,34,9
7,5,4,9
import csv
def add_col_to_csv(csvfile,fileout,new_list):
with open(csvfile, 'r') as read_f, \
open(fileout, 'w', newline='') as write_f:
csv_reader = csv.reader(read_f)
csv_writer = csv.writer(write_f)
i = 0
for row in csv_reader:
row.append(new_list[i])
csv_writer.writerow(row)
i += 1
new_list1 = ['new_col',4,4,5,5,9,9,9]
add_col_to_csv('input.csv','output.csv',new_list1)
you can use something like this
df = pd.DataFrame(new_list1).to_csv(f'output.csv', mode='a', index=False, header=False)
del df
del new_list1
new_list1 = []
this will append it and delete it from memory right after. You can enable index and header based on the values in you're array how ever this is a very weird and bad way to append to csv files try json instead.
I have 14 CSV files and each has 100 columns, what i want to do is to extract first column from each file and copy it in a single csv file. I have to do it for each 100 columns (for example next step is to put second column from each file in a csv file).
What i've tried before is the code below which is perfect for extracting one column, but i want to put it in a loop so i get the 100 files at once how can i do it?
import csv
import itertools as IT
filenames = ['Sul-v1.csv', 'Sul-v2.csv','Sul-v3.csv', 'Sul-v4.csv', 'Sul-v5.csv', 'Sul-v6.csv', 'Sul-v7.csv', 'Sul-v8.csv', 'Sul-v9.csv', 'Sul-v10.csv', 'Sul-v11.csv', 'Sul-v12.csv', 'Sul-v13.csv', 'Sul-v14.csv']
handles = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(f, delimiter=',') for f in handles]
with open('combined.csv', 'wb') as h:
writer = csv.writer(h, delimiter=',', lineterminator='\n', )
for rows in IT.izip_longest(*readers, fillvalue=['']*2):
combined_row = []
for row in rows:
row = row[:1] # select the columns you want
if len(row) == 1:
combined_row.extend(row)
else:
combined.extend(['']*2)
writer.writerow(combined_row)
for f in handles:
f.close()
Thanks in advance!
Use pandas.
Start by loading all csv files into one dateframe. (see here)
Next, save each column into a new csv by looping over the columns and using to_csv .
Make sure you pass the column to 'to_csv' using the 'columns' argument
I have a CSV file with information and want to replace the information in a specific location with a new value.
For example if my CSV file looks like this:
example1,example2,0
example3,example4,0
exampple5,example6,0
Note that each row is labelled for example:
test = row[0]
test1 = row[1]
test2 = row[2]
If I want to replace
test[0]
with a new value how would I go about doing it?
Simplest way without installing any additional package would be to use built-in csv to read the whole file in a matrix and replace the desired element.
Here is code that would do just that:
import csv
with open('test.csv', 'r') as in_file, open('test_out.csv', 'wb') as out_file:
data = [row for row in csv.reader(in_file)]
data[0][0] = 'new value'
writer = csv.writer(out_file)
writer.writerows(data)
There are a handful of ways to do this, but personally I'm a big fan of pandas. With pandas, you can read a csv file with df = pd.read_csv('path_to_file.csv'). Make changes however you want, if you wanted row 1 column 1, you'd use df.loc[0,0] = new_val. Then when you are done save to the same file df.to_csv('path_to_file.csv').
I have been trying initially to create a program to go through one file and select certain columns that will then be moved to a new text file. So far I have
import os, sys, csv
os.chdir("C://Users//nelsonj//Desktop//Master_Project")
with open('CHS_2009_test.txt', "rb") as sitefile:
reader = csv.reader(sitefile, delimiter=',')
pref_cols = [0,1,2,4,6,8,10,12,14,18,20,22,24,26,30,34,36,40]
for row in reader:
new_cols = list(row[i] for i in pref_cols)
print new_cols
I have been trying to use the csv functions to write the new file but I am continuosly getting errors. I will eventually need to do this over a folder of files, but thought I would try to do it on one before tackling that.
Code I attempted to use to write this data to a new file
for row in reader:
with open("CHS_2009_edit.txt", 'w') as file:
new_cols = list(row[i] for i in pref_cols)
newfile = csv.writer(file)
newfile.writerows(new_cols)
This kind of works in that I get a new file, but in only prints the second row of values from my csv, i.e., not the header values and places commas in between each individual character, not just copying over the original columns as they were.
I am using PythonWin with Python 2.6(from ArcGIS)
Thanks for the help!
NEW UPDATED CODE
import os, sys, csv
path = ('C://Users//nelsonj//Desktop//Master_Project')
for filename in os.listdir(path):
pref_cols = [0,1,2,4,6,8,10,12,14,18,20,22,24,26,30,34,36,40]
with open(filename, "rb") as sitefile:
with open(filename.rsplit('.',1)[0] + "_Master.txt", 'w') as output_file:
reader = csv.reader(sitefile, delimiter=',')
writer = csv.writer(output_file)
for row in reader:
new_row = list(row[i] for i in pref_cols)
writer.writerow(new_row)
print new_row
Getting list index out of range for the new_row, but it seems to still be processing the file. Only thing I can't get it to do now is loop through all files in my directory. Here's a hyperlink to Screenshot of data text file
Try this:
new_header = list(row[i] for i in pref_cols if i in row)
That should avoid the error, but it may not avoid the underlying problem. Would you paste your CSV file somewhere that I can access, and I'll fix this for you?
For your purpose of filtering, you don't have to treat the header differently from the rest of the data. You can go ahead remove the following block:
headers = reader.next()
for row in headers:
new_header = list(row[i] for i in pref_cols)
print new_header
Your code did not work because you treated headers as a list of rows, but headers is just one row.
Update
This update deals with writing the CSV data to a new file. You should move the open statement above the for row...
with open("CHS_2009_edit.txt", 'w') as output_file:
writer = csv.writer(output_file)
for row in reader:
new_cols = list(row[i] for i in pref_cols)
writer.writerows(new_cols)
Update 2
This update deals with the header output problem. If you followed my suggestions, you should not have this problem. I don't know what your current code looks like, but it looks like you supplies a string where the code expects a list. Here is the code that I tried on my system (using my made-up data) and it seems to work:
pref_cols = [...] # <<=== Should be set before entering the loop
with open('CHS_2009_test.txt', "rb") as sitefile:
with open('CHS_2009_edit.txt', 'w') as output_file:
reader = csv.reader(sitefile, delimiter=',')
writer = csv.writer(output_file)
for row in reader:
new_row = list(row[i] for i in pref_cols)
writer.writerow(new_row)
One thing to notice: I use writerow() to write a single row, where you use writerows() -- that makes a difference.
I have two files, the first one is called book1.csv, and looks like this:
header1,header2,header3,header4,header5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
The second file is called book2.csv, and looks like this:
header1,header2,header3,header4,header5
1,2,3,4
1,2,3,4
1,2,3,4
My goal is to copy the column that contains the 5's in book1.csv to the corresponding column in book2.csv.
The problem with my code seems to be that it is not appending right nor is it selecting just the index that I want to copy.It also gives an error that I have selected an incorrect index position. The output is as follows:
header1,header2,header3,header4,header5
1,2,3,4
1,2,3,4
1,2,3,41,2,3,4,5
Here is my code:
import csv
with open('C:/Users/SAM/Desktop/book2.csv','a') as csvout:
write=csv.writer(csvout, delimiter=',')
with open('C:/Users/SAM/Desktop/book1.csv','rb') as csvfile1:
read=csv.reader(csvfile1, delimiter=',')
header=next(read)
for row in read:
row[5]=write.writerow(row)
What should I do to get this to append properly?
Thanks for any help!
What about something like this. I read in both books, append the last element of book1 to the book2 row for every row in book2, which I store in a list. Then I write the contents of that list to a new .csv file.
with open('book1.csv', 'r') as book1:
with open('book2.csv', 'r') as book2:
reader1 = csv.reader(book1, delimiter=',')
reader2 = csv.reader(book2, delimiter=',')
both = []
fields = reader1.next() # read header row
reader2.next() # read and ignore header row
for row1, row2 in zip(reader1, reader2):
row2.append(row1[-1])
both.append(row2)
with open('output.csv', 'w') as output:
writer = csv.writer(output, delimiter=',')
writer.writerow(fields) # write a header row
writer.writerows(both)
Although some of the code above will work it is not really scalable and a vectorised approach is needed. Getting to work with numpy or pandas will make some of these tasks easier so it is great to learn a bit of it.
You can download pandas from the Pandas Website
# Load Pandas
from pandas import DataFrame
# Load each file into a pandas dataframe, this is based on a numpy array
data1 = DataFrame.from_csv('csv1.csv',sep=',',parse_dates=False)
data2 = DataFrame.from_csv('csv2.csv',sep=',',parse_dates=False)
#Now add 'header5' from data1 to data2
data2['header5'] = data1['header5']
#Save it back to csv
data2.to_csv('output.csv')
Regarding the "error that I have selected an incorrect index position," I suspect this is because you're using row[5] in your code. Indexing in Python starts from 0, so if you have A = [1, 2, 3, 4, 5] then to get the 5 you would do print(A[4]).
Assuming the two files have the same number of rows and the rows are in the same order, I think you want to do something like this:
import csv
# Open the two input files, which I've renamed to be more descriptive,
# and also an output file that we'll be creating
with open("four_col.csv", mode='r') as four_col, \
open("five_col.csv", mode='r') as five_col, \
open("five_output.csv", mode='w', newline='') as outfile:
four_reader = csv.reader(four_col)
five_reader = csv.reader(five_col)
five_writer = csv.writer(outfile)
_ = next(four_reader) # Ignore headers for the 4-column file
headers = next(five_reader)
five_writer.writerow(headers)
for four_row, five_row in zip(four_reader, five_reader):
last_col = five_row[-1] # # Or use five_row[4]
four_row.append(last_col)
five_writer.writerow(four_row)
Why not reading the files line by line and use the -1 index to find the last item?
endings=[]
with open('book1.csv') as book1:
for line in book1:
# if not header line:
endings.append(line.split(',')[-1])
linecounter=0
with open('book2.csv') as book2:
for line in book2:
# if not header line:
print line+','+str(endings[linecounter]) # or write to file
linecounter+=1
You should also catch errors if row numbers don't match.