Modifying multiple .csv files from same directory in python

Modifying multiple .csv files from same directory in python - python

I need to modify multiple .csv files in my directory. Is it possible to do it with a simple script?
My .csv columns are in this order:
X_center,Y_center,X_Area,Y_Area,Classification
I would like to change them to this order:
Classification,X_center,Y_center,X_Area,Y_Area
So far I managed to write:
import os
import csv
for file in os.listdir("."):
if file.endswith(".csv"):
with open('*.csv', 'r') as infile, open('reordered.csv', 'a') as outfile:
fieldnames = ['Classification','X_center','Y_center','X_Area','Y_Area']
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()
for row in csv.DictReader(infile):
writer.writerow(row)
csv_file.close()
But it changes every row to Classification,X_center,Y_center,X_Area,Y_Area (replaces values in every row).
Is it possible to open a file, re-order the columns and save the file under the same name?
I checked similar solutions that were given on other threads but no luck.
Thanks for the help!

First off, I think your problem lay in opening '*.csv' in the loop instead of opening file. Also though, I would recommend never overwriting your original input files. It's much safer to write copies to a new directory. Here's a modified version of your script which does that.
import os
import csv
import argparse
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True)
ap.add_argument("-o", "--output", required=True)
args = vars(ap.parse_args())
if os.path.exists(args["output"]) and os.path.isdir(args["output"]):
print("Writing to {}".format(args["output"]))
else:
print("Cannot write to directory {}".format(args["output"]))
exit()
for file in os.listdir(args["input"]):
if file.endswith(".csv"):
print("{} ...".format(file))
with open(os.path.join(args["input"],file), 'r') as infile, open(os.path.join(args["output"], file), 'w') as outfile:
fieldnames = ['Classification','X_center','Y_center','X_Area','Y_Area']
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()
for row in csv.DictReader(infile):
writer.writerow(row)
outfile.close()
To use it, create a new directory for your outputs and then run like so:
python this.py -i input_dir -o output_dir
Note:
From your question you seemed to want each file to be modified in place so this does basically that (outputs a file of the same name, just in a different directory) but leaves your inputs unharmed. If you actually wanted all the files reordered into a single file as your code open('reordered.csv', 'a') implies, you could easily do that by moving the output initialization code so it is executed before entering the loop.

Using pandas & pathlib.
from pathlib import Path # available in python 3.4 +
import pandas as pd
dir = r'c:\path\to\csvs' # raw string for windows.
csv_files = [f for f in Path(dir).glob('*.csv')] # finds all csvs in your folder.
cols = ['Classification','X_center','Y_center','X_Area','Y_Area']
for csv in csv_files: #iterate list
df = pd.read_csv(csv) #read csv
df[cols].to_csv(csv.name,index=False)
print(f'{csv.name} saved.')
naturally, if there a csv without those columns then this code will fail, you can add a try/except if that's the case.

Related

Delimiter change in csv using python

I have a .csv file of around 30000 rows. The default delimiter implemented is a semicolon. I created a small script with python that would convert that delimiter to comma and save it in the same file. The script runs without any errors but does nothingnat the end. The delimiter is still a semicolon. The .txt file is created but it does not write back on the main file. The code I am using is as follows:
import csv
from pathlib import Path
import os
cwd = os.getcwd() # Get the current working directory (cwd)
files = os.listdir(cwd) # Get all the files in that directory
print("Files in %r: %s" % (cwd, files))
with open('RadGridExport.csv', mode='r', encoding='utf-8') as infile:
reader = csv.reader(infile, dialect="excel")
with open('temp.txt', mode='w', encoding='utf-8') as outfile:
writer = csv.writer(outfile, delimiter=',')
writer.writerows(reader)

You have missed the delimiter while reading. By default it looks for comma, since it is not the case you have to specify the delimiter:
reader = csv.reader(infile, dialect="excel",delimiter=";")
And you need not mention comma as delimiter while writing since it is the default.
Or the easiest way is to use pandas package:
import pandas as pd
df=pd.read_csv(infile,sep=';')
df.to_csv(infile,index=False)

Combining multiple csv files into one csv file

I am trying to combine multiple csv files into one, and have tried a number of methods but I am struggling.
I import the data from multiple csv files, and when I compile them together into one csv file, it seems that the first few rows get filled out nicely, but then it starts randomly inputting spaces of variable number in between the rows, and it never finishes filling out the combined csv file, it just seems to continuously get information added to it, which does not make sense to me because I am trying to compile a finite amount of data.
I have already tried writing close statements for the file, and I still get the same result, my designated combined csv file never stops getting data, and it will randomly space the data throughout the file - I just want a normally compiled csv.
Is there an error in my code? Is there any explanation as to why my csv file is behaving this way?
csv_file_list = glob.glob(Dir + '/*.csv') #returns the file list
print (csv_file_list)
with open(Avg_Dir + '.csv','w') as f:
wf = csv.writer(f, delimiter = ',')
print (f)
for files in csv_file_list:
rd = csv.reader(open(files,'r'),delimiter = ',')
for row in rd:
print (row)
wf.writerow(row)

Your code works for me.
Alternatively, you can merge files as follows:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
for line in rf:
if line.strip(): # if line is not empty
if not line.endswith("\n"):
line+="\n"
wf.write(line)
Or, if the files are not too large, you can read each file at once. But in this case all empty lines an headers will be copied:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
wf.write(rf.read().strip()+"\n")

Consider several adjustments:
Use context manager, with, for both the read and write process. This avoids the need to close() file objects which you do not do on the read objects.
For skipping lines issue: use either the argument newline='' in open() or lineterminator="\n" argument in csv.writer(). See SO answers for former and latter.
Use os.path.join() to properly concatenate folder and file paths. This method is os-agnostic so accounts for Windows or Unix machines using forward or backslashes types.
Adjusted script:
import os
import csv, glob
Dir = r"C:\Path\To\Source"
Avg_Dir = r"C:\Path\To\Destination\Output"
csv_file_list = glob.glob(os.path.join(Dir, '*.csv')) # returns the file list
print (csv_file_list)
with open(os.path.join(Avg_Dir, 'Output.csv'), 'w', newline='') as f:
wf = csv.writer(f, lineterminator='\n')
for files in csv_file_list:
with open(files, 'r') as r:
next(r) # SKIP HEADERS
rr = csv.reader(r)
for row in rr:
wf.writerow(row)

Merging several csv files and storing the file names as a variable - Python

I am trying to append several csv files into a single csv file using python while adding the file name (or, even better, a sub-string of the file name) as a new variable. All files have headers. The following script does the trick of merging the files, but does not cover the file name as variable issue:
import glob
filenames=glob.glob("/filepath/*.csv")
outputfile=open("out.csv","a")
for line in open(str(filenames[1])):
outputfile.write(line)
for i in range(1,len(filenames)):
f = open(str(filenames[i]))
f.next()
for line in f:
outputfile.write(line)
outputfile.close()
I was wondering if there are any good suggestions. I have about 25k small size csv files (less than 100KB each).

You can use Python's csv module to parse the CSV files for you, and to format the output. Example code (untested):
import csv
with open(output_filename, "wb") as outfile:
writer = None
for input_filename in filenames:
with open(input_filename, "rb") as infile:
reader = csv.DictReader(infile)
if writer is None:
field_names = ["Filename"] + reader.fieldnames
writer = csv.DictWriter(outfile, field_names)
writer.writeheader()
for row in reader:
row["Filename"] = input_filename
writer.writerow(row)
A few notes:
Always use with to open files. This makes sure they will get closed again when you are done with them. Your code doesn't correctly close the input files.
CSV files should be opened in binary mode.
Indices start at 0 in Python. Your code skips the first file, and includes the lines from the second file twice. If you just want to iterate over a list, you don't need to bother with indices in Python. Simply use for x in my_list instead.

Simple changes will achieve what you want:
For the first line
outputfile.write(line) -> outputfile.write(line+',file')
and later
outputfile.write(line+','+filenames[i])

Loop through multiple csv files, copying only certain columns to new files

I have a number of .csv files in a folder (1.csv, 2.csv, 3.csv, etc.) and I need to loop over them all. The output should be a corresponding NEW file for each existing one, but each should only contain 2 columns.
Here is a sample of the csv files:
004,444.444.444.444,448,11:16 PDT,11-24-15
004,444.444.444.444,107,09:55 PDT,11-25-15
004,444.444.444.444,235,09:45 PDT,11-26-15
004,444.444.444.444,241,11:00 PDT,11-27-15
And here is how I would like the output to look:
448,11-24-15
107,11-25-15
235,11-26-15
241,11-27-15
Here is my working attempt at achieving this with Python:
import csv
import os
import glob
path = '/csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
inputfile = open(infile, 'r')
output = os.rename(inputfile + ".out", 'w')
#Extracts the important columns from the .csv into a new file
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))
Using only the second half of this code, I am able to get the desired output by specifying the input files in the code. However, this Python script will be a small part of a much larger bash script that will be (hopefully) fully automated.
How can I adjust the input of this script to loop over each file and create a new one with just the 2 specified columns?
Please let me know if there is anything I need to clarify.

inputfile file is a file you openned , but then you are doing -
os.rename(inputfile + ".out", 'w')
This does not work, you are trying to add a string and the openned file using + operator. I am not even sure why you need that line or even the line - inputfile = open(infile, 'r') . You are openning the file again in the with statement.
Another issue -
You specify your path as - path = '/csvs/' , it is highly unlikely that you have a 'csvs' directory under root directory. You may have wanted to use some other relative directory, so you should use relative directory.
You can just do -
path = 'csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
output = infile + '.out'
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))

You can use pandas library. It offers several functionality for dealing with csv files. read_csv will read the csv file for you and give you a dataframe object. Visit this link to get example about how to write csv file from pandas dataframe. Moreore there are lot of tutorials available on the net.

Merging multiple CSV files without headers being repeated (using Python)

I am a beginner with Python. I have multiple CSV files (more than 10), and all of them have same number of columns. I would like to merge all of them into a single CSV file, where I will not have headers repeated.
So essentially I need to have just the first row with all the headers and from then I need all the rows from all CSV files merged. How do I do this?
Here's what I tried so far.
import glob
import csv
with open('output.csv','wb') as fout:
wout = csv.writer(fout,delimiter=',')
interesting_files = glob.glob("*.csv")
for filename in interesting_files:
print 'Processing',filename
# Open and process file
h = True
with open(filename,'rb') as fin:
fin.next()#skip header
for line in csv.reader(fin,delimiter=','):
wout.writerow(line)

If you are on a linux system:
head -1 director/one_file.csv > output csv ## writing the header to the final file
tail -n +2 director/*.csv >> output.csv ## writing the content of all csv starting with second line into final file

While I think that the best answer is the one from #valentin, you can do this without using csv module at all:
import glob
interesting_files = glob.glob("*.csv")
header_saved = False
with open('output.csv','wb') as fout:
for filename in interesting_files:
with open(filename) as fin:
header = next(fin)
if not header_saved:
fout.write(header)
header_saved = True
for line in fin:
fout.write(line)

If you dont mind the overhead, you could use pandas which is shipped with common python distributions. If you plan do more with speadsheet tables, I recommend using pandas rather than trying to write your own libraries.
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
Just a little more on pandas. Because it is made to deal with spreadsheet like data, it knows the first line is a header. When reading a CSV it separates the data table from the header which is kept as metadata of the dataframe, the standard datatype in pandas. If you concat several of these dataframes it concatenates only the dataparts if their headers are the same. If the headers are not the same it fails and gives you an error. Probably a good thing in case your directory is polluted with CSV files from another source.
Another thing: I just added sorted() around the interesting_files. I assume your files are named in order and this order should be kept. I am not sure about glob, but the os functions are not necessarily returning files sorted by their name.

Your attempt is almost working, but the issues are:
you're opening the file for reading but closing it before writing the rows.
you're never writing the title. You have to write it once
Also you have to exclude output.csv from the "glob" else the output is also in input!
Here's the corrected code, passing the csv object direcly to csv.writerows method for shorter & faster code. Also writing the title from the first file to the output file.
import glob
import csv
output_file = 'output.csv'
header_written = False
with open(output_file,'w',newline="") as fout: # just "wb" in python 2
wout = csv.writer(fout,delimiter=',')
# filter out output
interesting_files = [x for x in glob.glob("*.csv") if x != output_file]
for filename in interesting_files:
print('Processing {}'.format(filename))
with open(filename) as fin:
cr = csv.reader(fin,delmiter=",")
header = cr.next() #skip header
if not header_written:
wout.writerow(header)
header_written = True
wout.writerows(cr)
Note that solutions using raw line-by-line processing miss an important point: if the header is multi-line, they miserably fail, botching the title line/repeating part of it several time, efficiently corrupting the file.
csv module (or pandas, too) handle those cases gracefully.

Your indentation is wrong, you need to put the loop inside the with block. You can also pass the file object to writer.writerows.
import csv
with open('output.csv','wb') as fout:
wout = csv.writer(fout)
interesting_files = glob.glob("*.csv")
for filename in interesting_files:
print 'Processing',filename
with open(filename,'rb') as fin:
next(fin) # skip header
wout.writerows(fin)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modifying multiple .csv files from same directory in python - python

Related

Delimiter change in csv using python

Combining multiple csv files into one csv file

Merging several csv files and storing the file names as a variable - Python

Loop through multiple csv files, copying only certain columns to new files

Merging multiple CSV files without headers being repeated (using Python)

Categories

Resources