Loop through multiple csv files, copying only certain columns to new files

Loop through multiple csv files, copying only certain columns to new files - python

I have a number of .csv files in a folder (1.csv, 2.csv, 3.csv, etc.) and I need to loop over them all. The output should be a corresponding NEW file for each existing one, but each should only contain 2 columns.
Here is a sample of the csv files:
004,444.444.444.444,448,11:16 PDT,11-24-15
004,444.444.444.444,107,09:55 PDT,11-25-15
004,444.444.444.444,235,09:45 PDT,11-26-15
004,444.444.444.444,241,11:00 PDT,11-27-15
And here is how I would like the output to look:
448,11-24-15
107,11-25-15
235,11-26-15
241,11-27-15
Here is my working attempt at achieving this with Python:
import csv
import os
import glob
path = '/csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
inputfile = open(infile, 'r')
output = os.rename(inputfile + ".out", 'w')
#Extracts the important columns from the .csv into a new file
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))
Using only the second half of this code, I am able to get the desired output by specifying the input files in the code. However, this Python script will be a small part of a much larger bash script that will be (hopefully) fully automated.
How can I adjust the input of this script to loop over each file and create a new one with just the 2 specified columns?
Please let me know if there is anything I need to clarify.

inputfile file is a file you openned , but then you are doing -
os.rename(inputfile + ".out", 'w')
This does not work, you are trying to add a string and the openned file using + operator. I am not even sure why you need that line or even the line - inputfile = open(infile, 'r') . You are openning the file again in the with statement.
Another issue -
You specify your path as - path = '/csvs/' , it is highly unlikely that you have a 'csvs' directory under root directory. You may have wanted to use some other relative directory, so you should use relative directory.
You can just do -
path = 'csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
output = infile + '.out'
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))

You can use pandas library. It offers several functionality for dealing with csv files. read_csv will read the csv file for you and give you a dataframe object. Visit this link to get example about how to write csv file from pandas dataframe. Moreore there are lot of tutorials available on the net.

Related

How to read multiple .txt files in a folder and write into a single file using python?

I am trying to read files with .txt extension and wanted to append into a single txt file. I could read data. But what is the best way to write into single .txt file?

sources = ["list of paths to files you want to write from"]
dest = open("file.txt", "a")
for src in sources:
source = open(src, "r")
data = source.readlines()
for d in data:
dest.write(d)
source.close()
dest.close()
If your destination doesnt already exist you can use "w"(write) mode instead of "a"(append) mode.

Try this.
x.txt:
Python is fun
y.txt:
Hello World. Welcome to my code.
z.txt:
I know that python is popular.
Main Python file:
list_=['x.txt','y.txt','z.txt']
new_list_=[]
for i in list_:
x=open(i,"r")
re=x.read()
new_list_.append(re)
with open('all.txt',"w") as file:
for line in new_list_:
file.write(line+"\n")

After you find the filenames, if you have a lot of files you should avoid string concatenation when merging file contents because in python string concatenation comes with O(n) runtime cost. I think the code below demonstrates the full example.
import glob
# get every txt files from the current directory
txt_files = glob.iglob('./*.txt')
def get_file_content(filename):
content = ''
with open(filename, 'r') as f:
content = f.read()
return content
contents = []
for txt_file in txt_files:
contents.append(get_file_content(txt_file))
with open('complete_content.txt', 'w') as f:
f.write(''.join(contents))

Modifying multiple .csv files from same directory in python

I need to modify multiple .csv files in my directory. Is it possible to do it with a simple script?
My .csv columns are in this order:
X_center,Y_center,X_Area,Y_Area,Classification
I would like to change them to this order:
Classification,X_center,Y_center,X_Area,Y_Area
So far I managed to write:
import os
import csv
for file in os.listdir("."):
if file.endswith(".csv"):
with open('*.csv', 'r') as infile, open('reordered.csv', 'a') as outfile:
fieldnames = ['Classification','X_center','Y_center','X_Area','Y_Area']
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()
for row in csv.DictReader(infile):
writer.writerow(row)
csv_file.close()
But it changes every row to Classification,X_center,Y_center,X_Area,Y_Area (replaces values in every row).
Is it possible to open a file, re-order the columns and save the file under the same name?
I checked similar solutions that were given on other threads but no luck.
Thanks for the help!

First off, I think your problem lay in opening '*.csv' in the loop instead of opening file. Also though, I would recommend never overwriting your original input files. It's much safer to write copies to a new directory. Here's a modified version of your script which does that.
import os
import csv
import argparse
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True)
ap.add_argument("-o", "--output", required=True)
args = vars(ap.parse_args())
if os.path.exists(args["output"]) and os.path.isdir(args["output"]):
print("Writing to {}".format(args["output"]))
else:
print("Cannot write to directory {}".format(args["output"]))
exit()
for file in os.listdir(args["input"]):
if file.endswith(".csv"):
print("{} ...".format(file))
with open(os.path.join(args["input"],file), 'r') as infile, open(os.path.join(args["output"], file), 'w') as outfile:
fieldnames = ['Classification','X_center','Y_center','X_Area','Y_Area']
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()
for row in csv.DictReader(infile):
writer.writerow(row)
outfile.close()
To use it, create a new directory for your outputs and then run like so:
python this.py -i input_dir -o output_dir
Note:
From your question you seemed to want each file to be modified in place so this does basically that (outputs a file of the same name, just in a different directory) but leaves your inputs unharmed. If you actually wanted all the files reordered into a single file as your code open('reordered.csv', 'a') implies, you could easily do that by moving the output initialization code so it is executed before entering the loop.

Using pandas & pathlib.
from pathlib import Path # available in python 3.4 +
import pandas as pd
dir = r'c:\path\to\csvs' # raw string for windows.
csv_files = [f for f in Path(dir).glob('*.csv')] # finds all csvs in your folder.
cols = ['Classification','X_center','Y_center','X_Area','Y_Area']
for csv in csv_files: #iterate list
df = pd.read_csv(csv) #read csv
df[cols].to_csv(csv.name,index=False)
print(f'{csv.name} saved.')
naturally, if there a csv without those columns then this code will fail, you can add a try/except if that's the case.

Combining multiple csv files into one csv file

I am trying to combine multiple csv files into one, and have tried a number of methods but I am struggling.
I import the data from multiple csv files, and when I compile them together into one csv file, it seems that the first few rows get filled out nicely, but then it starts randomly inputting spaces of variable number in between the rows, and it never finishes filling out the combined csv file, it just seems to continuously get information added to it, which does not make sense to me because I am trying to compile a finite amount of data.
I have already tried writing close statements for the file, and I still get the same result, my designated combined csv file never stops getting data, and it will randomly space the data throughout the file - I just want a normally compiled csv.
Is there an error in my code? Is there any explanation as to why my csv file is behaving this way?
csv_file_list = glob.glob(Dir + '/*.csv') #returns the file list
print (csv_file_list)
with open(Avg_Dir + '.csv','w') as f:
wf = csv.writer(f, delimiter = ',')
print (f)
for files in csv_file_list:
rd = csv.reader(open(files,'r'),delimiter = ',')
for row in rd:
print (row)
wf.writerow(row)

Your code works for me.
Alternatively, you can merge files as follows:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
for line in rf:
if line.strip(): # if line is not empty
if not line.endswith("\n"):
line+="\n"
wf.write(line)
Or, if the files are not too large, you can read each file at once. But in this case all empty lines an headers will be copied:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
wf.write(rf.read().strip()+"\n")

Consider several adjustments:
Use context manager, with, for both the read and write process. This avoids the need to close() file objects which you do not do on the read objects.
For skipping lines issue: use either the argument newline='' in open() or lineterminator="\n" argument in csv.writer(). See SO answers for former and latter.
Use os.path.join() to properly concatenate folder and file paths. This method is os-agnostic so accounts for Windows or Unix machines using forward or backslashes types.
Adjusted script:
import os
import csv, glob
Dir = r"C:\Path\To\Source"
Avg_Dir = r"C:\Path\To\Destination\Output"
csv_file_list = glob.glob(os.path.join(Dir, '*.csv')) # returns the file list
print (csv_file_list)
with open(os.path.join(Avg_Dir, 'Output.csv'), 'w', newline='') as f:
wf = csv.writer(f, lineterminator='\n')
for files in csv_file_list:
with open(files, 'r') as r:
next(r) # SKIP HEADERS
rr = csv.reader(r)
for row in rr:
wf.writerow(row)

Merging several csv files and storing the file names as a variable - Python

I am trying to append several csv files into a single csv file using python while adding the file name (or, even better, a sub-string of the file name) as a new variable. All files have headers. The following script does the trick of merging the files, but does not cover the file name as variable issue:
import glob
filenames=glob.glob("/filepath/*.csv")
outputfile=open("out.csv","a")
for line in open(str(filenames[1])):
outputfile.write(line)
for i in range(1,len(filenames)):
f = open(str(filenames[i]))
f.next()
for line in f:
outputfile.write(line)
outputfile.close()
I was wondering if there are any good suggestions. I have about 25k small size csv files (less than 100KB each).

You can use Python's csv module to parse the CSV files for you, and to format the output. Example code (untested):
import csv
with open(output_filename, "wb") as outfile:
writer = None
for input_filename in filenames:
with open(input_filename, "rb") as infile:
reader = csv.DictReader(infile)
if writer is None:
field_names = ["Filename"] + reader.fieldnames
writer = csv.DictWriter(outfile, field_names)
writer.writeheader()
for row in reader:
row["Filename"] = input_filename
writer.writerow(row)
A few notes:
Always use with to open files. This makes sure they will get closed again when you are done with them. Your code doesn't correctly close the input files.
CSV files should be opened in binary mode.
Indices start at 0 in Python. Your code skips the first file, and includes the lines from the second file twice. If you just want to iterate over a list, you don't need to bother with indices in Python. Simply use for x in my_list instead.

Simple changes will achieve what you want:
For the first line
outputfile.write(line) -> outputfile.write(line+',file')
and later
outputfile.write(line+','+filenames[i])

Find "string" in Text File - Add it to Excel File Using Python

I ran a grep command and found several hundred instances of a string in a large directory of data. This file is 2 MB and has strings that I would like to extract out and put into an Excel file for easy access later. The part that I'm extracting is a path to a data file I need to work on later.
I have been reading about Python lately and thought I could somehow do this extraction automatically. But I'm a bit stumped how to start. I have this so far:
data = open("C:\python27\text.txt").read()
if "string" in data:
But then I'm not sure what to use to get out of the file what I want. Anything for a beginner to chew on?
EDIT
Here is some more info on what I was looking for. I have several hundred lines in a text file. Each line has a path and some strings like this:
/path/to/file:STRING=SOME_STRING, ANOTHER_STRING
What I would like from these lines are the paths of those lines with a specific "STRING=SOME_STRING". For example if the line looks like this, I want the path (/path/to/file) to be extracted to another file:
/path/to/file:STRING=SOME_STRING

All this is quite easily done with standard Python, but for "excel" (xls,or xlsx) files -- you'd have to install a third party library for that. However, if you need just a 2D table that cna open up on a spreadsheed you can use Comma Separated Values (CSV) files - these are comaptible with Excel and other spreadsheet software, and comes integrated in Python.
As for searching a string inside a file, it is straightforward. You may not even need regular expressions for most things. What information do you want along with the string?
Also, the "os" module onthse standardlib has some functions to list all files in a directory, or in a directory tree. The most straightforward is os.listdir(path)
String methods like "count" and "find" can be used beyond "in" to locate the string in a file, or count the number of ocurrences.
And finally, the "CSV" module can write a properly formated file to read in ay spreadsheet.
Along the away, you may abuse python's buit-in list objects as an easy way to manipulate data sets around.
Here is a sample programa that counts strings given in the command line found in files in a given directory,, and assembles a .CSV table with them:
# -*- coding: utf-8 -*-
import csv
import sys, os
output_name = "count.csv"
def find_in_file(path, string_list):
count = []
file_ = open(path)
data = file_.read()
file_.close()
for string in string_list:
count.append(data.count(string))
return count
def main():
if len(sys.argv) < 3:
print "Use %s directory_path <string1>[ string2 [...]])\n" % __package__
sys.exit(1)
target_dir = sys.argv[1]
string_list = sys.argv[2:]
csv_file = open(output_name, "wt")
writer = csv.writer(csv_file)
header = ["Filename"] + string_list
writer.writerow(header)
for filename in os.listdir(target_dir):
path = os.path.join(target_dir, filename)
if not os.path.isfile(path):
continue
line = [filename] + find_in_file(path, string_list)
writer.writerow(line)
csv_file.close()
if __name__=="__main__":
main()

The steps to do this are as follows:
Make a list of all files in the directory (This isn't necessary if you're only interested in a single file)
Extract the names of those files that you're interested in
In a loop, read in those files line by line
See if the line matches your pattern
Extract the part of the line before the first : character
So, the code would look something like this, provided your text files are formatted the way you've shown in the question and that this format is reliably correct:
import sys, os, glob
dir_path = sys.argv[1]
if dir_path[-1] != os.sep: dir_path+=os.sep
file_list = glob.glob(dir_path+'*.txt') #use standard *NIX wildcards to get your file names, in this case, all the files with a .txt extension
with open('out_file.csv', 'w') as out_file:
for filename in file_list:
with open(filename, 'r') as in_file:
for line in in_file:
if 'STRING=SOME_STRING' in line:
out_file.write(line.split(':')[0]+'\n')
This program would be run as python extract_paths.py path/to/directory and would give you a file called out_file.csv in your current directory.
This file can then be imported into Excel as a CSV file. If your input is less reliable than you've suggested, regular expressions might be a better choice.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loop through multiple csv files, copying only certain columns to new files - python

You can use pandas library. It offers several functionality for dealing with csv files. read_csv will read the csv file for you and give you a dataframe object. Visit this link to get example about how to write csv file from pandas dataframe. Moreore there are lot of tutorials available on the net.

Related

How to read multiple .txt files in a folder and write into a single file using python?

Modifying multiple .csv files from same directory in python

Combining multiple csv files into one csv file

Merging several csv files and storing the file names as a variable - Python

Find "string" in Text File - Add it to Excel File Using Python

Categories

Resources