Python: Find and replace strings in batch csv files - python

I have several hundred csv files that I would like to search for the string "Keyed,Bet" and change it to "KeyedBet". The string may or may not be within the file, and may be in different columns in different files.
I cobbled together the script below, but it doesn't work. I am definitely using replace() incorrectly, but can't quite figure out how, and am creating a new file when I don't really need to- if it simply updated the current file and saved under the same name, that would be fine (but beyond my beginner capabilities).
Where did I go wrong here? Thanks for the help!
import os
import csv
path='.'
filenames = os.listdir(path)
for filename in filenames:
if filename.endswith('.csv'):
r=csv.reader(open(filename))
new_data = []
for row in r:
replace("Keyed,Bet","KeyedBet")
new_data.append(row)
newfilename = "".join(filename.split(".csv")) + "_edited3.csv"
with open(newfilename, "w") as f:
writer = csv.writer(f)
writer.writerows(new_data)

Why reinvent the wheel? Just download sed + its dependencies, then
sed -i 's/Keyed,Bet/KeyedBet/ig' *.csv
Edit: The command above should work fine in Linux. Windows sed requires its quoted tokens to be double-quoted, rather than single.
sed -i "s/Keyed,Bet,KeyedBet/ig" *.csv

If you want to change the original files you can use fileinput.input with inplace=True to actually modify the original file, glob will find all the csv files for you in the given directory:
from glob import iglob
import fileinput
path = '.'
for line in fileinput.input(iglob(os.path.join(path, "*.csv")),inplace=True):
print(line.replace("Keyed,Bet", "KeyedBet"),end="")
Not quite one line but a lot less than 15.
To create new files:
path='.'
from glob import iglob
for filename in iglob(os.path.join(path,"*.csv")):
with open(os.path.join(path,filename)) as f,open(os.path.join(path, os.path.splitext(filename)[0]+ "_edited3.csv"), "w") as f2:
for line in f:
f2.write(line.replace("Keyed,Bet", "KeyedBet"))
Considering you are replacing strings it is easier to just open the files without the csv module and use str.replace, if you knew the string always appeared in the same row then the csv module would be a better option but it seems that substring can appear anywhere.

Related

Open all files matching regex - python

I want to open and manipulate all files in a directory that have a numbered extension (eg. .342) My regex is '(.[0-9]{3})' I'm going to combine them all in one single file and massage them before outputting the new file.
I can't figure out what I'm supposed to feed the regex as input. I know I want to feed it the list of dir files. I guess I iterate through every file in the directory first, and put only the matched ones in matchlist, THEN I loop through matchlist and open them.
(I've looked at a bunch of examples.)
This is where I am so far.
import glob, os, re
Path = "data"
os.chdir(Path)
matchlist = re.search('(.[0-9]{3})', file )
for file in glob.glob(matchlist):
with open(file) as fp:
for line in fp:
print(line.strip())
Keep in mind that globs use a different syntax than regex.
You probably want either:
for filename in os.listdir():
if re.search(r'(\.[0-9]{3})', filename):
# ...
or:
for file in glob.glob('./*.[0-9][0-9][0-9]'):
# ...

How to load data set having multiple 'No-extension files' in python?

I am trying to load a dataset for my machine learning project and it requires me to load files having no extensions.
I tried :
import os
import glob
files = filter(os.path.isfile, glob.glob("./[0-9]*"))
for name in files:
with open(name) as fh:
contents = fh.read()
But doesn't return anything, mainly that glob command has nothing in it.
Also tried :
import os
import glob
path = './dataset1/training_validation/2012-07-10/'
for infile in glob.glob(os.path.join(path, '*')):
print("test")
file = open(infile, 'r')
print(file)
but this returns [] because of that glob command.
I'm stuck in here and couldn't find anything over the internet.
My actual problem is to load 'no extension files in a training and testing set' from two folders, validation, and the test itself. I can iterate through the folder but don't know how to handle those file types.
When I open those files in a text editor. it shows me something like this.
So I know that it's a binary format of an image, but have no idea how can I store and train them.
any help would be appreciated. thanks.
Two things:
File extensions (.txt , .dat , .bat, .f90, etc.) are not meaningful to python, at least when using glob or numpy or something of the sort, because it's just part of a string. Some of us are raised (within Windows) to believe that file extensions mean something (I too fell for it).
The file you are looking at is a text file, containing the ASCII representation of a binary image on 0's and 1's. So, it's not a binary file, and it's not an image file (per-se), but it is a text file, which means we can read it as such from python.
To read this in, you could do either:
1. Use numpy to do data = numpy.loadtxt(<filename>), however you might have trouble delimiting the digits.
2. Use Python's standard open function on the file, and loop through each line using for line in <file_handle>:. This way, each row of data is a string, which can be parsed easily (see documentation on string indexing).
Good luck!
IMO this simply means that your path does not exist.
Perhaps you try in a first test an absolute path to your folder, as you eventually confused the relative position of the folder to your current working directory.
I got it to work with the following code.
fileNames = [f for f in listdir(dirName) if isfile(join(dirName, f))]
random.shuffle(fileNames)
for files in fileNames:
data = open(dirName+'/'+files,'r');
Thanks for your responses.

Errors with Glob while outputting file names

I am combining two questions here because they are related to each other.
Question 1: I am trying to use glob to open all the files in a folder but it is giving me "Syntax Error". I am using Python 3.xx. Has the syntax changed for Python 3.xx?
Error Message:
File "multiple_files.py", line 29
files = glob.glob(/src/xyz/rte/folder/)
SyntaxError: invalid syntax
Code:
import csv
import os
import glob
from pandas import DataFrame, read_csv
#extracting
files = glob.glob(/src/xyz/rte/folder/)
for fle in files:
with open (fle) as f:
print("output" + fle)
f_read.close()
Question 2: I want to read input files, append "output" to the names and print out the names of the files. How can I do that?
Example: Input file name would be - xyz.csv and the code should print output_xyz.csv .
Your help is appreciated.
Your first problem is that strings, including pathnames, need to be in quotes. This:
files = glob.glob(/src/xyz/rte/folder/)
… is trying to divide a bunch of variables together, but the leftmost and rightmost divisions are missing operands, so you've confused the parser. What you want is this:
files = glob.glob('/src/xyz/rte/folder/')
Your next problem is that this glob pattern doesn't have any globs in it, so the only thing it's going to match is the directory itself.
That's perfectly legal, but kind of useless.
And then you try to open each match as a text file. Which you can't do with a directory, hence the IsADirectoryError.
The answer here is less obvious, because it's not clear what you want.
Maybe you just wanted all of the files in that directory? In that case, you don't want glob.glob, you want listdir (or maybe scandir): os.listdir('/src/xyz/rte/folder/').
Maybe you wanted all of the files in that directory or any of its subdirectories? In that case, you could do it with rglob, but os.walk is probably clearer.
Maybe you did want all the files in that directory that match some pattern, so glob.glob is right—but in that case, you need to specify what that pattern is. For example, if you wanted all .csv files, that would be glob.glob('/src/xyz/rte/folder/*.csv').
Finally, you say "I want to read input files, append "output" to the names and print out the names of the files". Why do you want to read the files if you're not doing anything with the contents? You can do that, of course, but it seems pretty wasteful. If you just want to print out the filenames with output appended, that's easy:
for filename in os.listdir('/src/xyz/rte/folder/'):
print('output'+filename)
This works in http://pyfiddle.io:
Doku: https://docs.python.org/3/library/glob.html
import csv
import os
import glob
# create some files
for n in ["a","b","c","d"]:
with open('{}.txt'.format(n),"w") as f:
f.write(n)
print("\nFiles before")
# get all files
files = glob.glob("./*.*")
for fle in files:
print(fle) # print file
path,fileName = os.path.split(fle) # split name from path
# open file for read and second one for write with modified name
with open (fle) as f,open('{}{}output_{}'.format(path,os.sep, fileName),"w") as w:
content = f.read() # read all
w.write(content.upper()) # write all modified
# check files afterwards
print("\nFiles after")
files = glob.glob("./*.*") # pattern for all files
for fle in files:
print(fle)
Output:
Files before
./d.txt
./main.py
./c.txt
./b.txt
./a.txt
Files after
./d.txt
./output_c.txt
./output_d.txt
./main.py
./output_main.py
./c.txt
./b.txt
./output_b.txt
./a.txt
./output_a.txt
I am on windows and would use os.walk (Doku) instead.
for d,subdirs,files in os.walk("./"): # deconstruct returned aktDir, all subdirs, files
print("AktDir:", d)
print("Subdirs:", subdirs)
print("Files:", files)
Output:
AktDir: ./
Subdirs: []
Files: ['d.txt', 'output_c.txt', 'output_d.txt', 'main.py', 'output_main.py',
'c.txt', 'b.txt', 'output_b.txt', 'a.txt', 'output_a.txt']
It also recurses into subdirs.

Iterating over a text files in a subdirectory

How do I iterate over text files only within a directory? What I have thus far is;
for file in glob.glob('*'):
f = open(file)
text = f.read()
f.close()
This works, however I am having to store my .py file in the same directory (folder) to get it to run, and as a result the iteration is including the .py file itself. Ideally what I want to command is either;
"Look in this subdirectory/folder, and iterate over all the files in there"
OR...
"Look through all files in this directory and iterate over those with .txt extension"
I'm sure I'm asking for something fairly straight forward, but I do not know how to proceed. Its probably worth me highlighting that I got the glob module through trial and error, so if this is the wrong way to go around this particular method feel free to correct me! Thanks.
The glob.glob function actually takes a globbing pattern as its parameter.
For instance, "*.txt" while match the files whose name ends with .txt.
Here is how you can use it:
for file in glob.glob("*.txt"):
f = open(file)
text = f.read()
f.close()
If however you want to exclude some specific files, say .py files, this is not directly supported by globbing's syntax, as explained here.
In that case, you'll need to get those files, and manually exclude them:
pythonFiles = glob.glob("*.py")
otherFiles = [f for f in glob.glob("*") if f not in pythonFiles]
glob.glob() uses the same wildcard pattern matching as your standard unix-like shell. The pattern can be used to filter on extensions of course:
# this will list all ".py" files in the current directory
# (
>>> glob.glob("*.py")
['__init__.py', 'manage.py', 'fabfile.py', 'fixmig.py']
but it can also be used to explore a given path, relative:
>>> glob.glob("../*")
['../etc', '../docs', '../setup.sh', '../tools', '../project', '../bin', '../pylint.html', '../sql']
or absolute:
>>> glob.glob("/home/bruno/Bureau/mailgun/*")
['/home/bruno/Bureau/mailgun/Domains_ Verify - Mailgun.html', '/home/bruno/Bureau/mailgun/Domains_ Verify - Mailgun_files']
And you can of course do both at once:
>>> glob.glob("/home/bruno/Bureau/*.pdf")
['/home/bruno/Bureau/marvin.pdf', '/home/bruno/Bureau/24-pages.pdf', '/home/bruno/Bureau/alice-in-wonderland.pdf']
The solution is very simple.
for file in glob.glob('*'):
if not file.endswith('.txt'):
continue
f = open(file)
text = f.read()
f.close()

I am trying to write a Python Script to Print a list of Files In Directory

as the title would imply I am looking to create a script that will allow me to print a list of file names in a directory to a CSV file.
I have a folder on my desktop that contains approx 150 pdf's. I'd like to be able to have the file names printed to a csv.
I am brand new to Python and may be jumping out of the frying pan and into the fire with this project.
Can anyone offer some insight to get me started?
First off you will want to start by grabbing all of the files in the directory, then simply by writing them to a file.
from os import listdir
from os.path import isfile, join
import csv
onlyfiles = [f for f in listdir("./") if isfile(join("./", f))]
with open('file_name.csv', 'w') as print_to:
writer = csv.writer(print_to)
writer.writerow(onlyfiles)
Please Note
"./" on line 5 is the directory you want to grab the files from.
Please replace 'file_name.csv' with the name of the file you want to right too.
The following will create a csv file with all *.pdf files:
from glob import glob
with open('/tmp/filelist.csv', 'w') as fout:
# write the csv header -- optional
fout.write("filename\n")
# write each filename with a newline characer
fout.writelines(['%s\n' % fn for fn in glob('/path/to/*.pdf')])
glob() is a nice shortcut to using listdir because it supports wildcards.
import os
csvpath = "csvfile.csv"
dirpath = "."
f = open("csvpath, "wb")
f.write(",".join(os.listdir(dirpath)))
f.close()
This may be improved to present filenames in way that you need, like for getting them back, or something. For instance, this most probably won't include unicode filenames in UTF-8 form but make some mess out of the encoding, but it is easy to fix all that.
If you have very big dir, with many files, you may have to wait some time for os.listdir() to get them all. This also can be fixed by using some other methods instead of os.listdir().
To differentiate between files and subdirectories see Michael's answer.
Also, using os.path.isfile() or os.path.isdir() you can recursively get all subdirectories if you wish.
Like this:
def getall (path):
files = []
for x in os.listdir(path):
x = os.path.join(path, x)
if os.path.isdir(x): files += getall(x)
else: files.append(x)
return files

Categories