os.path.basename to outfile - python

For every input file processed (see code below) I am trying to use "os.path.basename" to write to a new output file - I know I am missing something obvious...?
import os
import glob
import gzip
dbpath = '/home/university/Desktop/test'
for infile in glob.glob( os.path.join(dbpath, 'G[D|E]/????/*.gz') ):
print("current file is: " + infile)
**
outfile=os.path.basename('/home/university/Desktop/test/G[D|E]
/????/??????.xaa.fastq.gz').rsplit('.xaa.fastq.gz')[0]
file=open(outfile, 'w+')
**
gzsuppl = Chem.ForwardSDMolSupplier(gzip.open(infile))
for m in gzsuppl:
if m is None: continue
...etc
file.close()
print(count)
It is not clear to me how to capture the variable [0] (i.e. everything upstream of .xaa.fastq.gz) and use as the basename for the new output file?
Unfortunately it simply writes the new output file as "??????" rather than the actual sequence of 6 letters.
Thanks for any help given.

This seems like it will get everything upstream of the .xaa.fastq.gz in the paths returned from glob() in your sample code:
import os
filepath = '/home/university/Desktop/test/GD /AAML/DEAAML.xaa.fastq.gz'
filepath = os.path.normpath(filepath) # Changes path separators for Windows.
# This section was adapted from answer https://stackoverflow.com/a/3167684/355230
folders = []
while 1:
filepath, folder = os.path.split(filepath)
if folder:
folders.append(folder)
else:
if filepath:
folders.append(filepath)
break
folders.reverse()
if len(folders) > 1:
# The last element of folders should contain the original filename.
filename_prefix = os.path.basename(folders[-1]).split('.')[0]
outfile = os.path.join(*(folders[:-1] + [filename_prefix + '.rest_of_filename']))
print(outfile) # -> \home\university\Desktop\test\GD \AAML\DEAAML.rest_of_filename
Of course what ends-up in outfile isn't the final path plus filename since I don't know what the remainder of the filename will be and just put a placeholder in (the '.rest_of_filename').

I'm not familiar with the kind of input data you're working with, but here's what I can tell you:
The "something obvious" you're missing is that outfile has no connection to infile. Your outfile line uses the ?????? rather than the actual filename because that's what you ask for. It's glob.glob that turns it into a list of matches.
Here's how I'd write that aspect of the outfile line:
outfile = infile.rsplit('.xaa.fastq.gz', 1)[0]
(The , 1 ensures that it'll never split more than once, no matter how crazy a filename gets. It's just a good habit to get into when using split or rsplit like this.)
You're setting yourself up for a bug, because the glob pattern can match *.gz files which don't end in .xaa.fastq.gz, which would mean that a random .gz file which happens to wind up in the folder listing would cause outfile to have the same path as infile and you'd end up writing to the input file.
There are three solutions to this problem which apply to your use case:
Use *.xaa.fastq.gz instead of *.gzin your glob. I don't recommend this because it's easy for a typo to sneak in and make them different again, which would silently reintroduce the bug.
Write your output to a different folder than you took your input from.
outfile = os.path.join(outpath, os.path.relpath(infile, dbpath))
outparent = os.path.dirname(outfile)
if not os.path.exists(outparent):
os.makedirs(outparent)
Add an assert outfile != infile line so the program will die with a meaningful error message in the "this should never actually happen" case, rather than silently doing incorrect things.
The indentation of what you posted could be wrong, but it looks like you're opening a bunch of files, then only closing the last one. My advice is to use this instead, so it's impossible to get that wrong:
with open(outfile, 'w+') as file:
# put things which use `file` here
The name file is already present in the standard library and the variable names you chose are unhelpful. I'd rename infile to inpath, outfile to outpath, and file to outfile. That way, you can tell whether each one is a path (ie. a string) or a Python file object just from the variable name and there's no risk of accessing file before you (re)define it and getting a very confusing error message.

Related

Extracting the last instance of specific data from x number of files using python

I have a folder containing hundreds of files (scan_zmat_x.txt) where x is an incremental int [1,2,3...]. I need to open the file, find the last instance of a line, let's call it "gumballs" for now. Then I need to put everything in a new file. So far I've tried using .sh scripts but I only have access to a Windows machine currently. So py is a good alternative. I'm really stuck and could use some guidance.
I appreciate it. Cheers.
#!/bin/tcsh
Efile=opt/e.txt
Logs=opt/infiles/scan_zmat_$i.log
for i in Logs do
grep -winr "gumballs" scan_zmat_$i.log|tail -n 1 > $Efile
done
If you're willing to wait, just find each line with the value you want in each file, keeping the last one each time
import glob
search_string = "gumballs"
with open("results.txt") as fh_results:
for name_file in glob.iglob("scan_zmat_*.txt"):
discovered_line = None # clear the match for each file
with open(name_file) as fh:
for line in fh:
if search_string in line: # update on each match
discovered_line = line
if discovered_line is not None:
fh_results.write(line) # includes newline chars
# else: # optional message
# print(f"WARNING: no lines in '{name_file}' matched '{search_string}'")
Caveats
Both this and your original search may be sorted by the inode (or Windows/NTFS equivalent), which is normally the order the files were written in, but not necessarily
If you want to be certain they're sorted, use glob.glob and sort it as you prefer instead of using glob.iglob directly (which yields an iterable of the filenames in the order provided by the filesystem)
If the files are large, it could be more efficient to seek backwards in blocks (repeatedly .seek()ing)
Not tested! If you don't want to wait)
#!/usr/bin/env python3
from os.path import join as joinpath
from os import listdir # like 'ls' in bash
# Path to files you want to read
DIRPATH = "/home/some/path/"
OUTFILE = "out.txt" # will be created in cwd
collected = [] # OS dependent sorting
def main():
for fname in listdir(DIRPATH):
# Check with 'os.path.isfile' if you want
if fname.startswith("scan_zmat_") and fname.endswith(".txt"):
with open(joinpath(DIRPATH, fname), "rb") as fd:
# GOTO second last byte
fd.seek(-1, 2)
# every read() advances the pointer
while fd.read(1) == b"\n":
fd.seek(-2, 1)
end = fd.tell() + 1
fd.seek(-1, 1)
while fd.read(1) != b"\n":
fd.seek(-2, 1)
collected.append(fd.read(end-fd.tell()))
with open(OUTFILE, "wb") as fd:
fd.writelines(collected)
if __name__ == "__main__":
main()

Modifying a file in-place inside nested for loops

I am iterating directories and files inside of them while I modify in place each file. I am looking to have the new modified file being read right after.
Here is my code with descriptive comments:
# go through each directory based on their ids
for id in id_list:
id_dir = os.path.join(ouput_dir, id)
os.chdir(id_dir)
# go through all files (with a specific extension)
for filename in glob('*' + ext):
# modify the file by replacing all new-line characters with an empty space
with fileinput.FileInput(filename, inplace=True) as f:
for line in f:
print(line.replace('\n', ' '), end='')
# here I would like to read the NEW modified file
with open(filename) as newf:
content = newf.read()
As it stands, the newf is not the new modified one, but instead the original f. I think I understand why that is, however I found it difficult to overcome that issue.
I can always do 2 separate iterations (go through each directory based on their ids, go through all files (with a specific extension) and modify the file, and then repeat iteration to read each one of them) but I was hoping if there was a more efficient way around it. Perhaps if it would be possible to restart the second for loop after the modification has taken place and then have the read take place (so to avoid at least repeating the outer for loop).
Any ideas/designs of to achieve the above with a clean and efficient way?
For me it works with this code:
#!/usr/bin/env python3
import os
from glob import glob
import fileinput
id_list=['1']
ouput_dir='.'
ext = '.txt'
# go through each directory based on their ids
for id in id_list:
id_dir = os.path.join(ouput_dir, id)
os.chdir(id_dir)
# go through all files (with a specific extension)
for filename in glob('*' + ext):
# modify the file by replacing all new-line characters with an empty space
for line in fileinput.FileInput(filename, inplace=True):
print(line.replace('\n', ' ') , end="")
# here I would like to read the NEW modified file
with open(filename) as newf:
content = newf.read()
print(content)
notice how I iterate over the lines!
I am not saying that the way you are going about doing this is incorrect but I feel that you are overcomplicating it. Here is my super simple solution.
import glob, fileinput
for filename in glob('*' + ext):
f_in = (x.rstrip() for x in open(filename, 'rb').readlines()) #instead of trying to modify in place we instead read in data and replace raw_values.
with open(filename, 'wb') as f_out: # we then write the data stream back out
#extra modification to the data can go here, i just remove the /r and /n and write back out
for i in f_in:
f_out.write(i)
#now there is no need to read the data back in because we already have a static referance to it.

How to take input of a directory

What I'm trying to do is troll through a directory of log files which begin like this "filename001.log" there can be 100s of files in a directory
The code I want to run against each files is to check to make sure that the 8th position of the log always contains a number. I have a suspicion that a non-digit is throwing off our parser. Here's some simple code I'm trying to check this:
# import re
from urlparse import urlparse
a = '/folderA/filename*.log' #<< currently this only does 1 file
b = '/folderB/' #<< I'd like it to write the same file name as it read
with open(b, 'w') as newfile, open(a, 'r') as oldfile:
data = oldfile.readlines()
for line in data:
parts = line.split()
status = parts[8] # value of 8th position in the log file
isDigit = status.isdigit()
if isDigit = False:
print " Not A Number :",status
newfile.write(status)
My problem is:
How do I tell it to read all the files in a directory? (The above really only works for 1 file at a time)
If I find something is not a number I would like to write that character into a file in a different folder but of the same name as the log file. For example I find filename002.log has a "*" in one of the log lines. I would like folderB/filename002.log to made and the non-digit character to the written.
Sounds sounds simple enough I'm just a not very good at coding.
To read files in one directory matching a given pattern and write to another, use the glob module and the os.path functions to construct the output files:
srcpat = '/folderA/filename*.log'
dstdir = '/folderB'
for srcfile in glob.iglob(srcpat):
if not os.path.isfile(srcfile): continue
dstfile = os.path.join(dstdir, os.path.basename(srcfile))
with open(srcfile) as src, open(dstfile, 'w') as dst:
for line in src:
parts = line.split()
status = parts[8] # value of 8th position in the log file
if not status.isdigit():
print " Not A Number :", status
dst.write(status) # Or print >>dst, status if you want newline
This will create empty files even if no bad entries are found. You can either wait until you're finished processing the files (and the with block is closed) and just check the file size for the output and delete it if empty then, or you can move to a lazy approach where you delete the output file before beginning iteration unconditionally, but don't open it; only if you get a bad value do you open the file (for append instead of write to keep earlier loops' output from being discarded), write to it, allow it to close.
Import os and use: for filenames in os.listdir('path'):. This will list all files in the directory, including subdirectories.
Simply open a second file with the correct path. Since you already have filename from iterating with the above method, you only have to replace the directory. You can use os.path.join for that.

Python copy and rename many small csv files based on selected characters within the files

I'm not a programmer; I'm a pilot who has done just a little bit of scripting in a past life, so I'm completely non-current at this. I have searched the forum and found somewhat similar problems that, with more expertise and time I might be able to adapt to my problem, but I hope I can get closer by asking my own question. I hope my problem is unique enough that those considering answering do not feel their time is wasted, considering my disadvantage. Anyway here is my problem:
Some of my crew members periodically have a need to rename a few hundred to more than 1,000 small csv files based on a specific convention applied to their contents. Not all of the files are used in a given project, but any subset of them could be used, so automation makes a lot of sense here. Currently this is done manually as needed. I can easily move all these files into a single directory for processing, since all their file names are unique as received.
Here are representative excerpts from two example csv files, preceded by their respective file names (As I receive them):
A_13LSAT_2014-04-23_1431.csv:
1,KDAL CURLO RW13L SAT 20140414_0644,SID,N/A,DDI
2,*,RW13L(AER),SAT
3,RW13L(AER),+325123.36,-0965121.20,RW31R(DER),+325031.35,-0965020.95
4,1,1.2,+325123.36,-0965121.20,0.0,+325031.35,-0965020.95,2.0
3,RW31R(DER),+325031.35,-0965020.95,GH13L,+324947.23,-0964929.84
4,1,2.4,+325031.35,-0965020.95,0.0,+324947.23,-0964929.84,2.0
5,TTT,0,0
5,CVE,0,0
A_RROSEE_2014-04-03_1419.csv:
1,KDFW SEEVR STAR RRONY SEEVR 20140403_1340,STAR,N/A,DDI
2,*,RRONY,SEEVR
3,RRONY,+333455.16,-0952530.56,ROWZE,+333233.02,-0954016.52
4,1,12.6,+333455.16,-0952530.56,0.0,+333233.02,-0954016.52,2.0
5,EIC,0,1
5,SLR,0,0
I know these files are not code, but I entered them indented in this post so they would display properly.
The files must be renamed due to the 8.3 limitation of the platform they are used on.
The convention is:
•On the first line, the first two characters in the second word of the second "cell" (Which are the 6th and 7th characters of the second cell), and,
•on line 2, the first three characters of the third cell, and
•the first three characters of the fourth cell.
The contents and format of the files must remain unaltered. In theory this convention yields unique names for every file so duplication of file names should not be a problem.
The files above would be copied and renamed respectively to:
CURW1SAT.csv
SERROSEE.csv
That's it. Just a script that will scan a directory full of these csv files, and create renamed copies in the same directory according the the convention I just described, based on their contents. I'm attempting to use Activestate Python 2.7.7.
Thanks in advance for any consideration.
It's not what you'd call pretty, but neither am I; and it works (and it's simple)
import os
import glob
fileset = set(glob.glob(os.path.basename(os.path.join(".", "*.csv"))))
for filename in fileset:
with open(filename, "r") as f:
csv_file = f.readlines()
out = csv_file[0].split(",")[1].split(" ")[1][:2]
out += csv_file[1].split(",")[2][:3]
out += csv_file[1].split(",")[3][:3]
os.rename(filename, out + ".csv")
just drop this in the folder with all the csv's to be renamed and run it
That is indeed not too complicated. Python has out of the box everything you need.
I don't think it's a good idea to rename the files, in case of error (e.g. collision) it would make the process dangerous, copying to another folder is safer.
The code could look like that:
import csv
import os
import os.path
import sys
import shutil
def Process(input_directory, output_directory, filename):
"""This methods reads the file named 'filename' in input_directory and copies
it to output_directory, renaming it."""
# Read the file and extract first 2 lines.
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
line1 = reader.next()
line2 = reader.next()
line1_second_cell = line1[1]
# split() separate words by spaces into a list, [1] takes the second.
second_word = line1_second_cell.split()[1]
line2_third_cell = line2[2]
line2_fourth_cell = line2[3]
# [:2] takes the first two characters from a string.
new_filename = second_word[:2] + line2_third_cell[:3] + line2_fourth_cell[:3]
new_filename += '.csv'
print 'copying', filename, 'to', new_filename
shutil.copyfile(
os.path.join(input_directory, filename),
os.path.join(output_directory, new_filename))
# sys.argv is the list of arguments passed on the command line.
if len(sys.argv) == 3:
input_directory = sys.argv[1]
output_directory = sys.argv[2]
# os.listdir gives all the files in the directory (including ., .. and sub
# directories).
for filename in os.listdir(input_directory):
if filename.endswith(".csv"):
Process(input_directory, output_directory, filename)
else:
print "Usage:", sys.argv[0], "source_directory target_directory"
On windows you can run it in a command line (cmd.exe):
C:\where_your_python_is\python.exe C:\where_your_script_is\renamer.py C:\input C:\output
On linux it would be a little simpler as the python binary is in the path:
python /where_your_script_is/renamer.py /input /output
Put this in a script, and when you run it, give it the directory name as an argument on the command line:
import csv
import sys
import os
def rename_csv_file(filename):
global directory
with open(filename,'r') as csv_file:
newfilename = str()
rownum = 0
filereader = csv.reader(csv_file,delimiter=',')
for row in filereader:
if rownum == 0:
newfilename = row[1].split()[1][:2]
elif rownum == 1:
newfilename += row[2][:3]
newfilename += row[3][:3]
break
rownum += 1
newfilename += '.csv'
newfullpath = os.path.join(directory,newfilename)
os.rename(filename,newfullpath)
if len(sys.argv) < 2:
print "Usage: {} directory_name".format(sys.argv[0])
sys.exit()
directory = sys.argv[1]
csvfiles = [ os.path.join(directory,f) for f in os.listdir(directory) if (os.path.isfile(os.path.join(directory,f)) and f.endswith('.csv')) ]
for f in csvfiles:
rename_csv_file(f)
This assumes that every csv in your directory needs to be renamed. The code could be more condensed, but I tried to spell it out a bit so you could see what was going on.
import os
import csv
import shutil
#change this to the directory where your csvs are stored
dirname = r'C:\yourdirectory'
os.chdir(dirname)
for item in os.listdir(dirname): #look through directory contents
if item.endswith('.csv'):
f = open(item)
r = csv.reader(f)
line1 = r.next() #get the first line of csv
line2 = r.next() #get the second line of csv
f.close()
name1 = line1[1][:2] #first part of your name
name2 = line2[2][:3] #second part
name3 = line2[3][:3] #third part
newname = name1+name2+name3+'.csv'
shutil.copy2(os.path.join(dirname,item),newname) #copied csv with newname

Find "string" in Text File - Add it to Excel File Using Python

I ran a grep command and found several hundred instances of a string in a large directory of data. This file is 2 MB and has strings that I would like to extract out and put into an Excel file for easy access later. The part that I'm extracting is a path to a data file I need to work on later.
I have been reading about Python lately and thought I could somehow do this extraction automatically. But I'm a bit stumped how to start. I have this so far:
data = open("C:\python27\text.txt").read()
if "string" in data:
But then I'm not sure what to use to get out of the file what I want. Anything for a beginner to chew on?
EDIT
Here is some more info on what I was looking for. I have several hundred lines in a text file. Each line has a path and some strings like this:
/path/to/file:STRING=SOME_STRING, ANOTHER_STRING
What I would like from these lines are the paths of those lines with a specific "STRING=SOME_STRING". For example if the line looks like this, I want the path (/path/to/file) to be extracted to another file:
/path/to/file:STRING=SOME_STRING
All this is quite easily done with standard Python, but for "excel" (xls,or xlsx) files -- you'd have to install a third party library for that. However, if you need just a 2D table that cna open up on a spreadsheed you can use Comma Separated Values (CSV) files - these are comaptible with Excel and other spreadsheet software, and comes integrated in Python.
As for searching a string inside a file, it is straightforward. You may not even need regular expressions for most things. What information do you want along with the string?
Also, the "os" module onthse standardlib has some functions to list all files in a directory, or in a directory tree. The most straightforward is os.listdir(path)
String methods like "count" and "find" can be used beyond "in" to locate the string in a file, or count the number of ocurrences.
And finally, the "CSV" module can write a properly formated file to read in ay spreadsheet.
Along the away, you may abuse python's buit-in list objects as an easy way to manipulate data sets around.
Here is a sample programa that counts strings given in the command line found in files in a given directory,, and assembles a .CSV table with them:
# -*- coding: utf-8 -*-
import csv
import sys, os
output_name = "count.csv"
def find_in_file(path, string_list):
count = []
file_ = open(path)
data = file_.read()
file_.close()
for string in string_list:
count.append(data.count(string))
return count
def main():
if len(sys.argv) < 3:
print "Use %s directory_path <string1>[ string2 [...]])\n" % __package__
sys.exit(1)
target_dir = sys.argv[1]
string_list = sys.argv[2:]
csv_file = open(output_name, "wt")
writer = csv.writer(csv_file)
header = ["Filename"] + string_list
writer.writerow(header)
for filename in os.listdir(target_dir):
path = os.path.join(target_dir, filename)
if not os.path.isfile(path):
continue
line = [filename] + find_in_file(path, string_list)
writer.writerow(line)
csv_file.close()
if __name__=="__main__":
main()
The steps to do this are as follows:
Make a list of all files in the directory (This isn't necessary if you're only interested in a single file)
Extract the names of those files that you're interested in
In a loop, read in those files line by line
See if the line matches your pattern
Extract the part of the line before the first : character
So, the code would look something like this, provided your text files are formatted the way you've shown in the question and that this format is reliably correct:
import sys, os, glob
dir_path = sys.argv[1]
if dir_path[-1] != os.sep: dir_path+=os.sep
file_list = glob.glob(dir_path+'*.txt') #use standard *NIX wildcards to get your file names, in this case, all the files with a .txt extension
with open('out_file.csv', 'w') as out_file:
for filename in file_list:
with open(filename, 'r') as in_file:
for line in in_file:
if 'STRING=SOME_STRING' in line:
out_file.write(line.split(':')[0]+'\n')
This program would be run as python extract_paths.py path/to/directory and would give you a file called out_file.csv in your current directory.
This file can then be imported into Excel as a CSV file. If your input is less reliable than you've suggested, regular expressions might be a better choice.

Categories