Searching multiple text files for two strings? - python

I have a folder with many text files (EPA10.txt, EPA55.txt, EPA120.txt..., EPA150.txt). I have 2 strings that are to be searched in each file and the result of the search is written in a text file result.txt. So far I have it working for a single file. Here is the working code:
if 'LZY_201_335_R10A01' and 'LZY_201_186_R5U01' in open('C:\\Temp\\lamip\\EPA150.txt').read():
with open("C:\\Temp\\lamip\\result.txt", "w") as f:
f.write('Current MW in node is EPA150')
else:
with open("C:\\Temp\\lamip\\result.txt", "w") as f:
f.write('NOT EPA150')
Now I want this to be repeated for all the text files in the folder. Please help.

Given that you have some amount of files named from EPA1.txt to EPA150.txt, but you don't know all the names, you can put them all together inside a folder, then read all the files in that folder using the os.listdir() method to get a list of filenames. You can read the file names using listdir("C:/Temp/lamip").
Also, your if statement is wrong, you should do this instead:
text = file.read()
if "string1" in text and "string2" in text
Here's the code:
from os import listdir
with open("C:/Temp/lamip/result.txt", "w") as f:
for filename in listdir("C:/Temp/lamip"):
with open('C:/Temp/lamip/' + filename) as currentFile:
text = currentFile.read()
if ('LZY_201_335_R10A01' in text) and ('LZY_201_186_R5U01' in text):
f.write('Current MW in node is ' + filename[:-4] + '\n')
else:
f.write('NOT ' + filename[:-4] + '\n')
PS: You can use / instead of \\ in your paths, Python automatically converts them for you.

Modularise! Modularise!
Well, not in the terms of having to write distinct Python modules, but isolate the different tasks at hand.
Find the files you wish to search.
Read the file and locate the text.
Write the result into a separate file.
Each of these tasks can be solved independently. I.e. to list the files, you have os.listdir which you might want to filter.
For step 2, it does not matter whether you have 1 or 1,000 files to search. The routine is the same. You merely have to iterate over each file found in step 1. This indicates that step 2 could be implemented as a function that takes the filename (and possible search-string) as argument, and returns True or False.
Step 3 is the combination of each element from step 1 and the result of step 2.
The result:
files = [fn for fn in os.listdir('C:/Temp/lamip') if fn.endswith('.txt')]
# perhaps filter `files`
def does_fn_contain_string(filename):
with open('C:/Temp/lamip/' + filename) as blargh:
content = blargh.read()
return 'string1' in content and/or 'string2' in content
with open('results.txt', 'w') as output:
for fn in files:
if does_fn_contain_string(fn):
output.write('Current MW in node is {1}\n'.format(fn[:-4]))
else:
output.write('NOT {1}\n'.format(fn[:-4]))

You can do this by creating a for loop that runs through all your .txt files in the current working directory.
import os
with open("result.txt", "w") as resultfile:
for result in [txt for txt in os.listdir(os.getcwd()) if txt.endswith(".txt")]:
if 'LZY_201_335_R10A01' and 'LZY_201_186_R5U01' in open(result).read():
resultfile.write('Current MW in node is {1}'.format(result[:-4]))
else:
resultfile.write('NOT {0}'.format(result[:-4]))

Related

Extracting a diffrentiating numerical value from multiple files - PowerShell/Python

I have multiple text files containing different text.
They all contain a single appearance of the same 2 lines I am interested in:
================================================================
Result: XX/100
I am trying to write a script to collect all those XX values (numerical values between 0 and 100), and paste them in a CSV file with the text file name in column A and the numerical value in column B.
I have considered using Python or PowerShell for this purpose.
How can I identify the line where "Result" appears under the string of "===..", collect its content until '\n', and then strip it from "Result: " and "/100"?
"Result" and other numerical values could appear in the files, but never in the quoted format, and below "=====", like the line im interested in.
Thank you!
Edit: I have written this poor naive attempt to collect the numerical values.
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
for filename in os.listdir(dir_path):
if filename.endswith(".txt"):
with open(filename,"r") as f:
lineFound=False
for index, line in enumerate(f):
if lineFound:
line=line.replace("Result: ", "")
line=line.replace("/100","")
line.strip()
grade=line
lineFound=False
print(grade, end='')
continue
if index>3:
if "================================================================" in line:
lineFound=True
I'd still be happy to learn if there's a simple way to do this with PowerShell tbh
For the output, I used csv writer to append the results to a file one by one.
So there's two steps involved here, first is to get a list of files. There's a ton of answers for that one on stackoverflow, but this one is stupidly complete.
Once you have the list of files, you can simply just load the files themselves one by one, and then do some simple string.split() to get the value you want.
Finally, write the results into a CSV file. Since the CSV file is a simple one, you don't need to use the CSV library for this.
See the code example below. Note that I copied/pasted the function for generating the list of files from my personal github repo. I reuse that one a lot.
import os
def get_files_from_path(path: str = ".", ext:str or list=None) -> list:
"""Find files in path and return them as a list.
Gets all files in folders and subfolders
See the answer on the link below for a ridiculously
complete answer for this.
https://stackoverflow.com/a/41447012/9267296
Args:
path (str, optional): Which path to start on.
Defaults to '.'.
ext (str/list, optional): Optional file extention.
Defaults to None.
Returns:
list: list of file paths
"""
result = []
for subdir, dirs, files in os.walk(path):
for fname in files:
filepath = f"{subdir}{os.sep}{fname}"
if ext == None:
result.append(filepath)
elif type(ext) == str and fname.lower().endswith(ext.lower()):
result.append(filepath)
elif type(ext) == list:
for item in ext:
if fname.lower().endswith(item.lower()):
result.append(filepath)
return result
filelist = get_files_from_path("path/to/files/", ext=".txt")
split1 = "================================================================\nResult: "
split2 = "/100"
with open("output.csv", "w") as outfile:
outfile.write('filename, value\n')
for filename in filelist:
with open(filename) as infile:
value = infile.read().split(split1)[1].split(split2)[0]
print(value)
outfile.write(f'"{filename}", {value}\n')
You could try this.
In this example the filename written to the CSV will be its full (absolute) path. You may just want the base filename.
Uses the same, albeit seemingly unnecessary, mechanism for deriving the source directory. It would be unusual to have your Python script in the same directory as your data.
import os
import glob
equals = '=' * 64
dir_path = os.path.dirname(os.path.realpath(__file__))
outfile = os.path.join(dir_path, 'foo.csv')
with open(outfile, 'w') as csv:
print('A,B', file=csv)
for file in glob.glob(os.path.join(dir_path, '*.txt')):
prev = None
with open(file) as indata:
for line in indata:
t = line.split()
if len(t) == 2 and t[0] == 'Result:' and prev.startswith(equals):
v = t[1].split('/')
if len(v) == 2 and v[1] == '100':
print(f'{file},{v[0]}', file=csv)
break
prev = line

Rename files in a directory with incremental index

INPUT: I want to add increasing numbers to file names in a directory sorted by date. For example, add "01_", "02_", "03_"...to these files below.
test1.txt (oldest text file)
test2.txt
test3.txt
test4.txt (newest text file)
Here's the code so far. I can get the file names, but each character in the file name seems to be it's own item in a list.
import os
for file in os.listdir("/Users/Admin/Documents/Test"):
if file.endswith(".txt"):
print(file)
The EXPECTED results are:
01_test1.txt
02_test2.txt
03_test3.txt
04_test4.txt
with test1 being the oldest and test 4 being the newest.
How do I add a 01_, 02_, 03_, 04_ to each file name?
I've tried something like this. But it adds a '01_' to every single character in the file name.
new_test_names = ['01_'.format(i) for i in file]
print (new_test_names)
If you want to number your files by age, you'll need to sort them first. You call sorted and pass a key parameter. The function os.path.getmtime will sort in ascending order of age (oldest to latest).
Use glob.glob to get all text files in a given directory. It is not recursive as of now, but a recursive extension is a minimal addition if you are using python3.
Use str.zfill to strings of the form 0x_
Use os.rename to rename your files
import glob
import os
sorted_files = sorted(
glob.glob('path/to/your/directory/*.txt'), key=os.path.getmtime)
for i, f in enumerate(sorted_files, 1):
try:
head, tail = os.path.split(f)
os.rename(f, os.path.join(head, str(i).zfill(2) + '_' + tail))
except OSError:
print('Invalid operation')
It always helps to make a check using try-except, to catch any errors that shouldn't be occurring.
This should work:
import glob
new_test_names = ["{:02d}_{}".format(i, filename) for i, filename in enumerate(glob.glob("/Users/Admin/Documents/Test/*.txt"), start=1)]
Or without list comprehension:
for i, filename in enumerate(glob.glob("/Users/Admin/Documents/Test/*.txt"), start=1):
print("{:02d}_{}".format(i, filename))
Three things to learn about here:
glob, which makes this sort of file matching easier.
enumerate, which lets you write a loop with an index variable.
format, specifically the 02d modifier, which prints two-digit numbers (zero-padded).
two methods to format integer with leading zero.
1.use .format
import os
i = 1
for file in os.listdir("/Users/Admin/Documents/Test"):
if file.endswith(".txt"):
print('{0:02d}'.format(i) + '_' + file)
i+=1
2.use .zfill
import os
i = 1
for file in os.listdir("/Users/Admin/Documents/Test"):
if file.endswith(".txt"):
print(str(i).zfill(2) + '_' + file)
i+=1
The easiest way is to simply have a variable, such as i, which will hold the number and prepend it to the string using some kind of formatting that guarantees it will have at least 2 digits:
import os
i = 1
for file in os.listdir("/Users/Admin/Documents/Test"):
if file.endswith(".txt"):
print('%02d_%s' % (i, file)) # %02d means your number will have at least 2 digits
i += 1
You can also take a look at enumerate and glob to make your code even shorter (but make sure you understand the fundamentals before using it).
test_dir = '/Users/Admin/Documents/Test'
txt_files = [file
for file in os.listdir(test_dir)
if file.endswith('.txt')]
numbered_files = ['%02d_%s' % (i + 1, file)
for i, file in enumerate(txt_files)]

How to run a code for multiple fastq files?

I would run the following code for multiple fastq files in a folder. In a folder I have different fastq files; first I have to read one file and perform the required operations, then store results in a separate file. fastq and then read second file, perform the same operations and save results in new 2nd file.fastq. Repeat the same procedure for all the files in the folder.
How can I do? Can someone suggest me a way to this this?
from Bio.SeqIO.QualityIO import FastqGeneralIterator
fout=open("prova_FiltraN_CE_filt.fastq","w")
fin=open("prova_FiltraN_CE.fastq","rU")
maxN=0
countall=0
countincl=0
with open("prova_FiltraN_CE.fastq", "rU") as handle:
for (title, sequence, quality) in FastqGeneralIterator(handle):
countN = sequence.count("N", 0, len(sequence))
countall+=1
if countN==maxN:
fout.write("#%s\n%s\n+\n%s\n" % (title, sequence, quality))
countincl+=1
fin.close
fout.close
print countall, countincl
I think the following will do what you want. What I did was make your code into a function (and modified it to be what I think is more correct) and then called that function for every .fastq file found in the designated folder. The output file names are generated from the input files found.
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import glob
import os
def process(in_filepath, out_filepath):
maxN = 0
countall = 0
countincl = 0
with open(in_filepath, "rU") as fin:
with open(out_filepath, "w") as fout:
for (title, sequence, quality) in FastqGeneralIterator(fin):
countN = sequence.count("N", 0, len(sequence))
countall += 1
if countN == maxN:
fout.write("#%s\n%s\n+\n%s\n" % (title, sequence, quality))
countincl += 1
print os.path.split(in_filepath)[1], countall, countincl
folder = "/path/to/folder" # folder to process
for in_filepath in glob.glob(os.path.join(folder, "*.fastq")):
root, ext = os.path.splitext(in_filepath)
if not root.endswith("_filt"): # avoid processing existing output files
out_filepath = root + "_filt" + ext
process(in_filepath, out_filepath)

How would I read and write from multiple files in a single directory? Python

I am writing a Python code and would like some more insight on how to approach this issue.
I am trying to read in multiple files in order that end with .log. With this, I hope to write specific values to a .csv file.
Within the text file, there are X/Y values that are extracted below:
Textfile.log:
X/Y = 5
X/Y = 6
Textfile.log.2:
X/Y = 7
X/Y = 8
DesiredOutput in the CSV file:
5
6
7
8
Here is the code I've come up with so far:
def readfile():
import os
i = 0
for file in os.listdir("\mydir"):
if file.endswith(".log"):
return file
def main ():
import re
list = []
list = readfile()
for line in readfile():
x = re.search(r'(?<=X/Y = )\d+', line)
if x:
list.append(x.group())
else:
break
f = csv.write(open(output, "wb"))
while 1:
if (i>len(list-1)):
break
else:
f.writerow(list(i))
i += 1
if __name__ == '__main__':
main()
I'm confused on how to make it read the .log file, then the .log.2 file.
Is it possible to just have it automatically read all the files in 1 directory without typing them in individually?
Update: I'm using Windows 7 and Python V2.7
The simplest way to read files sequentially is to build a list and then loop over it. Something like:
for fname in list_of_files:
with open(fname, 'r') as f:
#Do all the stuff you do to each file
This way whatever you do to read each file will be repeated and applied to every file in list_of_files. Since lists are ordered, it will occur in the same order as the list is sorted to.
Borrowing from #The2ndSon's answer, you can pick up the files with os.listdir(dir). This will simply list all files and directories within dir in an arbitrary order. From this you can pull out and order all of your files like this:
allFiles = os.listdir(some_dir)
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
logFiles.sort(key = lambda x: x.split('.')[-1])
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
The above code will work with files name like "somename.log", "somename.log.2" and so on. You can then take logFiles and plug it in as list_of_files. Note that the last line is only necessary if the first file is "somename.log" instead of "somename.log.1". If the first file has a number on the end, just exclude the last step
Line By Line Explanation:
allFiles = os.listdir(some_dir)
This line takes all files and directories within some_dir and returns them as a list
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
Perform a list comprehension to gather all of the files with log in the name as part of the extension. "something.log.somethingelse" will be included, "log_something.somethingelse" will not.
logFiles.sort(key = lambda x: x.split('.')[-1])
Sort the list of log files in place by the last extension. x.split('.')[-1] splits the file name into a list of period delimited values and takes the last entry. If the name is "name.log.5", it will be sorted as "5". If the name is "name.log", it will be sorted as "log".
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
Swap the first and last entries of the list of log files. This is necessary because the sorting operation will put "name.log" as the last entry and "nane.log.1" as the first.
If you change the naming scheme for your log files you can easily return of list of files that have the ".log" extension. For example if you change the file names to Textfile1.log and Textfile2.log you can update readfile() to be:
import os
def readfile():
my_list = []
for file in os.listdir("."):
if file.endswith(".log"):
my_list.append(file)
print my_list will return ['Textfile1.log', 'Textfile2.log']. Using the word 'list' as a variable is generally avoided, as it is also used to for an object in python.

Directory Name + File Name to Text File

I have many directories with files in them. I want to create a comma delimited txt file showing directory name, and the files that are within that particular directory, see example below:
What I'm looking for:
DirName,Filename
999,123.tif
999,1234.tif
999,abc.tif
900,1236.tif
900,xyz.tif
...etc
The python code below pushes a list of those file paths into a text file, however I'm unsure of how to format the list as described above.
My current code looks like:
Update
I've been able to format the text file now, however I'm noticing all the directory names/filenames are not being written to the text file. It is only writing ~4000 of the ~8000 dir/files. Is there some sort limit that I'm reaching with the text file number of rows, the list (mylist) size, or some bad file dir/file character that is stopping it (see updated code below)?
from os import listdir
from os.path import isfile, join
root = r'C:\temp'
mylist = ['2']
for path, subdirs, files in os.walk(root):
for name in files:
mylist.append(os.path.join(path, name))
txt = open(r'C:\temp\output.txt', 'w')
txt.write('dir' + ',' + 'file' + '\n')
for item in mylist:
list = mylist.pop(0)
dir, filename = os.path.basename(os.path.dirname(list)), os.path.basename(list)
txt.write(dir + ',' + filename + '\n')
with open(r'C:\temp\output.txt', 'r') as f:
read_data = f.read()
Thank You
Maybe this helps:
You could get the absolute file paths and then do the following
import os.path
p = "/tmp/999/123.tif"
dir, filename = os.path.basename(os.path.dirname(p)), os.path.basename(p)
Result:
In [21]: dir
Out[21]: '999'
In [22]: filename
Out[22]: '123.tif'
Also consider using csv module to write this kind of files.
import csv
import os.path
# You already have a list of absolute paths
files = ["/tmp/999/123.tif"]
# csv writer
with open('/tmp/out.csv', 'wb') as out_file:
csv_writer = csv.writer(out_file, delimiter=',')
csv_writer.writerow(('DirName','Filename'))
for f in files:
csv_writer.writerow((os.path.basename(os.path.dirname(f)),
os.path.basename(f)))

Categories