how to search string in a folder of text files using python - python

I am writing some scripts to process some text files in python. Locally the script reads from a single txt file thus i use
index_file = open('index.txt', 'r')
for line in index_file:
....
and loop through the file to find a matching string, but when using amazon EMR, the index.txt file per se, is split into multiple txt files in a single folder.
Thus i would like to replicate that locally and read from multiple txt file for a certain string, but i struggle to find clean code to do that.
What is the best way to go about it while writing minimal code?

import os
from glob import glob
def readindex(path):
pattern = '*.txt'
full_path = os.path.join(path, pattern)
for fname in sorted(glob(full_path)):
for line in open(fname, 'r'):
yield line
# read lines to memory list for using multiple times
linelist = list(readindex("directory"))
for line in linelist:
print line,
This script defines a generator (see this question for details about generators) to iterate through all the files in directory "directory" that have extension "txt" in sorted order. It yields all the lines as one stream that after calling the function can be iterated through as if the lines were coming from one open file, as that seems to be what the question author wanted. The comma at the end of print line, makes sure that newline is not printed twice, although the content of the for loop would be replaced by question author anyway. In that case one can use line.rstrip() to get rid of the newline.
The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.

Related

How to take input of a directory

What I'm trying to do is troll through a directory of log files which begin like this "filename001.log" there can be 100s of files in a directory
The code I want to run against each files is to check to make sure that the 8th position of the log always contains a number. I have a suspicion that a non-digit is throwing off our parser. Here's some simple code I'm trying to check this:
# import re
from urlparse import urlparse
a = '/folderA/filename*.log' #<< currently this only does 1 file
b = '/folderB/' #<< I'd like it to write the same file name as it read
with open(b, 'w') as newfile, open(a, 'r') as oldfile:
data = oldfile.readlines()
for line in data:
parts = line.split()
status = parts[8] # value of 8th position in the log file
isDigit = status.isdigit()
if isDigit = False:
print " Not A Number :",status
newfile.write(status)
My problem is:
How do I tell it to read all the files in a directory? (The above really only works for 1 file at a time)
If I find something is not a number I would like to write that character into a file in a different folder but of the same name as the log file. For example I find filename002.log has a "*" in one of the log lines. I would like folderB/filename002.log to made and the non-digit character to the written.
Sounds sounds simple enough I'm just a not very good at coding.
To read files in one directory matching a given pattern and write to another, use the glob module and the os.path functions to construct the output files:
srcpat = '/folderA/filename*.log'
dstdir = '/folderB'
for srcfile in glob.iglob(srcpat):
if not os.path.isfile(srcfile): continue
dstfile = os.path.join(dstdir, os.path.basename(srcfile))
with open(srcfile) as src, open(dstfile, 'w') as dst:
for line in src:
parts = line.split()
status = parts[8] # value of 8th position in the log file
if not status.isdigit():
print " Not A Number :", status
dst.write(status) # Or print >>dst, status if you want newline
This will create empty files even if no bad entries are found. You can either wait until you're finished processing the files (and the with block is closed) and just check the file size for the output and delete it if empty then, or you can move to a lazy approach where you delete the output file before beginning iteration unconditionally, but don't open it; only if you get a bad value do you open the file (for append instead of write to keep earlier loops' output from being discarded), write to it, allow it to close.
Import os and use: for filenames in os.listdir('path'):. This will list all files in the directory, including subdirectories.
Simply open a second file with the correct path. Since you already have filename from iterating with the above method, you only have to replace the directory. You can use os.path.join for that.

Match two files against each other and write output as file - Python

I'm new to Python. My second time coding in it. The main point of this script is to take a text file that contains thousands of lines of file names (sNotUsed file) and match it against about 50 XML files. The XML files may contain up to thousands of lines each and are formatted as most XML's are. I'm not sure what the problem with the code so far is. The code is not fully complete as I have not added the part where it writes the output back to an XML file, but the current last line should be printing at least once. It is not, though.
Examples of the two file formats are as follows:
TEXT FILE:
fileNameWithoutExtension1
fileNameWithoutExtension2
fileNameWithoutExtension3
etc.
XML FILE:
<blocks>
<more stuff="name">
<Tag2>
<Tag3 name="Tag3">
<!--COMMENT-->
<fileType>../../dir/fileNameWithoutExtension1</fileType>
<fileType>../../dir/fileNameWithoutExtension4</fileType>
</blocks>
MY CODE SO FAR:
import os
import re
sNotUsed=list()
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open snotused txt file
for lines in sFile:
sNotUsed.append(lines)
#sNotUsed = sFile.readlines() # read all lines and assign to list
sFile.close() # close file
xmlFiles=list() # list of xmlFiles in directory
usedS=list() # list of S files that do not match against sFile txt
search = "\w/([\w\-]+)"
# getting the list of xmlFiles
filelist=os.listdir('C:\Users\xxx\Desktop\dir')
for files in filelist:
if files.endswith('.xml'):
xmlFile = open(files, "r+") # open first file with read + write access
xmlComp = xmlFile.readlines() # read lines and assign to list
for lines in xmlComp: # iterate by line in list of lines
temp = re.findall(search, lines)
#print temp
if temp:
if temp[0] in sNotUsed:
print "yes" # debugging. I know there is at least one match for sure, but this is not being printed.
TO HELP CLEAR THINGS UP:
Sorry, I guess my question wasn't very clear. I would like the script to go through each XML line by line and see if the FILENAME part of that line matches with the exact line of the sNotUsed.txt file. If there is match then I want to delete it from the XML. If the line doesn't match any of the lines in the sNotUsed.txt then I would like it be part of the output of the new modified XML file (which will overwrite the old one). Please let me know if still not clear.
EDITED, WORKING CODE
import os
import re
import codecs
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open sNotUsed txt file
sNotUsed=sFile.readlines() # read all lines and assign to list
sFile.close() # close file
search = re.compile(r"\w/([\w\-]+)")
sNotUsed=[x.strip().replace(',','') for x in sNotUsed]
directory=r'C:\Users\xxx\Desktop\dir'
filelist=os.listdir(directory) # getting the list of xmlFiles
# for each file in the list
for files in filelist:
if files.endswith('.xml'): # make sure it is an XML file
xmlFile = codecs.open(os.path.join(directory, files), "r", encoding="UTF-8") # open first file with read
xmlComp = xmlFile.readlines() # read lines and assign to list
print xmlComp
xmlFile.close() # closing the file since the lines have already been read and assigned to a variable
xmlEdit = codecs.open(os.path.join(directory, files), "w", encoding="UTF-8") # opening the same file again and overwriting all existing lines
for lines in xmlComp: # iterate by line in list of lines
#headerInd = re.search(search, lines) # used to get the headers, comments, and ending blocks
temp = re.findall(search, lines) # finds all strings that match the regular expression compiled above and makes a list for each
if temp: # if the list is not empty
if temp[0] not in sNotUsed: # if the first (and only) value in each list is not in the sNotUsed list
xmlEdit.write(lines) # write it in the file
else: # if the list is empty
xmlEdit.write(lines) # write it (used to preserve the beginning and ending blocks of the XML, as well as comments)
There is a lot of things to say but I'll try to stay concise.
PEP8: Style Guide for Python Code
You should use lower case with underscores for local variables.
take a look at the PEP8: Style Guide for Python Code.
File objects and with statement
Use the with statement to open a file, see: File Objects: http://docs.python.org/2/library/stdtypes.html#bltin-file-objects
Escape Windows filenames
Backslashes in Windows filenames can cause problems in Python programs. You must escape the string using double backslashes or use raw strings.
For example: if your Windows filename is "dir\notUsed.txt", you should escape it like this: "dir\\notUsed.txt" or use a raw string r"dir\notUsed.txt". If you don't do that, the "\n" will be interpreted as a newline!
Note: if you need to support Unicode filenames, you can use Unicode raw strings: ur"dir\notUsed.txt".
See also the question 19065115 in StockOverFlow.
store the filenames in a set: it is an optimized collection without duplicates
not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
not_used_set = set([line.strip() for line in not_used_file])
Compile your regex
It is more efficient to compile a regex when used numerous times. Again, you should use raw strings to avoid backslashes interpretation.
pattern = re.compile(r"\w/([\w\-]+)")
Warning: os.listdir() function return a list of filenames not a list of full paths. See this function in the Python documentation.
In your example, you read a desktop directory 'C:\Users\xxx\Desktop\dir' with os.listdir(). And then you want to open each XML file in this directory with open(files, "r+"). But this is wrong, until your current working directory isn't your desktop directory. The classic usage is to used os.path.join() function like this:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
desktop_path = os.path.join(desktop_dir, filename)
If you want to extract the filename's extension, you can use the os.path.splitext() function.
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
if os.path.splitext(filename)[1].lower() != '.xml':
continue
desktop_path = os.path.join(desktop_dir, filename)
You can simplify this with a comprehension list:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
for filename in os.listdir(desktop_dir)
if os.path.splitext(filename)[1].lower() == '.xml']
Parse a XML file
How to parse a XML file? This is a great question!
There a several possibility:
- use regex, efficient but dangerous;
- use SAX parser, efficient too but confusing and difficult to maintain;
- use DOM parser, less efficient but clearer...
Consider using lxml package (#see: http://lxml.de/)
It is dangerous, because the way you read the file, you don't care of the XML encoding. And it is bad! Very bad indeed! XML files are usually encoded in UTF-8. You should first decode UTF-8 byte stream. A simple way to do that is to use codecs.open() to open an encoded file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
With this solution, the full XML content is store in the content variable as an Unicode string. You can then use a Unicode regex to parse the content.
Finally, you can use a set intersection to find if a given XML file contains commons names with the text file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
actual_set = set(pattern.findall(content))
print(not_used_set & actual_set)

python: extracting specific lines from all files in a zip archive

I need to extract a specific line (the second line) from all the files contained in a zip archive. My attempts (obviously) didn't work. Everything I could find related to this involved using a specific string/variable to narrow down the contents to be extracted from archived files - I can't use that in my case.
The closest I've gotten is extracting ALL lines from ALL files.
import zipfile
with zipfile.ZipFile() as input_zipfile:
for f in input_zipfile.namelist():
for line in input_zipfile.read(f).split("\n"):
print line
Ideally I would want to use something like .readlines() and then print line[1] to get the second line of each file. But that doesn't work with zipfiles. Do I need to create temporary files and use that syntax, or is there a way around this?
I tried changing the last line to print line[1] but then I get an IndexError.
As a side note, the files aren't large (4-12 lines). So I guess making temporary files isn't out of the question, but it seems too roundabout and inelegant.
This will work:
import zipfile
with zipfile.ZipFile() as input_zipfile:
for f in input_zipfile.namelist():
lines = input_zipfile.read(f).split("\n")
print lines[1]
(Your original code loops through the list of lines for no reason, instead of just printing the 2nd)

Edit all files in subdirectories with python

I am trying to recursively loop through a series of directories (about 3 levels deep). In each directory is a series of text files, I want to replace a line of text with the directory path if the line contains a certain string so for example.
/path/to/text/file/fName.txt
If a line in fName in fName.txt text contains the string 'String1' I want to replace this line with 'some text' + file where file is the last part of the path.
This seems like it should be easy in python but I can't seem to manage it.
Edit: Apologies for a very badly written question, I had to rush off, shouldn't have hit enter.
Here's what I have so far
import os
for dirname, dirs, files in os.walk("~/dir1/dir2"):
print files
for fName in files:
fpath = os.path.join(dirname, fName)
print fpath
f = open(fpath)
for line in f:
#where I'm getting stuck
s = s.replace("old_txt", "new_txt")
#create new file and save output
What I'm getting stuck on is how to replace an entire line based on only a section of the line. For example if the line were
That was a useless question,
I can't seem to make replace to what I want. What I'm trying to do is change the entire line based only on searching for 'useless'. Also, is there a better way of modyfying a single line than re-writing the entire file?
Thanks.
os.walk (look at example) is all you need
parse each file with with open(...) as f:, analyze it, and overwrite it (carefully, after testing) with with open(..., 'w') as f:

Find "string" in Text File - Add it to Excel File Using Python

I ran a grep command and found several hundred instances of a string in a large directory of data. This file is 2 MB and has strings that I would like to extract out and put into an Excel file for easy access later. The part that I'm extracting is a path to a data file I need to work on later.
I have been reading about Python lately and thought I could somehow do this extraction automatically. But I'm a bit stumped how to start. I have this so far:
data = open("C:\python27\text.txt").read()
if "string" in data:
But then I'm not sure what to use to get out of the file what I want. Anything for a beginner to chew on?
EDIT
Here is some more info on what I was looking for. I have several hundred lines in a text file. Each line has a path and some strings like this:
/path/to/file:STRING=SOME_STRING, ANOTHER_STRING
What I would like from these lines are the paths of those lines with a specific "STRING=SOME_STRING". For example if the line looks like this, I want the path (/path/to/file) to be extracted to another file:
/path/to/file:STRING=SOME_STRING
All this is quite easily done with standard Python, but for "excel" (xls,or xlsx) files -- you'd have to install a third party library for that. However, if you need just a 2D table that cna open up on a spreadsheed you can use Comma Separated Values (CSV) files - these are comaptible with Excel and other spreadsheet software, and comes integrated in Python.
As for searching a string inside a file, it is straightforward. You may not even need regular expressions for most things. What information do you want along with the string?
Also, the "os" module onthse standardlib has some functions to list all files in a directory, or in a directory tree. The most straightforward is os.listdir(path)
String methods like "count" and "find" can be used beyond "in" to locate the string in a file, or count the number of ocurrences.
And finally, the "CSV" module can write a properly formated file to read in ay spreadsheet.
Along the away, you may abuse python's buit-in list objects as an easy way to manipulate data sets around.
Here is a sample programa that counts strings given in the command line found in files in a given directory,, and assembles a .CSV table with them:
# -*- coding: utf-8 -*-
import csv
import sys, os
output_name = "count.csv"
def find_in_file(path, string_list):
count = []
file_ = open(path)
data = file_.read()
file_.close()
for string in string_list:
count.append(data.count(string))
return count
def main():
if len(sys.argv) < 3:
print "Use %s directory_path <string1>[ string2 [...]])\n" % __package__
sys.exit(1)
target_dir = sys.argv[1]
string_list = sys.argv[2:]
csv_file = open(output_name, "wt")
writer = csv.writer(csv_file)
header = ["Filename"] + string_list
writer.writerow(header)
for filename in os.listdir(target_dir):
path = os.path.join(target_dir, filename)
if not os.path.isfile(path):
continue
line = [filename] + find_in_file(path, string_list)
writer.writerow(line)
csv_file.close()
if __name__=="__main__":
main()
The steps to do this are as follows:
Make a list of all files in the directory (This isn't necessary if you're only interested in a single file)
Extract the names of those files that you're interested in
In a loop, read in those files line by line
See if the line matches your pattern
Extract the part of the line before the first : character
So, the code would look something like this, provided your text files are formatted the way you've shown in the question and that this format is reliably correct:
import sys, os, glob
dir_path = sys.argv[1]
if dir_path[-1] != os.sep: dir_path+=os.sep
file_list = glob.glob(dir_path+'*.txt') #use standard *NIX wildcards to get your file names, in this case, all the files with a .txt extension
with open('out_file.csv', 'w') as out_file:
for filename in file_list:
with open(filename, 'r') as in_file:
for line in in_file:
if 'STRING=SOME_STRING' in line:
out_file.write(line.split(':')[0]+'\n')
This program would be run as python extract_paths.py path/to/directory and would give you a file called out_file.csv in your current directory.
This file can then be imported into Excel as a CSV file. If your input is less reliable than you've suggested, regular expressions might be a better choice.

Categories