Extracting line data based on specific pattern in text file using python - python

I have a huge report file with some data where i have to do some data processing on lines starting with the code "MLT-TRR"
For now i have extracted all the lines in my script that start with that code and placed them in a separate file. The new file looks like this- Rules.txt.
MLT-TRR Warning C:\Users\Di\Pictures\SavedPictures\top.png 63 10 Port is not registered [Folder: 'Picture']
MLT-TRR Warning C:\Users\Di\Pictures\SavedPictures\tree.png 315 10 Port is not registered [Folder: 'Picture.first_inst']
MLT-TRR Warning C:\Users\Di\Pictures\SavedPictures\top.png 315 10 Port is not registered [Folder: 'Picture.second_inst']
MLT-TRR Warning C:\Users\Di\Pictures\SavedPictures\tree.png 317 10 Port is not registered [Folder: 'Picture.third_inst']
MLT-TRR Warning C:\Users\Di\Pictures\SavedPictures\top.png 317 10 Port is not registered [Folder: 'Picture.fourth_inst']
For each of these lines i have to extract the data that lies after "[Folder: 'Picture" If there is no data after "[Folder: 'Picture" as in the case of my first line, then skip that line and move on to the next line.
I also want to extract the file names for each of those lines- top.txt, tree.txt
I couldnt think of a simpler method to do this as this involves a loop and gets messier.
Is there any way out i can do this? extracting just the file paths and the ending data of each line.
import os
import sys
from os import path
import numpy as np
folder_path = os.path.dirname(os.path.abspath(__file__))
inFile1 = 'Rules.txt'
inFile2 = 'TopRules.txt'
def open_file(filename):
try:
with open(filename,'r') as f:
targets = [line for line in f if "MLT-TRR" in line]
print targets
f.close()
with open(inFile1, "w") as f2:
for line in targets:
f2.write(line + "\n")
f2.close()
except Exception,e:
print str(e)
exit(1)
if __name__ == '__main__':
name = sys.argv[1]
filename = sys.argv[1]
open_file(filename)

To extract the filenames and other data, you should be able to use a regular expression:
import re
for line in f:
match = re.match(r"^MLT-TRR.*([A-Za-z]:\\[-A-Za-z0-9_:\\.]+).*\[Folder: 'Picture\.(\w+)']", line)
if match:
filename = match.group(1)
data = match.group(2)
This assumes that the data after 'Picture. only contains alphanumeric characters and underscores. And you may have to change the allowed characters in the filename part [A-Za-z0-9_:\\.] if you have weird filenames. It also assumes the filenames start with the Windows drive letter (so absolute paths), to make it easier to distinguish from other data in the line.
If you just want the basename of the filename, then after extracting it you can use os.path.basename or pathlib.Path.name.

I had a very similar problem and solved it by searching for the specific line 'key', in your case MLT-TRR" with regex and then specifying which 'bytes' to take from that line. I then append the selected data to an array.
import re #Import the regex function
#Make empty arrays:
P190=[] #my file
shot=[] #events in my file (multiple lines of text for each event)
S011east=[] #what I want
S011north #another thing I want
#Create your regex:
S011=re.compile(r"^S0\w*\W*11\b")
#search and append:
#Open P190 file
with open(import_file_path,'rt') as infile:
for lines in infile:
P190.append(lines.rstrip('\n'))
#Locate specific lines and extract data
for line in P190:
if S011.search(line)!= None:
easting=line[47:55]
easting=float(easting)
S011east.append(easting)
northing=line[55:64]
northing=float(northing)
S011north.append(northing)
If you set up regex to look for "MLT_TRR ????? Folder: 'Picture.'" then it should skip any lines that don't have any further information.
For the second part of your question.
I doubt your file names are a constant length so the above method won't work as you can't specify a number of bytes to extract.This code extracts the name and extension from a file path, you could apply it to whatever you extract from each line.
import os
tail=os.path.basename(import_file_path) #Get file name from path

Related

Unable to return required strings from XML files

I have created this code to have a user point at a directory and for it to go through the directory looking for .xml files. Once found the program is supposed to search each file looking for strings that are 32 bits in length. This is the only requirement, the content is not important at this time just that it return 32 bit strings.
i have tried using the regex module within Python as below, when run the program iterates over the available files. returns all the file names but the String_recovery function returns only empty lists. I have confirmed that the xml contains 32 bit strings visually.
import os
import re
import tkinter as tk
from tkinter import filedialog
def string_recovery(data):
short_string = re.compile(r"^[a-zA-Z0-9\-._]{32}$")
strings = re.findall(short_string, data)
print(strings)
def xml_search(directory):
xml_files = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith(".xml"):
xml_files.append(os.path.join(root, file))
print("The following XML files have been found.")
print(xml_files)
for xml_file in xml_files:
with open(xml_file, "r") as f:
string_recovery(f.read())
def key_finder():
directory = filedialog.askdirectory()
xml_search(directory)
key_finder()
By default, python patterns are not "multiline" thus ^ and $ match the start and end of your text block, not each line. You need to set this flag re.M aka re.MULTILINE:
compare:
import re
text = """
foo
12345678901234567890123456789011
12345678901234567890123456789011
"""
pattern = r"^[a-zA-Z0-9\-._]{32}$"
print(re.findall(pattern, text, re.M)) ## <--- flag
Giving:
[
'12345678901234567890123456789011',
'12345678901234567890123456789011'
]
with:
import re
text = """
foo
12345678901234567890123456789011
12345678901234567890123456789011
"""
pattern = r"^[a-zA-Z0-9\-._]{32}$"
print(re.findall(pattern, text))
Giving:
[]
Maybe you should go over each line:
for xml_file in xml_files:
with open(xml_file, "r") as f:
string_recovery(f.read())
If your string_recovery works properly (try it with a line, I cannot reproduce your example but create a variable line = and put there a line which should be recoverd.
And go over each line instead of the whole file:
for xml_file in xml_files:
with open(xml_file, "r") as f:
for line in f.readliens():
string_recovery(line)

search replace the string from number of .txt files in python

there are multiple files in directory with extension .txt, .dox, .qcr etc.
i need to list out txt files, search & replace the text from each txt files only.
need to search the $$\d ...where \d stands for the digit 1,2,3.....100.
need to replace with xxx.
please let me know the python script for this .
thanks in advance .
-Shrinivas
#created following script, it works for single txt files, but it is not working for txt files more than one lies in directory.
-----
def replaceAll(file,searchExp,replaceExp):
for line in fileinput.input(file, inplace=1):
if searchExp in line:
line = line.replace(searchExp,replaceExp)
sys.stdout.write(line)
#following code is not working, i expect to list out the files start #with "um_*.txt", open the file & replace the "$$\d" with replaceAll function.
for um_file in glob.glob('*.txt'):
t = open(um_file, 'r')
replaceAll("t.read","$$\d","xxx")
t.close()
fileinput.input(...) is supposed to process a bunch of files, and must be ended with a corresponding fileinput.close(). So you can either process all in one single call:
def replaceAll(file,searchExp,replaceExp):
for line in fileinput.input(file, inplace=True):
if searchExp in line:
line = line.replace(searchExp,replaceExp)
dummy = sys.stdout.write(line) # to avoid a possible output of the size
fileinput.close() # to orderly close everythin
replaceAll(glob.glob('*.txt'), "$$\d","xxx")
or consistently close fileinput after processing each file, but it rather ignores the main fileinput feature.
Try out this.
import re
def replaceAll(file,searchExp,replaceExp):
for line in file.readlines():
try:
line = line.replace(re.findall(searchExp,line)[0],replaceExp)
except:
pass
sys.stdout.write(line)
#following code is not working, i expect to list out the files start #with "um_*.txt", open the file & replace the "$$\d" with replaceAll function.
for um_file in glob.glob('*.txt'):
t = open(um_file, 'r')
replaceAll(t,"\d+","xxx")
t.close()
Here we are sending file handler to the replaceAll function rather than a string.
You can try this:
import os
import re
the_files = [i for i in os.listdir("foldername") if i.endswith("txt")]
for file in the_files:
new_data = re.sub("\d+", "xxx", open(file).read())
final_file = open(file, 'w')
final_file.write(new_data)
final_file.close()

How to copy one file based on a variable

I am building a python script to basically edit lots of files by means of searching and replacing words in the file.
There is an original file named: C:\python 3.5/remedy line 1.ahk
There is a file containing the words I want to replace (search words) in the original document and a text file that has the list of the new words that I would like to be placed into the final document.
The script then runs and works perfect. The final document is then created and named based on a line in the final text file (code begins on line 72). A way so I can tell what the final product is by looking at it. This file is originally named output = open("C:\python 3.5\output.ahk", 'w') and later in the script it is renamed based on line 37 in the script. All that works fine.
So the seemingly simple part left that I can't seem to figure out is how to take this one file and move it to a directory where it belongs. That directory is created based on the same line in that the file gets its name from (code starts on line 82). How do I simply move my file into a directory that has been created by the script, i.e. based on a variable (code starts on line 84 for this) so the name of the file is based on a variable.
import shutil
#below is where your modified file sits, before we move it into it's own directory named dst, based on a variable #mainnewdir
srcdir = r'C:\python 3.5/'+(justfilename)
dst = (mainnewdir)+(justfilename)
shutil.copyfile(src, dst)
Why does it format it with extra \ in the code?
Why does it seem to not give me a error if I use a / vs. a \ slash?
Here is the entire code, like I said only the last part of moving the file does not work:
import os
import linecache
import sys
import string
import re
## information/replacingvalues.txt this is the text of the values you want in your final document
#information = open("C:\python 3.5\replacingvalues.txt", 'r')
information = open("C:\python 3.5/replacingvalues.txt", 'r')
# information = open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\information/replacingvalues.txt",
# Text_Find_and_Replace\Result/output.txt This is the dir and the sum or final document
# output = open("C:\python 3.5\output.ahk", 'w')
createblank = open ("C:\python 3.5/output.ahk", 'w')
createblank.close()
output = open("C:\python 3.5\output.ahk", 'w')
# field = open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\Field/values.txt"
# Field is the file or words you will be replacing
field = open("C:\python 3.5/values.txt", 'r')
# modified code for autohot key
# Text_Find_and_Replace\Test/remedy line 1.ahk is the original doc you want modified
with open("C:\python 3.5/remedy line 1.ahk", 'r') as myfile:
inline = myfile.read()
## remedy line 1.ahk
informations = []
fields = []
dictionary = {}
i = 0
for line in information:
informations.append(line.splitlines())
for lines in field:
fields.append(lines.split())
i = i + 1;
if (len(fields) != len(informations)):
print("replacing values and values have different numbers")
exit();
else:
for i in range(0, i):
rightvalue = str(informations[i])
rightvalue = rightvalue.strip('[]')
rightvalue = rightvalue[1:-1]
leftvalue = str(fields[i])
leftvalue = leftvalue.strip('[]')
leftvalue = leftvalue.strip("'")
dictionary[leftvalue] = rightvalue
robj = re.compile('|'.join(dictionary.keys()))
result = robj.sub(lambda m: dictionary[m.group(0)], inline)
output.write(result)
information.close;
output.close;
field.close;
output.close()
import os
import linecache
linecache.clearcache()
newfilename= linecache.getline("C:\python 3.5/remedy line 1.txt",37)
filename = ("C:\python 3.5/output.ahk")
os.rename(filename, newfilename.strip())
#os.rename(filename, newfilename.strip()+".ahk")
linecache.clearcache()
############## below will create a new directory based on the the word or words in line 37 of the txt file.
newdirname= linecache.getline("C:\python 3.5/remedy line 1.txt",37)
#newpath = r'C:\pythontest\automadedir'
#below removes the /n ie new line raw assci
justfilename = (newdirname).strip()
#below removes the .txt from the rest of the justfilename..
autocreateddir = (justfilename).strip(".txt")
# below is an example of combining a string and a variable
# below makes the variable up that will be the name of the new directory based on reading line 37 of a text file above
mainnewdir= r'C:\pythontest\automadedir/'+(autocreateddir)
if not os.path.exists(mainnewdir):
os.makedirs(mainnewdir)
linecache.clearcache()
# ####################################################
#below is where your modified file sits, before we move it into it's own directory named dst, based on a variable #mainnewdir
srcdir = r'C:\python 3.5/'+(justfilename)
dst = (mainnewdir)+(justfilename)
shutil.copyfile(src, dst)
backslashes do not have a mind of their own.
When you paste windows paths as-is and they contain \n, r, \b, \x, \v, \U (python 3), (refer to table here for all of them), you're just using escape sequences without noticing it.
When the escape sequence doesn't exist (ex \p) it works. But when it's known the filenames are often invalid. Which explains the apparent randomness of the issue.
To be able to safely paste windows paths without changing/escaping them, just use the raw prefix:
my_file = r"C:\temp\foo.txt"
so the backslashes won't be interpreted. One exception though: if string ends with backslash you still have to double it.

How do I run this Python 2.7 script over multiple files in the same directory [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
This script currently grabs specific types of IP addresses out of a file, formats them into csv.
How do I change this to get it to look through all files in its directory (same dir as script) and create a new output file. This is my first week on python so please be as simple as possible.
#!usr/bin/python
# Extract IP address from file
#import modules
import re
# Open Source File
infile = open('stix1.xml', 'r')
# Open output file
outfile = open('ExtractedIPs.csv', 'w')
# Create a list
BadIPs = []
#search each line in doc
for line in infile:
# ignore empty lines
if line.isspace(): continue
# find IP that are Indicator Titles
IP = (re.findall(r"(?:<indicator:Title>IP:) (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", line))
# Only take finds
if not IP: continue
# Add each found IP to the BadIP list
BadIPs.append(IP)
#tidy up for CSV format
data = str(BadIPs)
data = data.replace('[', '')
data = data.replace(']', '')
data = data.replace("'", "")
# Write IPs to a file
outfile.write(data)
infile.close
outfile.close
I thinks you want to have a look at glob.glob: https://docs.python.org/2/library/glob.html
This will return a list of files matching a given pattern.
then you can do something like
import re, glob
def do_something_with(f):
# Open Source File
infile = open(f, 'r')
# Open output file
outfile = open('ExtractedIPs.csv', 'wa') ## ADDED a to append
# Create a list
BadIPs = []
### rest of you code
.
.
outfile.write(data)
infile.close
outfile.close
for f in glob.glob("*.xml"):
do_something_with(f)
assuming that you want to add all outputs to the same file this would be the script:
#!usr/bin/python
import glob
import re
for infileName in glob.glob("*.xml"):
# Open Source File
infile = open(infileName, 'r')
# Append to file
outfile = open('ExtractedIPs.csv', 'a')
# Create a list
BadIPs = []
#search each line in doc
for line in infile:
# ignore empty lines
if line.isspace(): continue
# find IP that are Indicator Titles
IP = (re.findall(r"(?:<indicator:Title>IP:) (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", line))
# Only take finds
if not IP: continue
# Add each found IP to the BadIP list
BadIPs.append(IP)
#tidy up for CSV format
data = str(BadIPs)
data = data.replace('[', '')
data = data.replace(']', '')
data = data.replace("'", "")
# Write IPs to a file
outfile.write(data)
infile.close
outfile.close
You could get a list of all XML files like this.
filenames = [nm for nm in os.listdir() if nm.endswith('.xml')]
And then you iterate over all the files.
for fn in filenames:
with open(fn) as infile:
for ln in infile:
# do your thing
The with-statement makes sure that the file is closed after you're done with it.
import sys
Make a function out of your current code, for examle def extract(filename).
Call the script with all filenames: python myscript.py file1 file2 file3
Inside your script, loop over the filenames for filename in sys.argv[1:]:.
Call the function inside the loop: extract(filename).
I had a need to do this, and also to go into subdirectories as well. You need to import os and os.path, then can use a function like this:
def recursive_glob(rootdir='.', suffix=()):
""" recursively traverses full path from route, returns
paths and file names for files with suffix in tuple """
pathlist = []
filelist = []
for looproot,dirnames, filenames in os.walk(rootdir):
for filename in filenames:
if filename.endswith(suffix):
pathlist.append(os.path.join(looproot, filename))
filelist.append(filename)
return pathlist, filelist
You pass the function the top level directory you want to start from and the suffix for the file type you are looking for. This was written and tested for Windows, but I believe it will work on other OS's as well, as long as you've got file extensions to work from.
You could just use os.listdir() if all files in your current folder are relevant. If not, say all the .xml files, then use glob.glob("*.xml"). But the overall program can be improved, roughly as follows.
#import modules
import re
pat = re.compile(reg) # reg is your regex
with open("out.csv", "w") as fw:
writer = csv.writer(fw)
for f in os.listdir(): # or glob.glob("*.xml")
with open(f) as fr:
lines = (line for line in fr if line.isspace())
# genex for all ip in that file
ips = (ip for line in lines for ip in pat.findall(line))
writer.writerow(ips)
You probably have to change it to suit to exact needs. But the idea is in this version there are a lot less side effects, lot less memory consumption and close is managed by the context manager. Please comment if doesn't work.

python compare and output to new file

I have a Python script that does the following in the stated order:
Takes an argument (in this case a filename) and removes all characters other than a-z A-Z 0-9 and period '.'.
Strips out all information from the new file except IP addresses that are later going to be compared to a watchlist
Cleans up the file and saves it as a new file to be compared to the watchlist
finally it compares this cleaned up file (ip_list_clea) to the watchlist file and outputs matching lines to a new file (malicious_ips).
It is part 4 I am struggling with. The following code works up until stage 4 which stops the rest from working:
#!/usr/bin/python
import re
import sys
import cgi
# Compare the cleaned up list of IPs against the botwatch
# list and output the results to a new file.
new_list = set()
outfile = open("final_downloads/malicious_ips", "w")
for line in open("final_downloads/ip_list_clean", "r")
if line in open("/var/www/botwatch.txt", "r")
outfile.write(line)
new_list.add(line)
outfile.close()
Any ideas as to why the last section does not work? In fact, it stops the whole thing from working.
You are missing some colons in the last section. Try this:
new_list = set()
outfile = open("final_downloads/malicious_ips", "w")
for line in open("final_downloads/ip_list_clean", "r"):
if line in open("/var/www/botwatch.txt", "r"):
outfile.write(line)
new_list.add(line)
outfile.close()

Categories