Python's regex .match() function works inconsistently with multiline strings - python

I'm writing a script in python that takes a directory containing journal entries in the form of markdown files and processes each file in order to create an object from it. These objects are appended into a list of journal entry objects. The object contains 3 fields: title, date and body.
In order to create this list of entry objects, I loop over each file in a directory and append to the list the return value of a function called entry_create_object, which takes the file text as an input.
def load_entries(directory):
entries = []
for filename in os.listdir(directory):
filepath = os.path.join(directory, filename)
with open(filepath, 'r') as f:
text = f.read()
entry_object = entry_create_object(text)
if entry_object: entries.append(entry_object)
else: print(f"Couldn't read {filepath}")
return entries
In order to create the object, I use regular expressions to find the information I need for the title and date fields. The body is just the file contents. The function returns None if it doesn't match title and date. The following code is what I use for this:
def entry_create_object(ugly_entry):
title = re.match('^# (.*)', ugly_entry)
date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)
body = ugly_entry
if not (title and date and body):
return
entry_object = {}
entry_object['title'], entry_object['date'], entry_object['body'] = title, date, body
return entry_object
For some reason I can't understand, my regular expression for dates works for some files but doesn't for others, even though I've been able to succesfully match what I wanted by testing my regex pattern in an online regular expression webapp such as Regexr. The title regex pattern works fine for all files.
I've found in my testing that re.match is very inconsistent with multiline strings overall, but I haven't been able to find a way of fixing it.
I can't see anything wrong with my pattern.
Example of file that succesfully matches both title and date:
# Time tracker
Created at: Oct 21, 2020 4:16 PM
Date: Oct 21, 2020
[...]
Example of file that fails to match date:
# Bad habits
Created at: Dec 6, 2020 4:24 PM
Date: Dec 6, 2020
[...]
Thank you for your time.

Let's decode the regex.
date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)
That's three letters, followed by a space, followed by 2 digits, followed by comma space, followed by 4 digits. Given that description, can you see why the following string does not match?
Created at: Dec 6, 2020 4:24 PM
I shouldn't spoil the surprise, but you want (\d{1,2}),

Related

Create a tuple for each line by filtering data using regex from a text file

So basically I want to create a function which loads a text file and returns a list of tuples by filtering the required parts using regex.
This text file contains lines in the form of:
-rw-r--r-- 1 jttoivon hyad-all 25399 Nov 2 21:25 exception_hierarchy.pdf
And the tuple should be like:
(25399, "Nov", 2, 21, 25, "exception_hierarchy.pdf")
Basically it filters out the size, month, day, hour, minute, filename. I can load the file and all but unable to apply regex on this to filter out and then how to return a list of tuples. A piece of help on that particular line will do.
Try this on each line
def getTupleFromLine(line):
l = line.split(" ")[1].split() # ['25399', 'Nov', '2', '21:25', 'exception_hierarchy.pdf']
return (int(l[0]),l[1],int(l[2]),int(l[3]),int(l[4]),l[5])
As this seems output from ls -la the order of your information should be static. You can do something like:
string = "-rw-r--r-- 1 jttoivon hyad-all 25399 Nov 2 21:25 exception_hierarchy.pdf"
words = string.split()
tupl = (words[4], words[5], words[6], words[7], words[8])
print(tupl)
Here the time is 1 word as by default split() goes by " ". If you want you can further split it to achieve your exact output.

Parsing mixed text file with a lot header info

I'm an amateur astronomer (and retired) and just messing around with an idea. I want to scrape data from a NASA website that is a text file and extract specific values to help me identify when to observe. The text file automatically updates every 60 seconds. The text file has some header information before the actual rows and columns of data that I need to process. The actual data is numerical. Example:
Prepared by NASA
Please send comments and suggestions to xxx.com
Date Time Year
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
I want to access the string numerical data and convert it into a double
From what I can see the file is space delimited
I want to poll the website every 60 seconds and if Value 1 and Value 2 are above a specific threshold that will trigger a PyAutoGUI to automate a software application to take an image.
After reading the file from the website I tried converting the file into a dictionary thinking that I could then map keys to values, but I can't predict the exact location that I need. I thought once I extract the values I need I would write the file then try to convert the string into a double or float
I have tried to use
import re
re.split
to read each line and split out info but I get a huge mess because of the header information
I wanted to use a simple approach to open the file and this works
import urllib
import csv
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
print (data)
I found this on Stack overflow but I don't understand how I would use this
file = open('abc.txt','r')
while 1:
a = file.readline()
if a =='': break
a = a.split() #This creates a list of the input
name = a[0]
value = int(a[1]) # or value=float(a[1]) whatever you want
#use the name and value howsoever
f.close()
What I want is it to extract Value 1 and Value 2 as a double or float than in Part II (which I have not yet even started) I will compare Value 1 and Value 2 and if they are about a specific threshold this would trigger a PyAutoGUI to interact with my imaging software and trigger taking an image.
Here's a simple example of using regular expressions. This assumes you'd read the whole file into memory with a single f.read() rather that bothering to process individual lines, which with regular expressions, is often the simpler way to go (and I'm lazy and didn't want to have to create a test file):
import re
data = """
blah
blah
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
blah
"""
pattern = re.compile(r"(\d+) (\d+) (\d+) (\d+) (\d+) ([^\s]+) ([^\s]+)")
def main():
m = pattern.search(data)
if m:
# Do whatever processing you want to do here. You have access to all 7 input
# fields via m.group(1-7)
d1 = float(m.group(6))
d2 = float(m.group(7))
print(">{}< >{}<".format(d1, d2))
else:
print("no match")
main()
Output:
>6e-09< >1e-09<
You'd want to tweak this a bit if I've made a wrong assumption about the input data, but this gives you the general idea anyway.
This should handle just about anything else that exists in the input as long as nothing else does that looks like that one line you're interested in.
UPDATE:
I can't leave well enough alone. Here's code that pulls the data from the URL you provide and processes all the matching lines:
import re
import urllib
pattern = re.compile(r"(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)")
def main():
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
pos = 0
while True:
m = pattern.search(data, pos)
if not m:
break
pos = m.end()
# Do whatever processing you want to do here. You have access to all 8 input
# fields via m.group(1-8)
f1 = float(m.group(7))
f2 = float(m.group(8))
print(">{}< >{}<".format(f1, f2))
main()
Result:
>9.22e-09< >1e-09<
>1.06e-08< >1e-09<
...
>8.99e-09< >1e-09<
>1.01e-08< >1e-09<
This was a fun little challenge, I've pulled all the data out of the table for you, mapped it to class and converted the data to int and Decimal as appropriate. Once it's populated you can read all the data you want from it.
To get the data I've used the requests library, instead of urllib, that's merely personal preference. You could use pip install requests, if you wanted to use it too. It has a method iter_lines that can traverse the rows of data.
This may be overkill for what you need, but as I wrote it anyway I thought I'd post it for you.
import re
from datetime import datetime
from decimal import Decimal
import requests
class SolarXrayFluxData:
def __init__(
self,
year,
month,
day,
time,
modified_julian_day,
seconds_of_the_day,
short,
long
):
self.date = datetime(
int(year), int(month), int(day), hour=int(time[:2]), minute=int(time[2:])
)
self.modified_julian_day = int(modified_julian_day)
self.seconds_of_the_day = int(seconds_of_the_day)
self.short = Decimal(short)
self.long = Decimal(long)
class GoesXrayFluxPrimary:
def __init__(self):
self.created_at = ''
self.data = []
def extract_data(self, url):
data = requests.get(url)
for i, line in enumerate(data.iter_lines(decode_unicode=True)):
if line[0] in [':', '#']:
if i is 1:
self.set_created_at(line)
continue
row_data = re.findall(r"(\S+)", line)
self.data.append(SolarXrayFluxData(*row_data))
def set_created_at(self, line):
date_str = re.search(r'\d{4}\s\w{3}\s\d{2}\s\d{4}', line).group(0)
self.created_at = datetime.strptime(date_str, '%Y %b %d %H%M')
if __name__ == '__main__':
goes_xray_flux_primary = GoesXrayFluxPrimary()
goes_xray_flux_primary.extract_data('https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt')
print("Created At: %s" % goes_xray_flux_primary.created_at)
for row in goes_xray_flux_primary.data:
print(row.date)
print("%.12f, %.12f" % (row.short, row.long))
The intention of the SolarXrayFluxData class is to store each items data and to make sure it is in a nice usable format. While the GoesXrayFluxPrimary class is used populate a list of SolarXrayFluxData and to store any other data that you might want to pull out. For example I've grabbed the Created date and time. You could also get the Location and Source from the header data.

Counting date occurrences in python?

I'm currently trying to count the number of times a date occurs within a chat log for example the file I'm reading from may look something like this:
*username* (mm/dd/yyyy hh:mm:ss): *message here*
However I need to split the date from the time as I currently treat them as one. Im currently struggling to solve my problem so any help is appreciated. Down below is some sample code that I'm currently using to try get the date count working. Im currently using a counter however I'm wondering if there are other ways to count dates.
filename = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,))
mtxtr = filename.read()
date = []
number = []
occurences = Counter(date)
mtxtformat = mtxtr.split("\r\n")
print 'The Dates in the chat are as follows'
print "--------------------------------------------"
for mtxtf in mtxtformat:
participant = mtxtf.split("(")[0]
date = mtxtf.split("(")[-1]
message = date.split(")")[0]
date.append(date1.strip())
for item in date:
if item not in number:
number.append(item)
for item in number:
occurences = date.count(item)
print("Date Occurences " + " is: " + str(occurences))
Easiest way would be to use regex and take the count of the date pattern you have in the log file. It would be faster too.
If you know the date and time are going to be enclosed in parentheses at the start of the message (i.e. no parentheses (...): will be seen before the one containing the date and time):
*username* (mm/dd/yyyy hh:mm:ss): *message here*
Then you can extract based on the parens:
import re
...
parens = re.compile(r'\((.+)\)')
for mtxtf in mtxtformat:
match = parens.search(mtxtf)
date.append(match.group(1).split(' ')[0])
...
Note: If the message itself contains parens, this may match more than just the needed (mm/dd/yyyy hh:mm:ss). Doing match.group(1).split(' ')[0] would still give you the information you are looking for assuming there is no information enclosed in parens before your date-time information (for the current line).
Note2: Ideally enclose this in a try-except to continue on to the next line if the current line doesn't contain useful information.

python : reading a datetime from a log file using regex

I have a log file which has text that looks like this.
Jul 1 03:27:12 syslog: [m_java][ 1/Jul/2013 03:27:12.818][j:[SessionThread <]^Iat com/avc/abc/magr/service/find.something(abc/1235/locator/abc;Ljava/lang/String;)Labc/abc/abcd/abcd;(bytecode:7)
There are two time formats in the file. I need to sort this log file based on the date time format enclosed in [].
This is the regex I am trying to use. But it does not return anything.
t_pat = re.compile(r".*\[\d+/\D+/.*\]")
I want to go over each line in file, be able to apply this pattern and sort the lines based on the date & time.
Can someone help me on this? Thanks!
You have a space in there that needs to be added to the regular expression
text = "Jul 1 03:27:12 syslog: [m_java][ 1/Jul/2013 03:27:12.818][j:[SessionThread <]^Iat com/avc/abc/magr/service/find.something(abc/1235/locator/abc;Ljava/lang/String;)Labc/abc/abcd/abcd;(bytecode:7)"
matches = re.findall(r"\[\s*(\d+/\D+/.*?)\]", text)
print matches
['1/Jul/2013 03:27:12.818']
Next parse the time using the following function
http://docs.python.org/2/library/time.html#time.strptime
Finally use this as a key into a dict, and the line as the value, and sort these entries based on the key.
You are not matching the initial space; you also want to group the date for easy extraction, and limit the \D and .* patterns to non-greedy:
t_pat = re.compile(r".*\[\s?(\d+/\D+?/.*?)\]")
Demo:
>>> re.compile(r".*\[\s?(\d+/\D+?/.*?)\]").search(line).group(1)
'1/Jul/2013 03:27:12.818'
You can narrow down the pattern some more; you only need to match 3 letters for the month for example:
t_pat = re.compile(r".*\[\s?(\d{1,2}/[A-Z][a-z]{2}/\d{4} \d{2}:\d{2}:[\d.]{2,})\]")
Read all the lines of the file and use the sort function and pass in a function that parses out the date and uses that as the key for sorting:
import re
import datetime
def parse_date_from_log_line(line):
t_pat = re.compile(r".*\[\s?(\d+/\D+?/.*?)\]")
date_string = t_pat.search(line).group(1)
format = '%d/%b/%Y %H:%M:%S.%f'
return datetime.datetime.strptime(date_string, format)
log_path = 'mylog.txt'
with open(log_path) as log_file:
lines = log_file.readlines()
lines.sort(key=parse_date_from_log_line)

Parsing text files using Python

I am very new to Python and am looking to use it to parse a text file. The file has between 250-300 lines of the following format:
---- Mark Grey (mark.grey#gmail.com) changed status from Busy to Available # 14/07/2010 16:32:36 ----
---- Silvia Pablo (spablo#gmail.com) became Available # 14/07/2010 16:32:39 ----
I need to store the following information into another file (excel or text) for all the entries from this file
UserName/ID Previous Status New Status Date Time
So my result file should look like this for the above entried
Mark Grey/mark.grey#gmail.com Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo#gmail.com NaN Available 14/07/2010 16:32:39
Thanks in advance,
Any help would be really appreciated
To get you started:
result = []
regex = re.compile(
r"""^-*\s+
(?P<name>.*?)\s+
\((?P<email>.*?)\)\s+
(?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+
(?P<new>.*?)\s+#\s+
(?P<date>\S+)\s+
(?P<time>\S+)\s+
-*$""", re.VERBOSE)
with open("inputfile") as f:
for line in f:
match = regex.match(line)
if match:
result.append([
match.group("name"),
match.group("email"),
match.group("previous")
# etc.
])
else:
# Match attempt failed
will get you an array of the parts of the match. I'd then suggest you use the csv module to store the results in a standard format.
import re
pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) # (.*?) ----\s*")
with open("data.txt") as f:
for line in f:
(name, email, prev, curr, date) = pat.match(line).groups()
print "{0}/{1} {2} {3} {4}".format(name, email, prev or "NaN", curr, date)
This makes assumptions about whitespace and also assumes that every line conforms to the pattern. You might want to add error checking (such as checking that pat.match() doesn't return None) if you want to handle dirty input gracefully.
The two RE patterns of interest seem to be...:
p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$'
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$'
so I'd do:
import csv, re, sys
# assign p1, p2 as above (or enhance them, etc etc)
r1 = re.compile(p1)
r2 = re.compile(p2)
data = []
with open('somefile.txt') as f:
for line in f:
m = p1.match(line)
if m:
data.append(m.groups())
continue
m = p2.match(line)
if not m:
print>>sys.stderr, "No match for line: %r" % line
continue
listofgroups = m.groups()
listofgroups.insert(2, 'NaN')
data.append(listofgroups)
with open('result.csv', 'w') as f:
w = csv.writer(f)
w.writerow('UserName/ID Previous Status New Status Date Time'.split())
w.writerows(data)
If the two patterns I described are not general enough, they may need to be tweaked, of course, but I think this general approach will be useful. While many Python users on Stack Overflow intensely dislike REs, I find them very useful for this kind of pragmatical ad hoc text processing.
Maybe the dislike is explained by others wanting to use REs for absurd uses such as ad hoc parsing of CSV, HTML, XML, ... -- and many other kinds of structured text formats for which perfectly good parsers exist! And also, other tasks well beyond REs' "comfort zone", and requiring instead solid general parser systems like pyparsing. Or at the other extreme super-simple tasks done perfectly well with simple strings (e.g. I remember a recent SO question which used if re.search('something', s): instead of if 'something' in s:!-).
But for the reasonably broad swathe of tasks (excluding the very simplest ones at one end, and the parsing of structured or somewhat-complicated grammars at the other) for which REs are appropriate, there's really nothing wrong with using them, and I recommend to all programmers to learn at least REs' basics.
Alex mentioned pyparsing and so here is a pyparsing approach to your same problem:
from pyparsing import Word, Suppress, Regex, oneOf, SkipTo
import datetime
DASHES = Word('-').suppress()
LPAR,RPAR,AT = map(Suppress,"()#")
date = Regex(r'\d{2}/\d{2}/\d{4}')
time = Regex(r'\d{2}:\d{2}:\d{2}')
status = oneOf("Busy Available Idle Offline Unavailable")
statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate')
statechange2 = 'became' + status('tostate')
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR +
(statechange1 | statechange2) +
AT + date('date') + time('time') + DASHES)
def convertFields(tokens):
if 'fromstate' not in tokens:
tokens['fromstate'] = 'NULL'
tokens['name'] = tokens.name.strip()
tokens['email'] = tokens.email.strip()
d,mon,yr = map(int, tokens.date.split('/'))
h,m,s = map(int, tokens.time.split(':'))
tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s)
linefmt.setParseAction(convertFields)
for line in text.splitlines():
fields = linefmt.parseString(line)
print "%(name)s/%(email)s %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields
prints:
Mark Grey/mark.grey#gmail.com Busy Available 2010-07-14 16:32:36
Silvia Pablo/spablo#gmail.com NULL Available 2010-07-14 16:32:39
pyparsing allows you to attach names to the results fields (just like the named groups in Tom Pietzcker's RE-styled answer), plus parse-time actions to act on or manipulate the parsed actions - note the conversion of the separate date and time fields into a true datetime object, already converted and ready for processing after parsing with no additional muss nor fuss.
Here is a modified loop that just dumps out the parsed tokens and the named fields for each line:
for line in text.splitlines():
fields = linefmt.parseString(line)
print fields.dump()
prints:
['Mark Grey ', 'mark.grey#gmail.com', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:36
- email: mark.grey#gmail.com
- fromstate: Busy
- name: Mark Grey
- time: 16:32:36
- tostate: Available
['Silvia Pablo ', 'spablo#gmail.com', 'became', 'Available', '14/07/2010', '16:32:39']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:39
- email: spablo#gmail.com
- fromstate: NULL
- name: Silvia Pablo
- time: 16:32:39
- tostate: Available
I suspect that as you continue to work on this problem, you will find other variations on the format of the input text specifying how the user's state changed. In this case, you would just add another definition like statechange1 or statechange2, and insert it into linefmt with the others. I feel that pyparsing's structuring of the parser definition helps developers come back to a parser after things have changed, and easily extend their parsing program.
Well, if i were to approach this problem, probably I'd start by splitting each entry into its own, separate string. This looks like it might be line oriented, so a inputfile.split('\n') is probably adequate. From there I would probably craft a regular expression to match each of the possible status changes, with subgroups wrapping each of the important fields.
thanks very much for all your comments. They were very useful. I wrote my code using the directory functionality. What it does is it reads through the file and creates an output file for each of the user with all his status updates. Here is the code pasted below.
#Script to extract info from individual data files and print out a data file combining info from these files
import os
import commands
dataFileDir="data/";
#Dictionary linking names to email ids
#For the time being, assume no 2 people have the same name
usrName2Id={};
#User id to user name mapping to check for duplicate names
usrId2Name={};
#Store info: key: user ids and values a dictionary with time stamp keys and status messages values
infoDict={};
#Given an array of space tokenized inputs, extract user name
def getUserName(info,mailInd):
userName="";
for i in range(mailInd-1,0,-1):
if info[i].endswith("-") or info[i].endswith("+"):
break;
userName=info[i]+" "+userName;
userName=userName.strip();
userName=userName.replace(" "," ");
userName=userName.replace(" ","_");
return userName;
#Given an array of space tokenized inputs, extract time stamp
def getTimeStamp(info,timeStartInd):
timeStamp="";
for i in range(timeStartInd+1,len(info)):
timeStamp=timeStamp+" "+info[i];
timeStamp=timeStamp.replace("-","");
timeStamp=timeStamp.strip();
return timeStamp;
#Given an array of space tokenized inputs, extract status message
def getStatusMsg(info,startInd,endInd):
msg="";
for i in range(startInd,endInd):
msg=msg+" "+info[i];
msg=msg.strip();
msg=msg.replace(" ","_");
return msg;
#Extract and store info from each line in the datafile
def extractLineInfo(line):
print line;
info=line.split(" ");
mailInd=-1;userId="-NONE-";
timeStartInd=-1;timeStamp="-NONE-";
becameInd="-1";
statusMsg="-NONE-";
#Find indices of email id and "#" char indicating start of timestamp
for i in range(0,len(info)):
#print (str(i)+" "+info[i]);
if(info[i].startswith("(") and info[i].endswith("#in.ibm.com)")):
mailInd=i;
if(info[i]=="#"):
timeStartInd=i;
if(info[i]=="became"):
becameInd=i;
#Debug print of mail and time stamp start inds
"""print "\n";
print "Index of mail id: "+str(mailInd);
print "Index of time start index: "+str(timeStartInd);
print "\n";"""
#Extract IBM user id and name for lines with ibm id
if(mailInd>=0):
userId=info[mailInd].replace("(","");
userId=userId.replace(")","");
userName=getUserName(info,mailInd);
#Lines with no ibm id are of the form "Suraj Godar Mr became idle # 15/07/2010 16:30:18"
elif(becameInd>0):
userName=getUserName(info,becameInd);
#Time stamp info
if(timeStartInd>=0):
timeStamp=getTimeStamp(info,timeStartInd);
if(mailInd>=0):
statusMsg=getStatusMsg(info,mailInd+1,timeStartInd);
elif(becameInd>0):
statusMsg=getStatusMsg(info,becameInd,timeStartInd);
print userId;
print userName;
print timeStamp
print statusMsg+"\n";
if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"):
usrName2Id[userName]=userId;
#Store status messages keyed by user email ids
timeDict={};
#Retrieve user id corresponding to user name
if userName in usrName2Id:
userId=usrName2Id[userName];
#For valid user ids, store status message in the dict within dict data str arrangement
if not(userId=="-NONE-"):
if not(userId in infoDict.keys()):
infoDict[userId]={};
timeDict=infoDict[userId];
if not(timeStamp in timeDict.keys()):
timeDict[timeStamp]=statusMsg;
else:
timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg;
#Print for each user a file containing status
def printStatusFiles(dataFileDir):
volNum=0;
for userName in usrName2Id:
volNum=volNum+1;
filename=dataFileDir+"/"+"status-"+str(volNum)+".txt";
file = open(filename,"w");
print "Printing output file name: "+filename;
print volNum,userName,usrName2Id[userName]+"\n";
file.write(userName+" "+usrName2Id[userName]+"\n");
timeDict=infoDict[usrName2Id[userName]];
for time in sorted(timeDict.keys()):
file.write(time+" "+timeDict[time]+"\n");
#Read and store data from individual data files
def readDataFiles(dataFileDir):
#Process each datafile
files=os.listdir(dataFileDir)
files.sort();
for i in range(0,len(files)):
#for i in range(0,1):
file=files[i];
#Do not process other non-data files lying around in that dir
if not file.endswith(".txt"):
continue
print "Processing data file: "+file
dataFile=dataFileDir+str(file);
inpFile=open(dataFile,"r");
lines=inpFile.readlines();
#Process lines
for line in lines:
#Clean lines
line=line.strip();
line=line.replace("/India/Contr/IBM","");
line=line.strip();
#Skip header line of the file and L's sign in sign out times
if(line.startswith("System log for account") or line.find("signed")>-1):
continue;
extractLineInfo(line);
print "\n";
readDataFiles(dataFileDir);
print "\n";
printStatusFiles("out/");

Categories