Parsing mixed text file with a lot header info

Parsing mixed text file with a lot header info - python

I'm an amateur astronomer (and retired) and just messing around with an idea. I want to scrape data from a NASA website that is a text file and extract specific values to help me identify when to observe. The text file automatically updates every 60 seconds. The text file has some header information before the actual rows and columns of data that I need to process. The actual data is numerical. Example:
Prepared by NASA
Please send comments and suggestions to xxx.com
Date Time Year
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
I want to access the string numerical data and convert it into a double
From what I can see the file is space delimited
I want to poll the website every 60 seconds and if Value 1 and Value 2 are above a specific threshold that will trigger a PyAutoGUI to automate a software application to take an image.
After reading the file from the website I tried converting the file into a dictionary thinking that I could then map keys to values, but I can't predict the exact location that I need. I thought once I extract the values I need I would write the file then try to convert the string into a double or float
I have tried to use
import re
re.split
to read each line and split out info but I get a huge mess because of the header information
I wanted to use a simple approach to open the file and this works
import urllib
import csv
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
print (data)
I found this on Stack overflow but I don't understand how I would use this
file = open('abc.txt','r')
while 1:
a = file.readline()
if a =='': break
a = a.split() #This creates a list of the input
name = a[0]
value = int(a[1]) # or value=float(a[1]) whatever you want
#use the name and value howsoever
f.close()
What I want is it to extract Value 1 and Value 2 as a double or float than in Part II (which I have not yet even started) I will compare Value 1 and Value 2 and if they are about a specific threshold this would trigger a PyAutoGUI to interact with my imaging software and trigger taking an image.

Here's a simple example of using regular expressions. This assumes you'd read the whole file into memory with a single f.read() rather that bothering to process individual lines, which with regular expressions, is often the simpler way to go (and I'm lazy and didn't want to have to create a test file):
import re
data = """
blah
blah
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
blah
"""
pattern = re.compile(r"(\d+) (\d+) (\d+) (\d+) (\d+) ([^\s]+) ([^\s]+)")
def main():
m = pattern.search(data)
if m:
# Do whatever processing you want to do here. You have access to all 7 input
# fields via m.group(1-7)
d1 = float(m.group(6))
d2 = float(m.group(7))
print(">{}< >{}<".format(d1, d2))
else:
print("no match")
main()
Output:
>6e-09< >1e-09<
You'd want to tweak this a bit if I've made a wrong assumption about the input data, but this gives you the general idea anyway.
This should handle just about anything else that exists in the input as long as nothing else does that looks like that one line you're interested in.
UPDATE:
I can't leave well enough alone. Here's code that pulls the data from the URL you provide and processes all the matching lines:
import re
import urllib
pattern = re.compile(r"(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)")
def main():
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
pos = 0
while True:
m = pattern.search(data, pos)
if not m:
break
pos = m.end()
# Do whatever processing you want to do here. You have access to all 8 input
# fields via m.group(1-8)
f1 = float(m.group(7))
f2 = float(m.group(8))
print(">{}< >{}<".format(f1, f2))
main()
Result:
>9.22e-09< >1e-09<
>1.06e-08< >1e-09<
...
>8.99e-09< >1e-09<
>1.01e-08< >1e-09<

This was a fun little challenge, I've pulled all the data out of the table for you, mapped it to class and converted the data to int and Decimal as appropriate. Once it's populated you can read all the data you want from it.
To get the data I've used the requests library, instead of urllib, that's merely personal preference. You could use pip install requests, if you wanted to use it too. It has a method iter_lines that can traverse the rows of data.
This may be overkill for what you need, but as I wrote it anyway I thought I'd post it for you.
import re
from datetime import datetime
from decimal import Decimal
import requests
class SolarXrayFluxData:
def __init__(
self,
year,
month,
day,
time,
modified_julian_day,
seconds_of_the_day,
short,
long
):
self.date = datetime(
int(year), int(month), int(day), hour=int(time[:2]), minute=int(time[2:])
)
self.modified_julian_day = int(modified_julian_day)
self.seconds_of_the_day = int(seconds_of_the_day)
self.short = Decimal(short)
self.long = Decimal(long)
class GoesXrayFluxPrimary:
def __init__(self):
self.created_at = ''
self.data = []
def extract_data(self, url):
data = requests.get(url)
for i, line in enumerate(data.iter_lines(decode_unicode=True)):
if line[0] in [':', '#']:
if i is 1:
self.set_created_at(line)
continue
row_data = re.findall(r"(\S+)", line)
self.data.append(SolarXrayFluxData(*row_data))
def set_created_at(self, line):
date_str = re.search(r'\d{4}\s\w{3}\s\d{2}\s\d{4}', line).group(0)
self.created_at = datetime.strptime(date_str, '%Y %b %d %H%M')
if __name__ == '__main__':
goes_xray_flux_primary = GoesXrayFluxPrimary()
goes_xray_flux_primary.extract_data('https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt')
print("Created At: %s" % goes_xray_flux_primary.created_at)
for row in goes_xray_flux_primary.data:
print(row.date)
print("%.12f, %.12f" % (row.short, row.long))
The intention of the SolarXrayFluxData class is to store each items data and to make sure it is in a nice usable format. While the GoesXrayFluxPrimary class is used populate a list of SolarXrayFluxData and to store any other data that you might want to pull out. For example I've grabbed the Created date and time. You could also get the Location and Source from the header data.

Related

Is there a way to read data into R or Python and split it by number of characters?

Say I have a number of fields with their respective length of characters. The first field is the ID with length of 10, the second field is phone number with length 20, and so forth. Can I set the delimiter based on the length?
The data does not have any structure to it so in order to read it I have to find a way to construct the table when I read it in. It's a plain text file. I do, however, have the respective length of characters for each field. I have not done anything like this before so before I spend hours into it I wanted to see if this is even possible.

In python you can just slice the string.
msg = "ID12345678PHONE123456789012345BLOB"
_id = msg[:10]
phone = msg[10:30]
blob = msg[30:34]
print(_id, phone, blob)
Result
ID12345678 PHONE123456789012345 BLOB
Option 2: If you open the file in binary mode and get bytes strings, you can use the struct module to unpack them.
import struct
msg = b"ID12345678PHONE123456789012345BLOB"
_id, phone, blob = struct.unpack("10s20s4s", msg)
print(_id, phone, blob)

In python it should be quite simple indeed
In pandas you can use the simple function read_fwf()
Given for example something like this
myfile.txt
1name1surname1
2name2surname2
with columns of size [1, 5, 8]
you can read the file in this way
import pandas as pd
df = pd.read_fwf("myfile.txt", widths=[1, 5, 8])
Details here for this function
If you want to parse the file yourself instead, which is also quite simple:
import pandas as pd
# column name + size
meta_data = [('id',1),('name',5),('surname',8)]
def my_parser(line):
curr_dict = {}
start=0
end=0
for meta in meta_data:
end = meta[1] + end
curr_dict[meta[0]] = line[start:end]
start = end
return curr_dict
with open("myfile.txt", "r") as f_o:
lines = f_o.readlines()
dicts = []
for line in lines:
dicts.append(my_parser(line))
pd.Dataframe(dicts)

Parse unstructured text in python

Am new to python and am trying to read a PDF file to pull the ID No.. I have been successful so far to extract the text out of the PDF file using pdfplumber. Below is the code block:
import pdfplumber
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
print (raw_text)
Here is the text output:
Welcome to ABC
01 January, 1991
ID No. : 10101010
Welcome to your ABC portal. Learn
More text here..
Even more text here..
Mr Jane Doe
Jack & Jill Street Learn more about your
www.abc.com
....
....
....
However, am unable to find the optimum way to parse this unstructured text further. The final output am expecting to be is just the ID No. i.e. 10101010. On a side note, the script would be using against fairly huge set of PDFs so performance would be of concern.

Try using a regular expression:
import pdfplumber
import re
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
m = re.search(r'ID No\. : (\d+)', raw_text)
if m:
print(m.group(1))
Of course you'll have to iterate over all the PDF's contents - not just the first page! Also ask yourself if it's possible that there's more than one match per page. Anyway: you know the structure of the input better than I do (and we don't have access to the sample file), so I'll leave it as an exercise for you.

If the length of the id number is always the same, I would try to find the location of it with the find-function. position = raw_text.find('ID No. : ')should return the position of the I in ID No. position + 9 should be the first digit of the id. When the number has always a length of 8 you could get it with int(raw_text[position+9:position+17])

If you are new to Python and actually need to process serious amounts of data, I suggest that you look at Scala as an alternative.
For data processing in general, and regular expression matching in particular, the time it takes to get results is much reduced.
Here is an answer to your question in Scala instead of Python:
import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.PdfTextExtractor
val fil = "ABC.pdf"
val textFromPage = (1 until (new PdfReader(fil)).getNumberOfPages).par.map(page => PdfTextExtractor.getTextFromPage(new PdfReader(fil), page)).mkString
val r = "ID No\\. : (\\d+)".r
val res = for (m <- r.findAllMatchIn(textFromPage )) yield m.group(0)
res.foreach(println)

What is a working method for extracting numeric values with associated data from open text?

I tried to look for a solution but nothing was giving me quite what I needed. I'm not sure regex can do what I need.
I need to process a large amount of data where license information is provided. I just need to grab the number of licenses and the name for each license then group and tally the license counts for each company.
Here's an example of the data pulled:
L00129A578-E105C1D138 1 Centralized Recording
$42.00
L00129A677-213DC6D60E 1 Centralized Recording
$42.00
1005272AE2-C1D6CACEC8 5 Station
$45.00
100525B658-3AC4D2C93A 5 Station
$45.00
I would need to grab the license count and license name then add like objects so it would grab (1 Centralized Recording, 1 Centralized Recording, 5 Station, 5 Station) then add license counts and output (2 Centralized Recording, 10 Station)
What would be the easiest way to implement this?

It looks like you're trying to ignore the license number, and get the count and name. So, the following should point you on your way for your data, if it is as uniform as it seems:
import re
r = re.compile(r"\s+(\d+)\s+[A-Za-z ]+")
r = re.compile(r"\s+(\d+)\s+([A-Za-z ]+)")
m = r.search(" 1 Centralized")
m.groups()
# ('1', 'Centralized')
That regex just says, "Require but ignore 1 or more spaces, pay attention to the string of digits after it, require but ignore 1 or more spaces after it, and pay attention to the capital letters, lower case letters, and spaces after it." (You may need to trim of a newline when you're done.)
The file-handling bit would look like:
f = open('/path/to/your_data_file.txt')
for line in f.readlines():
# run regex and do stuff for each line
pass

import re, io, pandas as pd
a = open('your_data_file.txt')
pd.read_csv(io.StringIO(re.sub(r'(?m).*\s(\d+)\s+(.*\S+)\s+$\n|.*','\\1,\\2',a)),
header=None).groupby(1).sum()[0].to_dict()

Pandas is a good tool for jobs like this. You might have to play around with it a bit. You will also need to export your excel file as a .csv file. In the interpreter,try:
import pandas
raw = pandas.read_csv('myfile.csv')
print(raw.columns)
That will give you the column headings for the csv file. If you have headers name and nums, then you can extract those as a list of tuples as follows:
extract = list(zip(raw.name, raw.nums))
You can then sort this list by name:
extract = sorted(extract)
Pandas probably has a method for compressing this easily, but I can't recall it so:
def accum(c):
nm = c[0][0]
count = 0
result = []
for x in c:
if x[0] == nm:
count += x[1]
else:
result.append((nm, count))
nm = x[0]
count = x[1]
result.append((nm, count))
return result
done = accum(extract)
Now you can write this to a text file as follows(fstrings require Python 3.6+)
with open("myjob.txt", "w+") as fout:
for x in done:
line = f"name: {x[0]} count: {x[1]} \n"
fout.write(line)

Counting date occurrences in python?

I'm currently trying to count the number of times a date occurs within a chat log for example the file I'm reading from may look something like this:
*username* (mm/dd/yyyy hh:mm:ss): *message here*
However I need to split the date from the time as I currently treat them as one. Im currently struggling to solve my problem so any help is appreciated. Down below is some sample code that I'm currently using to try get the date count working. Im currently using a counter however I'm wondering if there are other ways to count dates.
filename = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,))
mtxtr = filename.read()
date = []
number = []
occurences = Counter(date)
mtxtformat = mtxtr.split("\r\n")
print 'The Dates in the chat are as follows'
print "--------------------------------------------"
for mtxtf in mtxtformat:
participant = mtxtf.split("(")[0]
date = mtxtf.split("(")[-1]
message = date.split(")")[0]
date.append(date1.strip())
for item in date:
if item not in number:
number.append(item)
for item in number:
occurences = date.count(item)
print("Date Occurences " + " is: " + str(occurences))

Easiest way would be to use regex and take the count of the date pattern you have in the log file. It would be faster too.

If you know the date and time are going to be enclosed in parentheses at the start of the message (i.e. no parentheses (...): will be seen before the one containing the date and time):
*username* (mm/dd/yyyy hh:mm:ss): *message here*
Then you can extract based on the parens:
import re
...
parens = re.compile(r'\((.+)\)')
for mtxtf in mtxtformat:
match = parens.search(mtxtf)
date.append(match.group(1).split(' ')[0])
...
Note: If the message itself contains parens, this may match more than just the needed (mm/dd/yyyy hh:mm:ss). Doing match.group(1).split(' ')[0] would still give you the information you are looking for assuming there is no information enclosed in parens before your date-time information (for the current line).
Note2: Ideally enclose this in a try-except to continue on to the next line if the current line doesn't contain useful information.

Parsing text files using Python

I am very new to Python and am looking to use it to parse a text file. The file has between 250-300 lines of the following format:
---- Mark Grey (mark.grey#gmail.com) changed status from Busy to Available # 14/07/2010 16:32:36 ----
---- Silvia Pablo (spablo#gmail.com) became Available # 14/07/2010 16:32:39 ----
I need to store the following information into another file (excel or text) for all the entries from this file
UserName/ID Previous Status New Status Date Time
So my result file should look like this for the above entried
Mark Grey/mark.grey#gmail.com Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo#gmail.com NaN Available 14/07/2010 16:32:39
Thanks in advance,
Any help would be really appreciated

To get you started:
result = []
regex = re.compile(
r"""^-*\s+
(?P<name>.*?)\s+
\((?P<email>.*?)\)\s+
(?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+
(?P<new>.*?)\s+#\s+
(?P<date>\S+)\s+
(?P<time>\S+)\s+
-*$""", re.VERBOSE)
with open("inputfile") as f:
for line in f:
match = regex.match(line)
if match:
result.append([
match.group("name"),
match.group("email"),
match.group("previous")
# etc.
])
else:
# Match attempt failed
will get you an array of the parts of the match. I'd then suggest you use the csv module to store the results in a standard format.

import re
pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) # (.*?) ----\s*")
with open("data.txt") as f:
for line in f:
(name, email, prev, curr, date) = pat.match(line).groups()
print "{0}/{1} {2} {3} {4}".format(name, email, prev or "NaN", curr, date)
This makes assumptions about whitespace and also assumes that every line conforms to the pattern. You might want to add error checking (such as checking that pat.match() doesn't return None) if you want to handle dirty input gracefully.

The two RE patterns of interest seem to be...:
p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$'
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$'
so I'd do:
import csv, re, sys
# assign p1, p2 as above (or enhance them, etc etc)
r1 = re.compile(p1)
r2 = re.compile(p2)
data = []
with open('somefile.txt') as f:
for line in f:
m = p1.match(line)
if m:
data.append(m.groups())
continue
m = p2.match(line)
if not m:
print>>sys.stderr, "No match for line: %r" % line
continue
listofgroups = m.groups()
listofgroups.insert(2, 'NaN')
data.append(listofgroups)
with open('result.csv', 'w') as f:
w = csv.writer(f)
w.writerow('UserName/ID Previous Status New Status Date Time'.split())
w.writerows(data)
If the two patterns I described are not general enough, they may need to be tweaked, of course, but I think this general approach will be useful. While many Python users on Stack Overflow intensely dislike REs, I find them very useful for this kind of pragmatical ad hoc text processing.
Maybe the dislike is explained by others wanting to use REs for absurd uses such as ad hoc parsing of CSV, HTML, XML, ... -- and many other kinds of structured text formats for which perfectly good parsers exist! And also, other tasks well beyond REs' "comfort zone", and requiring instead solid general parser systems like pyparsing. Or at the other extreme super-simple tasks done perfectly well with simple strings (e.g. I remember a recent SO question which used if re.search('something', s): instead of if 'something' in s:!-).
But for the reasonably broad swathe of tasks (excluding the very simplest ones at one end, and the parsing of structured or somewhat-complicated grammars at the other) for which REs are appropriate, there's really nothing wrong with using them, and I recommend to all programmers to learn at least REs' basics.

Alex mentioned pyparsing and so here is a pyparsing approach to your same problem:
from pyparsing import Word, Suppress, Regex, oneOf, SkipTo
import datetime
DASHES = Word('-').suppress()
LPAR,RPAR,AT = map(Suppress,"()#")
date = Regex(r'\d{2}/\d{2}/\d{4}')
time = Regex(r'\d{2}:\d{2}:\d{2}')
status = oneOf("Busy Available Idle Offline Unavailable")
statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate')
statechange2 = 'became' + status('tostate')
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR +
(statechange1 | statechange2) +
AT + date('date') + time('time') + DASHES)
def convertFields(tokens):
if 'fromstate' not in tokens:
tokens['fromstate'] = 'NULL'
tokens['name'] = tokens.name.strip()
tokens['email'] = tokens.email.strip()
d,mon,yr = map(int, tokens.date.split('/'))
h,m,s = map(int, tokens.time.split(':'))
tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s)
linefmt.setParseAction(convertFields)
for line in text.splitlines():
fields = linefmt.parseString(line)
print "%(name)s/%(email)s %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields
prints:
Mark Grey/mark.grey#gmail.com Busy Available 2010-07-14 16:32:36
Silvia Pablo/spablo#gmail.com NULL Available 2010-07-14 16:32:39
pyparsing allows you to attach names to the results fields (just like the named groups in Tom Pietzcker's RE-styled answer), plus parse-time actions to act on or manipulate the parsed actions - note the conversion of the separate date and time fields into a true datetime object, already converted and ready for processing after parsing with no additional muss nor fuss.
Here is a modified loop that just dumps out the parsed tokens and the named fields for each line:
for line in text.splitlines():
fields = linefmt.parseString(line)
print fields.dump()
prints:
['Mark Grey ', 'mark.grey#gmail.com', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:36
- email: mark.grey#gmail.com
- fromstate: Busy
- name: Mark Grey
- time: 16:32:36
- tostate: Available
['Silvia Pablo ', 'spablo#gmail.com', 'became', 'Available', '14/07/2010', '16:32:39']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:39
- email: spablo#gmail.com
- fromstate: NULL
- name: Silvia Pablo
- time: 16:32:39
- tostate: Available
I suspect that as you continue to work on this problem, you will find other variations on the format of the input text specifying how the user's state changed. In this case, you would just add another definition like statechange1 or statechange2, and insert it into linefmt with the others. I feel that pyparsing's structuring of the parser definition helps developers come back to a parser after things have changed, and easily extend their parsing program.

Well, if i were to approach this problem, probably I'd start by splitting each entry into its own, separate string. This looks like it might be line oriented, so a inputfile.split('\n') is probably adequate. From there I would probably craft a regular expression to match each of the possible status changes, with subgroups wrapping each of the important fields.

thanks very much for all your comments. They were very useful. I wrote my code using the directory functionality. What it does is it reads through the file and creates an output file for each of the user with all his status updates. Here is the code pasted below.
#Script to extract info from individual data files and print out a data file combining info from these files
import os
import commands
dataFileDir="data/";
#Dictionary linking names to email ids
#For the time being, assume no 2 people have the same name
usrName2Id={};
#User id to user name mapping to check for duplicate names
usrId2Name={};
#Store info: key: user ids and values a dictionary with time stamp keys and status messages values
infoDict={};
#Given an array of space tokenized inputs, extract user name
def getUserName(info,mailInd):
userName="";
for i in range(mailInd-1,0,-1):
if info[i].endswith("-") or info[i].endswith("+"):
break;
userName=info[i]+" "+userName;
userName=userName.strip();
userName=userName.replace(" "," ");
userName=userName.replace(" ","_");
return userName;
#Given an array of space tokenized inputs, extract time stamp
def getTimeStamp(info,timeStartInd):
timeStamp="";
for i in range(timeStartInd+1,len(info)):
timeStamp=timeStamp+" "+info[i];
timeStamp=timeStamp.replace("-","");
timeStamp=timeStamp.strip();
return timeStamp;
#Given an array of space tokenized inputs, extract status message
def getStatusMsg(info,startInd,endInd):
msg="";
for i in range(startInd,endInd):
msg=msg+" "+info[i];
msg=msg.strip();
msg=msg.replace(" ","_");
return msg;
#Extract and store info from each line in the datafile
def extractLineInfo(line):
print line;
info=line.split(" ");
mailInd=-1;userId="-NONE-";
timeStartInd=-1;timeStamp="-NONE-";
becameInd="-1";
statusMsg="-NONE-";
#Find indices of email id and "#" char indicating start of timestamp
for i in range(0,len(info)):
#print (str(i)+" "+info[i]);
if(info[i].startswith("(") and info[i].endswith("#in.ibm.com)")):
mailInd=i;
if(info[i]=="#"):
timeStartInd=i;
if(info[i]=="became"):
becameInd=i;
#Debug print of mail and time stamp start inds
"""print "\n";
print "Index of mail id: "+str(mailInd);
print "Index of time start index: "+str(timeStartInd);
print "\n";"""
#Extract IBM user id and name for lines with ibm id
if(mailInd>=0):
userId=info[mailInd].replace("(","");
userId=userId.replace(")","");
userName=getUserName(info,mailInd);
#Lines with no ibm id are of the form "Suraj Godar Mr became idle # 15/07/2010 16:30:18"
elif(becameInd>0):
userName=getUserName(info,becameInd);
#Time stamp info
if(timeStartInd>=0):
timeStamp=getTimeStamp(info,timeStartInd);
if(mailInd>=0):
statusMsg=getStatusMsg(info,mailInd+1,timeStartInd);
elif(becameInd>0):
statusMsg=getStatusMsg(info,becameInd,timeStartInd);
print userId;
print userName;
print timeStamp
print statusMsg+"\n";
if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"):
usrName2Id[userName]=userId;
#Store status messages keyed by user email ids
timeDict={};
#Retrieve user id corresponding to user name
if userName in usrName2Id:
userId=usrName2Id[userName];
#For valid user ids, store status message in the dict within dict data str arrangement
if not(userId=="-NONE-"):
if not(userId in infoDict.keys()):
infoDict[userId]={};
timeDict=infoDict[userId];
if not(timeStamp in timeDict.keys()):
timeDict[timeStamp]=statusMsg;
else:
timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg;
#Print for each user a file containing status
def printStatusFiles(dataFileDir):
volNum=0;
for userName in usrName2Id:
volNum=volNum+1;
filename=dataFileDir+"/"+"status-"+str(volNum)+".txt";
file = open(filename,"w");
print "Printing output file name: "+filename;
print volNum,userName,usrName2Id[userName]+"\n";
file.write(userName+" "+usrName2Id[userName]+"\n");
timeDict=infoDict[usrName2Id[userName]];
for time in sorted(timeDict.keys()):
file.write(time+" "+timeDict[time]+"\n");
#Read and store data from individual data files
def readDataFiles(dataFileDir):
#Process each datafile
files=os.listdir(dataFileDir)
files.sort();
for i in range(0,len(files)):
#for i in range(0,1):
file=files[i];
#Do not process other non-data files lying around in that dir
if not file.endswith(".txt"):
continue
print "Processing data file: "+file
dataFile=dataFileDir+str(file);
inpFile=open(dataFile,"r");
lines=inpFile.readlines();
#Process lines
for line in lines:
#Clean lines
line=line.strip();
line=line.replace("/India/Contr/IBM","");
line=line.strip();
#Skip header line of the file and L's sign in sign out times
if(line.startswith("System log for account") or line.find("signed")>-1):
continue;
extractLineInfo(line);
print "\n";
readDataFiles(dataFileDir);
print "\n";
printStatusFiles("out/");

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing mixed text file with a lot header info - python

Related

Is there a way to read data into R or Python and split it by number of characters?

Parse unstructured text in python

What is a working method for extracting numeric values with associated data from open text?

Counting date occurrences in python?

Parsing text files using Python

Categories

Resources