Say I have a number of fields with their respective length of characters. The first field is the ID with length of 10, the second field is phone number with length 20, and so forth. Can I set the delimiter based on the length?
The data does not have any structure to it so in order to read it I have to find a way to construct the table when I read it in. It's a plain text file. I do, however, have the respective length of characters for each field. I have not done anything like this before so before I spend hours into it I wanted to see if this is even possible.
In python you can just slice the string.
msg = "ID12345678PHONE123456789012345BLOB"
_id = msg[:10]
phone = msg[10:30]
blob = msg[30:34]
print(_id, phone, blob)
Result
ID12345678 PHONE123456789012345 BLOB
Option 2: If you open the file in binary mode and get bytes strings, you can use the struct module to unpack them.
import struct
msg = b"ID12345678PHONE123456789012345BLOB"
_id, phone, blob = struct.unpack("10s20s4s", msg)
print(_id, phone, blob)
In python it should be quite simple indeed
In pandas you can use the simple function read_fwf()
Given for example something like this
myfile.txt
1name1surname1
2name2surname2
with columns of size [1, 5, 8]
you can read the file in this way
import pandas as pd
df = pd.read_fwf("myfile.txt", widths=[1, 5, 8])
Details here for this function
If you want to parse the file yourself instead, which is also quite simple:
import pandas as pd
# column name + size
meta_data = [('id',1),('name',5),('surname',8)]
def my_parser(line):
curr_dict = {}
start=0
end=0
for meta in meta_data:
end = meta[1] + end
curr_dict[meta[0]] = line[start:end]
start = end
return curr_dict
with open("myfile.txt", "r") as f_o:
lines = f_o.readlines()
dicts = []
for line in lines:
dicts.append(my_parser(line))
pd.Dataframe(dicts)
Am new to python and am trying to read a PDF file to pull the ID No.. I have been successful so far to extract the text out of the PDF file using pdfplumber. Below is the code block:
import pdfplumber
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
print (raw_text)
Here is the text output:
Welcome to ABC
01 January, 1991
ID No. : 10101010
Welcome to your ABC portal. Learn
More text here..
Even more text here..
Mr Jane Doe
Jack & Jill Street Learn more about your
www.abc.com
....
....
....
However, am unable to find the optimum way to parse this unstructured text further. The final output am expecting to be is just the ID No. i.e. 10101010. On a side note, the script would be using against fairly huge set of PDFs so performance would be of concern.
Try using a regular expression:
import pdfplumber
import re
with pdfplumber.open('ABC.pdf') as pdf_file:
firstpage = pdf_file.pages[0]
raw_text = firstpage.extract_text()
m = re.search(r'ID No\. : (\d+)', raw_text)
if m:
print(m.group(1))
Of course you'll have to iterate over all the PDF's contents - not just the first page! Also ask yourself if it's possible that there's more than one match per page. Anyway: you know the structure of the input better than I do (and we don't have access to the sample file), so I'll leave it as an exercise for you.
If the length of the id number is always the same, I would try to find the location of it with the find-function. position = raw_text.find('ID No. : ')should return the position of the I in ID No. position + 9 should be the first digit of the id. When the number has always a length of 8 you could get it with int(raw_text[position+9:position+17])
If you are new to Python and actually need to process serious amounts of data, I suggest that you look at Scala as an alternative.
For data processing in general, and regular expression matching in particular, the time it takes to get results is much reduced.
Here is an answer to your question in Scala instead of Python:
import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.PdfTextExtractor
val fil = "ABC.pdf"
val textFromPage = (1 until (new PdfReader(fil)).getNumberOfPages).par.map(page => PdfTextExtractor.getTextFromPage(new PdfReader(fil), page)).mkString
val r = "ID No\\. : (\\d+)".r
val res = for (m <- r.findAllMatchIn(textFromPage )) yield m.group(0)
res.foreach(println)
I tried to look for a solution but nothing was giving me quite what I needed. I'm not sure regex can do what I need.
I need to process a large amount of data where license information is provided. I just need to grab the number of licenses and the name for each license then group and tally the license counts for each company.
Here's an example of the data pulled:
L00129A578-E105C1D138 1 Centralized Recording
$42.00
L00129A677-213DC6D60E 1 Centralized Recording
$42.00
1005272AE2-C1D6CACEC8 5 Station
$45.00
100525B658-3AC4D2C93A 5 Station
$45.00
I would need to grab the license count and license name then add like objects so it would grab (1 Centralized Recording, 1 Centralized Recording, 5 Station, 5 Station) then add license counts and output (2 Centralized Recording, 10 Station)
What would be the easiest way to implement this?
It looks like you're trying to ignore the license number, and get the count and name. So, the following should point you on your way for your data, if it is as uniform as it seems:
import re
r = re.compile(r"\s+(\d+)\s+[A-Za-z ]+")
r = re.compile(r"\s+(\d+)\s+([A-Za-z ]+)")
m = r.search(" 1 Centralized")
m.groups()
# ('1', 'Centralized')
That regex just says, "Require but ignore 1 or more spaces, pay attention to the string of digits after it, require but ignore 1 or more spaces after it, and pay attention to the capital letters, lower case letters, and spaces after it." (You may need to trim of a newline when you're done.)
The file-handling bit would look like:
f = open('/path/to/your_data_file.txt')
for line in f.readlines():
# run regex and do stuff for each line
pass
import re, io, pandas as pd
a = open('your_data_file.txt')
pd.read_csv(io.StringIO(re.sub(r'(?m).*\s(\d+)\s+(.*\S+)\s+$\n|.*','\\1,\\2',a)),
header=None).groupby(1).sum()[0].to_dict()
Pandas is a good tool for jobs like this. You might have to play around with it a bit. You will also need to export your excel file as a .csv file. In the interpreter,try:
import pandas
raw = pandas.read_csv('myfile.csv')
print(raw.columns)
That will give you the column headings for the csv file. If you have headers name and nums, then you can extract those as a list of tuples as follows:
extract = list(zip(raw.name, raw.nums))
You can then sort this list by name:
extract = sorted(extract)
Pandas probably has a method for compressing this easily, but I can't recall it so:
def accum(c):
nm = c[0][0]
count = 0
result = []
for x in c:
if x[0] == nm:
count += x[1]
else:
result.append((nm, count))
nm = x[0]
count = x[1]
result.append((nm, count))
return result
done = accum(extract)
Now you can write this to a text file as follows(fstrings require Python 3.6+)
with open("myjob.txt", "w+") as fout:
for x in done:
line = f"name: {x[0]} count: {x[1]} \n"
fout.write(line)
I'm currently trying to count the number of times a date occurs within a chat log for example the file I'm reading from may look something like this:
*username* (mm/dd/yyyy hh:mm:ss): *message here*
However I need to split the date from the time as I currently treat them as one. Im currently struggling to solve my problem so any help is appreciated. Down below is some sample code that I'm currently using to try get the date count working. Im currently using a counter however I'm wondering if there are other ways to count dates.
filename = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,))
mtxtr = filename.read()
date = []
number = []
occurences = Counter(date)
mtxtformat = mtxtr.split("\r\n")
print 'The Dates in the chat are as follows'
print "--------------------------------------------"
for mtxtf in mtxtformat:
participant = mtxtf.split("(")[0]
date = mtxtf.split("(")[-1]
message = date.split(")")[0]
date.append(date1.strip())
for item in date:
if item not in number:
number.append(item)
for item in number:
occurences = date.count(item)
print("Date Occurences " + " is: " + str(occurences))
Easiest way would be to use regex and take the count of the date pattern you have in the log file. It would be faster too.
If you know the date and time are going to be enclosed in parentheses at the start of the message (i.e. no parentheses (...): will be seen before the one containing the date and time):
*username* (mm/dd/yyyy hh:mm:ss): *message here*
Then you can extract based on the parens:
import re
...
parens = re.compile(r'\((.+)\)')
for mtxtf in mtxtformat:
match = parens.search(mtxtf)
date.append(match.group(1).split(' ')[0])
...
Note: If the message itself contains parens, this may match more than just the needed (mm/dd/yyyy hh:mm:ss). Doing match.group(1).split(' ')[0] would still give you the information you are looking for assuming there is no information enclosed in parens before your date-time information (for the current line).
Note2: Ideally enclose this in a try-except to continue on to the next line if the current line doesn't contain useful information.
I am very new to Python and am looking to use it to parse a text file. The file has between 250-300 lines of the following format:
---- Mark Grey (mark.grey#gmail.com) changed status from Busy to Available # 14/07/2010 16:32:36 ----
---- Silvia Pablo (spablo#gmail.com) became Available # 14/07/2010 16:32:39 ----
I need to store the following information into another file (excel or text) for all the entries from this file
UserName/ID Previous Status New Status Date Time
So my result file should look like this for the above entried
Mark Grey/mark.grey#gmail.com Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo#gmail.com NaN Available 14/07/2010 16:32:39
Thanks in advance,
Any help would be really appreciated
To get you started:
result = []
regex = re.compile(
r"""^-*\s+
(?P<name>.*?)\s+
\((?P<email>.*?)\)\s+
(?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+
(?P<new>.*?)\s+#\s+
(?P<date>\S+)\s+
(?P<time>\S+)\s+
-*$""", re.VERBOSE)
with open("inputfile") as f:
for line in f:
match = regex.match(line)
if match:
result.append([
match.group("name"),
match.group("email"),
match.group("previous")
# etc.
])
else:
# Match attempt failed
will get you an array of the parts of the match. I'd then suggest you use the csv module to store the results in a standard format.
import re
pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) # (.*?) ----\s*")
with open("data.txt") as f:
for line in f:
(name, email, prev, curr, date) = pat.match(line).groups()
print "{0}/{1} {2} {3} {4}".format(name, email, prev or "NaN", curr, date)
This makes assumptions about whitespace and also assumes that every line conforms to the pattern. You might want to add error checking (such as checking that pat.match() doesn't return None) if you want to handle dirty input gracefully.
The two RE patterns of interest seem to be...:
p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$'
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$'
so I'd do:
import csv, re, sys
# assign p1, p2 as above (or enhance them, etc etc)
r1 = re.compile(p1)
r2 = re.compile(p2)
data = []
with open('somefile.txt') as f:
for line in f:
m = p1.match(line)
if m:
data.append(m.groups())
continue
m = p2.match(line)
if not m:
print>>sys.stderr, "No match for line: %r" % line
continue
listofgroups = m.groups()
listofgroups.insert(2, 'NaN')
data.append(listofgroups)
with open('result.csv', 'w') as f:
w = csv.writer(f)
w.writerow('UserName/ID Previous Status New Status Date Time'.split())
w.writerows(data)
If the two patterns I described are not general enough, they may need to be tweaked, of course, but I think this general approach will be useful. While many Python users on Stack Overflow intensely dislike REs, I find them very useful for this kind of pragmatical ad hoc text processing.
Maybe the dislike is explained by others wanting to use REs for absurd uses such as ad hoc parsing of CSV, HTML, XML, ... -- and many other kinds of structured text formats for which perfectly good parsers exist! And also, other tasks well beyond REs' "comfort zone", and requiring instead solid general parser systems like pyparsing. Or at the other extreme super-simple tasks done perfectly well with simple strings (e.g. I remember a recent SO question which used if re.search('something', s): instead of if 'something' in s:!-).
But for the reasonably broad swathe of tasks (excluding the very simplest ones at one end, and the parsing of structured or somewhat-complicated grammars at the other) for which REs are appropriate, there's really nothing wrong with using them, and I recommend to all programmers to learn at least REs' basics.
Alex mentioned pyparsing and so here is a pyparsing approach to your same problem:
from pyparsing import Word, Suppress, Regex, oneOf, SkipTo
import datetime
DASHES = Word('-').suppress()
LPAR,RPAR,AT = map(Suppress,"()#")
date = Regex(r'\d{2}/\d{2}/\d{4}')
time = Regex(r'\d{2}:\d{2}:\d{2}')
status = oneOf("Busy Available Idle Offline Unavailable")
statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate')
statechange2 = 'became' + status('tostate')
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR +
(statechange1 | statechange2) +
AT + date('date') + time('time') + DASHES)
def convertFields(tokens):
if 'fromstate' not in tokens:
tokens['fromstate'] = 'NULL'
tokens['name'] = tokens.name.strip()
tokens['email'] = tokens.email.strip()
d,mon,yr = map(int, tokens.date.split('/'))
h,m,s = map(int, tokens.time.split(':'))
tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s)
linefmt.setParseAction(convertFields)
for line in text.splitlines():
fields = linefmt.parseString(line)
print "%(name)s/%(email)s %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields
prints:
Mark Grey/mark.grey#gmail.com Busy Available 2010-07-14 16:32:36
Silvia Pablo/spablo#gmail.com NULL Available 2010-07-14 16:32:39
pyparsing allows you to attach names to the results fields (just like the named groups in Tom Pietzcker's RE-styled answer), plus parse-time actions to act on or manipulate the parsed actions - note the conversion of the separate date and time fields into a true datetime object, already converted and ready for processing after parsing with no additional muss nor fuss.
Here is a modified loop that just dumps out the parsed tokens and the named fields for each line:
for line in text.splitlines():
fields = linefmt.parseString(line)
print fields.dump()
prints:
['Mark Grey ', 'mark.grey#gmail.com', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:36
- email: mark.grey#gmail.com
- fromstate: Busy
- name: Mark Grey
- time: 16:32:36
- tostate: Available
['Silvia Pablo ', 'spablo#gmail.com', 'became', 'Available', '14/07/2010', '16:32:39']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:39
- email: spablo#gmail.com
- fromstate: NULL
- name: Silvia Pablo
- time: 16:32:39
- tostate: Available
I suspect that as you continue to work on this problem, you will find other variations on the format of the input text specifying how the user's state changed. In this case, you would just add another definition like statechange1 or statechange2, and insert it into linefmt with the others. I feel that pyparsing's structuring of the parser definition helps developers come back to a parser after things have changed, and easily extend their parsing program.
Well, if i were to approach this problem, probably I'd start by splitting each entry into its own, separate string. This looks like it might be line oriented, so a inputfile.split('\n') is probably adequate. From there I would probably craft a regular expression to match each of the possible status changes, with subgroups wrapping each of the important fields.
thanks very much for all your comments. They were very useful. I wrote my code using the directory functionality. What it does is it reads through the file and creates an output file for each of the user with all his status updates. Here is the code pasted below.
#Script to extract info from individual data files and print out a data file combining info from these files
import os
import commands
dataFileDir="data/";
#Dictionary linking names to email ids
#For the time being, assume no 2 people have the same name
usrName2Id={};
#User id to user name mapping to check for duplicate names
usrId2Name={};
#Store info: key: user ids and values a dictionary with time stamp keys and status messages values
infoDict={};
#Given an array of space tokenized inputs, extract user name
def getUserName(info,mailInd):
userName="";
for i in range(mailInd-1,0,-1):
if info[i].endswith("-") or info[i].endswith("+"):
break;
userName=info[i]+" "+userName;
userName=userName.strip();
userName=userName.replace(" "," ");
userName=userName.replace(" ","_");
return userName;
#Given an array of space tokenized inputs, extract time stamp
def getTimeStamp(info,timeStartInd):
timeStamp="";
for i in range(timeStartInd+1,len(info)):
timeStamp=timeStamp+" "+info[i];
timeStamp=timeStamp.replace("-","");
timeStamp=timeStamp.strip();
return timeStamp;
#Given an array of space tokenized inputs, extract status message
def getStatusMsg(info,startInd,endInd):
msg="";
for i in range(startInd,endInd):
msg=msg+" "+info[i];
msg=msg.strip();
msg=msg.replace(" ","_");
return msg;
#Extract and store info from each line in the datafile
def extractLineInfo(line):
print line;
info=line.split(" ");
mailInd=-1;userId="-NONE-";
timeStartInd=-1;timeStamp="-NONE-";
becameInd="-1";
statusMsg="-NONE-";
#Find indices of email id and "#" char indicating start of timestamp
for i in range(0,len(info)):
#print (str(i)+" "+info[i]);
if(info[i].startswith("(") and info[i].endswith("#in.ibm.com)")):
mailInd=i;
if(info[i]=="#"):
timeStartInd=i;
if(info[i]=="became"):
becameInd=i;
#Debug print of mail and time stamp start inds
"""print "\n";
print "Index of mail id: "+str(mailInd);
print "Index of time start index: "+str(timeStartInd);
print "\n";"""
#Extract IBM user id and name for lines with ibm id
if(mailInd>=0):
userId=info[mailInd].replace("(","");
userId=userId.replace(")","");
userName=getUserName(info,mailInd);
#Lines with no ibm id are of the form "Suraj Godar Mr became idle # 15/07/2010 16:30:18"
elif(becameInd>0):
userName=getUserName(info,becameInd);
#Time stamp info
if(timeStartInd>=0):
timeStamp=getTimeStamp(info,timeStartInd);
if(mailInd>=0):
statusMsg=getStatusMsg(info,mailInd+1,timeStartInd);
elif(becameInd>0):
statusMsg=getStatusMsg(info,becameInd,timeStartInd);
print userId;
print userName;
print timeStamp
print statusMsg+"\n";
if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"):
usrName2Id[userName]=userId;
#Store status messages keyed by user email ids
timeDict={};
#Retrieve user id corresponding to user name
if userName in usrName2Id:
userId=usrName2Id[userName];
#For valid user ids, store status message in the dict within dict data str arrangement
if not(userId=="-NONE-"):
if not(userId in infoDict.keys()):
infoDict[userId]={};
timeDict=infoDict[userId];
if not(timeStamp in timeDict.keys()):
timeDict[timeStamp]=statusMsg;
else:
timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg;
#Print for each user a file containing status
def printStatusFiles(dataFileDir):
volNum=0;
for userName in usrName2Id:
volNum=volNum+1;
filename=dataFileDir+"/"+"status-"+str(volNum)+".txt";
file = open(filename,"w");
print "Printing output file name: "+filename;
print volNum,userName,usrName2Id[userName]+"\n";
file.write(userName+" "+usrName2Id[userName]+"\n");
timeDict=infoDict[usrName2Id[userName]];
for time in sorted(timeDict.keys()):
file.write(time+" "+timeDict[time]+"\n");
#Read and store data from individual data files
def readDataFiles(dataFileDir):
#Process each datafile
files=os.listdir(dataFileDir)
files.sort();
for i in range(0,len(files)):
#for i in range(0,1):
file=files[i];
#Do not process other non-data files lying around in that dir
if not file.endswith(".txt"):
continue
print "Processing data file: "+file
dataFile=dataFileDir+str(file);
inpFile=open(dataFile,"r");
lines=inpFile.readlines();
#Process lines
for line in lines:
#Clean lines
line=line.strip();
line=line.replace("/India/Contr/IBM","");
line=line.strip();
#Skip header line of the file and L's sign in sign out times
if(line.startswith("System log for account") or line.find("signed")>-1):
continue;
extractLineInfo(line);
print "\n";
readDataFiles(dataFileDir);
print "\n";
printStatusFiles("out/");