Read and select specific rows from text file regex Python - python

I have a large number of text files to read from in Python. Each file is structured as the following sample:
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is a multiline abstract of the paper)
blablabla
blablabla
\\
I would like to automatically extract and store (e.g., as a list) the Title, Authors, and abstract (the text between the second and third \\ - note that it starts with an indent) from each text file. Also note that the white line between Date (revised) and Title is really there (it is not a typo that I introduced).
My attempts so far have involved (I am showing the steps for a single text file, say the first file in the list):
filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")
Which gives me:
0
0 ----------------------------------------------...
1 \\
2 Paper: some_integer
3 From: <some_email_address>
4 Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
5 Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6 Title: some_title...
7 Authors: name_1, name_2
8 Comments: 28 pages, JHEP latex
9 Report-no: DUKE-CGTP-00-01
10 \\
11 blabla...
12 blabla...
13 blabla...
14 \\
I can then select a given row (e.g., the one featuring the title) with:
test[test[0].str.contains("Title")].to_string()
But it is truncated, it is not a clean string (some attributes show up) and I find this entire pandas-based approach quite tedious actually... There must be an easier way to directly select the rows of interest from the text file using regex. At least I hope so...

you could process line by line.
import re
data = {}
temp_s = match = ''
with open('myfile.txt', 'r') as infile:
for line in infile:
if ":" in line:
line = line.split(':')
data[line[0]] = line[1]
elif re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
match = match.group(1)
temp_s += match
while 1:
line = infile.next()
if re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
temp_s += match.group(1)
else:
break
data['abstract'] = temp_s

How about iterating over each line in the file and split by the first : if it is present in line, collecting the result of the split in a dictionary:
with open("input.txt") as f:
data = dict(line.strip().split(": ", 1) for line in f if ": " in line)
As a result, the data would contain:
{
'Comments': '28 pages, JHEP latex',
'Paper': 'some_integer',
'From': '<some_email_address>',
'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT (27kb)',
'Title': 'some_title',
'Date': 'Wed, 4 Apr 2001 12:08:13 GMT (27kb)',
'Authors': 'name_1, name_2'
}

If your files really always have the same structure, you could come up with:
# -*- coding: utf-8> -*-
import re
string = """
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is the abstract of the paper)
\\
"""
rx = re.compile(r"""
^Title:\s(?P<title>.+)[\n\r] # Title at the beginning of a line
Authors:\s(?P<authors>.+)[\n\r] # Authors: ...
Comments:\s(?P<comments>.+)[\n\r] # ... and so on ...
.*[\n\r]
(?P<abstract>.+)""",
re.MULTILINE|re.VERBOSE) # so that the caret matches any line
# + verbose for this explanation
for match in rx.finditer(string):
print match.group('title'), match.group('authors'), match.group('abstract')
# some_title name_1, name_2 blablabla (this is the abstract of the paper)
This approach takes Title as the anchor (beginning of a line) and skims the text afterwards. The named groups may not really be necessary but make the code easier to understand. The pattern [\n\r] looks for newline characters.
See a demo on regex101.com.

This pattern will get you started:
\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\
Assume 'txtfile.txt' is of the format as shown on the top. If using python 2.7x:
import re
with open('txtfile.txt', 'r') as f:
input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)
Output:
[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n blablabla\n blablabla')]

Related

How to parse log file by regex grouping

Trying to parse based on the grouping, below is the input file to parse.
Cannot able to aggregate multiple groups from my regex which produces expected output. Need some recommendations to print data in expected output. (Note Group2 can have different other (strings) in the actual log-file)
#Parse out the timedate stamp Jan 20 03:25:08 to capture two groups
Example groups
1.) Jan 20 03:25 2.) logrotate
1.) Jan 20 05:03 2.) ntpd
logfile= """Jan 20 03:25:08 fakehost logrotate: ALERT exited abnormally with [1]
Jan 20 03:25:08 fakehost run-parts(/etc/cron.daily)[20447]: finished logrotate
Jan 20 03:26:21 fakehost anacron[28969]: Job 'cron.daily' terminated
Jan 20 03:26:21 fakehost anacron[28969]: Normal exit (1 job run)
Jan 20 03:30:01 fakehost CROND[31462]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jan 20 03:30:01 fakehost CROND[31461]: (root) CMD (/var/system/bin/sys-cmd -F
Jan 20 05:03:03 fakehost ntpd[3705]: synchronized to time.faux.biz, stratum 2
"""
Expected output:
minute,total_count,logrotate,CROND,ntpd,anacron,run-parts
Jan 20 03:25,2,1,0,0,0,1
Jan 20 03:26,2,0,2,0,1,1
Jan 20 03:30,2,0,2,0,0,0
Jan 20 05:03,1,0,0,1,0,0
This is my code:
import re
output = {}
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
print(match.groups())
# Struck here to arrange the data
output[match.group(1)]['total_count'] += 1
output[match.group(1)][match.group(2)] += 1
for k, v in output.items():
print('{0} {1}'.format(k, v))
import re
output = []
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile.txt", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
dataDict = {'minute': match.group(1), 'total_count': 1}
dataDict[match.group(2)] = 1
lastInsertedIndex = len(output) - 1
if (len(output) > 0): # data exist, check if same minute data exist or not
# same minute, update existing data
if (output[lastInsertedIndex]['minute'] == match.group(1)):
lastInsertedIndexDict = output[lastInsertedIndex]
if (match.group(2) in lastInsertedIndexDict):
lastInsertedIndexDict[match.group(2)] = lastInsertedIndexDict[match.group(2)] + 1 # updating group(2)
else:
lastInsertedIndexDict[match.group(2)] = 1
# updating total count
lastInsertedIndexDict['total_count'] = lastInsertedIndexDict['total_count'] + 1
output[lastInsertedIndex] = lastInsertedIndexDict
else: # new minute, simply append
output.append(dataDict)
else: # output list is empty
output.append(dataDict)
for data in output:
print(data)
Here the idea is after we have match.groups(), create a dictionary with minute as a key & for total_count value as 1. Then for match.group(1) set value as 1 with the new found key.
As the data would be in increasing order of time, so check if previously inserted data is of same minute or different minute.
If same minute then increase the dictionary's total_count & match.group(2) values by 1
If different minute then simply append the dictionary to output list
Currently output list would print keys & values. Incase if you want to print only values then instead of print(data) in the last line, you can change to print(data.values())
Just to mention, I have assumed that you are not facing any issue in regex & that whatever regex you have provided is fulfilling your requirement.
In case you face any issue in regex or need help in that do let me know in comment.

Rewriting a txt file in python, creating new lines where there is a certain string

I have converted a PDF bank statement to a txt file. Here is a snippet of the .txt file:
15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK
What is the easiest way of re-writing the text file in python to create a new line at certain points. i.e. after a number ‘xx.xx’ there in a new date such as ‘xx APR’
For example the text to become:
15 Apr 20DDOPEN 100.00
BENNON WATER SRVCS29.00
DDBG BUSINESS106.00...(etc)
I am just trying to make a PDF more readable and useful when working amongst my other files.
If you know of another PDF to txt python converter which works better, I would also be interested.
Thanks for your help
First step would be getting the text file into Python
with open(“file.txt”) as file:
data = file.read()
This next part, initially, I thought you wouldn't be able to do, but in your example, each part contains a number XX.XX The important thing to notice here is that there is a '.' in each number.
Using Python's string find command, you can iteratively look for that '.' and add a newline character two characters later. You can change my indices below to remove the DD as well if you want.
index = 0
while(index != -1):
index = data.find('.', index)
if index != -1:
data = data[:index+3] + '\n' + data[index+3:]
Then you need to write the new data back to the file.
file = open('ValidEmails.txt','w')
file.write(data)
For the given input the following should work:
import re
counter = 0
l = "15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK"
nums = re.finditer("[\d]+[\.][\d]+", l)
for elem in nums:
idx = elem.span()[1] + counter
l = l[:idx] + '\n' + l[idx:]
counter += 1
print(l)
The output is:
15 Apr 20DDOPEN 100.00
DDBENNON WATER SRVCS29.00
DDBG BUSINESS106.00
BPC BOB PETROL MINISTRY78.03
BPC BARBARA STREAMING DATA30.50
CRPAYPAL Z4J22FR450.00
CRPAYNAL AAWDL4Z4J22222KHMG30.0019
,028.4917
Apr 20CRCASH IN AT HSBC BANK
Then you should easily able to write line by line to a file.

Dig out information with Python re

I want to dig out information from the log files and wrote the script below:
import re
file = '''Date,Time,Type,User,Message
Thu Jul 18, 2019 14:18:41.945,EFM,201202 ,Robot picked
Thu Jul 18, 2019 14:18:51.486,DS ,201202 ,Module 1
Thu Jul 18, 2019 14:19:07.747,DS ,201202 ,Door opened
Thu Jul 18, 2019 14:20:08.231,EFM,203204205206,Robot picked
Thu Jul 18, 2019 14:20:08.231,DS ,203204 ,Module 2
Thu Jul 18, 2019 14:20:10.282,DS ,203204 ,Door opened
...
'''
p1 = re.compile(r'\w{3} \w{3} \d\d, \d{4} (\d\d:\d\d:\d\d.\d{3}),EFM,(\d+?\s*?),Robot picked')
p2 = re.compile(r'\w{3} \w{3} \d\d, \d{4} (\d\d:\d\d:\d\d.\d{3}),DS ,(\d+?\s*?),Module 1')
p3 = re.compile(r'\w{3} \w{3} \d\d, \d{4} (\d\d:\d\d:\d\d.\d{3}),DS ,(\d+?\s*?),Door opened')
w_file = r'D:\sample.txt'
lines = file.readlines()
t_file =open(w_file,'w')
info = ['User','Time1','Time2','Time3' ]
t_file.write('{}\n'.format(','.join(item for item in info)))
for line in lines:
p1_line = re.findall(p1, line.strip())
p2_line = re.findall(p2, line.strip())
p3_line = re.findall(p3, line.strip())
if p1_line and p2_line and p3_line:
if p1_line[0][1][:3] == p2_line[0][1][:3] and p1_line[0][1][:3] == p5_line[0][1][:3]:
t_file.write('{},{},{},{}\n'.format(p1_line[0][1].strip(),p1_line[0][0],p2_line[0][0],p3_line[0][0])
t_file.close()
When I open the sample.txt file, there is only the 'User,Time1,Time2,Time3' row. Can any find what's wrong in my script?
What I want is like below:
User,Time1,Time2,Time3
201202,14:18:41.945,14:18:51.486,14:19:07.747
203204205206,14:20:08.231,14:20:08.231,14:20:10.282
The issue with your script is that you are trying to match all regular expressions to the same line, and then performing an and condition, which of course fails.
Each regular expression works but only for specific lines, hence 2 out of the 3 will return [] which evaluates to False.
For example, given:
line = 'Thu Jul 18, 2019 14:18:41.945,EFM,201202 ,Robot picked'
You will have:
p1_line = [('14:18:41.945', '201202 ')] # match
p2_line = [] # no match
p3_line = [] # no match
Once you and these three values, the condition will evaluate to False and for this reason nothing will be written to the file:
if p1_line and p2_line and p3_line: # this evaluates to False
So, depending on the exact logic you want to implement, you may have to store and remember past matches and build on that.

Django view decoding danish chars from CSV

I have a csv file that contains some strange (incorrect) encoded danish characters (å-ø-æ). In my Django view I'm trying to grab a string from the first row, and the date from the second row in the file. The file looks like this if I copy paste it.
01,01,Project Name: SAM_LOGIK_rsm¿de_HD,,,Statistics as of: Sat Oct 01 17:09:16 2016
02,01,Project created: Tue Apr 12 09:10:16 2016,,,Last Session Started: Sat Oct 01 16:59:22 2016
The string SAM_LOGIK_rsm¿de_HD should be SAM_LOGIK_Årsmøde_HD - which is the value I want to store in the DB.
I am decoding the file with iso-8859-1 (otherwise I get an error).
with open(latest, 'rt', encoding='iso-8859-1') as csvfile:
for i, row in enumerate(csvfile):
if "Project Name:" in row:
this = row.split(',')
project_list.append(this[2][14:]) # gets the project name as is
if i >= 1:
break
else:
this = row.split(',')
date = datetime.strptime(this[5][22:-1], '%c') # datetime object
project_list.append(date)
if i >= 1:
break # break at row 2
csvfile.close()
This stores the string 'as is', and I'm not sure what to do to convert it back into danish before I store it in the DB. The DB and Django are set up to work with danish chars.
If I try to decode it as utf.8 - I get a UnicodeDecodeError which reveals some more information.
01,01,Project Name: SAM_LOGIK_\x81rsm\xbfde_HD,,,Statistics as of: Sat Oct'
01 17:09:16 2016\r02,01,Project created: Tue Apr 12 09:10:16 2016,,,Last'
EDIT:
I found out that the strings in the csv are actually corrupted - and the application that created them (Avid Media Composer) at least consistently applies the same values for - Å-å-Æ-æ-Ø-ø
Å = \x81 unassigned in UTF8
å = Œ - u"\u0153" OE ligature
Æ = ® - chr(174)
æ = ¾ - chr(190)
Ø = » - chr(187)
ø = ¿ - chr(191)
I fixed it like this.
replacements = {'\x81':'Å','Œ':'å','®':'Æ','¾':'æ','¿':'ø','»':'Ø'}
with open(newest, 'rt', encoding='iso-8859-1') as csvfile:
for i, row in enumerate(csvfile):
if "Project Name:" in row:
this = row.split(',')
project_list.append("".join([replacements.get(c, c) for c in this[2][14:]]))
if i >= 1:
break
else:
this = row.split(',')
date = datetime.strptime(this[5][22:-1], '%c') # datetime object
project_list.append(date)
if i >= 1:
break # break at row 2
try that
row.decode('iso-8859-1').encode('utf-8')
And if you're use the "with" statement closing the file isn't necessary

Python: Parsing a colon delimited file with various counts of fields

I'm trying to parse a a few files with the following format in 'clientname'.txt
hostname:comp1
time: Fri Jan 28 20:00:02 GMT 2011
ip:xxx.xxx.xx.xx
fs:good:45
memory:bad:78
swap:good:34
Mail:good
Each section is delimited by a : but where lines 0,2,6 have 2 fields... lines 1,3-5 have 3 or more fields. (A big issue I've had trouble with is the time: line, since 20:00:02 is really a time and not 3 separate fields.
I have several files like this that I need to parse. There are many more lines in some of these files with multiple fields.
...
for i in clients:
if os.path.isfile(rpt_path + i + rpt_ext): # if the rpt exists then do this
rpt = rpt_path + i + rpt_ext
l_count = 0
for line in open(rpt, "r"):
s_line = line.rstrip()
part = s_line.split(':')
print part
l_count = l_count + 1
else: # else break
break
First I'm checking if the file exists first, if it does then open the file and parse it (eventually) As of now I'm just printing the output (print part) to make sure it's parsing right.
Honestly, the only trouble I'm having at this point is the time: field. How can I treat that line specifically different than all the others? The time field is ALWAYS the 2nd line in all of my report files.
split method has the following syntax split( [sep [,maxsplit]]) and if the maxsplit is given, it will make maxsplit+1 parts. In you case, you just have give maxsplit as 1. Just split(':',1) would solve your problem.
If time is a special case, you could do:
[...]
s_line = line.rstrip()
if line.startswith('time:'):
part = s_line.split(':', 1)
else:
part = s_line.split(':')
print part
[...]
This would give you:
['hostname', 'comp1']
['time', ' Fri Jan 28 20:00:02 GMT 2011']
['ip', 'xxx.xxx.xx.xx']
['fs', 'good', '45']
['memory', 'bad', '78']
['swap', 'good', '34']
['Mail', 'good']
And doesn't rely on the position of time in the file.
Design considerations:
Robustly handle extraneous whitespace, including blank lines, and missing colons.
Extract a record_type, which is then used to decide how to parse the remainder of the line.
>>> def munched(s, n=None):
... if n is None:
... n = 99999999 # this kludge should not be necessary
... return [x.strip() for x in s.split(':', n)]
...
>>> def parse_line(line):
... if ':' not in line:
... return [line.strip(), '']
... record_type, remainder = munched(line, 1)
... if record_type == 'time':
... data = [remainder]
... else:
... data = munched(remainder)
... return record_type, data
...
>>> for guff in """
... hostname:comp1
... time: Fri Jan 28 20:00:02 GMT 2011
... ip:xxx.xxx.xx.xx
... fs:good:45
... memory : bad : 78
... missing colon
... Mail:good""".splitlines(True):
... print repr(guff), parse_line(guff)
...
'\n' ['', '']
'hostname:comp1\n' ('hostname', ['comp1'])
'time: Fri Jan 28 20:00:02 GMT 2011\n' ('time', ['Fri Jan 28 20:00:02 GMT 2011'])
'ip:xxx.xxx.xx.xx\n' ('ip', ['xxx.xxx.xx.xx'])
'fs:good:45\n' ('fs', ['good', '45'])
' memory : bad : 78 \n' ('memory', ['bad', '78'])
'missing colon\n' ['missing colon', '']
'Mail:good' ('Mail', ['good'])
>>>
If the time field always the 2nd line. Why can't you skip it and parse it separately?
Something like
for i, line in enumerate(open(rpt, "r").read().splitlines()):
if i==1: # Special parsing for time: line
data = line[5:]
else:
# your normal parsing logic

Categories