printing dates in a log file - python

I have a log file with a consistent date in my log file.
ex:
date1
date2
...
The date means the number of logs in my log file. I was wondering how can print the dates from the log file using Regular expressions
what I have tried:
import re
dateRegex = re.compile('^\w{3}\s\d\d:\d\d:\d\d')
f = open("logfile.log","r")
for line in f.readlines():
matches = re.findall(dateRegex,line)
print matches
The output I am getting is (many []):
[]
[]
[]
...
...

You seem to have forgotten the date:
import re
dateRegex = re.compile(r'^\w{3}\s\d\d?\s\d\d:\d\d:\d\d')
# ^^^^^^^ I added ? to cater for dates between 1 & 9
f = open("logfile.log","r")
for line in f.readlines():
matches = re.findall(dateRegex,line)
if matches: # Check if there are matches
print matches[0] # Print first element of list returned by findall
I think that you can use re.match instead, since you're testing line by line and using the beginning of line anchor:
import re
dateRegex = re.compile(r'\w{3}\s\d\d?\s\d\d:\d\d:\d\d')
f = open("logfile.log","r")
for line in f.readlines():
matches = re.match(dateRegex,line)
if matches:
print matches.group()

Related

Getting the line number of a string

Suppose I have a very long string taken from a file:
lf = open(filename, 'r')
text = lf.readlines()
lf.close()
or
lineList = [line.strip() for line in open(filename)]
text = '\n'.join(lineList)
How can one find specific regular expression's line number in this string( in this case the line number of 'match'):
regex = re.compile(somepattern)
for match in re.findall(regex, text):
continue
Thank you for your time in advance
Edit: Forgot to add that the pattern that we are searching is multiple lines and I am interested in the starting line.
We need to get re.Match objects rather than strings themselves using re.finditer, which will allow getting information about starting position. Consider following example: lets say I want to find every two digits which are located immediately before and after newline (\n) then:
import re
lineList = ["123","456","789","ABC","XYZ"]
text = '\n'.join(lineList)
for match in re.finditer(r"\d\n\d", text, re.MULTILINE):
start = match.span()[0] # .span() gives tuple (start, end)
line_no = text[:start].count("\n")
print(line_no)
Output:
0
1
Explanation: After I get starting position I simply count number of newlines before that place, which is same as getting number of line. Note: I assumed line numbers are starting from 0.
Perhaps something like this:
lf = open(filename, 'r')
text_lines = lf.readlines()
lf.close()
regex = re.compile(somepattern)
for line_number, line in enumerate(text_lines):
for match in re.findall(regex, line):
print('Match found on line %d: %s' % (line_number, match))

How to ignore the next same word as the first in a text using python? [duplicate]

I have a huge file containing the following lines DDD-1126N|refseq:NP_285726|uniprotkb:P00112 and DDD-1081N|uniprotkb:P12121, I want to grab the number after uniprotkb.
Here's my code:
x = 'uniprotkb:P'
f = open('m.txt')
for line in f:
print line.find(x)
print line[36:31 + len(x)]
The problem in line.find(x) is 10 and 26, I grab the complete number when it is 26. I'm new to programming, so I'm looking for something to grab the complete number after the word.
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print the number after x
Use regular expressions:
import re
for line in open('m.txt'):
match = re.search('uniprotkb:P(\d+)', line)
if match:
print match.group(1)
import re
regex = re.compile('uniprotkb:P([0-9]*)')
print regex.findall(string)
The re module is quite unnecessary here if x is static and always matches a substring at the end of each line (like "DDD-1126N|refseq:NP_285726|uniprotkb:P00112"):
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    print line[line.find(x)+len(x):]
Edit:
To answer you comment. If they are separated by the pipe character (|), then you could do this:
sep = "|"
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    matches = [l[l.find(x)+len(x):] for l in line.split(sep) if l[l.find(x)+len(x):]]
print matches
If m.txt has the following line:
DDD-1126N|uniprotkb:285726|uniprotkb:P00112
Then the above will output:
['285726', 'P00112']
Replace sep = "|" with whatever the column separator would be.
Um, for one thing I'd suggest you use the csv module to read a TSV file.
But generally, you can use a regular expression:
import re
regex = re.compile(r"(?<=\buniprotkb:)\w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
The regular expression matches a string of alphanumeric characters if it's preceded by uniprotkb:.

Python - Replace parenthesis with periods and remove first and last period

I am trying to take an input file with a list of DNS lookups that contains subdomain/domain separators with the string length in parenthesis as opposed to periods. It looks like this:
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
I would like to replace the parenthesis and numbers with periods and then remove the first and last period. My code currently does this, but leaves the last period. Any help is appreciated. Here is the code:
import re
file = open('test.txt', 'rb')
writer = open('outfile.txt', 'wb')
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line)
if newline1.startswith('.'):
newline1 = newline1[1:-1]
writer.write(newline1)
You can split the lines with \(\d+\) regex and then join with . stripping commas at both ends:
for line in file:
res =".".join(re.split(r'\(\d+\)', line))
writer.write(res.strip('.'))
See IDEONE demo
Given that your re.sub call works like this:
> re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
'.subdomain.domain.com.'
the only thing you need to do is strip the resulting string from any leading and trailing .:
> s = re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
> s.strip(".")
'subdomain.domain.com'
Full drop in solution:
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line).strip(".")
writer.write(newline1)
import re
def repl(matchobj):
if matchobj.group(1):
return "."
else:
return ""
x="(8)subdomain(5)domain(3)com(0)"
print re.sub(r"^\(\d+\)|((?<!^)\(\d+\))(?!$)|\(\d+\)$",repl,x)
Output:subdomain.domain.com.
You can define your own replace function.
import re
for line in file:
line = re.sub(r'\(\d\)','.',line)
line = line.strip('.')

search the line in a file for a number after a particular word and iterate over that number

I have a file with multiple lines and one of them read as: loop_iter 10 {apples=0; oranges=0}
import sys
import re
input_file = open(r'C:\infile')
pat_file_read = input_file.read()
for line in input_file_read:
match = re.search("loop_iter\s*(\d+)" , input_file_read)
print match.group(1)
right now I am able to print it as many times as there lines in the file and if I do
for line in input_file_read:
if line.startswith("loop_iter"):
match = re.search("loop_iter\s*(\d+)" , input_file_read)
print match.group(1)
Does not work...
The syntax coloring here in Stack Overflow might have given you a hint already... But it looks like your quote marks don't match:
if line.startswith('loop_iter"):
Try
if line.startswith("loop_iter"):
I guess I got it to work once I converted the match.group(1) into an integer.
import sys
import re
input_file = open(r'C:\infile')
pat_file_read = input_file.read()
for line in input_file_read:
match = re.search("loop_iter\s*(\d+)" , input_file_read)
i = int(match.group(1))
for x in range i:
print 'something'

How to grab number after word in python

I have a huge file containing the following lines DDD-1126N|refseq:NP_285726|uniprotkb:P00112 and DDD-1081N|uniprotkb:P12121, I want to grab the number after uniprotkb.
Here's my code:
x = 'uniprotkb:P'
f = open('m.txt')
for line in f:
print line.find(x)
print line[36:31 + len(x)]
The problem in line.find(x) is 10 and 26, I grab the complete number when it is 26. I'm new to programming, so I'm looking for something to grab the complete number after the word.
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print the number after x
Use regular expressions:
import re
for line in open('m.txt'):
match = re.search('uniprotkb:P(\d+)', line)
if match:
print match.group(1)
import re
regex = re.compile('uniprotkb:P([0-9]*)')
print regex.findall(string)
The re module is quite unnecessary here if x is static and always matches a substring at the end of each line (like "DDD-1126N|refseq:NP_285726|uniprotkb:P00112"):
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    print line[line.find(x)+len(x):]
Edit:
To answer you comment. If they are separated by the pipe character (|), then you could do this:
sep = "|"
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    matches = [l[l.find(x)+len(x):] for l in line.split(sep) if l[l.find(x)+len(x):]]
print matches
If m.txt has the following line:
DDD-1126N|uniprotkb:285726|uniprotkb:P00112
Then the above will output:
['285726', 'P00112']
Replace sep = "|" with whatever the column separator would be.
Um, for one thing I'd suggest you use the csv module to read a TSV file.
But generally, you can use a regular expression:
import re
regex = re.compile(r"(?<=\buniprotkb:)\w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
The regular expression matches a string of alphanumeric characters if it's preceded by uniprotkb:.

Categories