Take part of a regex search result in Python - python

I want to read in a header file and output each of the variables that has the form x = 1.0; as double = x;
At the moment I've got this, which just outputs the whole line:
import re
input = open("file_with_vars.hpp", 'r')
out = open("output.txt", 'w')
for line in input:
if re.match("(.*) = (\d)", line):
print >> out, line
But I can't work out how to take part of the line and output the variable name and the double string to file.
EDIT:
So now I have
for line in cell:
m = re.search('(.*)\s*=\s*(\d+\.\d+)', line)
print m.group()
But get the error ' AttributeError: 'NoneType' object has no attribute 'group' '

Use search instead of match
the regex is .*\s*=\s*\d+\.\d+
test:
import re
y="x=1.0"
m=re.search('(.*)\s*=\s*(\d+\.\d+)',y)
The group function can be used to extract the matched strings as
>>> print m.group()
'x=1.0'
>>> print m.group(1)`
'x'
>>> print m.group(2)
'1.0'
EDIT
How to search lines within a file
for line in cell:
try:
m = re.search('(.*)\s*=\s*(\d+\.\d+)', line)
print m.group()
except AttributeError:
pass
The NoneType error is caused because some lines in the file doesnot match the regex returning a None by the search method.
The try except takes care of the exception.
pass a null statement in python
for an input file
x=10.2
y=15.3
z=12.4
w=48
creates output as
x=10.2
y=15.3
z=12.4
see here w=48 doesnt match the regex returning NoneType, which is safely handled by the try block
OR
as Jerry pointed out, an if can make that more simple
for line in cell:
m = re.search('\S*\s*=\s*(\d+\.\d+)', line)
if m:
print m.group()

You are printing the line after match .as its possible to exist more than 1 match you can use re.findall() , also you need [\d\.]+ instead of \d :
for line in input:
if re.match("(.*) = [\d\.]+", line):
print re.findall("(.*) = [\d\.]+", line)
and about spaces after and before of = you need to be sure ! if its possible that there are matched like var=num you can use ? after your spaces in your regex pattern: (.*) ?= ?[\d\.]+

import re
input = open("file_with_vars.hpp", 'r')
out = open("output.txt", 'w')
for line in input:
if re.findall("(.*?)\s*=\s*(\d+(?:\.\d*)?)", line):
print >> out, line
Try this.This should work.

Related

Cant extract substring from the string using regex in python

I want to extract the substring "login attempt [b'admin'/b'admin']" from the string:
2021-05-06T00:00:15.921179Z [HoneyPotSSHTransport,1127,5.188.87.53] login attempt [b'admin'/b'admin'] succeeded.
But python returns the whole string. My code is:
import re
hand = open('cowrie.log')
outF = open("Usernames.txt", "w")
for line in hand:
if re.findall(r'login\sattempt\s\[[a-zA-z0-9]\'[a-zA-z0-9]+\'/[a-zA-z0-9]+\'[a-zA-z0-9]+\'\]', line):
print(line)
outF.write(line)
outF.write("\n")
outF.close()
Thanks in advance. This is the LINK which contains the data from which I want to extract.
Your code states: if re.findall returns something, print the whole line. But you should print the return from re.findall and write that as a string.
Or use re.search if you expect a single match.
Note that [A-z] matches more than [A-Za-z].
import re
hand = open('cowrie.log')
outF = open("Usernames.txt", "w")
for line in hand:
res = re.search(r"login\sattempt\s\[[a-zA-Z0-9]'[a-zA-Z0-9]+'/[a-zA-Z0-9]+'[a-zA-Z0-9]+']", line)
if res:
outF.write(res.group())
outF.write("\n")
outF.close()
Usernames.txt now contains:
login attempt [b'admin'/b'admin']

Getting the line number of a string

Suppose I have a very long string taken from a file:
lf = open(filename, 'r')
text = lf.readlines()
lf.close()
or
lineList = [line.strip() for line in open(filename)]
text = '\n'.join(lineList)
How can one find specific regular expression's line number in this string( in this case the line number of 'match'):
regex = re.compile(somepattern)
for match in re.findall(regex, text):
continue
Thank you for your time in advance
Edit: Forgot to add that the pattern that we are searching is multiple lines and I am interested in the starting line.
We need to get re.Match objects rather than strings themselves using re.finditer, which will allow getting information about starting position. Consider following example: lets say I want to find every two digits which are located immediately before and after newline (\n) then:
import re
lineList = ["123","456","789","ABC","XYZ"]
text = '\n'.join(lineList)
for match in re.finditer(r"\d\n\d", text, re.MULTILINE):
start = match.span()[0] # .span() gives tuple (start, end)
line_no = text[:start].count("\n")
print(line_no)
Output:
0
1
Explanation: After I get starting position I simply count number of newlines before that place, which is same as getting number of line. Note: I assumed line numbers are starting from 0.
Perhaps something like this:
lf = open(filename, 'r')
text_lines = lf.readlines()
lf.close()
regex = re.compile(somepattern)
for line_number, line in enumerate(text_lines):
for match in re.findall(regex, line):
print('Match found on line %d: %s' % (line_number, match))

Print line if line starts with any letter of the alphabet

I'm trying to print all of my reptile subspecies in my python program. I have a text file with a bunch of subspecies and their DNA sequence IDs. I just want to create a dictionary of subspecies (keys) and their respective DNA sequence IDs (values). But to do that I need to first learn how to separate the two.
So I want to print all of the subspecies names only, and to ignore the sequence IDs.
So far I have
import re
file = open('repCleanSubs2.txt')
for line in file:
if line.startswith('[a-zA-Z]'):
print line
I believe the compiler takes the '[a-zA-Z]'as a string literal, rather than a search for any letter of the alphabet regardless the case sensitivity, which is what I want.
Is there some syntax that I'm missing in my if statement?
Thanks!
startswith does not interpret regular expressions. use the re module you have imported to check if a string is a match:
if re.match('^[a-zA-Z]+', line) is not None:
print line
starts with: ^
one or more matching characters: +
http://www.fon.hum.uva.nl/praat/manual/Regular_expressions_1__Special_characters.html
import re
file = open('repCleanSubs2.txt')
for line in file:
match = re.findall('^[a-zA-Z]+', line)
if match:
print line, match
The ^ sign means match from the beginning of the line, letters between a-z and A-Z
+ means at least one or more characters in [a-zA-Z] must be found
re.findall will return a list of all the patterns it could find in the string you supplied to it
Try the following lines instead of the startswith.
if re.match("^[a-zA-Z]", line):
print line
Try this, its working for me:
import re
file = open('repCleanSubs2.txt')
for line in file:
if (re.match('[a-zA-Z]',line)):
print line
without using re:
import string
with open('repCleanSubs2.txt') as c_file:
for line in c_file:
if any([line.startswith(c) for c in string.letters]):
print line
Try this
file = open("abc.xyz")
file_content = file.read()
line = file_content.splitlines()
output_data = []
for i in line:
if i[0] == '[a-zA-Z]':
output_data.append(i)
print(i)
It can be done without regular expression
data = open('repCleanSubs2.txt').read().splitlines() ## Read file and extract data as list
print [i for i in data if i[0].isalpha()]

Python - Replace parenthesis with periods and remove first and last period

I am trying to take an input file with a list of DNS lookups that contains subdomain/domain separators with the string length in parenthesis as opposed to periods. It looks like this:
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
I would like to replace the parenthesis and numbers with periods and then remove the first and last period. My code currently does this, but leaves the last period. Any help is appreciated. Here is the code:
import re
file = open('test.txt', 'rb')
writer = open('outfile.txt', 'wb')
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line)
if newline1.startswith('.'):
newline1 = newline1[1:-1]
writer.write(newline1)
You can split the lines with \(\d+\) regex and then join with . stripping commas at both ends:
for line in file:
res =".".join(re.split(r'\(\d+\)', line))
writer.write(res.strip('.'))
See IDEONE demo
Given that your re.sub call works like this:
> re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
'.subdomain.domain.com.'
the only thing you need to do is strip the resulting string from any leading and trailing .:
> s = re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
> s.strip(".")
'subdomain.domain.com'
Full drop in solution:
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line).strip(".")
writer.write(newline1)
import re
def repl(matchobj):
if matchobj.group(1):
return "."
else:
return ""
x="(8)subdomain(5)domain(3)com(0)"
print re.sub(r"^\(\d+\)|((?<!^)\(\d+\))(?!$)|\(\d+\)$",repl,x)
Output:subdomain.domain.com.
You can define your own replace function.
import re
for line in file:
line = re.sub(r'\(\d\)','.',line)
line = line.strip('.')

Match an element of every line

I have a list of rules for a given input file for my function. If any of them are violated in the file given, I want my program to return an error message and quit.
Every gene in the file should be on the same chromosome
Thus for a lines such as:
NM_001003443 chr11 + 5997152 5927598 5921052 5926098 1 5928752,5925972, 5927204,5396098,
NM_001003444 chr11 + 5925152 5926098 5925152 5926098 2 5925152,5925652, 5925404,5926098,
NM_001003489 chr11 + 5925145 5926093 5925115 5926045 4 5925151,5925762, 5987404,5908098,
etc.
Each line in the file will be variations of this line
Thus, I want to make sure every line in the file is on chr11
Yet I may be given a file with a different list of chr(and any number of numbers). Thus I want to write a function that will make sure whatever number is found on chr in the line is the same for every line.
Should I use a regular expression for this, or what should I do? This is in python by the way.
Such as: chr\d+ ?
I am unsure how to make sure that whatever is matched is the same in every line though...
I currently have:
from re import *
for line in file:
r = 'chr\d+'
i = search(r, line)
if i in line:
but I don't know how to make sure it is the same in every line...
In reference to sajattack's answer
fp = open(infile, 'r')
for line in fp:
filestring = ''
filestring +=line
chrlist = search('chr\d+', filestring)
chrlist = chrlist.group()
for chr in chrlist:
if chr != chrlist[0]:
print('Every gene in file not on same chromosome')
Just read the file and have a while loop check each line to make sure it contains chr11. There are string functions to search for substrings in a string. As soon as you find a line that returns false (does not contain chr11) then break out of the loop and set a flag valid = false.
import re
fp = open(infile, 'r')
fp.readline()
tar = re.findall(r'chr\d+', fp.readline())[0]
for line in fp:
if (line.find(tar) == -1):
print("Not valid")
break
This should search for a number in the line and check for validity.
Is it safe to assume that the first chr is the correct one? If so, use this:
import re
chrlist = re.findall("chr[0-9]+", open('file').read())
# ^ this is a list with all chr(whatever numbers)
for chr in chrlist:
if chr != chrlist[0]
print("Chr does not match")
break
My solution uses a "match group" to collect the matched numbers from the "chr" string.
import re
pat = re.compile(r'\schr(\d+)\s')
def chr_val(line):
m = re.search(pat, line)
if m is not None:
return m.group(1)
else:
return ''
def is_valid(f):
line = f.readline()
v = chr_val(line)
if not v:
return False
return all(chr_val(line) == v for line in f)
with open("test.txt", "r") as f:
print("The file is {0}".format("valid" if is_valid(f) else "NOT valid"))
Notes:
Pre-compiles the regular expression for speed.
Uses a raw string (r'') to specify the regular expression.
The pattern requires white space (\s) on either side of the chr string.
is_valid() returns False if the first line doesn't have a good chr value. Then it returns a Boolean value that is true if all of the following lines match the chr value of the first line.
Your sample code just prints something like The file is True so I made it a bit friendlier.

Categories