Python - Replace parenthesis with periods and remove first and last period - python

I am trying to take an input file with a list of DNS lookups that contains subdomain/domain separators with the string length in parenthesis as opposed to periods. It looks like this:
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
I would like to replace the parenthesis and numbers with periods and then remove the first and last period. My code currently does this, but leaves the last period. Any help is appreciated. Here is the code:
import re
file = open('test.txt', 'rb')
writer = open('outfile.txt', 'wb')
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line)
if newline1.startswith('.'):
newline1 = newline1[1:-1]
writer.write(newline1)

You can split the lines with \(\d+\) regex and then join with . stripping commas at both ends:
for line in file:
res =".".join(re.split(r'\(\d+\)', line))
writer.write(res.strip('.'))
See IDEONE demo

Given that your re.sub call works like this:
> re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
'.subdomain.domain.com.'
the only thing you need to do is strip the resulting string from any leading and trailing .:
> s = re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
> s.strip(".")
'subdomain.domain.com'
Full drop in solution:
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line).strip(".")
writer.write(newline1)

import re
def repl(matchobj):
if matchobj.group(1):
return "."
else:
return ""
x="(8)subdomain(5)domain(3)com(0)"
print re.sub(r"^\(\d+\)|((?<!^)\(\d+\))(?!$)|\(\d+\)$",repl,x)
Output:subdomain.domain.com.
You can define your own replace function.

import re
for line in file:
line = re.sub(r'\(\d\)','.',line)
line = line.strip('.')

Related

python open csv search for pattern and strip everything else

I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')

Getting the line number of a string

Suppose I have a very long string taken from a file:
lf = open(filename, 'r')
text = lf.readlines()
lf.close()
or
lineList = [line.strip() for line in open(filename)]
text = '\n'.join(lineList)
How can one find specific regular expression's line number in this string( in this case the line number of 'match'):
regex = re.compile(somepattern)
for match in re.findall(regex, text):
continue
Thank you for your time in advance
Edit: Forgot to add that the pattern that we are searching is multiple lines and I am interested in the starting line.
We need to get re.Match objects rather than strings themselves using re.finditer, which will allow getting information about starting position. Consider following example: lets say I want to find every two digits which are located immediately before and after newline (\n) then:
import re
lineList = ["123","456","789","ABC","XYZ"]
text = '\n'.join(lineList)
for match in re.finditer(r"\d\n\d", text, re.MULTILINE):
start = match.span()[0] # .span() gives tuple (start, end)
line_no = text[:start].count("\n")
print(line_no)
Output:
0
1
Explanation: After I get starting position I simply count number of newlines before that place, which is same as getting number of line. Note: I assumed line numbers are starting from 0.
Perhaps something like this:
lf = open(filename, 'r')
text_lines = lf.readlines()
lf.close()
regex = re.compile(somepattern)
for line_number, line in enumerate(text_lines):
for match in re.findall(regex, line):
print('Match found on line %d: %s' % (line_number, match))

Print line if line starts with any letter of the alphabet

I'm trying to print all of my reptile subspecies in my python program. I have a text file with a bunch of subspecies and their DNA sequence IDs. I just want to create a dictionary of subspecies (keys) and their respective DNA sequence IDs (values). But to do that I need to first learn how to separate the two.
So I want to print all of the subspecies names only, and to ignore the sequence IDs.
So far I have
import re
file = open('repCleanSubs2.txt')
for line in file:
if line.startswith('[a-zA-Z]'):
print line
I believe the compiler takes the '[a-zA-Z]'as a string literal, rather than a search for any letter of the alphabet regardless the case sensitivity, which is what I want.
Is there some syntax that I'm missing in my if statement?
Thanks!
startswith does not interpret regular expressions. use the re module you have imported to check if a string is a match:
if re.match('^[a-zA-Z]+', line) is not None:
print line
starts with: ^
one or more matching characters: +
http://www.fon.hum.uva.nl/praat/manual/Regular_expressions_1__Special_characters.html
import re
file = open('repCleanSubs2.txt')
for line in file:
match = re.findall('^[a-zA-Z]+', line)
if match:
print line, match
The ^ sign means match from the beginning of the line, letters between a-z and A-Z
+ means at least one or more characters in [a-zA-Z] must be found
re.findall will return a list of all the patterns it could find in the string you supplied to it
Try the following lines instead of the startswith.
if re.match("^[a-zA-Z]", line):
print line
Try this, its working for me:
import re
file = open('repCleanSubs2.txt')
for line in file:
if (re.match('[a-zA-Z]',line)):
print line
without using re:
import string
with open('repCleanSubs2.txt') as c_file:
for line in c_file:
if any([line.startswith(c) for c in string.letters]):
print line
Try this
file = open("abc.xyz")
file_content = file.read()
line = file_content.splitlines()
output_data = []
for i in line:
if i[0] == '[a-zA-Z]':
output_data.append(i)
print(i)
It can be done without regular expression
data = open('repCleanSubs2.txt').read().splitlines() ## Read file and extract data as list
print [i for i in data if i[0].isalpha()]

Replace part of a matched string in python

I have the following matched strings:
punctacros="Tasla"_TONTA
punctacros="Tasla"_SONTA
punctacros="Tasla"_JONTA
punctacros="Tasla"_BONTA
I want to replace only a part (before the underscore) of the matched strings, and the rest of it should remain the same in each original string.
The result should look like this:
TROGA_TONTA
TROGA_SONTA
TROGA_JONTA
TROGA_BONTA
Edit:
This should work:
from re import sub
with open("/path/to/file") as myfile:
lines = []
for line in myfile:
line = sub('punctacros="Tasla"(_.*)', r'TROGA\1', line)
lines.append(line)
with open("/path/to/file", "w") as myfile:
myfile.writelines(lines)
Result:
TROGA_TONTA
TROGA_SONTA
TROGA_JONTA
TROGA_BONTA
Note however, if your file is exactly like the sample given, you can replace the re.sub line with this:
line = "TROGA_"+line.split("_", 1)[1]
eliminating the need of Regex altogether. I didn't do this though because you seem to want a Regex solution.
mystring.replace('punctacross="Tasla"', 'TROGA_')
where mystring is string with those four lines. It will return string with replaced values.
If you want to replace everything before the first underscore, try this:
#! /usr/bin/python3
data = ['punctacros="Tasla"_TONTA',
'punctacros="Tasla"_SONTA',
'punctacros="Tasla"_JONTA',
'punctacros="Tasla"_BONTA',
'somethingelse!="Tucku"_CONTA']
for s in data:
print('TROGA' + s[s.find('_'):])

Splitting lines in python based on some character

Input:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Output:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
'!' is the starting character and +0013 should be the ending of each line (if present).
Problem which I am getting:
Output is like :
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
Any help would be highly appreciated...!!!
My code:
file_open= open('sample.txt','r')
file_read= file_open.read()
file_open2= open('output.txt','w+')
counter =0
for i in file_read:
if '!' in i:
if counter == 1:
file_open2.write('\n')
counter= counter -1
counter= counter +1
file_open2.write(i)
You can try something like this:
with open("abc.txt") as f:
data=f.read().replace("\r\n","") #replace the newlines with ""
#the newline can be "\n" in your system instead of "\r\n"
ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines
for x in ans:
print "!"+x #or write to some other file
.....:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split?
lines = file_read.split('!')
Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file:
file_open2.writelines('!{0}\n'.format(line) for line in lines)
You might need:
file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
if you find that you're getting more newlines than you wanted in the output.
A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly:
with open('inputfile') as fin:
lines = fin.read()
with open('outputfile','w') as fout:
fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line:
In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '')
In [15]: print data.replace('+0013!', "+0013\n!")
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer:
import re
outputFile = open('output.txt', 'w+')
with open('sample.txt', 'r') as f:
for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL):
outputFile.write(line.replace("\n", "") + '\n')
outputFile.close()
It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4
After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) :
file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files
>>> def split_on_stream(it,sep="!"):
prev = ""
for line in it:
line = (prev + line.strip()).split(sep)
for parts in line[:-1]:
yield parts
prev = line[-1]
yield prev
>>> with open("test.txt") as fin:
for parts in split_on_stream(fin):
print parts
,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Categories