Change line in regex python - python

I have a few lines of string e.g:
AR0003242303
TR0402304004
CR0402340404
I want to create a dictionary from these lines.
And I need to create change it in regex to:
KOLAORM0003242303
KOLTORM0402304004
KOLCORM0402340404
So i need to split first 2 characters, before PUT KOL, between PUT O, and Afer second char put M. How can i reach it. Through many attempts I lose patience with the regex and unfortunately I now I have no time to learn it better now. Need some result now :(
Could someone help me with this case?

Using re.sub --> re.sub(r"^([A-Z])([A-Z])", r"KOL\1O\2M", string)
Ex:
import re
s = ["AR0003242303", "TR0402304004", "CR0402340404"]
for i in s:
print( re.sub(r"^([A-Z])([A-Z])", r"KOL\1O\2M", i) )
Output:
KOLAORM0003242303
KOLTORM0402304004
KOLCORM0402340404

You don't need regex for this, you can do it with getting the list of characters from the string, recreate the list, and join the string back
def get_convert_s(s):
li = list(s)
li = ['KOL', li[0], '0', li[1], 'M', *li[2:]]
return ''.join(li)
print(get_convert_s('AR0003242303'))
#KOLA0RM0003242303
print(get_convert_s('TR0402304004'))
#KOLT0RM0402304004
print(get_convert_s('CR0402340404'))
#KOLC0RM0402340404

import re
regex = re.compile(r"([A-Z])([A-Z])([0-9]+)")
inputs = [
'AR0003242303',
'TR0402304004',
'CR0402340404'
]
results = []
for input in inputs:
matches = re.match(regex, input)
groups = matches.groups()
results.append('KOL{}O{}M{}'.format(*groups))
print(results)

Assuming the length of the strings in your list will always be the same Devesh answers is pretty much the best approach (no reason to overcomplicate it).
My solution is similar to Devesh, I just like writing functions as oneliners:
list = ["AR0003242303", "TR0402304004", "CR0402340404"]
def convert_s(s):
return "KOL"+s[0]+"0"+s[1]+"M"+s[2:]
for str in list:
print(convert_s(str));
Altough it returns the same output.

Related

python: parse substring in brackets from string

What's a pythonic way to parse this string in the brackets:
txt = 'foo[bar]'
to get as result:
bar
What have I tried:
How I would solve it and I believe it's not very elegant:
result = txt.split('[')[1].split(']')[0]
I strongly think there is a library or method out there that has a more fault-tolerant and elegant solution for this. That's why I created this question.
Using Regex.
Ex:
import re
txt = 'foo[bar]'
print(re.findall(r"\[(.*?)\]", txt))
Output:
['bar']
One out of many ways, using slicing:
print(txt[4:].strip("[]"))
OR
import re
txt = 'foo[bar]'
m = re.search(r"\[([A-Za-z0-9_]+)\]", txt)
print(m.group(1))
OUTPUT:
bar
Another take on slicing, using str.index to find the locations of the starting and ending delimiters. Once you get a Python slice, you can just it just like an index into the list, except that the slice doesn't just give a single character, but the range of characters from the start to the end.
def make_slice(s, first, second):
floc = s.index(first)
sloc = s.index(second, floc)
# return a Python slice that will extract the contents between the first
# and second delimiters
return slice(floc+1, sloc)
txt = 'foo[bar]'
slc = make_slice(txt, '[', ']')
print(txt[slc])
Prints:
bar

Printing substrings' patterns from a string in Python

The input to this problem is a string and has a specific form. For example if s is a string then inputs can be s='3(a)2(b)' or s='3(aa)2(bbb)' or s='4(aaaa)'. The output should be a string, that is the substring inside the brackets multiplied by numerical substring value the substring inside the brackets follows.
For example,
Input ='3(a)2(b)'
Output='aaabb'
Input='4(aaa)'
Output='aaaaaaaaaaaa'
and similarly for other inputs. The program should print an empty string for wrong or invalid inputs.
This is what I've tried so far
s='3(aa)2(b)'
p=''
q=''
for i in range(0,len(s)):
#print(s[i],end='')
if s[i]=='(':
k=int(s[i-1])
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
print(q)
Can anyone tell what's wrong with my code?
A oneliner would be:
''.join(int(y[0])*y[1] for y in (x.split('(') for x in Input.split(')')[:-1]))
It works like this. We take the input, and split on the close paren
In [1]: Input ='3(a)2(b)'
In [2]: a = Input.split(')')[:-1]
In [3]: a
Out[3]: ['3(a', '2(b']
This gives us the integer, character pairs we're looking for, but we need to get rid of the open paren, so for each x in a, we split on the open paren to get a two-element list where the first element is the int (as a string still) and the character. You'll see this in b
In [4]: b = [x.split('(') for x in a]
In [5]: b
Out[5]: [['3', 'a'], ['2', 'b']]
So for each element in b, we need to cast the first element as an integer with int() and multiply by the character.
In [6]: c = [int(y[0])*y[1] for y in b]
In [7]: c
Out[7]: ['aaa', 'bb']
Now we join on the empty string to combine them into one string with
In [8]: ''.join(c)
Out[8]: 'aaabb'
Try this:
a = re.findall(r'[\d]+', s)
b = re.findall(r'[a-zA-Z]+', s)
c = ''
for i, j in zip(a, b):
c+=(int(i)*str(j))
print(c)
Here is how you could do it:
Step 1: Simple case, getting the data out of a really simple template
Let's assume your template string is 3(a). That's the simplest case I could think of. We'll need to extract pieces of information from that string. The first one is the count of chars that will have to be rendered. The second is the char that has to be rendered.
You are in a case where regex are more than suited (hence, the use of re module from python's standard library).
I won't do a full course on regex. You'll have to do that by our own. However, I'll explain quickly the step I used. So, count (the variable that holds the number of times we should render the char to render) is a digit (or several). Hence our first capturing group will be something like (\d+). Then we have a char to extract that is enclosed by parenthesis, hence \((\w+)\) (I actually enable several chars to be rendered at once). So, if we put them together, we get (\d+)\((\w+)\). For testing you can check this out.
Applied to our case, a straight forward use of the re module is:
import re
# Our template
template = '3(a)'
# Run the regex
match = re.search(r'(\d+)\((\w+)\)', template)
if match:
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
# Print as many times the string as count was given
print count * string
Output:
aaa
Yeah!
Step 2: Full case, with several templates
Okay, we know how to do it for 1 template, how to do the same for several, for instance 3(a)4(b)? Well... How would we do it "by hand"? We'd read the full template from left to right and apply each template one by one. Then this is what we'll do with python!
Hopefully for us the re module has a function just for that: finditer. It does exactly what we described above.
So, we'll do something like:
import re
# Our template
template = '3(a)4(b)'
# Iterate through found templates
for match in re.finditer(r'(\d+)\((\w+)\)', template):
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
print count * string
Output:
aaa
bbbb
Okay... Just remains the combination of that stuff. We know we can put everything at each step in an array, and then join each items of this array at the end, no?
Let's do it!
import re
template = '3(a)4(b)'
parts = []
for match in re.finditer(r'(\d+)\((\w+)\)', template):
parts.append(int(match.group(1)) * match.group(2))
print ''.join(parts)
Output:
aaabbb
Yeah!
Step 3: Final step, optimization
Because we can always do better, we won't stop. for loops are cool. But what I love (it's personal) about python is that there is so much stuff you can actually just write with one line! Is it the case here? Well yes :).
First we can remove the for loop and the append using a list comprehension:
parts = [int(match.group(1)) * match.group(2) for match in re.finditer(r'(\d+)\((\w+)\)', template)]
rendered = ''.join(parts)
Finally, let's remove the two lines with parts populating and then join and let's do all that in a single line:
import re
template = '3(a)4(b)'
rendered = ''.join(
int(match.group(1)) * match.group(2) \
for match in re.finditer(r'(\d+)\((\w+)\)', template))
print rendered
Output:
aaabbb
Yeah! Still the same output :).
Hope it helped!
The value of 'p' should be refreshed after each iteration.
s='1(aaa)2(bb)'
p=''
q=''
i=0
while i<len(s):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
i+=1
print(q)
The code is not behaving the way I want it to behave. The problem here is the placement of 'p'. 'p' is the variable that adds the substring inside the ( )s. I'm repeating the process even after sufficient adding is done. Placing 'p' inside the 'if' block will do the job.
s='2(aa)2(bb)'
q=''
for i in range(0,len(s)):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
#print(i,'first time')
p+=s[i+1]
i+=1
q+=p*k
#print(i,'second time')
print(q)
what you want is not print substrings . the real purpose is most like to generate text based regular expression or comands.
you can parametrize a function to read it or use something like it:
The python library rstr has the function xeger() to do what you need by using random strings and only returning ones that match:
Example
Install with pip install rstr
In [1]: from __future__ import print_function
In [2]: import rstr
In [3]: for dummy in range(10):
...: print(rstr.xeger(r"(a|b)[cd]{2}\1"))
...:
acca
bddb
adda
bdcb
bccb
bcdb
adca
bccb
bccb
acda
Warning
For complex re patterns this might take a long time to generate any matches.

How to move all special characters to the end of the string in Python?

I'm trying to filter all non-alphanumeric characters to the end of the strings. I am having a hard time with the regex since I don't know where the special characters we be. Here are a couple of simple examples.
hello*there*this*is*a*str*ing*with*asterisks
and&this&is&a&str&ing&&with&ampersands&in&i&t
one%mo%refor%good%mea%sure%I%think%you%get%it
How would I go about sliding all the special characters to the end of the string?
Here is what I tried, but I didn't get anything.
re.compile(r'(.+?)(\**)')
r.sub(r'\1\2', string)
Edit:
Expected output for the first string would be:
hellotherethisisastringwithasterisks********
There's no need for regex here. Just use str.isalpha and build up two lists, then join them:
strings = ['hello*there*this*is*a*str*ing*with*asterisks',
'and&this&is&a&str&ing&&with&ampersands&in&i&t',
'one%mo%refor%good%mea%sure%I%think%you%get%it']
for s in strings:
a = []
b = []
for c in s:
if c.isalpha():
a.append(c)
else:
b.append(c)
print(''.join(a+b))
Result:
hellotherethisisastringwithasterisks********
andthisisastringwithampersandsinit&&&&&&&&&&&
onemoreforgoodmeasureIthinkyougetit%%%%%%%%%%
Alternative print() call for Python 3.5 and higher:
print(*a, *b, sep='')
Here is my proposed solution for this with regex:
import re
def move_nonalpha(string,char):
pattern = "\\"+char
char_list = re.findall(pattern,string)
if len(char_list)>0:
items = re.split(pattern,string)
if len(items)>0:
return ''.join(items)+''.join(char_list)
Usage:
string = "hello*there*this*is*a*str*ing*with*asterisks"
print (move_nonalpha(string,"*"))
Gives me output:
hellotherethisisastringwithasterisks********
I tried with your other input patterns as well and it's working. Hope it'll help.

Python - Extract text from string

What are the most efficient ways to extract text from a string? Are there some available functions or regex expressions, or some other way?
For example, my string is below and I want to extract the IDs as well
as the ScreenNames, separately.
[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]
Thank you!
Edit: These are the text strings that I want to pull. I want them to be in a list.
Target_IDs = 1234567890, 233323490, 4459284
Target_ScreenNames = RandomNameHere, AnotherRandomName, YetAnotherName
import re
str = '[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]'
print 'Target IDs = ' + ','.join( re.findall(r'ID=(\d+)', str) )
print 'Target ScreenNames = ' + ','.join( re.findall(r' ScreenName=(\w+)', str) )
Output :
Target IDs = 1234567890,233323490,4459284
Target ScreenNames = RandomNameHere,AnotherRandomName,YetAnotherName
It depends. Assuming that all your text comes in the form of
TagName = TagValue1, TagValue2, ...
You need just two calls to split.
tag, value_string = string.split('=')
values = value_string.split(',')
Remove the excess space (probably a couple of rstrip()/lstrip() calls will suffice) and you are done. Or you can take regex. They are slightly more powerful, but in this case I think it's a matter of personal taste.
If you want more complex syntax with nonterminals, terminals and all that, you'll need lex/yacc, which will require some background in parsers. A rather interesting thing to play with, but not something you'll want to use for storing program options and such.
The regex I'd use would be:
(?:ID=|ScreenName=)+(\d+|[\w\d]+)
However, this assumes that ID is only digits (\d) and usernames are only letters or numbers ([\w\d]).
This regex (when combined with re.findall) would return a list of matches that could be iterated through and sorted in some fashion like so:
import re
s = "[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]"
pattern = re.compile(r'(?:ID=|ScreenName=)+(\d+|[\w\d]+)');
ids = []
names = []
for p in re.findall(pattern, s):
if p.isnumeric():
ids.append(p)
else:
names.append(p)
print(ids, names)

Returning all characters before the first underscore

Using re in Python, I would like to return all of the characters in a string that precede the first appearance of an underscore. In addition, I would like the string that is being returned to be in all uppercase and without any non-alpanumeric characters.
For example:
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
I am pretty sure I know how to return a string in all uppercase using string.upper() but I'm sure there are several ways to remove the . efficiently. Any help would be greatly appreciated. I am still learning regular expressions slowly but surely. Each tip gets added to my notes for future use.
To further clarify, my above examples aren't the actual strings. The actual string would look like:
AG.av08_binloop_v6
With my desired output looking like:
AGAV08
And the next example would be the same. String:
TL.av1_binloopv2
Desired output:
TLAV1
Again, thanks all for the help!
Even without re:
text.split('_', 1)[0].replace('.', '').upper()
Try this:
re.sub("[^A-Z\d]", "", re.search("^[^_]*", str).group(0).upper())
Since everyone is giving their favorite implementation, here's mine that doesn't use re:
>>> for s in ('AG.av08_binloop_v6', 'TL.av1_binloopv2'):
... print ''.join(c for c in s.split('_',1)[0] if c.isalnum()).upper()
...
AGAV08
TLAV1
I put .upper() on the outside of the generator so it is only called once.
You don't have to use re for this. Simple string operations would be enough based on your requirements:
tests = """
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
"""
for t in tests.splitlines():
print t[:t.find('_')].replace('.', '').upper()
# Returns:
# AGAV08
# TLAV1
Or if you absolutely must use re:
import re
pat = r'([a-zA-Z0-9.]+)_.*'
pat_re = re.compile(pat)
for t in tests.splitlines():
print re.sub(r'\.', '', pat_re.findall(t)[0]).upper()
# Returns:
# AGAV08
# TLAV1
He, just for fun, another option to get text before the first underscore is:
before_underscore, sep, after_underscore = str.partition('_')
So all in one line could be:
re.sub("[^A-Z\d]", "", str.partition('_')[0].upper())
import re
re.sub("[^A-Z\d]", "", yourstr.split('_',1)[0].upper())

Categories