Replacing all numeric value to formatted string - python

What I am trying to do is:
Find out all the numeric values in a string.
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.finditer(r'[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?',input_string)
for number in numbers:
print ("{} start > {}, end > {}".format(number.group(), number.start(0), number.end(0)))
'''Output'''
>>100 start > 12, end > 15
>>79.80 start > 18, end > 23
And then I want to replace all the integer and float value to a certain format:
INT_(number of digit) and FLT(number of decimal places)
eg. 100 -> INT_3 // 79.80 -> FLT_2
Thus, the expect output string is like this:
"高露潔光感白輕悅薄荷牙膏INT_3 FLT2"
But the string replace substring method in Python is kind of weird, which can't archive what I want to do.
So I am trying to use the substring append substring methods
string[:number.start(0)] + "INT_%s"%len(number.group()) +.....
which looks stupid and most importantly I still can't make it work.
Can anyone give me some advice on this problem?

Use re.sub and a callback method inside where you can perform various manipulations on the match:
import re
def repl(match):
chunks = match.group(1).split(".")
if len(chunks) == 2:
return "FLT_{}".format(len(chunks[1]))
else:
return "INT_{}".format(len(chunks[0]))
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
result = re.sub(r'[-+]?([0-9]*\.?[0-9]+)(?:[eE][-+]?[0-9]+)?',repl,input_string)
print(result)
See the Python demo
Details:
The regex now has a capturing group over the number part (([0-9]*\.?[0-9]+)), this will be analyzed inside the repl method
Inside the repl method, Group 1 contents is split with . to see if we have a float/double, and if yes, we return the length of the fractional part, else, the length of the integer number.

You need to group the parts of your regex possibly like this
import re
def repl(m):
if m.group(1) is None: #int
return ("INT_%i"%len(m.group(2)))
else: #float
return ("FLT_%i"%(len(m.group(2))))
input_string = "高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.sub(r'[-+]?([0-9]*\.)?([0-9]+)([eE][-+]?[0-9]+)?',repl,input_string)
print(numbers)
group 0 is the whole string that was matched (can be used for putting into float or int)
group 1 is any digits before the . and the . itself if exists else it is None
group 2 is all digits after the . if it exists else it it is just all digits
group 3 is the exponential part if existing else None
You can get a python-number from it with
def parse(m):
s=m.group(0)
if m.group(1) is not None or m.group(3) is not None: # if there is a dot or an exponential part it must be a float
return float(s)
else:
return int(s)

You probably are looking for something like the code below (of course there are other ways to do it). This one just starts with what you were doing and show how it can be done.
import re
input_string = u"高露潔光感白輕悅薄荷牙膏100 79.80"
numbers = re.finditer(r'[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?',input_string)
s = input_string
for m in list(numbers)[::-1]:
num = m.group(0)
if '.' in num:
s = "%sFLT_%s%s" % (s[:m.start(0)],str(len(num)-num.index('.')-1),s[m.end(0):])
else:
s = "%sINT_%s%s" % (s[:m.start(0)],str(len(num)), s[m.end(0):])
print(s)
This may look a bit complicated because there are really several simple problems to solve.
For instance your initial regex find both ints and floats, but you with to apply totally different replacements afterward. This would be much more straightforward if you were doing only one thing at a time. But as parts of floats may look like an int, doing everything at once may not be such a bad idea, you just have to understand that this will lead to a secondary check to discriminate both cases.
Another more fundamental issue is that really you can't replace anything in a python string. Python strings are non modifiable objects, henceforth you have to make a copy. This is fine anyway because the format change may need insertion or removal of characters and an inplace replacement wouldn't be efficient.
The last trouble to take into account is that replacement must be made backward, because if you change the beginning of the string the match position would also change and the next replacement wouldn't be at the right place. If we do it backward, all is fine.
Of course I agree that using re.sub() is much simpler.

Related

Can't wrap my head around how to remove a list of characters from another list

I've been able to isolate the list (or string) of characters I want excluded from a user entered string. But I don't see how to then remove all these unwanted characters. After I do this, I think I can try joining the user string so it all becomes one alphabet input like the instructions say.
Instructions:
Remove all non-alpha characters
Write a program that removes all non-alpha characters from the given input.
For example, if the input is:
-Hello, 1 world$!
the output should be:
Helloworld
My code:
userEntered = input()
makeList = userEntered.split()
def split(userEntered):
return list(userEntered)
if userEntered.isalnum() == False:
for i in userEntered:
if i.isalpha() == False:
#answer = userEntered[slice(userEntered.index(i))]
reference = split(userEntered)
excludeThis = i
print(excludeThis)
When I print excludeThis, I get this as my output:
-
,
1
$
!
So I think I might be on the right track. I need to figure it out how to get these characters out of the user input. Any help is appreciated.
Loop over the input string. If the character is alphabetic, add it to the result string.
userEntered = input()
result = ''
for char in userEntered:
if char.isalpha():
result += char
print(result)
This can also be done with a regular expression:
import re
userEntered = input()
result = re.sub(r'[^a-z]', '', userEntered, flags=re.I)
The regexp [^a-z] matches anything except an alphabetic character. The re.I flag makes it case-insensitive. These are all replaced with an empty string, which removes them.
There's basically two main parts to this: distinguish alpha from non-alpha, and get a string with only the former. If isalpha() is satisfactory for the former, then that leaves the latter. My understanding is that the solution that is considered most Pythonic would be to join a comprehension. This would like this:
''.join(char for char in userEntered if char.isalpha())
BTW, there are several places in the code where you are making it more complicated than it needs to be. In Python, you can iterate over strings, so there's no need to convert userEntered to a list. isalnum() checks whether the string is all alphanumeric, so it's rather irrelevant (alphanumeric includes digits). You shouldn't ever compare a boolean to True or False, just use the boolean. So, for instance, if i.isalpha() == False: can be simplified to just if not i.isalpha():.

How do I check isalpha() for only the first 5 characters of a string?

I want to validate a PAN card whose first 5 characters are alphabets, the next 4 are numbers and the last character is an alphabet again. I can't use isalnum() because I want to check this specific order too, not just verify whether it contains both numbers and letters.
Here is a snipped of my code:
def validate_PAN(pan):
for i in pan:
pan.isalpha(pan[0:4])==True:
return 1
pan.isdigit(pan[5:9])==True:
return 1
pan.isalpha(pan[9])==True:
return 1
else:
return 0
This obviously returns an error since it is wrong. How can I fix this?
Just do string slicing and check
s[:5].isalpha()
pan[0:4] - Here you check for the first 4 characters and not 5 characters.
s[m:n] - This will slice the string from mth character till nth character (not including n)
Mistake in your code
pan.isalpha(pan[0:4])==True
This is giving you the error because isalpha() doesn't accept any arguments and you aren't using if before it.
You must use - if pan[:5].isalpha() == True:
You can use regular expression for simplicity sake
import re
PAN_1 = 'ABCDE1111E'
PAN_2 = 'ABC1111DEF'
def is_valid_PAN(PAN_number):
return bool(re.match(r'[a-z]{5}\d{4}[a-z]', PAN_number, re.IGNORECASE))
print(is_valid_PAN(PAN_1)) #True
print(is_valid_PAN(PAN_2)) #False
Regular expressions are a good fit for this.
import re
# Pattern for matching a PAN number
pattern = r'\b[A-Z]{5}[0-9]{4}[A-Z]\b'
# compile the pattern for better performance with repetitive matches
pobject = re.compile(pattern)
pan_number = "AXXMP1234Z"
result = pobject.match(pan_number)
if result:
print ("Matched for PAN: ", res.group(0))
else:
print("Failed")

Change string for defiened pattern (Python)

Learning Python, came across a demanding begginer's exercise.
Let's say you have a string constituted by "blocks" of characters separated by ';'. An example would be:
cdk;2(c)3(i)s;c
And you have to return a new string based on old one but in accordance to a certain pattern (which is also a string), for example:
c?*
This pattern means that each block must start with an 'c', the '?' character must be switched by some other letter and finally '*' by an arbitrary number of letters.
So when the pattern is applied you return something like:
cdk;cciiis
Another example:
string: 2(a)bxaxb;ab
pattern: a?*b
result: aabxaxb
My very crude attempt resulted in this:
def switch(string,pattern):
d = []
for v in range(0,string):
r = float("inf")
for m in range (0,pattern):
if pattern[m] == string[v]:
d.append(pattern[m])
elif string[m]==';':
d.append(pattern[m])
elif (pattern[m]=='?' & Character.isLetter(string.charAt(v))):
d.append(pattern[m])
return d
Tips?
To split a string you can use split() function.
For pattern detection in strings you can use regular expressions (regex) with the re library.

Python: Expanding a string of variables with integers

I'm still new to Python and learning the more basic things in programming.
Right now i'm trying to create a function that will dupilicate a set of numbers varies names.
Example:
def expand('d3f4e2')
>dddffffee
I'm not sure how to write the function for this.
Basically i understand you want to times the letter variable to the number variable beside it.
The key to any solution is splitting things into pairs of strings to be repeated, and repeat counts, and then iterating those pairs in lock-step.
If you only need single-character strings and single-digit repeat counts, this is just breaking the string up into 2-character pairs, which you can do with mshsayem's answer, or with slicing (s[::2] is the strings, s[1::2] is the counts).
But what if you want to generalize this to multi-letter strings and multi-digit counts?
Well, somehow we need to group the string into runs of digits and non-digits. If we could do that, we could use pairs of those groups in exactly the same way mshsayem's answer uses pairs of characters.
And it turns out that we can do this very easily. There's a nifty function in the standard library called groupby that lets you group anything into runs according to any function. And there's a function isdigit that distinguishes digits and non-digits.
So, this gets us the runs we want:
>>> import itertools
>>> s = 'd13fx4e2'
>>> [''.join(group) for (key, group) in itertools.groupby(s, str.isdigit)]
['d', '13', 'ff', '4', 'e', '2']
Now we zip this up the same way that mshsayem zipped up the characters:
>>> groups = (''.join(group) for (key, group) in itertools.groupby(s, str.isdigit))
>>> ''.join(c*int(d) for (c, d) in zip(groups, groups))
'dddddddddddddfxfxfxfxee'
So:
def expand(s):
groups = (''.join(group) for (key, group) in itertools.groupby(s, str.isdigit))
return ''.join(c*int(d) for (c, d) in zip(groups, groups))
Naive approach (if the digits are only single, and characters are single too):
>>> def expand(s):
s = iter(s)
return "".join(c*int(d) for (c,d) in zip(s,s))
>>> expand("d3s5")
'dddsssss'
Poor explanation:
Terms/functions:
iter() gives you an iterator object.
zip() makes tuples from iterables.
int() parses an integer from string
<expression> for <variable> in <iterable> is list comprehension
<string>.join joins an iterable strings with string
Process:
First we are making an iterator of the given string
zip() is being used to make tuples of character and repeating times. e.g. ('d','3'), ('s','5) (zip() will call the iterable to make the tuples. Note that for each tuple, it will call the same iterable twice—and, because our iterable is an iterator, that means it will advance twice)
now for in will iterate the tuples. using two variables (c,d) will unpack the tuples into those
but d is still an string. int is making it an integer
<string> * integer will repeat the string with integer times
finally join will return the result
Here is a multi-digit, multi-char version:
import re
def expand(s):
s = re.findall('([^0-9]+)(\d+)',s)
return "".join(c*int(d) for (c,d) in s)
By the way, using itertools.groupby is better, as shown by abarnert.
Let's look at how you could do this manually, using only tools that a novice will understand. It's better to actually learn about zip and iterators and comprehensions and so on, but it may also help to see the clunky and verbose way you write the same thing.
So, let's start with just single characters and single digits:
def expand(s):
result = ''
repeated_char_next = True
for char in s:
if repeated_char_next:
char_to_repeat = char
repeated_char_next = False
else:
repeat_count = int(char)
s += char_to_repeat * repeat_count
repeated_char_next = True
return char
This is a very simple state machine. There are two states: either the next character is a character to be repeated, or it's a digit that gives a repeat count. After reading the former, we don't have anything to add yet (we know the character, but not how many times to repeat it), so all we do is switch states. After reading the latter, we now know what to add (since we know both the character and the repeat count), so we do that, and also switch states. That's all there is to it.
Now, to expand it to multi-char repeat strings and multi-digit repeat counts:
def expand(s):
result = ''
current_repeat_string = ''
current_repeat_count = ''
for char in s:
if isdigit(char):
current_repeat_count += char
else:
if current_repeat_count:
# We've just switched from a digit back to a non-digit
count = int(current_repeat_count)
result += current_repeat_string * count
current_repeat_count = ''
current_repeat_string = ''
current_repeat_string += char
return char
The state here is pretty similar—we're either in the middle of reading non-digits, or in the middle of reading digits. But we don't automatically switch states after each character; we only do it when getting a digit after non-digits, or vice-versa. Plus, we have to keep track of all the characters in the current repeat string and in the current repeat count. I've collapsed the state flag into that repeat string, but there's nothing else tricky here.
There is more than one way to do this, but assuming that the sequence of characters in your input is always the same, eg: a single character followed by a number, the following would work
def expand(input):
alphatest = False
finalexpanded = "" #Blank string variable to hold final output
#first part is used for iterating through range of size i
#this solution assumes you have a numeric character coming after your
#alphabetic character every time
for i in input:
if alphatest == True:
i = int(i) #converts the string number to an integer
for value in range(0,i): #loops through range of size i
finalexpanded += alphatemp #adds your alphabetic character to string
alphatest = False #Once loop is finished resets your alphatest variable to False
i = str(i) #converts i back to string to avoid error from i.isalpha() test
if i.isalpha(): #tests i to see if it is an alphabetic character
alphatemp = i #sets alphatemp to i for loop above
alphatest = True #sets alphatest True for loop above
print finalexpanded #prints the final result

python: regular expressions, how to match a string of undefind length which has a structure and finishes with a specific group

I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()

Categories