python string manipulation and processing

python string manipulation and processing - python

I have a number of codes which I need to process, and these come through in a number of different formats which I need to manipulate first to get them in the right format:
Examples of codes:
ABC1.12 - correct format
ABC 1.22 - space between letters and numbers
ABC1.12/13 - 2 codes joined together and leading 1. missing from 13, should be ABC1.12 and ABC1.13
ABC 1.12 / 1.13 - codes joined together and spaces
I know how to remove the spaces but am not sure how to handle the codes which have been split. I know I can use the split function to create 2 codes but not sure how I can then append the letters (and first number part) to the second code. This is the 3rd and 4th example in the list above.
WHAT I HAVE SO FAR
val = # code
retList = [val]
if "/" in val:
(code1, code2) = session_codes = val.split("/", 1)
(inital_letters, numbers) = code1.split(".", 1)
if initial_letters not in code2:
code2 = initial_letters + '.' + code2
# reset list so that it returns both values
retList = [code1, code2]
This won't really handle the splits for 4 as the code2 becomes ABC1.1.13

You can use regex for this purpose
A possible implementation would be as follows
>>> def foo(st):
parts=st.replace(' ','').split("/")
parts=list(re.findall("^([A-Za-z]+)(.*)$",parts[0])[0])+parts[1:]
parts=parts[0:1]+[x.split('.') for x in parts[1:]]
parts=parts[0:1]+['.'.join(x) if len(x) > 1 else '.'.join([parts[1][0],x[0]]) for x in parts[1:]]
return [parts[0]+p for p in parts[1:]]
>>> foo('ABC1.12')
['ABC1.12']
>>> foo('ABC 1.22')
['ABC1.22']
>>> foo('ABC1.12/13')
['ABC1.12', 'ABC1.13']
>>> foo('ABC 1.12 / 1.13')
['ABC1.12', 'ABC1.13']
>>>

Are you familiar with regex? That would be an angle worth exploring here. Also, consider splitting on the space character, not just the slash and decimal.

I suggest you write a regular expression for each code pattern and then form a larger regular expression which is the union of the individual ones.

Using PyParsing
The answer by #Abhijit is a good, and for this simple problem reg-ex may be the way to go. However, when dealing with parsing problems, you'll often need a more extensible solution that can grow with your problem. I've found that pyparsing is great for that, you write the grammar it does the parsing:
from pyparsing import *
index = Combine(Word(alphas))
# Define what a number is and convert it to a float
number = Combine(Word(nums)+Optional('.'+Optional(Word(nums))))
number.setParseAction(lambda x: float(x[0]))
# What do extra numbers look like?
marker = Word('/').suppress()
extra_numbers = marker + number
# Define what a possible line could be
line_code = Group(index + number + ZeroOrMore(extra_numbers))
grammar = OneOrMore(line_code)
From this definition we can parse the string:
S = '''ABC1.12
ABC 1.22
XXX1.12/13/77/32.
XYZ 1.12 / 1.13
'''
print grammar.parseString(S)
Giving:
[['ABC', 1.12], ['ABC', 1.22], ['XXX', 1.12, 13.0, 77.0, 32.0], ['XYZ', 1.12, 1.13]]
Advantages:
The number is now in the correct format, as we've type-casted them to floats during the parsing. Many more "numbers" are handled, look at the index "XXX", all numbers of type 1.12, 13, 32. are parsed, irregardless of decimal.

Take a look at this method. The might be the simple and yet best way to do.
val = unicode(raw_input())
for aChar in val:
if aChar.isnumeric():
lastIndex = val.index(aChar)
break
part1 = val[:lastIndex].strip()
part2 = val[lastIndex:]
if "/" not in part2:
print part1+part2
else:
if " " not in part2:
codes = []
divPart2 = part2.split(".")
partCodes = divPart2[1].split("/")
for aPart in partCodes:
codes.append(part1+divPart2[0]+"."+aPart)
print codes
else:
codes = []
divPart2 = part2.split("/")
for aPart in divPart2:
aPart = aPart.strip()
codes.append(part1+aPart)
print codes

Related

How to find a String that contains Specific other String?

I have a problem with my homework and it's just confusing for me, this is the problem:
So input is a string that is a linear Equation Like " A + B = C ".
but for some reason one of A, B or C is not clear to us and we can't see it right.
for example:
"1# + 24 = 34" or "5131 + #251 = 76382"
Note that: It can happen to One part of Equation; A, B or C! and '#' can be more than one Digit!
(((( if input is = "10# + 50 = 10052" , output shoul be "10002 + 50 = 10052"))))
So here is a Question! How can I Highlight or Select part of this String that contains '#'?
I tried to search in RegExr and I can't find a pattern that matches my situation!

This retrieves the part of string that contains #:
import re
textExample = "5131 + #251 = 76382"
re.findall(r'[^ ]*#[^ ]*',textExample)

In case the expression does not always separate operators and numbers with spaces, you should search for a preceding or subsequent digit around the pound sign:
import re
equation = "5131 + #251 = 76382"
r = re.findall(r"((?<=\d)#|#(?=\d))",equation)
If you only intend to replace the pound sign with some digits, you don't need to find/highlight it. Simply use the built-in string replace function
equality = equation.replace("#","71") #==> '5131 + 71251 = 76382'

How to extract a part of a string

I have this string:
-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)
but actually I have a lot of string like this:
a*p**(-1.0) + b*p**(c)
where a,b and c are double. And I would like to extract a,b and c of this string. How can I do this using Python?

import re
s = '-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)'
pattern = r'-?\d+\.\d*'
a,_,b,c = re.findall(pattern,s)
print(a, b, c)
Output
('-1007.88670550662', '67293.8347365694', '-0.416543501823503')
s is your test strings and what not, pattern is the regex pattern, we are looking for floats, and once we find them using findall() we assign them back to a,b,c
Note this method works only if your string is in format of what you've given. else you can play with the pattern to match what you want.
Edit like most people stated in the comments if you need to include a + in front of your positive numbers you can use this pattern r'[-+]?\d+\.\d*'

Using the reqular expression
(-?\d+\.?\d*)\*p\*\*\(-1\.0\)\s*\+\s*(-?\d+\.?\d*)\*p\*\*\((-?\d+\.?\d*)\)
We can do
import re
pat = r'(-?\d+\.?\d*)\*p\*\*\(-1\.0\)\s*\+\s*(-?\d+\.?\d*)\*p\*\*\((-?\d+\.?\d*)\)'
regex = re.compile(pat)
print(regex.findall('-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)'))
will print [('-1007.88670550662', '67293.8347365694', '-0.416543501823503')]

If your formats are consistent, and you don't want to deep dive into regex (check out regex101 for this, btw) you could just split your way through it.
Here's a start:
>>> s= "-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)"
>>> a, buf, c = s.split("*p**")
>>> b = buf.split()[-1]
>>> a,b,c
('-1007.88670550662', '67293.8347365694', '(-0.416543501823503)')
>>> [float(x.strip("()")) for x in (a,b,c)]
[-1007.88670550662, 67293.8347365694, -0.416543501823503]

The re module can certainly be made to work for this, although as some of the comments on the other answers have pointed out, the corner cases can be interesting -- decimal points, plus and minus signs, etc. It could be even more interesting; e.g. can one of your numbers be imaginary?
Anyway, if your string is always a valid Python expression, you can use Python's built-in tools to process it. Here is a good generic explanation about the ast module's NodeVisitor class. To use it for your example is quite simple:
import ast
x = "-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)"
def getnums(s):
result = []
class GetNums(ast.NodeVisitor):
def visit_Num(self, node):
result.append(node.n)
def visit_UnaryOp(self, node):
if (isinstance(node.op, ast.USub) and
isinstance(node.operand, ast.Num)):
result.append(-node.operand.n)
else:
ast.NodeVisitor.generic_visit(self, node)
GetNums().visit(ast.parse(s))
return result
print(getnums(x))
This will return a list with all the numbers in your expression:
[-1007.88670550662, -1.0, 67293.8347365694, -0.416543501823503]
The visit_UnaryOp method is only required for Python 3.x.

You can use something like:
import re
a,_,b,c = re.findall(r"[\d\-.]+", subject)
print(a,b,c)
Demo

While I prefer MooingRawr's answer as it is simple, I would extend it a bit to cover more situations.
A floating point number can be converted to string with surprising variety of formats:
Exponential format (eg. 2.0e+07)
Without leading digit (eg. .5, which is equal to 0.5)
Without trailing digit (eg. 5., which is equal to 5)
Positive numbers with plus sign (eg. +5, which is equal to 5)
Numbers without decimal part (integers) (eg. 0 or 5)
Script
import re
test_values = [
'-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)',
'-2.000e+07*p**(-1.0) + 1.23e+07*p**(-5e+07)',
'+2.*p**(-1.0) + -1.*p**(5)',
'0*p**(-1.0) + .123*p**(7.89)'
]
pattern = r'([-+]?\.?\d+\.?\d*(?:[eE][-+]?\d+)?)'
for value in test_values:
print("Test with '%s':" % value)
matches = re.findall(pattern, value)
del matches[1]
print(matches, end='\n\n')
Output:
Test with '-1007.88670550662*p**(-1.0) + 67293.8347365694*p**(-0.416543501823503)':
['-1007.88670550662', '67293.8347365694', '-0.416543501823503']
Test with '-2.000e+07*p**(-1.0) + 1.23e+07*p**(-5e+07)':
['-2.000e+07', '1.23e+07', '-5e+07']
Test with '+2.*p**(-1.0) + -1.*p**(5)':
['+2.', '-1.', '5']
Test with '0*p**(-1.0) + .123*p**(7.89)':
['0', '.123', '7.89']

Another alternating-case in-a-string in Python 3.+

I'm very new to Python and am trying to understand how to manipulate strings.
What I want to do is change a string by removing the spaces and alternating the case from upper to lower, IE "This is harder than I thought it would be" to "ThIsIsHaRdErThAnItHoUgHtItWoUlDbE"
I've cobbled together a code to remove the spaces (heavily borrowed from here):
string1 = input("Ask user for something.")
nospace = ""
for a in string1:
if a == " ":
pass
else:
nospace=nospace+a
... but just can't get my head around the caps/lower case part. There are several similar issues on this site and I've tried amending a few of them, with no joy. I realise I need to define a range and iterate through it, but that's where I draw a blank.
for c in nospace[::]:
d = ""
c = nospace[:1].lower()
d = d + c
c = nospace[:1].upper
print d
All I am getting is a column of V's. I'm obviously getting this very wrong. Please can someone advise where? Thanks in advance.

Here is a cutesie way to do this:
>>> s = "This is harder than I thought it would be"
>>> from itertools import cycle
>>> funcs = cycle([str.upper, str.lower])
>>> ''.join(next(funcs)(c) for c in s if c != ' ')
'ThIsIsHaRdErThAnItHoUgHtItWoUlDbE'
>>>
Or, as suggested by Moses in the comments, you can use str.isspace, which will take care of not just a single space ' '
>>> ''.join(next(funcs)(c) for c in s if not c.isspace())
'ThIsIsHaRdErThAnItHoUgHtItWoUlDbE'
This approach only does a single pass on the string. Although, a two-pass method is likely performant enough.
Now, if you were starting with a nospace string, the best way is to convert to some mutable type (e.g. a list) and use slice-assignment notation. It's a little bit inefficient because it builds intermediate data structures, but slicing is fast in Python, so it may be quite performant. You have to ''.join at the end, to bring it back to a string:
>>> nospace
'ThisisharderthanIthoughtitwouldbe'
>>> nospace = list(nospace)
>>> nospace[0::2] = map(str.upper, nospace[0::2])
>>> nospace[1::2] = map(str.lower, nospace[1::2])
>>> ''.join(nospace)
'ThIsIsHaRdErThAnItHoUgHtItWoUlDbE'
>>>

You're trying to do everything at once. Don't. Break your program into steps.
Read the string.
Remove the spaces from the string (as #A.Sherif just demonstrated here)
Go over the string character by character. If the character is in an odd position, convert it to uppercase. Otherwise, convert to lowercase.

So your 2nd loop is where you're breaking it, because the original list isn't being shortened, the c=nospace[:1] grabs the first character of the string and that's the only character that's ever printed. So a solution would be as follows.
string1 = str(input("Ask user for something."))
nospace = ''.join(string1.split(' '))
for i in range(0, len(nospace)):
if i % 2 == 0:
print(nospace[i].upper(), end="")
else:
print(nospace[i].lower(), end="")
Could also replace the if/else statement with a ternary opperator.
for i in range(0, len(nospace)):
print(nospace[i].upper() if (i % 2 == 0) else nospace[i].lower(), end='')
Final way using enumerate as commented about
for i, c in enumerate(nospace):
print(c.upper() if (i % 2 == 0) else c.lower(), end='')

Python - how to substitute a substring using regex with n occurrencies

I have a string with a lot of recurrencies of a single pattern like
a = 'eresQQQutnohnQQQjkhjhnmQQQlkj'
and I have another string like
b = 'rerTTTytu'
I want to substitute the entire second string having as a reference the 'QQQ' and the 'TTT', and I want to find in this case 3 different results:
'ererTTTytuohnQQQjkhjhnmQQQlkj'
'eresQQQutnrerTTTytujhnmQQQlkj'
'eresQQQutnohnQQQjkhjrerTTTytu'
I've tried using re.sub
re.sub('\w{3}QQQ\w{3}' ,b,a)
but I obtain only the first one, and I don't know how to get the other two solutions.

Edit: As you requested, the two characters surrounding 'QQQ' will be replaced as well now.
I don't know if this is the most elegant or simplest solution for the problem, but it works:
import re
# Find all occurences of ??QQQ?? in a - where ? is any character
matches = [x.start() for x in re.finditer('\S{2}QQQ\S{2}', a)]
# Replace each ??QQQ?? with b
results = [a[:idx] + re.sub('\S{2}QQQ\S{2}', b, a[idx:], 1) for idx in matches]
print(results)
Output
['errerTTTytunohnQQQjkhjhnmQQQlkj',
'eresQQQutnorerTTTytuhjhnmQQQlkj',
'eresQQQutnohnQQQjkhjhrerTTTytuj']
Since you didn't specify the output format, I just put it in a list.

Check if a string ends with a decimal in Python 2

I want to check if a string ends with a decimal of varying numbers, from searching for a while, the closest solution I found was to input values into a tuple and using that as the condition for endswith(). But is there any shorter way instead of inputting every possible combination?
I tried hard coding the end condition but if there are new elements in the list it wont work for those, I also tried using regex it returns other elements together with the decimal elements as well. Any help would be appreciated
list1 = ["abcd 1.01", "zyx 22.98", "efgh 3.0", "qwe -70"]
for e in list1:
if e.endswith('.0') or e.endswith('.98'):
print 'pass'
Edit: Sorry should have specified that I do not want to have 'qwe -70' to be accepted, only those elements with a decimal point should be accepted

I'd like to propose another solution: using regular expressions to search for an ending decimal.
You can define a regular expression for an ending decimal with the following regex [-+]?[0-9]*\.[0-9]+$.
The regex broken apart:
[-+]?: optional - or + symbol at the beginning
[0-9]*: zero or more digits
\.: required dot
[0-9]+: one or more digits
$: must be at the end of the line
Then we can test the regular expression to see if it matches any of the members in the list:
import re
regex = re.compile('[-+]?[0-9]*\.[0-9]+$')
list1 = ["abcd 1.01", "zyx 22.98", "efgh 3.0", "qwe -70", "test"]
for e in list1:
if regex.search(e) is not None:
print e + " passes"
else:
print e + " does not pass"
The output for the previous script is the following:
abcd 1.01 passes
zyx 22.98 passes
efgh 3.0 passes
qwe -70 does not pass
test does not pass

Your example data leaves many possibilities open:
Last character is a digit:
e[-1].isdigit()
Everything after the last space is a number:
try:
float(e.rsplit(None, 1)[-1])
except ValueError:
# no number
pass
else:
print "number"
Using regular expressions:
re.match('[.0-9]$', e)

suspects = [x.split() for x in list1] # split by the space in between and get the second item as in your strings
# iterate over to try and cast it to float -- if not it will raise ValueError exception
for x in suspects:
try:
float(x[1])
print "{} - ends with float".format(str(" ".join(x)))
except ValueError:
print "{} - does not ends with float".format(str(" ".join(x)))
## -- End pasted text --
abcd 1.01 - ends with float
zyx 22.98 - ends with float
efgh 3.0 - ends with float
qwe -70 - ends with float

I think this will work for this case:
regex = r"([0-9]+\.[0-9]+)"
list1 = ["abcd 1.01", "zyx 22.98", "efgh 3.0", "qwe -70"]
for e in list1:
str = e.split(' ')[1]
if re.search(regex, str):
print True #Code for yes condition
else:
print False #Code for no condition

As you correctly guessed, endswith() is not a good way to look at the solution, given that the number of combinations is basically infinite. The way to go is - as many suggested - a regular expression that would match the end of the string to be a decimal point followed by any count of digits. Besides that, keep the code simple, and readable. The strip() is in there just in case one the input string has an extra space at the end, which would unnecessarily complicate the regex.
You can see this in action at: https://eval.in/649155
import re
regex = r"[0-9]+\.[0-9]+$"
list1 = ["abcd 1.01", "zyx 22.98", "efgh 3.0", "qwe -70"]
for e in list1:
if re.search(regex, e.strip()):
print e, 'pass'

The flowing maybe help:
import re
reg = re.compile(r'^[a-z]+ \-?[0-9]+\.[0-9]+$')
if re.match(reg, the_string):
do something...
else:
do other...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python string manipulation and processing - python

Are you familiar with regex? That would be an angle worth exploring here. Also, consider splitting on the space character, not just the slash and decimal.

I suggest you write a regular expression for each code pattern and then form a larger regular expression which is the union of the individual ones.

Related

How to find a String that contains Specific other String?

How to extract a part of a string

Another alternating-case in-a-string in Python 3.+

Python - how to substitute a substring using regex with n occurrencies

Check if a string ends with a decimal in Python 2

Categories

Resources