Splitting String in Python with Multiple Delineating Characters - python

I would like to split a two different strings in python to the same length of 5. For example,
string1= '21007250.12000 -18047085.73200 1604.90200 59.10000 21007239.94800'
string2= '24784864.18300-318969464.50000 -1543.53600 34.48000 24784864.9700'
string1_final = ['21007250.12000','-18047085.73200','1604.90200','59.10000','21007239.94800']
string2_final = ['24784864.18300','-318969464.50000','-1543.53600','34.48000','24784864.9700']
Notice the separation of the white space and separating the two numbers while keeping the minus sign. I've tried using string2.split() and string2.split('-'), but it removes the minus. Any help would be greatly appreciated.

You can use a similar code to the answer to this question and get this:
import re
string1 = '21007250.12000 -18047085.73200 1604.90200 59.10000 21007239.94800'
string2 = '24784864.18300-318969464.50000 -1543.53600 34.48000 24784864.9700'
def process_string (string):
string_spaces_added = re.sub('-', ' -', string)
string_spaces_removed = re.sub(' +', ' ', string_spaces_added)
return string_spaces_removed.split()
print(process_string(string1))
print(process_string(string2))
Output:
['21007250.12000', '-18047085.73200', '1604.90200', '59.10000', '21007239.94800']
['24784864.18300', '-318969464.50000', '-1543.53600', '34.48000', '24784864.9700']

You could try something like this:
string1 = '21007250.12000 -18047085.73200 1604.90200 59.10000 21007239.94800'
string2 = '24784864.18300-318969464.50000 -1543.53600 34.48000 24784864.9700'
def splitter(string_to_split: str) -> list:
out = []
for item in string_to_split.split():
if "-" in item and not item.startswith("-"):
out.extend(item.replace("-", " -").split())
else:
out.append(item)
return out
for s in [string1, string2]:
print(splitter(s))
Output:
['21007250.12000', '-18047085.73200', '1604.90200', '59.10000', '21007239.94800']
['24784864.18300', '-318969464.50000', '-1543.53600', '34.48000', '24784864.9700']

Well, it looks like you want the numbers in the strings, rather than "split on variable delimiters"; ie it's not a string like "123 -abc def ghi", it's always a string of numbers.
So using simple regex to identify: an optional negtive sign, some numbers, an optional decimal place and then decimal digits (assuming it will always have digits after the decimal place, unlike numpy's representation of numbers like 2. == 2.0).
import re
numbers = re.compile(r'(-?\d+(?:\.\d+)?)')
string1 = numbers.findall(string1)
string1 == string1_final
# True
string2 = numbers.findall(string2)
string2 == string2_final
# True
# also works for these:
string3 = '123 21007250.12000 -5000 -67.89 200-300.4-7'
numbers.findall(string3)
# ['123', '21007250.12000', '-5000', '-67.89', '200', '-300.4', '-7']
If you expect and want to avoid non-arabic digits, like roman numerals, fractions or anything marked as numerals in unicode, then change each \d in the regex to [0-9].
Note: this regex doesn't include the possibility for exponents, complex numbers, powers, etc.

Related

split string on any special character using python

currently I can have many dynamic separators in string like
new_123_12313131
new$123$12313131
new#123#12313131
etc etc . I just want to check if there is a special character in string then just get value after last separator like in this example just want 12313131
This is a good use case for isdigit():
l = [
'new_123_12313131',
'new$123$12313131',
'new#123#12313131',
]
output = []
for s in l:
temp = ''
for char in s:
if char.isdigit():
temp += char
output.append(temp)
print(output)
Result: ['12312313131', '12312313131', '12312313131']
Assuming you define 'special character' as anything thats not alphanumeric, you can use the str.isalnum() function to determine the first special character and leverage it something like this:
def split_non_special(input) -> str:
"""
Find first special character starting from the end and get the last piece
"""
for i in reversed(input):
if not i.isalnum():
return input.split(i)[-1] # return as soon as a separator is found
return '' # no separator found
# inputs = ['new_123_12313131', 'new$123$12313131', 'new#123#12313131', 'eefwfwrfwfwf3243']
# outputs = [split_non_special(input) for input in inputs]
# ['12313131', '12313131', '12313131', ''] # outputs
just get value after last separator
the more obvious way is using re.findall:
from re import findall
findall(r'\d+$',text) # ['12313131']
Python supplies what seems to be what you consider "special" characters using the string library as string.punctuation. Which are these characters:
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
Using that in conjunction with the re module you can do this:
from string import punctuation
import re
re.split(f"[{punctuation}]", my_string)
my_string being the string you want to split.
Results for your examples
['new', '123', '12313131']
To get just digits you can use:
re.split("\d", my_string)
Results:
['123', '12313131']

finding an indefinite sequence of numbers and dashes regex python

So I have a bunch of strings that contain a sequence of numbers and dashes:
strings = [
'32sdjhsdjhsdjb20-11-3kjddjsdsdj435',
'jdhjhdahj200-19-39-2-12-2jksjfkfjkdf3345',
'1232sdsjsdkjsop99-7-21sdjsdjsdj',
]
I have a function:
def get_nums():
for string in strings:
print(re.findall('\d+-\d+', string))
I want this function to return the following:
['20-11-3']
['200-19-39-2-12-2']
['99-7-21']
But my function returns:
['20-11']
['200-19', '39-2', '12-2']
['99-7']
I have no idea how to return the full sequence of numbers and dashes.
The sequences will always begin and end with numbers, never dashes. If there are no dashes between the numbers they should not be returned.
How can I use regex to return these sequences? Is there an alternative to regex that would be better here?
def get_nums():
for string in strings:
print(re.findall('\d+(?:-\d+)+', string))
This needs to be (?:…) rather than just (…), see https://medium.com/#yeukhon/non-capturing-group-in-pythons-regular-expression-75c4a828a9eb
import re
strings = [
'32sdjhsdjhsdjb20-11-3kjddjsdsdj435',
'jdhjhdahj200-19-39-2-12-2jksjfkfjkdf3345',
'1232sdsjsdkjsop99-7-21sdjsdjsdj',
]
def get_nums():
for string in strings:
print(re.search(r'\d+(-\d+)+', string).group(0))
get_nums()
Output:
20-11-3
200-19-39-2-12-2
99-7-21

Write a Regex to extract number before '/'

I don't want to use string split because I have numbers 1-99, and a column of string that contain '#/#' somewhere in the text.
How can I write a regex to extract the number 10 in the following example:
He got 10/19 questions right.
Use a lookahead to match on the /, like this:
\d+(?=/)
You may need to escape the / if your implementation uses it as its delimiter.
Live example: https://regex101.com/r/xdT4vq/1
You can still use str.split() if you carefully construct logic around it:
t = "He got 10/19 questions right."
t2 = "He/she got 10/19 questions right"
for q in [t,t2]:
# split whole string at spaces
# split each part at /
# only keep parts that contain / but not at 1st position and only consists
# out of numbers elsewise
numbers = [x.split("/") for x in q.split()
if "/" in x and all(c in "0123456789/" for c in x)
and not x.startswith("/")]
if numbers:
print(numbers[0][0])
Output:
10
10
import re
myString = "He got 10/19 questions right."
oldnumber = re.findall('[0-9]+/', myString) #find one or more digits followed by a slash.
newNumber = oldnumber[0].replace("/","") #get rid of the slash.
print(newNumber)
>>>10
res = re.search('(\d+)/\d+', r'He got 10/19 questions right.')
res.groups()
('10',)
Find all numbers before the forward-slash and exclude the forward-slash by using start-stop parentheses.
>>> import re
>>> myString = 'He got 10/19 questions right.'
>>> stringNumber = re.findall('([0-9]+)/', myString)
>>> stringNumber
['10']
This returns all numbers ended with a forward-slash, but in a list of strings. if you want integers, you should map your list with int, then make a list again.
>>> intNumber = list(map(int, stringNumber))
>>> intNumber
[10]

Check if a string has unique characters excluding whitespace

I'm practicing questions from Cracking the coding interview to become better and just in case, be prepared. The first problem states: Find if a string has all unique characters or not? I wrote this and it works perfectly:
def isunique(string):
x = []
for i in string:
if i in x:
return False
else:
x.append(i)
return True
Now, my question is, what if I have all unique characters like in:
'I am J'
which would be pretty rare, but lets say it occurs by mere chance, how can I create an exception for the spaces? I a way it doesn't count the space as a character, so the func returns True and not False?
Now no matter how space or how many special characters in your string , it will just count the words :
import re
def isunique(string):
pattern=r'\w'
search=re.findall(pattern,string)
string=search
x = []
for i in string:
if i in x:
return False
else:
x.append(i)
return True
print(isunique('I am J'))
output:
True
without space words test case :
print(isunique('war'))
True
with space words test case:
print(isunique('w a r'))
True
repeating letters :
print(isunique('warrior'))
False
Create a list of characters you want to consider as non-characters and replace them in string. Then perform your function code.
As an alternative, to check the uniqueness of characters, the better approach will be to compare the length of final string with the set value of that string as:
def isunique(my_string):
nonchars = [' ', '.', ',']
for nonchar in nonchars:
my_string = my_string.replace(nonchar, '')
return len(set(my_string)) == len(my_string)
Sample Run:
>>> isunique( 'I am J' )
True
As per the Python's set() document:
Return a new set object, optionally with elements taken from iterable.
set is a built-in class. See set and Set Types — set, frozenset for
documentation about this class.
And... a pool of answers is never complete unless there is also a regex solution:
def is_unique(string):
import re
patt = re.compile(r"^.*?(.).*?(\1).*$")
return not re.search(patt, string)
(I'll leave the whitespace handling as an exercise to the OP)
An elegant approach (YMMV), with collections.Counter.
from collections import Counter
def isunique(string):
return Counter(string.replace(' ', '')).most_common(1)[0][-1] == 1
Alternatively, if your strings contain more than just whitespaces (tabs and newlines for instance), I'd recommend regex based substitution:
import re
string = re.sub(r'\s+', '', string, flags=re.M)
Simple solution
def isunique(string):
return all(string.count(i)==1 for i in string if i!=' ')

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Categories