I am parsing some data where the standard format is something like 10 pizzas. Sometimes, data is input correctly and we might end up with 5pizzas instead of 5 pizzas. In this scenario, I want to parse out the number of pizzas.
The naïve way of doing this would be to check character by character, building up a string until we reach a non-digit and then casting that string as an integer.
num_pizzas = ""
for character in data_input:
if character.isdigit():
num_pizzas += character
else:
break
num_pizzas = int(num_pizzas)
This is pretty clunky, though. Is there an easier way to split a string where it switches from numeric digits to alphabetic characters?
You ask for a way to split a string on digits, but then in your example, what you actually want is just the first numbers, this done easily with itertools.takewhile():
>>> int("".join(itertools.takewhile(str.isdigit, "10pizzas")))
10
This makes a lot of sense - what we are doing is taking the character from the string while they are digits. This has the advantage of stopping processing as soon as we get to the first non-digit character.
If you need the later data too, then what you are looking for is itertools.groupby() mixed in with a simple list comprehension:
>>> ["".join(x) for _, x in itertools.groupby("dfsd98sd8f68as7df56", key=str.isdigit)]
['dfsd', '98', 'sd', '8', 'f', '68', 'as', '7', 'df', '56']
If you then want to make one giant number:
>>> int("".join("".join(x) for is_number, x in itertools.groupby("dfsd98sd8f68as7df56", key=str.isdigit) if is_number is True))
98868756
To split the string at digits you can use re.split with the regular expression \d+:
>>> import re
>>> def my_split(s):
return filter(None, re.split(r'(\d+)', s))
>>> my_split('5pizzas')
['5', 'pizzas']
>>> my_split('foo123bar')
['foo', '123', 'bar']
To find the first number use re.search:
>>> re.search('\d+', '5pizzas').group()
'5'
>>> re.search('\d+', 'foo123bar').group()
'123'
If you know the number must be at the start of the string then you can use re.match instead of re.search. If you want to find all the numbers and discard the rest you can use re.findall.
How about a regex ?
reg = re.compile(r'(?P<numbers>\d*)(?P<rest>.*)')
result = reg.search(str)
if result:
numbers = result.group('numbers')
rest = result.group('rest')
Answer added as possible way to solve How to split a string into a list by digits? which was dupe-linked to this question.
You can do the splitting yourself:
use a temporary list to accumulate characters that are not digits
if you find a digit, add the temporary list (''.join()-ed) to the result list (only if not empty) and do not forget to clear the temporary list
repeat until all characters are processed and if the temp-lists still has content, add it
text = "Ka12Tu12La"
splitted = [] # our result
tmp = [] # our temporary character collector
for c in text:
if not c.isdigit():
tmp.append(c) # not a digit, add it
elif tmp: # c is a digit, if tmp filled, add it
splitted.append(''.join(tmp))
tmp = []
if tmp:
splitted.append(''.join(tmp))
print(splitted)
Output:
['Ka', 'Tu', 'La']
References:
What exactly does the .join() method do?
Related
I have made a string without spaces. so instead of spaces, I used 0000000. but there will be no alphabet letters. so for example, 000000020000000050000000190000000200000000 should equal "test". Sorry, I am very new to python and am not good. so if someone can help me out, that would be awesome.
You should be able to achieve the desired effect using regular expressions and re.sub()
If you want to extract the literal word "test" from that string as mentioned in the comments, you'll need to account for the fact that if you have 8 0's, it will match the first 7 from left to right, so a number like 20 followed by 7 0's would cause a few issues. We can get around this by matching the string in reverse (right to left) and then reversing the finished string to undo the initial reverse.
Here's the solution I came up with as my revised answer:
import re
my_string = '000000020000000050000000190000000200000000'
# Substitute a space in place of 7 0's
# Reverse the string in the input, and then reverse the output
new_string = re.sub('0{7}', ' ', my_string[::-1])[::-1]
# >>> new_string
# ' 20 5 19 20 '
Then we can strip the leading and trailing whitespace from this answer and split it into an array
my_array = new_string.strip().split()
# >>> my_array
# ['20', '5', '19', '20']
After that, you can process the array in whatever way you see fit to get the word "test" out of it.
My solution to that would probably be the following:
import string
word = ''.join([string.ascii_lowercase[int(x) - 1] for x in my_array])
# >>> word
# 'test'
NOTE: This answer has been completely rewritten (v2).
I have line include some numbers with underscore like this
1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0
I need code to check (DCAAFFC) and if the last 4 numbers not (0000) then the code should be replacing (0000) in place of last 4 numbers (AFFC) like this (DCA0000)
So should be line become like this
1_0_1_A2C_1A_2BE_DCA0000_0_0_0
I need code work on python2 and 3 please !!
P.S the code of (DCAAFFC) is not stander always changing.
code=1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0
I will assume that the format is strictly like this. Then you can get the DCAAFFC by code.split('_')[-4]. Finally, you can replace the last string with 0000 by replace.
Here is the full code
>>> code="1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0"
>>> frag=code.split("_")
['1', '0', '1', 'A2C', '1A', '2BE', 'DCAAFFC', '0', '0', '0']
>>> frag[-4]=frag[-4].replace(frag[-4][-4:],"0000") if frag[-4][-4:] != "0000" else frag[-4]
>>> final_code="_".join(frag)
>>> final_code
'1_0_1_A2C_1A_2BE_DCA0000_0_0_0'
Try regular expressions i.e:
import re
old_string = '1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0'
match = re.search('_([a-zA-Z]{7})_', old_string)
span = match.span()
new_string = old_string[:span[0]+4] + '0000_' + old_string[span[1]:]
print(new_string)
Is this a general string or just some hexadecimal representation of a number? For numbers in Python 3, '_' underscores are used just for adding readability and do not affect the number value in any way.
Say you have one such general string as you've given, and would like to replace ending 4 characters of every possible subgroup bounded within '_' underscores having length more than 4 by '0000', then one simple one-liner following your hexadecimal_string would be:
hexadecimal_string = "1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0"
hexadecimal_string = "_".join([ substring if len(substring)<=4 else substring[:-4]+'0'*4 for substring in hexadecimal_string.split('_')])
Here,
hexadecimal_string.split('_') separates all groups by '_' as separator,
substring if len(substring)<=4 else substring[:-4]+'0'*4 takes care of every such substring group having length more than 4 to have ending 4 characters replaced by '0'*4 or '0000',
such for loop usage is a list comprehension feature of Python.
'_'.join() joins the subgroups back into one main string using '_' as separator in string.
Other answers posted here work specifically well for the given string in the question, I'm sharing this answer to ensure your one-liner requirement in Python 3.
If the length of the string is always the same, and the position of the part that needs to be replaced with zero is always the same, you can just do this,
txt = '1_0_1_A2C_1A_2BE_DCAAFFC_0_0_0'
new = txt[0:20]+'0000'+txt[-6:]
print(new)
The output will be
'1_0_1_A2C_1A_2BE_DCA0000_0_0_0'
It would help if you gave us some other examples of the strings.
I started studying Python yesterday and I wanted to study a little about the string split method.
I wasn't looking for anything specific, I was just trying to learn it. I saw that it's possible to split multiple characters of a string, but what if I want to use the maxsplit parameter in only one of those characters?
I searched a little about it and found nothing, so I'm here to ask how. Here's an example:
Let's suppose I have this string:
normal_string = "1d30 drake dreke"
I want this to be a list like this:
['1', '30', 'drake', 'dreke']
Now let's suppose I use a method to split multiple characters, so I split the character 'd' and the character ' '.
The thing is:
I don't want to take the "d" from "drake" and "dreke" off, only from "1d30", but at the same time I don't want this, I want to split all of the space characters.
I need to put a maxsplit parameter ONLY at the character "d", how can I do it?
Do the following:
normal_string = "1d30 drake dreke"
# first split by d
start, end = normal_string.split("d", maxsplit=1)
# the split by space and concat the results
res = start.split() + end.split()
print(res)
Output
['1', '30', 'drake', 'dreke']
A more general approach, albeit more advanced, is to do:
res = [w for s in normal_string.split("d", maxsplit=1) for w in s.split()]
print(res)
I'm wondering if there's any way to find how many pair of parentheses are in a string.
I have to do some string manipulation and I sometimes have something like:
some_string = '1.8.0*99(0000000*kWh)'
or something like
some_string = '1.6.1*01(007.717*kW)(1604041815)'
What I'd like to do is:
get all the digits between the parentheses (e.g for the first string: 0000000)
if there are 2 pairs of parentheses (there will always be max 2 pairs) get all the digits and join them (e.g for the second string I'll have: 0077171604041815)
How can I verify how many pair of parentheses are in a string so that I can do later something like:
if number_of_pairs == 1:
do_this
else:
do_that
Or maybe there's an easier way to do what I want but couldn't think of one so far.
I know how to get only the digits in a string: final_string = re.sub('[^0-9]', '', my_string), but I'm wondering how could I treat both cases.
As parenthesis always present in pairs, So just count the left or right parenthesis in a string and you'll get your answer.
num_of_parenthesis = string.count('(')
You can do that: (assuming you already know there's at least one parenthese)
re.sub(r'[^0-9]+', '', some_string.split('(', 1)[1])
or only with re.sub:
re.sub(r'^[^(]*\(|[^0-9]+', '', some_string)
If you want all the digits in a single string, use re.findall after replacing any . and join into a single string:
In [15]: s="'1.6.1*01(007.717*kW)(1604041815)'"
In [16]: ("".join(re.findall("\((\d+).*?\)", s.replace(".", ""))))
Out[16]: '0077171604041815'
In [17]: s = '1.8.0*99(0000000*kWh)'
In [18]: ("".join(re.findall("\((\d+).*?\)", s.replace(".", ""))))
Out[18]: '0000000'
The count of parens is irrelevant when all you want is to extract any digits inside them. Based on the fact "you only have max two pairs" I presume the format is consistent.
Or if the parens always have digits, find the data in the parens and sub all bar the digits:
In [20]: "".join([re.sub("[^0-9]", "", m) for m in re.findall("\((.*?)\)", s)])
Out[20]: '0077171604041815'
I'm still new to Python and learning the more basic things in programming.
Right now i'm trying to create a function that will dupilicate a set of numbers varies names.
Example:
def expand('d3f4e2')
>dddffffee
I'm not sure how to write the function for this.
Basically i understand you want to times the letter variable to the number variable beside it.
The key to any solution is splitting things into pairs of strings to be repeated, and repeat counts, and then iterating those pairs in lock-step.
If you only need single-character strings and single-digit repeat counts, this is just breaking the string up into 2-character pairs, which you can do with mshsayem's answer, or with slicing (s[::2] is the strings, s[1::2] is the counts).
But what if you want to generalize this to multi-letter strings and multi-digit counts?
Well, somehow we need to group the string into runs of digits and non-digits. If we could do that, we could use pairs of those groups in exactly the same way mshsayem's answer uses pairs of characters.
And it turns out that we can do this very easily. There's a nifty function in the standard library called groupby that lets you group anything into runs according to any function. And there's a function isdigit that distinguishes digits and non-digits.
So, this gets us the runs we want:
>>> import itertools
>>> s = 'd13fx4e2'
>>> [''.join(group) for (key, group) in itertools.groupby(s, str.isdigit)]
['d', '13', 'ff', '4', 'e', '2']
Now we zip this up the same way that mshsayem zipped up the characters:
>>> groups = (''.join(group) for (key, group) in itertools.groupby(s, str.isdigit))
>>> ''.join(c*int(d) for (c, d) in zip(groups, groups))
'dddddddddddddfxfxfxfxee'
So:
def expand(s):
groups = (''.join(group) for (key, group) in itertools.groupby(s, str.isdigit))
return ''.join(c*int(d) for (c, d) in zip(groups, groups))
Naive approach (if the digits are only single, and characters are single too):
>>> def expand(s):
s = iter(s)
return "".join(c*int(d) for (c,d) in zip(s,s))
>>> expand("d3s5")
'dddsssss'
Poor explanation:
Terms/functions:
iter() gives you an iterator object.
zip() makes tuples from iterables.
int() parses an integer from string
<expression> for <variable> in <iterable> is list comprehension
<string>.join joins an iterable strings with string
Process:
First we are making an iterator of the given string
zip() is being used to make tuples of character and repeating times. e.g. ('d','3'), ('s','5) (zip() will call the iterable to make the tuples. Note that for each tuple, it will call the same iterable twice—and, because our iterable is an iterator, that means it will advance twice)
now for in will iterate the tuples. using two variables (c,d) will unpack the tuples into those
but d is still an string. int is making it an integer
<string> * integer will repeat the string with integer times
finally join will return the result
Here is a multi-digit, multi-char version:
import re
def expand(s):
s = re.findall('([^0-9]+)(\d+)',s)
return "".join(c*int(d) for (c,d) in s)
By the way, using itertools.groupby is better, as shown by abarnert.
Let's look at how you could do this manually, using only tools that a novice will understand. It's better to actually learn about zip and iterators and comprehensions and so on, but it may also help to see the clunky and verbose way you write the same thing.
So, let's start with just single characters and single digits:
def expand(s):
result = ''
repeated_char_next = True
for char in s:
if repeated_char_next:
char_to_repeat = char
repeated_char_next = False
else:
repeat_count = int(char)
s += char_to_repeat * repeat_count
repeated_char_next = True
return char
This is a very simple state machine. There are two states: either the next character is a character to be repeated, or it's a digit that gives a repeat count. After reading the former, we don't have anything to add yet (we know the character, but not how many times to repeat it), so all we do is switch states. After reading the latter, we now know what to add (since we know both the character and the repeat count), so we do that, and also switch states. That's all there is to it.
Now, to expand it to multi-char repeat strings and multi-digit repeat counts:
def expand(s):
result = ''
current_repeat_string = ''
current_repeat_count = ''
for char in s:
if isdigit(char):
current_repeat_count += char
else:
if current_repeat_count:
# We've just switched from a digit back to a non-digit
count = int(current_repeat_count)
result += current_repeat_string * count
current_repeat_count = ''
current_repeat_string = ''
current_repeat_string += char
return char
The state here is pretty similar—we're either in the middle of reading non-digits, or in the middle of reading digits. But we don't automatically switch states after each character; we only do it when getting a digit after non-digits, or vice-versa. Plus, we have to keep track of all the characters in the current repeat string and in the current repeat count. I've collapsed the state flag into that repeat string, but there's nothing else tricky here.
There is more than one way to do this, but assuming that the sequence of characters in your input is always the same, eg: a single character followed by a number, the following would work
def expand(input):
alphatest = False
finalexpanded = "" #Blank string variable to hold final output
#first part is used for iterating through range of size i
#this solution assumes you have a numeric character coming after your
#alphabetic character every time
for i in input:
if alphatest == True:
i = int(i) #converts the string number to an integer
for value in range(0,i): #loops through range of size i
finalexpanded += alphatemp #adds your alphabetic character to string
alphatest = False #Once loop is finished resets your alphatest variable to False
i = str(i) #converts i back to string to avoid error from i.isalpha() test
if i.isalpha(): #tests i to see if it is an alphabetic character
alphatemp = i #sets alphatemp to i for loop above
alphatest = True #sets alphatest True for loop above
print finalexpanded #prints the final result