I have a list of strings. Each string has the form of data0*(\d*) if we use a regular expression form.
The following is an example the strings:
data000000, data000003, data0172, data2312, data008212312
I would like to take only the meaningful number portion. All numbers are integers. For example, in the above case, I would like to get another list containing:
0, 3, 172, 2312, 8212312
What would be the best way in the above case?
The following is the solution that I thought:
import re
string_list = ["data0000172", ..... ]
number_list = []
for string in string_list:
match = re.search("data0*(\d+)", string)
if match:
number_list.append(match.group(1))
else:
raise Exception("Wrong format.")
However, the above might be inefficient. Could you suggest a better way for doing this?
If you are sure that the strings start with "data", you can just slice the string and convert to integer. Leading zeroes aren't an issue there. Building an integer from a zero-padded digit strings works.
lst = ["data000000", "data000003", "data0172", "data2312", "data008212312"]
result = [int(x[4:]) for x in lst]
result:
[0, 3, 172, 2312, 8212312]
or good old replace just in case the prefix can be omitted (but it will be slightly slower):
result = [int(x.replace("data","")) for x in lst]
import re
st = 'data0000172'
a = float(re.search('data(\d+)',st).group(1))
print(a)
Output:
172.0
This extract the numbers i.e useful part.Apply this to your list.
In the case where the strings are might not be of the form data<num> and you want the solution to still be valid or if some of the entries are broken for some reason, you can do the following:
import re
ll = ['data000000', 'data000003', 'data0172', 'data2312', 'data008212312']
ss = ''.join(ll)
res = [int(s) for s in re.findall(r'\d+', ss)]
print(res)
The re.findall is applied to the entire list of strings but due to the fact it returns a list of tuples you will get the desired result.
Output:
[0, 3, 172, 2312, 8212312]
Note: applying the re.findall to the list without the join will raise an error.
Related
Suppose I have a string that has the same sub-string repeated multiple times and I want to replace each occurrence with a different element from a list.
For example, consider this scenario:
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
The goal is to obtain a string of the form:
s = "a(_001_), b(_002_), c(_003_)"
The number of occurrences is known, and the list r has the same length as the number of occurrences (3 in the previous example) and contains increasing integers starting from 0.
I've came up with this solution:
import re
pattern = "_____"
s = "a(_____), b(_____), c(_____)"
l = [m.start() for m in re.finditer(pattern, s)]
i = 0
for el in l:
s = s[:el] + f"_{str(i).zfill(5 - 2)}_" + s[el + 5:]
i += 1
print(s)
Output: a(_000_), b(_001_), c(_002_)
This solves my problem, but it seems to me a bit cumbersome, especially the for-loop. Is there a better way, maybe more "pythonic" (intended as concise, possibly elegant, whatever it means) to solve the task?
You can simply use re.sub() method to replace each occurrence of the pattern with a different element from the list.
import re
pattern = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0,1,2]
for val in r:
s = re.sub(pattern, f"_{val:03d}_", s, count=1)
print(s)
You can also choose to go with this approach without re using the values in the r list with their indexes respectively:
r = [0,1,2]
s = ", ".join(f"{'abc'[i]}(_{val:03d}_)" for i, val in enumerate(r))
print(s)
a(_000_), b(_001_), c(_002_)
TL;DR
Use re.sub with a replacement callable and an iterator:
import re
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
it = iter(r)
print(re.sub(p, lambda _: f"_{next(it):03d}_", s))
Long version
Generally speaking, it is a good idea to re.compile your pattern once ahead of time. If you are going to use that pattern repeatedly later, this makes the regex calls much more efficient. There is basically no downside to compiling the pattern, so I would just make it a habit.
As for avoiding the for-loop altogether, the re.sub function allows us to pass a callable as the repl argument, which takes a re.Match object as its only argument and returns a string. Wouldn't it be nice, if we could have such a replacement function that takes the next element from our replacements list every time it is called?
Well, since you have an iterable of replacement elements, we can leverage the iterator protocol to avoid explicit looping over the elements. All we need to do is give our replacement function access to an iterator over those elements, so that it can grab a new one via the next function every time it is called.
The string format specification that Jamiu used in his answer is great if you know exactly that the sub-string to be replaced will always be exactly five underscores (_____) and that your replacement numbers will always be < 999.
So in its simplest form, a function doing what you described, could look like this:
import re
from collections.abc import Iterable
def multi_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(_match: re.Match[str]) -> str:
return f"_{next(iterator):03d}_"
return re.sub(pattern, repl, string)
Trying it out with your example data:
if __name__ == "__main__":
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
print(multi_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_)
In this simple application, we aren't doing anything with the Match object in our replacement function.
If you want to make it a bit more flexible, there are a few avenues possible. Let's say the sub-strings to replace might (perhaps unexpectedly) be a different number of underscores. Let's further assume that the numbers might get bigger than 999.
First of all, the pattern would need to change a bit. And if we still want to center the replacement in an arbitrary number of underscores, we'll actually need to access the match object in our replacement function to check the number of underscores.
The format specifiers are still useful because the allow centering the inserted object with the ^ align code.
import re
from collections.abc import Iterable
def dynamic_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(match: re.Match[str]) -> str:
replacement = f"{next(iterator):03d}"
length = len(match.group())
return f"{replacement:_^{length}}"
return re.sub(pattern, repl, string)
if __name__ == "__main__":
p = re.compile("(_+)")
s = "a(_____), b(_____), c(_____), d(_______), e(___)"
r = [0, 1, 2, 30, 4000]
print(dynamic_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_), d(__030__), e(4000)
Here we are building the replacement string based on the length of the match group (i.e. the number of underscores) to ensure it the number is always centered.
I think you get the idea. As always, separation of concerns is a good idea. You can put the replacement logic in its own function and refer to that, whenever you need to adjust it.
i dun see regex best suit the situation.
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
fstring = s.replace(pattern, "_{}_")
str_out = fstring.format(*r)
str_out_pad = fstring.format(*[str(entry).zfill(3) for entry in r])
print(str_out)
print(str_out_pad)
--
a(_0_), b(_1_), c(_2_)
a(_000_), b(_001_), c(_002_)
I have a few lines of string e.g:
AR0003242303
TR0402304004
CR0402340404
I want to create a dictionary from these lines.
And I need to create change it in regex to:
KOLAORM0003242303
KOLTORM0402304004
KOLCORM0402340404
So i need to split first 2 characters, before PUT KOL, between PUT O, and Afer second char put M. How can i reach it. Through many attempts I lose patience with the regex and unfortunately I now I have no time to learn it better now. Need some result now :(
Could someone help me with this case?
Using re.sub --> re.sub(r"^([A-Z])([A-Z])", r"KOL\1O\2M", string)
Ex:
import re
s = ["AR0003242303", "TR0402304004", "CR0402340404"]
for i in s:
print( re.sub(r"^([A-Z])([A-Z])", r"KOL\1O\2M", i) )
Output:
KOLAORM0003242303
KOLTORM0402304004
KOLCORM0402340404
You don't need regex for this, you can do it with getting the list of characters from the string, recreate the list, and join the string back
def get_convert_s(s):
li = list(s)
li = ['KOL', li[0], '0', li[1], 'M', *li[2:]]
return ''.join(li)
print(get_convert_s('AR0003242303'))
#KOLA0RM0003242303
print(get_convert_s('TR0402304004'))
#KOLT0RM0402304004
print(get_convert_s('CR0402340404'))
#KOLC0RM0402340404
import re
regex = re.compile(r"([A-Z])([A-Z])([0-9]+)")
inputs = [
'AR0003242303',
'TR0402304004',
'CR0402340404'
]
results = []
for input in inputs:
matches = re.match(regex, input)
groups = matches.groups()
results.append('KOL{}O{}M{}'.format(*groups))
print(results)
Assuming the length of the strings in your list will always be the same Devesh answers is pretty much the best approach (no reason to overcomplicate it).
My solution is similar to Devesh, I just like writing functions as oneliners:
list = ["AR0003242303", "TR0402304004", "CR0402340404"]
def convert_s(s):
return "KOL"+s[0]+"0"+s[1]+"M"+s[2:]
for str in list:
print(convert_s(str));
Altough it returns the same output.
Python 3:
Given a string (an equation), return a list of positive and negative integers.
I've tried various regex and list comprehension solutions to no avail.
Given an equation 4+3x or -5+2y or -7y-2x
Returns: [4,3], [-5,2], [-7,-2]
input
str = '-7y-2x'
output
my_list = [-7, -2]
Simple solution using re.findall function:
import re
s = '-5+2y'
result = [int(d) for d in re.findall(r'-?\d+', s)]
print(result)
The output:
[-5, 2]
-?\d+ - matches positive and negative integers
Raw string notation (r"text") keeps regular expressions sane.
Without it, every backslash ('\') in a regular expression would have
to be prefixed with another one to escape it
This regex should solve your problem.
[\+\-]?[0-9]+
Also, here is some code that goes with it.
import re
regex = re.compile(r'[\+\-]?[0-9]+')
nums = [int(k) for k in regex.findall('5-21x')]
I have a list of strings
['time_10', 'time_23', 'time_345', 'date_10', 'date_23', 'date_345']
I want to use regular expression to get strings that end with a specific number.
As I understand, first I have to combine all strings from the list into large string, then use form some kind of a pattern to use it for regular expression
I would be grateful if you could provide
regex(some_pattern, some_string)
that would return
['time_10', 'date_10']
or just
'time_10, date_10'
str.endswith is enough.
l = ['time_10', 'time_23', 'time_345', 'date_10', 'date_23', 'date_345']
result = [s for s in l if s.endswith('10')]
print(result)
['time_10', 'date_10']
If you insist on using regex,
import re
result = [s for s in l if re.search('10$', s)]
How do I add a dot into a Python list?
For example
groups = [0.122, 0.1212, 0.2112]
If I want to output this data, how would I make it so it is like
122, 1212, 2112
I tried write(groups...[0]) and further research but didn't get far. Thanks.
Thankyou
[str(g).split(".")[1] for g in groups]
results in
['122', '1212', '2112']
Edit:
Use it like this:
groups = [0.122, 0.1212, 0.2112]
decimals = [str(g).split(".")[1] for g in groups]
You could use a list comprehension and return a list of strings
groups = [0.122, 0.1212, 0.2112]
[str(x).split(".")[1] for x in groups]
Result
['122', '1212', '2112']
The list comprehension is doing the following:
Turn each list element into a string
Split the string about the "." character
Return the substring to the right of the split
Return a list based on the above logic
This should do it:
groups = [0.122, 0.1212, 0.2112]
import re
groups_str = ", ".join([str(x) for x in groups])
re.sub('[0-9]*[.]', "", groups_str)
[str(x) for x in groups] will make strings of the items.
", ".join will connect the items, as a string.
import re allows you to replace regular expressions:
using re.sub, the regular expression is used by replacing any numbers followed by a dot by nothing.
EDIT (no extra modules):
Working with Lutz' answer, this will also work in the case there is an integer (no dot):
decimals = [str(g).split("0.") for g in groups]
decimals = decimals = [i for x in decimals for i in x if i != '']
It won't work though when you have numbers like 11.11, where there is a part you don't want to ignore in front of the dot.