finding an indefinite sequence of numbers and dashes regex python - python

So I have a bunch of strings that contain a sequence of numbers and dashes:
strings = [
'32sdjhsdjhsdjb20-11-3kjddjsdsdj435',
'jdhjhdahj200-19-39-2-12-2jksjfkfjkdf3345',
'1232sdsjsdkjsop99-7-21sdjsdjsdj',
]
I have a function:
def get_nums():
for string in strings:
print(re.findall('\d+-\d+', string))
I want this function to return the following:
['20-11-3']
['200-19-39-2-12-2']
['99-7-21']
But my function returns:
['20-11']
['200-19', '39-2', '12-2']
['99-7']
I have no idea how to return the full sequence of numbers and dashes.
The sequences will always begin and end with numbers, never dashes. If there are no dashes between the numbers they should not be returned.
How can I use regex to return these sequences? Is there an alternative to regex that would be better here?

def get_nums():
for string in strings:
print(re.findall('\d+(?:-\d+)+', string))
This needs to be (?:…) rather than just (…), see https://medium.com/#yeukhon/non-capturing-group-in-pythons-regular-expression-75c4a828a9eb

import re
strings = [
'32sdjhsdjhsdjb20-11-3kjddjsdsdj435',
'jdhjhdahj200-19-39-2-12-2jksjfkfjkdf3345',
'1232sdsjsdkjsop99-7-21sdjsdjsdj',
]
def get_nums():
for string in strings:
print(re.search(r'\d+(-\d+)+', string).group(0))
get_nums()
Output:
20-11-3
200-19-39-2-12-2
99-7-21

Related

split string on any special character using python

currently I can have many dynamic separators in string like
new_123_12313131
new$123$12313131
new#123#12313131
etc etc . I just want to check if there is a special character in string then just get value after last separator like in this example just want 12313131
This is a good use case for isdigit():
l = [
'new_123_12313131',
'new$123$12313131',
'new#123#12313131',
]
output = []
for s in l:
temp = ''
for char in s:
if char.isdigit():
temp += char
output.append(temp)
print(output)
Result: ['12312313131', '12312313131', '12312313131']
Assuming you define 'special character' as anything thats not alphanumeric, you can use the str.isalnum() function to determine the first special character and leverage it something like this:
def split_non_special(input) -> str:
"""
Find first special character starting from the end and get the last piece
"""
for i in reversed(input):
if not i.isalnum():
return input.split(i)[-1] # return as soon as a separator is found
return '' # no separator found
# inputs = ['new_123_12313131', 'new$123$12313131', 'new#123#12313131', 'eefwfwrfwfwf3243']
# outputs = [split_non_special(input) for input in inputs]
# ['12313131', '12313131', '12313131', ''] # outputs
just get value after last separator
the more obvious way is using re.findall:
from re import findall
findall(r'\d+$',text) # ['12313131']
Python supplies what seems to be what you consider "special" characters using the string library as string.punctuation. Which are these characters:
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
Using that in conjunction with the re module you can do this:
from string import punctuation
import re
re.split(f"[{punctuation}]", my_string)
my_string being the string you want to split.
Results for your examples
['new', '123', '12313131']
To get just digits you can use:
re.split("\d", my_string)
Results:
['123', '12313131']

Filtering a list of strings using regex

I have a list of strings that looks like this,
strlist = [
'list/category/22',
'list/category/22561',
'list/category/3361b',
'list/category/22?=1512',
'list/category/216?=591jf1!',
'list/other/1671',
'list/1y9jj9/1yj32y',
'list/category/91121/91251',
'list/category/0027',
]
I want to use regex to find the strings in this list, that contain the following string /list/category/ followed by an integer of any length, but that's it, it cannot contain any letters or symbols after that.
So in my example, the output should look like this
list/category/22
list/category/22561
list/category/0027
I used the following code:
newlist = []
for i in strlist:
if re.match('list/category/[0-9]+[0-9]',i):
newlist.append(i)
print(i)
but this is my output:
list/category/22
list/category/22561
list/category/3361b
list/category/22?=1512
list/category/216?=591jf1!
list/category/91121/91251
list/category/0027
How do I fix my regex? And also is there a way to do this in one line using a filter or match command instead of a for loop?
You can try the below regex:
^list\/category\/\d+$
Explanation of the above regex:
^ - Represents the start of the given test String.
\d+ - Matches digits that occur one or more times.
$ - Matches the end of the test string. This is the part your regex missed.
Demo of the above regex in here.
IMPLEMENTATION IN PYTHON
import re
pattern = re.compile(r"^list\/category\/\d+$", re.MULTILINE)
match = pattern.findall("list/category/22\n"
"list/category/22561\n"
"list/category/3361b\n"
"list/category/22?=1512\n"
"list/category/216?=591jf1!\n"
"list/other/1671\n"
"list/1y9jj9/1yj32y\n"
"list/category/91121/91251\n"
"list/category/0027")
print (match)
You can find the sample run of the above implementation here.

Convert Unicode char code to char on Python

I have a list of Unicode character codes I need to convert into chars on python 2.7.
U+0021
U+0022
U+0023
.......
U+0024
How to do that?
This regular expression will replace all U+nnnn sequences with the corresponding Unicode character:
import re
s = u'''\
U+0021
U+0022
U+0023
.......
U+0024
'''
s = re.sub(ur'U\+([0-9A-F]{4})',lambda m: unichr(int(m.group(1),16)),s)
print(s)
Output:
!
"
#
.......
$
Explanation:
unichr gives the character of a codepoint, e.g. unichr(0x21) == u'!'.
int('0021',16) converts a hexadecimal string to an integer.
lambda(m): expression is an anonymous function that receives the regex match.
It defines a function equivalent to def func(m): return expression but inline.
re.sub matches a pattern and sends each match to a function that returns the replacement. In this case, the pattern is U+hhhh where h is a hexadecimal digit, and the replacement function converts the hexadecimal digit string into a Unicode character.
In case anyone using Python 3 and above wonders, how to do this effectively, I'll leave this post here for reference, since I didn't realize the author was asking about Python 2.7...
Just use the built-in python function chr():
char = chr(0x2474)
print(char)
Output:
⑴
Remember that the four digits in Unicode codenames U+WXYZ stand for a hexadecimal number WXYZ, which in python should be written as 0xWXYZ.
The code written below will take every Unicode string and will convert into the string.
for I in list:
print(I.encode('ascii', 'ignore'))
a = 'U+aaa'
a.encode('ascii','ignore')
'aaa'
This will convert for unicode to Ascii which i think is what you want.

find substring from list - python

I have a list with elements I would like to remove from a string:
Example
list = ['345','DEF', 'QWERTY']
my_string = '12345XYZDEFABCQWERTY'
Is there a way to iterate list and find where are the elements in the string? My final objective is to remove those elements from the string (I don't know if is this the proper way, since strings are immutable)
You could use a regex union :
import re
def delete_substrings_from_string(substrings, text):
pattern = re.compile('|'.join(map(re.escape, substrings)))
return re.sub(pattern, '', text)
print(delete_substrings_from_string(['345', 'DEF', 'QWERTY'], '12345XYZDEFABCQWERTY'))
# 12XYZABC
print(delete_substrings_from_string(['AA', 'ZZ'], 'ZAAZ'))
# ZZ
It uses re.escape to avoid interpreting the string content as a literal regex.
It uses only one pass so it should be reasonably fast and it ensures that the second example isn't converted to an empty string.
If you want a faster solution, you could build a Trie-based regex out of your substrings.

How to extract just the characters "abc-3456" from the given text in python

i have this code
import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m
This prints ['abc-3456'] but i want to get only abc-3456 (without the square brackets and the quotes].
How to do this?
import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m[0]
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
findall returns list of strings. If you want the first one then use m[0].
print m[0] will give string without [] and ''.
If you only want the first (or only) result, do this:
import re
text = "this is my desc abc-3456"
m = re.search("\w+\\-\d+", text)
print m.group()
re.findall retuns a list of matches. In that list the result is a string. You can use re.finditer if you want.
In python, a list's representation is in brackets: [member1, member2, ...].
A string ("somestring") representation is in quotes: 'somestring'.
This means the representation of a list of strings is:
['somestring1', 'somestring2', ...]
So you have a string in a list, the characters you want to remove are a part of python's representation and not a part of the data you have.
To get the string simply take the first element from the list:
mystring = m[0]

Categories