Extracting alphanumeric substring from a string in python - python

i have a string in python
text = '(b)'
i want to extract the 'b'. I could strip the first and the last letter of the string but the reason i wont do that is because the text string may contain '(a)', (iii), 'i)', '(1' or '(2)'. Some times they contain no parenthesis at all. but they will always contain an alphanumeric values. But i equally want to retrieve the alphanumeric values there.
this feat will have to be accomplished in a one line code or block of code that returns justthe value as it will be used in an iteratively on a multiple situations
what is the best way to do that in python,

I don't think Regex is needed here. You can just strip off any parenthesis with str.strip:
>>> text = '(b)'
>>> text.strip('()')
'b'
>>> text = '(iii)'
>>> text.strip('()')
'iii'
>>> text = 'i)'
>>> text.strip('()')
'i'
>>> text = '(1'
>>> text.strip('()')
'1'
>>> text = '(2)'
>>> text.strip('()')
'2'
>>> text = 'a'
>>> text.strip('()')
'a'
>>>
Regarding #MikeMcKerns' comment, a more robust solution would be to pass string.punctuation to str.strip:
>>> from string import punctuation
>>> punctuation # Just to demonstrate
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>>
>>> text = '*(ab2**)'
>>> text.strip(punctuation)
'ab2'
>>>

You could do this through python's re module,
>>> import re
>>> text = '(5a)'
>>> match = re.search(r'\(?([0-9A-Za-z]+)\)?', text)
>>> match.group(1)
'5a'
>>> text = '*(ab2**)'
>>> match = re.search(r'\(?([0-9A-Za-z]+)\)?', text)
>>> match.group(1)
'ab2'

Not fancy, but this is pretty generic
>>> import string
>>> ''.join(i for i in text if i in string.ascii_letters+'0123456789')
This works for all sorts of combinations of parenthesis in the middle of the string, and also if you have other non-alphanumeric characters (aside from the parenthesis) present.

re.match(r'\(?([a-zA-Z0-9]+)', text).group(1)
for your input provided by exmple it would be:
>>> a=['(a)', '(iii)', 'i)', '(1' , '(2)']
>>> [ re.match(r'\(?([a-zA-Z0-9]+)', text).group(1) for text in a ]
['a', 'iii', 'i', '1', '2']

Related

How can I replace all occurrences of a substring using regex?

I have a string, s = 'sdfjoiweng%#$foo$fsoifjoi', and I would like to replace 'foo' with 'bar'.
I tried re.sub(r'\bfoo\b', 'bar', s) and re.sub(r'[foo]', 'bar', s), but it doesn't do anything. What am I doing wrong?
You can replace it directly:
>>> import re
>>> s = 'sdfjoiweng%#$foo$fsoifjoi'
>>> print(re.sub('foo','bar',s))
sdfjoiweng%#$bar$fsoifjoi
It will also work for more occurrences of foo like below:
>>> s = 'sdfjoiweng%#$foo$fsoifoojoi'
>>> print(re.sub('foo','bar',s))
sdfjoiweng%#$bar$fsoibarjoi
If you want to replace only the 1st occurrence of foo and not all the foo occurrences in the string then alecxe's answer does exactly that.
re.sub(r'\bfoo\b', 'bar', s)
Here, the \b defines the word boundaries - positions between a word character (\w) and a non-word character - exactly what you have matching for foo inside the sdfjoiweng%#$foo$fsoifjoi string. Works for me:
In [1]: import re
In [2]: s = 'sdfjoiweng%#$foo$fsoifjoi'
In [3]: re.sub(r'\bfoo\b', 'bar', s)
Out[3]: 'sdfjoiweng%#$bar$fsoifjoi'
You can use replace function directly instead of using regex.
>>> s = 'sdfjoiweng%#$foo$fsoifjoifoo'
>>>
>>> s.replace("foo","bar")
'sdfjoiweng%#$bar$fsoifjoibar'
>>>
>>>
To further add to the above, the code below shows you how to replace multiple words at once! I've used this to replace 165,000 words in 1 step!!
Note \b means no sub string matching..must be a whole word..if you remove it then it will make sub-string match.
import re
s = 'thisis a test'
re.sub('\bthis\b|test','',s)
This gives:
'thisis a '

Python: How to remove [' and ']?

I want to remove [' from start and '] characters from the end of a string.
This is my text:
"['45453656565']"
I need to have this text:
"45453656565"
I've tried to use str.replace
text = text.replace("['","");
but it does not work.
You need to strip your text by passing the unwanted characters to str.strip() method:
>>> s = "['45453656565']"
>>>
>>> s.strip("[']")
'45453656565'
Or if you want to convert it to integer you can simply pass the striped result to int function:
>>> try:
... val = int(s.strip("[']"))
... except ValueError:
... print("Invalid string")
...
>>> val
45453656565
Using re.sub:
>>> my_str = "['45453656565']"
>>> import re
>>> re.sub("['\]\[]","",my_str)
'45453656565'
You could loop over the character filtering if the element is a digit:
>>> number_array = "['34325235235']"
>>> int(''.join(c for c in number_array if c.isdigit()))
34325235235
This solution works even for both "['34325235235']" and '["34325235235"]' and whatever other combination of number and characters.
You also can import a package and use a regular expresion to get it:
>>> import re
>>> theString = "['34325235235']"
>>> int(re.sub(r'\D', '', theString)) # Optionally parse to int
Instead of hacking your data by stripping brackets, you should edit the script that created it to print out just the numbers. E.g., instead of lazily doing
output.write(str(mylist))
you can write
for elt in mylist:
output.write(elt + "\n")
Then when you read your data back in, it'll contain the numbers (as strings) without any quotes, commas or brackets.

regex to match a word and everything after it?

I need to dump some http data as a string from the http packet which i have in string format am trying to use the regular expression below to match 'data:'and everything after it,Its not working . I am new to regex and python
>>>import re
>>>pat=re.compile(r'(?:/bdata:/b)?\w$')
>>>string=" dnfhndkn data: ndknfdjoj pop"
>>>res=re.match(pat,string)
>>>print res
None
re.match matches only at the beginning of the string. Use re.search to match at any position. (See search() vs. match())
>>> import re
>>> pat = re.compile(r'(?:/bdata:/b)?\w$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> res
<_sre.SRE_Match object at 0x0000000002838100>
>>> res.group()
'p'
To match everything, you need to change \w with .*. Also remove /b.
>>> import re
>>> pat = re.compile(r'(?:data:).*$')
>>> string = " dnfhndkn data: ndknfdjoj pop"
>>> res = re.search(pat,string)
>>> print res.group()
data: ndknfdjoj pop
No need for a regular expression here. You can just slice the string:
>>> string
' dnfhndkn data: ndknfdjoj pop'
>>> string.index('data')
10
>>> string[string.index('data'):]
'data: ndknfdjoj pop'
str.index('data') returns the point in the string where the substring data is found. The slice from this position to the end string[10:] gives you the part of the string you are interested in.
By the way, string is a potentially problematic variable name if you are planning on using the string module at any point...
you can just do:
string.split("data:")[1]
assuming "data:" appears only once in each string

Split string without non-characters

I'm trying to split a string that looks like this for example:
':foo [bar]'
Using str.split() on this of course returns [':foo','[bar]']
But how can I make it return just ['foo','bar'] containing only these characters?
I don't like regular expressions, but do like Python, so I'd probably write this as
>>> s = ':foo [bar]'
>>> ''.join(c for c in s if c.isalnum() or c.isspace())
'foo bar'
>>> ''.join(c for c in s if c.isalnum() or c.isspace()).split()
['foo', 'bar']
The ''.join idiom is a little strange, I admit, but you can almost read the rest in English: "join every character for the characters in s if the character is alphanumeric or the character is whitespace, and then split that".
Alternatively, if you know that the symbols you want to remove will always be on the outside and the word will still be separated by spaces, and you know what they are, you might try something like
>>> s = ':foo [bar]'
>>> s.split()
[':foo', '[bar]']
>>> [word.strip(':[]') for word in s.split()]
['foo', 'bar']
Do str.split() as normal, and then parse each element to remove the non-letters. Something like:
>>> my_string = ':foo [bar]'
>>> parts = [''.join(c for c in s if c.isalpha()) for s in my_string.split()]
['foo', 'bar']
You'll have to pass through the list ['foo','[bar]'] and strip out all non-letter characters, using regular expressions. Check Regex replace (in Python) - a simpler way? for examples and references to documentation.
You have to try regular expressions.
Use re.sub() to replace :,[,] characters and than split your resultant string with white space as delimiter.
>>> st = ':foo [bar]'
>>> import re
>>> new_st = re.sub(r'[\[\]:]','',st)
>>> new_st.split(' ')
['foo', 'bar']

python string pattern matching

new_str="##2##*##1"
new_str1="##3##*##5##7"
How to split the above string in python
for val in new_str.split("##*"):
logging.debug("=======")
logging.debug(val[2:]) // will give
for st in val.split("##*"):
//how to get the values after ## in new_str and new_str1
I don't understand the question.
Are you trying to split a string by a delimiter? Then use split:
>>> a = "##2##*##1"
>>> b = "##3##*##5##7"
>>>
>>> a.split("##*")
['##2', '##1']
>>> b.split("##*")
['##3', '##5##7']
Are you trying to strip extraneous characters from a string? Then use strip:
>>> c = b.split("##*")[1]
>>> c
'##5##7'
>>> c.strip("#")
'5##7'
Are you trying to remove all the hashes (#) from a string? Then use replace:
>>> c.replace("#","")
'57'
Are you trying to find all the characters after "##"? Then use rsplit with its optional argument to split only once:
>>> a.rsplit("##",1)
['##2##*', '1']

Categories