Extract letters (and a specific number) from a string - python

I have a list of strings similar to the one below:
l = ['ad2g3f234','4jafg32','fg23g523']
For each string in l, I want to delete every digit (except for 2 and 3 if they appear as 23). So in this case, I want the following outcome:
n = ['adgf23','jafg','fg23g23']
How do I go about this? I tried re.findall like:
w = [re.findall(r'[a-zA-Z]+',t) for t in l]
but it doesn't give my desired outcome.

You can capture 23 in a group, and remove all other digits. In the replacement, use the group which holds 23 if it is there, else replace with an empty string.
import re
l = ['ad2g3f234', '4jafg32', 'fg23g523']
result = [
re.sub(
r"(23)|(?:(?!23)\d)+",
lambda m: m.group(1) if m.group(1) else "", s) for s in l
]
print(result)
Output
['adgf23', 'jafg', 'fg23g23']
Python demo

One way would be just to replace the string twice:
[re.sub("\d", "", i.replace("23", "*")).replace("*", "23") for i in l]
Output:
['adgf23', 'jafg', 'fg23g23']

Use a placeholder with re.sub
l = ['ad2g3f234','4jafg32','fg23g523']
w = [re.sub('#','23',re.sub('\d','',re.sub('23','#',t))) for t in l]
['adgf23', 'jafg', 'fg23g23']
EDIT
As answered by Chris, the approach is the same although string replace will be a better alternative stack_comparison

Using re.sub with function
import re
def replace(m):
if m.group() == '23':
return m.group()
else:
return ''
l = ['ad2g3f234','4jafg32','fg23g523']
w = [re.sub(r'23|\d', replace, x) for x in l]
#w: ['adgf23', 'jafg', 'fg23g23']
Explanation
re.sub(r'23|\d', replace, x)
- checks first for 23, next for a digit
- replace function leaves alone match with 23
- changes match with digit to null string.

Related

How can I find a character with specific criteria?

I want to loop through a string and find a character that is not a letter or number or _ . #. This my code:
mystr = "saddas das"
for x in range(0, len(mystr)):
if not(mystr[x].isdigit() or mystr[x].isalpha or mystr[x]=="#" or mystr[x]=="_" or mystr[x]=="."):
print (x)
Unfortunately it doen't detect anthing while it should return the index of the space.
for x in range(0, len(mystr)):
if not(mystr[x].isdigit() or mystr[x].isalpha() or mystr[x]=="#" or mystr[x]=="_" or mystr[x]=="."):
print (x)
You forgot to add (): mystr[x].isalpha. To call function you should do mystr[x].isalpha(). mystr[x].isalpha is always evaluated to True, that's why your code doesn't print anything
Use enumerate() wich returns the pos and the character you iterate:
mystr = "saddas das"
for pos,c in enumerate(mystr):
# change your conditions to make it easier to understand, isalpha() helps
if c.isdigit() or c.isalpha() or c in "#_.":
continue # do nothing
else:
print (pos)
Output:
6
Using a regex:
import re
pattern = re.compile('[^\d\w\.#]')
s = "saddas das"
for match in pattern.finditer(s):
print(match.start())
Output
6
The pattern '[^\d\w\.#]' matches everything that is not a digit, not a letter, nor _, . or #.

Add text between strings as needed

I want something that replaces text between two occurences of the same string as follows:
Input:- "abcdefghcd","cd","k"
Output :- "abkefghk"
You might think that a simple thing such a .replace() would work, but actually its not that. Some more examples-
Input:- "123*45","*","u"
Output:- "123*45" # No change because there aren't two occurences of "*"
Input:- "text*text*hello*text","*","k"
Output:- "textktextkhello*text"
I don't know how to do it. Any ideas?
Count the occurrences and only replace the first n-1 of them if n is odd.
>>> s, find, replace = "text*text*hello*text", "*", "k"
>>> s.replace(find, replace, 2*(s.count(find)//2))
'textktextkhello*text'
what about splitting and joining :
Input = "abcdefghcd"
replace_="cd"
with_='k'
data=Input.split(replace_)
print(with_.join(data))
output:
abkefghk
Split the string and only substitute pattern if more than 2 occurrences are found.
>>> replace = lambda s, pat, sub: "".join([x + sub for x in s.split(pat) if x]) if len(s.split(pat))>2 else s
>>> replace("abcdefghcd", "cd", "k")
'abkefghk'
>>> replace("123*45", "*", "u")
'123*45'
If you favor an explicit function (recommended) instead of a one-liner:
def replace(s, pat, sub, occ=2):
"""Return a string of replaced letters unless below the occurrences."""
if len(s.split(pat)) > occ:
return "".join([x + sub for x in s.split(pat) if x])
return s

more efficient way to replace items on a list based on a condition

I have the following piece of code. Basically, I'm trying to replace a word if it matches one of these regex patterns. If the word matches even once, the word should be completely gone from the new list. The code below works, however, I'm wondering if there's a way to implement this so that I can indefinitely add more patterns to the 'pat' list without having to write additional if statements within the for loop.
To clarify, my regex patterns have negative lookaheads and lookbehinds to make sure it's one word.
pat = [r'(?<![a-z][ ])Pacific(?![ ])', r'(?<![a-z][ ])Global(?![ ])']
if isinstance(x, list):
new = []
for i in x:
if re.search(pat[0], i):
i = re.sub(pat[0], '', i)
if re.search(pat[1], i):
i = re.sub(pat[1], '', i)
if len(i) > 0:
new.append(i)
x = new
else:
x = x.strip()
Just add another for loop:
for patn in pat:
if re.search(patn, i):
i = re.sub(patn, '', i)
if i:
new.append(i)
pat = [r'(?<![a-z][ ])Pacific(?![ ])', r'(?<![a-z][ ])Global(?![ ])']
if isinstance(x, list):
new = []
for i in x:
for p in pat:
i = re.sub(p, '', i)
if len(i) > 0:
new.append(i)
x = new
else:
x = x.strip()
Add another loop:
pat = [r'(?<![a-z][ ])Pacific(?![ ])', r'(?<![a-z][ ])Global(?![ ])']
if isinstance(x, list):
new = []
for i in x:
# iterate through pat list
for regx in pat:
if re.search(regx, i):
i = re.sub(regx, '', i)
...
If in your pattern, then changes are only the words, then you can add the words joined with | to make it or. So for your two patterns from the example will become one like below one.
r'(?<![a-z][ ])(?:Pacific|Global)(?![ ])'
If you need to add more words, just add with a pipe. For example (?:word1|word2|word3)
Inside the bracket ?: means do not capture the group.
something like this:
[word for word in l if not any(re.search(p, word) for p in pat)]
I will attempt a guess here; if I am wrong, please skip to the "this is how I'd write it" and modify the code that I provide, according to what you intend to do (which I may have failed to understand).
I am assuming you are trying to eliminate the words "Global" and "Pacific" in a list of phrases that may contain them.
If that is the case, I think your regular expression does not do what you specify. You probably intended to have something like the following (which does not work as-is!):
pat = [r'(?<=[a-z][ ])Pacific(?=[ ])', r'(?<=[a-z][ ])Global(?=[ ])']
The difference is in the look-ahead patterns, which are positive ((?=...) and (?<=...)) instead of negative ((?!...) and (?<!...)).
Furthermore, writing your regular expressions like this will not always correctly eliminate white space between your words.
This is how I'd write it:
words = ['Pacific', 'Global']
pat = "|".join(r'\b' + word + r'\b\s*' for word in words)
if isinstance(x, str):
x = x.strip() # I don't understand why you don't sub here, anyway!
else:
x = [s for s in (re.sub(pat, '', s) for s in x) if s != '']
In the regular expression for patterns, notice (a) \b, standing for "the empty string, but only at the beginning or end of a word" (see the manual), (b) the use of | for separating alternative patterns, and (c) \s, standing for "characters considered whitespace". The latter is what takes care of correctly removing unnecessary space after each eliminated word.
This works correctly in both Python 2 and Python 3. I think the code is much clearer and, in terms of efficiency, it's best if you leave re to do its work instead of testing each pattern separately.
Given:
x = ["from Global a to Pacific b",
"Global Pacific",
"Pacific Global",
"none",
"only Global and that's it"]
this produces:
x = ['from a to b', 'none', "only and that's it"]

Remove Last instance of a character and rest of a string

If I have a string as follows:
foo_bar_one_two_three
Is there a clean way, with RegEx, to return: foo_bar_one_two?
I know I can use split, pop and join for this, but I'm looking for a cleaner solution.
result = my_string.rsplit('_', 1)[0]
Which behaves like this:
>>> my_string = 'foo_bar_one_two_three'
>>> print(my_string.rsplit('_', 1)[0])
foo_bar_one_two
See in the documentation entry for str.rsplit([sep[, maxsplit]]).
One way is to use rfind to get the index of the last _ character and then slice the string to extract the characters up to that point:
>>> s = "foo_bar_one_two_three"
>>> idx = s.rfind("_")
>>> if idx >= 0:
... s = s[:idx]
...
>>> print s
foo_bar_one_two
You need to check that the rfind call returns something greater than -1 before using it to get the substring otherwise it'll strip off the last character.
If you must use regular expressions (and I tend to prefer non-regex solutions for simple cases like this), you can do it thus:
>>> import re
>>> s = "foo_bar_one_two_three"
>>> re.sub('_[^_]*$','',s)
'foo_bar_one_two'
Similar the the rsplit solution, rpartition will also work:
result = my_string.rpartition("_")[0]
You'll need to watch out for the case where the separator character is not found. In that case the original string will be in index 2, not 0.
doc string:
rpartition(...)
S.rpartition(sep) -> (head, sep, tail)
Search for the separator sep in S, starting at the end of S, and return
the part before it, the separator itself, and the part after it. If the
separator is not found, return two empty strings and S.
Here is a generic function to remove everything after the last occurrence of any specified string. For extra credit, it also supports removing everything after the nth last occurrence.
def removeEverythingAfterLast (needle, haystack, n=1):
while n > 0:
idx = haystack.rfind(needle)
if idx >= 0:
haystack = haystack[:idx]
n -= 1
else:
break
return haystack
In your case, to remove everything after the last '_', you would simply call it like this:
updatedString = removeEverythingAfterLast('_', yourString)
If you wanted to remove everything after the 2nd last '_', you would call it like this:
updatedString = removeEverythingAfterLast('_', yourString, 2)
I know is python, and my answer may be a little bit wrong in syntax, but in java you would do:
String a = "foo_bar_one_two_three";
String[] b = a.split("_");
String c = "";
for(int i=0; i<b.length-1; a++){
c += b[i];
if(i != b.length-2){
c += "_";
}
}
//and at this point, c is "foo_bar_one_two"
Hope in python split function works same way. :)
EDIT:
Using the limit part of the function you can do:
String a = "foo_bar_one_two_three";
String[] b = a.split("_",StringUtils.countMatches(a,"_"));
//and at this point, b is the array = [foo,bar,one,two]

python string manipulation

I have a string s with nested brackets: s = "AX(p>q)&E((-p)Ur)"
I want to remove all characters between all pairs of brackets and store in a new string like this: new_string = AX&E
i tried doing this:
p = re.compile("\(.*?\)", re.DOTALL)
new_string = p.sub("", s)
It gives output: AX&EUr)
Is there any way to correct this, rather than iterating each element in the string?
Another simple option is removing the innermost parentheses at every stage, until there are no more parentheses:
p = re.compile("\([^()]*\)")
count = 1
while count:
s, count = p.subn("", s)
Working example: http://ideone.com/WicDK
You can just use string manipulation without regular expression
>>> s = "AX(p>q)&E(qUr)"
>>> [ i.split("(")[0] for i in s.split(")") ]
['AX', '&E', '']
I leave it to you to join the strings up.
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> re.compile("""\([^\)]*\)""").sub('', s)
'AX&E'
Yeah, it should be:
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> p = re.compile("\(.*?\)", re.DOTALL)
>>> new_string = p.sub("", s)
>>> new_string
'AX&E'
Nested brackets (or tags, ...) are something that are not possible to handle in a general way using regex. See http://www.amazon.de/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&s=gateway&qid=1304230523&sr=8-1-spell for details why. You would need a real parser.
It's possible to construct a regex which can handle two levels of nesting, but they are already ugly, three levels will already be quite long. And you don't want to think about four levels. ;-)
You can use PyParsing to parse the string:
from pyparsing import nestedExpr
import sys
s = "AX(p>q)&E((-p)Ur)"
expr = nestedExpr('(', ')')
result = expr.parseString('(' + s + ')').asList()[0]
s = ''.join(filter(lambda x: isinstance(x, str), result))
print(s)
Most code is from: How can a recursive regexp be implemented in python?
You could use re.subn():
import re
s = 'AX(p>q)&E((-p)Ur)'
while True:
s, n = re.subn(r'\([^)(]*\)', '', s)
if n == 0:
break
print(s)
Output
AX&E
this is just how you do it:
# strings
# double and single quotes use in Python
"hey there! welcome to CIP"
'hey there! welcome to CIP'
"you'll understand python"
'i said, "python is awesome!"'
'i can\'t live without python'
# use of 'r' before string
print(r"\new code", "\n")
first = "code in"
last = "python"
first + last #concatenation
# slicing of strings
user = "code in python!"
print(user)
print(user[5]) # print an element
print(user[-3]) # print an element from rear end
print(user[2:6]) # slicing the string
print(user[:6])
print(user[2:])
print(len(user)) # length of the string
print(user.upper()) # convert to uppercase
print(user.lstrip())
print(user.rstrip())
print(max(user)) # max alphabet from user string
print(min(user)) # min alphabet from user string
print(user.join([1,2,3,4]))
input()

Categories