Python: Replace characters in string with equivalent in another string - python

I'm trying to replace a character within a string with an equivalent character in another string.
String after changes made:
#C/084&"
#3*#%#C
Original String:
#+/084&"
#3*#%#+
How can I replace all 'C's back to '+'?

Im not sure what you mean by "equivalent character", but if you mean replace all occurrences of a specific character anywhere in the string you can use string.replace('C','+').

Use original_str.replace('C', '+')
Example:
>>> 'C++'.replace('C', '+')
'+++'
>>>
UPDATE:
first = list('C/084&"')
second = '3*#%#C'
for i, x in enumerate(first):
first[i] = first[i] if x!='C' else second[i]
first = ''.join(first)
After it first is 3/084&"

You can also re module for eg.
import re
string1 = '#+/084&"'
string2 = re.sub("+", "C", string1)
>>>string2
'#C/084&"'

You can easily do this with string.replace() function:
string1 = "#3*#%#C"
string2 = string1.replace("C","+")
print(string2)

Related

Python Removing non-alphabetical characters with exceptions

I am having a hard time doing Data Analysis on a large text that has lots of non-alphabetical chars. I tried using
string = filter(str.isalnum, string)
but I also have "#" in my text that I want to keep. How do I make an exception for a character like "#" ?
It is easier to use regular expressions:
string = re.sub("[^A-Za-z0-9#]", "", string)
You can use re.sub
re.sub(r'[^\w\s\d#]', '', string)
Example:
>>> re.sub(r'[^\w\s\d#]', '', 'This is # string 123 *$^%')
This is # string 123
One way to do this would be to create a function that returns True or False if an input character is valid.
import string
valid_characters = string.ascii_letters + string.digits + '#'
def is_valid_character(character):
return character in valid_characters
# Instead of using `filter`, we `join` all characters in the input string
# if `is_valid_character` is `True`.
def get_valid_characters(string):
return "".join(char for char in string if is_valid_character(char))
Some example output:
>>> print(valid_characters)
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#
>>> get_valid_characters("!Hello_#world?")
'Helloworld'
>>> get_valid_characters("user#example")
'user#example'
A simpler way to write it would be using regex. This will accomplish the same thing:
import re
def get_valid_characters(string):
return re.sub(r"[^\w\d#]", "", string)
You could use a lambda function to specify your allowed characters. But also note that filter returns a <filter object> which is an iterator over the returned values. So you will have to stich it back to a string:
string = "?filter_#->me3!"
extra_chars = "#!"
filtered_object = filter(lambda c: c.isalnum() or c in extra_chars, string)
string = "".join(filtered_object)
print(string)
Gives:
filter#me3!

Slice string at last digit in Python

So I have strings with a date somewhere in the middle, like 111_Joe_Smith_2010_Assessment and I want to truncate them such that they become something like 111_Joe_Smith_2010. The code that I thought would work is
reverseString = currentString[::-1]
stripper = re.search('\d', reverseString)
But for some reason this doesn't always give me the right result. Most of the time it does, but every now and then, it will output a string that looks like 111_Joe_Smith_2010_A.
If anyone knows what's wrong with this, it would be super helpful!
You can use re.sub and $ to match and substitute alphabetical characters
and underscores until the end of the string:
import re
d = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
new_s = [re.sub('[a-zA-Z_]+$', '', i) for i in d]
Output:
['111_Joe_Smith_2010', '111_Bob_Smith_2010']
You could strip non-digit characters from the end of the string using re.sub like this:
>>> import re
>>> re.sub(r'\D+$', '', '111_Joe_Smith_2010_Assessment')
'111_Joe_Smith_2010'
For your input format you could also do it with a simple loop:
>>> s = '111_Joe_Smith_2010_Assessment'
>>> i = len(s) - 1
>>> while not s[i].isdigit():
... i -= 1
...
>>> s[:i+1]
'111_Joe_Smith_2010'
You can use the following approach:
def clean_names():
names = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
for name in names:
while not name[-1].isdigit():
name = name[:-1]
print(name)
Here is another solution using rstrip() to remove trailing letters and underscores, which I consider a pretty smart alternative to re.sub() as used in other answers:
import string
s = '111_Joe_Smith_2010_Assessment'
new_s = s.rstrip(f'{string.ascii_letters}_') # For Python 3.6+
new_s = s.rstrip(string.ascii_letters+'_') # For other Python versions
print(new_s) # 111_Joe_Smith_2010

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Python: Change uppercase letter

I can't figure out how to replace the second uppercase letter in a string in python.
for example:
string = "YannickMorin"
I want it to become yannick-morin
As of now I can make it all lowercase by doing string.lower() but how to put a dash when it finds the second uppercase letter.
You can use Regex
>>> import re
>>> split_res = re.findall('[A-Z][^A-Z]*', 'YannickMorin')
['Yannick', 'Morin' ]
>>>'-'.join(split_res).lower()
This is more a task for regular expressions:
result = re.sub(r'[a-z]([A-Z])', r'-\1', inputstring).lower()
Demo:
>>> import re
>>> inputstring = 'YannickMorin'
>>> re.sub(r'[a-z]([A-Z])', r'-\1', inputstring).lower()
'yannic-morin'
Find uppercase letters that are not at the beginning of the word and insert a dash before. Then convert everything to lowercase.
>>> import re
>>> re.sub(r'\B([A-Z])', r'-\1', "ThisIsMyText").lower()
'this-is-my-text'
the lower() method does not change the string in place, it returns the value that either needs to be printed out, or assigned to another variable. You need to replace it.. One solution is:
strAsList = list(string)
strAsList[0] = strAsList[0].lower()
strAsList[7] = strAsList[7].lower()
strAsList.insert(7, '-')
print (''.join(strAsList))

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Categories