Replacing substring but skipping previous occurance - python

I have a long string that may contain multiple same sub-strings. I would like to extract certain sub-strings by using regex. Then, for each extracted sub-string, I want to append [i] and replace the original one.
By using Regex, I extracted ['df.Libor3m','df.Libor3m_lag1','df.Libor3m_lag1']. However, when I tried to add [i] to each item, the first 'df.Libor3m_lag1' in string is replaced twice.
function_text_MD='0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'
read_var = re.findall(r"df.[\w+][^\W]+",function_text_MD)
for var_name in read_var:
function_text_MD.find(var_name)
new_var_name = var_name+'[i]'
function_text_MD=function_text_MD.replace(var_name,new_var_name,1)
So I got '0.11*(np.maximum(df.Libor3m[i],0.9)-np.maximum(df.Libor3m_lag1[i][i],0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'.
df.Libor3m_lag1[i][i] was added [i] twice.
What I want to get:
'0.11*(np.maximum(df.Libor3m[i],0.9)-np.maximum(df.Libor3m_lag1[i],0.9))+0.7*np.maximum(df.Libor3m_lag1[i],0.9)'
Thanks in advance!

Here is the code.
import re
function_text_MD='0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'
read_var = re.findall(r"df.[\w+][^\W]+",function_text_MD)
for var_name in read_var:
function_text_MD = function_text_MD.replace(var_name,var_name+'[i]')
print(function_text_MD)

t = "0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)"
p = re.split("(?<=df\.)[a-zA-Z_0-9]+", t)
s = re.findall("(?<=df\.)[a-zA-Z_0-9]+", t)
s = [x+"[i]" for x in s]
result = "".join([p[0],s[0],p[1],s[1],p[2],s[2]])
use the regular expression to split string first.
use the same regular expression to find the spliters
change the spliters to what you want
put the 2 list together and join.

Related

How to explicitly find string using str.contains() in a loop?

I am searching particular strings in first column using str.contain() in a big file. There are some cases are reported even if they partially match with the provided string. For example:
My file structure:
miRNA,Gene,Species_ID,PCT
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
miR-17-5p/130-5p,AAK1,9606,0.94
miR-17-5p/30-5p,Gnp,9606,0.94
when I run my code search
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(miRNA)]
I am expecting to only get only the second raw:
miR-17-5p/31-5p,Gnp,9606,0.92
but I de get both first and second raw - 331-5p come in the result too which should not:
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
Is there a way to make the str.contains() more specific? There is a suggestion here but how I can implement it to a for loop? str.contains(r"\bmiRNA\b") does not work.
Thank you.
Use str.contains with a regex alternation which is surrounded by word boundaries on both sides:
DE_miRNAs = ['31-5p', '150-3p']
regex = r'\b(' + '|'.join(DE_miRNAs) + r')\b'
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(regex)]
contains is a function that takes a regex pattern as an argument. You should be more explicit about the regex pattern you are using.
In your case, I suggest you use /31-5p instead of 31-5p:
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains("/" + miRNA)]

Python - Extract text from string

What are the most efficient ways to extract text from a string? Are there some available functions or regex expressions, or some other way?
For example, my string is below and I want to extract the IDs as well
as the ScreenNames, separately.
[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]
Thank you!
Edit: These are the text strings that I want to pull. I want them to be in a list.
Target_IDs = 1234567890, 233323490, 4459284
Target_ScreenNames = RandomNameHere, AnotherRandomName, YetAnotherName
import re
str = '[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]'
print 'Target IDs = ' + ','.join( re.findall(r'ID=(\d+)', str) )
print 'Target ScreenNames = ' + ','.join( re.findall(r' ScreenName=(\w+)', str) )
Output :
Target IDs = 1234567890,233323490,4459284
Target ScreenNames = RandomNameHere,AnotherRandomName,YetAnotherName
It depends. Assuming that all your text comes in the form of
TagName = TagValue1, TagValue2, ...
You need just two calls to split.
tag, value_string = string.split('=')
values = value_string.split(',')
Remove the excess space (probably a couple of rstrip()/lstrip() calls will suffice) and you are done. Or you can take regex. They are slightly more powerful, but in this case I think it's a matter of personal taste.
If you want more complex syntax with nonterminals, terminals and all that, you'll need lex/yacc, which will require some background in parsers. A rather interesting thing to play with, but not something you'll want to use for storing program options and such.
The regex I'd use would be:
(?:ID=|ScreenName=)+(\d+|[\w\d]+)
However, this assumes that ID is only digits (\d) and usernames are only letters or numbers ([\w\d]).
This regex (when combined with re.findall) would return a list of matches that could be iterated through and sorted in some fashion like so:
import re
s = "[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]"
pattern = re.compile(r'(?:ID=|ScreenName=)+(\d+|[\w\d]+)');
ids = []
names = []
for p in re.findall(pattern, s):
if p.isnumeric():
ids.append(p)
else:
names.append(p)
print(ids, names)

Python - how to substitute a substring using regex with n occurrencies

I have a string with a lot of recurrencies of a single pattern like
a = 'eresQQQutnohnQQQjkhjhnmQQQlkj'
and I have another string like
b = 'rerTTTytu'
I want to substitute the entire second string having as a reference the 'QQQ' and the 'TTT', and I want to find in this case 3 different results:
'ererTTTytuohnQQQjkhjhnmQQQlkj'
'eresQQQutnrerTTTytujhnmQQQlkj'
'eresQQQutnohnQQQjkhjrerTTTytu'
I've tried using re.sub
re.sub('\w{3}QQQ\w{3}' ,b,a)
but I obtain only the first one, and I don't know how to get the other two solutions.
Edit: As you requested, the two characters surrounding 'QQQ' will be replaced as well now.
I don't know if this is the most elegant or simplest solution for the problem, but it works:
import re
# Find all occurences of ??QQQ?? in a - where ? is any character
matches = [x.start() for x in re.finditer('\S{2}QQQ\S{2}', a)]
# Replace each ??QQQ?? with b
results = [a[:idx] + re.sub('\S{2}QQQ\S{2}', b, a[idx:], 1) for idx in matches]
print(results)
Output
['errerTTTytunohnQQQjkhjhnmQQQlkj',
'eresQQQutnorerTTTytuhjhnmQQQlkj',
'eresQQQutnohnQQQjkhjhrerTTTytuj']
Since you didn't specify the output format, I just put it in a list.

not standard splitting

I have string value like:
a='[-sfdfj aidjf -dugs jfdsif -usda [[s dfdsf sdf]]]'
I want to transform "a" into dictionary: the strings with preceding "-" character should be keys and what goes after the space should be values of the key preceding it.
If we are working with "a", then what I want is the resulting dictionary like:
dict_a={'-sfdfj': 'aidjf', '-dugs': 'jfdsif', '-usda': '[[s dfdsf sdf]]'}
This would be simple if not the last value('[[s dfdsf sdf]]'), it contains the spaces. Otherwise I would just strip the external brackets and split the "a", then convert the resulting list into dict_a, but alas the reality is not on my side.
Even if I get the list like:
list_a=['-sfdfj', 'aidjf', '-dugs', 'jfdsif', '-usda', '[[s dfdsf sdf]']
this would be enough.
Any help will be appreciated.
You can split the string by '-' and then add the '-' back.
a = '[-sfdfj aidjf -dugs jfdsif -usda [[s dfdsf sdf]]]'
a = a[1:-1] # get ride of the start and end []
sections = a.split('-')
dict_a = {}
for s in sections:
s = s.strip()
if len(s) == 0:
continue
key_value = s.split(' ') # split key value by space
key = '-' + key_value[0] # the first element is key
value = ' '.join(key_value[1:]) # the lefe are value
dict_a[key] = value
I can tell you a way to go about it.
Strip the quotes and the outer brackets. Then split the string using spaces. Iterate over the list obtained and check for any opening brackets. Keep a count of the number of opening brackets, join all the list items as a string with spaces between each such item until you encounter an equal number of closing brackets. The remaining items remain as is. You could try implementing it. If you face any issues, I'll help you with the code.
#chong's answer is a neater way to go about it.
Using a regular expression:
>>> import re
>>> dict(re.findall('(-\S+) ([^-]+)', a[:-1].replace(' -', '-')))
{'-sfdfj': 'aidjf', '-dugs': 'jfdsif', '-usda': '[[s dfdsf sdf]]'}
Using #ChongTang's idea:
>>> dict(('-' + b).strip().split(maxsplit=1) for b in a[1:-1].split('-') if b)
{'-sfdfj': 'aidjf', '-dugs': 'jfdsif', '-usda': '[[s dfdsf sdf]]'}
You can try this:
import re
a='[-sfdfj aidjf -dugs jfdsif -usda [[s dfdsf sdf]]]'
pattern_key=re.compile(r'(?P<key>-\S+)\s+')
pattern_val=re.compile(r' (?P<val>[^-].*?)( -|\Z)')
d={}
matches=pattern_key.finditer(a)
matches1=pattern_val.finditer(a)
for m,n in zip(matches, matches1):
d[m.group('key')]= n.group('val')
print d

How to convert a multiline string into a list of lines?

In sikuli I've get a multiline string from clipboard like this...
Names = App.getClipboard();
So Name =
#corazona
#Pebleo00
#cofriasd
«paflio
and I have use this regex to delete the first character if it is not in x00-x7f hex range or is not a word, or is a digit
import re
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", Names)
So now Names =
corazona
Pebleo00
cofriasd
paflio
But, I am having trouble with the second regex that converts "Names" into the items of a sequence. I would like to convert "Names" into...
'corazona', 'Pebleo00', 'cofriasd', 'paflio'
or
'corazona', 'Pebleo00', 'cofriasd', 'paflio',
So sikuli can then recognize it as a List (I've found that Sikuli is able to recognize it even with those last "comma" and "space" in the end) by using...
NamesAsList = eval(Names)
How could I do this in python? is it necessary to use regex, or there is other way to do this in python?
I have already done this but using .Net regex, I just don't know how to do it in python, I have googled it with no result.
This is how I did it using .Net regex
Text to find:
(.*[^$])(\r\n|\z)
Replace with:
'$1',%" "%
Thanks Advanced.
A couple of one liners. Your question isn't completely clear - but I am assuming - you want to split a given string delimited by 'newline' and then generate a list of strings by removing the first character if it's not alpha numeric. Here's how I'd go about it
import re
r = re.compile(r'^[a-zA-Z0-9]') # match # beginning anything that's not alpha numeric
s = '#abc\ndef\nghi'
l = [r.sub('', x) for x in s.split()]
# join this list with comma (if that's required else you got the list already)
','.join(l)
Hope that's what you want.
If Names is a string before you "convert" it, in which each name is separated by a new line ('\n'), then this will work:
NamesAsList = '\n'.split(Names)
See this question for other options.
You could use splitlines()
import re
clipBoard = App.getClipboard();
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", clipBoard)
# Replace the end of a line with a comma.
singleNames = ', '.join(Names.splitlines())
print(singleNames)

Categories