How to explicitly find string using str.contains() in a loop?

How to explicitly find string using str.contains() in a loop? - python

I am searching particular strings in first column using str.contain() in a big file. There are some cases are reported even if they partially match with the provided string. For example:
My file structure:
miRNA,Gene,Species_ID,PCT
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
miR-17-5p/130-5p,AAK1,9606,0.94
miR-17-5p/30-5p,Gnp,9606,0.94
when I run my code search
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(miRNA)]
I am expecting to only get only the second raw:
miR-17-5p/31-5p,Gnp,9606,0.92
but I de get both first and second raw - 331-5p come in the result too which should not:
miR-17-5p/331-5p,AAK1,9606,0.94
miR-17-5p/31-5p,Gnp,9606,0.92
Is there a way to make the str.contains() more specific? There is a suggestion here but how I can implement it to a for loop? str.contains(r"\bmiRNA\b") does not work.
Thank you.

Use str.contains with a regex alternation which is surrounded by word boundaries on both sides:
DE_miRNAs = ['31-5p', '150-3p']
regex = r'\b(' + '|'.join(DE_miRNAs) + r')\b'
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains(regex)]

contains is a function that takes a regex pattern as an argument. You should be more explicit about the regex pattern you are using.
In your case, I suggest you use /31-5p instead of 31-5p:
DE_miRNAs = ['31-5p', '150-3p'] #the actual list is much bigger
for miRNA in DE_miRNAs:
targets = pd.read_csv('my_file.csv')
new_df = targets.loc[targets['miRNA'].str.contains("/" + miRNA)]

Related

Using a dictionary to replace strings not working

I am trying to use the following code to make replacements in a pandas dataframe however:
replacerscompanya = {',':'','.':'','-':'','ltd':'limited','&':'and'}
df1['CompanyA'] = df1['CompanyA'].replace(replacerscompanya)
replacersaddress1a = {',':'','.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['Address1A'] = df1['Address1A'].replace(replacersaddress1a)
replacersaddress2a = {',':'','.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['Address2A'] = df1['Address2A'].replace(replacersaddress2a)
It does not give me an error but when i check the dataframe, no replacements have been made.
I had previously just used a number of lines of the code below to acheive the same result but I was hoping to create something a bit simpler to adjust.
df1['CompanyA'] = df1['CompanyA'].str.replace('.','')
Any ideas as to what is going on here?
Thanks!

Escape . in dictionary because special regex character and add parameter regex=True for substring replacement and also for replace by regex:
replacersaddress1a = {',':'','\.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['CompanyA'] = df1['CompanyA'].replace(replacerscompanya, regex=True)

Replacing substring but skipping previous occurance

I have a long string that may contain multiple same sub-strings. I would like to extract certain sub-strings by using regex. Then, for each extracted sub-string, I want to append [i] and replace the original one.
By using Regex, I extracted ['df.Libor3m','df.Libor3m_lag1','df.Libor3m_lag1']. However, when I tried to add [i] to each item, the first 'df.Libor3m_lag1' in string is replaced twice.
function_text_MD='0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'
read_var = re.findall(r"df.[\w+][^\W]+",function_text_MD)
for var_name in read_var:
function_text_MD.find(var_name)
new_var_name = var_name+'[i]'
function_text_MD=function_text_MD.replace(var_name,new_var_name,1)
So I got '0.11*(np.maximum(df.Libor3m[i],0.9)-np.maximum(df.Libor3m_lag1[i][i],0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'.
df.Libor3m_lag1[i][i] was added [i] twice.
What I want to get:
'0.11*(np.maximum(df.Libor3m[i],0.9)-np.maximum(df.Libor3m_lag1[i],0.9))+0.7*np.maximum(df.Libor3m_lag1[i],0.9)'
Thanks in advance!

Here is the code.
import re
function_text_MD='0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)'
read_var = re.findall(r"df.[\w+][^\W]+",function_text_MD)
for var_name in read_var:
function_text_MD = function_text_MD.replace(var_name,var_name+'[i]')
print(function_text_MD)

t = "0.11*(np.maximum(df.Libor3m,0.9)-np.maximum(df.Libor3m_lag1,0.9))+0.7*np.maximum(df.Libor3m_lag1,0.9)"
p = re.split("(?<=df\.)[a-zA-Z_0-9]+", t)
s = re.findall("(?<=df\.)[a-zA-Z_0-9]+", t)
s = [x+"[i]" for x in s]
result = "".join([p[0],s[0],p[1],s[1],p[2],s[2]])
use the regular expression to split string first.
use the same regular expression to find the spliters
change the spliters to what you want
put the 2 list together and join.

Search for any number of unknown substrings in place of * in a list of string

First of all, sorry if the title isn't very explicit, it's hard for me to formulate it properly. That's also why I haven't found if the question has already been asked, if it has.
So, I have a list of string, and I want to perform a "procedural" search replacing every * in my target-substring by any possible substring.
Here is an example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('mesh_*')
# should return: ['mesh_1_TMP', 'mesh_2_TMP']
In this case where there is just one * I just split each string with * and use startswith() and/or endswith(), so that's ok.
But I don't know how to do the same thing if there are multiple * in the search string.
So my question is, how do I search for any number of unknown substrings in place of * in a list of string?
For example:
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor('*_1_*')
# should return: ['obj_1_mesh', 'mesh_1_TMP']
Hope everything is clear enough. Thanks.

Consider using 'fnmatch' which provides Unix-like file pattern matching. More info here http://docs.python.org/2/library/fnmatch.html
from fnmatch import fnmatch
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
resultSubList = [ strList[i] for i,x in enumerate(strList) if fnmatch(x,searchFor) ]
This should do the trick

I would use the regular expression package for this if I were you. You'll have to learn a little bit of regex to make correct search queries, but it's not too bad. '.+' is pretty similar to '*' in this case.
import re
def search_strings(str_list, search_query):
regex = re.compile(search_query)
result = []
for string in str_list:
match = regex.match(string)
if match is not None:
result+=[match.group()]
return result
strList= ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
print search_strings(strList, '.+_1_.+')
This should return ['obj_1_mesh', 'mesh_1_TMP']. I tried to replicate the '*_1_*' case. For 'mesh_*' you could make the search_query 'mesh_.+'. Here is the link to the python regex api: https://docs.python.org/2/library/re.html

The simplest way to do this is to use fnmatch, as shown in ma3oun's answer. But here's a way to do it using Regular Expressions, aka regex.
First we transform your searchFor pattern so it uses '.+?' as the "wildcard" instead of '*'. Then we compile the result into a regex pattern object so we can efficiently use it multiple tests.
For an explanation of regex syntax, please see the docs. But briefly, the dot means any character (on this line), the + means look for one or more of them, and the ? means do non-greedy matching, i.e., match the smallest string that conforms to the pattern rather than the longest, (which is what greedy matching does).
import re
strList = ['obj_1_mesh',
'obj_2_mesh',
'obj_TMP',
'mesh_1_TMP',
'mesh_2_TMP',
'meshTMP']
searchFor = '*_1_*'
pat = re.compile(searchFor.replace('*', '.+?'))
result = [s for s in strList if pat.match(s)]
print(result)
output
['obj_1_mesh', 'mesh_1_TMP']
If we use searchFor = 'mesh_*' the result is
['mesh_1_TMP', 'mesh_2_TMP']
Please note that this solution is not robust. If searchFor contains other characters that have special meaning in a regex they need to be escaped. Actually, rather than doing that searchFor.replace transformation, it would be cleaner to just write the pattern using regex syntax in the first place.

If the string you are looking for looks always like string you can just use the find function, you'll get something like:
for s in strList:
if s.find(searchFor) != -1:
do_something()
If you have more than one string to look for (like abc*123*test) you gonna need to look for the each string, find the second one in the same string starting at the index you found the first + it's len and so on.

Python split before a certain character

I have following string:
BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6
I am trying to split it in a way I would get back the following dict / other data structure:
BUCKET1 -> /dir1/dir2/, BUCKET1 -> /dir3/dir4/, BUCKET2 -> /dir5/dir6/
I can somehow split it if I only have one BUCKET, not multiple, like this:
res.split(res.split(':', 1)[0].replace('.', '').upper()) -> it's not perfect
Input: ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/
Output: [(ADRIAN, /dir1/dir11), (DANIEL, /dir2/), (CULEA, /dir3/), (ADRIAN, /dir5/), (ADRIAN, /dir6/)
As per Wiktor Stribiżew comments, the following regex does the job:
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"

If you're experienced, I'd recommend learning Regex just as the others have suggested. However, if you're looking for an alternative, here's a way of doing such without Regex. It also produces the output you're looking for.
string = input("Enter:") #Put your own input here.
tempList = string.replace("BUCKET",':').split(":")
outputList = []
for i in range(1,len(tempList)-1,2):
someTuple = ("BUCKET"+tempList[i],tempList[i+1])
outputList.append(someTuple)
print(outputList) #Put your own output here.
This will produce:
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
This code is hopefully easier to understand and manipulate if you're unfamiliar with Regex, although I'd still personally recommend Regex to solve this if you're familiar with how to use it.

Use re.findall() function:
s = "ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/"
result = re.findall(r'(\w+):([^:]+\/)', s)
print(result)
The output:
[('ADRIAN', '/dir1/dir11/'), ('DANIEL', '/dir2/'), ('ADI_BUCKET', '/dir3/'), ('CULEA', '/dir4/'), ('ADRIAN', '/dir5/'), ('ADRIAN', '/dir6/')]

Use regex instead?
impore re
test = 'BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6'
output = re.findall(r'(?P<bucket>[A-Z0-9]+):(?P<path>[/a-z0-9]+)', test)
print(output)
Which gives
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]

It appears you have a list of predefined "buckets" that you want to use as boundaries for the records inside the string.
That means, the easiest way to match these key-value pairs is by matching one of the buckets, then a colon and then any chars not starting a sequence of chars equal to those bucket names.
You may use
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"
Compile with re.S / re.DOTALL if your values span across multiple lines. See the regex demo.
Details:
(BUCKET1|BUCKET2) - capture group one that matches and stores in .group(1) any of the bucket names
: - a colon
(.*?) - any 0+ chars, as few as possible (as *? is a lazy quantifier), up to the first occurrence of (but not inlcuding)...
(?=(?:BUCKET1|BUCKET2)|$) - any of the bucket names or end of string.
Build it dynamically while escaping bucket names (just to play it safe in case those names contain * or + or other special chars):
import re
buckets = ['BUCKET1','BUCKET2']
rx = r"({0}):(.*?)(?=(?:{0})|$)".format("|".join([re.escape(bucket) for bucket in buckets]))
print(rx)
s = "BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6"
print(re.findall(rx, s))
# => (BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
See the online Python demo.

Python - how to substitute a substring using regex with n occurrencies

I have a string with a lot of recurrencies of a single pattern like
a = 'eresQQQutnohnQQQjkhjhnmQQQlkj'
and I have another string like
b = 'rerTTTytu'
I want to substitute the entire second string having as a reference the 'QQQ' and the 'TTT', and I want to find in this case 3 different results:
'ererTTTytuohnQQQjkhjhnmQQQlkj'
'eresQQQutnrerTTTytujhnmQQQlkj'
'eresQQQutnohnQQQjkhjrerTTTytu'
I've tried using re.sub
re.sub('\w{3}QQQ\w{3}' ,b,a)
but I obtain only the first one, and I don't know how to get the other two solutions.

Edit: As you requested, the two characters surrounding 'QQQ' will be replaced as well now.
I don't know if this is the most elegant or simplest solution for the problem, but it works:
import re
# Find all occurences of ??QQQ?? in a - where ? is any character
matches = [x.start() for x in re.finditer('\S{2}QQQ\S{2}', a)]
# Replace each ??QQQ?? with b
results = [a[:idx] + re.sub('\S{2}QQQ\S{2}', b, a[idx:], 1) for idx in matches]
print(results)
Output
['errerTTTytunohnQQQjkhjhnmQQQlkj',
'eresQQQutnorerTTTytuhjhnmQQQlkj',
'eresQQQutnohnQQQjkhjhrerTTTytuj']
Since you didn't specify the output format, I just put it in a list.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to explicitly find string using str.contains() in a loop? - python

Use str.contains with a regex alternation which is surrounded by word boundaries on both sides: DE_miRNAs = ['31-5p', '150-3p'] regex = r'\b(' + '|'.join(DE_miRNAs) + r')\b' targets = pd.read_csv('my_file.csv') new_df = targets.loc[targets['miRNA'].str.contains(regex)]

Related

Using a dictionary to replace strings not working

Replacing substring but skipping previous occurance

Search for any number of unknown substrings in place of * in a list of string

Python split before a certain character

Python - how to substitute a substring using regex with n occurrencies

Categories

Resources