Hello I am new to python, and I hope you can help me. I have a text file (call it data.txt) with data on gene number with corresponding rs number and some distance measure. The data looks something like this:
rs1982171 55349 40802
rs6088650 55902 38550
rs1655902 3105 12220
rs1013677 55902 0
where the first column is rs number, second column is gene number, and third column is some distance measure. The data is much bigger, but hopefully the above gives you an idea of the dataset. What I want to do is find all the rs numbers that correspond to a certain gene. For example, for the data set above, gene 55902= {rs6088650, rs1013677}. Ideally, I want my code to find all rs numbers corresponding to a given gene. Since I am unable to do that now, I instead wrote a short code that gives the lines that contain the string "55902" in the data.txt file:
import re
data=open("data.txt","r")
for line in data:
line=line.rstrip()
if re.search("55902",line):
print line
The problem with this code is that the output is something like this:
rs6088650 55902 38550
rs1655902 3105 12220
rs1013677 55902 0
I want my code to ignore the string "55902" in the rs number. In other words, I don't my code to output the second line in the above output because the gene number is not 55902. I would like my output to be :
rs6088650 55902 38550
rs1013677 55902 0
How can I modify the above code to achieve what I want. Any help would be appreciated. Thanks in advance.
There's no need for regular expressions here, as all you're looking for is a simple static sequence. This line:
if re.search("55902",line):
Could be expressed as:
if "55902" in line:
And if you only want to check the second column, split the line first:
if '55902' in line.split()[1]:
Since you're now already checking the correct column, check for equality rather than membership:
if line.split()[1] == '55902':
You can use word boundary (\b), to match whole word search:
>>> import re
>>> re.search(r"\b55902\b", "rs1655902 3105 12220")
>>> re.search(r"\b55902\b", "rs6088650 55902 38550")
<_sre.SRE_Match object at 0x7f82594566b0>
if re.search(r"\b55902\b", line):
....
You can do this easily with a more powerful regular expression. One possible quick solution is to use a regex of the form:
r'\b55902\b'
The \b are word boundaries.
If you want to use regex, then you can use match or search along with word boundary \b as
x = " rs1982171 55349 40802".strip()
if (re.match(r"\b55349\b", x.split()[1])):
print x
IDEONE DEMO
Related
I'm not sure if this is a problem in my understanding of regex modules, or a silly mistake I'm making in my for loop.
I have a list of numbers that look like this:
4; 94
3; 92
1; 53
etc.
I made a regex pattern to match just the last two digits of the string:
'^.*\s([0-9]+)$'
This works when I take each element of the list 1 at a time.
However when I try and make a for loop
for i in xData:
if re.findall('^.*\s([0-9]+)$', i)
print i
The output is simply the entire string instead of just the last two digits.
I'm sure I'm missing something very simple here but if someone could point me in the right direction that would be great. Thanks.
You are printing the whole string, i. If you wanted to print the output of re.findall(), then store the result and print that result:
for i in xData:
results = re.findall('^.*\s([0-9]+)$', i)
if results:
print results
I don't think that re.findall() is the right method here, since your lines contain just the one set of digits. Use re.search() to get a match object, and if the match object is not None, take the first group data:
for i in xData:
match = re.search('^.*\s([0-9]+)$', i)
if match:
print match.group(1)
I might be missing something here, but if all you're looking to do is get the last 2 characters, could you use the below?
for i in xData:
print(i[-2:])
There are probably several ways to solve this problem, so I'm open to any ideas.
I have a file, within that file is the string "D133330593" Note: I do have the exact position within the file this string exists, but I don't know if that helps.
Following this string, there are 6 digits, I need to replace these 6 digits with 6 other digits.
This is what I have so far:
def editfile():
f = open(filein,'r')
filedata = f.read()
f.close()
#This is the line that needs help
newdata = filedata.replace( -TOREPLACE- ,-REPLACER-)
#Basically what I need is something that lets me say "D133330593******"
#->"D133330593123456" Note: The following 6 digits don't need to be
#anything specific, just different from the original 6
f = open(filein,'w')
f.write(newdata)
f.close()
Use the re module to define your pattern and then use the sub() function to substitute occurrence of that pattern with your own string.
import re
...
pat = re.compile(r"D133330593\d{6}")
re.sub(pat, "D133330593abcdef", filedata)
The above defines a pattern as -- your string ("D133330593") followed by six decimal digits. Then the next line replaces ALL occurrences of this pattern with your replacement string ("abcdef" in this case), if that is what you want.
If you want a unique replacement string for each occurrence of pattern, then you could use the count keyword argument in the sub() function, which allows you to specify the number of times the replacement must be done.
Check out this library for more info - https://docs.python.org/3.6/library/re.html
Let's simplify your problem to you having a string:
s = "zshisjD133330593090909fdjgsl"
and you wanting to replace the 6 characters after "D133330593" with "123456" to produce:
"zshisjD133330594123456fdjgsl"
To achieve this, we can first need to find the index of "D133330593". This is done by just using str.index:
i = s.index("D133330593")
Then replace the next 6 characters, but for this, we should first calculate the length of our string that we want to replace:
l = len("D133330593")
then do the replace:
s[:i+l] + "123456" + s[i+l+6:]
which gives us the desired result of:
'zshisjD133330593123456fdjgsl'
I am sure that you can now integrate this into your code to work with a file, but this is how you can do the heart of your problem .
Note that using variables as above is the right thing to do as it is the most efficient compared to calculating them on the go. Nevertheless, if your file isn't too long (i.e. efficiency isn't too much of a big deal) you can do the whole process outlined above in one line:
s[:s.index("D133330593")+len("D133330593")] + "123456" + s[s.index("D133330593")+len("D133330593")+6:]
which gives the same result.
I'm learning the RE module for Python and doing some experiment. I have question regarding using expression, here is the example:
name = 'abc123def456'
m = re.compile('.*[^0-9]').match(name)
m.group()
print m
Result is 'abc123def'
What should I do if I want to totally take out the numeric number
Thank you!
You can extract all occurrences of alphabets and concatenate them to get just the alphabets in the string. See below:
"".join(re.findall("[a-zA-Z]+",name))
I would like to replace all substring occurrences with regular expressions. The original sentences would be like:
mystring = "Carl's house is big. He is asking 1M for that(the house)."
Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:
substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"
At the end I want the original sentence like this:
mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."
The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:
Carl's **house** is big. He is asking 1M for that(**the **house****).
On the other hand, if I analyze the shortest substring first, I am getting this:
Carl's **house** is big. He is asking 1M for that(the **house**).
It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.
Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words
You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:
re.sub(r"(house|the house)", r"**\1**", mystring)
You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example
>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data)
"Carl's **house** is big. He is asking 1M for that(**the house**)."
Note: \g<repl> is used to get all the string matched by the group <repl>
You could do two passes:
First: Go through from longest to shortest and replace with something like:
'the house': 'AA_THE_HOUSE'
'house': 'BB_HOUSE'
Second: Go through replace like:
'AA_THE_HOUSE': '**the house**'
'BB_HOUSE': '**house**'
Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.
For example:
'the house' with 'temp_the_house'
'house' with 'temp_house'
then 'temp_house' with 'house'
'temp_the_house' with '**the house****'
Should work fine. You can automate this by using two lists.
I am new to regular expressions and I am trying to write a pattern of phone numbers, in order to identify them and be able to extract them. My doubt can be summarized to the following simple example:
I try first to identify whether in the string is there something like (+34) which should be optional:
prefixsrch = re.compile(r'(\(?\+34\)?)?')
that I test in the following string in the following way:
line0 = "(+34)"
print prefixsrch.findall(line0)
which yields the result:
['(+34)','']
My first question is: why does it find two occurrences of the pattern? I guess that this is related to the fact that the prefix thing is optional but I do not completely understand it. Anyway, now for my big doubt
If we do a similar thing searching for a pattern of 9 digits we get the same:
numsrch = re.compile(r'\d{9}')
line1 = "971756754"
print numsrch.findall(line1)
yields something like:
['971756754']
which is fine. Now what I want to do is identify a 9 digits number, preceded or not, by (+34). So to my understanding I should do something like:
phonesrch = re.compile(r'(\(?\+34\)?)?\d{9}')
If I test it in the following strings...
line0 = "(+34)971756754"
line1 = "971756754"
print phonesrch.findall(line0)
print phonesrch.findall(line1)
this is, to my surprise, what I get:
['(+34)']
['']
What I was expecting to get is ['(+34)971756754'] and ['971756754']. Does anybody has the insight of this? thank you very much in advance.
Your capturing group is wrong. Make the country code within a non-capturing group and the entire expression in the capturing group
>>> line0 = "(+34)971756754"
>>> line1 = "971756754"
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line0)
['(+34)971756754']
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line1)
['971756754']
My first question is: why does it find two occurrences of the pattern?
This is because, ? which means it match 0 or 1 repetitions, so an empty string is also a valid match