Find common items in 2 file text - python

To introduce you to the context of my problem: I have two files containing information about genes:
pos.bed contains positions of specific genes and hg19-genes.txt contains all the existing genes of the species, with some indicated characters such as the position of the genes (start and end), its name, its symbol, etc.
The problem is that in pos, only the position of the gene is indicated, but not its name/symbol. I would like to read through both files and compare the start and end in each line. If there is a match, I would like to get the symbol of the corresponding gene.
I wrote this little python code:
pos=open('C:/Users/Claire/Desktop/Arithmetics/pos.bed','r')
gen=open('C:/Users/Claire/Desktop/Arithmetics/hg19-genes.txt','r')
for row in pos:
row=row.split()
start=row[11]
end=row[12]
for row2 in gen:
row2=row2.split()
start2=row2[3]
end2=row2[4]
sym=row2[10]
if start==start2 and end==end2:
print sym
pos.close()
gen.close()
But it seems like this is only comparing the two files line by line (like line 2 in file pos with line 2 in file gen only).So I tried adding else to the if loop but I get an error message:
else:
gen.next()
StopIteration Traceback (most recent call last)
<ipython-input-9-a309fdca7035> in <module>()
14 print sym
15 else:
---> 16 gen.next()
17
18 pos.close()
StopIteration:
I know it is possible to compare all the lines of 2 files, no matter the position of the line, by doing something like:
same = set(file1).intersection(file2)
but in my case I only want to compare some columns of each line as the lines have different information in each file (except for the start and the end). Is there a similar way to compare lines in files, but only for some specified items?

gen is an iterator that iterates over the lines of the file exactly once, that is, when processing the first row in pos. The simplest workaround for that is to open the gen file inside the outer loop:
pos=open('C:/Users/Claire/Desktop/Arithmetics/pos.bed','r')
for row in pos:
row=row.split()
start=row[11]
end=row[12]
gen=open('C:/Users/Claire/Desktop/Arithmetics/hg19-genes.txt','r')
for row2 in gen:
row2=row2.split()
start2=row2[3]
end2=row2[4]
sym=row2[10]
if start==start2 and end==end2:
print sym
gen.close()
pos.close()
Another option would be to read all lines of gen into a list and use that list in the inner loop.

Related

I want to use the sum function to count multiple occurrences of specific characters, but my script only works for one character

This script is supposed to calculate the total weight of a protein, so I decided to count the occurrences of certain characters in a script. However, only the first equation produces a result which causes the total weight to be the same value (everything under the first one equals zero, which is definitely incorrect). How do I get my script to pay attention to the other lines???
This is a shortened version:
akt3_file = open('AKT3 fasta.txt', 'r') #open up the fasta file
for line in akt3_file:
ala =(sum(line.count('A') for line in akt3_file)*89*1000) #this value is 1780000
arg =(sum(line.count('R') for line in akt3_file)*174*1000)
asn =(sum(line.count('N') for line in akt3_file)*132*1000)
asp =(sum(line.count('D') for line in akt3_file)*133*1000)
asx =(sum(line.count('B') for line in akt3_file)*133*1000)
protein_weight = ala+arg+asn+asp+asx
print(protein_weight) # the problem is that this value is also 1780000
akt3_file.close() #close the fasta file
The issue you have is that you're trying to iterate over your file's lines several times. While that's actually possible (unlike most iterators, file objects can be rewound with seek), you're not doing it properly, so all the iterations except for the first don't see any data.
In this case, you probably don't need to iterate over the lines at all. Just read the full text of the file into a string, and count the characters you want out of that string:
with open('AKT3 fasta.txt', 'r') as akt_3file: # A with is not necessary, but a good idea.
data = akt_3file.read() # Read the whole file into the data string.
ala = data.count('A') * 89 * 1000 # Now we can count all the occurrences in all lines at
arg = data.count('R') * 174 * 1000 # once, and there's no issue iterating the file, since
asn = data.count('N') * 132 * 1000 # we're not dealing with the file any more, just the
asp = data.count('D') * 133 * 1000 # immutable data string.
asx = data.count('B') * 133 * 1000

Python: a way to ignore/account for newlines with read()

So I am having a problem extracting text from a larger (>GB) text file. The file is structured as follows:
>header1
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
andEnds
>header2
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAlineAtPosition_80
MaybeAnotherTargetBBBBBBBBBBBrestText
andEndsSomewhereHere
Now I have the information that in the entry with header2 I need to extract the text from position X to position Y (the A's in this example), starting with 1 as the first letter in the line below the header.
BUT: the positions do not account for newline characters. So basically when it says from 1 to 95 it really means just the letters from 1 to 80 and the following 15 of the next line.
My first solution was to use file.read(X-1) to skip the unwanted part in front and then file.read(Y-X) to get the part I want, but when that stretches over newline(s) I get to few characters extracted.
Is there a way to solve this with another python-function than read() maybe? I thought about just replacing all newlines with empty strings but the file maybe quite large (millions of lines).
I also tried to account for the newlines by taking extractLength // 80 as added length, but this is problematic in cases like the example when eg. of 95 characters it's 2-80-3 over 3 lines I actually need 2 additional positions but 95 // 80 is 1.
UPDATE:
I modified my code to use Biopython:
for s in SeqIO.parse(sys.argv[2], "fasta"):
#foundClusters stores the information for substrings I want extracted
currentCluster = foundClusters.get(s.id)
if(currentCluster is not None):
for i in range(len(currentCluster)):
outputFile.write(">"+s.id+"|cluster"+str(i)+"\n")
flanking = 25
start = currentCluster[i][0]
end = currentCluster[i][1]
left = currentCluster[i][2]
if(start - flanking < 0):
start = 0
else:
start = start - flanking
if(end + flanking > end + left):
end = end + left
else:
end = end + flanking
#for debugging only
print(currentCluster)
print(start)
print(end)
outputFile.write(s.seq[start, end+1])
But I get the following error:
[[1, 55, 2782]]
0
80
Traceback (most recent call last):
File "findClaClusters.py", line 92, in <module>
outputFile.write(s.seq[start, end+1])
File "/usr/local/lib/python3.4/dist-packages/Bio/Seq.py", line 236, in __getitem__
return Seq(self._data[index], self.alphabet)
TypeError: string indices must be integers
UPDATE2:
Changed outputFile.write(s.seq[start, end+1]) to:
outRecord = SeqRecord(s.seq[start: end+1], id=s.id+"|cluster"+str(i), description="Repeat-Cluster")
SeqIO.write(outRecord, outputFile, "fasta")
and its working :)
With Biopython:
from Bio import SeqIO
X = 66
Y = 130
for s in in SeqIO.parse("test.fst", "fasta"):
if "header2" == s.id:
print s.seq[X: Y+1]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Biopython let's you parse fasta file and access its id, description and sequence easily. You have then a Seq object and you can manipulate it conveniently without recoding everything (like reverse complement and so on).

Convert a Column oriented file to CSV output using shell

I have a file that come from map reduce output for the format below that needs conversion to CSV using shell script
25-MAY-15
04:20
Client
0000000010
127.0.0.1
PAY
ISO20022
PAIN000
100
1
CUST
API
ABF07
ABC03_LIFE.xml
AFF07/LIFE
100000
Standard Life
================================================
==================================================
AFF07-B000001
2000
ABC Corp
..
BE900000075000027
AFF07-B000002
2000
XYZ corp
..
BE900000075000027
AFF07-B000003
2000
3MM corp
..
BE900000075000027
I need the output like CSV format below where I want to repeat some of the values in the file and add the TRANSACTION ID as below format
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000002,2000,XYZ Corp,..,BE900000075000027
TRANSACTION ID is AFF07-B000001,AFF07-B000002,AFF07-B000003 which have different values and I have put a marked line from where the Transaction ID starts . Before the demarkation , the values should be repeating and the transaction ID column needs to be added along with the repeating values before the line as given in above format
BASH shell script I may need and CentOS is the flavour of linux
I am getting error as below when I execute the code
Traceback (most recent call last):
File "abc.py", line 37, in <module>
main()
File "abc.py", line 36, in main
createTxns(fh)
File "abc.py", line 7, in createTxns
first17.append( fh.readLine().rstrip() )
AttributeError: 'file' object has no attribute 'readLine'
Can someone help me out
Is this a correct description of the input file and output format?
The input file consists of:
17 lines, followed by
groups of 10 lines each - each group holding one transaction id
Each output row consists of:
29 common fields, followed by
5 fields derived from each of the 10-line groups above
So we just translate this into some Python:
def createTxns(fh):
"""fh is the file handle of the input file"""
# 1. Read 17 lines from fh
first17 = []
for i in range(17):
first17.append( fh.readLine().rstrip() )
# 2. Form the common fields.
commonFields = first17 + first17[0:12]
# 3. Process the rest of the file in groups of ten lines.
while True:
# read 10 lines
group = []
for i in range(10):
x = fh.readline()
if x == '':
break
group.append( x.rstrip() )
if len(group) <> 10:
break # we've reached the end of the file
fields = commonFields + [ group[2], group[4], group[6], group[7[, group[9] ]
row = ",".join(fields)
print row
def main():
with open("input-file", "r") as fh:
createTxns(fh)
main()
This code shows how to:
open a file handle
read lines from a file handle
strip off the ending newline
check for end of input when reading from a file
concatenate lists together
join strings together
I would recommend you to read Input and Output if you are going for the python route.
You just have to break the problem down and try it. For the first 17 line use f.readline() and concat into the string. Then the replace method to get the begining of the string that you want in the csv.
str.replace("\n", ",")
Then use the split method and break them down into the list.
str.split("\n")
Then write the file out in the loop. Use a counter to make your life easier. First write out the header string
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Then write the item in the list with a ",".
,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
At the count of 5 write the "\n" with the header again and don't forget to reset your counter so it can begin again.
\n25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Give it a try and let us know if you need more assistant. I assumed that you have some scripting background :) Good luck!!

Wit's end with file to dict

Python: 2.7.9
I erased all of my code because I'm going nuts.
Here's the gist (its for Rosalind challenge thingy):
I want to take a file that looks like this (no quotes on carets)
">"Rosalind_0304
actgatcgtcgctgtactcg
actcgactacgtagctacgtacgctgcatagt
">"Rosalind_2480
gctatcggtactgcgctgctacgtg
ccccccgaagaatagatag
">"Rosalind_2452
cgtacgatctagc
aaattcgcctcgaactcg
etc...
What I can't figure out how to do is basically everything at this point, my mind is so muddled. I'll just show kind of what I was doing, but failing to do.
1st. I want to search the file for '>'
Then assign the rest of that line into the dictionary as a key.
read the next lines up until the next '>' and do some calculations and return
findings into the value for that key.
go through the file and do it for every string.
then compare all values and return the key of whichever one is highest.
Can anyone help?
It might help if I just take a break. I've been coding all day and i think I smell colors.
def func(dna_str):
bla
return gcp #gc count percentage returned to the value in dict
With my_function somewhere that returns that percentage value:
with open('rosalind.txt', 'r') as ros:
rosa = {line[1:].split(' ')[0]:my_function(line.split(' ')[1].strip()) for line in ros if line.strip()}
top_key = max(rosa, key=rosa.get)
print(top_key, rosa.get(top_key))
For each line in the file, that will first check if there's anything left of the line after stripping trailing whitespace, then discard the blank lines. Next, it adds each non-blank line as an entry to a dictionary, with the key being everything to the left of the space except for the unneeded >, and the value being the result of sending everything to the right of the space to your function.
Then it saves the key corresponding to the highest value, then prints that key along with its corresponding value. You're left with a dictionary rosa that you can process however you like.
Complete code of the module:
def my_function(dna):
return 100 * len(dna.replace('A','').replace('T',''))/len(dna)
with open('rosalind.txt', 'r') as ros:
with open('rosalind_clean.txt', 'w') as output:
for line in ros:
if line.startswith('>'):
output.write('\n'+line.strip())
elif line.strip():
output.write(line.strip())
with open('rosalind_clean.txt', 'r') as ros:
rosa = {line[1:].split(' ')[0]:my_function(line.split(' ')[1].strip()) for line in ros if line.strip()}
top_key = max(rosa, key=rosa.get)
print(top_key, rosa.get(top_key))
Complete content of rosalind.txt:
>Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCG
TTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG
>Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCA
GGCGCTCCGCCGAAGGTCTATATCCA
TTTGTCAGCAGACACGC
>Rosalind_0808 CCACCCTCGTGGT
ATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
Result when running the module:
Rosalind_0808 60.91954022988506
This should properly handle an input file that doesn't necessarily have one entry per line.
See SO's formatting guide to learn how to make inline or block code tags to get past things like ">". If you want it to appear as regular text rather than code, escape the > with a backslash:
Type:
\>Rosalind
Result:
>Rosalind
I think I got that part down now. Thanks so much. BUUUUT. Its throwing an error about it.
rosa = {line[1:].split(' ')[0]:calc(line.split(' ')[1].strip()) for line in ros if line.strip()}
IndexError: list index out of range
this is my func btw.
def calc(dna_str):
for x in dna_str:
if x == 'G':
gc += 1
divc += 1
elif x == 'C':
gc += 1
divc += 1
else:
divc += 1
gcp = float(gc/divc)
return gcp
Exact test file. no blank lines before or after.
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Search through directory for items with multiple criteria

I'm trying to write some code that searches through a directory and pulls out all the items that start with a certain numbers (defined by a list) and that end with '.labels.txt'. This is what I have so far.
lbldir = '/musc.repo/Data/shared/my_labeled_images/labeled_image_maps/'
picnum = []
for ii in os.listdir(picdir):
num = ii.rstrip('.png')
picnum.append(num)
lblpath = []
for file in os.listdir(lbldir):
if fnmatch.fnmatch(file, '*.labels.txt') and fnmatch.fnmatch(file, ii in picnum + '.*'):
lblpath.append(os.path.abspath(file))
Here is the error I get
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-a03c65e65a71> in <module>()
3 lblpath = []
4 for file in os.listdir(lbldir):
----> 5 if fnmatch.fnmatch(file, '*.labels.txt') and fnmatch.fnmatch(file, ii in picnum + '.*'):
6 lblpath.append(os.path.abspath(file))
TypeError: can only concatenate list (not "str") to list
I realize the ii in picnum part won't work but I don't know how to get around it. Can this be accomplished with the fnmatch module or do I need regular expressions?
The error comes because you are trying to add ".*" (a string) to the end of picnum, which is a list, and not a string.
Also, ii in picnum isn't giving you back each item of picnum, because you are not iterating over ii. It just has the last value that it was assigned in your first loop.
Instead of testing both at once with the and, you might have a nested test that operates when you find a file matching .labels.txt, as below. This uses re instead of fnmatch to extract the digits from the beginning of the file name, instead of trying to match each picnum. This replaces your second loop:
import re
for file in os.listdir(lbldir):
if file.endswith('.labels.txt')
startnum=re.match("\d+",file)
if startnum and startnum.group(0) in picnum:
lblpath.append(os.path.abspath(file))
I think that should work, but it is obviously untested without your actual file names.

Categories