python newbie - where is my if/else wrong? - python

Complete beginner so I'm sorry if this is obvious!
I have a file which is name | +/- or IG_name | 0 in a long list like so -
S1 +
IG_1 0
S2 -
IG_S3 0
S3 +
S4 -
dnaA +
IG_dnaA 0
Everything which starts with IG_ has a corresponding name. I want to add the + or - to the IG_name. e.g. IG_S3 is + like S3 is.
The information is gene names and strand information, IG = intergenic region. Basically I want to know which strand the intergenic region is on.
What I think I want:
open file
for every line, if the line starts with IG_*
find the line with *
print("IG_" and the line it found)
else
print line
What I have:
with open(sys.argv[2]) as geneInfo:
with open(sys.argv[1]) as origin:
for line in origin:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3]
for newline in geneInfo:
if re.match(nname, newline):
print("IG_"+newline)
else:
print(line)
where origin is the mixed list and geneInfo has only the names not IG_names.
With this code I end up with a list containing only the else statements.
S1 +
S2 -
S3 +
S4 -
dnaA +
My problem is that I don't know what is wrong to search so I can (attempt) to fix it!

Below is some step-by-step annotated code that hopefully does what you want (though instead of using print I have aggregated the results into a list so you can actually make use of it). I'm not quite sure what happened with your existing code (especially how you're processing two files?)
s_dict = {}
ig_list = []
with open('genes.txt', 'r') as infile: # Simulating reading the file you pass in sys.argv
for line in infile:
if line.startswith('IG_'):
ig_list.append(line.split()[0]) # Collect all our IG values for later
else:
s_name, value = line.split() # Separate out the S value and its operator
s_dict[s_name] = value.strip() # Add to dictionary to map S to operator
# Now you can go back through your list of IG values and append the appropriate operator
pulled_together = []
for item in ig_list:
s_value = item.split('_')[1]
# The following will look for the operator mapped to the S value. If it is
# not found, it will instead give you 'not found'
corresponding_operator = s_dict.get(s_value, 'Not found')
pulled_together.append([item, corresponding_operator])
print ('List structure')
print (pulled_together)
print ('\n')
print('Printout of each item in list')
for item in pulled_together:
print(item[0] + '\t' + item[1])

nname = name[:-3]
Python's slicing through list is very powerful, but can be tricky to understand correctly.
When you write [:-3], you take everything except the last three items. The thing is, if you have less than three element in your list, it does not return you an error, but an empty list.
I think this is where things does not work, as there are not much elements per line, it returns you an empty list. If you could tell what do you exactly want it to return there, with an example or something, it would help a lot, as i don't really know what you're trying to get with your slicing.

Does this do what you want?
from __future__ import print_function
import sys
# Read and store all the gene info lines, keyed by name
gene_info = dict()
with open(sys.argv[2]) as gene_info_file:
for line in gene_info_file:
tokens = line.split()
name = tokens[0].strip()
gene_info[name] = line
# Read the other file and lookup the names
with open(sys.argv[1]) as origin_file:
for line in origin_file:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3].strip()
if nname in gene_info:
lookup_line = gene_info[nname]
print("IG_" + lookup_line)
else:
pass # what do you want to do in this case?
else:
print(line)

Related

How to extract and trim the fasta sequence using biopython

Hellow everybody, I am new to python struggling to do a small task using biopython.I have two file- one containing list of ids and associated number.eg
id.txt
tr_F6LMO6_F6LMO6_9LE 25
tr_F6ISE0_F6ISE0_9LE 17
tr_F6HSF4_F6HSF4_9LE 27
tr_F6PLK9_F6PLK9_9LE 19
tr_F6HOT8_F6HOT8_9LE 29
Second file containg a large fasta sequences.eg below
fasta_db.fasta
>tr|F6LMO6|F6LMO6_9LEHG Transporter
MLAPETRRKRLFSLIFLCTILTTRDLLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTN
PANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPIT
EKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
>tr|F6ISE0|F6ISE0_9LEHG peptidase domain protein OMat str.
MPILKVAFVSFVLLVFSLPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASS
GTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK
>tr|F6HSF4|F6HSF4_9LEHG hypothetical protein,
MNLRSYIREIQVGLLCILVFLMSLYLLYFESKSRGASVKEILGNVSFRYKTAQRKFPDRM
LWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS
>tr|F6PLK9|F6PLK9_9LEHG Uncharacterized protein mano str.
MRKITGSYSKISLLTLLFLIGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRD
RKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG
I wanted to match the id from the first file with second file and print those matched seq in a new file after removing the length(from 1 to 25, in eq) .
Eg output[ 25(associated value with id,first file), aa removed from start, when id matched].
fasta_pruned.fasta
>tr|F6LMO6|F6LMO6_9LEHG Transporter
LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSL
PYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYV
PGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
Biopython cookbook was way above my head being new to python programming.Thanks for any help you can give.
I tried and messed up. Here is it.
from Bio import SeqIO
from Bio import Seq
f1 = open('fasta_pruned.fasta','w')
lengthdict = dict()
with open("seqid_len.txt") as seqlengths:
for line in seqlengths:
split_IDlength = line.strip().split(' ')
lengthdict[split_IDlength[0]] = split_IDlength[1]
with open("species.fasta","rU") as spe:
for record in SeqIO.parse(spe,"fasta"):
if record[0] == '>' :
split_header = line.split('|')
accession_ID = split_header[1]
if accession_ID in lengthdict:
f1.write(str(seq_record.id) + "\n")
f1.write(str(seq_record_seq[split_IDlength[1]-1:]))
f1.close()
Your code has almost everything except for a couple of small things which prevent it from giving the desired output:
Your file id.txt has two spaces between the id and the location. You take the 2nd element which would be empty in this case.
When the file is read it is interpreted as a string but you want the position to be an integer
lengthdict[split_IDlength[0]] = int(split_IDlength[-1])
Your ids are very similar but not identical, the only identical part is the 6 character identifier which could be used to map the two files (double check that before you assume it works). Having identical keys makes mapping much easier.
f1 = open('fasta_pruned.fasta', 'w')
fasta = dict()
with open("species.fasta","rU") as spe:
for record in SeqIO.parse(spe, "fasta"):
fasta[record.id.split('|')[1]] = record
lengthdict = dict()
with open("seqid_len.txt") as seqlengths:
for line in seqlengths:
split_IDlength = line.strip().split(' ')
lengthdict[split_IDlength[0].split('_')[1]] = int(split_IDlength[1])
for k, v in lengthdict.items():
if fasta.get(k) is None:
continue
print('>' + k)
print(fasta[k].seq[v:])
f1.write('>{}\n'.format(k))
f1.write(str(fasta[k].seq[v:]) + '\n')
f1.close()
Output:
>F6LMO6
LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
>F6ISE0
LPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASSGTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK
>F6HSF4
YFESKSRGASVKEILGNVSFRYKTAQRKFPDRMLWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS
>F6PLK9
IGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRDRKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG
>F6HOT8

Python read .txt File -> list

I have a .txt File and I want to get the values in a list.
The format of the txt file should be:
value0,timestamp0
value1,timestamp1
...
...
...
In the end I want to get a list with
[[value0,timestamp0],[value1,timestamp1],.....]
I know it's easy to get these values by
direction = []
for line in open(filename):
direction,t = line.strip().split(',')
direction = float(direction)
t = long(t)
direction.append([direction,t])
return direction
But I have a big problem: When creating the data I forgot to insert a "\n" in each row.
Thats why I have this format:
value0, timestamp0value1,timestamp1value2,timestamp2value3.....
Every timestamp has exactly 13 characters.
Is there a way to get these data in a list as I want it? Would be very much work get the data again.
Thanks
Max
import re
input = "value0,0123456789012value1,0123456789012value2,0123456789012value3"
for (line, value, timestamp) in re.findall("(([^,]+),(.{13}))", input):
print value, timestamp
You will have to strip the last , but you can insert a comma after every 13 chars following a comma:
import re
s = "-0.1351197,1466615025472-0.25672746,1466615025501-0.3661744,1466615025531-0.4646‌​7665,1466615025561-0.5533287,1466615025591-0.63311553,1466615025621-0.7049236,146‌​6615025652-0.7695509,1466615025681-1.7158673,1466615025711-1.6896278,146661502574‌​1-1.65375,1466615025772-1.6092329,1466615025801"
print(re.sub("(?<=,)(.{13})",r"\1"+",", s))
Which will give you:
-0.1351197,1466615025472,-0.25672746,1466615025501,-0.3661744,1466615025531,-0.4646‌​7665,1466615025561,-0.5533287,1466615025591,-0.63311553,1466615025621,-0.7049236,146‌​6615025652-0.7695509,1466615025681,-1.7158673,1466615025711,-1.6896278,146661502574‌​1-1.65375,1466615025772,-1.6092329,1466615025801,
I coded a quickie using your example, and not using 13 but len("timestamp") so you can adapt
instr = "value,timestampvalue2,timestampvalue3,timestampvalue4,timestamp"
previous_i = 0
for i,c in enumerate(instr):
if c==",":
next_i = i+len("timestamp")+1
print(instr[previous_i:next_i])
previous_i = next_i
output is descrambled:
value,timestamp
value2,timestamp
value3,timestamp
value4,timestamp
I think you could do something like this:
direction = []
for line in open(filename):
list = line.split(',')
v = list[0]
for s in list[1:]:
t = s[:13]
direction.append([float(v), long(t)])
v = s[13:]
If you're using python 3.X, then the long function no longer exists -- use int.

Unable to create a dictionary

I am trying to write a script for log parsing.
I got a file in which logs are jumbled up. Every first line of a particular log will have time stamp so I want to sort them using that.
For e.g.
10:48 Start
.
.
10:50 start
.
.
10:42 start
First line will contain key word ‘Start’ and the time stamp. The lines between ‘Start’ and before next ‘start’ are one set. I want to sort all of these sets in log files based on their time stamp.
Code Logic:
I thought of creating dictionary, where I will pick this time and assign it as ‘key’ and the text in value for that log set. And then I will sort the ‘keys’ in dictionary and print their ‘values’ in that sorted order in a file.
However I am getting error “TypeError: unhashable type: 'list'”
write1 = False
x = 0
search3 = "start"
matched = dict()
matched = {}
# fo is a list which is defined elsewhre in the code.
for line in fo:
if search3 in line:
#got the Hello2 printed which indicates script enters this loop
print('hello2')
write1 = True
x +=1
time = line.split()[3]
name1 = [time.split(':')[0] +":"+time.split(':')[1] + ":"+ time.split(':')[2]]
matched[name1] = line
elif write1:
matched[name1] += line
print(matched.keys())
Please let me know if my logic and the way I am doing is correct?
You set name1 as a list. Lists aren't hashable, only tuples are. However, I assume that you want name1 to be a string so you just want to remove the brackets:
name1 = time.split(':')[0] +":"+time.split(':')[1] + ":"+ time.split(':')[2]

Compare configuration data text with a default data text

I am in the process of understanding how to compare data from two text files and print the data that does not match into a new document or output.
The Program Goal:
Allow the user to compare the data in a file that contains many lines of data with a default file that has the correct values of the data.
Compare multiple lines of different data with the same parameters against a default list of the data with the same parameters
Example:
Lets say I have the following text document that has these parameters and data:
Lets call it Config.txt:
<231931844151>
Bird = 3
Cat = 4
Dog = 5
Bat = 10
Tiger = 11
Fish = 16
<92103884812>
Bird = 4
Cat = 40
Dog = 10
Bat = Null
Tiger = 19
Fish = 24
etc. etc.
Let's call this my Configuration data, now I need to make sure that the values these parameters inside my Config Data file are correct.
So I have a default data file that has the correct values for these parameters/variables. Lets call it Default.txt
<Correct Parameters>
Bird = 3
Cat = 40
Dog = 10
Bat = 10
Tiger = 19
Fish = 234
This text file is the default configuration or the correct configuration for the data.
Now I want to compare these two files and print out the data that is incorrect.
So, in theory, if I were to compare these two text document I should get an output of the following: Lets call this Output.txt
<231931844151>
Cat = 4
Dog = 5
Tiger = 11
Fish = 16
<92103884812>
Bird = 4
Bat = Null
Fish = 24
etc. etc.
Since these are the parameters that are incorrect or do not match. So in this case we see that for <231931844151> the parameters Cat, Dog, Tiger, and Fish did not match the default text file so those get printed. In the case of <92103884812> Bird, Bat, and Fish do not match the default parameters so those get printed.
So that's the gist of it for now.
Code:
Currently this is my approach I am trying to do however I'm not sure how I can compare a data file that has different sets of lines with the same parameters to a default data file.
configFile = open("Config.txt", "rb")
defaultFile = open("Default.txt", "rb")
with open(configFile) as f:
dataConfig = f.read().splitlines()
with open(defaultFile) as d:
dataDefault = d.read().splitlines()
def make_dict(data):
return dict((line.split(None, 1)[0], line) for line in data)
defdict = make_dict(dataDefault)
outdict = make_dict(dataConfig)
#Create a sorted list containing all the keys
allkeys = sorted(set(defdict) | set(outdict))
#print allkeys
difflines = []
for key in allkeys:
indef = key in defdict
inout = key in outdict
if indef and not inout:
difflines.append(defdict[key])
elif inout and not indef:
difflines.append(outdict[key])
else:
#key must be in both dicts
defval = defdict[key]
outval = outdict[key]
if outval != defval:
difflines.append(outval)
for line in difflines:
print line
Summary:
I want to compare two text documents that have data/parameters in them, One text document will have a series of data with the same parameters while the other will have just one series of data with the same parameters. I need to compare those parameters and print out the ones that do not match the default. How can I go about doing this in Python?
EDIT:
Okay so thanks to #Maria 's code I think I am almost there. Now I just need to figure out how to compare the dictionary to the list and print out the differences. Here's an example of what I am trying to do:
for i in range (len(setNames)):
print setNames[i]
for k in setData[i]:
if k in dataDefault:
print dataDefault
obvious the print line is just there to see if it worked or not but I'm not sure if this is the proper way about going through this.
Sample code for parsing the file into separate dictionaries. This works by finding the group separators (blank lines). setNames[i] is the name of the set of parameters in the dictionary at setData[i]. Alternatively you can create an object which has a string name member and a dictionary data member and keep a list of those. Doing the comparisons and outputting it how you want is up to you, this just regurgitates the input file to the command line in a slightly different format.
# The function you wrote
def make_dict(data):
return dict((line.split(None, 1)[0], line) for line in data)
# open the file and read the lines into a list of strings
with open("Config.txt" , "rb") as f:
dataConfig = f.read().splitlines()
# get rid of trailing '', as they cause problems and are unecessary
while (len(dataConfig) > 0) and (dataConfig[len(dataConfig) - 1] == ''):
dataConfig.pop()
# find the indexes of all the ''. They amount to one index past the end of each set of parameters
setEnds = []
index = 0
while '' in dataConfig[index:]:
setEnds.append(dataConfig[index:].index('') + index)
index = setEnds[len(setEnds) - 1] + 1
# separate out your input into separate dictionaries, and keep track of the name of each dictionary
setNames = []
setData = []
i = 0;
j = 0;
while j < len(setEnds):
setNames.append(dataConfig[i])
setData.append(make_dict(dataConfig[i+1:setEnds[j]]))
i = setEnds[j] + 1
j += 1
# handle the last index to the end of the list. Alternativel you could add len(dataConfig) to the end of setEnds and you wouldn't need this
if len(setEnds) > 0:
setNames.append(dataConfig[i])
setData.append(make_dict(dataConfig[i+1:]))
# regurgitate the input to prove it worked the way you wanted.
for i in range(len(setNames)):
print setNames[i]
for k in setData[i]:
print "\t" + k + ": " + setData[i][k];
print ""
Why not just use those dicts and loop through them to compare?
for keys in outdict:
if defdict.get(keys):
print outdict.get(keys)

python--import data from file and autopopulate a dictionary

I am a python newbie and am trying to accomplish the following.
A text file contains data in a slightly weird format and I was wondering whether there is an easy way to parse it and auto-fill an empty dictionary with the correct keys and values.
The data looks something like this
01> A B 2 ##01> denotes the line number, that's all
02> EWMWEM
03> C D 3
04> EWWMWWST
05> Q R 4
06> WESTMMMWW
So each pair of lines describe a full set of instructions for a robot arm. For lines 1-2 is for arm1, 3-4 is for arm 2, and so on. The first line states the location and the second line states the set of instructions (movement, changes in direction, turns, etc.)
What I am looking for is a way to import this text file, parse it properly, and populate a dictionary that will generate automatic keys. Note the file only contains value. This is why I am having a hard time. How do I tell the program to generate armX (where X is the ID from 1 to n) and assign a tuple (or a pair) to it such that the dictionary reads.
dict = {'arm1': ('A''B'2, EWMWEM) ...}
I am sorry if the newbie-ish vocab is redundant or unclear. Please let me know and I will be happy to clarify.
A commented code that is easy to understand will help me learn the concepts and motivation.
Just to provide some context. The point of the program is to load all the instructions and then execute the methods on the arms. So if you think there is a more elegant way to do it without loading all the instructions, please suggest.
def get_instructions_dict(instructions_file):
even_lines = []
odd_lines = []
with open(instructions_file) as f:
i = 1
for line in f:
# split the lines into id and command lines
if i % 2==0:
# command line
even_lines.append(line.strip())
else:
# id line
odd_lines.append(line.strip())
i += 1
# create tuples of (id, cmd) and zip them with armX ( armX, (id, command) )
# and combine them into a dict
result = dict( zip ( tuple("arm%s" % i for i in range(1,len(odd_lines)+1)),
tuple(zip(odd_lines,even_lines)) ) )
return result
>>> print get_instructions_dict('instructions.txt')
{'arm3': ('Q R 4', 'WESTMMMWW'), 'arm1': ('A B 2', 'EWMWEM'), 'arm2': ('C D 3', 'EWWMWWST')}
Note dict keys are not ordered. If that matters, use OrderedDict
I would do something like that:
mydict = {} # empty dict
buffer = ''
for line in open('myFile'): # open the file, read line by line
linelist = line.strip().replace(' ', '').split('>') # line 1 would become ['01', 'AB2']
if len(linelist) > 1: # eliminates empty lines
number = int(linelist[0])
if number % 2: # location line
buffer = linelist[1] # we keep this till we know the instruction
else:
mydict['arm%i' % number/2] = (buffer, linelist[1]) # we know the instructions, we write all to the dict
robot_dict = {}
arm_number = 1
key = None
for line in open('sample.txt'):
line = line.strip().replace("\n",'')
if not key:
location = line
key = 'arm' + str(arm_number) #setting key for dict
else:
instruction = line
robot_dict[key] = (location,line)
key = None #reset key
arm_number = arm_number + 1

Categories