Compare configuration data text with a default data text - python

I am in the process of understanding how to compare data from two text files and print the data that does not match into a new document or output.
The Program Goal:
Allow the user to compare the data in a file that contains many lines of data with a default file that has the correct values of the data.
Compare multiple lines of different data with the same parameters against a default list of the data with the same parameters
Example:
Lets say I have the following text document that has these parameters and data:
Lets call it Config.txt:
<231931844151>
Bird = 3
Cat = 4
Dog = 5
Bat = 10
Tiger = 11
Fish = 16
<92103884812>
Bird = 4
Cat = 40
Dog = 10
Bat = Null
Tiger = 19
Fish = 24
etc. etc.
Let's call this my Configuration data, now I need to make sure that the values these parameters inside my Config Data file are correct.
So I have a default data file that has the correct values for these parameters/variables. Lets call it Default.txt
<Correct Parameters>
Bird = 3
Cat = 40
Dog = 10
Bat = 10
Tiger = 19
Fish = 234
This text file is the default configuration or the correct configuration for the data.
Now I want to compare these two files and print out the data that is incorrect.
So, in theory, if I were to compare these two text document I should get an output of the following: Lets call this Output.txt
<231931844151>
Cat = 4
Dog = 5
Tiger = 11
Fish = 16
<92103884812>
Bird = 4
Bat = Null
Fish = 24
etc. etc.
Since these are the parameters that are incorrect or do not match. So in this case we see that for <231931844151> the parameters Cat, Dog, Tiger, and Fish did not match the default text file so those get printed. In the case of <92103884812> Bird, Bat, and Fish do not match the default parameters so those get printed.
So that's the gist of it for now.
Code:
Currently this is my approach I am trying to do however I'm not sure how I can compare a data file that has different sets of lines with the same parameters to a default data file.
configFile = open("Config.txt", "rb")
defaultFile = open("Default.txt", "rb")
with open(configFile) as f:
dataConfig = f.read().splitlines()
with open(defaultFile) as d:
dataDefault = d.read().splitlines()
def make_dict(data):
return dict((line.split(None, 1)[0], line) for line in data)
defdict = make_dict(dataDefault)
outdict = make_dict(dataConfig)
#Create a sorted list containing all the keys
allkeys = sorted(set(defdict) | set(outdict))
#print allkeys
difflines = []
for key in allkeys:
indef = key in defdict
inout = key in outdict
if indef and not inout:
difflines.append(defdict[key])
elif inout and not indef:
difflines.append(outdict[key])
else:
#key must be in both dicts
defval = defdict[key]
outval = outdict[key]
if outval != defval:
difflines.append(outval)
for line in difflines:
print line
Summary:
I want to compare two text documents that have data/parameters in them, One text document will have a series of data with the same parameters while the other will have just one series of data with the same parameters. I need to compare those parameters and print out the ones that do not match the default. How can I go about doing this in Python?
EDIT:
Okay so thanks to #Maria 's code I think I am almost there. Now I just need to figure out how to compare the dictionary to the list and print out the differences. Here's an example of what I am trying to do:
for i in range (len(setNames)):
print setNames[i]
for k in setData[i]:
if k in dataDefault:
print dataDefault
obvious the print line is just there to see if it worked or not but I'm not sure if this is the proper way about going through this.

Sample code for parsing the file into separate dictionaries. This works by finding the group separators (blank lines). setNames[i] is the name of the set of parameters in the dictionary at setData[i]. Alternatively you can create an object which has a string name member and a dictionary data member and keep a list of those. Doing the comparisons and outputting it how you want is up to you, this just regurgitates the input file to the command line in a slightly different format.
# The function you wrote
def make_dict(data):
return dict((line.split(None, 1)[0], line) for line in data)
# open the file and read the lines into a list of strings
with open("Config.txt" , "rb") as f:
dataConfig = f.read().splitlines()
# get rid of trailing '', as they cause problems and are unecessary
while (len(dataConfig) > 0) and (dataConfig[len(dataConfig) - 1] == ''):
dataConfig.pop()
# find the indexes of all the ''. They amount to one index past the end of each set of parameters
setEnds = []
index = 0
while '' in dataConfig[index:]:
setEnds.append(dataConfig[index:].index('') + index)
index = setEnds[len(setEnds) - 1] + 1
# separate out your input into separate dictionaries, and keep track of the name of each dictionary
setNames = []
setData = []
i = 0;
j = 0;
while j < len(setEnds):
setNames.append(dataConfig[i])
setData.append(make_dict(dataConfig[i+1:setEnds[j]]))
i = setEnds[j] + 1
j += 1
# handle the last index to the end of the list. Alternativel you could add len(dataConfig) to the end of setEnds and you wouldn't need this
if len(setEnds) > 0:
setNames.append(dataConfig[i])
setData.append(make_dict(dataConfig[i+1:]))
# regurgitate the input to prove it worked the way you wanted.
for i in range(len(setNames)):
print setNames[i]
for k in setData[i]:
print "\t" + k + ": " + setData[i][k];
print ""

Why not just use those dicts and loop through them to compare?
for keys in outdict:
if defdict.get(keys):
print outdict.get(keys)

Related

How to extract and trim the fasta sequence using biopython

Hellow everybody, I am new to python struggling to do a small task using biopython.I have two file- one containing list of ids and associated number.eg
id.txt
tr_F6LMO6_F6LMO6_9LE 25
tr_F6ISE0_F6ISE0_9LE 17
tr_F6HSF4_F6HSF4_9LE 27
tr_F6PLK9_F6PLK9_9LE 19
tr_F6HOT8_F6HOT8_9LE 29
Second file containg a large fasta sequences.eg below
fasta_db.fasta
>tr|F6LMO6|F6LMO6_9LEHG Transporter
MLAPETRRKRLFSLIFLCTILTTRDLLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTN
PANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPIT
EKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
>tr|F6ISE0|F6ISE0_9LEHG peptidase domain protein OMat str.
MPILKVAFVSFVLLVFSLPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASS
GTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK
>tr|F6HSF4|F6HSF4_9LEHG hypothetical protein,
MNLRSYIREIQVGLLCILVFLMSLYLLYFESKSRGASVKEILGNVSFRYKTAQRKFPDRM
LWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS
>tr|F6PLK9|F6PLK9_9LEHG Uncharacterized protein mano str.
MRKITGSYSKISLLTLLFLIGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRD
RKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG
I wanted to match the id from the first file with second file and print those matched seq in a new file after removing the length(from 1 to 25, in eq) .
Eg output[ 25(associated value with id,first file), aa removed from start, when id matched].
fasta_pruned.fasta
>tr|F6LMO6|F6LMO6_9LEHG Transporter
LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSL
PYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYV
PGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
Biopython cookbook was way above my head being new to python programming.Thanks for any help you can give.
I tried and messed up. Here is it.
from Bio import SeqIO
from Bio import Seq
f1 = open('fasta_pruned.fasta','w')
lengthdict = dict()
with open("seqid_len.txt") as seqlengths:
for line in seqlengths:
split_IDlength = line.strip().split(' ')
lengthdict[split_IDlength[0]] = split_IDlength[1]
with open("species.fasta","rU") as spe:
for record in SeqIO.parse(spe,"fasta"):
if record[0] == '>' :
split_header = line.split('|')
accession_ID = split_header[1]
if accession_ID in lengthdict:
f1.write(str(seq_record.id) + "\n")
f1.write(str(seq_record_seq[split_IDlength[1]-1:]))
f1.close()
Your code has almost everything except for a couple of small things which prevent it from giving the desired output:
Your file id.txt has two spaces between the id and the location. You take the 2nd element which would be empty in this case.
When the file is read it is interpreted as a string but you want the position to be an integer
lengthdict[split_IDlength[0]] = int(split_IDlength[-1])
Your ids are very similar but not identical, the only identical part is the 6 character identifier which could be used to map the two files (double check that before you assume it works). Having identical keys makes mapping much easier.
f1 = open('fasta_pruned.fasta', 'w')
fasta = dict()
with open("species.fasta","rU") as spe:
for record in SeqIO.parse(spe, "fasta"):
fasta[record.id.split('|')[1]] = record
lengthdict = dict()
with open("seqid_len.txt") as seqlengths:
for line in seqlengths:
split_IDlength = line.strip().split(' ')
lengthdict[split_IDlength[0].split('_')[1]] = int(split_IDlength[1])
for k, v in lengthdict.items():
if fasta.get(k) is None:
continue
print('>' + k)
print(fasta[k].seq[v:])
f1.write('>{}\n'.format(k))
f1.write(str(fasta[k].seq[v:]) + '\n')
f1.close()
Output:
>F6LMO6
LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
>F6ISE0
LPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASSGTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK
>F6HSF4
YFESKSRGASVKEILGNVSFRYKTAQRKFPDRMLWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS
>F6PLK9
IGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRDRKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG
>F6HOT8

python newbie - where is my if/else wrong?

Complete beginner so I'm sorry if this is obvious!
I have a file which is name | +/- or IG_name | 0 in a long list like so -
S1 +
IG_1 0
S2 -
IG_S3 0
S3 +
S4 -
dnaA +
IG_dnaA 0
Everything which starts with IG_ has a corresponding name. I want to add the + or - to the IG_name. e.g. IG_S3 is + like S3 is.
The information is gene names and strand information, IG = intergenic region. Basically I want to know which strand the intergenic region is on.
What I think I want:
open file
for every line, if the line starts with IG_*
find the line with *
print("IG_" and the line it found)
else
print line
What I have:
with open(sys.argv[2]) as geneInfo:
with open(sys.argv[1]) as origin:
for line in origin:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3]
for newline in geneInfo:
if re.match(nname, newline):
print("IG_"+newline)
else:
print(line)
where origin is the mixed list and geneInfo has only the names not IG_names.
With this code I end up with a list containing only the else statements.
S1 +
S2 -
S3 +
S4 -
dnaA +
My problem is that I don't know what is wrong to search so I can (attempt) to fix it!
Below is some step-by-step annotated code that hopefully does what you want (though instead of using print I have aggregated the results into a list so you can actually make use of it). I'm not quite sure what happened with your existing code (especially how you're processing two files?)
s_dict = {}
ig_list = []
with open('genes.txt', 'r') as infile: # Simulating reading the file you pass in sys.argv
for line in infile:
if line.startswith('IG_'):
ig_list.append(line.split()[0]) # Collect all our IG values for later
else:
s_name, value = line.split() # Separate out the S value and its operator
s_dict[s_name] = value.strip() # Add to dictionary to map S to operator
# Now you can go back through your list of IG values and append the appropriate operator
pulled_together = []
for item in ig_list:
s_value = item.split('_')[1]
# The following will look for the operator mapped to the S value. If it is
# not found, it will instead give you 'not found'
corresponding_operator = s_dict.get(s_value, 'Not found')
pulled_together.append([item, corresponding_operator])
print ('List structure')
print (pulled_together)
print ('\n')
print('Printout of each item in list')
for item in pulled_together:
print(item[0] + '\t' + item[1])
nname = name[:-3]
Python's slicing through list is very powerful, but can be tricky to understand correctly.
When you write [:-3], you take everything except the last three items. The thing is, if you have less than three element in your list, it does not return you an error, but an empty list.
I think this is where things does not work, as there are not much elements per line, it returns you an empty list. If you could tell what do you exactly want it to return there, with an example or something, it would help a lot, as i don't really know what you're trying to get with your slicing.
Does this do what you want?
from __future__ import print_function
import sys
# Read and store all the gene info lines, keyed by name
gene_info = dict()
with open(sys.argv[2]) as gene_info_file:
for line in gene_info_file:
tokens = line.split()
name = tokens[0].strip()
gene_info[name] = line
# Read the other file and lookup the names
with open(sys.argv[1]) as origin_file:
for line in origin_file:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3].strip()
if nname in gene_info:
lookup_line = gene_info[nname]
print("IG_" + lookup_line)
else:
pass # what do you want to do in this case?
else:
print(line)

How can I organize case-insensitive text and the material following it?

I'm very new to Python so it'd be very appreciated if this could be explained as in-depth as possible.
If I have some text like this on a text file:
matthew : 60 kg
MaTtHew : 5 feet
mAttheW : 20 years old
maTThEw : student
MaTTHEW : dog owner
How can I make a piece of code that can write something like...
Matthew : 60 kg , 5 feet , 20 years old , student , dog owner
...by only gathering information from the text file?
def test_data():
# This is obviously the source data as a multi-line string constant.
source = \
"""
matthew : 60 kg
MaTtHew : 5 feet
mAttheW : 20 years old
maTThEw : student
MaTTHEW : dog owner
bob : 70 kg
BoB : 6 ft
"""
# Split on newline. This will return a list of lines like ["matthew : 60 kg", "MaTtHew : 5 feet", etc]
return source.split("\n")
def append_pair(d, p):
k, v = p
if k in d:
d[k] = d[k] + [v]
else:
d[k] = [v]
return d
if __name__ == "__main__":
# Do a list comprehension. For every line in the test data, split by ":", strip off leading/trailing whitespace,
# and convert to lowercase. This will yield lists of lists.
# This is mostly a list of key/value size-2-lists
pairs = [[x.strip().lower() for x in line.split(":", 2)] for line in test_data()]
# Filter the lists in the main list that do not have a size of 2. This will yield a list of key/value pairs like:
# [["matthew", "60 kg"], ["matthew", "5 feet"], etc]
cleaned_pairs = [p for p in pairs if len(p) == 2]
# This will iterate the list of key/value pairs and send each to append_pair, which will either append to
# an existing key, or create a new key.
d = reduce(append_pair, cleaned_pairs, {})
# Now, just print out the resulting dictionary.
for k, v in d.items():
print("{}: {}".format(k, ", ".join(v)))
import sys
# There's a number of assumptions I have to make based on your description.
# I'll try to point those out.
# Should be self-explanatory. something like: "C:\Users\yourname\yourfile"
path_to_file = "put_your_path_here"
# open a file for reading. The 'r' indicates read-only
infile = open(path_to_file, 'r')
# reads in the file line by line and strips the "invisible" endline character
readLines = [line.strip() for line in infile]
# make sure we close the file
infile.close()
# An Associative array. Does not use normal numerical indexing.
# instead, in our case, we'll use a string(the name) to index into.
# At a given name index(AKA key) we'll save the attributes about that person.
names = dict()
# iterate through each line we read in from the file
# each line in this loop will be stored in the variable
# item for that iteration.
for item in readLines:
#assuming that your file has a strict format:
# name : attribute
index = item.find(':')
# if there was a ':' found then continue
if index is not -1:
# grab only the name of the person and convert the string to all lowercase
name = item[0:index].lower()
# see if our associative array already has that peson
if names.has_key(name):
# if that person has already been indexed add the new attribute
# this assumes there are no dupilcates so I don't check for them.
names[name].append(item[index+1:len(item)])
else:
# if that person was not in the array then add them.
# we're adding a list at that index to store their attributes.
names[name] = list()
# append the attribute to the list.
# the len() function tells us how long the string 'item' is
# offsetting the index by 1 so we don't capture the ':'
names[name].append(item[index+1:len(item)])
else:
# there was no ':' found in the line so skip it
pass
# iterate through keys (names) we found.
for name in names:
# write it to stdout. I am using this because the "print" built-in to python
# always ends with a new line. This way I can print the name and then
# iterate through the attributes associated with them
sys.stdout.write(name + " : ")
# iterate through attributes
for attribute in names[name]:
sys.stdout.write(attribute + ", ")
# end each person with a new line.
sys.stdout.write('\r\n')

tricky string parsing with python

I have a text file like this:
ID = 31
Ne = 5122
============
List of 104 four tuples:
1 2 12 40
2 3 4 21
.
.
51 21 41 42
ID = 34
Ne = 5122
============
List of 104 four tuples:
3 2 12 40
4 3 4 21
.
.
The four-tuples are tab delimited.
For each ID, I'm trying to make a dictionary with the ID being the key and the four-tuples (in list/tuple form) as elements of that key.
dict = {31: (1,2,12,40),(2,3,4,21)....., 32:(3,2,12,40), (4,3,4,21)..
My string parsing knowledge is limited to adding using a reference object for file.readlines(), using str.replace() and str.split() on 'ID = '. But there has to be a better way. Here some beginnings of what I have.
file = open('text.txt', 'r')
fp = file.readlines()
B = [];
for x in fp:
x.replace('\t',',')
x.replace('\n',')')
B.append(x)
something like this:
ll = []
for line in fp:
tt = tuple(int(x) for x in line.split())
ll.append(tt)
that will produce a list of tuples to assign to the key for your dictionary
Python's great for this stuff, why not write up a 5-10 liner for it? It's kind of what the language is meant to excel at.
$ cat test
ID = 31
Ne = 5122
============
List of 104 four tuples:
1 2 12 40
2 3 4 21
ID = 34
Ne = 5122
============
List of 104 four tuples:
3 2 12 40
4 3 4 21
data = {}
for block in open('test').read().split('ID = '):
if not block:
continue #empty line
lines = block.split('\n')
ID = int(lines[0])
tups = map(lambda y: int(y), [filter(lambda x: x, line.split('\t')) for line in lines[4:]])
data[ID] = tuple(filter(lambda x: x, tups))
print(data)
# {34: ([3, 2, 12, 40], [4, 3, 4, 21]), 31: ([1, 2, 12, 40], [2, 3, 4, 21])}
Only annoying thing is all the filters - sorry, that's just the result of empty strings and stuff from extra newlines, etc. For a one-off little script, it's no biggie.
I think this will do the trick for you:
import csv
def parse_file(filename):
"""
Parses an input data file containing tags of the form "ID = ##" (where ## is a
number) followed by rows of data. Returns a dictionary where the ID numbers
are the keys and all of the rows of data are stored as a list of tuples
associated with the key.
Args:
filename (string) name of the file you want to parse
Returns:
my_dict (dictionary) dictionary of data with ID numbers as keys
"""
my_dict = {}
with open(filename, "r") as my_file: # handles opening and closing file
rows = my_file.readlines()
for row in rows:
if "ID = " in row:
my_key = int(row.split("ID = ")[1]) # grab the ID number
my_list = [] # initialize a new data list for a new ID
elif row != "\n": # skip rows that only have newline char
try: # if this fails, we don't have a valid data line
my_list.append(tuple([int(x) for x in row.split()]))
except:
my_dict[my_key] = my_list # stores the data list
continue # repeat until done with file
return my_dict
I made it a function so that you can it from anywhere, just passing the filename. It makes assumptions about the file format, but if the file format is always what you showed us here, it should work for you. You would call it on your data.txt file like:
a_dictionary = parse_file("data.txt")
I tested it on the data that you gave us and it seems to work just fine after deleting the "..." rows.
Edit: I noticed one small bug. As written, it will add an empty tuple in place of a new line character ("\n") wherever that appears alone on a line. To fix this, put the try: and except: clauses inside of this:
elif row != "\n": # skips rows that only contain newline char
I added this to the full code above as well.

python--import data from file and autopopulate a dictionary

I am a python newbie and am trying to accomplish the following.
A text file contains data in a slightly weird format and I was wondering whether there is an easy way to parse it and auto-fill an empty dictionary with the correct keys and values.
The data looks something like this
01> A B 2 ##01> denotes the line number, that's all
02> EWMWEM
03> C D 3
04> EWWMWWST
05> Q R 4
06> WESTMMMWW
So each pair of lines describe a full set of instructions for a robot arm. For lines 1-2 is for arm1, 3-4 is for arm 2, and so on. The first line states the location and the second line states the set of instructions (movement, changes in direction, turns, etc.)
What I am looking for is a way to import this text file, parse it properly, and populate a dictionary that will generate automatic keys. Note the file only contains value. This is why I am having a hard time. How do I tell the program to generate armX (where X is the ID from 1 to n) and assign a tuple (or a pair) to it such that the dictionary reads.
dict = {'arm1': ('A''B'2, EWMWEM) ...}
I am sorry if the newbie-ish vocab is redundant or unclear. Please let me know and I will be happy to clarify.
A commented code that is easy to understand will help me learn the concepts and motivation.
Just to provide some context. The point of the program is to load all the instructions and then execute the methods on the arms. So if you think there is a more elegant way to do it without loading all the instructions, please suggest.
def get_instructions_dict(instructions_file):
even_lines = []
odd_lines = []
with open(instructions_file) as f:
i = 1
for line in f:
# split the lines into id and command lines
if i % 2==0:
# command line
even_lines.append(line.strip())
else:
# id line
odd_lines.append(line.strip())
i += 1
# create tuples of (id, cmd) and zip them with armX ( armX, (id, command) )
# and combine them into a dict
result = dict( zip ( tuple("arm%s" % i for i in range(1,len(odd_lines)+1)),
tuple(zip(odd_lines,even_lines)) ) )
return result
>>> print get_instructions_dict('instructions.txt')
{'arm3': ('Q R 4', 'WESTMMMWW'), 'arm1': ('A B 2', 'EWMWEM'), 'arm2': ('C D 3', 'EWWMWWST')}
Note dict keys are not ordered. If that matters, use OrderedDict
I would do something like that:
mydict = {} # empty dict
buffer = ''
for line in open('myFile'): # open the file, read line by line
linelist = line.strip().replace(' ', '').split('>') # line 1 would become ['01', 'AB2']
if len(linelist) > 1: # eliminates empty lines
number = int(linelist[0])
if number % 2: # location line
buffer = linelist[1] # we keep this till we know the instruction
else:
mydict['arm%i' % number/2] = (buffer, linelist[1]) # we know the instructions, we write all to the dict
robot_dict = {}
arm_number = 1
key = None
for line in open('sample.txt'):
line = line.strip().replace("\n",'')
if not key:
location = line
key = 'arm' + str(arm_number) #setting key for dict
else:
instruction = line
robot_dict[key] = (location,line)
key = None #reset key
arm_number = arm_number + 1

Categories