Flux variability analysis only for transport reactions between compartments? - python

I would like to do a FVA only for selected reactions, in my case on transport reactions between compartments (e.g. between the cytosol and mitochondrion). I know that I can use selected_reactions in doFVA like this:
import cbmpy as cbm
mod = cbm.CBRead.readSBML3FBC('iMM904.xml.gz')
cbm.doFVA(mod, selected_reactions=['R_FORtm', 'R_CO2tm'])
Is there a way to get the entire list of transport reactions, not only the two I manually added? I thought about
selecting the reactions based on their ending tm but that fails for 'R_ORNt3m' (and probably other reactions, too).
I want to share this model with others. What is the best way of storing the information in the SBML file?
Currently, I would store the information in the reaction annotation as in
this answer. For example
mod.getReaction('R_FORtm').setAnnotation('FVA', 'yes')
which could be parsed.

There is no built-in function for this kind of task. As you already mentioned, relying on the IDs is generally not a good idea as those can differ between different databases, models and groups (e.g. if someone decided just to enumerate reactions from r1 till rn and or metabolites from m1 till mm, filtering based on IDs fails). Instead, one can make use of the compartment field of the species. In CBMPy you can access a species' compartment by doing
import cbmpy as cbm
import pandas as pd
mod = cbm.CBRead.readSBML3FBC('iMM904.xml.gz')
mod.getSpecies('M_atp_c').getCompartmentId()
# will return 'c'
# run a FBA
cbm.doFBA(mod)
This can be used to find all fluxes between compartments as one can check for each reaction in which compartment their reagents are located. A possible implementation could look as follows:
def get_fluxes_associated_with_compartments(model_object, compartments, return_values=True):
# check whether provided compartment IDs are valid
if not isinstance(compartments, (list, set) or not set(compartments).issubset(model_object.getCompartmentIds())):
raise ValueError("Please provide valid compartment IDs as a list!")
else:
compartments = set(compartments)
# all reactions in the model
model_reactions = model_object.getReactionIds()
# check whether provided compartments are identical with the ones of the reagents of a reaction
return_reaction_ids = [ri for ri in model_reactions if compartments == set(si.getCompartmentId() for si in
model_object.getReaction(ri).getSpeciesObj())]
# return reaction along with its value
if return_values:
return {ri: model_object.getReaction(ri).getValue() for ri in return_reaction_ids}
# return only a list with reaction IDs
return return_reaction_ids
So you pass your model object and a list of compartments and then for each reaction it is checked whether there is at least one reagent located in the specified compartments.
In your case you would use it as follows:
# compartment IDs for mitochondria and cytosol
comps = ['c', 'm']
# you only want the reaction IDs; remove the ', return_values=False' part if you also want the corresponding values
trans_cyt_mit = get_fluxes_associated_with_compartments(mod, ['c', 'm'], return_values=False)
The list trans_cyt_mit will then contain all desired reaction IDs (also the two you specified in your question) which you can then pass to the doFVA function.
About the second part of your question. I highly recommend to store those reactions in a group rather than using annotation:
# create an empty group
mod.createGroup('group_trans_cyt_mit')
# get the group object so that we can manipulate it
cyt_mit = mod.getGroup('group_trans_cyt_mit')
# we can only add objects to a group so we get the reaction object for each transport reaction
reaction_objects = [mod.getReaction(ri) for ri in trans_cyt_mit]
# add all the reaction objects to the group
cyt_mit.addMember(reaction_objects)
When you now export the model, e.g. by using
cbm.CBWrite.writeSBML3FBCV2(mod, 'iMM904_with_groups.xml')
this group will be stored as well in SBML. If a colleague reads the SBML again, he/she can then easily run a FVA for the same reactions by accessing the group members which is far easier than parsing the annotation:
# do an FVA; fva_res: Reaction, Reduced Costs, Variability Min, Variability Max, abs(Max-Min), MinStatus, MaxStatus
fva_res, rea_names = cbm.doFVA(mod, selected_reactions=mod.getGroup('group_trans_cyt_mit').getMemberIDs())
fva_dict = dict(zip(rea_names, fva_res.tolist()))
# store results in a dataframe which makes the selection of reactions easier
fva_df = pd.DataFrame.from_dict(fva_dict, orient='index')
fva_df = fva_df.rename({0: "flux_value", 1: "reduced_cost_unscaled", 2: "variability_min", 3: "variability_max",
4: "abs_diff_var", 5: "min_status", 6: "max_status"}, axis='columns')
Now you can easily query the dataframe and find the flexible and not flexible reactions within your group:
# filter the reactions with flexibility
fva_flex = fva_df.query("abs_diff_var > 10 ** (-4)")
# filter the reactions that are not flexible
fva_not_flex = fva_df.query("abs_diff_var <= 10 ** (-4)")

Related

Python Gitlab API - list shared projects of a group/subgroup

I need to find all projects and shared projects within a Gitlab group with subgroups. I managed to list the names of all projects like this:
group = gl.groups.get(11111, lazy=True)
# find all projects, also in subgroups
projects=group.projects.list(include_subgroups=True, all=True)
for prj in projects:
print(prj.attributes['name'])
print("")
What I am missing is to list also the shared projects within the group. Or maybe to put this in other words: find out all projects where my group is a member. Is this possible with the Python API?
So, inspired by the answer of sytech, I found out that it was not working in the first place, as the shared projects were still hidden in the subgroups. So I came up with the following code that digs through all various levels of subgroups to find all shared projects. I assume this can be written way more elegant, but it works for me:
# group definition
main_group_id = 11111
# create empty list that will contain final result
list_subgroups_id_all = []
# create empty list that act as temporal storage of the results outside the function
list_subgroups_id_stored = []
# function to create a list of subgroups of a group (id)
def find_subgroups(group_id):
# retrieve group object
group = gl.groups.get(group_id)
# create empty lists to store id of subgroups
list_subgroups_id = []
#iterate through group to find id of all subgroups
for sub in group.subgroups.list():
list_subgroups_id.append(sub.id)
return(list_subgroups_id)
# function to iterate over the various groups for subgroup detection
def iterate_subgroups(group_id, list_subgroups_id_all):
# for a given id, find existing subgroups (id) and store them in a list
list_subgroups_id = find_subgroups(group_id)
# add the found items to the list storage variable, so that the results are not overwritten
list_subgroups_id_stored.append(list_subgroups_id)
# for each found subgroup_id, test if it is already part of the total id list
# if not, keep store it and test for more subgroups
for test_id in list_subgroups_id:
if test_id not in list_subgroups_id_all:
# add it to total subgroup id list (final results list)
list_subgroups_id_all.append(test_id)
# check whether test_id contains more subgroups
list_subgroups_id_tmp = iterate_subgroups(test_id, list_subgroups_id_all)
#if so, append to stored subgroup list that is currently checked
list_subgroups_id_stored.append(list_subgroups_id_tmp)
return(list_subgroups_id_all)
# find all subgroup and subsubgroups, etc... store ids in list
list_subgroups_id_all = iterate_subgroups(main_group_id , list_subgroups_id_all)
print("***ids of all subgroups***")
print(list_subgroups_id_all)
print("")
print("***names of all subgroups***")
list_names = []
for ids in list_subgroups_id_all:
group = gl.groups.get(ids)
group_name = group.attributes['name']
list_names.append(group_name)
print(list_names)
#print(list_subgroups_name_all)
print("")
# print all directly integrated projects of the main group, also those in subgroups
print("***integrated projects***")
group = gl.groups.get(main_group_id)
projects=group.projects.list(include_subgroups=True, all=True)
for prj in projects:
print(prj.attributes['name'])
print("")
# print all shared projects
print("***shared projects***")
for sub in list_subgroups_id_all:
group = gl.groups.get(sub)
for shared_prj in group.shared_projects:
print(shared_prj['path_with_namespace'])
print("")
One question that remains - at the very beginning I retrieve the main group by its id (here: 11111), but can I actually also get this id by looking for the name of the group? Something like: group_id = gl.group.get(attribute={'name','foo'}) (not working)?
You can get the shared projects by the .shared_projects attribute:
group = gl.groups.get(11111)
for proj in group.shared_projects:
print(proj['path_with_namespace'])
However, you cannot use the lazy=True argument to gl.groups.get.
>>> group = gl.groups.get(11111, lazy=True)
>>> group.shared_projects
AttributeError: shared_projects

Pyomo | Creating simple model with indexed set

I am having trouble creating a simple model in pyomo. I want to define the following abstract model:
An attempt at creating an abstract model
I define
m.V = pyo.Set()
m.C = pyo.Set() # I first wanted to make this an indexed set in m.V, but this does not work as I cannot create variables with indexed sets (in next line)
m.Components = pyo.Var(m.V*m.C, domain=Binary)
Now I have no idea how to add the constraint. Just adding
Def constr(m,v):
return sum([m.Components[v,c] for c in m.C]) == 2
m.Constraint = Constraint(m.V, rule= constr)
will lead to the model also summing over components in m.C that should not fall under m.V (eg if I pass m.V = ['Cars', 'Boats'], and one of the 'Boats' components I want to pass is ‘New sails’; the above constraint will also put a constraint on m.Components[‘Cars’,’New sails’], which does not make much sense.
Trying to work out a concrete example
Now if I try to work through this problem in a concrete way and follow e.g. Variable indexed by an indexed Set with Pyomo, I still get an issue with the constraint. E.g. say I want to create a model that has this structure:
set_dict = {‘Car’:[ ‘New wheels’, ’New gearbox’, ’New seats’],’Boat’: [’New seats’, ‘New sail’, ‘New rudder‘]}
I then create these sets and variables:
m.V = pyo.Set(initialize=[‘Car’,’Boat’])
m.C = pyo.Set(initialize=[‘New wheels’, ’New gearbox’, ’New seats’, ‘New sail’, ‘New rudder‘])
m.VxC = pyo.Set(m.V*m.C, within = set_dict)
m.Components = pyo.Var(m.VxC, domain=Binary)
But now I still dont see a way to add the constraint in a pyomo native way. I cannot define a function that sums just over m.C as then it will sum over values that are not allowed again (e.g. as above, ‘New sail’ for the ‘Cars’ vehicle type). It seems the only way to do this is to refer back to the set_dict and loop & sum over that?
I need to create an abstract model, so I want to be able to write out this model in a pyomo native way, not relying on additional dictionaries and other objects to pass the right dimensions/sets into the model.
Any idea how I could do this?
You didn't say what form your data is in, but some variation of below should work. I'm not a huge fan of AbstractModels, but each format for the data should have some accommodation to build sparse sets which is what you want to do to represent the legal combinations of V x C.
By adding a membership test within your constraint(s), you can still sum across either V or C as needed.
import pyomo.environ as pyo
m = pyo.AbstractModel()
### SETS
m.V = pyo.Set()
m.C = pyo.Set()
m.VC = pyo.Set(within = m.V*m.C)
### VARS
m.select = pyo.Var(m.VC, domain=pyo.Binary)
### CONSTRAINTS
def constr(m,v):
return sum(m.select[v,c] for c in m.C if (v,c) in m.VC) == 2
m.Constraint = pyo.Constraint(m.V, rule= constr)

Sanitization error applying reaction to molecule RDKit

Sanitization error while applying a reaction to an molecule with wedged bond.
I am getting this error while applying a proton removal reaction to a molecule but I do not see any error in MolBlock information.
This is for a reaction problem in which I am trying to apply a simple reaction (proton removal) to a molecule given its isomeric SMILES.
I create a function to apply reaction using SMARTS and SMILES but I am getting the following error which I could not fixed.
I am using the following code to load my inputs.
smile = rdkit.Chem.rdmolfiles.MolToSmiles(mol,isomericSmiles=True)
which leads to:
C/C1=C\\C[C##H]([C+](C)C)CC/C(C)=C/CC1
I create the following dictionary to use my SMILES and SMARTS:
reaction_smarts = {}
# proton removal reaction
reaction_smarts["proton_removal"] = "[Ch:1]-[C+1:2]>>[C:1]=[C+0:2].[H+]"
reactions = {name: AllChem.ReactionFromSmarts(reaction_smarts[name]) for name in reaction_smarts}
# function to run reactions
def run_reaction(molecule, reaction):
products = []
for product in reaction.RunReactant(molecule, 0):
Chem.SanitizeMol(product[0])
products.append(product[0])
return products
# apply reaction
products = run_reaction(cation_to_rdkit_mol["mol_name"], reactions["proton_removal"])
At this step I am getting this error but I cannot fix it.
RDKit ERROR: [10:43:23] Explicit valence for atom # 0 C, 5, is greater than permitted
Expected results should be the the molecule with the double bond and its stereoisomers:
First product: CC(C)=C1C/C=C(\\C)CC/C=C(\\C)CC1
Second product: C=C(C)[C##H]1C/C=C(\\C)CC/C=C(\\C)CC1
Third product: C=C(C)[C#H]1C/C=C(\\C)CC/C=C(\\C)CC1
I am using Chem.EnumerateStereoisomers.EnumerateStereoisomers() to get all stereoisomers but I am just getting the first and second product. I also added your initial proposal product[0].GetAtomWithIdx(0).SetNumExplicitHs(0) which actually fix the Explicit valence error. But now I am trying to figure it out how to get all that three stereoisomers.
Any hint why this is happening?, cause if I check the mol block with all the info about valence it seems to be fine.
The Error is stating that the Explicit valence for atom 0 (Carbon) is 5, this would suggest that the explicit hydrogen hasn't been removed although the bond is now a double bond, hence a valence of 5. I am not too familiar with reaction SMARTS although an easy way to fix this manually would be to set the number of explicit hydrogens on atom 0 to 0 before you sanitize:
product.GetAtomWithIdx(0).SetNumExplicitHs(0)
Chem.SanitizeMol(product)
Edit 1:
Scratch that, I did some experimentation, try this reaction:
rxn = AllChem.ReactionFromSmarts('[#6##H:1]-[#6+:2] >> [#6H0:1]=[#6+0:2]')
This way in the reaction definition we explicitly state that a hydrogen is lost and the resultant molecule will sanitize. Does this work for you?
Edit 2:
When I run this reaction the product does not seem to contain a cation:
mol = Chem.MolFromSmiles('C/C1=C\\C[C##H]([C+](C)C)CC/C(C)=C/CC1')
rxn = AllChem.ReactionFromSmarts('[#6##H:1]-[#6+:2] >> [#6H0:1]=[#6+0:2]')
products = list()
for product in rxn.RunReactant(mol, 0):
Chem.SanitizeMol(product[0])
products.append(product[0])
print(Chem.MolToSmiles(products[0]))
Output:
'CC(C)=C1C/C=C(\\C)CC/C=C(\\C)CC1'
Edit 3:
I think I now understand what you are looking for:
mol = Chem.MolFromSmiles('C/C1=C\\C[C##H]([C+](C)C)CC/C(C)=C/CC1')
# Reactant SMARTS
reactant_smarts = '[CH3:1][C+:2][C##H:3]'
# Product SMARTS
product_smarts = [
'[CH2:1]=[CH0+0:2][CH:3]',
'[CH2:1]=[CH0+0:2][C#H:3]',
'[CH2:1]=[CH0+0:2][C##H:3]',
]
# Reaction SMARTS
reaction_smarts = str(reactant_smarts + '>>' + '.'.join(product_smarts))
# RDKit Reaction
rxn = AllChem.ReactionFromSmarts(reaction_smarts)
# Get Products
results = list()
for products in rxn.RunReactant(mol, 0):
for product in products:
Chem.SanitizeMol(product)
results.append(product)
print(Chem.MolToSmiles(product))
Output:
'C=C(C)C1C/C=C(\\C)CC/C=C(\\C)CC1'
'C=C(C)[C#H]1C/C=C(\\C)CC/C=C(\\C)CC1'
'C=C(C)[C##H]1C/C=C(\\C)CC/C=C(\\C)CC1'
'C=C(C)C1C/C=C(\\C)CC/C=C(\\C)CC1'
'C=C(C)[C#H]1C/C=C(\\C)CC/C=C(\\C)CC1'
'C=C(C)[C##H]1C/C=C(\\C)CC/C=C(\\C)CC1'
Note that we get the same products twice, I think this is because the reactant SMARTS matches both CH3 groups, hence the reaction is applied to both. I hope this is what you are looking for.

Why does my association model find subgroups in a dataset when there shouldn't any?

I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.
After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.

Python: Find subsets across many-to-many mapping

I am trying to work with a many-to-many mapping, finding subsets of one set that map to specific subsets of the other set.
I have many genes. Each gene is a member of one or more COGs (and vice versa), eg.
gene1 is member of COG1
gene1 is member of COG1003
gene2 is member of COG2
gene3 is member of COG273
gene4 is member of COG1
gene5 is member of COG273
gene5 is member of COG71
gene6 is member of COG1
gene6 is member of COG273
I have a short set of COGs that represents an enzyme, eg. COG1,COG273.
I want to find all sets of genes that between them have membership of every COG in the enzyme, but without unnecessary overlaps (in this case, for instance, 'gene1 and gene6' would be spurious as gene6 is already a member of both COGs).
In this example, the answers would be:
gene1 and gene3
gene1 and gene5
gene3 and gene4
gene4 and gene5
gene6
Although I could get all members of each COG and create a 'product', this would contain spurious results (as mentioned above) where more genes than necessary are in the set.
My mappings are currently contained in a dictionary where the key is the gene ID and the value is a list of the COG IDs of which that gene is a member. However I accept that this might not be the best way to have the mapping stored.
One basic attack:
Keep your representation as it is for now.
Initialize a dictionary with the COGs as keys; each value is an initial count of 0.
Now start building your list of enzyme coverage sets (ecs_list), one ecs at a time. Do this by starting at the front of the gene list and working your way to the end, considering all combinations.
Write a recursive routine to solve the remaining COGs in the enzyme. Something like this:
def pick_a_gene(gene_list, cog_list, solution_set, cog_count_dict):
pick the first gene in the list that is in at least one cog in the list.
let the rest of the list be remaining_gene_list.
add the gene to the solution set.
for each of the gene's cogs:
increment the cog's count in cog_count_dict
remove the cog from cog_list (if it's still there).
add the gene to the solution set.
is there anything left in the cog_list?
yes:
pick_a_gene(remaining_gene_list, cog_list, solution_set, cog_count_dict)
no: # we have a solution: check it for minimality
from every non-zero entry in cog_count_dict, subtract 1. This gives us a list of excess coverage.
while the excess list is not empty:
pick the next gene in the solution set, starting from the *end* (if none, break the loop)
if the gene's cogs are all covered by the excess:
remove the gene from the solution set.
decrement the excess count of each of its cogs.
The remaining set of genes is an ECS; add it to ecs_list
Does this work for you? I believe that it covers the minimal sets properly, given the well-behaved example you have. Note that starting from the high end when we check minimality guards against a case like this:
gene1: cog1, cog5
gene2: cog2, cog5
gene3: cog3
gene4: cog1, cog2, cog4
enzyme: cog1 - cog5
We can see that we need gene3, gene4, and either gene1 or gene2. If we eliminate from the low end, we'll toss out gene1 and never find that solution. If we start from the high end, we'll eliminate gene2, but find that solution in a later pass of the main loop.
It's possible to construct a case in which there is a 3-way conflict of this ilk. In that case, we'd have to write an extra loop in the minimality check to find them all. However, I gather that your data aren't that nasty to us.
def findGenes(seq1, seq2, llist):
from collections import OrderedDict
from collections import Counter
from itertools import product
od = OrderedDict()
for b,a in llist:
od.setdefault(a,[]).append(b)
llv = []
for k,v in od.items():
if seq1 == k or seq2 == k:
llv.append(v)
# flat list needed for counting genes frequencies
flatL = [ x for sublist in llv for x in sublist]
cFlatl = Counter(flatL)
# this will gather genes that like gene6 have both sequencies
l_lonely = []
for k in cFlatl:
if cFlatl[k] > 1:
l_lonely.append(k)
newL = []
temp = []
for sublist in llv:
for el in sublist:
if el not in l_lonely:
newL.append(el)
temp.append(newL)
newL = []
# temp contains only genes that do not belong to both sequences
# product will connect genes from different sequence groups
p = product(*temp)
for el in list(p):
print(el)
print(l_lonely)
OUTPUT:
lt = [('gene1', 'COG1'), ('gene1', 'COG1003'),('gene2', 'COG2'), ('gene3', 'COG273'), ('gene4', 'COG1'),
('gene5', 'COG273'),('gene5', 'COG71'), ('gene6' ,'COG1'),('gene6', 'COG273')]
findGenes('COG1', 'COG273', lt )
('gene1', 'gene3')
('gene1', 'gene5')
('gene4', 'gene3')
('gene4', 'gene5')
['gene6']
Does this do it for you? Note that since you said you had a short set of COGs, I went ahead and did nested for loops; there may be ways to optimize this...
For future reference, please post any code that you've got along with your question.
import itertools
d = {'gene1':['COG1','COG1003'], 'gene2':['COG2'], 'gene3':['COG273'], 'gene4':['COG1'], 'gene5':['COG273','COG71'], 'gene6':['COG1','COG273']}
COGs = [set(['COG1','COG273'])] # example list of COGs containing only one enzyme; NOTE: your data should be a list of multiple sets
# create all pair-wise combinations of our data
gene_pairs = [l for l in itertools.combinations(d.keys(),2)]
found = set()
for pair in gene_pairs:
join = set(d[pair[0]] + d[pair[1]]) # set of COGs for gene pairs
for COG in COGs:
# check if gene already part of enzyme
if sorted(d[pair[0]]) == sorted(list(COG)):
found.add(pair[0])
elif sorted(d[pair[1]]) == sorted(list(COG)):
found.add(pair[1])
# check if gene combinations are part of enzyme
if COG <= join and pair[0] not in found and pair[1] not in found:
found.add(pair)
for l in found:
if isinstance(l, tuple): # if tuple
print l[0], l[1]
else:
print l
Thanks for the suggestions, they have inspired me to hack something together using recursion. I want to deal with arbitrary gene-cog relationships, so it needs to be a general solution. This should yield all sets of genes (enzymes) that between them are members of all required COGs, without duplicate enzymes and without redundant genes:
def get_enzyme_cogs(enzyme, gene_cog_dict):
"""Get all COGs of which there is at least one member gene in the enzyme."""
cog_list = []
for gene in enzyme:
cog_list.extend(gene_cog_dict[gene])
return set(cog_list)
def get_gene_by_gene_cogs(enzyme, gene_cog_dict):
"""Get COG memberships for each gene in enzyme."""
cogs_list = []
for gene in enzyme:
cogs_list.append(set(gene_cog_dict[gene]))
return cogs_list
def add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict, proposed_enzyme = None, fulfilled_cogs = None):
"""Generator for all enzymes with membership of all target_enzyme_cogs, without duplicate enzymes or redundant genes."""
base_enzyme_genes = proposed_enzyme or []
fulfilled_cogs = get_enzyme_cogs(base_enzyme_genes, target_enzyme_cogs, gene_cog_dict)
## Which COG will we try to find a member of?
next_cog_to_fill = sorted(list(target_enzyme_cogs-fulfilled_cogs))[0]
gene_members_of_cog = cog_gene_dict[next_cog_to_fill]
for gene in gene_members_of_cog:
## Check whether any already-present gene's COG set is a subset of the proposed gene's COG set, if so skip addition
subset_found = False
proposed_gene_cogs = set(gene_cog_dict[gene]) & target_enzyme_cogs
for gene_cogs_set in get_gene_by_gene_cogs(base_enzyme_genes, target_enzyme_cogs, gene_cog_dict):
if gene_cogs_set.issubset(proposed_gene_cogs):
subset_found = True
break
if subset_found:
continue
## Add gene to proposed enzyme
proposed_enzyme = deepcopy(base_enzyme_genes)
proposed_enzyme.append(gene)
## Determine which COG memberships are fulfilled by the genes in the proposed enzyme
fulfilled_cogs = get_enzyme_cogs(proposed_enzyme, target_enzyme_cogs, gene_cog_dict)
if (fulfilled_cogs & target_enzyme_cogs) == target_enzyme_cogs:
## Proposed enzyme has members of every required COG, so yield
enzyme = deepcopy(proposed_enzyme)
proposed_enzyme.remove(gene)
yield enzyme
else:
## Proposed enzyme is still missing some COG members
for enzyme in add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict, proposed_enzyme, fulfilled_cogs):
yield enzyme
Input:
gene_cog_dict = {'gene1':['COG1','COG1003'], 'gene2':['COG2'], 'gene3':['COG273'], 'gene4':['COG1'], 'gene5':['COG273','COG71'], 'gene6':['COG1','COG273']}
cog_gene_dict = {'COG2': ['gene2'], 'COG1': ['gene1', 'gene4', 'gene6'], 'COG71': ['gene5'], 'COG273': ['gene3', 'gene5', 'gene6'], 'COG1003': ['gene1']}
target_enzyme_cogs = ['COG1','COG273']
Usage:
for enzyme in add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict):
print enzyme
Output:
['gene1', 'gene3']
['gene1', 'gene5']
['gene4', 'gene3']
['gene4', 'gene5']
['gene6']
I have no idea about its performance though.

Categories