Home > RAVEN > getModelFromHomology.m

getModelFromHomology

PURPOSE ^

getModelFromHomology

SYNOPSIS ^

function draftModel=getModelFromHomology(models,blastStructure,getModelFor,preferredOrder,strictness,onlyGenesInModels,maxE,minLen,minSim,mapNewGenesToOld)

DESCRIPTION ^

 getModelFromHomology
   Constructs a new model from a set of existing models and gene 
   homology information.

   models            a cell array of model structures to build the model
                     from
   blastStructure    a blastStructure as produced by getBlast or
                     getBlastFromExcel
   getModelFor       the name of the organism to build a model for. Must
                     have hits in both directions to the organisms in models
   preferredOrder    the order in which reactions should be added from the
                     models. If not supplied, reactions will be included from
                     all models, otherwise one gene will only result in reactions
                     from one model (opt, default {})
   strictness        integer that specifies which reactions that should be included
                     1: Include only 1-1 orthologs (only include genes that
                     map back to the original gene in the blast in the
                     opposite direction)
                     2: Include the reactions for all genes below the cutoff
                     3: Include only best 1-1 orthologs (opt, default 1)
   onlyGenesInModels blast only against genes that exists in the models.
                     This tends to import a larger fraction from the existing 
                     models but may give less reliable results. Has effect only
                     if strictness=3 (opt, default false)
   maxE              only look at genes with E-values <= this value (opt, default 10^-30)
   minLen            only look at genes with overlap >= this value (opt,
                     default 200)
   minSim            only look at genes with similarity >= this value (opt,
                     default 40 (%))
   mapNewGenesToOld  determines how to match genes if not looking at only
                     1-1 orthologs. Either map the new genes to the old or 
                     old genes to new. The default is to map the new genes 
                     (opt, default true)

   draftModel        a model structure for the new organism

   The models in the models structure should have named the metabolites in
   the same manner (run mapMetNames), have their reversible reactions in the
   same direction (run sortModel), and use the same compartment names.
   To avoid keeping unneccesary old genes, the models should not have
   'or'-relations in their grRules (use expandModel).

   Returns a model structure for the new organism.

   draftModel        a new model structure

   Rasmus Agren, 2013-01-25

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function draftModel=getModelFromHomology(models,blastStructure,getModelFor,preferredOrder,strictness,onlyGenesInModels,maxE,minLen,minSim,mapNewGenesToOld)
0002 % getModelFromHomology
0003 %   Constructs a new model from a set of existing models and gene
0004 %   homology information.
0005 %
0006 %   models            a cell array of model structures to build the model
0007 %                     from
0008 %   blastStructure    a blastStructure as produced by getBlast or
0009 %                     getBlastFromExcel
0010 %   getModelFor       the name of the organism to build a model for. Must
0011 %                     have hits in both directions to the organisms in models
0012 %   preferredOrder    the order in which reactions should be added from the
0013 %                     models. If not supplied, reactions will be included from
0014 %                     all models, otherwise one gene will only result in reactions
0015 %                     from one model (opt, default {})
0016 %   strictness        integer that specifies which reactions that should be included
0017 %                     1: Include only 1-1 orthologs (only include genes that
0018 %                     map back to the original gene in the blast in the
0019 %                     opposite direction)
0020 %                     2: Include the reactions for all genes below the cutoff
0021 %                     3: Include only best 1-1 orthologs (opt, default 1)
0022 %   onlyGenesInModels blast only against genes that exists in the models.
0023 %                     This tends to import a larger fraction from the existing
0024 %                     models but may give less reliable results. Has effect only
0025 %                     if strictness=3 (opt, default false)
0026 %   maxE              only look at genes with E-values <= this value (opt, default 10^-30)
0027 %   minLen            only look at genes with overlap >= this value (opt,
0028 %                     default 200)
0029 %   minSim            only look at genes with similarity >= this value (opt,
0030 %                     default 40 (%))
0031 %   mapNewGenesToOld  determines how to match genes if not looking at only
0032 %                     1-1 orthologs. Either map the new genes to the old or
0033 %                     old genes to new. The default is to map the new genes
0034 %                     (opt, default true)
0035 %
0036 %   draftModel        a model structure for the new organism
0037 %
0038 %   The models in the models structure should have named the metabolites in
0039 %   the same manner (run mapMetNames), have their reversible reactions in the
0040 %   same direction (run sortModel), and use the same compartment names.
0041 %   To avoid keeping unneccesary old genes, the models should not have
0042 %   'or'-relations in their grRules (use expandModel).
0043 %
0044 %   Returns a model structure for the new organism.
0045 %
0046 %   draftModel        a new model structure
0047 %
0048 %   Rasmus Agren, 2013-01-25
0049 %
0050 
0051 %NOTE: "to" and "from" means relative the new organism
0052 
0053 if nargin<4
0054     preferredOrder=[];
0055 end
0056 if nargin<5
0057     strictness=1;
0058 end
0059 if nargin<6
0060     onlyGenesInModels=false;
0061 end
0062 if nargin<7
0063     maxE=10^-30;
0064 end
0065 if nargin<8
0066     minLen=200;
0067 end
0068 if nargin<9
0069     minSim=40;
0070 end
0071 if nargin<10
0072     mapNewGenesToOld=true;
0073 end
0074 
0075 preferredOrder=preferredOrder(:);
0076 
0077 %Check that all the information is in the blast structure
0078 modelNames=cell(numel(models),1);
0079 for i=1:numel(models)
0080     modelNames{i}=models{i}.id;
0081 end
0082 
0083 %Assume for now that all information is there and that it's correct
0084 %This is important to fix since no further checks are being made!
0085 
0086 %Remove all gene matches that are below the cutoffs
0087 for i=1:numel(blastStructure)
0088     indexes=blastStructure(i).evalue<maxE & blastStructure(i).aligLen>=minLen & blastStructure(i).identity>=minSim; %Do it in this direction to lose NaNs
0089     blastStructure(i).toGenes(~indexes)=[];
0090     blastStructure(i).fromGenes(~indexes)=[];
0091     blastStructure(i).evalue(~indexes)=[];
0092     blastStructure(i).aligLen(~indexes)=[];
0093     blastStructure(i).identity(~indexes)=[];
0094 end
0095 
0096 %Remove all reactions from the models that have no genes encoding for them.
0097 %Also remove all genes that encode for no reactions. There shouldn't be any
0098 %but there might be mistakes
0099 for i=1:numel(models)
0100     [hasGenes crap crap]=find(models{i}.rxnGeneMat);
0101     hasNoGenes=1:numel(models{i}.rxns);
0102     hasNoGenes(hasGenes)=[];
0103     
0104     models{i}=removeRxns(models{i},hasNoGenes,true,true);
0105 end
0106 
0107 %Create a structure that contains all genes used in the blasts in any
0108 %direction for each of the models in models and for the new new organism
0109 %The first cell is for the new organism and then according to the preferred
0110 %order. If no such order is supplied, then according to the order in
0111 %models.
0112 allGenes=cell(numel(models)+1,1);
0113 if isempty(preferredOrder)
0114     useOrder=modelNames;
0115 else
0116     useOrder=preferredOrder;
0117 end
0118 
0119 %Get the corresponding indexes for those models in the models structure
0120 useOrderIndexes=zeros(numel(models),1);
0121 for i=1:numel(models)
0122     index=strmatch(models{i}.id,useOrder,'exact');
0123     useOrderIndexes(index)=i;
0124 end
0125 
0126 %Remove all genes from the blast structure that have no genes in the models
0127 if onlyGenesInModels==true
0128     for i=1:numel(blastStructure)
0129         toId=strmatch(blastStructure(i).toId,preferredOrder,'exact');
0130         fromId=strmatch(blastStructure(i).fromId,preferredOrder,'exact');
0131         
0132         %Either toId is non-empty or fromId, but not both
0133         if any(toId)
0134             %Find genes in model
0135             I=ismember(blastStructure(i).toGenes,models{useOrderIndexes(toId)}.genes);
0136         else
0137             %Find genes in model
0138             I=ismember(blastStructure(i).fromGenes,models{useOrderIndexes(fromId)}.genes);
0139         end
0140         blastStructure(i).fromGenes(~I)=[];
0141         blastStructure(i).toGenes(~I)=[];
0142         blastStructure(i).evalue(~I)=[];
0143         blastStructure(i).aligLen(~I)=[];
0144         blastStructure(i).identity(~I)=[];
0145         
0146         %Check that no matching in blastStructure is empty. This happens if
0147         %no genes in the models are present in the corresponding sheet
0148         if isempty(blastStructure(i).fromGenes)
0149             throw(MException('',['No genes in matching from ' blastStructure(i).fromId ' to ' blastStructure(i).toId ' are present in the corresponding model']));
0150         end
0151     end
0152 end
0153 
0154 %If only best 1-1 orthologs are to be used then all other measurements are
0155 %deleted from the blastStructure. All code after this stays the same. This
0156 %means that preferred order can still matter. The best ortholog scoring is
0157 %based only on the E-value
0158 if strictness==3
0159     for i=1:numel(blastStructure)
0160         keep=false(numel(blastStructure(i).toGenes),1);
0161         [allFromGenes crap I]=unique(blastStructure(i).fromGenes);
0162         
0163         %It would be nice to get rid of this loop
0164         for j=1:numel(allFromGenes)
0165             allMatches=find(I==j);
0166             bestMatches=allMatches(blastStructure(i).evalue(allMatches)==min(blastStructure(i).evalue(allMatches)));
0167             
0168             %Keep the best matches
0169             keep(bestMatches)=true;
0170         end
0171         
0172         %Delete all matches that were not best matches
0173         blastStructure(i).fromGenes(~keep)=[];
0174         blastStructure(i).toGenes(~keep)=[];
0175         blastStructure(i).evalue(~keep)=[];
0176         blastStructure(i).aligLen(~keep)=[];
0177         blastStructure(i).identity(~keep)=[];
0178     end
0179 end
0180 
0181 useOrder=[{getModelFor};useOrder];
0182 
0183 for i=1:numel(blastStructure)
0184     toIndex=strmatch(blastStructure(i).toId,useOrder,'exact');
0185     fromIndex=strmatch(blastStructure(i).fromId,useOrder,'exact');
0186     
0187     %Add all genes to the corresponding list in allGenes
0188     allGenes{toIndex}=[allGenes{toIndex};blastStructure(i).toGenes];
0189     allGenes{fromIndex}=[allGenes{fromIndex};blastStructure(i).fromGenes];
0190 end
0191 
0192 %Keep only the unique gene names
0193 maxOtherGeneNr=0; %Determines the dimension of the connectivity matrixes
0194 for i=1:numel(allGenes)
0195    allGenes{i}=unique(allGenes{i});
0196    if i>1
0197        if numel(allGenes{i})>maxOtherGeneNr
0198             maxOtherGeneNr=numel(allGenes{i});
0199        end
0200    end
0201 end
0202 
0203 %Generate a cell array of matrixes the describes how the genes in the new
0204 %organism map to the models. Each cell matches to the corresponding model in
0205 %useOrder (starting at 2 of course). First dimension is gene in new organsism,
0206 %second which gene it is in the other organism. The second matrix describes
0207 %how they map back.
0208 
0209 %As it is now, a significant match is indicated by a 1. This could be
0210 %expanded to contain some kind of significance level. The first dimension
0211 %is still the genes in the new model.
0212 
0213 allTo=cell(numel(useOrder)-1,1);
0214 allFrom=cell(numel(useOrder)-1,1);
0215 
0216 for i=1:numel(useOrder)-1
0217     allTo{i}=sparse(numel(allGenes{1}),numel(allGenes{i+1}));
0218     allFrom{i}=sparse(numel(allGenes{1}),numel(allGenes{i+1}));
0219 end
0220 
0221 %Fill the matches to other species
0222 for i=1:numel(blastStructure)
0223     if strcmp(blastStructure(i).toId,getModelFor)
0224         %This was to the new organism
0225         %They should all match so no checks are being made
0226         [crap a]=ismember(blastStructure(i).toGenes,allGenes{1});
0227         fromModel=strmatch(blastStructure(i).fromId,useOrder,'exact');
0228         [crap b]=ismember(blastStructure(i).fromGenes,allGenes{fromModel});
0229         idx = sub2ind(size(allTo{fromModel-1}), a, b);
0230         allTo{fromModel-1}(idx)=1;
0231     else
0232         %This was from the new organism
0233         [crap a]=ismember(blastStructure(i).fromGenes,allGenes{1});
0234         toModel=strmatch(blastStructure(i).toId,useOrder,'exact');
0235         [crap b]=ismember(blastStructure(i).toGenes,allGenes{toModel});
0236         idx = sub2ind(size(allFrom{toModel-1}), a, b);
0237         allFrom{toModel-1}(idx)=1;
0238     end
0239 end
0240 
0241 %Now we have all the gene matches in a convenient way. For all the genes in
0242 %the new organism get the genes that should be included from other
0243 %organisms. If all genes should be included this simply means keep the
0244 %allFrom matrix as is is. If only 1-1 orthologs are to be included then only
0245 %those elements are kept.
0246 
0247 finalMappings=cell(numel(useOrder)-1,1);
0248 if strictness==1 || strictness==3
0249     for j=1:numel(allFrom)
0250         finalMappings{j}=allTo{j}~=0 & allFrom{j}~=0;
0251     end
0252 else
0253     if mapNewGenesToOld==true
0254         finalMappings=allFrom;
0255     else
0256         finalMappings=allTo;
0257     end 
0258 end
0259 
0260 %Remove all genes from the mapping that are not in the models. This doesn't
0261 %do much if only genes in the models were used for the original
0262 %mapping. Also simplify the finalMapping and allGenes structures so that
0263 %they only contain mappings that exist
0264 usedNewGenes=false(numel(allGenes{1}),1);
0265 
0266 for i=1:numel(allGenes)-1
0267     %First remove mappings for those genes that are not in the model
0268     if onlyGenesInModels==false
0269         a=ismember(allGenes{i+1},models{useOrderIndexes(i)}.genes);
0270         finalMappings{i}(:,~a)=false;
0271     end
0272     
0273     %Then remove unused ones and simplify
0274     [a b crap]=find(finalMappings{i});
0275     usedGenes=false(numel(allGenes{i+1}),1);
0276     usedNewGenes(a)=true;
0277     usedGenes(b)=true;
0278     finalMappings{i}=finalMappings{i}(:,usedGenes);
0279     allGenes{i+1}=allGenes{i+1}(usedGenes);
0280 end
0281 
0282 %Remove all new genes that have not been mapped to anything
0283 allGenes{1}=allGenes{1}(usedNewGenes);
0284 for i=1:numel(finalMappings)
0285     finalMappings{i}=finalMappings{i}(usedNewGenes,:);
0286 end
0287 
0288 %Now is it time to choose which reactions should be included from which
0289 %models. If there is a preferred order specified then each gene can only
0290 %result in reactions from one model, otherwise they should all be included
0291 
0292 %Start by simplifying the models by removing genes/reactions that are
0293 %not used. This is where it gets weird with complexes, especially "or" complexes.
0294 %In this step only reactions which are encoded by one single gene, or where all
0295 %genes should be deleted, are deleted. The info on the full complex is still
0296 %present in the grRules
0297 
0298 for i=1:numel(models)
0299     a=ismember(models{useOrderIndexes(i)}.genes,allGenes{i+1});
0300     
0301     %Don't remove reactions with complexes if not all genes in the complex
0302     %should be deleted.
0303     %NOTE: This means that not all the genes in 'a' are guaranteed to be
0304     %deleted. This approach works fine for 'and' complexes, but there
0305     %should be a check that it doesn't keep 'or' genes if it doesn't have
0306     %to!
0307     models{useOrderIndexes(i)}=removeGenes(models{useOrderIndexes(i)},~a,true,false);
0308 end
0309 
0310 %Since I want to use mergeModels in the end, I simplify the models further
0311 %by deleting genes/reactions in the order specified by preferredOrder. This
0312 %means that the last model will only contain reactions for genes that
0313 %mapped only to that model
0314 
0315 allUsedGenes=false(numel(allGenes{1}),1);
0316 
0317 if ~isempty(preferredOrder) && numel(models)>1
0318     [usedGenes crap crap]=find(finalMappings{1}); %All that are used in the first model in preferredOrder
0319     allUsedGenes(usedGenes)=true;
0320     for i=2:numel(finalMappings)
0321         [usedGenes crap crap]=find(finalMappings{i});
0322         usedGenes=unique(usedGenes);
0323         a=ismember(usedGenes,find(allUsedGenes));
0324         
0325         [craps genesToDelete crap]=find(finalMappings{i}(usedGenes(a),:)); %IMPORTANT! IS it really correct to remove all genes that map?
0326         genesToDelete=unique(genesToDelete); %Maybe not needed, but for clarity if nothing else
0327         
0328         %Remove all the genes that were already found and add the other
0329         %ones to allUsedGenes
0330         [models{useOrderIndexes(i)} notDeleted]=removeGenes(models{useOrderIndexes(i)},allGenes{i+1}(genesToDelete),true,false);
0331         allUsedGenes(usedGenes)=true;
0332         
0333         %Remove the deleted genes from finalMappings and allGenes
0334         %Don't remove the genes in notDeleted, they are part of complexes
0335         %with some non-mapped genes
0336         deletedIndexes=~ismember(allGenes{i+1}(genesToDelete),notDeleted);
0337         finalMappings{i}(:,genesToDelete(deletedIndexes))=[];
0338         allGenes{i+1}(genesToDelete(deletedIndexes))=[];
0339     end
0340 end
0341 
0342 %Now loop through the models and update the gene associations. Genes not
0343 %belonging to the new organism will be renamed as 'OLD_MODELID_gene'
0344 for i=1:numel(models)
0345     %Find all the new genes that should be used for this model
0346     [newGenes oldGenes crap]=find(finalMappings{i});
0347     
0348     %Create a new gene list with the genes from the new organism and those
0349     %genes that could not be removed
0350     replaceableGenes=allGenes{i+1}(unique(oldGenes));
0351     nonReplaceableGenes=setdiff(models{useOrderIndexes(i)}.genes,replaceableGenes);
0352     fullGeneList=[allGenes{1}(unique(newGenes));nonReplaceableGenes];
0353     
0354     %Just to save some indexing later. This is the LAST index of
0355     %replaceable ones
0356     nonRepStartIndex=numel(unique(newGenes)); 
0357     
0358     %Construct a new rxnGeneMat
0359     newRxnGeneMat=sparse(numel(models{useOrderIndexes(i)}.rxns),numel(fullGeneList));
0360     
0361     %Now update the rxnGeneMat matrix. This is a little tricky and could
0362     %probably be done in a more efficient way, but I just loop through the
0363     %reactions and add them one by one
0364     for j=1:numel(models{useOrderIndexes(i)}.rxns)
0365            %Get the old genes encoding for this reaction
0366         [crap oldGeneIds crap]=find(models{useOrderIndexes(i)}.rxnGeneMat(j,:));
0367         
0368         %Update the matrix for each gene. This includes replacing one gene
0369         %with several new ones if there were several matches
0370         for k=1:numel(oldGeneIds)
0371             %Match the gene to one in the gene list. This is done as a text
0372             %match. Could probably be done better, but I'm a little lost in
0373             %the indexing
0374             
0375             geneName=models{useOrderIndexes(i)}.genes(oldGeneIds(k));
0376             
0377             %First search in the mappable genes
0378             mapIndex=strmatch(geneName,allGenes{i+1},'exact');
0379             
0380             if ~isempty(mapIndex)
0381                 %Get the new genes for that gene
0382                 [a crap crap]=find(finalMappings{i}(:,mapIndex));
0383                 
0384                 %Find the positions of these genes in the final gene list
0385                 [a b]=ismember(allGenes{1}(a),fullGeneList);
0386                 
0387                 %Update the matrix
0388                 newRxnGeneMat(j,b)=1;
0389                 
0390                 %Update the grRules string. This is tricky, but I hope
0391                 %that it's ok to replace the old gene name with the new one
0392                 %and add ') or (' if there were several matches. Be sure of
0393                 %this!
0394                 repString=fullGeneList{b(1)};
0395                 for l=2:numel(b)
0396                     repString=[repString ') or (' fullGeneList{b(l)}];
0397                 end
0398                 models{useOrderIndexes(i)}.grRules{j}=strrep(models{useOrderIndexes(i)}.grRules{j},geneName{1},repString);
0399             else
0400                 %Then search in the non-replaceable genes. There could only
0401                 %be one match here
0402                 index=strmatch(geneName,nonReplaceableGenes,'exact');
0403                 
0404                 %Update the matrix
0405                 newRxnGeneMat(j,nonRepStartIndex+index)=1;
0406 
0407                 models{useOrderIndexes(i)}.grRules{j}=strrep(models{useOrderIndexes(i)}.grRules{j},geneName{1},strcat('OLD_',models{useOrderIndexes(i)}.id,'_',geneName{1}));
0408             end
0409         end
0410     end
0411     
0412     %Add the new list of genes
0413     models{useOrderIndexes(i)}.rxnGeneMat=newRxnGeneMat;
0414     if ~isempty(nonReplaceableGenes)
0415         models{useOrderIndexes(i)}.genes=[allGenes{1}(unique(newGenes));strcat('OLD_',models{useOrderIndexes(i)}.id,'_',nonReplaceableGenes)];
0416     else
0417         models{useOrderIndexes(i)}.genes=allGenes{1}(unique(newGenes));
0418     end
0419 end
0420 
0421 %Now merge the models. All information should be correct except for 'or'
0422 %complexes
0423 draftModel=mergeModels(models);
0424 
0425 %Change description of the resulting model
0426 draftModel.id=getModelFor;
0427 description='Generated by getModelFromHomology using ';
0428 for i=1:numel(models)
0429     if i<numel(models)
0430         description=[description '"' models{i}.id '", '];
0431     else
0432         description=[description '"' models{i}.id '"'];
0433     end
0434 end
0435 draftModel.description=description;
0436 end

Generated on Tue 23-Apr-2013 15:18:37 by m2html © 2005