Home > RAVEN > getModelFromHomology.m

getModelFromHomology

PURPOSE ^

getModelFromHomology

SYNOPSIS ^

function draftModel=getModelFromHomology(models,blastStructure,getModelFor,preferredOrder,strictness,onlyGenesInModels,maxE,minLen,minSim,mapNewGenesToOld)

DESCRIPTION ^

 getModelFromHomology
   Constructs a new model from a set of existing models and gene 
   homology information.

   models            a cell array of model structures to build the model
                     from
   blastStructure    a blastStructure as produced by getBlast or
                     getBlastFromExcel
   getModelFor       the name of the organism to build a model for. Must
                     have hits in both directions to the organisms in models
   preferredOrder    the order in which reactions should be added from the
                     models. If not supplied, reactions will be included from
                     all models, otherwise one gene will only result in reactions
                     from one model (opt, default {})
   strictness        integer that specifies which reactions that should be included
                     1: Include only 1-1 orthologs (only include genes that
                     map back to the original gene in the blast in the
                     opposite direction)
                     2: Include the reactions for all genes below the cutoff
                     3: Include only best 1-1 orthologs (opt, default 1)
   onlyGenesInModels blast only against genes that exists in the models.
                     This tends to import a larger fraction from the existing 
                     models but may give less reliable results. Has effect only
                     if strictness=3 (opt, default false)
   maxE              only look at genes with E-values <= this value (opt, default 10^-30)
   minLen            only look at genes with overlap >= this value (opt,
                     default 200)
   minSim            only look at genes with similarity >= this value (opt,
                     default 40 (%))
   mapNewGenesToOld  determines how to match genes if not looking at only
                     1-1 orthologs. Either map the new genes to the old or 
                     old genes to new. The default is to map the new genes 
                     (opt, default true)

   draftModel        a model structure for the new organism

   The models in the models structure should have named the metabolites in
   the same manner, have their reversible reactions in the
   same direction (run sortModel), and use the same compartment names.
   To avoid keeping unneccesary old genes, the models should not have
   'or'-relations in their grRules (use expandModel).

   Returns a model structure for the new organism.

   draftModel        a new model structure

   Rasmus Agren, 2013-09-24

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function draftModel=getModelFromHomology(models,blastStructure,getModelFor,preferredOrder,strictness,onlyGenesInModels,maxE,minLen,minSim,mapNewGenesToOld)
0002 % getModelFromHomology
0003 %   Constructs a new model from a set of existing models and gene
0004 %   homology information.
0005 %
0006 %   models            a cell array of model structures to build the model
0007 %                     from
0008 %   blastStructure    a blastStructure as produced by getBlast or
0009 %                     getBlastFromExcel
0010 %   getModelFor       the name of the organism to build a model for. Must
0011 %                     have hits in both directions to the organisms in models
0012 %   preferredOrder    the order in which reactions should be added from the
0013 %                     models. If not supplied, reactions will be included from
0014 %                     all models, otherwise one gene will only result in reactions
0015 %                     from one model (opt, default {})
0016 %   strictness        integer that specifies which reactions that should be included
0017 %                     1: Include only 1-1 orthologs (only include genes that
0018 %                     map back to the original gene in the blast in the
0019 %                     opposite direction)
0020 %                     2: Include the reactions for all genes below the cutoff
0021 %                     3: Include only best 1-1 orthologs (opt, default 1)
0022 %   onlyGenesInModels blast only against genes that exists in the models.
0023 %                     This tends to import a larger fraction from the existing
0024 %                     models but may give less reliable results. Has effect only
0025 %                     if strictness=3 (opt, default false)
0026 %   maxE              only look at genes with E-values <= this value (opt, default 10^-30)
0027 %   minLen            only look at genes with overlap >= this value (opt,
0028 %                     default 200)
0029 %   minSim            only look at genes with similarity >= this value (opt,
0030 %                     default 40 (%))
0031 %   mapNewGenesToOld  determines how to match genes if not looking at only
0032 %                     1-1 orthologs. Either map the new genes to the old or
0033 %                     old genes to new. The default is to map the new genes
0034 %                     (opt, default true)
0035 %
0036 %   draftModel        a model structure for the new organism
0037 %
0038 %   The models in the models structure should have named the metabolites in
0039 %   the same manner, have their reversible reactions in the
0040 %   same direction (run sortModel), and use the same compartment names.
0041 %   To avoid keeping unneccesary old genes, the models should not have
0042 %   'or'-relations in their grRules (use expandModel).
0043 %
0044 %   Returns a model structure for the new organism.
0045 %
0046 %   draftModel        a new model structure
0047 %
0048 %   Rasmus Agren, 2013-09-24
0049 %
0050 
0051 %NOTE: "to" and "from" means relative the new organism
0052 
0053 if nargin<4
0054     preferredOrder=[];
0055 end
0056 if nargin<5
0057     strictness=1;
0058 end
0059 if nargin<6
0060     onlyGenesInModels=false;
0061 end
0062 if nargin<7
0063     maxE=10^-30;
0064 end
0065 if nargin<8
0066     minLen=200;
0067 end
0068 if nargin<9
0069     minSim=40;
0070 end
0071 if nargin<10
0072     mapNewGenesToOld=true;
0073 end
0074 
0075 preferredOrder=preferredOrder(:);
0076 
0077 %Check that all the information is in the blast structure
0078 modelNames=cell(numel(models),1);
0079 for i=1:numel(models)
0080     modelNames{i}=models{i}.id;
0081 end
0082 
0083 %Assume for now that all information is there and that it's correct
0084 %This is important to fix since no further checks are being made!
0085 
0086 %Remove all gene matches that are below the cutoffs
0087 for i=1:numel(blastStructure)
0088     indexes=blastStructure(i).evalue<maxE & blastStructure(i).aligLen>=minLen & blastStructure(i).identity>=minSim; %Do it in this direction to lose NaNs
0089     blastStructure(i).toGenes(~indexes)=[];
0090     blastStructure(i).fromGenes(~indexes)=[];
0091     blastStructure(i).evalue(~indexes)=[];
0092     blastStructure(i).aligLen(~indexes)=[];
0093     blastStructure(i).identity(~indexes)=[];
0094 end
0095 
0096 %Remove all reactions from the models that have no genes encoding for them.
0097 %Also remove all genes that encode for no reactions. There shouldn't be any
0098 %but there might be mistakes
0099 for i=1:numel(models)
0100     [hasGenes crap crap]=find(models{i}.rxnGeneMat);
0101     hasNoGenes=1:numel(models{i}.rxns);
0102     hasNoGenes(hasGenes)=[];
0103     
0104     models{i}=removeRxns(models{i},hasNoGenes,true,true);
0105 end
0106 
0107 %Create a structure that contains all genes used in the blasts in any
0108 %direction for each of the models in models and for the new new organism
0109 %The first cell is for the new organism and then according to the preferred
0110 %order. If no such order is supplied, then according to the order in
0111 %models.
0112 allGenes=cell(numel(models)+1,1);
0113 if isempty(preferredOrder)
0114     useOrder=modelNames;
0115 else
0116     useOrder=preferredOrder;
0117 end
0118 
0119 %Get the corresponding indexes for those models in the models structure
0120 useOrderIndexes=zeros(numel(models),1);
0121 for i=1:numel(models)
0122     index=strmatch(models{i}.id,useOrder,'exact');
0123     useOrderIndexes(index)=i;
0124 end
0125 
0126 %Remove all genes from the blast structure that have no genes in the models
0127 if onlyGenesInModels==true
0128     modelGenes={};
0129     for i=1:numel(models)
0130         modelGenes=[modelGenes;models{i}.genes];
0131     end
0132     for i=1:numel(blastStructure)
0133         %Check to see if it should match the toId or fromId
0134         if strcmpi(blastStructure(i).fromId,getModelFor)
0135             I=ismember(blastStructure(i).toGenes,modelGenes);
0136         else
0137             I=ismember(blastStructure(i).fromGenes,modelGenes);
0138         end
0139         blastStructure(i).fromGenes(~I)=[];
0140         blastStructure(i).toGenes(~I)=[];
0141         blastStructure(i).evalue(~I)=[];
0142         blastStructure(i).aligLen(~I)=[];
0143         blastStructure(i).identity(~I)=[];
0144         
0145         %Check that no matching in blastStructure is empty. This happens if
0146         %no genes in the models are present in the corresponding sheet
0147         if isempty(blastStructure(i).fromGenes)
0148             dispEM(['No genes in matching from ' blastStructure(i).fromId ' to ' blastStructure(i).toId ' are present in the corresponding model']);
0149         end
0150     end
0151 end
0152 
0153 %If only best 1-1 orthologs are to be used then all other measurements are
0154 %deleted from the blastStructure. All code after this stays the same. This
0155 %means that preferred order can still matter. The best ortholog scoring is
0156 %based only on the E-value
0157 if strictness==3
0158     for i=1:numel(blastStructure)
0159         keep=false(numel(blastStructure(i).toGenes),1);
0160         [allFromGenes crap I]=unique(blastStructure(i).fromGenes);
0161         
0162         %It would be nice to get rid of this loop
0163         for j=1:numel(allFromGenes)
0164             allMatches=find(I==j);
0165             bestMatches=allMatches(blastStructure(i).evalue(allMatches)==min(blastStructure(i).evalue(allMatches)));
0166             
0167             %Keep the best matches
0168             keep(bestMatches)=true;
0169         end
0170         
0171         %Delete all matches that were not best matches
0172         blastStructure(i).fromGenes(~keep)=[];
0173         blastStructure(i).toGenes(~keep)=[];
0174         blastStructure(i).evalue(~keep)=[];
0175         blastStructure(i).aligLen(~keep)=[];
0176         blastStructure(i).identity(~keep)=[];
0177     end
0178 end
0179 
0180 useOrder=[{getModelFor};useOrder];
0181 
0182 for i=1:numel(blastStructure)
0183     toIndex=strmatch(blastStructure(i).toId,useOrder,'exact');
0184     fromIndex=strmatch(blastStructure(i).fromId,useOrder,'exact');
0185     
0186     %Add all genes to the corresponding list in allGenes
0187     allGenes{toIndex}=[allGenes{toIndex};blastStructure(i).toGenes];
0188     allGenes{fromIndex}=[allGenes{fromIndex};blastStructure(i).fromGenes];
0189 end
0190 
0191 %Keep only the unique gene names
0192 maxOtherGeneNr=0; %Determines the dimension of the connectivity matrixes
0193 for i=1:numel(allGenes)
0194    allGenes{i}=unique(allGenes{i});
0195    if i>1
0196        if numel(allGenes{i})>maxOtherGeneNr
0197             maxOtherGeneNr=numel(allGenes{i});
0198        end
0199    end
0200 end
0201 
0202 %Generate a cell array of matrixes the describes how the genes in the new
0203 %organism map to the models. Each cell matches to the corresponding model in
0204 %useOrder (starting at 2 of course). First dimension is gene in new organsism,
0205 %second which gene it is in the other organism. The second matrix describes
0206 %how they map back.
0207 
0208 %As it is now, a significant match is indicated by a 1. This could be
0209 %expanded to contain some kind of significance level. The first dimension
0210 %is still the genes in the new model.
0211 
0212 allTo=cell(numel(useOrder)-1,1);
0213 allFrom=cell(numel(useOrder)-1,1);
0214 
0215 for i=1:numel(useOrder)-1
0216     allTo{i}=sparse(numel(allGenes{1}),numel(allGenes{i+1}));
0217     allFrom{i}=sparse(numel(allGenes{1}),numel(allGenes{i+1}));
0218 end
0219 
0220 %Fill the matches to other species
0221 for i=1:numel(blastStructure)
0222     if strcmp(blastStructure(i).toId,getModelFor)
0223         %This was to the new organism
0224         %They should all match so no checks are being made
0225         [crap a]=ismember(blastStructure(i).toGenes,allGenes{1});
0226         fromModel=strmatch(blastStructure(i).fromId,useOrder,'exact');
0227         [crap b]=ismember(blastStructure(i).fromGenes,allGenes{fromModel});
0228         idx = sub2ind(size(allTo{fromModel-1}), a, b);
0229         allTo{fromModel-1}(idx)=1;
0230     else
0231         %This was from the new organism
0232         [crap a]=ismember(blastStructure(i).fromGenes,allGenes{1});
0233         toModel=strmatch(blastStructure(i).toId,useOrder,'exact');
0234         [crap b]=ismember(blastStructure(i).toGenes,allGenes{toModel});
0235         idx = sub2ind(size(allFrom{toModel-1}), a, b);
0236         allFrom{toModel-1}(idx)=1;
0237     end
0238 end
0239 
0240 %Now we have all the gene matches in a convenient way. For all the genes in
0241 %the new organism get the genes that should be included from other
0242 %organisms. If all genes should be included this simply means keep the
0243 %allFrom matrix as is is. If only 1-1 orthologs are to be included then only
0244 %those elements are kept.
0245 
0246 finalMappings=cell(numel(useOrder)-1,1);
0247 if strictness==1 || strictness==3
0248     for j=1:numel(allFrom)
0249         finalMappings{j}=allTo{j}~=0 & allFrom{j}~=0;
0250     end
0251 else
0252     if mapNewGenesToOld==true
0253         finalMappings=allFrom;
0254     else
0255         finalMappings=allTo;
0256     end 
0257 end
0258 
0259 %Remove all genes from the mapping that are not in the models. This doesn't
0260 %do much if only genes in the models were used for the original
0261 %mapping. Also simplify the finalMapping and allGenes structures so that
0262 %they only contain mappings that exist
0263 usedNewGenes=false(numel(allGenes{1}),1);
0264 
0265 for i=1:numel(allGenes)-1
0266     %First remove mappings for those genes that are not in the model
0267     if onlyGenesInModels==false
0268         a=ismember(allGenes{i+1},models{useOrderIndexes(i)}.genes);
0269         finalMappings{i}(:,~a)=false;
0270     end
0271     
0272     %Then remove unused ones and simplify
0273     [a b crap]=find(finalMappings{i});
0274     usedGenes=false(numel(allGenes{i+1}),1);
0275     usedNewGenes(a)=true;
0276     usedGenes(b)=true;
0277     finalMappings{i}=finalMappings{i}(:,usedGenes);
0278     allGenes{i+1}=allGenes{i+1}(usedGenes);
0279 end
0280 
0281 %Remove all new genes that have not been mapped to anything
0282 allGenes{1}=allGenes{1}(usedNewGenes);
0283 for i=1:numel(finalMappings)
0284     finalMappings{i}=finalMappings{i}(usedNewGenes,:);
0285 end
0286 
0287 %Now is it time to choose which reactions should be included from which
0288 %models. If there is a preferred order specified then each gene can only
0289 %result in reactions from one model, otherwise they should all be included
0290 
0291 %Start by simplifying the models by removing genes/reactions that are
0292 %not used. This is where it gets weird with complexes, especially "or" complexes.
0293 %In this step only reactions which are encoded by one single gene, or where all
0294 %genes should be deleted, are deleted. The info on the full complex is still
0295 %present in the grRules
0296 
0297 for i=1:numel(models)
0298     a=ismember(models{useOrderIndexes(i)}.genes,allGenes{i+1});
0299     
0300     %Don't remove reactions with complexes if not all genes in the complex
0301     %should be deleted.
0302     %NOTE: This means that not all the genes in 'a' are guaranteed to be
0303     %deleted. This approach works fine for 'and' complexes, but there
0304     %should be a check that it doesn't keep 'or' genes if it doesn't have
0305     %to!
0306     models{useOrderIndexes(i)}=removeGenes(models{useOrderIndexes(i)},~a,true,false);
0307 end
0308 
0309 %Since I want to use mergeModels in the end, I simplify the models further
0310 %by deleting genes/reactions in the order specified by preferredOrder. This
0311 %means that the last model will only contain reactions for genes that
0312 %mapped only to that model
0313 
0314 allUsedGenes=false(numel(allGenes{1}),1);
0315 
0316 if ~isempty(preferredOrder) && numel(models)>1
0317     [usedGenes crap crap]=find(finalMappings{1}); %All that are used in the first model in preferredOrder
0318     allUsedGenes(usedGenes)=true;
0319     for i=2:numel(finalMappings)
0320         [usedGenes crap crap]=find(finalMappings{i});
0321         usedGenes=unique(usedGenes);
0322         a=ismember(usedGenes,find(allUsedGenes));
0323         
0324         [craps genesToDelete crap]=find(finalMappings{i}(usedGenes(a),:)); %IMPORTANT! IS it really correct to remove all genes that map?
0325         genesToDelete=unique(genesToDelete); %Maybe not needed, but for clarity if nothing else
0326         
0327         %Remove all the genes that were already found and add the other
0328         %ones to allUsedGenes
0329         [models{useOrderIndexes(i)} notDeleted]=removeGenes(models{useOrderIndexes(i)},allGenes{i+1}(genesToDelete),true,false);
0330         allUsedGenes(usedGenes)=true;
0331         
0332         %Remove the deleted genes from finalMappings and allGenes
0333         %Don't remove the genes in notDeleted, they are part of complexes
0334         %with some non-mapped genes
0335         deletedIndexes=~ismember(allGenes{i+1}(genesToDelete),notDeleted);
0336         finalMappings{i}(:,genesToDelete(deletedIndexes))=[];
0337         allGenes{i+1}(genesToDelete(deletedIndexes))=[];
0338     end
0339 end
0340 
0341 %Now loop through the models and update the gene associations. Genes not
0342 %belonging to the new organism will be renamed as 'OLD_MODELID_gene'
0343 for i=1:numel(models)
0344     %Find all the new genes that should be used for this model
0345     [newGenes oldGenes crap]=find(finalMappings{i});
0346     
0347     %Create a new gene list with the genes from the new organism and those
0348     %genes that could not be removed
0349     replaceableGenes=allGenes{i+1}(unique(oldGenes));
0350     nonReplaceableGenes=setdiff(models{useOrderIndexes(i)}.genes,replaceableGenes);
0351     fullGeneList=[allGenes{1}(unique(newGenes));nonReplaceableGenes];
0352     
0353     %Just to save some indexing later. This is the LAST index of
0354     %replaceable ones
0355     nonRepStartIndex=numel(unique(newGenes)); 
0356     
0357     %Construct a new rxnGeneMat
0358     newRxnGeneMat=sparse(numel(models{useOrderIndexes(i)}.rxns),numel(fullGeneList));
0359     
0360     %Now update the rxnGeneMat matrix. This is a little tricky and could
0361     %probably be done in a more efficient way, but I just loop through the
0362     %reactions and add them one by one
0363     for j=1:numel(models{useOrderIndexes(i)}.rxns)
0364            %Get the old genes encoding for this reaction
0365         [crap oldGeneIds crap]=find(models{useOrderIndexes(i)}.rxnGeneMat(j,:));
0366         
0367         %Update the matrix for each gene. This includes replacing one gene
0368         %with several new ones if there were several matches
0369         for k=1:numel(oldGeneIds)
0370             %Match the gene to one in the gene list. This is done as a text
0371             %match. Could probably be done better, but I'm a little lost in
0372             %the indexing
0373             
0374             geneName=models{useOrderIndexes(i)}.genes(oldGeneIds(k));
0375             
0376             %First search in the mappable genes
0377             mapIndex=strmatch(geneName,allGenes{i+1},'exact');
0378             
0379             if ~isempty(mapIndex)
0380                 %Get the new genes for that gene
0381                 [a crap crap]=find(finalMappings{i}(:,mapIndex));
0382                 
0383                 %Find the positions of these genes in the final gene list
0384                 [a b]=ismember(allGenes{1}(a),fullGeneList);
0385                 
0386                 %Update the matrix
0387                 newRxnGeneMat(j,b)=1;
0388                 
0389                 %Update the grRules string. This is tricky, but I hope
0390                 %that it's ok to replace the old gene name with the new one
0391                 %and add ') or (' if there were several matches. Be sure of
0392                 %this!
0393                 repString=fullGeneList{b(1)};
0394                 for l=2:numel(b)
0395                     repString=[repString ') or (' fullGeneList{b(l)}];
0396                 end
0397                 models{useOrderIndexes(i)}.grRules{j}=strrep(models{useOrderIndexes(i)}.grRules{j},geneName{1},repString);
0398             else
0399                 %Then search in the non-replaceable genes. There could only
0400                 %be one match here
0401                 index=strmatch(geneName,nonReplaceableGenes,'exact');
0402                 
0403                 %Update the matrix
0404                 newRxnGeneMat(j,nonRepStartIndex+index)=1;
0405 
0406                 models{useOrderIndexes(i)}.grRules{j}=strrep(models{useOrderIndexes(i)}.grRules{j},geneName{1},strcat('OLD_',models{useOrderIndexes(i)}.id,'_',geneName{1}));
0407             end
0408         end
0409     end
0410     
0411     %Add the new list of genes
0412     models{useOrderIndexes(i)}.rxnGeneMat=newRxnGeneMat;
0413     if ~isempty(nonReplaceableGenes)
0414         models{useOrderIndexes(i)}.genes=[allGenes{1}(unique(newGenes));strcat('OLD_',models{useOrderIndexes(i)}.id,'_',nonReplaceableGenes)];
0415     else
0416         models{useOrderIndexes(i)}.genes=allGenes{1}(unique(newGenes));
0417     end
0418 end
0419 
0420 %Now merge the models. All information should be correct except for 'or'
0421 %complexes
0422 draftModel=mergeModels(models);
0423 
0424 %Change description of the resulting model
0425 draftModel.id=getModelFor;
0426 description='Generated by getModelFromHomology using ';
0427 for i=1:numel(models)
0428     if i<numel(models)
0429         description=[description models{i}.id];
0430     else
0431         description=[description ', ' models{i}.id];
0432     end
0433 end
0434 draftModel.description=description;
0435 end

Generated on Mon 06-Jan-2014 14:58:12 by m2html © 2005