getModelFromHomology Constructs a new model from a set of existing models and gene homology information. models a cell array of model structures to build the model from blastStructure a blastStructure as produced by getBlast or getBlastFromExcel getModelFor the name of the organism to build a model for. Must have hits in both directions to the organisms in models preferredOrder the order in which reactions should be added from the models. If not supplied, reactions will be included from all models, otherwise one gene will only result in reactions from one model (opt, default {}) strictness integer that specifies which reactions that should be included 1: Include only 1-1 orthologs (only include genes that map back to the original gene in the blast in the opposite direction) 2: Include the reactions for all genes below the cutoff 3: Include only best 1-1 orthologs (opt, default 1) onlyGenesInModels blast only against genes that exists in the models. This tends to import a larger fraction from the existing models but may give less reliable results. Has effect only if strictness=3 (opt, default false) maxE only look at genes with E-values <= this value (opt, default 10^-30) minLen only look at genes with overlap >= this value (opt, default 200) minSim only look at genes with similarity >= this value (opt, default 40 (%)) mapNewGenesToOld determines how to match genes if not looking at only 1-1 orthologs. Either map the new genes to the old or old genes to new. The default is to map the new genes (opt, default true) draftModel a model structure for the new organism The models in the models structure should have named the metabolites in the same manner (run mapMetNames), have their reversible reactions in the same direction (run sortModel), and use the same compartment names. To avoid keeping unneccesary old genes, the models should not have 'or'-relations in their grRules (use expandModel). Returns a model structure for the new organism. draftModel a new model structure Rasmus Agren, 2013-01-25
0001 function draftModel=getModelFromHomology(models,blastStructure,getModelFor,preferredOrder,strictness,onlyGenesInModels,maxE,minLen,minSim,mapNewGenesToOld) 0002 % getModelFromHomology 0003 % Constructs a new model from a set of existing models and gene 0004 % homology information. 0005 % 0006 % models a cell array of model structures to build the model 0007 % from 0008 % blastStructure a blastStructure as produced by getBlast or 0009 % getBlastFromExcel 0010 % getModelFor the name of the organism to build a model for. Must 0011 % have hits in both directions to the organisms in models 0012 % preferredOrder the order in which reactions should be added from the 0013 % models. If not supplied, reactions will be included from 0014 % all models, otherwise one gene will only result in reactions 0015 % from one model (opt, default {}) 0016 % strictness integer that specifies which reactions that should be included 0017 % 1: Include only 1-1 orthologs (only include genes that 0018 % map back to the original gene in the blast in the 0019 % opposite direction) 0020 % 2: Include the reactions for all genes below the cutoff 0021 % 3: Include only best 1-1 orthologs (opt, default 1) 0022 % onlyGenesInModels blast only against genes that exists in the models. 0023 % This tends to import a larger fraction from the existing 0024 % models but may give less reliable results. Has effect only 0025 % if strictness=3 (opt, default false) 0026 % maxE only look at genes with E-values <= this value (opt, default 10^-30) 0027 % minLen only look at genes with overlap >= this value (opt, 0028 % default 200) 0029 % minSim only look at genes with similarity >= this value (opt, 0030 % default 40 (%)) 0031 % mapNewGenesToOld determines how to match genes if not looking at only 0032 % 1-1 orthologs. Either map the new genes to the old or 0033 % old genes to new. The default is to map the new genes 0034 % (opt, default true) 0035 % 0036 % draftModel a model structure for the new organism 0037 % 0038 % The models in the models structure should have named the metabolites in 0039 % the same manner (run mapMetNames), have their reversible reactions in the 0040 % same direction (run sortModel), and use the same compartment names. 0041 % To avoid keeping unneccesary old genes, the models should not have 0042 % 'or'-relations in their grRules (use expandModel). 0043 % 0044 % Returns a model structure for the new organism. 0045 % 0046 % draftModel a new model structure 0047 % 0048 % Rasmus Agren, 2013-01-25 0049 % 0050 0051 %NOTE: "to" and "from" means relative the new organism 0052 0053 if nargin<4 0054 preferredOrder=[]; 0055 end 0056 if nargin<5 0057 strictness=1; 0058 end 0059 if nargin<6 0060 onlyGenesInModels=false; 0061 end 0062 if nargin<7 0063 maxE=10^-30; 0064 end 0065 if nargin<8 0066 minLen=200; 0067 end 0068 if nargin<9 0069 minSim=40; 0070 end 0071 if nargin<10 0072 mapNewGenesToOld=true; 0073 end 0074 0075 preferredOrder=preferredOrder(:); 0076 0077 %Check that all the information is in the blast structure 0078 modelNames=cell(numel(models),1); 0079 for i=1:numel(models) 0080 modelNames{i}=models{i}.id; 0081 end 0082 0083 %Assume for now that all information is there and that it's correct 0084 %This is important to fix since no further checks are being made! 0085 0086 %Remove all gene matches that are below the cutoffs 0087 for i=1:numel(blastStructure) 0088 indexes=blastStructure(i).evalue<maxE & blastStructure(i).aligLen>=minLen & blastStructure(i).identity>=minSim; %Do it in this direction to lose NaNs 0089 blastStructure(i).toGenes(~indexes)=[]; 0090 blastStructure(i).fromGenes(~indexes)=[]; 0091 blastStructure(i).evalue(~indexes)=[]; 0092 blastStructure(i).aligLen(~indexes)=[]; 0093 blastStructure(i).identity(~indexes)=[]; 0094 end 0095 0096 %Remove all reactions from the models that have no genes encoding for them. 0097 %Also remove all genes that encode for no reactions. There shouldn't be any 0098 %but there might be mistakes 0099 for i=1:numel(models) 0100 [hasGenes crap crap]=find(models{i}.rxnGeneMat); 0101 hasNoGenes=1:numel(models{i}.rxns); 0102 hasNoGenes(hasGenes)=[]; 0103 0104 models{i}=removeRxns(models{i},hasNoGenes,true,true); 0105 end 0106 0107 %Create a structure that contains all genes used in the blasts in any 0108 %direction for each of the models in models and for the new new organism 0109 %The first cell is for the new organism and then according to the preferred 0110 %order. If no such order is supplied, then according to the order in 0111 %models. 0112 allGenes=cell(numel(models)+1,1); 0113 if isempty(preferredOrder) 0114 useOrder=modelNames; 0115 else 0116 useOrder=preferredOrder; 0117 end 0118 0119 %Get the corresponding indexes for those models in the models structure 0120 useOrderIndexes=zeros(numel(models),1); 0121 for i=1:numel(models) 0122 index=strmatch(models{i}.id,useOrder,'exact'); 0123 useOrderIndexes(index)=i; 0124 end 0125 0126 %Remove all genes from the blast structure that have no genes in the models 0127 if onlyGenesInModels==true 0128 for i=1:numel(blastStructure) 0129 toId=strmatch(blastStructure(i).toId,preferredOrder,'exact'); 0130 fromId=strmatch(blastStructure(i).fromId,preferredOrder,'exact'); 0131 0132 %Either toId is non-empty or fromId, but not both 0133 if any(toId) 0134 %Find genes in model 0135 I=ismember(blastStructure(i).toGenes,models{useOrderIndexes(toId)}.genes); 0136 else 0137 %Find genes in model 0138 I=ismember(blastStructure(i).fromGenes,models{useOrderIndexes(fromId)}.genes); 0139 end 0140 blastStructure(i).fromGenes(~I)=[]; 0141 blastStructure(i).toGenes(~I)=[]; 0142 blastStructure(i).evalue(~I)=[]; 0143 blastStructure(i).aligLen(~I)=[]; 0144 blastStructure(i).identity(~I)=[]; 0145 0146 %Check that no matching in blastStructure is empty. This happens if 0147 %no genes in the models are present in the corresponding sheet 0148 if isempty(blastStructure(i).fromGenes) 0149 throw(MException('',['No genes in matching from ' blastStructure(i).fromId ' to ' blastStructure(i).toId ' are present in the corresponding model'])); 0150 end 0151 end 0152 end 0153 0154 %If only best 1-1 orthologs are to be used then all other measurements are 0155 %deleted from the blastStructure. All code after this stays the same. This 0156 %means that preferred order can still matter. The best ortholog scoring is 0157 %based only on the E-value 0158 if strictness==3 0159 for i=1:numel(blastStructure) 0160 keep=false(numel(blastStructure(i).toGenes),1); 0161 [allFromGenes crap I]=unique(blastStructure(i).fromGenes); 0162 0163 %It would be nice to get rid of this loop 0164 for j=1:numel(allFromGenes) 0165 allMatches=find(I==j); 0166 bestMatches=allMatches(blastStructure(i).evalue(allMatches)==min(blastStructure(i).evalue(allMatches))); 0167 0168 %Keep the best matches 0169 keep(bestMatches)=true; 0170 end 0171 0172 %Delete all matches that were not best matches 0173 blastStructure(i).fromGenes(~keep)=[]; 0174 blastStructure(i).toGenes(~keep)=[]; 0175 blastStructure(i).evalue(~keep)=[]; 0176 blastStructure(i).aligLen(~keep)=[]; 0177 blastStructure(i).identity(~keep)=[]; 0178 end 0179 end 0180 0181 useOrder=[{getModelFor};useOrder]; 0182 0183 for i=1:numel(blastStructure) 0184 toIndex=strmatch(blastStructure(i).toId,useOrder,'exact'); 0185 fromIndex=strmatch(blastStructure(i).fromId,useOrder,'exact'); 0186 0187 %Add all genes to the corresponding list in allGenes 0188 allGenes{toIndex}=[allGenes{toIndex};blastStructure(i).toGenes]; 0189 allGenes{fromIndex}=[allGenes{fromIndex};blastStructure(i).fromGenes]; 0190 end 0191 0192 %Keep only the unique gene names 0193 maxOtherGeneNr=0; %Determines the dimension of the connectivity matrixes 0194 for i=1:numel(allGenes) 0195 allGenes{i}=unique(allGenes{i}); 0196 if i>1 0197 if numel(allGenes{i})>maxOtherGeneNr 0198 maxOtherGeneNr=numel(allGenes{i}); 0199 end 0200 end 0201 end 0202 0203 %Generate a cell array of matrixes the describes how the genes in the new 0204 %organism map to the models. Each cell matches to the corresponding model in 0205 %useOrder (starting at 2 of course). First dimension is gene in new organsism, 0206 %second which gene it is in the other organism. The second matrix describes 0207 %how they map back. 0208 0209 %As it is now, a significant match is indicated by a 1. This could be 0210 %expanded to contain some kind of significance level. The first dimension 0211 %is still the genes in the new model. 0212 0213 allTo=cell(numel(useOrder)-1,1); 0214 allFrom=cell(numel(useOrder)-1,1); 0215 0216 for i=1:numel(useOrder)-1 0217 allTo{i}=sparse(numel(allGenes{1}),numel(allGenes{i+1})); 0218 allFrom{i}=sparse(numel(allGenes{1}),numel(allGenes{i+1})); 0219 end 0220 0221 %Fill the matches to other species 0222 for i=1:numel(blastStructure) 0223 if strcmp(blastStructure(i).toId,getModelFor) 0224 %This was to the new organism 0225 %They should all match so no checks are being made 0226 [crap a]=ismember(blastStructure(i).toGenes,allGenes{1}); 0227 fromModel=strmatch(blastStructure(i).fromId,useOrder,'exact'); 0228 [crap b]=ismember(blastStructure(i).fromGenes,allGenes{fromModel}); 0229 idx = sub2ind(size(allTo{fromModel-1}), a, b); 0230 allTo{fromModel-1}(idx)=1; 0231 else 0232 %This was from the new organism 0233 [crap a]=ismember(blastStructure(i).fromGenes,allGenes{1}); 0234 toModel=strmatch(blastStructure(i).toId,useOrder,'exact'); 0235 [crap b]=ismember(blastStructure(i).toGenes,allGenes{toModel}); 0236 idx = sub2ind(size(allFrom{toModel-1}), a, b); 0237 allFrom{toModel-1}(idx)=1; 0238 end 0239 end 0240 0241 %Now we have all the gene matches in a convenient way. For all the genes in 0242 %the new organism get the genes that should be included from other 0243 %organisms. If all genes should be included this simply means keep the 0244 %allFrom matrix as is is. If only 1-1 orthologs are to be included then only 0245 %those elements are kept. 0246 0247 finalMappings=cell(numel(useOrder)-1,1); 0248 if strictness==1 || strictness==3 0249 for j=1:numel(allFrom) 0250 finalMappings{j}=allTo{j}~=0 & allFrom{j}~=0; 0251 end 0252 else 0253 if mapNewGenesToOld==true 0254 finalMappings=allFrom; 0255 else 0256 finalMappings=allTo; 0257 end 0258 end 0259 0260 %Remove all genes from the mapping that are not in the models. This doesn't 0261 %do much if only genes in the models were used for the original 0262 %mapping. Also simplify the finalMapping and allGenes structures so that 0263 %they only contain mappings that exist 0264 usedNewGenes=false(numel(allGenes{1}),1); 0265 0266 for i=1:numel(allGenes)-1 0267 %First remove mappings for those genes that are not in the model 0268 if onlyGenesInModels==false 0269 a=ismember(allGenes{i+1},models{useOrderIndexes(i)}.genes); 0270 finalMappings{i}(:,~a)=false; 0271 end 0272 0273 %Then remove unused ones and simplify 0274 [a b crap]=find(finalMappings{i}); 0275 usedGenes=false(numel(allGenes{i+1}),1); 0276 usedNewGenes(a)=true; 0277 usedGenes(b)=true; 0278 finalMappings{i}=finalMappings{i}(:,usedGenes); 0279 allGenes{i+1}=allGenes{i+1}(usedGenes); 0280 end 0281 0282 %Remove all new genes that have not been mapped to anything 0283 allGenes{1}=allGenes{1}(usedNewGenes); 0284 for i=1:numel(finalMappings) 0285 finalMappings{i}=finalMappings{i}(usedNewGenes,:); 0286 end 0287 0288 %Now is it time to choose which reactions should be included from which 0289 %models. If there is a preferred order specified then each gene can only 0290 %result in reactions from one model, otherwise they should all be included 0291 0292 %Start by simplifying the models by removing genes/reactions that are 0293 %not used. This is where it gets weird with complexes, especially "or" complexes. 0294 %In this step only reactions which are encoded by one single gene, or where all 0295 %genes should be deleted, are deleted. The info on the full complex is still 0296 %present in the grRules 0297 0298 for i=1:numel(models) 0299 a=ismember(models{useOrderIndexes(i)}.genes,allGenes{i+1}); 0300 0301 %Don't remove reactions with complexes if not all genes in the complex 0302 %should be deleted. 0303 %NOTE: This means that not all the genes in 'a' are guaranteed to be 0304 %deleted. This approach works fine for 'and' complexes, but there 0305 %should be a check that it doesn't keep 'or' genes if it doesn't have 0306 %to! 0307 models{useOrderIndexes(i)}=removeGenes(models{useOrderIndexes(i)},~a,true,false); 0308 end 0309 0310 %Since I want to use mergeModels in the end, I simplify the models further 0311 %by deleting genes/reactions in the order specified by preferredOrder. This 0312 %means that the last model will only contain reactions for genes that 0313 %mapped only to that model 0314 0315 allUsedGenes=false(numel(allGenes{1}),1); 0316 0317 if ~isempty(preferredOrder) && numel(models)>1 0318 [usedGenes crap crap]=find(finalMappings{1}); %All that are used in the first model in preferredOrder 0319 allUsedGenes(usedGenes)=true; 0320 for i=2:numel(finalMappings) 0321 [usedGenes crap crap]=find(finalMappings{i}); 0322 usedGenes=unique(usedGenes); 0323 a=ismember(usedGenes,find(allUsedGenes)); 0324 0325 [craps genesToDelete crap]=find(finalMappings{i}(usedGenes(a),:)); %IMPORTANT! IS it really correct to remove all genes that map? 0326 genesToDelete=unique(genesToDelete); %Maybe not needed, but for clarity if nothing else 0327 0328 %Remove all the genes that were already found and add the other 0329 %ones to allUsedGenes 0330 [models{useOrderIndexes(i)} notDeleted]=removeGenes(models{useOrderIndexes(i)},allGenes{i+1}(genesToDelete),true,false); 0331 allUsedGenes(usedGenes)=true; 0332 0333 %Remove the deleted genes from finalMappings and allGenes 0334 %Don't remove the genes in notDeleted, they are part of complexes 0335 %with some non-mapped genes 0336 deletedIndexes=~ismember(allGenes{i+1}(genesToDelete),notDeleted); 0337 finalMappings{i}(:,genesToDelete(deletedIndexes))=[]; 0338 allGenes{i+1}(genesToDelete(deletedIndexes))=[]; 0339 end 0340 end 0341 0342 %Now loop through the models and update the gene associations. Genes not 0343 %belonging to the new organism will be renamed as 'OLD_MODELID_gene' 0344 for i=1:numel(models) 0345 %Find all the new genes that should be used for this model 0346 [newGenes oldGenes crap]=find(finalMappings{i}); 0347 0348 %Create a new gene list with the genes from the new organism and those 0349 %genes that could not be removed 0350 replaceableGenes=allGenes{i+1}(unique(oldGenes)); 0351 nonReplaceableGenes=setdiff(models{useOrderIndexes(i)}.genes,replaceableGenes); 0352 fullGeneList=[allGenes{1}(unique(newGenes));nonReplaceableGenes]; 0353 0354 %Just to save some indexing later. This is the LAST index of 0355 %replaceable ones 0356 nonRepStartIndex=numel(unique(newGenes)); 0357 0358 %Construct a new rxnGeneMat 0359 newRxnGeneMat=sparse(numel(models{useOrderIndexes(i)}.rxns),numel(fullGeneList)); 0360 0361 %Now update the rxnGeneMat matrix. This is a little tricky and could 0362 %probably be done in a more efficient way, but I just loop through the 0363 %reactions and add them one by one 0364 for j=1:numel(models{useOrderIndexes(i)}.rxns) 0365 %Get the old genes encoding for this reaction 0366 [crap oldGeneIds crap]=find(models{useOrderIndexes(i)}.rxnGeneMat(j,:)); 0367 0368 %Update the matrix for each gene. This includes replacing one gene 0369 %with several new ones if there were several matches 0370 for k=1:numel(oldGeneIds) 0371 %Match the gene to one in the gene list. This is done as a text 0372 %match. Could probably be done better, but I'm a little lost in 0373 %the indexing 0374 0375 geneName=models{useOrderIndexes(i)}.genes(oldGeneIds(k)); 0376 0377 %First search in the mappable genes 0378 mapIndex=strmatch(geneName,allGenes{i+1},'exact'); 0379 0380 if ~isempty(mapIndex) 0381 %Get the new genes for that gene 0382 [a crap crap]=find(finalMappings{i}(:,mapIndex)); 0383 0384 %Find the positions of these genes in the final gene list 0385 [a b]=ismember(allGenes{1}(a),fullGeneList); 0386 0387 %Update the matrix 0388 newRxnGeneMat(j,b)=1; 0389 0390 %Update the grRules string. This is tricky, but I hope 0391 %that it's ok to replace the old gene name with the new one 0392 %and add ') or (' if there were several matches. Be sure of 0393 %this! 0394 repString=fullGeneList{b(1)}; 0395 for l=2:numel(b) 0396 repString=[repString ') or (' fullGeneList{b(l)}]; 0397 end 0398 models{useOrderIndexes(i)}.grRules{j}=strrep(models{useOrderIndexes(i)}.grRules{j},geneName{1},repString); 0399 else 0400 %Then search in the non-replaceable genes. There could only 0401 %be one match here 0402 index=strmatch(geneName,nonReplaceableGenes,'exact'); 0403 0404 %Update the matrix 0405 newRxnGeneMat(j,nonRepStartIndex+index)=1; 0406 0407 models{useOrderIndexes(i)}.grRules{j}=strrep(models{useOrderIndexes(i)}.grRules{j},geneName{1},strcat('OLD_',models{useOrderIndexes(i)}.id,'_',geneName{1})); 0408 end 0409 end 0410 end 0411 0412 %Add the new list of genes 0413 models{useOrderIndexes(i)}.rxnGeneMat=newRxnGeneMat; 0414 if ~isempty(nonReplaceableGenes) 0415 models{useOrderIndexes(i)}.genes=[allGenes{1}(unique(newGenes));strcat('OLD_',models{useOrderIndexes(i)}.id,'_',nonReplaceableGenes)]; 0416 else 0417 models{useOrderIndexes(i)}.genes=allGenes{1}(unique(newGenes)); 0418 end 0419 end 0420 0421 %Now merge the models. All information should be correct except for 'or' 0422 %complexes 0423 draftModel=mergeModels(models); 0424 0425 %Change description of the resulting model 0426 draftModel.id=getModelFor; 0427 description='Generated by getModelFromHomology using '; 0428 for i=1:numel(models) 0429 if i<numel(models) 0430 description=[description '"' models{i}.id '", ']; 0431 else 0432 description=[description '"' models{i}.id '"']; 0433 end 0434 end 0435 draftModel.description=description; 0436 end