getModelFromHomology Constructs a new model from a set of existing models and gene homology information. models a cell array of model structures to build the model from blastStructure a blastStructure as produced by getBlast or getBlastFromExcel getModelFor the name of the organism to build a model for. Must have hits in both directions to the organisms in models preferredOrder the order in which reactions should be added from the models. If not supplied, reactions will be included from all models, otherwise one gene will only result in reactions from one model (opt, default {}) strictness integer that specifies which reactions that should be included 1: Include only 1-1 orthologs (only include genes that map back to the original gene in the blast in the opposite direction) 2: Include the reactions for all genes below the cutoff 3: Include only best 1-1 orthologs (opt, default 1) onlyGenesInModels blast only against genes that exists in the models. This tends to import a larger fraction from the existing models but may give less reliable results. Has effect only if strictness=3 (opt, default false) maxE only look at genes with E-values <= this value (opt, default 10^-30) minLen only look at genes with overlap >= this value (opt, default 200) minSim only look at genes with similarity >= this value (opt, default 40 (%)) mapNewGenesToOld determines how to match genes if not looking at only 1-1 orthologs. Either map the new genes to the old or old genes to new. The default is to map the new genes (opt, default true) draftModel a model structure for the new organism The models in the models structure should have named the metabolites in the same manner, have their reversible reactions in the same direction (run sortModel), and use the same compartment names. To avoid keeping unneccesary old genes, the models should not have 'or'-relations in their grRules (use expandModel). Returns a model structure for the new organism. draftModel a new model structure Rasmus Agren, 2013-09-24
0001 function draftModel=getModelFromHomology(models,blastStructure,getModelFor,preferredOrder,strictness,onlyGenesInModels,maxE,minLen,minSim,mapNewGenesToOld) 0002 % getModelFromHomology 0003 % Constructs a new model from a set of existing models and gene 0004 % homology information. 0005 % 0006 % models a cell array of model structures to build the model 0007 % from 0008 % blastStructure a blastStructure as produced by getBlast or 0009 % getBlastFromExcel 0010 % getModelFor the name of the organism to build a model for. Must 0011 % have hits in both directions to the organisms in models 0012 % preferredOrder the order in which reactions should be added from the 0013 % models. If not supplied, reactions will be included from 0014 % all models, otherwise one gene will only result in reactions 0015 % from one model (opt, default {}) 0016 % strictness integer that specifies which reactions that should be included 0017 % 1: Include only 1-1 orthologs (only include genes that 0018 % map back to the original gene in the blast in the 0019 % opposite direction) 0020 % 2: Include the reactions for all genes below the cutoff 0021 % 3: Include only best 1-1 orthologs (opt, default 1) 0022 % onlyGenesInModels blast only against genes that exists in the models. 0023 % This tends to import a larger fraction from the existing 0024 % models but may give less reliable results. Has effect only 0025 % if strictness=3 (opt, default false) 0026 % maxE only look at genes with E-values <= this value (opt, default 10^-30) 0027 % minLen only look at genes with overlap >= this value (opt, 0028 % default 200) 0029 % minSim only look at genes with similarity >= this value (opt, 0030 % default 40 (%)) 0031 % mapNewGenesToOld determines how to match genes if not looking at only 0032 % 1-1 orthologs. Either map the new genes to the old or 0033 % old genes to new. The default is to map the new genes 0034 % (opt, default true) 0035 % 0036 % draftModel a model structure for the new organism 0037 % 0038 % The models in the models structure should have named the metabolites in 0039 % the same manner, have their reversible reactions in the 0040 % same direction (run sortModel), and use the same compartment names. 0041 % To avoid keeping unneccesary old genes, the models should not have 0042 % 'or'-relations in their grRules (use expandModel). 0043 % 0044 % Returns a model structure for the new organism. 0045 % 0046 % draftModel a new model structure 0047 % 0048 % Rasmus Agren, 2013-09-24 0049 % 0050 0051 %NOTE: "to" and "from" means relative the new organism 0052 0053 if nargin<4 0054 preferredOrder=[]; 0055 end 0056 if nargin<5 0057 strictness=1; 0058 end 0059 if nargin<6 0060 onlyGenesInModels=false; 0061 end 0062 if nargin<7 0063 maxE=10^-30; 0064 end 0065 if nargin<8 0066 minLen=200; 0067 end 0068 if nargin<9 0069 minSim=40; 0070 end 0071 if nargin<10 0072 mapNewGenesToOld=true; 0073 end 0074 0075 preferredOrder=preferredOrder(:); 0076 0077 %Check that all the information is in the blast structure 0078 modelNames=cell(numel(models),1); 0079 for i=1:numel(models) 0080 modelNames{i}=models{i}.id; 0081 end 0082 0083 %Assume for now that all information is there and that it's correct 0084 %This is important to fix since no further checks are being made! 0085 0086 %Remove all gene matches that are below the cutoffs 0087 for i=1:numel(blastStructure) 0088 indexes=blastStructure(i).evalue<maxE & blastStructure(i).aligLen>=minLen & blastStructure(i).identity>=minSim; %Do it in this direction to lose NaNs 0089 blastStructure(i).toGenes(~indexes)=[]; 0090 blastStructure(i).fromGenes(~indexes)=[]; 0091 blastStructure(i).evalue(~indexes)=[]; 0092 blastStructure(i).aligLen(~indexes)=[]; 0093 blastStructure(i).identity(~indexes)=[]; 0094 end 0095 0096 %Remove all reactions from the models that have no genes encoding for them. 0097 %Also remove all genes that encode for no reactions. There shouldn't be any 0098 %but there might be mistakes 0099 for i=1:numel(models) 0100 [hasGenes crap crap]=find(models{i}.rxnGeneMat); 0101 hasNoGenes=1:numel(models{i}.rxns); 0102 hasNoGenes(hasGenes)=[]; 0103 0104 models{i}=removeRxns(models{i},hasNoGenes,true,true); 0105 end 0106 0107 %Create a structure that contains all genes used in the blasts in any 0108 %direction for each of the models in models and for the new new organism 0109 %The first cell is for the new organism and then according to the preferred 0110 %order. If no such order is supplied, then according to the order in 0111 %models. 0112 allGenes=cell(numel(models)+1,1); 0113 if isempty(preferredOrder) 0114 useOrder=modelNames; 0115 else 0116 useOrder=preferredOrder; 0117 end 0118 0119 %Get the corresponding indexes for those models in the models structure 0120 useOrderIndexes=zeros(numel(models),1); 0121 for i=1:numel(models) 0122 index=strmatch(models{i}.id,useOrder,'exact'); 0123 useOrderIndexes(index)=i; 0124 end 0125 0126 %Remove all genes from the blast structure that have no genes in the models 0127 if onlyGenesInModels==true 0128 modelGenes={}; 0129 for i=1:numel(models) 0130 modelGenes=[modelGenes;models{i}.genes]; 0131 end 0132 for i=1:numel(blastStructure) 0133 %Check to see if it should match the toId or fromId 0134 if strcmpi(blastStructure(i).fromId,getModelFor) 0135 I=ismember(blastStructure(i).toGenes,modelGenes); 0136 else 0137 I=ismember(blastStructure(i).fromGenes,modelGenes); 0138 end 0139 blastStructure(i).fromGenes(~I)=[]; 0140 blastStructure(i).toGenes(~I)=[]; 0141 blastStructure(i).evalue(~I)=[]; 0142 blastStructure(i).aligLen(~I)=[]; 0143 blastStructure(i).identity(~I)=[]; 0144 0145 %Check that no matching in blastStructure is empty. This happens if 0146 %no genes in the models are present in the corresponding sheet 0147 if isempty(blastStructure(i).fromGenes) 0148 dispEM(['No genes in matching from ' blastStructure(i).fromId ' to ' blastStructure(i).toId ' are present in the corresponding model']); 0149 end 0150 end 0151 end 0152 0153 %If only best 1-1 orthologs are to be used then all other measurements are 0154 %deleted from the blastStructure. All code after this stays the same. This 0155 %means that preferred order can still matter. The best ortholog scoring is 0156 %based only on the E-value 0157 if strictness==3 0158 for i=1:numel(blastStructure) 0159 keep=false(numel(blastStructure(i).toGenes),1); 0160 [allFromGenes crap I]=unique(blastStructure(i).fromGenes); 0161 0162 %It would be nice to get rid of this loop 0163 for j=1:numel(allFromGenes) 0164 allMatches=find(I==j); 0165 bestMatches=allMatches(blastStructure(i).evalue(allMatches)==min(blastStructure(i).evalue(allMatches))); 0166 0167 %Keep the best matches 0168 keep(bestMatches)=true; 0169 end 0170 0171 %Delete all matches that were not best matches 0172 blastStructure(i).fromGenes(~keep)=[]; 0173 blastStructure(i).toGenes(~keep)=[]; 0174 blastStructure(i).evalue(~keep)=[]; 0175 blastStructure(i).aligLen(~keep)=[]; 0176 blastStructure(i).identity(~keep)=[]; 0177 end 0178 end 0179 0180 useOrder=[{getModelFor};useOrder]; 0181 0182 for i=1:numel(blastStructure) 0183 toIndex=strmatch(blastStructure(i).toId,useOrder,'exact'); 0184 fromIndex=strmatch(blastStructure(i).fromId,useOrder,'exact'); 0185 0186 %Add all genes to the corresponding list in allGenes 0187 allGenes{toIndex}=[allGenes{toIndex};blastStructure(i).toGenes]; 0188 allGenes{fromIndex}=[allGenes{fromIndex};blastStructure(i).fromGenes]; 0189 end 0190 0191 %Keep only the unique gene names 0192 maxOtherGeneNr=0; %Determines the dimension of the connectivity matrixes 0193 for i=1:numel(allGenes) 0194 allGenes{i}=unique(allGenes{i}); 0195 if i>1 0196 if numel(allGenes{i})>maxOtherGeneNr 0197 maxOtherGeneNr=numel(allGenes{i}); 0198 end 0199 end 0200 end 0201 0202 %Generate a cell array of matrixes the describes how the genes in the new 0203 %organism map to the models. Each cell matches to the corresponding model in 0204 %useOrder (starting at 2 of course). First dimension is gene in new organsism, 0205 %second which gene it is in the other organism. The second matrix describes 0206 %how they map back. 0207 0208 %As it is now, a significant match is indicated by a 1. This could be 0209 %expanded to contain some kind of significance level. The first dimension 0210 %is still the genes in the new model. 0211 0212 allTo=cell(numel(useOrder)-1,1); 0213 allFrom=cell(numel(useOrder)-1,1); 0214 0215 for i=1:numel(useOrder)-1 0216 allTo{i}=sparse(numel(allGenes{1}),numel(allGenes{i+1})); 0217 allFrom{i}=sparse(numel(allGenes{1}),numel(allGenes{i+1})); 0218 end 0219 0220 %Fill the matches to other species 0221 for i=1:numel(blastStructure) 0222 if strcmp(blastStructure(i).toId,getModelFor) 0223 %This was to the new organism 0224 %They should all match so no checks are being made 0225 [crap a]=ismember(blastStructure(i).toGenes,allGenes{1}); 0226 fromModel=strmatch(blastStructure(i).fromId,useOrder,'exact'); 0227 [crap b]=ismember(blastStructure(i).fromGenes,allGenes{fromModel}); 0228 idx = sub2ind(size(allTo{fromModel-1}), a, b); 0229 allTo{fromModel-1}(idx)=1; 0230 else 0231 %This was from the new organism 0232 [crap a]=ismember(blastStructure(i).fromGenes,allGenes{1}); 0233 toModel=strmatch(blastStructure(i).toId,useOrder,'exact'); 0234 [crap b]=ismember(blastStructure(i).toGenes,allGenes{toModel}); 0235 idx = sub2ind(size(allFrom{toModel-1}), a, b); 0236 allFrom{toModel-1}(idx)=1; 0237 end 0238 end 0239 0240 %Now we have all the gene matches in a convenient way. For all the genes in 0241 %the new organism get the genes that should be included from other 0242 %organisms. If all genes should be included this simply means keep the 0243 %allFrom matrix as is is. If only 1-1 orthologs are to be included then only 0244 %those elements are kept. 0245 0246 finalMappings=cell(numel(useOrder)-1,1); 0247 if strictness==1 || strictness==3 0248 for j=1:numel(allFrom) 0249 finalMappings{j}=allTo{j}~=0 & allFrom{j}~=0; 0250 end 0251 else 0252 if mapNewGenesToOld==true 0253 finalMappings=allFrom; 0254 else 0255 finalMappings=allTo; 0256 end 0257 end 0258 0259 %Remove all genes from the mapping that are not in the models. This doesn't 0260 %do much if only genes in the models were used for the original 0261 %mapping. Also simplify the finalMapping and allGenes structures so that 0262 %they only contain mappings that exist 0263 usedNewGenes=false(numel(allGenes{1}),1); 0264 0265 for i=1:numel(allGenes)-1 0266 %First remove mappings for those genes that are not in the model 0267 if onlyGenesInModels==false 0268 a=ismember(allGenes{i+1},models{useOrderIndexes(i)}.genes); 0269 finalMappings{i}(:,~a)=false; 0270 end 0271 0272 %Then remove unused ones and simplify 0273 [a b crap]=find(finalMappings{i}); 0274 usedGenes=false(numel(allGenes{i+1}),1); 0275 usedNewGenes(a)=true; 0276 usedGenes(b)=true; 0277 finalMappings{i}=finalMappings{i}(:,usedGenes); 0278 allGenes{i+1}=allGenes{i+1}(usedGenes); 0279 end 0280 0281 %Remove all new genes that have not been mapped to anything 0282 allGenes{1}=allGenes{1}(usedNewGenes); 0283 for i=1:numel(finalMappings) 0284 finalMappings{i}=finalMappings{i}(usedNewGenes,:); 0285 end 0286 0287 %Now is it time to choose which reactions should be included from which 0288 %models. If there is a preferred order specified then each gene can only 0289 %result in reactions from one model, otherwise they should all be included 0290 0291 %Start by simplifying the models by removing genes/reactions that are 0292 %not used. This is where it gets weird with complexes, especially "or" complexes. 0293 %In this step only reactions which are encoded by one single gene, or where all 0294 %genes should be deleted, are deleted. The info on the full complex is still 0295 %present in the grRules 0296 0297 for i=1:numel(models) 0298 a=ismember(models{useOrderIndexes(i)}.genes,allGenes{i+1}); 0299 0300 %Don't remove reactions with complexes if not all genes in the complex 0301 %should be deleted. 0302 %NOTE: This means that not all the genes in 'a' are guaranteed to be 0303 %deleted. This approach works fine for 'and' complexes, but there 0304 %should be a check that it doesn't keep 'or' genes if it doesn't have 0305 %to! 0306 models{useOrderIndexes(i)}=removeGenes(models{useOrderIndexes(i)},~a,true,false); 0307 end 0308 0309 %Since I want to use mergeModels in the end, I simplify the models further 0310 %by deleting genes/reactions in the order specified by preferredOrder. This 0311 %means that the last model will only contain reactions for genes that 0312 %mapped only to that model 0313 0314 allUsedGenes=false(numel(allGenes{1}),1); 0315 0316 if ~isempty(preferredOrder) && numel(models)>1 0317 [usedGenes crap crap]=find(finalMappings{1}); %All that are used in the first model in preferredOrder 0318 allUsedGenes(usedGenes)=true; 0319 for i=2:numel(finalMappings) 0320 [usedGenes crap crap]=find(finalMappings{i}); 0321 usedGenes=unique(usedGenes); 0322 a=ismember(usedGenes,find(allUsedGenes)); 0323 0324 [craps genesToDelete crap]=find(finalMappings{i}(usedGenes(a),:)); %IMPORTANT! IS it really correct to remove all genes that map? 0325 genesToDelete=unique(genesToDelete); %Maybe not needed, but for clarity if nothing else 0326 0327 %Remove all the genes that were already found and add the other 0328 %ones to allUsedGenes 0329 [models{useOrderIndexes(i)} notDeleted]=removeGenes(models{useOrderIndexes(i)},allGenes{i+1}(genesToDelete),true,false); 0330 allUsedGenes(usedGenes)=true; 0331 0332 %Remove the deleted genes from finalMappings and allGenes 0333 %Don't remove the genes in notDeleted, they are part of complexes 0334 %with some non-mapped genes 0335 deletedIndexes=~ismember(allGenes{i+1}(genesToDelete),notDeleted); 0336 finalMappings{i}(:,genesToDelete(deletedIndexes))=[]; 0337 allGenes{i+1}(genesToDelete(deletedIndexes))=[]; 0338 end 0339 end 0340 0341 %Now loop through the models and update the gene associations. Genes not 0342 %belonging to the new organism will be renamed as 'OLD_MODELID_gene' 0343 for i=1:numel(models) 0344 %Find all the new genes that should be used for this model 0345 [newGenes oldGenes crap]=find(finalMappings{i}); 0346 0347 %Create a new gene list with the genes from the new organism and those 0348 %genes that could not be removed 0349 replaceableGenes=allGenes{i+1}(unique(oldGenes)); 0350 nonReplaceableGenes=setdiff(models{useOrderIndexes(i)}.genes,replaceableGenes); 0351 fullGeneList=[allGenes{1}(unique(newGenes));nonReplaceableGenes]; 0352 0353 %Just to save some indexing later. This is the LAST index of 0354 %replaceable ones 0355 nonRepStartIndex=numel(unique(newGenes)); 0356 0357 %Construct a new rxnGeneMat 0358 newRxnGeneMat=sparse(numel(models{useOrderIndexes(i)}.rxns),numel(fullGeneList)); 0359 0360 %Now update the rxnGeneMat matrix. This is a little tricky and could 0361 %probably be done in a more efficient way, but I just loop through the 0362 %reactions and add them one by one 0363 for j=1:numel(models{useOrderIndexes(i)}.rxns) 0364 %Get the old genes encoding for this reaction 0365 [crap oldGeneIds crap]=find(models{useOrderIndexes(i)}.rxnGeneMat(j,:)); 0366 0367 %Update the matrix for each gene. This includes replacing one gene 0368 %with several new ones if there were several matches 0369 for k=1:numel(oldGeneIds) 0370 %Match the gene to one in the gene list. This is done as a text 0371 %match. Could probably be done better, but I'm a little lost in 0372 %the indexing 0373 0374 geneName=models{useOrderIndexes(i)}.genes(oldGeneIds(k)); 0375 0376 %First search in the mappable genes 0377 mapIndex=strmatch(geneName,allGenes{i+1},'exact'); 0378 0379 if ~isempty(mapIndex) 0380 %Get the new genes for that gene 0381 [a crap crap]=find(finalMappings{i}(:,mapIndex)); 0382 0383 %Find the positions of these genes in the final gene list 0384 [a b]=ismember(allGenes{1}(a),fullGeneList); 0385 0386 %Update the matrix 0387 newRxnGeneMat(j,b)=1; 0388 0389 %Update the grRules string. This is tricky, but I hope 0390 %that it's ok to replace the old gene name with the new one 0391 %and add ') or (' if there were several matches. Be sure of 0392 %this! 0393 repString=fullGeneList{b(1)}; 0394 for l=2:numel(b) 0395 repString=[repString ') or (' fullGeneList{b(l)}]; 0396 end 0397 models{useOrderIndexes(i)}.grRules{j}=strrep(models{useOrderIndexes(i)}.grRules{j},geneName{1},repString); 0398 else 0399 %Then search in the non-replaceable genes. There could only 0400 %be one match here 0401 index=strmatch(geneName,nonReplaceableGenes,'exact'); 0402 0403 %Update the matrix 0404 newRxnGeneMat(j,nonRepStartIndex+index)=1; 0405 0406 models{useOrderIndexes(i)}.grRules{j}=strrep(models{useOrderIndexes(i)}.grRules{j},geneName{1},strcat('OLD_',models{useOrderIndexes(i)}.id,'_',geneName{1})); 0407 end 0408 end 0409 end 0410 0411 %Add the new list of genes 0412 models{useOrderIndexes(i)}.rxnGeneMat=newRxnGeneMat; 0413 if ~isempty(nonReplaceableGenes) 0414 models{useOrderIndexes(i)}.genes=[allGenes{1}(unique(newGenes));strcat('OLD_',models{useOrderIndexes(i)}.id,'_',nonReplaceableGenes)]; 0415 else 0416 models{useOrderIndexes(i)}.genes=allGenes{1}(unique(newGenes)); 0417 end 0418 end 0419 0420 %Now merge the models. All information should be correct except for 'or' 0421 %complexes 0422 draftModel=mergeModels(models); 0423 0424 %Change description of the resulting model 0425 draftModel.id=getModelFor; 0426 description='Generated by getModelFromHomology using '; 0427 for i=1:numel(models) 0428 if i<numel(models) 0429 description=[description models{i}.id]; 0430 else 0431 description=[description ', ' models{i}.id]; 0432 end 0433 end 0434 draftModel.description=description; 0435 end