getKEGGModelForOrganism Reconstructs a genome-scale metabolic model based on protein homology to the orthologies in KEGG organismID three letter abbreviation of the organism (as used in KEGG). If not available, use a closely related species. This is used for determing the phylogenetic distance. fastaFile a FASTA file that contains the protein sequences of the organism for which to reconstruct a model (opt, if no FASTA file is supplied then a model is reconstructed based only on the organism abbreviation. This option ignores all settings except for keepUndefinedStoich, keepIncomplete and keepGeneral) dataDir directory for which to retrieve the input data. Should contain a combination of these sub-folders: -dataDir\keggdb The KEGG database files used in 1a (see below). -dataDir\fasta The multi-FASTA files generated in 1b or downloaded from the RAVEN Toolbox homepage (see below) -dataDir\aligned The aligned FASTA files as generated in 2a (see below) -dataDir\hmms The Hidden Markov Models as generated in 2b or downloaded from the RAVEN Toolbox homepage (see below) (opt, see note about fastaFile. Note that in order to rebuild the KEGG model from a database dump, as opposed to using the version supplied with RAVEN, you would still need to supply this) outDir directory to save the results from the quering of the Hidden Markov models. The output is specific for the input sequences and the settings used. It is stored in this manner so that the function can continue if interrupted or if it should run in parallell. Be careful not to leave output files from different organisms or runs with different settings in the same folder. They will not be overwritten (opt, default is a temporary dir where all *.out files are deleted before and after doing the reconstruction) keepUndefinedStoich include reactions in the form n A <=> n+1 A. These will be dealt with as two separate metabolites (opt, default true) keepIncomplete include reactions which have been labelled as "incomplete", "erroneous" or "unclear" (opt, default true) keepGeneral include reactions which have been labelled as "general reaction". These are reactions on the form "an aldehyde <=> an alcohol", and are therefore unsuited for modelling purposes. Note that not all reactions have this type of annotation, and the script will therefore not be able to remove all such reactions (opt, default false) cutOff significans score from HMMer needed to assign genes to a KO (opt, default 10^-50) minScoreRatioG a gene is only assigned to KOs for which the score is >=log(score)/log(best score) for that gene. This is to prevent that a gene which clearly belongs to one KO is assigned also to KOs with much lower scores (opt, default 0.8 (lower is less strict)) minScoreRatioKO ignore genes in a KO if their score is <log(score)/log(best score in KO). This is to "prune" KOs which have many genes and where some are clearly a better fit (opt, default 0.3 (lower is less strict)) maxPhylDist -1: only use sequences from the same domain (Prokaryota, Eukaryota) other (positive) value: only use sequences for organisms where the phylogenetic distance is at the most this large (as calculated in getPhylDist) (opt, default Inf, which means that all sequences will be used) nSequences for each KO, use up to this many sequences from the most closely related species. This is mainly to speed up the alignment process for KOs with very many genes (opt, default inf) model the reconstructed model PLEASE READ THIS: The input to this function can be confusing, because it is intended to be run in parallell on a cluster or in multiple sessions. It therefore saves a lot of intermediate results to disc. This also serves the purpose of not having to do reduntant calculations. This, however, comes with the disadvantage of somewhat trickier handling. This is what this function does: 1a. Downloads relevant parts from the KEGG FTP and constructs a general RAVEN model representing the metabolic network. This output is distributed with RAVEN (it's in the raven\kegg directory). Delete those files to rerun this parsing. 1b. Generates FASTA files from the downloaded KEGG files. One multi-FASTA file for each KO in KEGG is generated. These multi-fasta files can be downloaded from the RAVEN Toolbox homepage if you do not have access to the KEGG FTP or don't want to do this time-consuming parsing. Just be sure that you actually need them first (see below) These steps only have to be redone every time KEGG updates their database (or rather when the updates are large enough to warrant rerunning this part). Many user would probably never use this feature. 2a. Does alignment of the multi-FASTA files for future use. This uses the settings "useEvDist" and "nSequences" to control which sequences should be used for constructing Hidden Markov models (HMMs), and later for matching your sequences to. 2b. Trains Hidden Markov models using HMMer for each of the aligned FASTA files. The most common alternatives here would be to use sequences from only eukaryotes, only prokaryotes or all sequences in KEGG. The Hidden Markov models for those options can be downloaded from the RAVEN Toolbox homepage. This is normally the most convenient way, but if you would like to use, for example, only fungal sequences for training the HMMs then you need to run this part. 3a. Queries the HMMs with sequences for the organism you are making a model for. This step uses both the output from step 1a and from 2b. 3b. Constructs a model based on the fit to the HMMs and the chosen parameters. These steps are specific to the organism for which you are reconstructing the model. In principle the function looks at which output that is already available and runs runs only the parts that are required for step 3. This means that (see the definition of the parameters for details): -1a is only performed if there are no KEGG model files in the raven\kegg directory -1b is only performed if not all required HMMs OR aligned FASTA files OR multi-FASTA files exist in the defined dataDir. This means that this step is skipped if the FASTA files or HMMs are downloaded from the RAVEN Toolbox homepage instead. If not all files exist it will try to find the KEGG database files in dataDir. If it cannot find them it will try to download them via the KEGG FTP. -2a is only performed if not all required HMMs OR aligned FASTA files files exist in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from the RAVEN Toolbox homepage instead. -2b is only performed if not all required HMMs exist in the defined dataDir. This means that this step is skipped if the FASTA files or HMMs are downloaded from the RAVEN Toolbox homepage instead. -3a is performed for the required HMMs for which no corresponding .out file exists in outDir. This is just a way to enable the function to be run in parallell or to resume if interrupeted. -3b is always performed. Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,... keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,... minScoreRatioKO,maxPhylDist,nSequences) Rasmus Agren, 2013-11-22
0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,... 0002 keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,... 0003 minScoreRatioKO,maxPhylDist,nSequences) 0004 % getKEGGModelForOrganism 0005 % Reconstructs a genome-scale metabolic model based on protein homology to the 0006 % orthologies in KEGG 0007 % 0008 % organismID three letter abbreviation of the organism (as used in 0009 % KEGG). If not available, use a closely related 0010 % species. This is used for determing the 0011 % phylogenetic distance. 0012 % fastaFile a FASTA file that contains the protein sequences of 0013 % the organism for which to reconstruct a model (opt, 0014 % if no FASTA file is supplied then a model is 0015 % reconstructed based only on the organism 0016 % abbreviation. This option ignores all settings except for 0017 % keepUndefinedStoich, keepIncomplete and keepGeneral) 0018 % dataDir directory for which to retrieve the input data. 0019 % Should contain a combination of these sub-folders: 0020 % -dataDir\keggdb 0021 % The KEGG database files used in 1a (see below). 0022 % -dataDir\fasta 0023 % The multi-FASTA files generated in 1b or 0024 % downloaded from the RAVEN Toolbox homepage (see 0025 % below) 0026 % -dataDir\aligned 0027 % The aligned FASTA files as generated in 2a (see 0028 % below) 0029 % -dataDir\hmms 0030 % The Hidden Markov Models as generated in 2b or 0031 % downloaded from the RAVEN Toolbox homepage (see 0032 % below) 0033 % (opt, see note about fastaFile. Note that in order to 0034 % rebuild the KEGG model from a database dump, as opposed to 0035 % using the version supplied with RAVEN, you would still need 0036 % to supply this) 0037 % outDir directory to save the results from the quering of 0038 % the Hidden Markov models. The output is specific 0039 % for the input sequences and the settings used. It 0040 % is stored in this manner so that the function can 0041 % continue if interrupted or if it should run in 0042 % parallell. Be careful not to leave output files 0043 % from different organisms or runs with different 0044 % settings in the same folder. They will not be 0045 % overwritten (opt, default is a temporary dir where 0046 % all *.out files are deleted before and after doing 0047 % the reconstruction) 0048 % keepUndefinedStoich include reactions in the form n A <=> n+1 A. These 0049 % will be dealt with as two separate metabolites 0050 % (opt, default true) 0051 % keepIncomplete include reactions which have been labelled as 0052 % "incomplete", "erroneous" or "unclear" (opt, 0053 % default true) 0054 % keepGeneral include reactions which have been labelled as 0055 % "general reaction". These are reactions on the form 0056 % "an aldehyde <=> an alcohol", and are therefore 0057 % unsuited for modelling purposes. Note that not all 0058 % reactions have this type of annotation, and the 0059 % script will therefore not be able to remove all 0060 % such reactions (opt, default false) 0061 % cutOff significans score from HMMer needed to assign genes 0062 % to a KO (opt, default 10^-50) 0063 % minScoreRatioG a gene is only assigned to KOs for which the score 0064 % is >=log(score)/log(best score) for that gene. 0065 % This is to prevent that a gene which clearly belongs 0066 % to one KO is assigned also to KOs with much lower 0067 % scores (opt, default 0.8 (lower is less strict)) 0068 % minScoreRatioKO ignore genes in a KO if their score is 0069 % <log(score)/log(best score in KO). This is to 0070 % "prune" KOs which have many genes and where some 0071 % are clearly a better fit 0072 % (opt, default 0.3 (lower is less strict)) 0073 % maxPhylDist -1: only use sequences from the same domain 0074 % (Prokaryota, Eukaryota) 0075 % other (positive) value: only use sequences for 0076 % organisms where the phylogenetic distance is at 0077 % the most this large (as calculated in getPhylDist) 0078 % (opt, default Inf, which means that all sequences will 0079 % be used) 0080 % nSequences for each KO, use up to this many sequences from 0081 % the most closely related species. This is mainly to 0082 % speed up the alignment process for KOs with very 0083 % many genes (opt, default inf) 0084 % 0085 % model the reconstructed model 0086 % 0087 % PLEASE READ THIS: The input to this function can be confusing, because 0088 % it is intended to be run in parallell on a cluster or in multiple sessions. 0089 % It therefore saves a lot of intermediate results to disc. This also 0090 % serves the purpose of not having to do reduntant calculations. This, 0091 % however, comes with the disadvantage of somewhat trickier handling. 0092 % This is what this function does: 0093 % 0094 % 1a. Downloads relevant parts from the KEGG FTP and constructs a general 0095 % RAVEN model representing the metabolic network. This output is 0096 % distributed with RAVEN (it's in the raven\kegg directory). Delete 0097 % those files to rerun this parsing. 0098 % 1b. Generates FASTA files from the downloaded KEGG files. One 0099 % multi-FASTA file for each KO in KEGG is generated. These 0100 % multi-fasta files can be downloaded from the RAVEN Toolbox 0101 % homepage if you do not have access to the KEGG FTP or don't want to 0102 % do this time-consuming parsing. Just be sure that you actually need 0103 % them first (see below) 0104 % 0105 % These steps only have to be redone every time KEGG updates their 0106 % database (or rather when the updates are large enough to warrant 0107 % rerunning this part). Many user would probably never use this feature. 0108 % 0109 % 2a. Does alignment of the multi-FASTA files for future use. This uses the 0110 % settings "useEvDist" and "nSequences" to control which sequences 0111 % should be used for constructing Hidden Markov models (HMMs), and 0112 % later for matching your sequences to. 0113 % 2b. Trains Hidden Markov models using HMMer for each of the aligned 0114 % FASTA files. 0115 % 0116 % The most common alternatives here would be to use sequences from only 0117 % eukaryotes, only prokaryotes or all sequences in KEGG. The Hidden Markov 0118 % models for those options can be downloaded from the RAVEN Toolbox 0119 % homepage. This is normally the most convenient way, but if you would 0120 % like to use, for example, only fungal sequences for training the HMMs 0121 % then you need to run this part. 0122 % 0123 % 3a. Queries the HMMs with sequences for the organism you are making a 0124 % model for. This step uses both the output from step 1a and from 2b. 0125 % 3b. Constructs a model based on the fit to the HMMs and the chosen 0126 % parameters. 0127 % 0128 % These steps are specific to the organism for which you are reconstructing 0129 % the model. 0130 % 0131 % In principle the function looks at which output that is already available 0132 % and runs runs only the parts that are required for step 3. This means 0133 % that (see the definition of the parameters for details): 0134 % -1a is only performed if there are no KEGG model files in the raven\kegg 0135 % directory 0136 % -1b is only performed if not all required HMMs OR aligned FASTA files 0137 % OR multi-FASTA files exist in the defined dataDir. This means that this 0138 % step is skipped if the FASTA files or HMMs are downloaded from the RAVEN 0139 % Toolbox homepage instead. If not all files exist it will try to find 0140 % the KEGG database files in dataDir. If it cannot find them it will try 0141 % to download them via the KEGG FTP. 0142 % -2a is only performed if not all required HMMs OR aligned FASTA files 0143 % files exist in the defined dataDir. This means that this step is skipped 0144 % if the HMMs are downloaded from the RAVEN Toolbox homepage instead. 0145 % -2b is only performed if not all required HMMs exist in the defined 0146 % dataDir. This means that this step is skipped if the FASTA files or 0147 % HMMs are downloaded from the RAVEN Toolbox homepage instead. 0148 % -3a is performed for the required HMMs for which no corresponding .out 0149 % file exists in outDir. This is just a way to enable the function to be 0150 % run in parallell or to resume if interrupeted. 0151 % -3b is always performed. 0152 % 0153 % Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,... 0154 % keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,... 0155 % minScoreRatioKO,maxPhylDist,nSequences) 0156 % 0157 % Rasmus Agren, 2013-11-22 0158 % 0159 0160 if nargin<2 0161 fastaFile=[]; 0162 end 0163 if nargin<3 0164 dataDir=[]; 0165 end 0166 if nargin<4 0167 outDir=[]; 0168 end 0169 if isempty(outDir) 0170 outDir=tempdir; 0171 %Delete all *.out files if any exist 0172 delete(fullfile(outDir,'*.out')); 0173 end 0174 if nargin<5 0175 keepUndefinedStoich=true; 0176 end 0177 if nargin<6 0178 keepIncomplete=true; 0179 end 0180 if nargin<7 0181 keepGeneral=true; 0182 end 0183 if nargin<8 0184 cutOff=10^-50; 0185 end 0186 if nargin<9 0187 minScoreRatioG=0.8; 0188 end 0189 if nargin<10 0190 minScoreRatioKO=0.3; 0191 end 0192 if nargin<11 0193 maxPhylDist=inf; %Include all sequences for each reaction 0194 end 0195 if nargin<12 0196 nSequences=inf; %Include all sequences for each reaction 0197 end 0198 0199 %Check if the fasta-file contains '/' or'\'. If not then it's probably just 0200 %a file name. It is then merged with the current folder 0201 if any(fastaFile) 0202 if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/')) 0203 fastaFile=fullfile(pwd,fastaFile); 0204 end 0205 %Create the required sub-folders in dataDir if they dont exist 0206 if ~exist(fullfile(dataDir,'keggdb'),'dir') 0207 mkdir(dataDir,'keggdb'); 0208 end 0209 if ~exist(fullfile(dataDir,'fasta'),'dir') 0210 mkdir(dataDir,'fasta'); 0211 end 0212 if ~exist(fullfile(dataDir,'aligned'),'dir') 0213 mkdir(dataDir,'aligned'); 0214 end 0215 if ~exist(fullfile(dataDir,'hmms'),'dir') 0216 mkdir(dataDir,'hmms'); 0217 end 0218 if ~exist(outDir,'dir') 0219 mkdir(outDir); 0220 end 0221 end 0222 0223 %First generate the full KEGG model. The dataDir mustn't be supplied as 0224 %there is also an internal RAVEN version available 0225 if any(dataDir) 0226 [model, KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepUndefinedStoich,keepIncomplete,keepGeneral); 0227 else 0228 [model, KOModel]=getModelFromKEGG([],keepUndefinedStoich,keepIncomplete,keepGeneral); 0229 end 0230 fprintf('Completed generation of KEGG model\n'); 0231 model.id=organismID; 0232 model.c=zeros(numel(model.rxns),1); 0233 0234 %If no FASTA file is supplied, then just remove all genes which are not for 0235 %the given organism ID 0236 if isempty(fastaFile) 0237 %All IDs are three letters 0238 I=cellfun(@(x) strcmpi(x(1:3),organismID),model.genes); 0239 0240 %Remove those genes 0241 model.genes=model.genes(I); 0242 model.rxnGeneMat=model.rxnGeneMat(:,I); 0243 end 0244 0245 %First remove all reactions without genes 0246 hasGenes=any(model.rxnGeneMat,2); 0247 0248 model=removeRxns(model,~hasGenes,true); 0249 0250 %If no FASTA file is supplied, then we're done here 0251 if isempty(fastaFile) 0252 return; 0253 end 0254 0255 %Trim the genes so that they only contain information that can be matched 0256 %to the KEGG file of protein sequences (remove all information after first 0257 %parenthesis) 0258 0259 %NOTE: For some reason the organism abbreviation should be in lower case in 0260 %this database. Fix this here 0261 for i=1:numel(KOModel.genes) 0262 parIndex=strfind(KOModel.genes{i},'('); 0263 if any(parIndex) 0264 KOModel.genes{i}=KOModel.genes{i}(1:parIndex-1); 0265 end 0266 colIndex=strfind(KOModel.genes{i},':'); 0267 KOModel.genes{i}=[lower(KOModel.genes{i}(1:colIndex-1)) KOModel.genes{i}(colIndex:end)]; 0268 end 0269 0270 %Create a phylogenetic distance structure 0271 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1); 0272 [crap phylDistId]=ismember(model.id,phylDistStruct.ids); 0273 fprintf('Completed creation of phylogenetic distance matrix\n'); 0274 0275 %Calculate the real maximal distance now. An abitary large number of 1000 is 0276 %used for the "all in kingdom" or "all sequences" options. This is a bit 0277 %inconvenient way to do it, but it's to make it fit with some older code 0278 if isinf(maxPhylDist) || maxPhylDist==-1 0279 maxPhylDist=1000; 0280 end 0281 0282 %Get the KO ids for which files have been generated. Maybe not the neatest 0283 %way.. 0284 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa')); 0285 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa')); 0286 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw')); 0287 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm')); 0288 outFiles=listFiles(fullfile(outDir,'*.out')); 0289 0290 %Check if multi-FASTA files should be generated. This should only be 0291 %performed if there are IDs in the KOModel structure that haven't been 0292 %parsed yet 0293 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]); 0294 0295 if ~isempty(missingFASTA) 0296 if ~exist(fullfile(dataDir,'keggdb','genes.pep'),'file') 0297 %If no sequence file exists then download from KEGG 0298 downloadKEGG(fullfile(dataDir,'keggdb')); 0299 end 0300 %Only construct models for KOs which don't have files already 0301 fastaModel=removeRxns(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true); 0302 %Permute the order of the KOs in the model so that constructMultiFasta 0303 %can be run on several processors at once 0304 fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns'); 0305 constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta')); 0306 end 0307 fprintf('Completed generation of multi-FASTA files\n'); 0308 0309 %Get the directory for RAVEN Toolbox. This is to get the path to the third 0310 %party software used 0311 [ST I]=dbstack('-completenames'); 0312 ravenPath=fileparts(ST(I).file); 0313 0314 %Check if alignment of FASTA files should be performed 0315 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]); 0316 if ~isempty(missingAligned) 0317 missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned))); 0318 %Align all sequences using CLUSTALW 0319 for i=1:numel(missingAligned) 0320 %This is checked here because it could be that it is created by a 0321 %parallell process. The faw-files are saved as temporary files to 0322 %keppt track of which files are being worked on 0323 if ~exist(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'file') && ~exist(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'file') 0324 %Check that the multi-FASTA file exists. It should do so since 0325 %we are saving empty files as well. Print a warning and 0326 %continue if not. 0327 if ~exist(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']),'file') 0328 dispEM(['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist'],false); 0329 continue; 0330 end 0331 0332 %If the multi-FASTA file is empty then save an empty aligned 0333 %file and continue 0334 s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0335 if s.bytes<=0 0336 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w'); 0337 fclose(fid); 0338 continue; 0339 end 0340 0341 %Create an empty file to prevent other threads to start to work on the same alignment 0342 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w'); 0343 fclose(fid); 0344 0345 %First load the FASTA file, then select up to nSequences sequences 0346 %of the most closely related species, apply any constraints from 0347 %maxPhylDist, and save it as a temporary file, and create the 0348 %model from that 0349 0350 fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0351 phylDist=inf(numel(fastaStruct),1); 0352 for j=1:numel(fastaStruct) 0353 %Get the organism abbreviation 0354 index=strfind(fastaStruct(j).Header,':'); 0355 if any(index) 0356 abbrev=fastaStruct(j).Header(1:index(1)-1); 0357 [crap index]=ismember(abbrev,phylDistStruct.ids); 0358 if any(index) 0359 phylDist(j)=phylDistStruct.distMat(index(1),phylDistId); 0360 end 0361 end 0362 end 0363 0364 %Inf means that it should not be included 0365 phylDist(phylDist>maxPhylDist)=[]; 0366 0367 %Sort based on phylDist 0368 [crap order]=sort(phylDist); 0369 0370 %Save the first nSequences hits to a temporary FASTA file 0371 if nSequences<=numel(fastaStruct) 0372 fastaStruct=fastaStruct(order(1:nSequences)); 0373 else 0374 fastaStruct=fastaStruct(order); 0375 end 0376 0377 %Do the alignment if there are more than one sequences, 0378 %otherwise just save the sequence (or an empty file) 0379 if numel(fastaStruct)>1 0380 tmpFile=tempname; 0381 fastawrite(tmpFile,fastaStruct); 0382 0383 %Do the alignment for this file 0384 [status output]=system([fullfile(ravenPath,'software','clustalw2','clustalw2') ' -infile="' tmpFile '" -align -outfile="' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '" -output=FASTA -type=PROTEIN -OUTORDER=INPUT']); 0385 if status~=0 0386 dispEM(['Error when performing alignment of ' missingAligned{i} ':\n' output]); 0387 end 0388 %Remove the old tempfile 0389 if exist(tmpFile, 'file') 0390 delete([tmpFile '*']); 0391 end 0392 else 0393 %If there is only one sequence then it's not possible to do a 0394 %multiple alignment. Just print the sequence instead. An 0395 %empty file was written previously so that doesn't have to 0396 %be dealt with 0397 if numel(fastaStruct)==1 0398 %CLUSTALW only uses the first word as gene name. This 0399 %doesn't really have an effect, but might as well be 0400 %consistent 0401 I=regexp(fastaStruct.Header,' ','split'); 0402 fastaStruct.Header=strrep(I{1},':','_'); 0403 fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct); 0404 end 0405 end 0406 %Move the temporary file to the real one 0407 movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f'); 0408 end 0409 end 0410 end 0411 fprintf('Completed multiple alignment of sequences\n'); 0412 0413 %Check if training of Hidden Markov models should be performed 0414 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]); 0415 if ~isempty(missingHMMs) 0416 missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs))); 0417 0418 %Train models for all missing KOs 0419 for i=1:numel(missingHMMs) 0420 %This is checked here because it could be that it is created by a 0421 %parallell process 0422 if ~exist(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'file') && ~exist(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'file') 0423 %Check that the aligned FASTA file exists. It could be that it 0424 %is still being worked on by some other instance of the 0425 %program (the .faw file should then exist). This should not 0426 %happen on a single computer. I don't throw an error, because 0427 %it should finalize the ones it can. 0428 if ~exist(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']),'file') 0429 dispEM(['The aligned FASTA file for ' missingHMMs{i} ' does not exist'],false); 0430 continue; 0431 end 0432 0433 %If the multi-FASTA file is empty then save an empty aligned 0434 %file and continue 0435 s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa'])); 0436 if s.bytes<=0 0437 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w'); 0438 fclose(fid); 0439 continue; 0440 end 0441 %Create a temporary file to indicate that it is working on the 0442 %KO. This is because hmmbuild cannot overwrite existing files. 0443 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w'); 0444 fclose(fid); 0445 0446 %Create HMM. This is saved with a "hm" ending to indicate that 0447 %it hasn't been calibrated yet. This is because I want to be 0448 %able to see which models are uncalibrated if something goes 0449 %wrong 0450 [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmbuild') ' "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']); 0451 if status~=0 0452 dispEM(['Error when training HMM for ' missingHMMs{i} ':\n' output]); 0453 end 0454 0455 %This is only available for linux 0456 if ~ispc 0457 [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmcalibrate') ' "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']) '"']); 0458 if status~=0 0459 %This check is because some HMMs gives an error saying 0460 %that the number of iterations might be too low. 0461 %However, raising it doesn't have any effect. The error 0462 %is therefore ignored and the uncalibrated model is 0463 %used 0464 if isempty(strfind(output,'--num may be set too small?')) 0465 delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hm'])); 0466 dispEM(['Error when calibrating HMM for ' missingHMMs{i} ':\n' output]); 0467 else 0468 dispEM(['Cannot calibrate the HMM for ' missingHMMs{i} '. Using uncalibrated version'],false); 0469 end 0470 end 0471 end 0472 %Move the temporary file to the real one 0473 movefile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']),fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'f'); 0474 %Delete the temporary file 0475 delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw'])); 0476 end 0477 end 0478 end 0479 fprintf('Completed generation of HMMs\n'); 0480 0481 %Check which new .out files that should be generated 0482 %Check if training of Hidden Markov models should be performed 0483 missingOUT=setdiff(KOModel.rxns,outFiles); 0484 if ~isempty(missingOUT) 0485 missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT))); 0486 for i=1:numel(missingOUT) 0487 %This is checked here because it could be that it is created by a 0488 %parallell process 0489 if ~exist(fullfile(outDir,[missingOUT{i} '.out']),'file') 0490 %Check that the HMM file exists. It should do so since 0491 %we are saving empty files as well. Print a warning and 0492 %continue if not. 0493 if ~exist(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']),'file') 0494 dispEM(['The HMM file for ' missingOUT{i} ' does not exist'],false); 0495 continue; 0496 end 0497 0498 %Save an empty file to prevent several threads working on the 0499 %same file 0500 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0501 fclose(fid); 0502 0503 %If the HMM file is empty then save an out file and continue 0504 s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm'])); 0505 if s.bytes<=0 0506 continue; 0507 end 0508 0509 %Check each gene in the input file against this model 0510 [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmsearch') ' "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']); 0511 if status~=0 0512 dispEM(['Error when querying HMM for ' missingOUT{i} ':\n' output]); 0513 end 0514 0515 %Save the output to a file 0516 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0517 fwrite(fid,output); 0518 fclose(fid); 0519 end 0520 end 0521 end 0522 fprintf('Completed matching to HMMs\n'); 0523 0524 %***Begin retrieving the output and putting together the resulting model 0525 0526 %Retrieve matched genes from the HMMs 0527 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes 0528 genes=cell(3000,1); 0529 %Store the best score for a gene in a hash list (since I will be searching 0530 %many times) 0531 hTable = java.util.Hashtable; 0532 0533 geneCounter=0; 0534 for i=1:numel(KOModel.rxns) 0535 if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file') 0536 fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r'); 0537 beginMatches=false; 0538 while 1 0539 %Get the next line 0540 tline = fgetl(fid); 0541 0542 %Abort at end of file 0543 if ~ischar(tline) 0544 break; 0545 end 0546 0547 if beginMatches==false 0548 %This is how the listing of matches begins 0549 if any(strfind(tline,'E-value ')) 0550 %Read one more line that is only padding 0551 tline = fgetl(fid); 0552 beginMatches=true; 0553 end 0554 else 0555 %If matches should be read 0556 if ~strcmp(tline,' [no hits above thresholds]') && ~isempty(tline) 0557 elements=regexp(tline,' ','split'); 0558 elements=elements(cellfun(@any,elements)); 0559 0560 %Check if the match is below the treshhold 0561 score=str2num(elements{end-1}); 0562 gene=elements{1}; 0563 if score<=cutOff 0564 %If the score is exactly 0, change it to a very 0565 %small value to avoid NaN 0566 if score==0 0567 score=10^-250; 0568 end 0569 %Check if the gene is added already and, is so, get 0570 %the best score for it 0571 I=hTable.get(gene); 0572 if any(I) 0573 koGeneMat(i,I)=score; 0574 else 0575 geneCounter=geneCounter+1; 0576 %The gene was not present yet so add it 0577 hTable.put(gene,geneCounter); 0578 genes{geneCounter}=gene; 0579 koGeneMat(i,geneCounter)=score; 0580 end 0581 end 0582 else 0583 break; 0584 end 0585 end 0586 end 0587 fclose(fid); 0588 end 0589 end 0590 koGeneMat=koGeneMat(:,1:geneCounter); 0591 0592 %Remove the genes for each KO that are below minScoreRatioKO. 0593 for i=1:size(koGeneMat,1) 0594 J=find(koGeneMat(i,:)); 0595 if any(J) 0596 koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0; 0597 end 0598 end 0599 0600 %Remove the KOs for each gene that are below minScoreRatioG 0601 for i=1:size(koGeneMat,2) 0602 J=find(koGeneMat(:,i)); 0603 if any(J) 0604 koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0; 0605 end 0606 end 0607 0608 %Create the new model 0609 model.genes=genes(1:geneCounter); 0610 model.grRules=cell(numel(model.rxns),1); 0611 model.grRules(:)={''}; 0612 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes)); 0613 0614 %Loop through the reactions and add the corresponding genes 0615 for i=1:numel(model.rxns) 0616 if isstruct(model.rxnMiriams{i}) 0617 %Get all KOs 0618 I=find(strcmpi(model.rxnMiriams{i}.name,'urn:miriam:kegg.ko')); 0619 KOs=model.rxnMiriams{i}.value(I); 0620 %Find the KOs and the corresponding genes 0621 J=ismember(KOModel.rxns,KOs); 0622 [crap K]=find(koGeneMat(J,:)); 0623 0624 if any(K) 0625 model.rxnGeneMat(i,K)=1; 0626 %Also delete KOs for which no genes were found. If no genes at all 0627 %were matched to the reaction it will be deleted later 0628 L=sum(koGeneMat(J,:),2)==0; 0629 model.rxnMiriams{i}.value(I(L))=[]; 0630 model.rxnMiriams{i}.name(I(L))=[]; 0631 end 0632 end 0633 end 0634 0635 %Find and delete all reactions without any genes. This also removes genes 0636 %that are not used (which could happen because minScoreRatioG and 0637 %minScoreRatioKO) 0638 I=sum(model.rxnGeneMat,2)==0; 0639 model=removeRxns(model,I,true,true); 0640 0641 %Add the gene associations as 'or' 0642 for i=1:numel(model.rxns) 0643 %Find the involved genes 0644 I=find(model.rxnGeneMat(i,:)); 0645 model.grRules{i}=['(' model.genes{I(1)}]; 0646 for j=2:numel(I) 0647 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}]; 0648 end 0649 model.grRules{i}=[model.grRules{i} ')']; 0650 end 0651 end 0652 0653 %Supporter function to list the files in a directory and return them as a 0654 %cell array 0655 function files=listFiles(directory) 0656 temp=dir(directory); 0657 files=cell(numel(temp),1); 0658 for i=1:numel(temp) 0659 files{i}=temp(i,1).name; 0660 end 0661 files=strrep(files,'.fa',''); 0662 files=strrep(files,'.hmm',''); 0663 files=strrep(files,'.out',''); 0664 files=strrep(files,'.faw',''); 0665 end