getKEGGModelForOrganism Reconstructs a genome-scale metabolic model based on protein homology to the orthologies in KEGG organismID three letter abbreviation of the organism (as used in KEGG). If not available, use a closely related species. This is used for determing the phylogenetic distance. fastaFile a FASTA file that contains the protein sequences of the organism for which to reconstruct a model dataDir directory for which to retrieve the input data. Should contain a combination of these sub-folders: -dataDir\keggdb The KEGG database files used in 1a (see below). -dataDir\fasta The multi-FASTA files generated in 1b or downloaded from the RAVEN Toolbox homepage (see below) -dataDir\aligned The aligned FASTA files as generated in 2a (see below) -dataDir\hmms The Hidden Markov Models as generated in 2b or downloaded from the RAVEN Toolbox homepage (see below) outDir directory to save the results from the quering of the Hidden Markov models. The output is specific for the input sequences and the settings used. It is stored in this manner so that the function can continue if interrupted or if it should run in parallell. Be careful not to leave output files from different organisms or runs with different settings in the same folder. They will not be overwritten (opt, default is a temporary dir where all *.out files are deleted before and after doing the reconstruction) keepUndefinedStoich include reactions in the form n A <=> n+1 A. These will be dealt with as two separate metabolites (opt, default true) keepIncomplete include reactions which have been labelled as "incomplete", "erroneous" or "unclear" (opt, default true) keepGeneral include reactions which have been labelled as "general reaction". These are reactions on the form "an aldehyde <=> an alcohol", and are therefore unsuited for modelling purposes. Note that not all reactions have this type of annotation, and the script will therefore not be able to remove all such reactions (opt, default false) cutOff significans score from HMMer needed to assign genes to a KO (opt, default 10^-50) minScoreRatioG a gene is only assigned to KOs for which the score is >=log(score)/log(best score) for that gene. This is to prevent that a gene which clearly belongs to one KO is assigned also to KOs with much lower scores (opt, default 0.8 (lower is less strict)) minScoreRatioKO ignore genes in a KO if their score is <log(score)/log(best score in KO). This is to "prune" KOs which have many genes and where some are clearly a better fit (opt, default 0.3 (lower is less strict)) maxPhylDist -1: only use sequences from the same domain (Prokaryota, Eukaryota) other (positive) value: only use sequences for organisms where the phylogenetic distance is at the most this large (as calculated in getPhylDist) (opt, default Inf, which means that all sequences will be used) nSequences for each KO, use up to this many sequences from the most closely related species. This is mainly to speed up the alignment process for KOs with very many genes (opt, default inf) model the reconstructed model PLEASE READ THIS: The input to this function can be confusing, because it is intended to be run in parallell on a cluster or in multiple sessions. It therefore saves a lot of intermediate results to disc. This also serves the purpose of not having to do reduntant calculations. This, however, comes with the disadvantage of somewhat trickier handling. This is what this function does: 1a. Downloads relevant parts from the KEGG FTP and constructs a general RAVEN model representing the metabolic network. This output is distributed with RAVEN (it's in the raven\kegg directory). Delete those files to rerun this parsing. 1b. Generates FASTA files from the downloaded KEGG files. One multi-FASTA file for each KO in KEGG is generated. These multi-fasta files can be downloaded from the RAVEN Toolbox homepage if you do not have access to the KEGG FTP or don't want to do this time-consuming parsing. Just be sure that you actually need them first (see below) These steps only have to be redone every time KEGG updates their database (or rather when the updates are large enough to warrant rerunning this part). Many user would probably never use this feature. 2a. Does alignment of the multi-FASTA files for future use. This uses the settings "useEvDist" and "nSequences" to control which sequences should be used for constructing Hidden Markov models (HMMs), and later for matching your sequences to. 2b. Trains Hidden Markov models using HMMer for each of the aligned FASTA files. The most common alternatives here would be to use sequences from only eukaryotes, only prokaryotes or all sequences in KEGG. The Hidden Markov models for those options can be downloaded from the RAVEN Toolbox homepage. This is normally the most convenient way, but if you would like to use, for example, only fungal sequences for training the HMMs then you need to run this part. 3a. Queries the HMMs with sequences for the organism you are making a model for. This step uses both the output from step 1a and from 2b. 3b. Constructs a model based on the fit to the HMMs and the chosen parameters. These steps are specific to the organism for which you are reconstructing the model. In principle the function looks at which output that is already available and runs runs only the parts that are required for step 3. This means that (see the definition of the parameters for details): -1a is only performed if there are no KEGG model files in the raven\kegg directory -1b is only performed if not all required HMMs OR aligned FASTA files OR multi-FASTA files exist in the defined dataDir. This means that this step is skipped if the FASTA files or HMMs are downloaded from the RAVEN Toolbox homepage instead. If not all files exist it will try to find the KEGG database files in dataDir. If it cannot find them it will try to download them via the KEGG FTP. -2a is only performed if not all required HMMs OR aligned FASTA files files exist in the defined dataDir. This means that this step is skipped if the HMMs are downloaded from the RAVEN Toolbox homepage instead. -2b is only performed if not all required HMMs exist in the defined dataDir. This means that this step is skipped if the FASTA files or HMMs are downloaded from the RAVEN Toolbox homepage instead. -3a is performed for the required HMMs for which no corresponding .out file exists in outDir. This is just a way to enable the function to be run in parallell or to resume if interrupeted. -3b is always performed. Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,... keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,... minScoreRatioKO,maxPhylDist,nSequences) Rasmus Agren, 2013-02-04
0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,... 0002 keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,... 0003 minScoreRatioKO,maxPhylDist,nSequences) 0004 % getKEGGModelForOrganism 0005 % Reconstructs a genome-scale metabolic model based on protein homology to the 0006 % orthologies in KEGG 0007 % 0008 % organismID three letter abbreviation of the organism (as used in 0009 % KEGG). If not available, use a closely related 0010 % species. This is used for determing the 0011 % phylogenetic distance. 0012 % fastaFile a FASTA file that contains the protein sequences of 0013 % the organism for which to reconstruct a model 0014 % dataDir directory for which to retrieve the input data. 0015 % Should contain a combination of these sub-folders: 0016 % -dataDir\keggdb 0017 % The KEGG database files used in 1a (see below). 0018 % -dataDir\fasta 0019 % The multi-FASTA files generated in 1b or 0020 % downloaded from the RAVEN Toolbox homepage (see 0021 % below) 0022 % -dataDir\aligned 0023 % The aligned FASTA files as generated in 2a (see 0024 % below) 0025 % -dataDir\hmms 0026 % The Hidden Markov Models as generated in 2b or 0027 % downloaded from the RAVEN Toolbox homepage (see 0028 % below) 0029 % outDir directory to save the results from the quering of 0030 % the Hidden Markov models. The output is specific 0031 % for the input sequences and the settings used. It 0032 % is stored in this manner so that the function can 0033 % continue if interrupted or if it should run in 0034 % parallell. Be careful not to leave output files 0035 % from different organisms or runs with different 0036 % settings in the same folder. They will not be 0037 % overwritten (opt, default is a temporary dir where 0038 % all *.out files are deleted before and after doing 0039 % the reconstruction) 0040 % keepUndefinedStoich include reactions in the form n A <=> n+1 A. These 0041 % will be dealt with as two separate metabolites 0042 % (opt, default true) 0043 % keepIncomplete include reactions which have been labelled as 0044 % "incomplete", "erroneous" or "unclear" (opt, 0045 % default true) 0046 % keepGeneral include reactions which have been labelled as 0047 % "general reaction". These are reactions on the form 0048 % "an aldehyde <=> an alcohol", and are therefore 0049 % unsuited for modelling purposes. Note that not all 0050 % reactions have this type of annotation, and the 0051 % script will therefore not be able to remove all 0052 % such reactions (opt, default false) 0053 % cutOff significans score from HMMer needed to assign genes 0054 % to a KO (opt, default 10^-50) 0055 % minScoreRatioG a gene is only assigned to KOs for which the score 0056 % is >=log(score)/log(best score) for that gene. 0057 % This is to prevent that a gene which clearly belongs 0058 % to one KO is assigned also to KOs with much lower 0059 % scores (opt, default 0.8 (lower is less strict)) 0060 % minScoreRatioKO ignore genes in a KO if their score is 0061 % <log(score)/log(best score in KO). This is to 0062 % "prune" KOs which have many genes and where some 0063 % are clearly a better fit 0064 % (opt, default 0.3 (lower is less strict)) 0065 % maxPhylDist -1: only use sequences from the same domain 0066 % (Prokaryota, Eukaryota) 0067 % other (positive) value: only use sequences for 0068 % organisms where the phylogenetic distance is at 0069 % the most this large (as calculated in getPhylDist) 0070 % (opt, default Inf, which means that all sequences will 0071 % be used) 0072 % nSequences for each KO, use up to this many sequences from 0073 % the most closely related species. This is mainly to 0074 % speed up the alignment process for KOs with very 0075 % many genes (opt, default inf) 0076 % 0077 % model the reconstructed model 0078 % 0079 % PLEASE READ THIS: The input to this function can be confusing, because 0080 % it is intended to be run in parallell on a cluster or in multiple sessions. 0081 % It therefore saves a lot of intermediate results to disc. This also 0082 % serves the purpose of not having to do reduntant calculations. This, 0083 % however, comes with the disadvantage of somewhat trickier handling. 0084 % This is what this function does: 0085 % 0086 % 1a. Downloads relevant parts from the KEGG FTP and constructs a general 0087 % RAVEN model representing the metabolic network. This output is 0088 % distributed with RAVEN (it's in the raven\kegg directory). Delete 0089 % those files to rerun this parsing. 0090 % 1b. Generates FASTA files from the downloaded KEGG files. One 0091 % multi-FASTA file for each KO in KEGG is generated. These 0092 % multi-fasta files can be downloaded from the RAVEN Toolbox 0093 % homepage if you do not have access to the KEGG FTP or don't want to 0094 % do this time-consuming parsing. Just be sure that you actually need 0095 % them first (see below) 0096 % 0097 % These steps only have to be redone every time KEGG updates their 0098 % database (or rather when the updates are large enough to warrant 0099 % rerunning this part). Many user would probably never use this feature. 0100 % 0101 % 2a. Does alignment of the multi-FASTA files for future use. This uses the 0102 % settings "useEvDist" and "nSequences" to control which sequences 0103 % should be used for constructing Hidden Markov models (HMMs), and 0104 % later for matching your sequences to. 0105 % 2b. Trains Hidden Markov models using HMMer for each of the aligned 0106 % FASTA files. 0107 % 0108 % The most common alternatives here would be to use sequences from only 0109 % eukaryotes, only prokaryotes or all sequences in KEGG. The Hidden Markov 0110 % models for those options can be downloaded from the RAVEN Toolbox 0111 % homepage. This is normally the most convenient way, but if you would 0112 % like to use, for example, only fungal sequences for training the HMMs 0113 % then you need to run this part. 0114 % 0115 % 3a. Queries the HMMs with sequences for the organism you are making a 0116 % model for. This step uses both the output from step 1a and from 2b. 0117 % 3b. Constructs a model based on the fit to the HMMs and the chosen 0118 % parameters. 0119 % 0120 % These steps are specific to the organism for which you are reconstructing 0121 % the model. 0122 % 0123 % In principle the function looks at which output that is already available 0124 % and runs runs only the parts that are required for step 3. This means 0125 % that (see the definition of the parameters for details): 0126 % -1a is only performed if there are no KEGG model files in the raven\kegg 0127 % directory 0128 % -1b is only performed if not all required HMMs OR aligned FASTA files 0129 % OR multi-FASTA files exist in the defined dataDir. This means that this 0130 % step is skipped if the FASTA files or HMMs are downloaded from the RAVEN 0131 % Toolbox homepage instead. If not all files exist it will try to find 0132 % the KEGG database files in dataDir. If it cannot find them it will try 0133 % to download them via the KEGG FTP. 0134 % -2a is only performed if not all required HMMs OR aligned FASTA files 0135 % files exist in the defined dataDir. This means that this step is skipped 0136 % if the HMMs are downloaded from the RAVEN Toolbox homepage instead. 0137 % -2b is only performed if not all required HMMs exist in the defined 0138 % dataDir. This means that this step is skipped if the FASTA files or 0139 % HMMs are downloaded from the RAVEN Toolbox homepage instead. 0140 % -3a is performed for the required HMMs for which no corresponding .out 0141 % file exists in outDir. This is just a way to enable the function to be 0142 % run in parallell or to resume if interrupeted. 0143 % -3b is always performed. 0144 % 0145 % Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,... 0146 % keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,... 0147 % minScoreRatioKO,maxPhylDist,nSequences) 0148 % 0149 % Rasmus Agren, 2013-02-04 0150 % 0151 0152 if nargin<4 0153 outDir=[]; 0154 end 0155 if isempty(outDir) 0156 outDir=tempdir; 0157 %Delete all *.out files if any exist 0158 delete(fullfile(outDir,'*.out')); 0159 end 0160 if nargin<5 0161 keepUndefinedStoich=true; 0162 end 0163 if nargin<6 0164 keepIncomplete=true; 0165 end 0166 if nargin<7 0167 keepGeneral=true; 0168 end 0169 if nargin<8 0170 cutOff=10^-50; 0171 end 0172 if nargin<9 0173 minScoreRatioG=0.8; 0174 end 0175 if nargin<10 0176 minScoreRatioKO=0.3; 0177 end 0178 if nargin<5 0179 maxPhylDist=inf; %Include all sequences for each reaction 0180 end 0181 if nargin<12 0182 nSequences=inf; %Include all sequences for each reaction 0183 end 0184 0185 %Check if the fasta-file contains '/' or'\'. If not then it's probably just 0186 %a file name. It is then merged with the current folder 0187 if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/')) 0188 fastaFile=fullfile(pwd,fastaFile); 0189 end 0190 %Create the required sub-folders in dataDir if they dont exist 0191 if ~exist(fullfile(dataDir,'keggdb'),'dir') 0192 mkdir(dataDir,'keggdb'); 0193 end 0194 if ~exist(fullfile(dataDir,'fasta'),'dir') 0195 mkdir(dataDir,'fasta'); 0196 end 0197 if ~exist(fullfile(dataDir,'aligned'),'dir') 0198 mkdir(dataDir,'aligned'); 0199 end 0200 if ~exist(fullfile(dataDir,'hmms'),'dir') 0201 mkdir(dataDir,'hmms'); 0202 end 0203 if ~exist(outDir,'dir') 0204 mkdir(outDir); 0205 end 0206 0207 %First generate the full KEGG model 0208 [model KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepUndefinedStoich,keepIncomplete,keepGeneral); 0209 0210 fprintf('Completed generation of KEGG model\n'); 0211 model.id=organismID; 0212 model.c=zeros(numel(model.rxns),1); 0213 0214 %First remove all reactions without genes 0215 hasGenes=any(model.rxnGeneMat,2); 0216 0217 model=removeRxns(model,~hasGenes,true); 0218 0219 %Trim the genes so that they only contain information that can be matched 0220 %to the KEGG file of protein sequences (remove all information after first 0221 %parenthesis) 0222 0223 %NOTE: For some reason the organism abbreviation should be in lower case in 0224 %this database. Fix this here 0225 for i=1:numel(KOModel.genes) 0226 parIndex=strfind(KOModel.genes{i},'('); 0227 if any(parIndex) 0228 KOModel.genes{i}=KOModel.genes{i}(1:parIndex-1); 0229 end 0230 colIndex=strfind(KOModel.genes{i},':'); 0231 KOModel.genes{i}=[lower(KOModel.genes{i}(1:colIndex-1)) KOModel.genes{i}(colIndex:end)]; 0232 end 0233 0234 %Create a phylogenetic distance structure 0235 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1); 0236 [crap phylDistId]=ismember(model.id,phylDistStruct.ids); 0237 fprintf('Completed creation of phylogenetic distance matrix\n'); 0238 0239 %Calculate the real maximal distance now. An abitary large number of 1000 is 0240 %used for the "all in kingdom" or "all sequences" options. This is a bit 0241 %inconvenient way to do it, but it's to make it fit with some older code 0242 if isinf(maxPhylDist) || maxPhylDist==-1 0243 maxPhylDist=1000; 0244 end 0245 0246 %Get the KO ids for which files have been generated. Maybe not the neatest 0247 %way.. 0248 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa')); 0249 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa')); 0250 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw')); 0251 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm')); 0252 outFiles=listFiles(fullfile(outDir,'*.out')); 0253 0254 %Check if multi-FASTA files should be generated. This should only be 0255 %performed if there are IDs in the KOModel structure that haven't been 0256 %parsed yet 0257 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]); 0258 0259 if ~isempty(missingFASTA) 0260 if ~exist(fullfile(dataDir,'keggdb','genes.pep'),'file') 0261 %If no sequence file exists then download from KEGG 0262 downloadKEGG(fullfile(dataDir,'keggdb')); 0263 end 0264 %Only construct models for KOs which don't have files already 0265 fastaModel=removeRxns(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true); 0266 %Permute the order of the KOs in the model so that constructMultiFasta 0267 %can be run on several processors at once 0268 fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns'); 0269 constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta')); 0270 end 0271 fprintf('Completed generation of multi-FASTA files\n'); 0272 0273 %Get the directory for RAVEN Toolbox. This is to get the path to the third 0274 %party software used 0275 [ST I]=dbstack('-completenames'); 0276 ravenPath=fileparts(ST(I).file); 0277 0278 %Check if alignment of FASTA files should be performed 0279 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]); 0280 if ~isempty(missingAligned) 0281 missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned))); 0282 %Align all sequences using CLUSTALW 0283 for i=1:numel(missingAligned) 0284 %This is checked here because it could be that it is created by a 0285 %parallell process. The faw-files are saved as temporary files to 0286 %keppt track of which files are being worked on 0287 if ~exist(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'file') && ~exist(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'file') 0288 %Check that the multi-FASTA file exists. It should do so since 0289 %we are saving empty files as well. Print a warning and 0290 %continue if not. 0291 if ~exist(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']),'file') 0292 fprintf(['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist\n']); 0293 continue; 0294 end 0295 0296 %If the multi-FASTA file is empty then save an empty aligned 0297 %file and continue 0298 s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0299 if s.bytes<=0 0300 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w'); 0301 fclose(fid); 0302 continue; 0303 end 0304 0305 %Create an empty file to prevent other threads to start to work on the same alignment 0306 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w'); 0307 fclose(fid); 0308 0309 %First load the FASTA file, then select up to nSequences sequences 0310 %of the most closely related species, apply any constraints from 0311 %maxPhylDist, and save it as a temporary file, and create the 0312 %model from that 0313 0314 fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa'])); 0315 phylDist=inf(numel(fastaStruct),1); 0316 for j=1:numel(fastaStruct) 0317 %Get the organism abbreviation 0318 index=strfind(fastaStruct(j).Header,':'); 0319 if any(index) 0320 abbrev=fastaStruct(j).Header(1:index(1)-1); 0321 [crap index]=ismember(abbrev,phylDistStruct.ids); 0322 if any(index) 0323 phylDist(j)=phylDistStruct.distMat(index(1),phylDistId); 0324 end 0325 end 0326 end 0327 0328 %Inf means that it should not be included 0329 phylDist(phylDist>maxPhylDist)=[]; 0330 0331 %Sort based on phylDist 0332 [crap order]=sort(phylDist); 0333 0334 %Save the first nSequences hits to a temporary FASTA file 0335 if nSequences<=numel(fastaStruct) 0336 fastaStruct=fastaStruct(order(1:nSequences)); 0337 else 0338 fastaStruct=fastaStruct(order); 0339 end 0340 0341 %Do the alignment if there are more than one sequences, 0342 %otherwise just save the sequence (or an empty file) 0343 if numel(fastaStruct)>1 0344 tmpFile=tempname; 0345 fastawrite(tmpFile,fastaStruct); 0346 0347 %Do the alignment for this file 0348 [status output]=system([fullfile(ravenPath,'software','clustalw2','clustalw2') ' -infile="' tmpFile '" -align -outfile="' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '" -output=FASTA -type=PROTEIN -OUTORDER=INPUT']); 0349 if status~=0 0350 throw(MException('',['Error when performing alignment of ' missingAligned{i} ':\n' output])); 0351 end 0352 %Remove the old tempfile 0353 if exist(tmpFile, 'file') 0354 delete([tmpFile '*']); 0355 end 0356 else 0357 %If there is only one sequence then it's not possible to do a 0358 %multiple alignment. Just print the sequence instead. An 0359 %empty file was written previously so that doesn't have to 0360 %be dealt with 0361 if numel(fastaStruct)==1 0362 %CLUSTALW only uses the first word as gene name. This 0363 %doesn't really have an effect, but might as well be 0364 %consistent 0365 I=regexp(fastaStruct.Header,' ','split'); 0366 fastaStruct.Header=strrep(I{1},':','_'); 0367 fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct); 0368 end 0369 end 0370 %Move the temporary file to the real one 0371 movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f'); 0372 end 0373 end 0374 end 0375 fprintf('Completed multiple alignment of sequences\n'); 0376 0377 %Check if training of Hidden Markov models should be performed 0378 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]); 0379 if ~isempty(missingHMMs) 0380 missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs))); 0381 0382 %Train models for all missing KOs 0383 for i=1:numel(missingHMMs) 0384 %This is checked here because it could be that it is created by a 0385 %parallell process 0386 if ~exist(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'file') && ~exist(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'file') 0387 %Check that the aligned FASTA file exists. It could be that it 0388 %is still being worked on by some other instance of the 0389 %program (the .faw file should then exist). This should not 0390 %happen on a single computer. I don't throw an error, because 0391 %it should finalize the ones it can. 0392 if ~exist(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']),'file') 0393 fprintf(['WARNING: The aligned FASTA file for ' missingHMMs{i} ' does not exist\n']); 0394 continue; 0395 end 0396 0397 %If the multi-FASTA file is empty then save an empty aligned 0398 %file and continue 0399 s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa'])); 0400 if s.bytes<=0 0401 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w'); 0402 fclose(fid); 0403 continue; 0404 end 0405 %Create a temporary file to indicate that it is working on the 0406 %KO. This is because hmmbuild cannot overwrite existing files. 0407 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w'); 0408 fclose(fid); 0409 0410 %Create HMM. This is saved with a "hm" ending to indicate that 0411 %it hasn't been calibrated yet. This is because I want to be 0412 %able to see which models are uncalibrated if something goes 0413 %wrong 0414 [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmbuild') ' "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']); 0415 if status~=0 0416 throw(MException('',['Error when training HMM for ' missingHMMs{i} ':\n' output])); 0417 end 0418 0419 %This is only available for linux 0420 if ~ispc 0421 [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmcalibrate') ' "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']) '"']); 0422 if status~=0 0423 %This check is because some HMMs gives an error saying 0424 %that the number of iterations might be too low. 0425 %However, raising it doesn't have any effect. The error 0426 %is therefore ignored and the uncalibrated model is 0427 %used 0428 if isempty(strfind(output,'--num may be set too small?')) 0429 delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hm'])); 0430 throw(MException('',['Error when calibrating HMM for ' missingHMMs{i} ':\n' output])); 0431 else 0432 fprintf(['WARNING: Cannot calibrate the HMM for ' missingHMMs{i} '. Using uncalibrated version\n']); 0433 end 0434 end 0435 end 0436 %Move the temporary file to the real one 0437 movefile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']),fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'f'); 0438 %Delete the temporary file 0439 delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw'])); 0440 end 0441 end 0442 end 0443 fprintf('Completed generation of HMMs\n'); 0444 0445 %Check which new .out files that should be generated 0446 %Check if training of Hidden Markov models should be performed 0447 missingOUT=setdiff(KOModel.rxns,outFiles); 0448 if ~isempty(missingOUT) 0449 missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT))); 0450 for i=1:numel(missingOUT) 0451 %This is checked here because it could be that it is created by a 0452 %parallell process 0453 if ~exist(fullfile(outDir,[missingOUT{i} '.out']),'file') 0454 %Check that the HMM file exists. It should do so since 0455 %we are saving empty files as well. Print a warning and 0456 %continue if not. 0457 if ~exist(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']),'file') 0458 fprintf(['WARNING: The HMM file for ' missingOUT{i} ' does not exist\n']); 0459 continue; 0460 end 0461 0462 %Save an empty file to prevent several threads working on the 0463 %same file 0464 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0465 fclose(fid); 0466 0467 %If the HMM file is empty then save an out file and continue 0468 s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm'])); 0469 if s.bytes<=0 0470 continue; 0471 end 0472 0473 %Check each gene in the input file against this model 0474 [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmsearch') ' "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']); 0475 if status~=0 0476 throw(MException('',['Error when querying HMM for ' missingOUT{i} ':\n' output])); 0477 end 0478 0479 %Save the output to a file 0480 fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w'); 0481 fwrite(fid,output); 0482 fclose(fid); 0483 end 0484 end 0485 end 0486 fprintf('Completed matching to HMMs\n'); 0487 0488 %***Begin retrieving the output and putting together the resulting model 0489 0490 %Retrieve matched genes from the HMMs 0491 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes 0492 genes=cell(3000,1); 0493 %Store the best score for a gene in a hash list (since I will be searching 0494 %many times) 0495 hTable = java.util.Hashtable; 0496 0497 geneCounter=0; 0498 for i=1:numel(KOModel.rxns) 0499 if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file') 0500 fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r'); 0501 beginMatches=false; 0502 while 1 0503 %Get the next line 0504 tline = fgetl(fid); 0505 0506 %Abort at end of file 0507 if ~ischar(tline) 0508 break; 0509 end 0510 0511 if beginMatches==false 0512 %This is how the listing of matches begins 0513 if any(strfind(tline,'E-value ')) 0514 %Read one more line that is only padding 0515 tline = fgetl(fid); 0516 beginMatches=true; 0517 end 0518 else 0519 %If matches should be read 0520 if ~strcmp(tline,' [no hits above thresholds]') && ~isempty(tline) 0521 elements=regexp(tline,' ','split'); 0522 elements=elements(cellfun(@any,elements)); 0523 0524 %Check if the match is below the treshhold 0525 score=str2num(elements{end-1}); 0526 gene=elements{1}; 0527 if score<=cutOff 0528 %If the score is exactly 0, change it to a very 0529 %small value to avoid NaN 0530 if score==0 0531 score=10^-250; 0532 end 0533 %Check if the gene is added already and, is so, get 0534 %the best score for it 0535 I=hTable.get(gene); 0536 if any(I) 0537 koGeneMat(i,I)=score; 0538 else 0539 geneCounter=geneCounter+1; 0540 %The gene was not present yet so add it 0541 hTable.put(gene,geneCounter); 0542 genes{geneCounter}=gene; 0543 koGeneMat(i,geneCounter)=score; 0544 end 0545 end 0546 else 0547 break; 0548 end 0549 end 0550 end 0551 fclose(fid); 0552 end 0553 end 0554 koGeneMat=koGeneMat(:,1:geneCounter); 0555 0556 %Remove the genes for each KO that are below minScoreRatioKO. 0557 for i=1:size(koGeneMat,1) 0558 J=find(koGeneMat(i,:)); 0559 if any(J) 0560 koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0; 0561 end 0562 end 0563 0564 %Remove the KOs for each gene that are below minScoreRatioG 0565 for i=1:size(koGeneMat,2) 0566 J=find(koGeneMat(:,i)); 0567 if any(J) 0568 koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0; 0569 end 0570 end 0571 0572 %Create the new model 0573 model.genes=genes(1:geneCounter); 0574 model.grRules=cell(numel(model.rxns),1); 0575 model.grRules(:)={''}; 0576 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes)); 0577 0578 %Loop through the reactions and add the corresponding genes 0579 for i=1:numel(model.rxns) 0580 if isstruct(model.rxnMiriams{i}) 0581 %Get all KOs 0582 I=find(strcmpi(model.rxnMiriams{i}.name,'urn:miriam:kegg.ko')); 0583 KOs=model.rxnMiriams{i}.value(I); 0584 %Find the KOs and the corresponding genes 0585 J=ismember(KOModel.rxns,KOs); 0586 [crap K]=find(koGeneMat(J,:)); 0587 0588 if any(K) 0589 model.rxnGeneMat(i,K)=1; 0590 %Also delete KOs for which no genes were found. If no genes at all 0591 %were matched to the reaction it will be deleted later 0592 L=sum(koGeneMat(J,:),2)==0; 0593 model.rxnMiriams{i}.value(I(L))=[]; 0594 model.rxnMiriams{i}.name(I(L))=[]; 0595 end 0596 end 0597 end 0598 0599 %Find and delete all reactions without any genes. This also removes genes 0600 %that are not used (which could happen because minScoreRatioG and 0601 %minScoreRatioKO) 0602 I=sum(model.rxnGeneMat,2)==0; 0603 model=removeRxns(model,I,true,true); 0604 0605 %Add the gene associations as 'or' 0606 for i=1:numel(model.rxns) 0607 %Find the involved genes 0608 I=find(model.rxnGeneMat(i,:)); 0609 model.grRules{i}=['(' model.genes{I(1)}]; 0610 for j=2:numel(I) 0611 model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}]; 0612 end 0613 model.grRules{i}=[model.grRules{i} ')']; 0614 end 0615 end 0616 0617 %Supporter function to list the files in a directory and return them as a 0618 %cell array 0619 function files=listFiles(directory) 0620 temp=dir(directory); 0621 files=cell(numel(temp),1); 0622 for i=1:numel(temp) 0623 files{i}=temp(i,1).name; 0624 end 0625 files=strrep(files,'.fa',''); 0626 files=strrep(files,'.hmm',''); 0627 files=strrep(files,'.out',''); 0628 files=strrep(files,'.faw',''); 0629 end