Home > RAVEN > getKEGGModelForOrganism.m

getKEGGModelForOrganism

PURPOSE ^

getKEGGModelForOrganism

SYNOPSIS ^

function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,minScoreRatioKO,maxPhylDist,nSequences)

DESCRIPTION ^

 getKEGGModelForOrganism
   Reconstructs a genome-scale metabolic model based on protein homology to the
   orthologies in KEGG

   organismID          three letter abbreviation of the organism (as used in
                       KEGG). If not available, use a closely related
                       species. This is used for determing the
                       phylogenetic distance.
   fastaFile           a FASTA file that contains the protein sequences of
                       the organism for which to reconstruct a model (opt,
                       if no FASTA file is supplied then a model is
                       reconstructed based only on the organism
                       abbreviation. This option ignores all settings except for
                       keepUndefinedStoich, keepIncomplete and keepGeneral)
   dataDir             directory for which to retrieve the input data.
                       Should contain a combination of these sub-folders:
                       -dataDir\keggdb
                           The KEGG database files used in 1a (see below).
                       -dataDir\fasta
                           The multi-FASTA files generated in 1b or
                           downloaded from the RAVEN Toolbox homepage (see
                           below)
                       -dataDir\aligned
                           The aligned FASTA files as generated in 2a (see
                           below)
                       -dataDir\hmms
                           The Hidden Markov Models as generated in 2b or
                           downloaded from the RAVEN Toolbox homepage (see
                           below)
                       (opt, see note about fastaFile. Note that in order to
                       rebuild the KEGG model from a database dump, as opposed to
                       using the version supplied with RAVEN, you would still need
                       to supply this)
   outDir              directory to save the results from the quering of
                       the Hidden Markov models. The output is specific
                       for the input sequences and the settings used. It
                       is stored in this manner so that the function can
                       continue if interrupted or if it should run in
                       parallell. Be careful not to leave output files
                       from different organisms or runs with different
                       settings in the same folder. They will not be
                       overwritten (opt, default is a temporary dir where
                       all *.out files are deleted before and after doing
                       the reconstruction)
   keepUndefinedStoich include reactions in the form n A <=> n+1 A. These
                       will be dealt with as two separate metabolites 
                       (opt, default true)
   keepIncomplete      include reactions which have been labelled as
                       "incomplete", "erroneous" or "unclear" (opt,
                       default true)
   keepGeneral         include reactions which have been labelled as
                       "general reaction". These are reactions on the form
                       "an aldehyde <=> an alcohol", and are therefore
                       unsuited for modelling purposes. Note that not all
                       reactions have this type of annotation, and the
                       script will therefore not be able to remove all
                       such reactions (opt, default false)
   cutOff              significans score from HMMer needed to assign genes
                       to a KO (opt, default 10^-50)
   minScoreRatioG      a gene is only assigned to KOs for which the score
                       is >=log(score)/log(best score) for that gene. 
                       This is to prevent that a gene which clearly belongs 
                       to one KO is assigned also to KOs with much lower 
                       scores (opt, default 0.8 (lower is less strict))
   minScoreRatioKO     ignore genes in a KO if their score is
                       <log(score)/log(best score in KO). This is to
                       "prune" KOs which have many genes and where some
                       are clearly a better fit
                       (opt, default 0.3 (lower is less strict))
   maxPhylDist         -1: only use sequences from the same domain
                       (Prokaryota, Eukaryota)
                       other (positive) value: only use sequences for
                       organisms where the phylogenetic distance is at
                       the most this large (as calculated in getPhylDist)
                       (opt, default Inf, which means that all sequences will
                       be used)
   nSequences          for each KO, use up to this many sequences from 
                       the most closely related species. This is mainly to
                       speed up the alignment process for KOs with very
                       many genes (opt, default inf)

   model               the reconstructed model

   PLEASE READ THIS: The input to this function can be confusing, because
   it is intended to be run in parallell on a cluster or in multiple sessions.
   It therefore saves a lot of intermediate results to disc. This also
   serves the purpose of not having to do reduntant calculations. This,
   however, comes with the disadvantage of somewhat trickier handling.
   This is what this function does:

   1a. Downloads relevant parts from the KEGG FTP and constructs a general
       RAVEN model representing the metabolic network. This output is 
       distributed with RAVEN (it's in the raven\kegg directory). Delete
       those files to rerun this parsing.
   1b. Generates FASTA files from the downloaded KEGG files. One
       multi-FASTA file for each KO in KEGG is generated. These
       multi-fasta files can be downloaded from the RAVEN Toolbox
       homepage if you do not have access to the KEGG FTP or don't want to
       do this time-consuming parsing. Just be sure that you actually need
       them first (see below)

   These steps only have to be redone every time KEGG updates their
   database (or rather when the updates are large enough to warrant
   rerunning this part). Many user would probably never use this feature.

   2a. Does alignment of the multi-FASTA files for future use. This uses the
       settings "useEvDist" and "nSequences" to control which sequences
       should be used for constructing Hidden Markov models (HMMs), and
       later for matching your sequences to. 
   2b. Trains Hidden Markov models using HMMer for each of the aligned
       FASTA files.

    The most common alternatives here would be to use sequences from only
   eukaryotes, only prokaryotes or all sequences in KEGG. The Hidden Markov
   models for those options can be downloaded from the RAVEN Toolbox
   homepage. This is normally the most convenient way, but if you would
   like to use, for example, only fungal sequences for training the HMMs
   then you need to run this part.

   3a. Queries the HMMs with sequences for the organism you are making a
       model for. This step uses both the output from step 1a and from 2b.
   3b. Constructs a model based on the fit to the HMMs and the chosen
       parameters.

   These steps are specific to the organism for which you are reconstructing
   the model.

   In principle the function looks at which output that is already available
   and runs runs only the parts that are required for step 3. This means
   that (see the definition of the parameters for details):
   -1a is only performed if there are no KEGG model files in the raven\kegg
   directory
   -1b is only performed if not all required HMMs OR aligned FASTA files
   OR multi-FASTA files exist in the defined dataDir. This means that this
   step is skipped if the FASTA files or HMMs are downloaded from the RAVEN
   Toolbox homepage instead. If not all files exist it will try to find
   the KEGG database files in dataDir. If it cannot find them it will try
   to download them via the KEGG FTP.
   -2a is only performed if not all required HMMs OR aligned FASTA files
   files exist in the defined dataDir. This means that this step is skipped
   if the HMMs are downloaded from the RAVEN Toolbox homepage instead.
   -2b is only performed if not all required HMMs exist in the defined
   dataDir. This means that this step is skipped if the FASTA files or
   HMMs are downloaded from the RAVEN Toolbox homepage instead.
   -3a is performed for the required HMMs for which no corresponding .out
   file exists in outDir. This is just a way to enable the function to be
   run in parallell or to resume if interrupeted.
   -3b is always performed.

   Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,...
    keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,...
    minScoreRatioKO,maxPhylDist,nSequences)

   Rasmus Agren, 2013-11-22

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,...
0002     keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,...
0003     minScoreRatioKO,maxPhylDist,nSequences)
0004 % getKEGGModelForOrganism
0005 %   Reconstructs a genome-scale metabolic model based on protein homology to the
0006 %   orthologies in KEGG
0007 %
0008 %   organismID          three letter abbreviation of the organism (as used in
0009 %                       KEGG). If not available, use a closely related
0010 %                       species. This is used for determing the
0011 %                       phylogenetic distance.
0012 %   fastaFile           a FASTA file that contains the protein sequences of
0013 %                       the organism for which to reconstruct a model (opt,
0014 %                       if no FASTA file is supplied then a model is
0015 %                       reconstructed based only on the organism
0016 %                       abbreviation. This option ignores all settings except for
0017 %                       keepUndefinedStoich, keepIncomplete and keepGeneral)
0018 %   dataDir             directory for which to retrieve the input data.
0019 %                       Should contain a combination of these sub-folders:
0020 %                       -dataDir\keggdb
0021 %                           The KEGG database files used in 1a (see below).
0022 %                       -dataDir\fasta
0023 %                           The multi-FASTA files generated in 1b or
0024 %                           downloaded from the RAVEN Toolbox homepage (see
0025 %                           below)
0026 %                       -dataDir\aligned
0027 %                           The aligned FASTA files as generated in 2a (see
0028 %                           below)
0029 %                       -dataDir\hmms
0030 %                           The Hidden Markov Models as generated in 2b or
0031 %                           downloaded from the RAVEN Toolbox homepage (see
0032 %                           below)
0033 %                       (opt, see note about fastaFile. Note that in order to
0034 %                       rebuild the KEGG model from a database dump, as opposed to
0035 %                       using the version supplied with RAVEN, you would still need
0036 %                       to supply this)
0037 %   outDir              directory to save the results from the quering of
0038 %                       the Hidden Markov models. The output is specific
0039 %                       for the input sequences and the settings used. It
0040 %                       is stored in this manner so that the function can
0041 %                       continue if interrupted or if it should run in
0042 %                       parallell. Be careful not to leave output files
0043 %                       from different organisms or runs with different
0044 %                       settings in the same folder. They will not be
0045 %                       overwritten (opt, default is a temporary dir where
0046 %                       all *.out files are deleted before and after doing
0047 %                       the reconstruction)
0048 %   keepUndefinedStoich include reactions in the form n A <=> n+1 A. These
0049 %                       will be dealt with as two separate metabolites
0050 %                       (opt, default true)
0051 %   keepIncomplete      include reactions which have been labelled as
0052 %                       "incomplete", "erroneous" or "unclear" (opt,
0053 %                       default true)
0054 %   keepGeneral         include reactions which have been labelled as
0055 %                       "general reaction". These are reactions on the form
0056 %                       "an aldehyde <=> an alcohol", and are therefore
0057 %                       unsuited for modelling purposes. Note that not all
0058 %                       reactions have this type of annotation, and the
0059 %                       script will therefore not be able to remove all
0060 %                       such reactions (opt, default false)
0061 %   cutOff              significans score from HMMer needed to assign genes
0062 %                       to a KO (opt, default 10^-50)
0063 %   minScoreRatioG      a gene is only assigned to KOs for which the score
0064 %                       is >=log(score)/log(best score) for that gene.
0065 %                       This is to prevent that a gene which clearly belongs
0066 %                       to one KO is assigned also to KOs with much lower
0067 %                       scores (opt, default 0.8 (lower is less strict))
0068 %   minScoreRatioKO     ignore genes in a KO if their score is
0069 %                       <log(score)/log(best score in KO). This is to
0070 %                       "prune" KOs which have many genes and where some
0071 %                       are clearly a better fit
0072 %                       (opt, default 0.3 (lower is less strict))
0073 %   maxPhylDist         -1: only use sequences from the same domain
0074 %                       (Prokaryota, Eukaryota)
0075 %                       other (positive) value: only use sequences for
0076 %                       organisms where the phylogenetic distance is at
0077 %                       the most this large (as calculated in getPhylDist)
0078 %                       (opt, default Inf, which means that all sequences will
0079 %                       be used)
0080 %   nSequences          for each KO, use up to this many sequences from
0081 %                       the most closely related species. This is mainly to
0082 %                       speed up the alignment process for KOs with very
0083 %                       many genes (opt, default inf)
0084 %
0085 %   model               the reconstructed model
0086 %
0087 %   PLEASE READ THIS: The input to this function can be confusing, because
0088 %   it is intended to be run in parallell on a cluster or in multiple sessions.
0089 %   It therefore saves a lot of intermediate results to disc. This also
0090 %   serves the purpose of not having to do reduntant calculations. This,
0091 %   however, comes with the disadvantage of somewhat trickier handling.
0092 %   This is what this function does:
0093 %
0094 %   1a. Downloads relevant parts from the KEGG FTP and constructs a general
0095 %       RAVEN model representing the metabolic network. This output is
0096 %       distributed with RAVEN (it's in the raven\kegg directory). Delete
0097 %       those files to rerun this parsing.
0098 %   1b. Generates FASTA files from the downloaded KEGG files. One
0099 %       multi-FASTA file for each KO in KEGG is generated. These
0100 %       multi-fasta files can be downloaded from the RAVEN Toolbox
0101 %       homepage if you do not have access to the KEGG FTP or don't want to
0102 %       do this time-consuming parsing. Just be sure that you actually need
0103 %       them first (see below)
0104 %
0105 %   These steps only have to be redone every time KEGG updates their
0106 %   database (or rather when the updates are large enough to warrant
0107 %   rerunning this part). Many user would probably never use this feature.
0108 %
0109 %   2a. Does alignment of the multi-FASTA files for future use. This uses the
0110 %       settings "useEvDist" and "nSequences" to control which sequences
0111 %       should be used for constructing Hidden Markov models (HMMs), and
0112 %       later for matching your sequences to.
0113 %   2b. Trains Hidden Markov models using HMMer for each of the aligned
0114 %       FASTA files.
0115 %
0116 %    The most common alternatives here would be to use sequences from only
0117 %   eukaryotes, only prokaryotes or all sequences in KEGG. The Hidden Markov
0118 %   models for those options can be downloaded from the RAVEN Toolbox
0119 %   homepage. This is normally the most convenient way, but if you would
0120 %   like to use, for example, only fungal sequences for training the HMMs
0121 %   then you need to run this part.
0122 %
0123 %   3a. Queries the HMMs with sequences for the organism you are making a
0124 %       model for. This step uses both the output from step 1a and from 2b.
0125 %   3b. Constructs a model based on the fit to the HMMs and the chosen
0126 %       parameters.
0127 %
0128 %   These steps are specific to the organism for which you are reconstructing
0129 %   the model.
0130 %
0131 %   In principle the function looks at which output that is already available
0132 %   and runs runs only the parts that are required for step 3. This means
0133 %   that (see the definition of the parameters for details):
0134 %   -1a is only performed if there are no KEGG model files in the raven\kegg
0135 %   directory
0136 %   -1b is only performed if not all required HMMs OR aligned FASTA files
0137 %   OR multi-FASTA files exist in the defined dataDir. This means that this
0138 %   step is skipped if the FASTA files or HMMs are downloaded from the RAVEN
0139 %   Toolbox homepage instead. If not all files exist it will try to find
0140 %   the KEGG database files in dataDir. If it cannot find them it will try
0141 %   to download them via the KEGG FTP.
0142 %   -2a is only performed if not all required HMMs OR aligned FASTA files
0143 %   files exist in the defined dataDir. This means that this step is skipped
0144 %   if the HMMs are downloaded from the RAVEN Toolbox homepage instead.
0145 %   -2b is only performed if not all required HMMs exist in the defined
0146 %   dataDir. This means that this step is skipped if the FASTA files or
0147 %   HMMs are downloaded from the RAVEN Toolbox homepage instead.
0148 %   -3a is performed for the required HMMs for which no corresponding .out
0149 %   file exists in outDir. This is just a way to enable the function to be
0150 %   run in parallell or to resume if interrupeted.
0151 %   -3b is always performed.
0152 %
0153 %   Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,...
0154 %    keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,...
0155 %    minScoreRatioKO,maxPhylDist,nSequences)
0156 %
0157 %   Rasmus Agren, 2013-11-22
0158 %
0159 
0160 if nargin<2
0161     fastaFile=[];
0162 end
0163 if nargin<3
0164     dataDir=[];
0165 end
0166 if nargin<4
0167     outDir=[];
0168 end
0169 if isempty(outDir)
0170     outDir=tempdir;
0171     %Delete all *.out files if any exist
0172     delete(fullfile(outDir,'*.out'));
0173 end
0174 if nargin<5
0175     keepUndefinedStoich=true;
0176 end
0177 if nargin<6
0178     keepIncomplete=true;
0179 end
0180 if nargin<7
0181     keepGeneral=true;
0182 end
0183 if nargin<8
0184     cutOff=10^-50;
0185 end
0186 if nargin<9
0187     minScoreRatioG=0.8;
0188 end
0189 if nargin<10
0190     minScoreRatioKO=0.3;
0191 end    
0192 if nargin<11
0193     maxPhylDist=inf; %Include all sequences for each reaction
0194 end
0195 if nargin<12
0196     nSequences=inf; %Include all sequences for each reaction
0197 end
0198 
0199 %Check if the fasta-file contains '/' or'\'. If not then it's probably just
0200 %a file name. It is then merged with the current folder
0201 if any(fastaFile)
0202     if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/'))
0203         fastaFile=fullfile(pwd,fastaFile);
0204     end
0205     %Create the required sub-folders in dataDir if they dont exist
0206     if ~exist(fullfile(dataDir,'keggdb'),'dir')
0207         mkdir(dataDir,'keggdb');
0208     end
0209     if ~exist(fullfile(dataDir,'fasta'),'dir')
0210         mkdir(dataDir,'fasta');
0211     end
0212     if ~exist(fullfile(dataDir,'aligned'),'dir')
0213         mkdir(dataDir,'aligned');
0214     end
0215     if ~exist(fullfile(dataDir,'hmms'),'dir')
0216         mkdir(dataDir,'hmms');
0217     end
0218     if ~exist(outDir,'dir')
0219         mkdir(outDir);
0220     end
0221 end
0222 
0223 %First generate the full KEGG model. The dataDir mustn't be supplied as
0224 %there is also an internal RAVEN version available
0225 if any(dataDir)
0226     [model, KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepUndefinedStoich,keepIncomplete,keepGeneral);
0227 else
0228     [model, KOModel]=getModelFromKEGG([],keepUndefinedStoich,keepIncomplete,keepGeneral);
0229 end
0230 fprintf('Completed generation of KEGG model\n');
0231 model.id=organismID;
0232 model.c=zeros(numel(model.rxns),1);
0233 
0234 %If no FASTA file is supplied, then just remove all genes which are not for
0235 %the given organism ID
0236 if isempty(fastaFile)
0237     %All IDs are three letters
0238     I=cellfun(@(x) strcmpi(x(1:3),organismID),model.genes);
0239     
0240     %Remove those genes
0241     model.genes=model.genes(I);
0242     model.rxnGeneMat=model.rxnGeneMat(:,I);
0243 end
0244 
0245 %First remove all reactions without genes
0246 hasGenes=any(model.rxnGeneMat,2);
0247 
0248 model=removeRxns(model,~hasGenes,true);
0249 
0250 %If no FASTA file is supplied, then we're done here
0251 if isempty(fastaFile)
0252     return;
0253 end
0254 
0255 %Trim the genes so that they only contain information that can be matched
0256 %to the KEGG file of protein sequences (remove all information after first
0257 %parenthesis)
0258 
0259 %NOTE: For some reason the organism abbreviation should be in lower case in
0260 %this database. Fix this here
0261 for i=1:numel(KOModel.genes)    
0262     parIndex=strfind(KOModel.genes{i},'(');
0263     if any(parIndex)
0264        KOModel.genes{i}=KOModel.genes{i}(1:parIndex-1); 
0265     end
0266     colIndex=strfind(KOModel.genes{i},':');
0267     KOModel.genes{i}=[lower(KOModel.genes{i}(1:colIndex-1)) KOModel.genes{i}(colIndex:end)];
0268 end
0269 
0270 %Create a phylogenetic distance structure
0271 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1);
0272 [crap phylDistId]=ismember(model.id,phylDistStruct.ids);
0273 fprintf('Completed creation of phylogenetic distance matrix\n');
0274 
0275 %Calculate the real maximal distance now. An abitary large number of 1000 is
0276 %used for the "all in kingdom" or "all sequences" options. This is a bit
0277 %inconvenient way to do it, but it's to make it fit with some older code
0278 if isinf(maxPhylDist) || maxPhylDist==-1
0279     maxPhylDist=1000;
0280 end
0281 
0282 %Get the KO ids for which files have been generated. Maybe not the neatest
0283 %way..
0284 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa'));
0285 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa'));
0286 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw'));
0287 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm'));
0288 outFiles=listFiles(fullfile(outDir,'*.out'));
0289 
0290 %Check if multi-FASTA files should be generated. This should only be
0291 %performed if there are IDs in the KOModel structure that haven't been
0292 %parsed yet
0293 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]);
0294 
0295 if ~isempty(missingFASTA)
0296     if ~exist(fullfile(dataDir,'keggdb','genes.pep'),'file')
0297         %If no sequence file exists then download from KEGG
0298         downloadKEGG(fullfile(dataDir,'keggdb'));
0299     end
0300     %Only construct models for KOs which don't have files already
0301     fastaModel=removeRxns(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true);
0302     %Permute the order of the KOs in the model so that constructMultiFasta
0303     %can be run on several processors at once
0304     fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns');
0305     constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta'));
0306 end
0307 fprintf('Completed generation of multi-FASTA files\n');
0308 
0309 %Get the directory for RAVEN Toolbox. This is to get the path to the third
0310 %party software used
0311 [ST I]=dbstack('-completenames');
0312 ravenPath=fileparts(ST(I).file);
0313 
0314 %Check if alignment of FASTA files should be performed
0315 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]);
0316 if ~isempty(missingAligned)
0317     missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned)));
0318     %Align all sequences using CLUSTALW
0319     for i=1:numel(missingAligned)
0320         %This is checked here because it could be that it is created by a
0321         %parallell process. The faw-files are saved as temporary files to
0322         %keppt track of which files are being worked on
0323         if ~exist(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'file') && ~exist(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'file')
0324             %Check that the multi-FASTA file exists. It should do so since
0325             %we are saving empty files as well. Print a warning and
0326             %continue if not.
0327             if ~exist(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']),'file')
0328                 dispEM(['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist'],false);
0329                 continue;
0330             end
0331             
0332             %If the multi-FASTA file is empty then save an empty aligned
0333             %file and continue
0334             s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0335             if s.bytes<=0
0336                 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w');
0337                 fclose(fid);
0338                 continue;
0339             end
0340             
0341             %Create an empty file to prevent other threads to start to work on the same alignment
0342             fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w');
0343             fclose(fid);
0344             
0345             %First load the FASTA file, then select up to nSequences sequences
0346             %of the most closely related species, apply any constraints from
0347             %maxPhylDist, and save it as a temporary file, and create the
0348             %model from that
0349 
0350             fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0351             phylDist=inf(numel(fastaStruct),1);
0352             for j=1:numel(fastaStruct)
0353                %Get the organism abbreviation
0354                index=strfind(fastaStruct(j).Header,':');
0355                if any(index)
0356                     abbrev=fastaStruct(j).Header(1:index(1)-1);
0357                     [crap index]=ismember(abbrev,phylDistStruct.ids);
0358                     if any(index)
0359                         phylDist(j)=phylDistStruct.distMat(index(1),phylDistId);
0360                     end
0361                end
0362             end
0363 
0364             %Inf means that it should not be included
0365             phylDist(phylDist>maxPhylDist)=[];
0366 
0367             %Sort based on phylDist
0368             [crap order]=sort(phylDist);
0369 
0370             %Save the first nSequences hits to a temporary FASTA file
0371             if nSequences<=numel(fastaStruct)
0372                 fastaStruct=fastaStruct(order(1:nSequences));
0373             else
0374                 fastaStruct=fastaStruct(order);
0375             end
0376             
0377             %Do the alignment if there are more than one sequences,
0378             %otherwise just save the sequence (or an empty file)
0379             if numel(fastaStruct)>1
0380                 tmpFile=tempname;
0381                 fastawrite(tmpFile,fastaStruct);
0382 
0383                 %Do the alignment for this file
0384                 [status output]=system([fullfile(ravenPath,'software','clustalw2','clustalw2') ' -infile="' tmpFile '" -align -outfile="' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '" -output=FASTA -type=PROTEIN -OUTORDER=INPUT']);
0385                 if status~=0
0386                     dispEM(['Error when performing alignment of ' missingAligned{i} ':\n' output]); 
0387                 end
0388                 %Remove the old tempfile
0389                 if exist(tmpFile, 'file')
0390                    delete([tmpFile '*']);
0391                 end
0392             else
0393                 %If there is only one sequence then it's not possible to do a
0394                 %multiple alignment. Just print the sequence instead. An
0395                 %empty file was written previously so that doesn't have to
0396                 %be dealt with
0397                 if numel(fastaStruct)==1
0398                     %CLUSTALW only uses the first word as gene name. This
0399                     %doesn't really have an effect, but might as well be
0400                     %consistent
0401                     I=regexp(fastaStruct.Header,' ','split');
0402                     fastaStruct.Header=strrep(I{1},':','_');
0403                     fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct);
0404                 end
0405             end
0406             %Move the temporary file to the real one
0407             movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f');
0408         end
0409     end
0410 end
0411 fprintf('Completed multiple alignment of sequences\n');
0412 
0413 %Check if training of Hidden Markov models should be performed
0414 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]);
0415 if ~isempty(missingHMMs)
0416     missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs)));
0417     
0418     %Train models for all missing KOs
0419     for i=1:numel(missingHMMs)
0420         %This is checked here because it could be that it is created by a
0421         %parallell process
0422         if ~exist(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'file') && ~exist(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'file')
0423             %Check that the aligned FASTA file exists. It could be that it
0424             %is still being worked on by some other instance of the
0425             %program (the .faw file should then exist). This should not
0426             %happen on a single computer. I don't throw an error, because
0427             %it should finalize the ones it can.
0428             if ~exist(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']),'file')
0429                 dispEM(['The aligned FASTA file for ' missingHMMs{i} ' does not exist'],false);
0430                 continue;
0431             end
0432             
0433             %If the multi-FASTA file is empty then save an empty aligned
0434             %file and continue
0435             s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']));
0436             if s.bytes<=0
0437                 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w');
0438                 fclose(fid);
0439                 continue;
0440             end
0441             %Create a temporary file to indicate that it is working on the
0442             %KO. This is because hmmbuild cannot overwrite existing files.
0443             fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w');
0444             fclose(fid);
0445                 
0446             %Create HMM. This is saved with a "hm" ending to indicate that
0447             %it hasn't been calibrated yet. This is because I want to be
0448             %able to see which models are uncalibrated if something goes
0449             %wrong
0450             [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmbuild') ' "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']);
0451             if status~=0
0452                 dispEM(['Error when training HMM for ' missingHMMs{i} ':\n' output]);
0453             end
0454             
0455             %This is only available for linux
0456             if ~ispc
0457                 [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmcalibrate') ' "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']) '"']);
0458                 if status~=0
0459                     %This check is because some HMMs gives an error saying
0460                     %that the number of iterations might be too low.
0461                     %However, raising it doesn't have any effect. The error
0462                     %is therefore ignored and the uncalibrated model is
0463                     %used
0464                     if isempty(strfind(output,'--num may be set too small?'))
0465                         delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']));
0466                         dispEM(['Error when calibrating HMM for ' missingHMMs{i} ':\n' output]);
0467                     else
0468                         dispEM(['Cannot calibrate the HMM for ' missingHMMs{i} '. Using uncalibrated version'],false);
0469                     end
0470                 end
0471             end
0472             %Move the temporary file to the real one
0473             movefile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']),fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'f');
0474             %Delete the temporary file
0475             delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']));
0476         end
0477     end
0478 end
0479 fprintf('Completed generation of HMMs\n');
0480 
0481 %Check which new .out files that should be generated
0482 %Check if training of Hidden Markov models should be performed
0483 missingOUT=setdiff(KOModel.rxns,outFiles);
0484 if ~isempty(missingOUT)
0485     missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT)));
0486     for i=1:numel(missingOUT)
0487         %This is checked here because it could be that it is created by a
0488         %parallell process
0489         if ~exist(fullfile(outDir,[missingOUT{i} '.out']),'file')
0490             %Check that the HMM file exists. It should do so since
0491             %we are saving empty files as well. Print a warning and
0492             %continue if not.
0493             if ~exist(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']),'file')
0494                 dispEM(['The HMM file for ' missingOUT{i} ' does not exist'],false);
0495                 continue;
0496             end
0497             
0498             %Save an empty file to prevent several threads working on the
0499             %same file
0500             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0501             fclose(fid);
0502             
0503             %If the HMM file is empty then save an out file and continue
0504             s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']));
0505             if s.bytes<=0
0506                 continue;
0507             end
0508             
0509             %Check each gene in the input file against this model
0510             [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmsearch') ' "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']);
0511             if status~=0
0512                 dispEM(['Error when querying HMM for ' missingOUT{i} ':\n' output]); 
0513             end
0514 
0515             %Save the output to a file
0516             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0517             fwrite(fid,output);
0518             fclose(fid);
0519         end 
0520     end
0521 end
0522 fprintf('Completed matching to HMMs\n');
0523             
0524 %***Begin retrieving the output and putting together the resulting model
0525 
0526 %Retrieve matched genes from the HMMs
0527 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes
0528 genes=cell(3000,1);
0529 %Store the best score for a gene in a hash list (since I will be searching
0530 %many times)
0531 hTable = java.util.Hashtable;
0532 
0533 geneCounter=0;
0534 for i=1:numel(KOModel.rxns)
0535     if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file')
0536         fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r');
0537         beginMatches=false;
0538         while 1
0539             %Get the next line
0540             tline = fgetl(fid);
0541 
0542             %Abort at end of file
0543             if ~ischar(tline)
0544                 break;
0545             end
0546             
0547             if beginMatches==false
0548                 %This is how the listing of matches begins
0549                 if any(strfind(tline,'E-value '))
0550                     %Read one more line that is only padding
0551                     tline = fgetl(fid);
0552                     beginMatches=true;
0553                 end
0554             else
0555                 %If matches should be read
0556                 if ~strcmp(tline,'    [no hits above thresholds]') && ~isempty(tline)
0557                     elements=regexp(tline,' ','split');
0558                     elements=elements(cellfun(@any,elements));
0559                     
0560                     %Check if the match is below the treshhold
0561                     score=str2num(elements{end-1});
0562                     gene=elements{1};
0563                     if score<=cutOff
0564                         %If the score is exactly 0, change it to a very
0565                         %small value to avoid NaN
0566                         if score==0
0567                            score=10^-250; 
0568                         end
0569                         %Check if the gene is added already and, is so, get
0570                         %the best score for it
0571                         I=hTable.get(gene);
0572                         if any(I)
0573                             koGeneMat(i,I)=score;    
0574                         else
0575                             geneCounter=geneCounter+1;
0576                             %The gene was not present yet so add it
0577                             hTable.put(gene,geneCounter);
0578                             genes{geneCounter}=gene;
0579                             koGeneMat(i,geneCounter)=score;
0580                         end
0581                     end
0582                 else
0583                     break;
0584                 end
0585             end
0586         end
0587         fclose(fid);
0588     end
0589 end
0590 koGeneMat=koGeneMat(:,1:geneCounter);
0591 
0592 %Remove the genes for each KO that are below minScoreRatioKO.
0593 for i=1:size(koGeneMat,1)
0594     J=find(koGeneMat(i,:));
0595     if any(J)
0596         koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0;
0597     end
0598 end
0599 
0600 %Remove the KOs for each gene that are below minScoreRatioG
0601 for i=1:size(koGeneMat,2)
0602     J=find(koGeneMat(:,i));
0603     if any(J)
0604         koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0;
0605     end
0606 end
0607 
0608 %Create the new model
0609 model.genes=genes(1:geneCounter);
0610 model.grRules=cell(numel(model.rxns),1);
0611 model.grRules(:)={''};
0612 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes));
0613 
0614 %Loop through the reactions and add the corresponding genes
0615 for i=1:numel(model.rxns)
0616     if isstruct(model.rxnMiriams{i})
0617         %Get all KOs
0618         I=find(strcmpi(model.rxnMiriams{i}.name,'urn:miriam:kegg.ko'));
0619         KOs=model.rxnMiriams{i}.value(I);
0620         %Find the KOs and the corresponding genes
0621         J=ismember(KOModel.rxns,KOs);
0622         [crap K]=find(koGeneMat(J,:));
0623         
0624         if any(K)
0625             model.rxnGeneMat(i,K)=1;
0626             %Also delete KOs for which no genes were found. If no genes at all
0627             %were matched to the reaction it will be deleted later
0628             L=sum(koGeneMat(J,:),2)==0;
0629             model.rxnMiriams{i}.value(I(L))=[];
0630             model.rxnMiriams{i}.name(I(L))=[];
0631         end
0632     end
0633 end
0634 
0635 %Find and delete all reactions without any genes. This also removes genes
0636 %that are not used (which could happen because minScoreRatioG and
0637 %minScoreRatioKO)
0638 I=sum(model.rxnGeneMat,2)==0;
0639 model=removeRxns(model,I,true,true);
0640 
0641 %Add the gene associations as 'or'
0642 for i=1:numel(model.rxns)
0643     %Find the involved genes
0644     I=find(model.rxnGeneMat(i,:));
0645     model.grRules{i}=['(' model.genes{I(1)}];
0646     for j=2:numel(I)
0647         model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}];
0648     end
0649     model.grRules{i}=[model.grRules{i} ')'];
0650 end
0651 end
0652 
0653 %Supporter function to list the files in a directory and return them as a
0654 %cell array
0655 function files=listFiles(directory)
0656     temp=dir(directory);
0657     files=cell(numel(temp),1);
0658     for i=1:numel(temp)
0659         files{i}=temp(i,1).name;
0660     end
0661     files=strrep(files,'.fa','');
0662     files=strrep(files,'.hmm','');
0663     files=strrep(files,'.out','');
0664     files=strrep(files,'.faw','');
0665 end

Generated on Mon 06-Jan-2014 14:58:12 by m2html © 2005