Home > RAVEN > getKEGGModelForOrganism.m

getKEGGModelForOrganism

PURPOSE ^

getKEGGModelForOrganism

SYNOPSIS ^

function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,minScoreRatioKO,maxPhylDist,nSequences)

DESCRIPTION ^

 getKEGGModelForOrganism
   Reconstructs a genome-scale metabolic model based on protein homology to the
   orthologies in KEGG

   organismID          three letter abbreviation of the organism (as used in
                       KEGG). If not available, use a closely related
                       species. This is used for determing the
                       phylogenetic distance.
   fastaFile           a FASTA file that contains the protein sequences of
                       the organism for which to reconstruct a model
   dataDir             directory for which to retrieve the input data.
                       Should contain a combination of these sub-folders:
                       -dataDir\keggdb
                           The KEGG database files used in 1a (see below).
                       -dataDir\fasta
                           The multi-FASTA files generated in 1b or
                           downloaded from the RAVEN Toolbox homepage (see
                           below)
                       -dataDir\aligned
                           The aligned FASTA files as generated in 2a (see
                           below)
                       -dataDir\hmms
                           The Hidden Markov Models as generated in 2b or
                           downloaded from the RAVEN Toolbox homepage (see
                           below)
   outDir              directory to save the results from the quering of
                       the Hidden Markov models. The output is specific
                       for the input sequences and the settings used. It
                       is stored in this manner so that the function can
                       continue if interrupted or if it should run in
                       parallell. Be careful not to leave output files
                       from different organisms or runs with different
                       settings in the same folder. They will not be
                       overwritten (opt, default is a temporary dir where
                       all *.out files are deleted before and after doing
                       the reconstruction)
   keepUndefinedStoich include reactions in the form n A <=> n+1 A. These
                       will be dealt with as two separate metabolites 
                       (opt, default true)
   keepIncomplete      include reactions which have been labelled as
                       "incomplete", "erroneous" or "unclear" (opt,
                       default true)
   keepGeneral         include reactions which have been labelled as
                       "general reaction". These are reactions on the form
                       "an aldehyde <=> an alcohol", and are therefore
                       unsuited for modelling purposes. Note that not all
                       reactions have this type of annotation, and the
                       script will therefore not be able to remove all
                       such reactions (opt, default false)
   cutOff              significans score from HMMer needed to assign genes
                       to a KO (opt, default 10^-50)
   minScoreRatioG      a gene is only assigned to KOs for which the score
                       is >=log(score)/log(best score) for that gene. 
                       This is to prevent that a gene which clearly belongs 
                       to one KO is assigned also to KOs with much lower 
                       scores (opt, default 0.8 (lower is less strict))
   minScoreRatioKO     ignore genes in a KO if their score is
                       <log(score)/log(best score in KO). This is to
                       "prune" KOs which have many genes and where some
                       are clearly a better fit
                       (opt, default 0.3 (lower is less strict))
   maxPhylDist         -1: only use sequences from the same domain
                       (Prokaryota, Eukaryota)
                       other (positive) value: only use sequences for
                       organisms where the phylogenetic distance is at
                       the most this large (as calculated in getPhylDist)
                       (opt, default Inf, which means that all sequences will
                       be used)
   nSequences          for each KO, use up to this many sequences from 
                       the most closely related species. This is mainly to
                       speed up the alignment process for KOs with very
                       many genes (opt, default inf)

   model               the reconstructed model

   PLEASE READ THIS: The input to this function can be confusing, because
   it is intended to be run in parallell on a cluster or in multiple sessions.
   It therefore saves a lot of intermediate results to disc. This also
   serves the purpose of not having to do reduntant calculations. This,
   however, comes with the disadvantage of somewhat trickier handling.
   This is what this function does:

   1a. Downloads relevant parts from the KEGG FTP and constructs a general
       RAVEN model representing the metabolic network. This output is 
       distributed with RAVEN (it's in the raven\kegg directory). Delete
       those files to rerun this parsing.
   1b. Generates FASTA files from the downloaded KEGG files. One
       multi-FASTA file for each KO in KEGG is generated. These
       multi-fasta files can be downloaded from the RAVEN Toolbox
       homepage if you do not have access to the KEGG FTP or don't want to
       do this time-consuming parsing. Just be sure that you actually need
       them first (see below)

   These steps only have to be redone every time KEGG updates their
   database (or rather when the updates are large enough to warrant
   rerunning this part). Many user would probably never use this feature.

   2a. Does alignment of the multi-FASTA files for future use. This uses the
       settings "useEvDist" and "nSequences" to control which sequences
       should be used for constructing Hidden Markov models (HMMs), and
       later for matching your sequences to. 
   2b. Trains Hidden Markov models using HMMer for each of the aligned
       FASTA files.

    The most common alternatives here would be to use sequences from only
   eukaryotes, only prokaryotes or all sequences in KEGG. The Hidden Markov
   models for those options can be downloaded from the RAVEN Toolbox
   homepage. This is normally the most convenient way, but if you would
   like to use, for example, only fungal sequences for training the HMMs
   then you need to run this part.

   3a. Queries the HMMs with sequences for the organism you are making a
       model for. This step uses both the output from step 1a and from 2b.
   3b. Constructs a model based on the fit to the HMMs and the chosen
       parameters.

   These steps are specific to the organism for which you are reconstructing
   the model.

   In principle the function looks at which output that is already available
   and runs runs only the parts that are required for step 3. This means
   that (see the definition of the parameters for details):
   -1a is only performed if there are no KEGG model files in the raven\kegg
   directory
   -1b is only performed if not all required HMMs OR aligned FASTA files
   OR multi-FASTA files exist in the defined dataDir. This means that this
   step is skipped if the FASTA files or HMMs are downloaded from the RAVEN
   Toolbox homepage instead. If not all files exist it will try to find
   the KEGG database files in dataDir. If it cannot find them it will try
   to download them via the KEGG FTP.
   -2a is only performed if not all required HMMs OR aligned FASTA files
   files exist in the defined dataDir. This means that this step is skipped
   if the HMMs are downloaded from the RAVEN Toolbox homepage instead.
   -2b is only performed if not all required HMMs exist in the defined
   dataDir. This means that this step is skipped if the FASTA files or
   HMMs are downloaded from the RAVEN Toolbox homepage instead.
   -3a is performed for the required HMMs for which no corresponding .out
   file exists in outDir. This is just a way to enable the function to be
   run in parallell or to resume if interrupeted.
   -3b is always performed.

   Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,...
    keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,...
    minScoreRatioKO,maxPhylDist,nSequences)

   Rasmus Agren, 2013-02-04

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,...
0002     keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,...
0003     minScoreRatioKO,maxPhylDist,nSequences)
0004 % getKEGGModelForOrganism
0005 %   Reconstructs a genome-scale metabolic model based on protein homology to the
0006 %   orthologies in KEGG
0007 %
0008 %   organismID          three letter abbreviation of the organism (as used in
0009 %                       KEGG). If not available, use a closely related
0010 %                       species. This is used for determing the
0011 %                       phylogenetic distance.
0012 %   fastaFile           a FASTA file that contains the protein sequences of
0013 %                       the organism for which to reconstruct a model
0014 %   dataDir             directory for which to retrieve the input data.
0015 %                       Should contain a combination of these sub-folders:
0016 %                       -dataDir\keggdb
0017 %                           The KEGG database files used in 1a (see below).
0018 %                       -dataDir\fasta
0019 %                           The multi-FASTA files generated in 1b or
0020 %                           downloaded from the RAVEN Toolbox homepage (see
0021 %                           below)
0022 %                       -dataDir\aligned
0023 %                           The aligned FASTA files as generated in 2a (see
0024 %                           below)
0025 %                       -dataDir\hmms
0026 %                           The Hidden Markov Models as generated in 2b or
0027 %                           downloaded from the RAVEN Toolbox homepage (see
0028 %                           below)
0029 %   outDir              directory to save the results from the quering of
0030 %                       the Hidden Markov models. The output is specific
0031 %                       for the input sequences and the settings used. It
0032 %                       is stored in this manner so that the function can
0033 %                       continue if interrupted or if it should run in
0034 %                       parallell. Be careful not to leave output files
0035 %                       from different organisms or runs with different
0036 %                       settings in the same folder. They will not be
0037 %                       overwritten (opt, default is a temporary dir where
0038 %                       all *.out files are deleted before and after doing
0039 %                       the reconstruction)
0040 %   keepUndefinedStoich include reactions in the form n A <=> n+1 A. These
0041 %                       will be dealt with as two separate metabolites
0042 %                       (opt, default true)
0043 %   keepIncomplete      include reactions which have been labelled as
0044 %                       "incomplete", "erroneous" or "unclear" (opt,
0045 %                       default true)
0046 %   keepGeneral         include reactions which have been labelled as
0047 %                       "general reaction". These are reactions on the form
0048 %                       "an aldehyde <=> an alcohol", and are therefore
0049 %                       unsuited for modelling purposes. Note that not all
0050 %                       reactions have this type of annotation, and the
0051 %                       script will therefore not be able to remove all
0052 %                       such reactions (opt, default false)
0053 %   cutOff              significans score from HMMer needed to assign genes
0054 %                       to a KO (opt, default 10^-50)
0055 %   minScoreRatioG      a gene is only assigned to KOs for which the score
0056 %                       is >=log(score)/log(best score) for that gene.
0057 %                       This is to prevent that a gene which clearly belongs
0058 %                       to one KO is assigned also to KOs with much lower
0059 %                       scores (opt, default 0.8 (lower is less strict))
0060 %   minScoreRatioKO     ignore genes in a KO if their score is
0061 %                       <log(score)/log(best score in KO). This is to
0062 %                       "prune" KOs which have many genes and where some
0063 %                       are clearly a better fit
0064 %                       (opt, default 0.3 (lower is less strict))
0065 %   maxPhylDist         -1: only use sequences from the same domain
0066 %                       (Prokaryota, Eukaryota)
0067 %                       other (positive) value: only use sequences for
0068 %                       organisms where the phylogenetic distance is at
0069 %                       the most this large (as calculated in getPhylDist)
0070 %                       (opt, default Inf, which means that all sequences will
0071 %                       be used)
0072 %   nSequences          for each KO, use up to this many sequences from
0073 %                       the most closely related species. This is mainly to
0074 %                       speed up the alignment process for KOs with very
0075 %                       many genes (opt, default inf)
0076 %
0077 %   model               the reconstructed model
0078 %
0079 %   PLEASE READ THIS: The input to this function can be confusing, because
0080 %   it is intended to be run in parallell on a cluster or in multiple sessions.
0081 %   It therefore saves a lot of intermediate results to disc. This also
0082 %   serves the purpose of not having to do reduntant calculations. This,
0083 %   however, comes with the disadvantage of somewhat trickier handling.
0084 %   This is what this function does:
0085 %
0086 %   1a. Downloads relevant parts from the KEGG FTP and constructs a general
0087 %       RAVEN model representing the metabolic network. This output is
0088 %       distributed with RAVEN (it's in the raven\kegg directory). Delete
0089 %       those files to rerun this parsing.
0090 %   1b. Generates FASTA files from the downloaded KEGG files. One
0091 %       multi-FASTA file for each KO in KEGG is generated. These
0092 %       multi-fasta files can be downloaded from the RAVEN Toolbox
0093 %       homepage if you do not have access to the KEGG FTP or don't want to
0094 %       do this time-consuming parsing. Just be sure that you actually need
0095 %       them first (see below)
0096 %
0097 %   These steps only have to be redone every time KEGG updates their
0098 %   database (or rather when the updates are large enough to warrant
0099 %   rerunning this part). Many user would probably never use this feature.
0100 %
0101 %   2a. Does alignment of the multi-FASTA files for future use. This uses the
0102 %       settings "useEvDist" and "nSequences" to control which sequences
0103 %       should be used for constructing Hidden Markov models (HMMs), and
0104 %       later for matching your sequences to.
0105 %   2b. Trains Hidden Markov models using HMMer for each of the aligned
0106 %       FASTA files.
0107 %
0108 %    The most common alternatives here would be to use sequences from only
0109 %   eukaryotes, only prokaryotes or all sequences in KEGG. The Hidden Markov
0110 %   models for those options can be downloaded from the RAVEN Toolbox
0111 %   homepage. This is normally the most convenient way, but if you would
0112 %   like to use, for example, only fungal sequences for training the HMMs
0113 %   then you need to run this part.
0114 %
0115 %   3a. Queries the HMMs with sequences for the organism you are making a
0116 %       model for. This step uses both the output from step 1a and from 2b.
0117 %   3b. Constructs a model based on the fit to the HMMs and the chosen
0118 %       parameters.
0119 %
0120 %   These steps are specific to the organism for which you are reconstructing
0121 %   the model.
0122 %
0123 %   In principle the function looks at which output that is already available
0124 %   and runs runs only the parts that are required for step 3. This means
0125 %   that (see the definition of the parameters for details):
0126 %   -1a is only performed if there are no KEGG model files in the raven\kegg
0127 %   directory
0128 %   -1b is only performed if not all required HMMs OR aligned FASTA files
0129 %   OR multi-FASTA files exist in the defined dataDir. This means that this
0130 %   step is skipped if the FASTA files or HMMs are downloaded from the RAVEN
0131 %   Toolbox homepage instead. If not all files exist it will try to find
0132 %   the KEGG database files in dataDir. If it cannot find them it will try
0133 %   to download them via the KEGG FTP.
0134 %   -2a is only performed if not all required HMMs OR aligned FASTA files
0135 %   files exist in the defined dataDir. This means that this step is skipped
0136 %   if the HMMs are downloaded from the RAVEN Toolbox homepage instead.
0137 %   -2b is only performed if not all required HMMs exist in the defined
0138 %   dataDir. This means that this step is skipped if the FASTA files or
0139 %   HMMs are downloaded from the RAVEN Toolbox homepage instead.
0140 %   -3a is performed for the required HMMs for which no corresponding .out
0141 %   file exists in outDir. This is just a way to enable the function to be
0142 %   run in parallell or to resume if interrupeted.
0143 %   -3b is always performed.
0144 %
0145 %   Usage: model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,outDir,...
0146 %    keepUndefinedStoich,keepIncomplete,keepGeneral,cutOff,minScoreRatioG,...
0147 %    minScoreRatioKO,maxPhylDist,nSequences)
0148 %
0149 %   Rasmus Agren, 2013-02-04
0150 %
0151 
0152 if nargin<4
0153     outDir=[];
0154 end
0155 if isempty(outDir)
0156     outDir=tempdir;
0157     %Delete all *.out files if any exist
0158     delete(fullfile(outDir,'*.out'));
0159 end
0160 if nargin<5
0161     keepUndefinedStoich=true;
0162 end
0163 if nargin<6
0164     keepIncomplete=true;
0165 end
0166 if nargin<7
0167     keepGeneral=true;
0168 end
0169 if nargin<8
0170     cutOff=10^-50;
0171 end
0172 if nargin<9
0173     minScoreRatioG=0.8;
0174 end
0175 if nargin<10
0176     minScoreRatioKO=0.3;
0177 end    
0178 if nargin<5
0179     maxPhylDist=inf; %Include all sequences for each reaction
0180 end
0181 if nargin<12
0182     nSequences=inf; %Include all sequences for each reaction
0183 end
0184 
0185 %Check if the fasta-file contains '/' or'\'. If not then it's probably just
0186 %a file name. It is then merged with the current folder
0187 if ~any(strfind(fastaFile,'\')) && ~any(strfind(fastaFile,'/'))
0188     fastaFile=fullfile(pwd,fastaFile);
0189 end
0190 %Create the required sub-folders in dataDir if they dont exist
0191 if ~exist(fullfile(dataDir,'keggdb'),'dir')
0192     mkdir(dataDir,'keggdb');
0193 end
0194 if ~exist(fullfile(dataDir,'fasta'),'dir')
0195     mkdir(dataDir,'fasta');
0196 end
0197 if ~exist(fullfile(dataDir,'aligned'),'dir')
0198     mkdir(dataDir,'aligned');
0199 end
0200 if ~exist(fullfile(dataDir,'hmms'),'dir')
0201     mkdir(dataDir,'hmms');
0202 end
0203 if ~exist(outDir,'dir')
0204     mkdir(outDir);
0205 end
0206 
0207 %First generate the full KEGG model
0208 [model KOModel]=getModelFromKEGG(fullfile(dataDir,'keggdb'),keepUndefinedStoich,keepIncomplete,keepGeneral);
0209 
0210 fprintf('Completed generation of KEGG model\n');
0211 model.id=organismID;
0212 model.c=zeros(numel(model.rxns),1);
0213 
0214 %First remove all reactions without genes
0215 hasGenes=any(model.rxnGeneMat,2);
0216 
0217 model=removeRxns(model,~hasGenes,true);
0218 
0219 %Trim the genes so that they only contain information that can be matched
0220 %to the KEGG file of protein sequences (remove all information after first
0221 %parenthesis)
0222 
0223 %NOTE: For some reason the organism abbreviation should be in lower case in
0224 %this database. Fix this here
0225 for i=1:numel(KOModel.genes)    
0226     parIndex=strfind(KOModel.genes{i},'(');
0227     if any(parIndex)
0228        KOModel.genes{i}=KOModel.genes{i}(1:parIndex-1); 
0229     end
0230     colIndex=strfind(KOModel.genes{i},':');
0231     KOModel.genes{i}=[lower(KOModel.genes{i}(1:colIndex-1)) KOModel.genes{i}(colIndex:end)];
0232 end
0233 
0234 %Create a phylogenetic distance structure
0235 phylDistStruct=getPhylDist(fullfile(dataDir,'keggdb'),maxPhylDist==-1);
0236 [crap phylDistId]=ismember(model.id,phylDistStruct.ids);
0237 fprintf('Completed creation of phylogenetic distance matrix\n');
0238 
0239 %Calculate the real maximal distance now. An abitary large number of 1000 is
0240 %used for the "all in kingdom" or "all sequences" options. This is a bit
0241 %inconvenient way to do it, but it's to make it fit with some older code
0242 if isinf(maxPhylDist) || maxPhylDist==-1
0243     maxPhylDist=1000;
0244 end
0245 
0246 %Get the KO ids for which files have been generated. Maybe not the neatest
0247 %way..
0248 fastaFiles=listFiles(fullfile(dataDir,'fasta','*.fa'));
0249 alignedFiles=listFiles(fullfile(dataDir,'aligned','*.fa'));
0250 alignedWorking=listFiles(fullfile(dataDir,'aligned','*.faw'));
0251 hmmFiles=listFiles(fullfile(dataDir,'hmms','*.hmm'));
0252 outFiles=listFiles(fullfile(outDir,'*.out'));
0253 
0254 %Check if multi-FASTA files should be generated. This should only be
0255 %performed if there are IDs in the KOModel structure that haven't been
0256 %parsed yet
0257 missingFASTA=setdiff(KOModel.rxns,[fastaFiles;alignedFiles;hmmFiles;outFiles]);
0258 
0259 if ~isempty(missingFASTA)
0260     if ~exist(fullfile(dataDir,'keggdb','genes.pep'),'file')
0261         %If no sequence file exists then download from KEGG
0262         downloadKEGG(fullfile(dataDir,'keggdb'));
0263     end
0264     %Only construct models for KOs which don't have files already
0265     fastaModel=removeRxns(KOModel,setdiff(KOModel.rxns,missingFASTA),true,true);
0266     %Permute the order of the KOs in the model so that constructMultiFasta
0267     %can be run on several processors at once
0268     fastaModel=permuteModel(fastaModel,randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(fastaModel.rxns)),'rxns');
0269     constructMultiFasta(fastaModel,fullfile(dataDir,'keggdb','genes.pep'),fullfile(dataDir,'fasta'));
0270 end
0271 fprintf('Completed generation of multi-FASTA files\n');
0272 
0273 %Get the directory for RAVEN Toolbox. This is to get the path to the third
0274 %party software used
0275 [ST I]=dbstack('-completenames');
0276 ravenPath=fileparts(ST(I).file);
0277 
0278 %Check if alignment of FASTA files should be performed
0279 missingAligned=setdiff(KOModel.rxns,[alignedFiles;hmmFiles;alignedWorking;outFiles]);
0280 if ~isempty(missingAligned)
0281     missingAligned=missingAligned(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingAligned)));
0282     %Align all sequences using CLUSTALW
0283     for i=1:numel(missingAligned)
0284         %This is checked here because it could be that it is created by a
0285         %parallell process. The faw-files are saved as temporary files to
0286         %keppt track of which files are being worked on
0287         if ~exist(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'file') && ~exist(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'file')
0288             %Check that the multi-FASTA file exists. It should do so since
0289             %we are saving empty files as well. Print a warning and
0290             %continue if not.
0291             if ~exist(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']),'file')
0292                 fprintf(['WARNING: The multi-FASTA file for ' missingAligned{i} ' does not exist\n']);
0293                 continue;
0294             end
0295             
0296             %If the multi-FASTA file is empty then save an empty aligned
0297             %file and continue
0298             s=dir(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0299             if s.bytes<=0
0300                 fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'w');
0301                 fclose(fid);
0302                 continue;
0303             end
0304             
0305             %Create an empty file to prevent other threads to start to work on the same alignment
0306             fid=fopen(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),'w');
0307             fclose(fid);
0308             
0309             %First load the FASTA file, then select up to nSequences sequences
0310             %of the most closely related species, apply any constraints from
0311             %maxPhylDist, and save it as a temporary file, and create the
0312             %model from that
0313 
0314             fastaStruct=fastaread(fullfile(dataDir,'fasta',[missingAligned{i} '.fa']));
0315             phylDist=inf(numel(fastaStruct),1);
0316             for j=1:numel(fastaStruct)
0317                %Get the organism abbreviation
0318                index=strfind(fastaStruct(j).Header,':');
0319                if any(index)
0320                     abbrev=fastaStruct(j).Header(1:index(1)-1);
0321                     [crap index]=ismember(abbrev,phylDistStruct.ids);
0322                     if any(index)
0323                         phylDist(j)=phylDistStruct.distMat(index(1),phylDistId);
0324                     end
0325                end
0326             end
0327 
0328             %Inf means that it should not be included
0329             phylDist(phylDist>maxPhylDist)=[];
0330 
0331             %Sort based on phylDist
0332             [crap order]=sort(phylDist);
0333 
0334             %Save the first nSequences hits to a temporary FASTA file
0335             if nSequences<=numel(fastaStruct)
0336                 fastaStruct=fastaStruct(order(1:nSequences));
0337             else
0338                 fastaStruct=fastaStruct(order);
0339             end
0340             
0341             %Do the alignment if there are more than one sequences,
0342             %otherwise just save the sequence (or an empty file)
0343             if numel(fastaStruct)>1
0344                 tmpFile=tempname;
0345                 fastawrite(tmpFile,fastaStruct);
0346 
0347                 %Do the alignment for this file
0348                 [status output]=system([fullfile(ravenPath,'software','clustalw2','clustalw2') ' -infile="' tmpFile '" -align -outfile="' fullfile(dataDir,'aligned',[missingAligned{i} '.faw']) '" -output=FASTA -type=PROTEIN -OUTORDER=INPUT']);
0349                 if status~=0
0350                     throw(MException('',['Error when performing alignment of ' missingAligned{i} ':\n' output])); 
0351                 end
0352                 %Remove the old tempfile
0353                 if exist(tmpFile, 'file')
0354                    delete([tmpFile '*']);
0355                 end
0356             else
0357                 %If there is only one sequence then it's not possible to do a
0358                 %multiple alignment. Just print the sequence instead. An
0359                 %empty file was written previously so that doesn't have to
0360                 %be dealt with
0361                 if numel(fastaStruct)==1
0362                     %CLUSTALW only uses the first word as gene name. This
0363                     %doesn't really have an effect, but might as well be
0364                     %consistent
0365                     I=regexp(fastaStruct.Header,' ','split');
0366                     fastaStruct.Header=strrep(I{1},':','_');
0367                     fastawrite(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fastaStruct);
0368                 end
0369             end
0370             %Move the temporary file to the real one
0371             movefile(fullfile(dataDir,'aligned',[missingAligned{i} '.faw']),fullfile(dataDir,'aligned',[missingAligned{i} '.fa']),'f');
0372         end
0373     end
0374 end
0375 fprintf('Completed multiple alignment of sequences\n');
0376 
0377 %Check if training of Hidden Markov models should be performed
0378 missingHMMs=setdiff(KOModel.rxns,[hmmFiles;outFiles]);
0379 if ~isempty(missingHMMs)
0380     missingHMMs=missingHMMs(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingHMMs)));
0381     
0382     %Train models for all missing KOs
0383     for i=1:numel(missingHMMs)
0384         %This is checked here because it could be that it is created by a
0385         %parallell process
0386         if ~exist(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'file') && ~exist(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'file')
0387             %Check that the aligned FASTA file exists. It could be that it
0388             %is still being worked on by some other instance of the
0389             %program (the .faw file should then exist). This should not
0390             %happen on a single computer. I don't throw an error, because
0391             %it should finalize the ones it can.
0392             if ~exist(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']),'file')
0393                 fprintf(['WARNING: The aligned FASTA file for ' missingHMMs{i} ' does not exist\n']);
0394                 continue;
0395             end
0396             
0397             %If the multi-FASTA file is empty then save an empty aligned
0398             %file and continue
0399             s=dir(fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']));
0400             if s.bytes<=0
0401                 fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'w');
0402                 fclose(fid);
0403                 continue;
0404             end
0405             %Create a temporary file to indicate that it is working on the
0406             %KO. This is because hmmbuild cannot overwrite existing files.
0407             fid=fopen(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']),'w');
0408             fclose(fid);
0409                 
0410             %Create HMM. This is saved with a "hm" ending to indicate that
0411             %it hasn't been calibrated yet. This is because I want to be
0412             %able to see which models are uncalibrated if something goes
0413             %wrong
0414             [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmbuild') ' "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']) '" "' fullfile(dataDir,'aligned',[missingHMMs{i} '.fa']) '"']);
0415             if status~=0
0416                 throw(MException('',['Error when training HMM for ' missingHMMs{i} ':\n' output]));
0417             end
0418             
0419             %This is only available for linux
0420             if ~ispc
0421                 [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmcalibrate') ' "' fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']) '"']);
0422                 if status~=0
0423                     %This check is because some HMMs gives an error saying
0424                     %that the number of iterations might be too low.
0425                     %However, raising it doesn't have any effect. The error
0426                     %is therefore ignored and the uncalibrated model is
0427                     %used
0428                     if isempty(strfind(output,'--num may be set too small?'))
0429                         delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']));
0430                         throw(MException('',['Error when calibrating HMM for ' missingHMMs{i} ':\n' output]));
0431                     else
0432                         fprintf(['WARNING: Cannot calibrate the HMM for ' missingHMMs{i} '. Using uncalibrated version\n']);
0433                     end
0434                 end
0435             end
0436             %Move the temporary file to the real one
0437             movefile(fullfile(dataDir,'hmms',[missingHMMs{i} '.hm']),fullfile(dataDir,'hmms',[missingHMMs{i} '.hmm']),'f');
0438             %Delete the temporary file
0439             delete(fullfile(dataDir,'hmms',[missingHMMs{i} '.hmw']));
0440         end
0441     end
0442 end
0443 fprintf('Completed generation of HMMs\n');
0444 
0445 %Check which new .out files that should be generated
0446 %Check if training of Hidden Markov models should be performed
0447 missingOUT=setdiff(KOModel.rxns,outFiles);
0448 if ~isempty(missingOUT)
0449     missingOUT=missingOUT(randperm(RandStream.create('mrg32k3a','Seed',cputime()),numel(missingOUT)));
0450     for i=1:numel(missingOUT)
0451         %This is checked here because it could be that it is created by a
0452         %parallell process
0453         if ~exist(fullfile(outDir,[missingOUT{i} '.out']),'file')
0454             %Check that the HMM file exists. It should do so since
0455             %we are saving empty files as well. Print a warning and
0456             %continue if not.
0457             if ~exist(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']),'file')
0458                 fprintf(['WARNING: The HMM file for ' missingOUT{i} ' does not exist\n']);
0459                 continue;
0460             end
0461             
0462             %Save an empty file to prevent several threads working on the
0463             %same file
0464             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0465             fclose(fid);
0466             
0467             %If the HMM file is empty then save an out file and continue
0468             s=dir(fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']));
0469             if s.bytes<=0
0470                 continue;
0471             end
0472             
0473             %Check each gene in the input file against this model
0474             [status output]=system([fullfile(ravenPath,'software','hmmer-2.3','hmmsearch') ' "' fullfile(dataDir,'hmms',[missingOUT{i} '.hmm']) '" "' fastaFile '"']);
0475             if status~=0
0476                 throw(MException('',['Error when querying HMM for ' missingOUT{i} ':\n' output])); 
0477             end
0478 
0479             %Save the output to a file
0480             fid=fopen(fullfile(outDir,[missingOUT{i} '.out']),'w');
0481             fwrite(fid,output);
0482             fclose(fid);
0483         end 
0484     end
0485 end
0486 fprintf('Completed matching to HMMs\n');
0487             
0488 %***Begin retrieving the output and putting together the resulting model
0489 
0490 %Retrieve matched genes from the HMMs
0491 koGeneMat=zeros(numel(KOModel.rxns),3000); %Make room for 3000 genes
0492 genes=cell(3000,1);
0493 %Store the best score for a gene in a hash list (since I will be searching
0494 %many times)
0495 hTable = java.util.Hashtable;
0496 
0497 geneCounter=0;
0498 for i=1:numel(KOModel.rxns)
0499     if exist(fullfile(outDir,[KOModel.rxns{i} '.out']), 'file')
0500         fid=fopen(fullfile(outDir,[KOModel.rxns{i} '.out']),'r');
0501         beginMatches=false;
0502         while 1
0503             %Get the next line
0504             tline = fgetl(fid);
0505 
0506             %Abort at end of file
0507             if ~ischar(tline)
0508                 break;
0509             end
0510             
0511             if beginMatches==false
0512                 %This is how the listing of matches begins
0513                 if any(strfind(tline,'E-value '))
0514                     %Read one more line that is only padding
0515                     tline = fgetl(fid);
0516                     beginMatches=true;
0517                 end
0518             else
0519                 %If matches should be read
0520                 if ~strcmp(tline,'    [no hits above thresholds]') && ~isempty(tline)
0521                     elements=regexp(tline,' ','split');
0522                     elements=elements(cellfun(@any,elements));
0523                     
0524                     %Check if the match is below the treshhold
0525                     score=str2num(elements{end-1});
0526                     gene=elements{1};
0527                     if score<=cutOff
0528                         %If the score is exactly 0, change it to a very
0529                         %small value to avoid NaN
0530                         if score==0
0531                            score=10^-250; 
0532                         end
0533                         %Check if the gene is added already and, is so, get
0534                         %the best score for it
0535                         I=hTable.get(gene);
0536                         if any(I)
0537                             koGeneMat(i,I)=score;    
0538                         else
0539                             geneCounter=geneCounter+1;
0540                             %The gene was not present yet so add it
0541                             hTable.put(gene,geneCounter);
0542                             genes{geneCounter}=gene;
0543                             koGeneMat(i,geneCounter)=score;
0544                         end
0545                     end
0546                 else
0547                     break;
0548                 end
0549             end
0550         end
0551         fclose(fid);
0552     end
0553 end
0554 koGeneMat=koGeneMat(:,1:geneCounter);
0555 
0556 %Remove the genes for each KO that are below minScoreRatioKO.
0557 for i=1:size(koGeneMat,1)
0558     J=find(koGeneMat(i,:));
0559     if any(J)
0560         koGeneMat(i,J(log(koGeneMat(i,J))/log(min(koGeneMat(i,J)))<minScoreRatioKO))=0;
0561     end
0562 end
0563 
0564 %Remove the KOs for each gene that are below minScoreRatioG
0565 for i=1:size(koGeneMat,2)
0566     J=find(koGeneMat(:,i));
0567     if any(J)
0568         koGeneMat(J(log(koGeneMat(J,i))/log(min(koGeneMat(J,i)))<minScoreRatioG),i)=0;
0569     end
0570 end
0571 
0572 %Create the new model
0573 model.genes=genes(1:geneCounter);
0574 model.grRules=cell(numel(model.rxns),1);
0575 model.grRules(:)={''};
0576 model.rxnGeneMat=sparse(numel(model.rxns),numel(model.genes));
0577 
0578 %Loop through the reactions and add the corresponding genes
0579 for i=1:numel(model.rxns)
0580     if isstruct(model.rxnMiriams{i})
0581         %Get all KOs
0582         I=find(strcmpi(model.rxnMiriams{i}.name,'urn:miriam:kegg.ko'));
0583         KOs=model.rxnMiriams{i}.value(I);
0584         %Find the KOs and the corresponding genes
0585         J=ismember(KOModel.rxns,KOs);
0586         [crap K]=find(koGeneMat(J,:));
0587         
0588         if any(K)
0589             model.rxnGeneMat(i,K)=1;
0590             %Also delete KOs for which no genes were found. If no genes at all
0591             %were matched to the reaction it will be deleted later
0592             L=sum(koGeneMat(J,:),2)==0;
0593             model.rxnMiriams{i}.value(I(L))=[];
0594             model.rxnMiriams{i}.name(I(L))=[];
0595         end
0596     end
0597 end
0598 
0599 %Find and delete all reactions without any genes. This also removes genes
0600 %that are not used (which could happen because minScoreRatioG and
0601 %minScoreRatioKO)
0602 I=sum(model.rxnGeneMat,2)==0;
0603 model=removeRxns(model,I,true,true);
0604 
0605 %Add the gene associations as 'or'
0606 for i=1:numel(model.rxns)
0607     %Find the involved genes
0608     I=find(model.rxnGeneMat(i,:));
0609     model.grRules{i}=['(' model.genes{I(1)}];
0610     for j=2:numel(I)
0611         model.grRules{i}=[model.grRules{i} ' or ' model.genes{I(j)}];
0612     end
0613     model.grRules{i}=[model.grRules{i} ')'];
0614 end
0615 end
0616 
0617 %Supporter function to list the files in a directory and return them as a
0618 %cell array
0619 function files=listFiles(directory)
0620     temp=dir(directory);
0621     files=cell(numel(temp),1);
0622     for i=1:numel(temp)
0623         files{i}=temp(i,1).name;
0624     end
0625     files=strrep(files,'.fa','');
0626     files=strrep(files,'.hmm','');
0627     files=strrep(files,'.out','');
0628     files=strrep(files,'.faw','');
0629 end

Generated on Tue 23-Apr-2013 15:18:37 by m2html © 2005