Home > RAVEN > getGenesFromKEGG.m

getGenesFromKEGG

PURPOSE ^

getGenesFromKEGG

SYNOPSIS ^

function model=getGenesFromKEGG(keggPath,koList)

DESCRIPTION ^

 getGenesFromKEGG
   Retrieves information on all genes stored in KEGG database

   keggPath    this function reads data from a local FTP dump of the KEGG
               database. keggPath is the pathway to the root of the database

   koList      the number of genes in KEGG is very large. koList can be a
               cell array with KO identifiers, in which case only genes
               belonging to one of those KEGG orthologies are retrieved (opt,
               default all KOs with associated reactions)

   model       a model structure generated from the database. The following
               fields are filled
               id:             'KEGG'
               description:    'Automatically generated from KEGG database'
               rxns:           KO ids
               rxnNames:       Name for each entry
               genes:          IDs for all the genes. Genes are saved as
                               organism abbreviation:id (same as in KEGG).
                               'HSA:124' for example is alcohol dehydrogenase
                               in Homo sapiens
               rxnGeneMat      A binary matrix that indicates whether a
                               specific gene is present in a KO id
  
   If the file keggGenes.mat is in the RAVEN directory it will be loaded
   instead of parsing of the KEGG files. If it does not exist it will be
   saved after parsing of the KEGG files. In general, you should remove
   the keggGenes.mat file if you want to rebuild the model structure from
   a newer version of KEGG.

   Usage: model=getGenesFromKEGG(keggPath,koList)

   Rasmus Agren, 2012-12-16

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function model=getGenesFromKEGG(keggPath,koList)
0002 % getGenesFromKEGG
0003 %   Retrieves information on all genes stored in KEGG database
0004 %
0005 %   keggPath    this function reads data from a local FTP dump of the KEGG
0006 %               database. keggPath is the pathway to the root of the database
0007 %
0008 %   koList      the number of genes in KEGG is very large. koList can be a
0009 %               cell array with KO identifiers, in which case only genes
0010 %               belonging to one of those KEGG orthologies are retrieved (opt,
0011 %               default all KOs with associated reactions)
0012 %
0013 %   model       a model structure generated from the database. The following
0014 %               fields are filled
0015 %               id:             'KEGG'
0016 %               description:    'Automatically generated from KEGG database'
0017 %               rxns:           KO ids
0018 %               rxnNames:       Name for each entry
0019 %               genes:          IDs for all the genes. Genes are saved as
0020 %                               organism abbreviation:id (same as in KEGG).
0021 %                               'HSA:124' for example is alcohol dehydrogenase
0022 %                               in Homo sapiens
0023 %               rxnGeneMat      A binary matrix that indicates whether a
0024 %                               specific gene is present in a KO id
0025 %
0026 %   If the file keggGenes.mat is in the RAVEN directory it will be loaded
0027 %   instead of parsing of the KEGG files. If it does not exist it will be
0028 %   saved after parsing of the KEGG files. In general, you should remove
0029 %   the keggGenes.mat file if you want to rebuild the model structure from
0030 %   a newer version of KEGG.
0031 %
0032 %   Usage: model=getGenesFromKEGG(keggPath,koList)
0033 %
0034 %   Rasmus Agren, 2012-12-16
0035 %
0036 
0037 %NOTE: This is how one entry looks in the file
0038 
0039 % ENTRY       K11440                      KO
0040 % NAME        gbsB
0041 % DEFINITION  choline dehydrogenase [EC:1.1.1.1]
0042 % CLASS       Metabolism; Amino Acid Metabolism; Glycine, serine and threonine
0043 %             metabolism [PATH:ko00260]
0044 % DBLINKS     RN: R08557 R08558
0045 %             COG: COG1454
0046 % GENES       BSU: BSU31050(gbsB)
0047 %             BCY: Bcer98_1477
0048 %             BLI: BL02563(gbsB)
0049 %             BLD: BLi03269(gbsB)
0050 %             BCL: ABC0979(gbsB)
0051 %             BAY: RBAM_028150(gbsB)
0052 %             BPU: BPUM_2734(gbsB)
0053 % ADDITIONAL  INFO...
0054 % ///
0055 
0056 %The file is not tab-delimited. Instead each label is 12 characters
0057 %(except for '///')
0058 
0059 %Check if the genes have been parsed before and saved. If so, load the
0060 %model.
0061 [ST I]=dbstack('-completenames');
0062 ravenPath=fileparts(ST(I).file);
0063 genesFile=fullfile(ravenPath,'kegg','keggGenes.mat');
0064 if exist(genesFile, 'file')
0065     fprintf(['NOTE: Importing KEGG genes from ' strrep(genesFile,'\','/') '.\n']);
0066     load(genesFile);
0067 else
0068     %Download required files from KEGG if it doesn't exist in the directory
0069     downloadKEGG(keggPath);
0070     
0071     %Get all KOs that are associated to reactions
0072     allKOs=getAllKOs(keggPath);
0073     
0074     %Since the list of genes will be accessed many times I use a hash table
0075     geneMap=containers.Map('KeyType','char','ValueType','int32');
0076     
0077     %Add new functionality in the order specified in models
0078     model.id='KEGG';
0079     model.description='Automatically generated from KEGG database';
0080 
0081     %Preallocate memory
0082     model.rxns=cell(numel(allKOs),1);
0083     model.rxnNames=cell(numel(allKOs),1);
0084 
0085     %Reserve space for 500000 genes
0086     model.genes=cell(500000,1);
0087 
0088     %Load information on KO ID, name, and associated genes
0089     fid = fopen(fullfile(keggPath,'ko'), 'r');
0090 
0091     %Keeps track of how many KOs that have been added
0092     koCounter=0;
0093     nGenes=0;
0094     addingGenes=false;
0095     
0096     %These contain the mapping between KOs and genes and are used to
0097     %generate the model.rxnGeneMat in the end
0098     koIndex=zeros(5000000,1);
0099     geneIndex=koIndex;
0100     nMappings=0;
0101 
0102     skipRead=false;
0103 
0104     %Loop through the file
0105     while 1
0106       %Get the next line
0107       if skipRead==false
0108         tline = fgetl(fid);
0109       else
0110         skipRead=false;
0111       end
0112 
0113       %Abort at end of file
0114       if ~ischar(tline)
0115           break;
0116       end
0117 
0118       %Skip '///'
0119       if numel(tline)<12
0120           addingGenes=false;
0121           continue;
0122       end
0123 
0124       %Check if it's a new reaction
0125       if strcmp(tline(1:12),'ENTRY       ')
0126           %Check if it should be added
0127           koID=tline(13:18);
0128 
0129           if ismember(koID,allKOs)
0130              addMe=true;
0131              koCounter=koCounter+1
0132           else
0133              addMe=false;
0134              continue;
0135           end      
0136 
0137           %Add reaction ID (always 6 characters)
0138           model.rxns{koCounter}=koID;
0139           model.rxnNames{koCounter}=''; %Will be overwritten if it exists
0140       end
0141 
0142       %To get here we must be in a KO that should be added
0143       if addMe==true
0144           %Add name
0145           if strcmp(tline(1:12),'DEFINITION  ')
0146               model.rxnNames{koCounter}=tline(13:end);
0147           end
0148 
0149           %Check if it's time to start adding genes
0150           if strcmp(tline(1:12),'GENES       ')
0151               addingGenes=true;
0152           end
0153 
0154           %Add genes
0155           if addingGenes==true
0156              %We are now adding genes for the current KO. All gene rows start
0157              %with 12 spaces. If this is not true it means that there is
0158              %additional annotation such as AUTHORS. This is not common but it
0159              %exists.
0160              if strcmp(tline(1:12),'            ') || strcmp(tline(1:12),'GENES       ')
0161                  geneLine=tline(13:end);
0162 
0163                  %Check if the line is from a new organism of from the same as
0164                  %before
0165                  if strcmp(geneLine(4),':')
0166                      currentOrganism=geneLine(1:3);
0167                  end
0168 
0169                  %Parse the string
0170                  genes=regexp(geneLine(6:end),' ','split');
0171                  genes=strcat(currentOrganism,':',genes(:));
0172                  
0173                  %Add the genes to the gene list
0174                  for i=1:numel(genes)
0175                      geneString=genes{i};
0176                      if geneMap.isKey(geneString);
0177                         index=geneMap(geneString);
0178                      else
0179                         nGenes=nGenes+1;
0180                         model.genes{nGenes}=geneString;
0181                         index=nGenes;
0182                         geneMap(geneString)=index;
0183                      end
0184                      
0185                      %Add the connection to the KO
0186                      nMappings=nMappings+1;
0187                      koIndex(nMappings)=koCounter;
0188                      geneIndex(nMappings)=index;
0189                  end
0190              else
0191                  %Now we want to throw away everything until the next
0192                  %entry
0193                  while 1
0194                     tline = fgetl(fid);
0195                     if strcmp(tline,'///')
0196                         %When the new entry is found, skip reading one line to
0197                         %fit with the rest of the code
0198                         skipRead=true;
0199                         break;
0200                     end
0201                  end
0202              end
0203           end
0204       end
0205     end
0206     %Close the file
0207     fclose(fid);
0208 
0209     %If too much space was allocated, shrink the model
0210     model.rxns=model.rxns(1:koCounter);
0211     model.rxnNames=model.rxnNames(1:koCounter);
0212     model.genes=model.genes(1:nGenes);
0213     
0214     %Retrieve and add the gene associations
0215     model.rxnGeneMat=sparse(koIndex(1:nMappings),geneIndex(1:nMappings),ones(nMappings,1));
0216     
0217     %To make sure the size is correct if the last KOs don't have genes
0218     if size(model.rxnGeneMat,1)~=koCounter
0219         model.rxnGeneMat(koCounter,1)=0;
0220     end
0221     
0222     %Save the model structure
0223     save(genesFile,'model');
0224 end
0225 
0226 %Only get the KOs in koList
0227 I=~ismember(model.rxns,koList);
0228 model=removeRxns(model,I,true,true);
0229 end
0230 
0231 function allKOs=getAllKOs(keggPath)
0232     %Retrieves all KOs that are associated to reactions. This is because
0233     %the number of genes in KEGG is very large so without this parsing it
0234     %would take many hours
0235     
0236     allKOs={};
0237     
0238     %First check if the reactions have already been parsed
0239     [ST I]=dbstack('-completenames');
0240     ravenPath=fileparts(ST(I).file);
0241     rxnsFile=fullfile(ravenPath,'kegg','keggRxns.mat');
0242     if exist(rxnsFile, 'file')
0243         fprintf(['NOTE: Importing KEGG ORTHOLOGY list from ' strrep(rxnsFile,'\','/') '.\n']);
0244         load(rxnsFile,'model');
0245         
0246         %Loop through the reactions and add the corresponding genes
0247         for i=1:numel(model.rxns)
0248             if isstruct(model.rxnMiriams{i})
0249                 %Get all KOs
0250                 allKOs=[allKOs;model.rxnMiriams{i}.value(strcmpi(model.rxnMiriams{i}.name,'urn:miriam:kegg.ko'))];
0251             end
0252         end
0253     else
0254         %Parse the reaction file instead
0255         %First load information on reaction ID, reaction name, KO, pathway, and ec-number
0256         fid = fopen(fullfile(keggPath,'reaction'), 'r');
0257         orthology=false;
0258         while 1
0259           %Get the next line
0260           tline = fgetl(fid);
0261 
0262           %Abort at end of file
0263           if ~ischar(tline)
0264               break;
0265           end
0266 
0267           %Skip '///'
0268           if numel(tline)<12
0269               continue;
0270           end
0271           
0272           %Check if it's a new reaction
0273           if strcmp(tline(1:12),'ENTRY       ')
0274               orthology=false;
0275           end
0276           
0277           if strcmp(tline(1:9),'REFERENCE')
0278               orthology=false;
0279           end
0280 
0281           %Add KO-ids
0282           if numel(tline)>16
0283               if strcmp(tline(1:16),'ORTHOLOGY   KO: ') || strcmp(tline(1:16),'            KO: ') || strcmp(tline(1:12),'ORTHOLOGY   ') || orthology==true
0284                   if strcmp(tline(13:16),'KO:') %This is in the old version
0285                     allKOs=[allKOs;tline(17:22)];
0286                   else
0287                       if strcmp(tline(1:12),'ORTHOLOGY   ')
0288                         %This means that it found one KO in the new format and that
0289                         %subsequent lines might be other KOs
0290                         orthology=true;
0291                       end
0292                       allKOs=[allKOs;tline(13:18)];
0293                   end
0294               end
0295           end
0296         end
0297 
0298         %Close the file
0299         fclose(fid);
0300     end
0301     allKOs=unique(allKOs);
0302 end

Generated on Tue 23-Apr-2013 15:18:37 by m2html © 2005