Characterization of Genetic Signal Sequences with Batch-Learning SOM


  • Takashi Abe
  • Shun Ikeda
  • Shigehiko Kanaya
  • Kennosuke Wada
  • Toshimichi Ikemura



batch learning SOM, BL-SOM, oligonucleotide frequency, the Earth Simulator, genome informatics, DDC: 004 (Data processing, computer science, computer systems)


An unsupervised clustering algorithm Kohonen's SOM is an effective tool for clustering and visualizing high-dimensional complex data on a single map. We previously modified the conventional SOM for genome informatics, making the learning process and resulting map independent of the order of data input on the basis of Batch Learning SOM (BL-SOM). We generated BL-SOMs for tetra- and pentanucleotide frequencies in 300,000 10-kb sequences from 13 eukaryotes for which almost complete genomic sequences are available. BL-SOM recognized species-specific characteristics of oligonucleotide frequencies in most 10-kb sequences, permitting species-specific classification of sequences without any information regarding the species. We next constructed BL-SOMs with tetra- and pentanucleotide frequencies in 37,086 full-length mouse cDNA sequences. With BL-SOM we also analyzed occurrence patterns of the oligonucleotides that are thought to be involved in transcriptional regulation on the human genome.