1. Home
  2. News & Publications
  3. Research News

Sep. 5, 2008 Research Highlight Biology

Generating new information from the web

Development of a new search engine allows statistical analysis of numerous databases containing scientific papers and omics data

Image of a search engine Figure 1: Researchers at the RIKEN BASE division have developed a new search engine that allows machines to analyze statistically the semantic web of numerous databases containing scientific papers and omics data.

Researchers from the RIKEN Bioinformatics And Systems Engineering (BASE) division (formerly the Genomic Sciences Center) in Yokohama have developed a search engine that can find statistically significant information from integrated published scientific papers and omics data. They have applied it to various problems such as using the externally observable or phenotypic characteristics of mice to estimate the location of genes in which mutations have been chemically induced.

At present, pages on the World Wide Web and the information they contain are designed to be read and handled by humans, not machines. Computer scientists hope one day to generate a Semantic Web, within which information will be understandable by machines that can then automate the process of finding, sharing and combining data, as well as analyzing it statistically. However, this demands structuring information in a form capable of being read by machine, and has led to much work on developing such datasets together with computer languages able to handle them.

But even in fields such as molecular biology, where there is an awareness of the value of producing structured machine-readable datasets, the overwhelming majority of information is still published in a non-structured form. And the most highly used programming language for manipulating structured data, SPARQL, does not support statistical evaluation of the links between data.

In a recent paper in Bioinformatics 1, Norio Kobayashi and Tetsuro Toyoda, at BASE have detailed a practical advance in dealing with this problem (Fig. 1). They have developed a new computer language, General and Rapid Association Study Query Language (GRASQL), in which they have added to SPARQL procedures for associating entities using statistical measures. Based on their new language they have generated a prototype rapid search engine that can be used to make statistical inferences in non-structured data.

In order to test the power of their computer language and its search engine, Kobayashi and Toyoda used it to evaluate links between keywords statistically, and applied it to problems such as ranking researchers in a particular field on the basis of their number of publications in that field. “It has also been applied to the PosMed system that is used to estimate those genes responsible for the phenotypes generated by N-ethyl-N-nitrosourea (ENU), which causes random mutations in genomes,” says Toyoda. “The system has contributed to more than 50 discoveries in the ENU mutant mice project.”

References

  • 1. Kobayashi, N. & Toyoda, T. Statistical search on the Semantic Web. Bioinformatics 24, 1002–1010 (2008). doi: 10.1093/bioinformatics/btn054

Top