Discovering network structures at Algodan/HIIT

What?

The Discovering Network Structures (DiNS) project is a research collaboration between several prominent researchers at Algodan / HIIT working on methods for inferring networks from large amounts of quantitative data.

Wiki

Note that there is an internal wiki for discussion among DiNS researchers.

Why?

Network-like structures are numerous in various domains. For example, the molecular level processes of a living organism can be modelled by using gene regulatory networks and metabolic networks. The human society is full of different social networks. The Internet is a huge network, with a plethora of subnetworks. New computational methods are needed for finding the structure of such networks and for understanding their dynamic behaviour. As explicit knowledge of the networks is typically missing, we have to resort to computational techniques that reconstruct networks from data. As growing amounts of data describing such networks are available, we see improved possibilities for significant progress on this very demanding area.

Principal investigators

Jaakko Hollmen, chief research scientist
Patrik Hoyer, academy research fellow
Aapo Hyvärinen, professor
Mikko Koivisto, academy research fellow
Heikki Mannila, academy professor
Petri Myllymäki, professor
Teemu Roos, postdoctoral researcher
Juho Rousu, professor
Hannu Toivonen, professor
Esko Ukkonen, professor

Research topics, publications, and software

	Biomine Public biological databases contain huge amounts of rich data, such as annotated sequences, proteins, domains, and orthology groups, genes and gene expressions, gene and protein interactions, scientific articles, and ontologies. The Biomine project develops methods for the analysis of such collections of data. Representative publications Link discovery in graphs derived from biological databases Petteri Sevon, Lauri Eronen, Petteri Hintsanen, Kimmo Kulovesi, Hannu Toivonen. 3rd International Workshop on Data Integration in the Life Sciences 2006 (DILS'06), LNBI 4705, 35-49, Hinxton, UK, July 2006. Springer. Finding Reliable Subgraphs from Large Probabilistic Graphs Petteri Hintsanen, Hannu Toivonen. Data Mining and Knowledge Discovery 17 (1): 3-23. 2008. Demo The Biomine prototype search engine for biological data integrates data from several publicly available biological databases with the aim of aiding explorative discovery of connections between biological entities, such as genes and phenotypes: http://biomine.cs.helsinki.fi/
	Seriation The problem in seriation is finding an ordering (total or partial order) among entities. For instance, in paleontology, the data may consist of a collection of fossil sites, a set of taxa, and the presence/absence information for all taxa, and the problem is to find a good ordering for the sites. Representative publications Seriation in Paleontological Data Using Markov Chain Monte Carlo Methods K. Puolamäki, M. Fortelius, H. Mannila. PLoS Comput Biol 2(2): e6 Software Software for seriation in paleontology data
	Learning discrete-valued Bayesian networks Several DiNS-researchers have worked on the topic of learning Bayesian networks for discrete-valued random variables. Both Bayesian approaches and methods based on the Minimum Description Length principle have been investigated. Representative publications Exact Bayesian structure discovery in Bayesian networks Mikko Koivisto and Kismat Sood. Journal of Machine Learning Research, 5:549-573, 2004. Advances in exact Bayesian structure discovery in Bayesian networks Mikko Koivisto. In: R. Dechter and T. Richardson (eds.), Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI 2006), pp. 241-248, AUAI Press, 2006. A Simple Approach for Finding the Globally Optimal Bayesian Network Structure T. Silander and P. Myllymäki. Pp. 445-452 in Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI-2006), edited by R. Dechter and T. Richardson. AUAI Press, 2006 On Sensitivity of the MAP Bayesian Network Structure to the Equivalent Sample Size Parameter T. Silander and P. Kontkanen and P. Myllymäki. Pp. 360-367 in the Proceedings of the The 23rd Conference on Uncertainty in Artificial Intelligence (UAI-2007), edited by R. Parr and L. van der Gaag. AUAI Press, 2007 Factorized NML Criterion for Learning Bayesian Network Structures T. Silander, T. Roos, P. Kontkanen, and P. Myllymäki. Pp. 257-264 in Proceedings of the 4th European Workshop on Probabilistic Graphical Models (PGM-08), edited by M. Jaeger and T. Nielsen. September 17-19, 2008, Hirtshals, Denmark. Software and demos The computer program REBEL is available from Mikko Koivisto. B-course is a web-based data analysis tool for Bayesian modeling, in particular multi-variate dependence and causal modeling and classification, hosted by the CoSCo group. Bene is a software for exact structure learning hosted by the CoSCo group. The program source code for Linux and Windows can be accessed from the demo page.
	Learning continuous-valued linear causal networks In the LiNGAM project, we develop methods for identifying data-generating processes with a DAG structure, where the observed variables are continuous-valued and all the mechanisms are linear. Representative publications A linear, non-gaussian acyclic model for causal discovery S. Shimizu, P.O. Hoyer, A. Hyvärinen, and A.J. Kerminen. Journal of Machine Learning Research, 7:2003-2030, 2006. Estimation of causal effects using linear non-gaussian causal models with hidden variables P. O. Hoyer, S. Shimizu, A. J. Kerminen, and M. Palviainen. International Journal of Approximate Reasoning 49: 362-378, 2008. Software We distribute a LiNGAM matlab package for performing the basic LiNGAM analysis.
	Modeling metabolic networks In this project, computational methods for reconstructing metabolic networks are developed. The methods utilize optimization techniques and graph traversal algorithms to discover a set of biochemical reactions that is most likely catalyzed by the enzymatic genes of the target organism. An important aspect of the developed methods is that they generate networks which are structurally consistent (gapless). Representative publications An analytic and systematic framework for estimating metabolic flux ratios from 13C tracer experiments Ari Rantanen, Juho Rousu, Paula Jouhten, Nicola Zamboni, Hannu Maaheimo, Esko Ukkonen. BMC Bioinformatics 2008, 9:266 A computational method for reconstructing gapless metabolic networks Esa Pitkänen, Ari Rantanen, Juho Rousu, Esko Ukkonen. Proc. 2nd International Conference on Bioinformatics Research and Development (BIRD '08). Communications in Computer and Information Science, Vol. 13, Springer, 2008
	Computer-assisted stemmatology Stemmatology (a.k.a. stemmatics) studies relations among different variants of a document that have been gradually built from an original text by copying and modifying earlier versions. The aim of such study is to reconstruct the family tree (causal graph) of the variants. Representative publications A Compression-Based Method for Stemmatic Analysis T.Roos, T.Heikkilä, and P. Myllymäki (2006). In Proc. ECAI-06, pp. 805-806. See also an extended version of the paper, and a related challenge and causality workbench task. Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets T. Roos and T. Heikkilä. Literary and Linguistic Computing, doi:10.1093/llc/fqp002. 2009.