Modern Data Mining in Astronomy

Interacademiaal College / Interacademic Course 2009

Overview

The amount of data produced by observational and theoretical effort in astronomy and astrophysics today is very large. Thus it is important to be able to organise and access these data very efficiently.

This course aims to give you a solid basic understanding of databases in astronomy, experience in accessing these and applying them to astronomical research, and introduce the statistical tools of astronomical computing.

Throughout the course a number of practical exercises will be given and a key aim is to enable the student to reach a level where she/he can apply the tools to their own research problems. At the end of the course the student should be able to plan the optimal data processing during her/his research, to select necessary tools, to use Virtual Observatory for the data mining and publishing her/his own results.

After having taken this course you should be able to efficiently search and access astronomical databases, know the basics of how to access databases both through on-line interfaces and programatically. You will also be able to apply up-to-date statistical techniques to your data and use these to mine large datasets for scientifically interesting information.

Practical details

The course will be given by Andrey Belikov (Groningen), Jarle Brinchmann (Leiden) and Edwin Valentijn (Groningen).

The lectures and practical classes will take place in room 112 in the astronomy department, Buys Ballot Laboratory, Utrecht University. This link shows the location in Google maps. The address is Princetonplein 5, 3584CC, Utrecht.

The course will go on from 11:00 to 15:00 with a short break for lunch.

Structure, evaluation and time-budget

The evaluation (grading) of the course will be done on the basis of a written project to be handed in towards the end of the semester. More details regarding this will be given in the lectures and on this web-site when available.

Each day will consist of a 2 hr lecture followed by a practical session of 1hr30 related to the day's lecture after lunch. During the practical session ("werkcollege"), we will get hands-on experience with topics covered in the lectures. These tasks will not be graded but are an integral, and important, part of the course and will be assumed as background for your final project work.

The overall time allocated to the IAC is 200hrs. This will be distributed in the following way:

This does not add up to 200hrs because we in fact count the day of the lectures as 8hrs for the students - this to take height for the time spent on travelling.

Some suggested literature for the course

There is not a single book for this course, rather a number of books are useful as reference for various aspects of the course. Individual lectures might also have suggested literature, so follow the lecture link below to see this.

The following are books that are useful for the topics of the course in general:

Detailed lecture overview

The main responsible for the lecture is indicated in square brackets. The detailed plan of the individual lectures is still somewhat preliminary but the overall content of the course is set. Click the lecture title to access the page for that lecture which will also contain links, literature suggestions and the lecture slides after the lectures have been given.

February 3

Introduction. Astronomical Data and Data Mining Overview.

Responsible: J. Brinchmann

This lecture will introduce the present day data handling in Astronomy. We will cover both aspects of the data gathering process and of data storage. The history of data centers in astronomy will be reviewed. This will provide a context for the course and a first introduction to the use of these facilities.

February 17

Database and programming basics

Responsible: A. Belikov/E. Valentijn

To use a database efficiently, or to build your own database, it is necessary to have some knowledge of the database management systems and how to interact with one. This lecture will introduce the main query language for databases, SQL, and give an overview of how databases operate and how to interact with them through programs.

Along with SQL we will introduce some other standards in the programming (XML, UML) and languages for data visualisation and processing ( IDL, python).

The second part of the lecture will be dedicated to visualisation of the data in efficient way.

March 3

Introduction to databases in astronomy

Responsible: J. Brinchmann

Databases in astronomy exist to allow us to do science both in a more efficient way, but also to apply novel tools and techniques to explore data. The lecture will first give some examples of astronomical databases and how to access the data.

We will take an in-depth look at the Sloan Digital Sky Survey to see an example of the practical use of SQL applied observational data. This will be complemented with an in-depth look at the Millennium database - a theoretical dataset provided on the web through a database and accessible via SQL searches.

This will be followed by an overview of how these databases have been used in science and the potential uses for future scientific studies.

March 17

Introduction to statistical methods in astronomy & the computational tools

Responsible: J. Brinchmann
A good understanding of basic statistics and an overview of statistical methods is essential to make good use of the massive amounts of astronomical data that current and future facilities produce. This lecture will start with a recap of basic statistcs, introduce Bayesian statistics, correlation functions and give an introduction to statistical computing (mostly IDL and R) and give an overview of data mining techniques and tools.
March 31

Astro-Wise

Responsible: A. Belikov/E. Valentijn
Astro-Wise is an integrated environment, an information system, for handling massive amounts of data from raw data obtained directly from the telescope, through reduced data to the final scientific analysis. This lecture will introduce this system and give the students and practical introduction to Astro-Wise and how to use it for astronomical research. Part of the lecture will be dedicated to the astrometric and photometric reduction of the astronomical images with the use of Astro-Wise.
April 14

Statistical methods in astronomy & the computational tools (continued)

Responsible: J. Brinchmann
This lecture will continue the discussion of statistical techniques in data mining. We will cover data summarising and characterisation, as well as density estimation (histograms and kernel estimation). Large data-sets often require classification or dimensional reduction and we will also discuss principal component analysis, clustering analysis and neural networks and give a summary of other current techniques.
April 28

The Virtual Observatory

Responsible: A. Belikov/E. Valentijn
The Virtual Observatory (see for instance IVO and references therein) is a system that aims to link together the various astronomical databases around the planet into a unified interface so that data from a wide range of facilities can be combined. This lecture will give an overview of the Virtual Observatory, how to interact with it and how to help to extend it. A part of the lecture will be dedicated to the practical use of the VO and the tools available to interact with it.
May 12

Future directions in astronomical data management

Responsible: A. Belikov/E. Valentijn
This lecture will conclude the course and look forwards to the recent and future development of astronomical data management. This will include an introduction to Grid computing and related tools and review of upcoming informational systems in astronomy. Practical work with present-day Grid facilities (EGEE-III http://www.eu-egee.org/) will be a part of the lecture.
Tasks for exams

Taks for Data Mining in Astronomy course exams

Please, select a task and send it to Edith Fayole till 5 May. Only up to three people per task, so it is better to send a ranked list of your three preferred tasks.

The deadline for submission of the project will be end June and there will be a subsequent oral presentation at about the same time (see PDF for some details). Thus please in your email to Edith Fayolle also provide your availability in late June/early July for the short presentation - probably to take place in Utrecht (TBD).

News

28/1 An updated version of the course schedule with more in-depth descriptions of the lectures as well as suggested literature is available. Note that a swap of lectures has taken place with the VO lecture now happening April 28.

Lectures