Massively Parallel Machine Learning in the Virtual Observatory as a Key Technology in the Era of Multi-Million Spectral Surveys

P. Škoda* and Mihir Arjunwadkar2
Astronomical Institute of the Czech Academy of Sciences, Fricova 298, Ondrejov

View Full Article: [PDF]

Abstract

The archives of multi-object spectral surveys such as SDSS or LAMOST currently contain millions of pipeline-reduced spectra of celestial objects. Most ca be identified as stars of recognised spectral types, according to quick comparisons with extensive lists of template spectra. To date, the dominant application of spectral libraries is for statistic estimates of similarity, measured in a sequential or simply parallel manner, by comparing all the survey spectra and their PCA components with a grid of templates.

In this paper we propose a new approach that uses modern machinelearning techniques as semi-supervised training, deep learning, or outlier detecting that helps to identify specific rare cases of unusual objects like stars with strong emission lines or P-Cyg profiles, or blazars, as well as to eliminate the instrumental and processing artefacts which cannot be handled correctly by a normal streaming pipeline. The amount of data and time-absorbing algorithms require a ‘Big Data’ approach, using massively parallel processing in the cloud by applying modern technologies such as GPGPUs, Hadoop and Spark.

An important stage towards verifying the results is an interactive visualisation and cross-matching with other data such as photometric surveys, spectra acquired by other surveys, space missions and multi-wavelength data of similar coverage, as well as comparisons with alternative models. All this can be easily achieved through correct exploitation of Virtual Observatory standards.



<< Previous Article | Next Article >>    Back to Asics_Vol_014

Keywords : stars: emission-line, Be; methods: data analysis; techniques: spectroscopic; virtual observatory; machine learning