Linked Data Mining Challenge

Know@LOD 2015 will host the third edition of the Linked Data Mining Challenge.

Note: Deadlines for the challenge result submission and associated papers are different from those of the regular Know@LOD 2015 papers, see the timeline below.

Prizes

  • The best performing solution will be awarded an Amazon voucher worth 100€.

General Overview of the Challenge

Linked data represents a novel type of data source that has been so far nearly untouched by advanced data mining methods. It breaks down many traditional assumptions on source data and thus represents a number of challenges:

  • While the individual published datasets typically follow a relatively regular, relational-like (or hierarchical, in the case of taxonomic classification) structure, the presence of semantic links among them makes the resulting ‘hyper-dataset’ akin to general graph datasets. On the other hand, compared to graphs such as social networks, there is a larger variety of link types in the graph.
  • The datasets have been published for entirely different purposes, such as statistical data publishing based on legal commitment of government bodies vs. publishing of encyclopedic data by internet volunteers vs. data sharing within a researcher community. This introduces further data modeling heterogeneity and uneven degree of completeness and reliability.
  • The amount and diversity of resources as well as their link sets is steadily growing, which allows for inclusion of new linked datasets into the mining dataset nearly on the fly, at the same time, however, making the feature selection problem extremely hard.

The Linked Data Mining Challenge (LDMC) will consist of one task, which is the prediction of the review class of movies.

The best participant in the challenge will be awarded. The ranking of the participants will be made by the LDMC organizers, taking into account both the quality of the submitted LDMC paper (evaluated by Know@LOD workshop PC members) and the prediction quality (i.e., accuracy, see below).

Task Overview

The task concerns the prediction of a review of movies, i.e., "good" and "bad". The initial dataset is retrieved from Metacritic, which offers an average rating of all time reviews for a list of movies. The ratings were used to divide the movies into classes, i.e., movies with score above 60 are regarded as "good" movies, while movies with score less than 40 are regarded as "bad" movies. For each movie we provide the corresponding DBpedia URI. The mappings can be used to extract semantic features from DBpedia or other LOD repositories to be exploited in the learning approaches proposed in the challenge.

Source Data, Required Results, and Evaluation

The dataset is available for download here: training data, test data. It consists of training data of 1,600 instances for learning the predictive models (this data contains the value of the target attribute) and testing data of 400 instances for evaluating the models (this data is without the target attribute). The target attribute to be predicted is the Label attribute.

The datasets contain semicolon separated values: movie's title Movie, movie's release date Release date, movie's DBpedia URI DBpedia_URI, movie's label Label (only for the training set), and id. A sample of the training CSV file is as follows:


"Movie";"Release date";"DBpedia_URI";"Label";"id"
"Best Kept Secret";9/6/13 12:00 AM;"http://dbpedia.org/resource/Best_Kept_Secret_(film)";"good";1.0

The participants have to submit the achieved results on testing data, i.e. label of the movie. The results have to be delivered in a (syntactically correct) CSV format that includes the predicted label.

The submitted results will be evaluated on a gold standard with respect to the accuracy, calculated as:

Beside the CSV file containing the predictions, the participants are expected to submit a paper describing the used methods and techniques, as well as the results obtained, i.e., the hypotheses perceived as interesting either by the computational method or by the participants themselves. 

The participants should provide a detailed description of their approach, so that it can be easily reproduced. For example, it should be clearly stated what are the used feature sets (and how they have been created), the preprocessing steps, the type of the predictor, the model parameters' values and tuning, etc.

The papers will be evaluated by the evaluation panel, both with respect to the soundness and originality of the methods used, and with respect to the validity of the hypotheses and nuggets found. It should meet the standard norms of scientific writing.

Allowed Datasets

For building the movie review predictor, any dataset that follows the Linked Open Data principles is allowed to be used.

Non-LOD datasets are allowed to be used only if the participants later publish those datasets in a way that they would make them accessible using some of the standard Semantic Web technologies, e.g., RDF, SPARQL, etc.

For example, one may map the movies from the dataset to the corresponding movies in a non-LOD dataset X, allowing to retrieve additional data from the dataset X. Then, it is expected from the participants to publish the DBpedia mappings to the dataset X movies, and the additional data retrieved from the dataset X, for example, using RDF.

*Important: Since the Metacritic dataset is publicly available, we kindly ask the participants not to use the Metacritic movies' rating score to tune the predictor for the movies in the test set. Any submission found not to comply with this rule will be disqualified.

However, other information than the movies' rating score retrieved from Metacritic is allowed, e.g., users' textual reviews for a given movie.

Submission Procedure

Results submission

Paper submission

  • In addition to your results, you have to submit a paper describing your solution
  • The paper format is Springer LNCS, with a limit of four pages
  • Papers are submitted online via Easychair before Friday March 27th

Presentation of Results

  • Challenge papers will be included in the workshop proceedings of Know@LOD
  • The authors of the best performing systems will be asked to present their solution at the workshop

For any questions related to the submission procedure, please address the contact persons below.

Additional Resources

Baseline Model

We provide a simple classification model that will serve as a baseline. The model is implemented in the RapidMiner platform, using the Linked Open Data extension. In this process we used the movies DBpedia URI to extract the direct types and categories of each movie. On the resulting dataset we built k-NN classifier (k=3), and applied it on the test set. 

The process can be downloaded here.

Movies LOD Sources

Beside DBpedia other movies related LOD sources exist, e.g., DBtropes and LinkedMDB.

Timeline

  • 19 January 2015: Data the LDMC task available for download
  • 27 March 2015: Submission of paper and solution deadline
  • 03 April 2015: Notification of acceptance

Organization

Contact persons

  • Petar Ristoski, University of Mannheim, Germany, petar.ristoski (at) informatik.uni-mannheim.de
  • Heiko Paulheim, University of Mannheim, Germany, heiko (at) informatik.uni-mannheim.de
  • Vojtěch Svátek, University of Economics, Prague, svatek (at) vse.cz 
  • Václav Zeman,  University of Economics, Prague, vaclav.zeman (at) vse.cz

 

 

 

Grundzertifikat Beruf und Familie