TY - JOUR T1 - Ecological Interactions and the Netflix Problem JF - bioRxiv DO - 10.1101/089771 SP - 089771 AU - Philippe Desjardins-Proulx AU - Idaline Laigle AU - Timothée Poisot AU - Dominique Gravel Y1 - 2016/01/01 UR - http://biorxiv.org/content/early/2016/11/26/089771.abstract N2 - Species interactions are key components of ecosystems but we generally have an incomplete picture of who-eats-who in a given community. Different techniques have been devised to predict species interactions using theoretical models or abundances. Here, we explore the K nearest neighbour approach, with a special emphasis on recommendation, along with other machine learning techniques. Recommenders are algorithms developed for companies like Netflix to predict if a customer would like a product given the preferences of similar customers. These machine learning techniques are well-suited to ecological interactions, since they focus on positive-only data. We also explore how the K nearest neighbour approach can be used with both positive and negative information, in which case the goal of the algorithm is to fill missing entries from a matrix (imputation). By removing a prey from a predator, we find that recommenders can guess the missing prey around 50% of the times on the first try, with up to 881 possibilities. Traits do not improve significantly the results for the K nearest neighbour, although a simple test with a supervised learning approach (random forests) show we can predict interactions with high accuracy using only three traits per species. These techniques are complementary, addressing different scenarios. Recommenders can predict interactions in the absence of traits, using only information about other species' interactions. Supervised learning algorithms such as random forests base their predictions on traits only, they do not exploit other species' interactions. They are useful when traits are known for the species. This result shows that binary interactions can be predicted without regard to the ecological community given only three variables: body mass and two variables for the species' phylogeny. Further work should focus on developing custom similarity measures specialized to ecology to improve the KNN algorithms and using richer data to capture indirect relationships between species. ER -