The 2009 International Conference on High Performance Computing & Simulation
June 21-24, 2009, Leipzig, Germany
Using Grids to Support Recommender Systems: A Case Study of Generating Movie Recommendations on the EELA-2 Infrastructure
Leandro Neumann Ciuffo
Istituto Nazionale di Fisica Nucleare (INFN)
The proposed demo aims at presenting a "gridified" implementation of a Recommender System (RS) based on the classic Collaborative Filtering algorithm (CF), which can be used by on-line business as a CRM tool.
The Collaborative Filtering technique relies on big datasets in order to generate accurate recommendations. However, for a large retailer like Amazon.com, with huge amount of data, tens of millions of customers and millions of distinct catalogue items, generate accurate recommendations in real-time is impractical. Such limitation has driven the research of a vast amount of alternative solutions. Hence, several variations of the classic CF algorithm are present in the literature, but all of which reduce, in a certain extent, recommendation quality.
In this scenario where traditional CF systems have suffered from scalability issues, Grid computing appears as an innovative approach that can be used for running ever-larger workloads and services.
This work is basically a study case of porting the classic CF algorithm on a Grid environment. In order to empirically test our proposed approach, we developed an application that runs on the EELA-2 infrastructure (http://eoc.eu-eela.eu) and uses CF to generate recommendations to the users of an on-line Movie Recommender System (http://canalcinefilia.com.br).
Two different Grid middleware are part of the EELA-2 infrastructure: gLite and OurGrid. Currently four (4) versions of the proposed application has been deployed: three of them based on gLite. The demo will also present and discuss the multiple strategies adopted in porting our application on both middleware.
To perform this case study we developed a RS to recommend movies at Cinefilia website (http://canalcinefilia.com.br). Our current dataset consists of more than 900 movies, which received 36,984 ratings provided by 380 unique users. Users can freely interact with the Cinefilia website to rate as much movies as they can. Cinefilia makes use of a standard MySQL database to store and retrieve information about its users. Our application was originally deployed on the GILDA test-bed (https://gilda.ct.infn.it/) prior to ported on the EELA-2 infrastructure. In one of our implementation approaches, we are using parametric Jobs where each user is a parameter of the actual executable file.
Such a distributed approach can hypothetically reduce the time consumption of the original algorithm, since each job launched to the Grid is in charge of calculating the recommendations for one single user. The overall execution time to generate recommendations to all users will depend on the number of CEs and WNs available on the Grid as well as on the efficiency of the WMS to distribute the tasks. The best scenario would be to have all jobs running simultaneously in different Worker Nodes. The bigger the number of free CPUs in a Grid, the better is the chance of such scenario occurs. However, early experiments have shown that many jobs tend to be kept waiting in execution queues for several minutes. To cope with this problem, instead of submitting hundreds of small tasks/jobs (one per user), we chose to cluster them in groups of 50 users. Hence, each job is in charge of calculating recommendations to a bunch of up to 50 users.
In both implementation strategies, a pre-processing script must be executed in the User Interface prior to lunch the jobs. Such script is in change of downloading all ratings from the on-line MySQL database; chopping the users' ratings into several text-file strings - one per user; registering the files in the LFC and storing them on a Grid SE. We actually create two replicas of the input files: one stored in a SE in Spain and another one in a SE in Brazil. This approach aims at reducing the download time by avoiding the transferring of those files across the Atlantic.
At the end of the Job execution, it is possible to retrieve multiple output files (one .SQL file per user). Simultaneously, it's worth noting that the recommendations for all users were also registered in an AMGA server. Therefore, there are two ways of delivering the outcome of the execution (i.e., the movie recommendations) to the Cinefilia website database: (i) another script must be launched in order to assemble all .SQL files into a single one containing recommendations for all users. Then the ultimate step of our workflow is to update the on-line database using the generated .SQL file; (ii) one of the website pages can interact directly with the AMGA server by means of a PHP API for AMGA. In order to be authenticated to use the Grid services, the API code uses, as one of the input values, the X.509 personal public-private key pairs from the author of this paper. All communications between the client and the AMGA server are sent over HTTPS connections.
Grids first emerge within scientific communities, like High Energy Physics (HEP) experiments. However, the enormous research activity in recent years has contributed to the development of new areas of interest. Commercial users have been attracted by this technology, which can potentially be exploited by industries and SMEs to offer new services with reduced costs and higher performance.
This expansion from science to business is nearing Grids to "utility computing," where computing power is viewed as a utility, available on a pay-as-you-use basis, like gas or electricity. This is not yet the case, but there are several ongoing initiatives to develop new tools and Grid services that should allow the outsourcing of computing resources in the short term.
The author understands that Grid computing technologies is getting even more mature and well known. As a consequence, new applications have been attracted to be deployed on it, even those that are not very CPU-intensive and that could be handled by small computing clusters.
Hence, the impact of the proposed work is manifold:
As a future work, we intend to discuss new strategies of distributing the workload among the gLite Worker Nodes as well as to run new experiments in order to present a detailed consumption time comparison between both middleware. We also intend to test some Cloud computing services and to create a new version of our Recommender System exploring the GRelC service (http://grelc.unile.it).REFERENCES
Leandro N. Ciuffo was born in Brazil in 1978 and holds a B.Sc. in Computer Science from the Universidade Federal de Juiz de Fora (UFJF) and an M.Sc. in Computing from the Universidade Federal Fluminense (UFF), both Brazilian universities. Since September 2006 he collaborates at INFN Catania (Italy), being allocated to work as an activity manager in two EC Grid projects: EELA (http://www.eu-eela.org/first-phase.php) and its successor EELA-2 (http://www.eu-eela.eu/). Currently Leandro is in charge of the NA3 activity (Application Support) of the EELA-2 project.
Author thanks the support provided by the EELA-2 Project. EELA-2 is co-funded by the European Commission in the frame of the Seventh EU Framework Programme - Research Infrastructures. This work reflects only the author's views. The Community is not liable for any use that may be made of the information contained therein.