Yahoo machine learning: increase accuracy by favoring scale over accuracy

At the Hadoop Summit earlier this year, Yahoo shared the story of the journey they made with Hadoop, right from the moment they started using it 10 years ago up until today. With 600 petabytes of data stored on Hadoop, 43.000 servers and 1 million jobs run every single day, Yahoo can be called a hardcore “Hadooper”. Peter Cnudde, VP Engineering at Yahoo explains how Yahoo is using Hadoop by means of some use cases, such as Flickr.

Flickr is a Yahoo application for sharing and storing photos. Yahoo is using Hadoop to show people the prettiest pics uploaded, in both their private collections and the public sphere. How? Through machine learning. By feeding the ‘machine’ with a variety of examples, teaching it to recognize patterns and eventually identify beauty. Yahoo uses machine learning in everything they do: from showing you the most beautiful pictures on Flickr to showing you the most relevant search results and blocking spam from your mailbox. In the case of Flickr, Yahoo stores all pictures people upload worldwide (and that’s a lot of pictures) on Hadoop, unleashes its machine learning on it and succeeds in identifying the most beautiful pictures.

They apply the same method to ads. How do they know which ads to show to which people at which moments? Thanks to an algorithm called ‘ad click prediction’. By analyzing users’ flow and click behavior in search and on websites, they set up the machine learning enabling them to better predict users’ clicking behavior and thus show more relevant ads.

For Yahoo, scale is key. The more parameters you can store and process, the more accurate results are. Plain as day! Of course, with a database of that size, errors are bound to occur. Yahoo’s response? Let them. By giving up some correctness, they vastly increase scale and speed. By giving up some accuracy, you can process more data and in the end provide more accurate results.

In 10 years time Yahoo has made a hell of a Hadoop journey, that is probably still in store for a lot of organizations out there. At first, Hadoop was solely employed for web search. Gradually they started using it throughout the company. Today they are doing big things with machine learning, and Hadoop is a substantial part of this. According to Peter Cnudde, machine learning is the base of innovation and will eventually change society.