AWS S3 behind Netflix success
◆ Netflix as the big data tycoon
Netflix is known as one of the most sophisticated user player in big data community. They appear regularly in big data conferences like Strata and discuss how they utilize the data analytics in their business, and what their infrastructure is like.
My theory why Netflix is successful while many others are not, is that their sophisticated big data power enables them to deliver better service and wider margin. Media industry people often see online video delivery as just another distribution means and do not pay too much attention to this “brain” part of the cloud, but it is the secret source of their success.
◆ From the user data to recommendations
I have tried all major movie services for years, including Netflix, Hulu, Apple, Amazon, cable’s TVEverywhere, as well as Joost, CinemaNow and MovieLink (remember them?). Among them, Netflix stands out in the power of recommendation. Other services push the ones that they want to show such as new shows, while Netflix top page is filled by personalized recommendations.
At the discussions in big data conferences, Netflix shows off how they utilize the amazing details of the usage data to come up with such recommendations.
With streaming, Netflix knows what you watch at which date and what time, if you quit watching, where you stop and whether you restart watching or not, on what device. It is not a simple “people who watch this movie also watch these” factor.
In my household, I have Netflix account and everyone else in the family share my account. Each have very different taste, so I was feeling pity for confusing Netflix, but they are actually one step ahead. They already roughly know the profile of my family members through the analysis of such usage data. And they show it in a subtle way, such as “SF Action” or “Foreign Art Films”, not creepy way such as “one for your teenage son” or “for mom”.
◆ Scale out on Amazon S3
Netflix is the most well known user of Amazon Web Service (AWS) as their infrastructure to support this massive data analytics operation. They state that “data center management is not our main business” as the reason to use AWS.
They used to have their own data center and was running Oracle database early in their history, but the data amount exploded as their online streaming service was catching on, to the point where they cannot catch up by building the new one anymore. So they moved to almost 100% cloud-based in 2009-10 both in processing and storage, to be able to scale rapidly.
Currently, AWS’s S3 is used to store both video and user behavior data. User order gets processed in NoSQL database Cassandra, and then the data is dumped into S3 once a day. According to an engineer’s confession in Strata speech, they had so much trouble in this transfer process, so they developed their own software to do this and named it Aegisthus. Aegisthus is a figure who killed the princess Cassandra of Troy in a famous tragedy of Greek mythology.
User data stored in S3 is analyzed with Hadoop tools, and the results are also stored in S3 again. S3 is generally known as "Pay as you go" service, but big customers like Netflix usually are assigned with a fixed capacity, so they use the slack capacity for user data analytics after midnight of the West Coast, when video stream volume decrease sharply.
The speaker emphasized the concept "the right tools for the right job" in his speech. Depending what your business model is, you have to choose where to put your own resources and what you buy from outside. The big data strategy is not solely defined by the amount of data or company size. Strategic priorities often are more important in your decision of “build or buy”. Cloud storage provide advantages for enterprise of all sizes.