Amazon EC2 and other on-demand or cloud computing vendors are providing a way to scale up and scale down your computing and storage requirements.
Over the last 18 months or so I have reviewed and demo'ed a bunch of Data mining and Business intelligence (BI) software and suites. This was both for my benefit, so I can check out what is out there is this software space and also to see how effective each is at providing a way to scale up and scale down your BI computing needs.
Many people want to use these compute clouds like dedicated hosting providers which is the wrong use, at least until the prices drop to be comparable to dedicated hosting.
What they are for is for peak demand resourcing, short term and mid term computing requirements. This is especially true for tasks where the work you are trying to do is constrained by CPU and which scale well.
Let's look at the short term computing example :
You have a task which will take one machine with 4 cores (2x Dual Core) 16 hours to complete or 4 hours per core. However the task scales, so you if you throw 16 cores at it, it will be completed in 1 hour.
I see this as the core of providing any kind of on-demand service. The main selling point is you are saving people time. They can take that time and reinvest it (if you like) and do more analysis or more in-depth analysis. They can broaden the techniques they use to analyze their data or use the same techniques in more detail.
One example of this type of broadening on analysis is using Weka and running many different learning algorithms on the same training and testing set of data to find the best one. This is perfect use of on-demand computing.
So how to you provide this type of service under a cloud computing model?
You have introduce a queue.
Your base load is done by a dedicated hosting provider and your peak load by starting and stopping (ramping up and down) the compute resources as required.
Why stop with one queue, many service providers have many channels to accessing their services. So work which must be done immediately where time is a premium, would pay some premium for the immediacy required. Work which isn't required until the next business day can be completed when spare resources are freed up.
Another way to approach this is to let a market form for the compute resources. The best mechanism for controlling the use of resources is a price signal.
Have Fun
Saturday, October 18, 2008
Datamining in the cloud
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment