When not to use spot instances
All three major cloud providers have similar options for what was originally known as spot instances, which Amazon Web Services initially introduced. There are many interesting usage scenarios for these types of machines, but it might be useful to know when not to use them.
So something to learn is when not to use spot instances. Spot instances should not be used with applications or systems that are unable to handle intermittent or random failures of nodes. Processing tasks that have long run times to complete their work with no external partial saving mechanism should also avoid being run on spot instances.
Both of these examples show instances of designs that are not made to be redundant, or resilient to failures. To see why spot instances makes these failure scenarios more likely to happen, it would be beneficial to understand how spot instances work.
Spot instances from all of the major cloud providers are extra compute capacity that the cloud provider has available that currently no customer is paying to use at an on demand or reserved pricing rate. Each cloud provider needs to build in extra capacity to their data centers so that they can handle peak usage demands at any time for all of their customers. This means having extra compute capacity available for those time when the resources are needed.
The downside to doing this, for the cloud provider is that they are paying for and maintaining machines that may not currently be in use. In order to make this more economical, the cloud providers, starting with AWS, started to offer what was known as Spot instances. This allows for customers to pay a highly reduced rate for these nodes, but with the risk that the machines can be taken away at any moment by another cloud customer that is willing to pay the full rate for the machine.
When a user that is willing to pay full price for an instance decides to launch one of the cloud instances, the cloud provider will check if there is any unused nodes of that instance type available for the customer, and if so, pass control of that instance to the paying customer. However if all of the spare capacity is currently being used by spot instance users, one of them will have their machine taken away from them and recycled into an instance useable by the higher paying customer.
So as you can see, any system, service, or application, that cannot sustain downtime would not want to be setup and running on a spot instance. If the machine can essentially fail (by being repackaged for another user) at any moment, you wouldn’t want it running on one of these spot instances. Of course and on-demand, or reserved, instance can also fail due to real physical hardware failure, software issues, data center failures or other related issues, but these would be much less frequent than what would be seen with a simulated failure caused by the spot instance being taken away.
One system that likely should not be setup on spot instances would be databases. A database takes in write and read requests from users or applications and creates or retrieves records from its internal storage. This helps maintain state for the user or application working with the database. Normally the database would be setup to backup their data to a remote system on a regular basis.
If a database was setup and configured to run on top of spot instances, everything would run smoothly until demand spiked for the instance type and the node was taken away. If that were to happen, all of the state data that was saved onto the storage of the node would be gone. This would be incredibly bad if that database was backing the user data of a website, or storing patient records from a medical system, or anything else that needs persistent storage to function properly.
The backups mentioned previously may help as they can be restored to another node when one becomes available, however there very likely will be a missed window of time that is lost. This is because the database system won’t have time to create a backup of the internal storage and upload it to the backup location from the period it was notified that the system was being taken away until the time that it was actually taken away.
If a database were to be setup to run on spot instances, it would likely need to be setup and configured into some kind of multi-master mode. This would need to be setup in a way that if a write was sent to one master, it was also fed to a secondary or third master at the same time in order to make sure the data was written to many machines, so that if one of them is taken away the others would still have up to date information.
This may seem like a good setup, but when using spot instances, you have no idea how many instances could be taken away at any moment. In the above setup, what would happen if all of the master nodes described were taken away at the same time, or relatively close in time to each other. This would have the same effect as the single node database, where all the data is lost up to the last backup point.
What about running the read replicas on the database? This would likely be a much better approach than running the master nodes on spot instances, however there are still potential issues with this configuration. If a user is reading from one of the read replica nodes and it is taken away to be given to a full paying user, the request from this user would fail. It may also take several seconds or minutes before the system is aware of the missing read replica and determines that it should stop sending requests to that node.
During that time, this user sending the read requests will likely end up with a bad experience, and see the system as non responsive. Other users may not see the same errors as they may already be communicating with a read replica that wasn’t taken away. However for certain users of the service or application, it could be a less than ideal usage scenario.
As mentioned, another area that should avoid using spot instances are long running batch or processing type jobs which aren’t able to save partial progress part way through the processing. If something takes several hours or days to process, and it is running on a system that can’t guarantee to be running for that long, all progress up to the point where the spot instance was taken away will be lost.
Even if the jobs are able to save partial results while they are running to the local disk of the spot instance, this doesn’t really help if the partial results aren’t pushed to a remote location at these regular intervals. Without this remote copy of the processed results, all work to date will be lost if the machine is taken away.
One example of this might be a machine learning job that is run on top of a large dataset which is locally loaded onto the spot instance for processing. The machine learning model will make many iterations, or passes, over the data so that it learns the specifics about the data to help make the predictions required by the model creator. Most of the time, after several epochs, the machine learning system will write out the current state of the model to the local disk so that it can save the current progress of the model learning. However, if this partial model result is not pushed to a remote location as a backup, it could be lost if the spot instance is taken away.
Another type of job which could suffer from this issue would be a large sorting job. Consider an example of several billion items which need to be sorted on a certain set of properties. This could take a long time to sort properly depending on the machine resources available, the algorithm used, and the number of fields which the list needs to be sorted by. Since this could be a very long running task, if it were being run on a spot instance, it would run the risk of failing due to the spot machine being taken away before the full sort of the list has completed.
Video rendering is another task that can take a very long time to run depending on the number of frames and the quality of the video produced. If a video rendering task was started on a spot instance for a large video sequence that took many hours to produce, there would be a risk of spending a lot of compute time rendering the video with the spot instance, only to have all that progress disappear if an on-demand user decides they need a machine like yours and the cloud provider decides your instance is the one they get.
So if you are considering running a large, or long running tasks, on top of spot instances, these tasks should be designed in a way that their partial results can be saved remotely and periodically. Otherwise you’ll risk losing a lot of progress, and the time and money spent on a spot instance that wasn’t able to complete its task.
Similar to one issue mentioned previously is around hosting an API with spot instances. Just like the read replica databases, if requests are being sent to an API hosted by a spot instance, a users request could be going to a node that was working fine, but then taken away due to an on-demand user grabbing the instance away from the spot user running the API on that spot node.
The customer that was making requests to this API may suddenly stop getting responses from the API as it no longer exists. It may take several seconds, or minutes, for the API layer to detect that the spot API node has gone away before it starts re-routing to another API node that is available. That is if the other nodes weren’t also taken away at the same time. This issue is less impactful than the master node of the database going away, but if all of the API nodes were taken away from the spot availability at the same time, it could be just as bad.
Another thing to consider around the API is if it is designed to be a stateless API. If your API is designed to be completely stateless, meaning that it doesn’t matter which API node a given user hits, they will always get the same response back no matter what then it is much safer to use spot instances for this type of API. However, if the API is designed to track state, this can become an issue with spot instances because certain state can be lost when the spot instance goes away.
An example of this might be an API that locks a given logged in user to a certain spot instance so that that users session information can be tracked and maintained within the API. The big issue with this kind of setup is that if the node supporting a specific user dies, or goes away as it would with a spot instance being taken away, the logged in users state information has disappeared for their session. This effectively would make them seem as if they were a logged out user simply because the node they were forced to work with went away and they are now required to start communicating with another node which doesn’t have their state information.
Ideally this is prevented by saving the state information to a remote database instead of inside an API instance running on a spot node. But this really would depend on the API server being used by the cloud customer and how it is configured.
A final issue that may arise with API systems running on top of spot instances is if there is suddenly a huge demand for those instance types by other cloud customers, but they are willing to pay the on demand pricing, or go into a long term contract for reduced pricing of those instances. If the demand suddenly becomes so large that there are no more spot instances available, or the other spot instances are already spoken for at your price, then there may be no capacity for the API nodes to run on. This means that any requests going to the hosted API would no longer work as there are no more resources left to process the requests.
A final issue that I’d like to discuss may able be a good example of when not to use spot instances has to do with time sensitivity. Even if you are running short term tasks on top of the spot instances that you’ve purchased in your cloud account, some times those tasks need to be completed within a relatively short time period.
If you currently have all of your spot instances up and running with your software that is required to process these time sensitive tasks, the spot instances may still be taken away at any moment. If this were to happen, the queue of tasks that the spot instance was processing would back up as the spot instance is no longer processing those tasks, unless another one of the nodes take over.
This extends the amount of work the other nodes need to do, meaning they need to pickup the slack of the missing node or nodes. This could cause the progress to slow down to a point that is unacceptable for the time sensitivity of the tasks being processed. This can be alleviated by adding more nodes to the processing queue, but it still takes time for those nodes to startup and get ready to start processing the queues of work that is available. Also, this only works if there are more nodes available to be added to the queue, which is unlikely if one was already taken away.
If these are tasks that need to be completed within a guaranteed time frame from when they get added to a work queue, spot instances are likely not the best approach for this type of task as you can’t guarantee the nodes will be there, ready and able to process the work given to them. This could especially be true for financial transactions or fraud detection type processing that needs to happen quickly during a specific workflow. If these jobs were interrupted part way through, or not completable due to lack of resources because no spot instances were available, it could be a really bad experience for the end users.
Even though this article talks about many situations for when spot instances shouldn’t always be used, there are a lot of situations that are ideal for spot instances. This is especially because they can save a huge amount of money compared to paying full price for on-demand usage. In a future article, many of the beneficial examples will be discussed.