What is the difference between ec2 and emr?
When most amazon web service customers start out they begin with a service known as Amazon Elastic Compute Cloud, or Amazon EC2 as it is more well known. However, after building up a large amount of data, you may want to start looking into another service known as Amazon Elastic Map Reduce, or Amazon EMR as it is known.
What is the difference between EC2 and EMR? Amazon EC2 is a cloud based service which gives customers access to a varying range of compute instances, or virtual machines. Amazon EMR is a managed big data service which provides pre-configured compute clusters of Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.
So Amazon EC2 would essentially be considered more low level than Amazon ECR as it is simply the machines running the operating systems and software while Amazon EMR already has the software pre-installed and configured by Amazon. This helps reduce setup time and allows the customer to ignore all of the maintenance and patching required with a manual setup.
Amazon EC2 is a service provided by Amazon which gives access to virtual machines running operating systems and software chosen by the cloud customer. These virtual machines could be running a Linux based operating system, one of the many various flavors of Linux, or the Microsoft Windows operating system. The choice is totally up to the cloud customer. However, when comparing it to Amazon EMR, it’s really the software that is installed on those machines that matters.
For example, any of these EC2 instances could be setup and configured to run software from the big data ecosystem, including every service that Amazon EMR provides. The big downside to running it this way though is that everything needs to be setup and configured manually. The master nodes of the given big data service will need to be configured one way while the worker nodes will need to be configured another. Over and above this the patching and updating of these virtual machines will need to be managed by the customer running those virtual machines.
However when using Amazon EMR, it is quite useful to know about the types of Amazon EC2 instances available, because when setting up and running an Amazon EMR cluster, you will need to choose the Amazon EC2 instance type that the Amazon EMR cluster will run on top of. There are Amazon EC2 instance types for pretty much every workload you can conceive.
The general purpose instances offered by Amazon EC2 are the types of machines that a user would select when they need a mix of CPU, Memory, Network, and Storage performance for the given workload that will be running on these instances. They are designed to essentially be average in each of these spaces, so that they work relatively well for any workload running on them.
The general purpose EC2 instances have a range of configurations which provide vCPUs from 1 to 96 cores of various underlying CPU types. They also offer versions that have between 512 MiB and 384 GiB of RAM. Some instances have only elastic block storage available to them while others have up to four 900 GiB NVMe local ephemeral drives. As you can see, the variations available for the general purpose virtual machines is quite large.
Compute optimized Amazon EC2 instances are designed and configured to be more heavy on the compute side of the house. So for the same pricing of a general purpose Amazon EC2 instance, with these types of machines you’ll usually end up with more cores available. Another thing that could happen, is that for the same price you may get a CPU with a faster frequency than the general purpose instances.
Again these instances have a range of vCPUs available to them starting as low as a single vCPU but increasing all the way up to 96. When comparing them to the general purpose based instances, you’ll likely notice less RAM, or less storage, for the same amount of vCPU or equivalent price. When you are running a compute intensive workload with Amazon EMR, these are the type of Amazon EC2 instances to choose.
Comparing that with memory optimized instances, these virtual machines are designed to have more RAM available to the machine than what an equivalently priced general purpose or compute optimized instance might have. These machines currently range in memory capacity from 8 GiB to an incredibly massive amount of 24TB of RAM in the u-24tb1.metal machine. That amount of memory can load incredible amounts of data per virtual machine in order to speed up processing on that data.
If you are running a big data workload on top of Amazon EMR and need to be able to load massive amounts of data into memory, the instance types that should be used are these ones. Though the machines like the u-24tb1.metal can get quite pricey to run on a continuous basis. But if you have a need to that much memory, it probably makes sense to run a machine that size and be able to load all of the data into memory, as long as the software being used to analyze the data can make use of all that memory. Most projects that run with Amazon EMR are able to make use of this massive capacity on a single node, and the Amazon engineers have finely tuned the configurations for those instances if you choose to use them.
These type of Amazon EC2 instances are quite different from the others that were previously mentioned. This is mainly because they have extra hardware attached and available to the Amazon EC2 instance that wouldn’t normally be available to the other instance types.
This includes things like graphical processing units which are very useful for big data processing or machine learning models. For example, some of these instances give access to up to 8 NVIDIA Tesla V100 GPUs which each contain 5120 CUDA cores and 640 Tensor cores. This is over and above the vCPUs available to the instance as well. The GPU cards in these machines are connected together with NVLink for optimal performance.
Another of these types of machines use the custom made AWS Inferentia chips, which are optimized for performing inference on machine learned models. They can have up to 16 of these custom processors available to the machine and work with the Amazon Neuron SDK.
Another interesting instance offered in this category of VMs is the F1 instances. These machines give the user access to field programmable gate arrays or FPGAs for short. These are effectively customizable hardware that allow the end user to create really customized applications in the cloud that are heavily tuned for their specific use case. If you need something incredibly optimized, this is likely the route to go.
As you can see, these types of instances are mostly used when you have a very specific use case that the extra hardware could be used to increase performance. Though, the extra features made available by these instances usually come at an extra cost when compared to the general, compute, or memory optimized instances when comparing the same amount of vCPU or RAM available to the instance.
If the workload you are running requires massive amounts of storage, the storage optimized instances are likely what should be used from Amazon EC2 with your Amazon EMR cluster of machines. When comparing these instances to the other types, they can have a huge amount of extra local storage available to them.
These instances range in capacity from 475 GiB of local storage to the vast amount of 60000 GiB of local storage spread across 8 local drives. An EMR cluster of these machines could provide enough storage to host basically any dataset that you would want to be able to process.
The i3 or i3en style of these instances provide local Non-Volatile Memory Express (NVMe) drives to give the best performance available for reading and writing to the drives. These types of drives offer the fastest IOPS available, the lowest latency, and high sequential throughput to the local disks. When running a workload that requires a lot of disk access, to get the highest performance currently available in the cloud, it would make a lot of sense to use these.
Amazon EMR is a fully managed service provided by Amazon to host and maintain clusters of Amazon EC2 instances so that they are able to run big data work loads. This means that the Amazon engineers have pre-configured software that will setup the machines to run the big data software without the end customer needing to know how to set them up, and they specifically tune the setup of these software systems for the Amazon EC2 instance type selected by the end user.
As you’ll see, there is a large variety of big data software that is available to run on Amazon EMR. This service also allows you to run many of these systems independently of one another. This allows the AWS customer to run several big data workloads without each affecting the other. For example there could be an Amazon EMR cluster specifically setup for development, another setup specifically for testing, and another specifically setup for production workloads. This way the dev and test environments could be running without affecting any of the production work that are currently running.
Another benefit to running these big data systems on Amazon EMR is that they are billed for actual usage of the clusters. If you only need to run a big data processing cluster for a few hours to process a certain job, you are able to spin up the cluster on top of the Amazon EC2 instances, run it for as long as the job needs them to complete the work and then spin the cluster down after the job is complete. You will only be charged for the time that the cluster was up and running and processing the data. The job results can be saved long term in a database or in an Amazon S3 storage bucket.
Apache Spark is a service that is able to run on top of Amazon EMR to process big data workloads. Apache Spark has become one of the standard environments used to process large data sets. It can run locally on your laptop for initial dev and test, but when ready to run on a large dataset it makes most sense to run this on a cluster of machines. This could be setup manually, but the simplest way to get up and going is with a service like Amazon EMR, which will pre-install the software on a cluster of Amazon EC2 instances for you, and be tuned to work as best as possible with those machines.
To work with Apache Spark you can write your processing code in languages like python or scala. The code will point to the Apache Spark master node, or nodes, and those nodes will send the processing instructions out to the worker nodes that are available to the cluster. This allows you to distribute the work across many machines very simply, and Amazon EMR makes it even simpler as they configure and setup the machines to make running the jobs easy. The maintenance of failover of the machines is completely handled by Amazon and not something the cloud customer needs to worry about.
Apache Hadoop is another big data processing engine which can be run with Amazon EMR. Apache Hadoop came before Apache Spark, and Apache Spark is built on top of some of the functionality provided by Apache Hadoop.
Apache Hadoop is a map reduce framework which allows building data processing models in various programming languages and allows for spreading that work across many worker nodes. Again this system needs one or more master nodes, usually more than one for high availability. The master nodes coordinate the map and reduce tasks to all of the worker nodes in the cluster. Again when running Apache Hadoop on an Amazon EMR cluster, all of the setup, configuration, and maintenance is handled by the Amazon systems and engineers.
With a map reduce framework, the code is designed so that the work is distributed across many of the worker nodes as map jobs which essentially take in records of data, do some processing on them and output one or many output records. Some times the output is then fed to a reduce task which then aggregates or merges the map output records into this jobs output. This output could then be fed into further map and reduce tasks if needed to complete the work.
Another Apache based project that is offered by Amazon EMR is Apache HBase. This is essentially a big data style database and is part of the Amazon Hadoop ecosystem. This database runs on top of the Hadoop Distributed File System (HDFS) to provide resilient storage for the database records as blocks of data are replicated across many nodes so that if a single node fails, the data is not lost.
Apache HBase would be considered a NoSQL database, or non relational style database. Very much like a key value store, but with a whole lot of features and virtually unlimited storage space and can be dialed up to handle almost any amount of read and write load sent to it. It could be considered to be very much like Google’s Bigtable. The benefits of running this with Amazon EMR is that the whole system is maintained by Amazon, instead of needing a local team of engineers to keep things running smoothly on the underlying EC2 instances.
Apache Hive is another open source project provided by the Apache Software Foundation which is essentially a big data warehouse. This big data software allows running SQL type queries across tables of data and is built on top of Apache Hadoop systems. It can query data that is stored in HDFS or other storage systems like Amazon S3. Since this software is also an option available with Amazon EMR, it is fully tuned by the Amazon engineering team to run the best on top of the EC2 instances selected when launching the Hive cluster.
Another similar big data system is Apache Hudi which is designed to ingest and store large analytical datasets, effectively acting as a data lake. It is mainly designed to provide stream like interfaces to this data to other systems such as Apache Spark and Apache Hadoop. Again Amazon EMR can provide fully configured Hudi environments which are patched an maintained by them so that you don’t have to worry about it.
Finally, Presto is another offering by Amazon EMR which allows for running fast interactive SQL queries on top of large datasets from various sources. Amazon EMR makes running Presto simple by having everything pre-configured and setup to run on top of the Amazon EC2 instances selected for the job.
So as you can see, Amazon EC2 is really used in tandem with Amazon EMR. Amazon EMR really lets you just use these big data systems and environments on top of Amazon EC2 without needing to worry about the underlying compute instances, but while also keeping things running smoothly and maintained for you.