How to Run Elastic MapReduce Hadoop Job Using Custom Jar - Amazon EMR Tutorial

Amazon EMR is a web service using which developers can easily and efficiently process enormous amounts of data. It uses an hosted Hadoop framework running on the web-scale infrastructure of Amazon EC2 and Amazon S3.
Amazon EMR removes most of the cumbersome details of Hadoop, while take care for provisioning of Hadoop, running the job flow, terminating the job flow, moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop.
In this tutorial, we will first going to developed WordCount java example using MapReduce framework Hadoop and thereafter, we execute our program on Amazon Elastic MapReduce.

Prerequisites

You must have valid AWS account credentials.You should also have a general familiarity with using the Eclipse IDE before you begin. The reader can also use any other IDE of their choice.

Step 1 – Develop Hadoop MapReduce WordCount Java Program

In this section, we will first going to develop WordCount application. A WordCount program will determine how many times different words appear in a set of files.
  • 1. In Eclipse (or whatever the IDE you are using), Create simple Java Project with name "WordCount".
  • 2. Create a java class name Map and override the map method as follow,
    public class Map extends Mapper<longwritable, 
                               intwritable="" text,=""> {
     private final static IntWritable one = 
                              new IntWritable(1);
     private Text word = new Text();
    
     @Override
     public void map(LongWritable key, Text value, 
                         Context context)
         throws IOException, InterruptedException {
       String line = value.toString();
       StringTokenizer tokenizer = new 
                            StringTokenizer(line);
       while (tokenizer.hasMoreTokens()) {
           word.set(tokenizer.nextToken());
           context.write(word, one);
       }
      }
    }
    
  • 3.Create a java class name Reduce and override the reduce method as below,
    public class Reduce extends Reducer<text, 
                  intwritable,="" intwritable="" text,=""> {
     @Override
     protected void reduce(
       Text key,
       java.lang.Iterable<intwritable> values,
       org.apache.hadoop.mapreduce.Reducer<text, 
               intwritable,="" intwritable="" text,="">.Context context)
       throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable value : values) {
       sum += value.get();
      }
      context.write(key, new IntWritable(sum));
     }
    }
    
  • 4. Create a java class name WordCount and defined the main method as below,
    public static void main(String[] args) 
                          throws Exception {
      Configuration conf = new Configuration();
    
      Job job = new Job(conf, "wordcount");
      job.setJarByClass(WordCount.class);
    
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
    
      job.setMapperClass(Map.class);
      job.setReducerClass(Reduce.class);
    
      job.setInputFormatClass(TextInputFormat.class);
      job.setOutputFormatClass(TextOutputFormat.class);
    
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
      job.waitForCompletion(true);
    }
    
  • 5. Export the WordCount program in a jar using eclipse and save it to some location on disk. Make sure that you have provided the Main Class (WordCount.jar) during extracting the jar file as shown below.
jar
jar ready
                                 ur jar is ready!!

Step 2 – Upload the WordCount JAR and Input Files to Amazon S3

Now we are going to upload the WordCount jar to Amazon S3. First, go to the following URL: https://console.aws.amazon.com/s3/home Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane. Upload the WordCount JAR and sample input file for counting the words.

Step 3 – Running an Amazon Elastic MapReduce Hadoop job

Running Hadoop WordCount example

Now that the JAR is uploaded into S3, all we need to do is to create a new Job flow. let's execute the below steps. (I encourage reader to check out the following link for details regarding each step, How to Create a Job Flow Using a Custom JAR )
  • 1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/
  • 2. Click Create New Job Flow.
  • 3. In the DEFINE JOB FLOW page, enter the following details,

    a. Job Flow Name = WordCountJob
    b. Select Run your own application
    c. Select Custom JAR in the drop-down list
    d. Click Continue

  • 4. In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, and then click Continue.
    JAR Location = bucketName/jarFileLocation
    JAR Arguments =
    s3n://bucketName/inputFileLocation
    s3n://bucketName/outputpath

    Please note that the output path must be unique each time we execute the job. The Hadoop always create folder with same name specify here.

After executing job, just wait and monitor your job that runs through the Hadoop flow. You can also look for errors by using the Debug button. The job should be complete within 10 to 15 minutes (can also depend on the size of input). After completing job, You can view results in the S3 Browser panel. You can also download the files from S3 and can analyze the outcome of the job.

Amazon Elastic MapReduce Resources

33 comments:

  1. Acetech Software should be your first choice if you are looking for a software development company Delhi India for your web designing need.

    ReplyDelete
  2. Thanks for sharing this nice blog..Its really very informative and useful..

    Cloud computing Course in Chennai

    ReplyDelete
  3. This information you provided in the blog that was really unique I love it!!, Thanks for sharing such a great blog..Keep posting..

    Cloud Computing Training Chennai

    ReplyDelete
  4. Cloud computing is a term that refers anything that include delivering hosted service over internet. Introduction this technology benefits small and business organization by minimizing the expenditure investing on individual computers and other resource.
    Cloud computing course in Chennai

    ReplyDelete
  5. Its really nice information..Thanks for sharing..

    Cloud Computing Training

    ReplyDelete
  6. I will to use your code in my projects :)

    ReplyDelete
  7. Awesome blog!!! Your article is very clear and gives complete overview about Search Engine Optimization. Your blog is recommended for students and fresh graduates. SEO Course in Chennai

    ReplyDelete
  8. Social networking sites are excellent platform to maximize your blog popularity. However, you need to update your blog with quality and informative post to engage users on your blog. SEO is process of optimizing your website with ethical techniques.

    ReplyDelete
  9. Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. It is really very helpful for us and I have gathered some important information from this blog.


    Cloud Computing Course in Chennai

    ReplyDelete
  10. Thanks for InformationHadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo. HADOOP Online Training

    ReplyDelete
  11. Very interesting.. Great Coding Technologies.. Awesome Thanks..

    ReplyDelete
  12. Awesome blog!!! Your article is very clear and gives complete overview about Search Engine Optimization and its return on investment. SEO institutes in Chennai

    ReplyDelete
  13. Thanks for sharing such informative post. Salesforce is a cloud based CRM product that allows users to create dynamic application and service over the cloud technology. This virtual technology has huge potential to offer for online community. Salesforce Training in Chennai

    ReplyDelete
  14. Your Blog post has very helpful information.
    Thanks for share....

    Excel Plugin Software Send Bulk SMS

    ReplyDelete
  15. Thanks for sharing this informative blog. FITA provides Salesforce Training in Chennai with years of experienced professionals and fully hands-on classes. Salesforce is a cloud based CRM software. Today's most of the IT industry use this software for customer relationship management. To know more details about salesforce reach FITA Academy.

    ReplyDelete
  16. Thanks for sharing this informative blog. Recently I have completed Digital Marketing courses at a leading digital marketing company. It's really useful for me to make a bright career. If anyone wants to get Digital Marketing Course in Chennai visit infiniX located at Chennai. Rated as No.1 digital marketing company in Chennai.

    ReplyDelete
  17. I have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article.You have done a great job . If anyone want to get Salesforce Developer Training in Chennai, Please visit FITA academy located at Chennai Velachery.

    ReplyDelete
  18. Thanks for sharing this informative blog. If anyone wants to get Hadoop Training in Chennai visit fita academy located at Chennai, Velachery.

    ReplyDelete
  19. Thanks for your informative article. I agree with your thoughts. SEO is most trusted and cost effective option to draw potential traffic to your business. However, you need to spend significant amount of time and money in achieving your business goal. SEO Training in Chennai

    ReplyDelete
  20. Thanks for sharing this informative blog. If anyone wants to get Unix Training in Chennai, Please visit Fita Academy located at Chennai, Velachery.

    ReplyDelete
  21. Thanks for sharing this valuable information..If anyone wants to get SAP Training in Chennai, please visit FITA Academy located at Chennai..

    ReplyDelete
  22. Your posts is really helpful for me.Thanks for your wonderful post.It is really very helpful for us and I have gathered some important information from this blog.If anyone wants to get Dot Net Training in Chennai reach FITA, rated as No.1 Dot Net Training Institutes in Chennai.

    ReplyDelete
  23. Java is one of the popular technologies with improved job opportunity for hopeful professionals. Java Training in Chennai helps you to study this technology in details.

    ReplyDelete
  24. Thanks for sharing this informative blog. If anyone wants to get Android Course in Chennai reach FITA Academy located at Chennai, Velachery.

    ReplyDelete
  25. Thanks for sharing this informative blog.. If anyone want to get HTML Training in Chennai please visit FITA academy located at Chennai, Velachery. Rated as No.1 training and placement academy in Chennai.

    ReplyDelete
  26. Thanks for sharing this valuable information..If anyone wants to get SAP Training in Chennai, please visit FITA Academy located at Chennai..

    ReplyDelete
  27. SEO is one of the digital marketing techniques which is used to increase website traffic and organic search results. Thanks for sharing this information. Its really useful for those who want to become an SEO professional. If anyone wants to get SEO Training in Chennai visit FITA Academy located at Chennai. Rated as No.1 Training institutes in Chennai.




    ReplyDelete
  28. Its really awesome blog..If anyone wants to get Software Testing Training in Chennai visit FITA IT academy located at Chennai.

    ReplyDelete