博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Hadoop 统计文件中某个单词出现的次数
阅读量:6976 次
发布时间:2019-06-27

本文共 2910 字,大约阅读时间需要 9 分钟。

如文件word.txt内容如下:

what is you name?

my name is zhang san。

要求统计word.txt中出现“is”的次数?

 

代码如下:

PerWordMapper

package com.hadoop.wordcount;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class PerWordMapper extends Mapper
{ public Text keyText = new Text(); public IntWritable intValue = new IntWritable(1); @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { String str = value.toString(); StringTokenizer to = new StringTokenizer(str); while (to.hasMoreTokens()) { String t = to.nextToken(); //此处为判断统计字符串的地方 if(t.equals("is")){ keyText = new Text(t); context.write(keyText, intValue); } } }}

 

PerWordReducer

package com.hadoop.wordcount;import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class PerWordReducer extends Reducer
{ public IntWritable intValue = new IntWritable(0); @Override protected void reduce(Text key, Iterable
value, Context context) throws IOException, InterruptedException { int sum = 0; while(value.iterator().hasNext()){ sum += value.iterator().next().get(); } intValue.set(sum); context.write(key, intValue); } }

PerWordCount

package com.hadoop.wordcount;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;import com.hadoop.mapreducer.MapperClass;import com.hadoop.mapreducer.ReducerClass;import com.hadoop.mapreducer.WordCount;public class PerWordCount {	public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {		Configuration conf = new Configuration();	    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();	    System.out.println("otherArgs.length:"+otherArgs.length);	    if (otherArgs.length != 2) {	      System.err.println("Usage: wordcount 
"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(PerWordCount.class); job.setMapperClass(PerWordMapper.class); job.setCombinerClass(PerWordReducer.class); job.setReducerClass(PerWordReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

 

 

转载地址:http://urkpl.baihongyu.com/

你可能感兴趣的文章
Could not create the view: An unexpected exception was thrown. Myeclipse空间报错
查看>>
RHEL6入门系列之九,常用命令2
查看>>
LINUX新手入门-1.装系统
查看>>
Attach Volume 操作(Part II) - 每天5分钟玩转 OpenStack(54)
查看>>
puppet 初识
查看>>
rsync
查看>>
ubuntu安装redis的方法以及PHP安装redis扩展、CI框架sess使用redis的方法
查看>>
功能演示:戴尔PowerConnect 8024交换机VLAN的创建与删除
查看>>
SharePoint运行状况分析器有关磁盘空间不足的警告
查看>>
Oracle的分页查询
查看>>
Objective-C非正式协议与正式协议
查看>>
jquery mobie导致超链接不可用
查看>>
Python OpenCV学习笔记之:图像读取,显示及保存
查看>>
计算机职业目标
查看>>
2月国内搜索市场:360继续上升 百度下降0.62%
查看>>
HTML样式offset[Direction] 和 style.[direction]的区别
查看>>
使用memcache做web缓存
查看>>
我的友情链接
查看>>
我的友情链接
查看>>
华胜天成ivcs云系统初体验2
查看>>