怎么在hadoop作map/reduce时输出N种不同类型的value

coderplay

浏览: 571656 次
性别:
来自: 广州杭州

最近访客更多访客>>

x_h_j123

liuxiao723846

汀雨晓洛

springcdma

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

mapreduce&parallel

Hadoop Apache lucene Gmail Mapreduce

BTW:再次感叹下没有机器, 3.4G的语料,单机处理了10来个小时, 真是郁闷~~　要是有N台机器多好啊.

在很多时候,特别是处理大数据的时候,我们希望一道MapReduce过程就可以解决几个问题。这样可以避免再次读取数据。比如：在做文本聚类/分类的时候，mapper读取语料，进行分词后，要同时算出每个词条(term)的term frequency以及它的document frequency.　前者对于每个词条来说其实是个向量,　它代表此词条在N篇文档各中的词频；而后者就是一个非负整数。这时候就可以借助一种特殊的Writable类：GenericWritable.

用法是：继承这个类，然后把你要输出value的Writable类型加进它的CLASSES静态变量里,在后面的TermMapper和TermReducer中我的value使用了三种ArrayWritable,IntWritable和我自已定义的TFWritable,所以要把三者全加入TermWritable的CLASSES中。

package redpoll.examples;

import org.apache.hadoop.io.GenericWritable;
import org.apache.hadoop.io.Writable;

/**
 * Generic Writable class for terms.
 * @author Jeremy Chow(coderplay@gmail.com)
 */
public class TermWritable extends GenericWritable {
  private static Class<? extends Writable>[] CLASSES = null;

  static {
    CLASSES = (Class<? extends Writable>[]) new Class[] {
        org.apache.hadoop.io.ArrayWritable.class,
        org.apache.hadoop.io.IntWritable.class,
        redpoll.examples.TFWritable.class
        };
  }

  public TermWritable() {
  }

  public TermWritable(Writable instance) {
    set(instance);
  }

  @Override
  protected Class<? extends Writable>[] getTypes() {
    return CLASSES;
  }
}

Mapper在collect数据时，用刚才定义的TermWritable来包装(wrap)要使用的Writable类。

package redpoll.examples;

import java.io.IOException;
import java.io.StringReader;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

/**
 * A class provides for doing words segmenation and counting term TFs and DFs.<p>
 * in: key is document id, value is a document instance. <br>
 * output:
 * <li>key is term, value is a <documentId, tf> pair</li>
 * <li>key is term, value is document frequency corresponsing to the key</li>
 * @author Jeremy Chow(coderplay@gmail.com)
 */
public class TermMapper extends MapReduceBase implements
    Mapper<LongWritable, Document, Text, TermWritable> {
  private static final Log log = LogFactory.getLog(TermMapper.class
      .getName());
  
  /* analyzer for words segmentation */
  private Analyzer analyzer = null;
   
  /* frequency weight for document title */
  private IntWritable titleWeight = new IntWritable(2);
  /* frequency weight for document content */
  private IntWritable contentWeight = new IntWritable(1);

  
  public void map(LongWritable key, Document value,
      OutputCollector<Text, TermWritable> output, Reporter reporter)
      throws IOException {
    doMap(key, value.getTitle(), titleWeight, output, reporter);
    doMap(key, value.getContent(), contentWeight, output, reporter);
  }
  
  private void doMap(LongWritable key, String value, IntWritable weight,
      OutputCollector<Text, TermWritable> output, Reporter reporter)
      throws IOException {
    // do words segmentation
    TokenStream ts = analyzer.tokenStream("dummy", new StringReader(value));
    Token token = new Token();
    while ((token = ts.next(token)) != null) {
      String termString = new String(token.termBuffer(), 0, token.termLength());
      Text term = new Text(termString);
      // <term, <documentId,tf>>
      TFWritable tf = new TFWritable(key, weight);
      output.collect(term, new TermWritable(tf)); // wrap then collect
      // <term, weight>
      output.collect(term, new TermWritable(weight)); // wrap then collect
    }
  }
    
  @Override
  public void configure(JobConf job) {
    String analyzerName = job.get("redpoll.text.analyzer");
    try {
      if (analyzerName != null)
        analyzer = (Analyzer) Class.forName(analyzerName).newInstance();
    } catch (Exception excp) {
      excp.printStackTrace();
    }
    if (analyzer == null)
      analyzer = new StandardAnalyzer();
  }

}

Reduce如果想获取数据，则可以解包(unwrap)它：

package redpoll.examples;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

/**
 * Form a tf vector and caculate the df for terms.
 * @author Jeremy Chow(coderplay@gmail.com)
 */
public class TermReducer extends MapReduceBase implements Reducer<Text, TermWritable, Text, Writable> {
  
  private static final Log log = LogFactory.getLog(TermReducer.class.getName());
  
  public void reduce(Text key, Iterator<TermWritable> values,
      OutputCollector<Text, Writable> output, Reporter reporter)
      throws IOException {
    ArrayList<TFWritable> tfs = new ArrayList<TFWritable>();
    int sum = 0;
//    log.info("term:" + key.toString());
    while (values.hasNext()) {
      Writable value = values.next().get(); // unwrap
      if (value  instanceof TFWritable) {
        tfs.add((TFWritable) value ); 
      }else {
        sum += ((IntWritable) value).get();
      }
    }
    
    TFWritable writables[] = new TFWritable[tfs.size()];
    ArrayWritable aw = new ArrayWritable(TFWritable.class, tfs.toArray(writables));
    // wrap again
    output.collect(key, new TermWritable(aw)); 
    output.collect(key, new TermWritable(new IntWritable(sum)));
  }

}

这儿collect的时候可以不再用TermWritable,只不过我在重新定义了OutputFormat，让它输出到两个不同的文件，而且输出的类型也是不一样的。

分享到：

关于redpoll中使用mahout模块,而没有沿用 ... | 最近项目进展及Hadoop自定义InputFormat

2008-10-30 04:46
浏览 11268
评论(6)
查看更多

6 楼 qingzew 2014-05-22

请问如果是map的输出中一个key有多个value值该怎么办

5 楼 javalive20120108 2012-06-20

回答3楼的问题：
在map里
String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
可以得到所有的输入文件的全路径，可以在这里判断哪些作为输入文件

4 楼 riddle_chen 2009-05-27

you_laner 写道

确切的说，这个不算是输出多个不同类型的value，只是把不同类型的value封装成同一class而已。我想通过不同的key来区分value，从而将value保存在多个文件中，而且在后续job中将前一job中的某些文件作为输入，只是不知道如何处理。

MultipleOutputFormat可以让你根据不同的key把汇总好的value保存不同的文件中，至于在后续任务中加入输入文件只要使用FileInputFormat.setInputPaths即可。

3 楼 you_laner 2009-05-20

确切的说，这个不算是输出多个不同类型的value，只是把不同类型的value封装成同一class而已。

我想通过不同的key来区分value，从而将value保存在多个文件中，而且在后续job中将前一job中的某些文件作为输入，只是不知道如何处理。

2 楼 shuchaoo 2009-05-05

不错，GenericWritable的应用！没看出来你这个词频是怎么计算的，有combiner？

1 楼 yawl 2008-10-30

看来EC2这种scale好的平台还是最适合了,反正一台机器跑10个小时和10台机器跑1个小时都是$1 (small instance)

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论