使用MultipleOutputs报File already exists错误

2018.8.4 2025.9.27 Hadoop 474 1 分钟

在reduce任务看到第一个任务因为某些原因失败了，但后续的任务都一起失败，后续的任务报File already exists错误。原因是MultipleOutputs.write的时候可以指定输出的别名或者绝对路径，如果写的是绝对路径，目录会马上生效，MR的output commit机制会失效（先输出到临时目录，然后最后移动回正式目录）。当任务第一次失败后，第二次重试还是存在着那个文件会报错退出。 MR会在任务失败的时候清理输出，但仅限于taskAttemptPath，不会清理其他产生的文件。

@Private
  public void abortTask(TaskAttemptContext context, Path taskAttemptPath) throws IOException {
    if (hasOutputPath()) { 
      context.progress();
      if(taskAttemptPath == null) {
        taskAttemptPath = getTaskAttemptPath(context);
      }
      FileSystem fs = taskAttemptPath.getFileSystem(context.getConfiguration());
      if(!fs.delete(taskAttemptPath, true)) {
        LOG.warn("Could not delete "+taskAttemptPath);
      }
    } else {
      LOG.warn("Output Path is null in abortTask()");
    }
  }

解决方法

想办法在第二次跑之前清理文件。最终输出的文件名是用户指定路径+数字，那个数字不太好从哪里获取到，另外如果开了推测执行，推测执行的任务也会失败。解决的方法还是不要用绝对路径输出，按key输出文件，最后如果需要分开目录的时候再在client那边把文件移动过去。

相关issue

这个issue只是指出如果用绝对路径，output committing的机制会失效，加了个警告注释
https://issues.apache.org/jira/browse/MAPREDUCE-6357

作者：fatkun
链接：https://fatkun.github.io/2018/08/use-multipleoutputs-file-already-exists.html
许可：CC BY-NC-SA 4.0