Failing to concat HDFS files
At work I use Hadoop via Cascading. I wanted the final output of a multi-stage job to be a single file. This is more complicated that I'd hoped, and led me down some blind alleys where Google wasn't helping. Hopefully this article will help others hitting the same issue.
Using a single reducer
I was initially doing this with job.setNumReduceTasks(0), which restricted the job to use only a single reducer. The side effect is that it writes out to a single file, but depending on what's going on in that reduce, you don't want it running single threaded. Also, if Cascading's flow planner puts a map-side operation as that last thing that needs doing, then you might still get multiple output files.
Using getmerge
My search for hdfs merge single file turned up various answers recommending the hadoop -getmerge
command. This is great if you want to merge from hdfs files to a single local file. If you want everything staying on HDFS, this is a mistake since it streams everything locally, then pushes it back. This will get worse and worse as you get more data.
Using FileSystem.concat(...)
This way lies madness
Browsing through Hadoop's FileSystem class, I found the following function:
FileSystem.concat(Path trg, Path[] psrcs)
Concat existing files together.
Parameters:
trg - the path to the target destination.
psrcs - the paths to the sources to use for the concatenation.
Fantastic! This is probably what I want. Turns out to be the opposite of what I want, and not at all what I expected from the documentation:
- The source files will be removed. This is in a javadoc at DFSClient.java:1566. Good to know
- All paths (source and destination) must be in the same directory (this is in an in-line comment at FSNamesystem:1580)
- The destination must exist. Found that out when a call to getInode on FSNamesystem:1633 fails if the file does not exist. Fine, I'll create an empty file.
- The destination file must not be empty. This is "by design" according to FSNamesystem:1639. Fine, I'll create a new file with a blank line.
- The destination file must have the last block full (FSNamesystem:1654). Why would that be a requirement.
This makes it unusable as a general purpose "concat these files together" since I can't concat to a non-existing file, I can't concat to a blank file, and I can't somehow align the blocks. Luckily this bug has been closed as "not a problem" in HDFS-6641.
This all boils down to FileSystem.concat(...)
not being for concatenating multiple files like most people would think of it. I think it's explained wrong. It's for moving the blocks of many files into a single file. It's a namenode only operation in that it remaps blocks without doing any copying. Once that's the purpose, everything else falls into place. The last block of trg
must be full, since a file can't have a non-full block in the middle. The psrcs
files disappear because this isn't a copy. The "same directory" and "trg not empty" restrictions still make no sense, but whatever.
This is not the api you are looking for. The documentation is misleading, but I filed HDFS-7800 to hopefully improve that.
Force a useless map-reduce cycle
The solution I went with was to use job.setNumReduceTasks(0)
, but force a useless map-reduce at the end of my Cascading flow. This is inefficient, but atleast the reducer does nothing but write down data, so it's pretty fast.