FileSystem

Report 10 Downloads 164 Views
How to create a filesystem in Hadoop By Hualiang Xu, Jul 13 2015 [email protected]

Hadoop supports a couple of file systems: 1. HDFS, with scheme hdfs://host:port/path_to_file. It is the default Hadoop filesystem when not scheme is specified. Refer to Hadoop setup for HDFS initialization. 2. Local file system, with scheme file:///path_to_file, through Hadoop user is capable to access the Linux/windows filesystem 3. S3 native file system, with scheme s3n://bucket/object, each object in bucket is treated as a file. 4. S3 block file system, with scheme s3://bucket/object 5. FTP file system and many others

FileSystem So many file system works together in a single Hadoop environment by means of scheme: Path path = new Path(“s3n://bucket/object”); FileSystem fs = path.getFileSystem(new Configuration()); FSDataInputStream strm = fs.open(path);

This differs to unix-like operating system with a mounting point for each filesystem. In the example, a NativeS3Filesystem instances is created, and NativeS3FsInputStream is instantiated for S3 object access. The scheme is hooked up with specific filesystem either by java service-loading facility or Hadoop configuration. Hadoop-configuration takes priority – refer to FileSystem::getFileSystemClass(String scheme, Configuration conf): Key “fs.scheme.impl” is first attempted through Hadoop configuration (that can be hdfs-site.xml or core-site.xml). In case the scheme is not configured, service-loading facility checks Hadoopcommon-2.4.0.jar:META-INF/services/org.apache.hadoop.fs.FileSystem, Hadoop-hdfs2.4.0.jar:META-INF/services/org.apache.hadoop.fs.FileSystem, or services in any other jars. With file system class name loaded, its instance is created by java reflection (refer to java reflection for the magic).

So far, we are clear with the functionality of the scheme for Hadoop. And we are going to elaborate the customization of your own file system. Each Hadoop filesystem has to follow the FileSystem interface: class MyFileSystem extends FileSystem { @Override public String getScheme() { return “myfs”; } @Override public FSDataOutputStream append(Path path, int bufferSize, Progressable prog) throws IOException { … } @Override public FSDataOutputStream create(Path path, FSPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException { … } @Override public boolean delete(Path path, boolean recursive) throws IOException { … } @Override public FileStatus getFileStatus(Path path) throws IOException { … } @Override public FileStatus[] listStatus(Path path) throws FileNotFoundException, IOException { … } @Override public boolean mkdirs(Path path, FsPermission permission) throws IOException { … } @Override public FSDataInputStream open(Path path, int bufferSize) throws IOException { … } @Override public boolean rename(Path path, Path dest) throws IOException) { … } @Override public URI getUri() { …} @Override public Path getWorkingDirectory() { …} @Override public void setWorkingDirectory(Path path) {…} }

The minimum support for a file system (in read only mode and no folder support) is to implement getSchema, getFileStatus, open. To support with folder structure, listStatus is necessary for implementation.

When we type in command “hadoop fs -ls myfs://path”, MyFileSystem is instantiated. getStatus(Path(“myfs://path”)) is called with a FileStatus object return. When FileStatus object is a file, the command prints the file metadata out. Otherwise when FileStatus object is a directory, listStatus(Path(“myfs://path”)) is called with path as a parent. A few words on FileStatus: unix-like OS shell “ls -l” prints the metadata of a file/folder drwxrwxr-x 5 hxu hxu 4096 Jun 3 19:11 workspace

FileStatus is data structure that holds all of the information. This extra file system is necessary when you have a new protocol to support, or you data format is customized. FTP is an example of the first case - the idea is to access a file through ftp protocol. An example for the second case is, say the file is compressed and hosted in HDFS, and I hope to read/write the data without the knowledge of compression. With this example, I’d like MyFileSystem to be compression aware and “myfs://file” in Hadoop to be compression unaware. With that said, a customized Hadoop filesystem could handle new protocol, a customized data format based on existing filesystem, or even data in cloud.

Input and output stream The interface for a Hadoop filesystem is explained. But we have the input/output stream to be customized to make it a working file system. In Hadoop, FSDataInputStream and FSDataOutputStream are two wrappers for input/output stream respectively, in decorator pattern. In my example, I have MyInputStream and MyOutputStream, and the diagram is like:

MyFileSystem hooks up with scheme, handles file/folder metadata and maintains the working directory. MyInputStream/MyOutputStream is the place for real IO – the interface is all for a logical file, but the implementation should handle the physical data. The example of the compressed file, or even more complicated formatted file, is to be handled within the input/output stream.

Deployment to Amazon EMR EMR, elastic map reduce, is a service offered by Amazon up on EC2 and Amazon customized Hadoop eco system. EMRFS (EMR filesystem) is Amazon’s implementation for S3. The goal of the stuff is to unlock the computation capability to S3 within cloud – data in cloud (S3) and computation in cloud (EMR) with no extra networking charge for data access. Please refer to EMR online documentation for more of its functions like bootstrapping and steps. I have already developed myfs and tar into myfs-0.1.jar. It takes a couple of steps to make it work within EMR: 1. Upload myfs-0.1.jar to S3, s3://myfsbucket/myfs-0.1.jar for example 2. Bootstrap actions in EMR setup: a. Configure hadoop: --core-key-value, fs.stub.impl=org.myfs.MyFileSystem b. S3get: -s, s3://myfsbucket/myfs-0.1.jar, -d, /usr/share/my/lib/ c. Configure hadoop: adding “/usr/share/my/lib/” to “/home/hadoop/conf/hadoop-user-env.sh” 3. After bootstrapping, myfs is up and running With the filesystem, we are capable to operate the formatted data (or cloud data) like a logical file. We unlock whatever Hadoop and its mapreduce paradigm is capable to this new data.