Hive – Selected data import/query – Files and folders (mapred.input.dir.recursive)

Data import in Hive by default expects a directory name in its query specified by LOCATION keyword.
By default Hive picks up all the files from the dir and imports into itself.
If the directory does not contain files, rather consists of sub directories Hive blows up with the exception:

java.io.IOException:java.io.IOException: Not a file: /path/to/data/*

This quick post is about 2 customized ways of importing data into Hive:

1. Importing data from directory which contains sub directories
Set the mapred parameter recursive to true.
Syntax:

SET mapred.input.dir.recursive=true

The above parameter will enable hive to recursively scan all sub-directories and fetch all the data from all sub-directories.

In case you stumble upon this error, add both the parameters to your hive script:

FAILED: Hive Internal Error: hive.mapred.supports.subdirectories must be true if any one of following is true: hive.internal.ddl.list.bucketing.enable, hive.optimize.listbucketing and mapred.input.dir.recursive
set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;

Table Creation Query is same as usual:

 DROP TABLE IF EXISTS sampledata;
 CREATE EXTERNAL TABLE sampldata (
 datestr string, ref_id string, locations string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
 LOCATION '/path/to/data/';

2. Fetching data only from selected files from the input directory:
Hive 0.8.0 onwards provides a virtual column named INPUT__FILE__NAME which can be used in while clauses to filter out unwanted files from data.

The table creation syntax is exactly same as any other table creation syntax. We filter our data out in the select queries.
Test:

 SELECT INPUT__FILE__NAME FROM sampledata LIMIT 20;

Fetch selected file’s data (by file name pattern):

SELECT * FROM sampledata WHERE INPUT__FILE__NAME LIKE '%FILENAME_BYDATE_201311%' limit 10;

This fetches data from files which have 201311 in name (i.e. data for month of nov-2013)

Hope its helpful. Cheers \m/

Yash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby.
Talk to Yash about Distributed Systems and Data platform designs.

Leave a Reply

Your email address will not be published. Required fields are marked *