Hive fixed length serde can be used in scenarios where we do not have any delimiters in out data file. Using RegexSerDe for fixed length strings is pretty straight:
CREATE EXTERNAL TABLE customers (userid STRING, fb_id STRING, twitter_id STRING, status STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (“input.regex” = “(.{10})(.{10})(.{10})(.2})” )
LOCATION ‘path/to/data’;
The above query only expects exactly 32 characters in a line of text (10+10+10+2). The query can be customized to Ignore any characters at end after the useful data is read:
CREATE EXTERNAL TABLE customers ((userid STRING, fb_id STRING, twitter_id STRING, status STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (“input.regex” = “(.{10})(.{10})(.{10})(.{2}).*” )
LOCATION ‘path/to/data’;
Thats all. Have fun. Cheers \m/
Hi..thanks for such a nice explanation..!
I have a scenario in which I’ll be receiving the fixed length record in a file but the records vary in length and they have to be extracted out from different positions.
For eg. if I get a record like AAAABBBCCCCCCDD, I just need AAAA (from position 1 to 4) as my first column and then DD (from position 13 to 14) as my second column. I need only few columns, not all.
Is there any way to work out the same?
I know I can write the substring function to extract the same but I wanted to know whether this can be handled while creating external table only?
Thanks in advance..!
Try this : https://github.com/sanjivsingh/FixedLengthAndDelimitedSerde
Supported both fixed length and delimited data serilization/deserialization
It support both case in single definition.