Monday, May 25, 2015

How to read AVRO files using HAWQ

The below article will describe the steps required to read AVRO files using HAWQ. In the below example, we will showcase 2 different approaches as an example to read AVRO files
Case 1: Read raw AVRO files on HDFS filesystem.
gpadmin=# CREATE EXTERNAL TABLE hawq_avro_example (number int, first_name varchar, last_name varchar) LOCATION ('pxf://hdm1:50070/tmp/doctors.avro?Profile=Avro') FORMAT 'custom' (formatter='pxfwritable_import');
[gpadmin@hdw3 ~]$hdfs dfs -copyFromLocal doctors.avro /tmp/doctors.avro
gpadmin=# select * from hawq_avro_example ;                                                                                                                   number | first_name  | last_name
--------+-------------+-----------
      6 | Colin       | Baker
      3 | Jon         | Pertwee
      4 | Tom         | Baker
      5 | Peter       | Davison
     11 | Matt        | Smith
      1 | William     | Hartnell
      7 | Sylvester   | McCoy
      8 | Paul        | McGann
      2 | Patrick     | Troughton
      9 | Christopher | Eccleston
     10 | David       | Tennant
(11 rows)
Case 2: Read from Hive tables based on AVRO schema. In the below example, we have defined an avro schema and had included extra_field column just to showcase that default value can be read in Hive table and used in HAWQ if required.
hive> CREATE TABLE doctors
     ROW FORMAT
     SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
     STORED AS
     INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
     TBLPROPERTIES ('avro.schema.literal'='{
       "namespace": "testing.hive.avro.serde",
       "name": "doctors",
       "type": "record",
       "fields": [
         {
           "name":"number",
           "type":"int",
           "doc":"Order of playing the role"
         },
         {
           "name":"first_name",
           "type":"string",
           "doc":"first name of actor playing role"
         },
         {
           "name":"last_name",
           "type":"string",
           "doc":"last name of actor playing role"
         },
         {
           "name":"extra_field",
           "type":"string",
           "doc:":"an extra field not in the original file",
           "default":"fishfingers and custard"
         }
       ]
     }');
OK
Time taken: 5.789 seconds
hive> LOAD DATA LOCAL INPATH '/tmp/doctors.avro'  INTO TABLE doctors;
gpadmin=# CREATE EXTERNAL TABLE hive_avro_test (number int, first_name varchar, last_name varchar, extra_field varchar) LOCATION ('pxf://hdm1.phd.local:50070/doctors?PROFILE=Hive') FORMAT 'custom' (formatter='pxfwritable_import');
CREATE EXTERNAL TABLE
gpadmin=# select * from hive_avro_test ;
 number | first_name  | last_name |       extra_field
--------+-------------+-----------+-------------------------
      6 | Colin       | Baker     | fishfingers and custard
      3 | Jon         | Pertwee   | fishfingers and custard
      4 | Tom         | Baker     | fishfingers and custard
      5 | Peter       | Davison   | fishfingers and custard
     11 | Matt        | Smith     | fishfingers and custard
      1 | William     | Hartnell  | fishfingers and custard
      7 | Sylvester   | McCoy     | fishfingers and custard
      8 | Paul        | McGann    | fishfingers and custard
      2 | Patrick     | Troughton | fishfingers and custard
      9 | Christopher | Eccleston | fishfingers and custard
     10 | David       | Tennant   | fishfingers and custard
(11 rows)

No comments:

Post a Comment