How to resolve Parquet File issue

Parqeut is a column-oriented data storage for hadoop. It comes with Snappy as default compression codec which is highly optimal for data storage space as well as parallel processing. Impala queries performs exceptionally well with  parquet format files rather than text files.

For workloads that involve reading fewer columns out of 100s of columns in a large dataset , Ex. select emp_name,address from emp  . Column oriented datastore such as parquet,ORC yield better performance.

We will now discuss about a very common Parquet related error and detailed steps on how to resolve it 


Error : A simple select query fails with the error below
2019-09-16 10:57:33,195 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.lang.reflect.InvocationTargetException
 at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
 at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:267)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:213)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:334)
 at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:695)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:253)
 ... 11 more
Caused by: java.lang.RuntimeException: hdfs://user/hive/warehouse/<schema_name.db>/<tbl_name>/<partition>/part-m-00000 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [1, 50, 48, 10]
 at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:423)
 at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
 at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
 at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
 at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
 ... 16 more


Root Cause : A simple misconfiguration might lead into this issue

Step 1: Check the health of the folder for any potential corrupt or missing blocks. First fix the corrupt/missing blocks isssue before moving further
[user@host~]$ hdfs fsck  hdfs://HANameservice/user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000
 Connecting to namenode via https://<Namenode>:50470/fsck?ugi=<user>&path=%2Fuser%2Fhive%2Fwarehouse%2F<schema>%2F<tbl>%2F<part>%2Fpart-m-00000
 FSCK started by user (auth:KERBEROS_SSL) from /X.X.X.X for path /user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000  at Mon Sep 16 12:30:26 EDT 2019
 .Status: HEALTHY
  Total size:    2456790469 B
  Total dirs:    0
  Total files:   1
  Total symlinks:                0
  Total blocks (validated):      19 (avg. block size 129304761 B)
  Minimally replicated blocks:   19 (100.0 %)
  Over-replicated blocks:        0 (0.0 %)
  Under-replicated blocks:       0 (0.0 %)
  Mis-replicated blocks:         0 (0.0 %)
  Default replication factor:    3
  Average block replication:     3.0
  Corrupt blocks:                0
  Missing replicas:              0 (0.0 %)
  Number of data-nodes:          9
  Number of racks:               1
 FSCK ended at Mon Sep 16 12:30:26 EDT 2019 in 1 milliseconds


 The filesystem under path '/user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000' is HEALTHY

Step 2: Check if the file mentioned in the error is really a Parquet file , you should see PART1 in the first line if its a parquet file
  hdfs dfs -cat  /user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000|head -1
  
  
Step 3: Check the ddl of the table for potential misconfiguration if any 
  show create table emp;
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
  

Step 4: Check  if the file format specified in the ddl and the actual file format underneath hdfs are matching. in my case the table ddl says it should be a parquet file, but the actual file underneath was text

Fix : 

    There are 2 ways to fix this problem, inspect each option carefully before attempting

Option 1:  Convert the text file into parquet 
      INSERT OVERWRITE TABLE parquet_table SELECT * FROM text_table;

Option 2: Modify the DDL statement to specify the underlying file format as text.
     alter table tablename SET FILEFORMAT TEXTFILE;



For more details on Parquet : http://parquet.apache.org/
For hive tables with parquet : https://cwiki.apache.org/confluence/display/Hive/Parquet

Comments

  1. I have found that this site is very informative, interesting and very well written. keep up the nice high quality writing Microsoft Certified Azure Fundamentals

    ReplyDelete

Post a Comment

Popular Posts