How to resolve Parquet File issue

September 16, 2019

How to resolve Parquet File issue

Parqeut is a column-oriented data storage for hadoop. It comes with Snappy as default compression codec which is highly optimal for data storage space as well as parallel processing. Impala queries performs exceptionally well with  parquet format files rather than text files.

For workloads that involve reading fewer columns out of 100s of columns in a large dataset , Ex. select emp_name,address from emp  . Column oriented datastore such as parquet,ORC yield better performance.

We will now discuss about a very common Parquet related error and detailed steps on how to resolve it 






Error : A simple select query fails with the error below

2019-09-16 10:57:33,195 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.lang.reflect.InvocationTargetException
 at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
 at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:267)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:213)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:334)
 at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:695)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
 at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:253)
 ... 11 more
Caused by: java.lang.RuntimeException: hdfs://user/hive/warehouse/<schema_name.db>/<tbl_name>/<partition>/part-m-00000 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [1, 50, 48, 10]
 at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:423)
 at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
 at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
 at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
 at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
 ... 16 more

Root Cause : A simple misconfiguration might lead into this issue


Step 1: Check the health of the folder for any potential corrupt or missing blocks. First fix the corrupt/missing blocks isssue before moving further
[user@host~]$ hdfs fsck  hdfs://HANameservice/user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000
 Connecting to namenode via https://<Namenode>:50470/fsck?ugi=<user>&path=%2Fuser%2Fhive%2Fwarehouse%2F<schema>%2F<tbl>%2F<part>%2Fpart-m-00000
 FSCK started by user (auth:KERBEROS_SSL) from /X.X.X.X for path /user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000  at Mon Sep 16 12:30:26 EDT 2019
 .Status: HEALTHY
  Total size:    2456790469 B
  Total dirs:    0
  Total files:   1
  Total symlinks:                0
  Total blocks (validated):      19 (avg. block size 129304761 B)
  Minimally replicated blocks:   19 (100.0 %)
  Over-replicated blocks:        0 (0.0 %)
  Under-replicated blocks:       0 (0.0 %)
  Mis-replicated blocks:         0 (0.0 %)
  Default replication factor:    3
  Average block replication:     3.0
  Corrupt blocks:                0
  Missing replicas:              0 (0.0 %)
  Number of data-nodes:          9
  Number of racks:               1
 FSCK ended at Mon Sep 16 12:30:26 EDT 2019 in 1 milliseconds


 The filesystem under path '/user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000' is HEALTHY

Step 2: Check if the file mentioned in the error is really a Parquet file , you should see PART1 in the first line if its a parquet file
  hdfs dfs -cat  /user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000|head -1
  
  
Step 3: Check the ddl of the table for potential misconfiguration if any 
  show create table emp;
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |


Step 4: Check  if the file format specified in the ddl and the actual file format underneath hdfs are matching. in my case the table ddl says it should be a parquet file, but the actual file underneath was text

Fix :

    There are 2 ways to fix this problem, inspect each option carefully before attempting

Option 1:  Convert the text file into parquet

      INSERT OVERWRITE TABLE parquet_table SELECT * FROM text_table;

Option 2: Modify the DDL statement to specify the underlying file format as text.

     alter table tablename SET FILEFORMAT TEXTFILE;

For more details on Parquet : http://parquet.apache.org/

For hive tables with parquet : https://cwiki.apache.org/confluence/display/Hive/Parquet

Comments

donnaj edwardsJuly 10, 2021 at 10:44 AM
I have found that this site is very informative, interesting and very well written. keep up the nice high quality writing Microsoft Certified Azure Fundamentals
ReplyDelete
Replies

Add comment

Search This Blog

BigData & Advanced Analytics Tips&Tricks

How to resolve Parquet File issue

Comments

Post a Comment

Popular Posts

Curl WebAPI call to an ssl and kerberos enabled Solr instance