Parqeut is a column-oriented data storage for hadoop. It comes with Snappy as default compression codec which is highly optimal for data storage space as well as parallel processing. Impala queries performs exceptionally well with parquet format files rather than text files.
For workloads that involve reading fewer columns out of 100s of columns in a large dataset , Ex. select emp_name,address from emp . Column oriented datastore such as parquet,ORC yield better performance.
We will now discuss about a very common Parquet related error and detailed steps on how to resolve it
Error : A simple select query fails with the error below
2019-09-16 10:57:33,195 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:267)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:213)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:334)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:695)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:253)
... 11 more
Caused by: java.lang.RuntimeException: hdfs://user/hive/warehouse/<schema_name.db>/<tbl_name>/<partition>/part-m-00000 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [1, 50, 48, 10]
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:423)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
... 16 more
Root Cause : A simple misconfiguration might lead into this issue
Step 1: Check the health of the folder for any potential corrupt or missing blocks. First fix the corrupt/missing blocks isssue before moving further
[user@host~]$ hdfs fsck hdfs://HANameservice/user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000
Connecting to namenode via https://<Namenode>:50470/fsck?ugi=<user>&path=%2Fuser%2Fhive%2Fwarehouse%2F<schema>%2F<tbl>%2F<part>%2Fpart-m-00000
FSCK started by user (auth:KERBEROS_SSL) from /X.X.X.X for path /user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000 at Mon Sep 16 12:30:26 EDT 2019
.Status: HEALTHY
Total size: 2456790469 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 19 (avg. block size 129304761 B)
Minimally replicated blocks: 19 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 9
Number of racks: 1
FSCK ended at Mon Sep 16 12:30:26 EDT 2019 in 1 milliseconds
The filesystem under path '/user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000' is HEALTHY
Step 2: Check if the file mentioned in the error is really a Parquet file , you should see PART1 in the first line if its a parquet file
hdfs dfs -cat /user/hive/warehouse/<schema>/<tbl>/<partition>/part-m-00000|head -1
Step 3: Check the ddl of the table for potential misconfiguration if any
show create table emp;
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
Step 4: Check if the file format specified in the ddl and the actual file format underneath hdfs are matching. in my case the table ddl says it should be a parquet file, but the actual file underneath was text
Fix :
There are 2 ways to fix this problem, inspect each option carefully before attempting
Option 1: Convert the text file into parquet
INSERT OVERWRITE TABLE parquet_table SELECT * FROM text_table;
Option 2: Modify the DDL statement to specify the underlying file format as text.
alter table tablename SET FILEFORMAT TEXTFILE;
For more details on Parquet : http://parquet.apache.org/
For hive tables with parquet : https://cwiki.apache.org/confluence/display/Hive/Parquet
I have found that this site is very informative, interesting and very well written. keep up the nice high quality writing Microsoft Certified Azure Fundamentals
ReplyDelete