MapReduce 读取 Hive ORC ArrayIndexOutOfBoundsException: 1024 异常解决
在 MR 处理 ORC 的时候遇到如下异常:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1024
at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector (RunLengthIntegerReaderV2.java:369)
at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays (TreeReaderFactory.java:1231)
at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays (TreeReaderFactory.java:1268)
at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector (TreeReaderFactory.java:1368)
at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector (TreeReaderFactory.java:1212)
at org.apache.orc.impl.TreeReaderFactory$ListTreeReader.nextVector (TreeReaderFactory.java:1902)
at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch (TreeReaderFactory.java:1737)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1045)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:89)
通过搜索发现这个 Bug 在 Hive 2.1.1 版本中已经修复。我使用的就是这个版本,检查对应的源代码发现代码是已经按照下面的 Patch 修复过得:https://issues.apache.org/jira/browse/HIVE-14483
通过反编译发现我最终打包后的代码中使用的是未修复 Bug 的代码版本。通过依赖包发现依赖的以下模块中也包含 ORC 的 Jar:
<dependency>
<groupId>org.apache.orc</groupId>
<artifactId>orc-mapreduce</artifactId>
<version>1.1.0</version>
</dependency>
解决方法是将 orc-mapreduce 包升级到 1.1.2 版本,依赖配置如下:
<dependency>
<groupId>org.apache.orc</groupId>
<artifactId>orc-mapreduce</artifactId>
<version>1.1.2</version>
</dependency>