编程接口

  • RDD:弹性分布式数据集(Resilient Distributed Dataset )。Spark2.0之前的编程接口。Spark2.0之后以不再推荐使用,而是被Dataset替代。
  • Dataset:Spark2.0之后的编程接口,用来替代RDD。与RDD不同Dataset是强数据类型的,但是这一点不适用与pyspark,因为Python是弱类型的。Spark引擎针对Dataset做了更丰富的优化,所以比RDD有更高的性能。
  • Dataframe:在Python(Pandas)和R中Dataset的组织形式;在Scala中没有这个概念。

架构

阅读全文 »

错误信息如下:

1
2
3
4
5
6
  Cloning https://github.com/gethue/PyHive to /tmp/pip-req-build-86w_hwe4
Running command git clone -q https://github.com/gethue/PyHive /tmp/pip-req-build-86w_hwe4
fatal: unable to access 'https://github.com/gethue/PyHive/': gnutls_handshake() failed: The TLS connection was non-properly terminated.
WARNING: Discarding git+https://github.com/gethue/PyHive. Command errored out with exit status 128: git clone -q https://github.com/gethue/PyHive /tmp/pip-req-build-86w_hwe4 Check the logs for full command output.
ERROR: Command errored out with exit status 128: git clone -q https://github.com/gethue/PyHive /tmp/pip-req-build-86w_hwe4 Check the logs for full command output.
The command '/bin/sh -c ./build/env/bin/pip install --no-cache-dir psycopg2-binary django_redis==4.11.0 flower git+https://github.com/gethue/PyHive git+https://github.com/bryanyang0528/ksql-python pydruid pybigquery elasticsearch-dbapi pyasn1==0.4.1 python-snappy==0.5.4 threadloop sqlalchemy-clickhouse infi.clickhouse_orm==1.0.4' returned a non-zero code: 1

网上大都是说因为代理的问题,对我这个场景没用。通过搜索找到一个很好的代理:https://mirror.ghproxy.com

阅读全文 »

垃圾收集器分类

Java HotSpot VM有三种不同类型的收集器,每种收集器具有不同的性能特征。

  • 串行收集器使用单个线程来执行所有垃圾收集工作,这使得它相对高效,因为线程之间没有通信开销。它最适合单处理器机器,因为它不能利用多处理器硬件,尽管它对于具有小数据集(最多约100MB)的应用程序在多处理器上很有用。在某些硬件和操作系统配置上默认选择串行收集器,或者可以使用选项显式启用-XX:+UseSerialGC。
  • 并行收集器(也称为吞吐量收集器)并行执行次要收集,这可以显著减少垃圾收集开销。它适用于在多处理器或多线程硬件上运行的具有中型到大型数据集的应用程序。并行收集器在某些硬件和操作系统配置上默认选择,或者可以使用选项显式启用-XX:+UseParallelGC。并行压缩是一个特性,它使并行收集器能够并行执行主要收集。如果没有并行压缩,主要收集是使用单个线程执行的,这会显着限制可伸缩性。如果-XX:+UseParallelGC已指定该选项,则默认情况下启用并行压缩。关闭它的选项是-XX:-UseParallelOldGC。
  • 大多数并发收集器并发地执行其大部分工作(例如,当应用程序仍在运行时)以保持垃圾收集暂停较短。它专为具有中型到大型数据集的应用程序而设计,其中响应时间比总吞吐量更重要,因为用于最小化暂停的技术会降低应用程序性能。Java HotSpot VM提供了两个主要并发收集器之间的选择;请参阅主要的并发收集器。使用该选项-XX:+UseConcMarkSweepGC启用CMS收集器或-XX:+UseG1GC启用G1收集器。
阅读全文 »

完整异常信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
2021-08-25 14:37:24,329 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1612602874723_344108_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1612602874723_344108_1_00_000004_0, creationTime=1629873439340, allocationTime=1629873440827, startTime=1629873442059, finishTime=1629873444322, timeTaken=2263, status=FAILED, taskFailureType=FATAL, errorEnum=APPLICATION_ERROR, diagnostics=Error: Error while running task ( failure ) : java.lang.RuntimeException: java.lang.AssertionError: Output column number expected to be 0 when isRepeating
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:101)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AssertionError: Output column number expected to be 0 when isRepeating
at org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.setElement(BytesColumnVector.java:492)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorUDFMapIndexBaseScalar.evaluate(VectorUDFMapIndexBaseScalar.java:84)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorCoalesce.evaluate(VectorCoalesce.java:60)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at org.apache.hadoop.hive.ql.exec.vector.expressions.gen.FilterStringGroupColNotEqualStringGroupScalarBase.evaluate(FilterStringGroupColNotEqualStringGroupScalarBase.java:64)
at org.apache.hadoop.hive.ql.exec.vector.expressions.FilterExprAndExpr.evaluate(FilterExprAndExpr.java:42)
at org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:125)
at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:966)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:939)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:812)
at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:845)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
... 19 more
, errorMessage=Cannot recover from this error:java.lang.RuntimeException: java.lang.AssertionError: Output column number expected to be 0 when isRepeating
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:101)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AssertionError: Output column number expected to be 0 when isRepeating
at org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.setElement(BytesColumnVector.java:492)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorUDFMapIndexBaseScalar.evaluate(VectorUDFMapIndexBaseScalar.java:84)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorCoalesce.evaluate(VectorCoalesce.java:60)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at org.apache.hadoop.hive.ql.exec.vector.expressions.gen.FilterStringGroupColNotEqualStringGroupScalarBase.evaluate(FilterStringGroupColNotEqualStringGroupScalarBase.java:64)
at org.apache.hadoop.hive.ql.exec.vector.expressions.FilterExprAndExpr.evaluate(FilterExprAndExpr.java:42)
at org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:125)
at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:966)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:939)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:812)
at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:845)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
... 19 more
, nodeHttpAddress=datanode143-ysten3:8042, counters=Counters: 15, File System Counters, HDFS_BYTES_READ=216088, HDFS_BYTES_WRITTEN=3, HDFS_READ_OPS=3, HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=1, org.apache.tez.common.counters.TaskCounter, GC_TIME_MILLIS=136, CPU_MILLISECONDS=7620, WALL_CLOCK_MILLISECONDS=2160, PHYSICAL_MEMORY_BYTES=1120403456, VIRTUAL_MEMORY_BYTES=3798720512, COMMITTED_HEAP_BYTES=1120403456, INPUT_RECORDS_PROCESSED=26, INPUT_SPLIT_LENGTH_BYTES=10128027, OUTPUT_RECORDS=0, HIVE, CREATED_FILES=1
2021-08-25 14:37:24,332 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: Failing task: task_1612602874723_344108_1_00_000004 due to FATAL error reported by TaskAttempt. CurrentFailedAttempts=1
2021-08-25 14:37:24,334 [INFO] [Dispatcher thread {Central}] |HistoryEventHandler.criticalEvents|: [HISTORY][DAG:dag_1612602874723_344108_1][Event:TASK_FINISHED]: vertexName=Map 1, taskId=task_1612602874723_344108_1_00_000004, startTime=1629873442059, finishTime=1629873444333, timeTaken=2274, status=FAILED, successfulAttemptID=null, diagnostics=TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : java.lang.RuntimeException: java.lang.AssertionError: Output column number expected to be 0 when isRepeating
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:101)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AssertionError: Output column number expected to be 0 when isRepeating
at org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.setElement(BytesColumnVector.java:492)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorUDFMapIndexBaseScalar.evaluate(VectorUDFMapIndexBaseScalar.java:84)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorCoalesce.evaluate(VectorCoalesce.java:60)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at org.apache.hadoop.hive.ql.exec.vector.expressions.gen.FilterStringGroupColNotEqualStringGroupScalarBase.evaluate(FilterStringGroupColNotEqualStringGroupScalarBase.java:64)
at org.apache.hadoop.hive.ql.exec.vector.expressions.FilterExprAndExpr.evaluate(FilterExprAndExpr.java:42)
at org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:125)
at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:966)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:939)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:812)
at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:845)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
... 19 more
, errorMessage=Cannot recover from this error:java.lang.RuntimeException: java.lang.AssertionError: Output column number expected to be 0 when isRepeating
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:101)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AssertionError: Output column number expected to be 0 when isRepeating
at org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.setElement(BytesColumnVector.java:492)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorUDFMapIndexBaseScalar.evaluate(VectorUDFMapIndexBaseScalar.java:84)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorCoalesce.evaluate(VectorCoalesce.java:60)
at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271)
at org.apache.hadoop.hive.ql.exec.vector.expressions.gen.FilterStringGroupColNotEqualStringGroupScalarBase.evaluate(FilterStringGroupColNotEqualStringGroupScalarBase.java:64)
at org.apache.hadoop.hive.ql.exec.vector.expressions.FilterExprAndExpr.evaluate(FilterExprAndExpr.java:42)
at org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:125)
at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:966)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:939)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125)
at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:812)
at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:845)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92)
... 19 more
], counters=Counters: 0

解决方法是在hive-site.xml中添加如下配置:

阅读全文 »

具体安装步骤参照官网安装手册即可。此处只对官网手册进行补充。

从官网下载apache-tez-0.10.1-bin.tar.gz进行安装未成功,出现下面的异常。最终按照官网源代码编译的方式安装测试成功。

环境

阅读全文 »

Docker在构建镜像阶段无法配置免密码sudo。但是在实际需求场景中会遇到需要使用sudo的场景。所以,我的解决思路是镜像构建及CMD使用root,在CMD的脚本中执行需要sudo的部分,然后使用普通用户启动服务进程。

当然,基于root使用普通用户启动进程可以选择su或者runuser。我使用的是su:

1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash

cd `dirname $0`

${RANGER_HOME}/enable-hive-plugin.sh
if [ $? -ne 0 ];then
echo "启用ranger plugin错误!"
exit 1
fi

su -mp -c '/opt/hive/bin/hiveserver2' hive
阅读全文 »
0%