public class AdaptiveScheduler extends Object implements SchedulerNG
SchedulerNG implementation that uses the declarative resource management and
automatically adapts the parallelism in case not enough resource could be acquired to run at the
configured parallelism, as described in FLIP-160.
This scheduler only supports jobs with streaming semantics, i.e., all vertices are connected via pipelined data-exchanges.
The implementation is spread over multiple State classes that control which RPCs are
allowed in a given state and what state transitions are possible (see the FLIP for an overview).
This class can thus be roughly split into 2 parts:
1) RPCs, which must forward the call to the state via State.tryRun(Class,
ThrowingConsumer, String) or State.tryCall(Class, FunctionWithException, String).
2) Context methods, which are called by states, to either transition into another state or access functionality of some component in the scheduler.
| 构造器和说明 |
|---|
AdaptiveScheduler(JobGraph jobGraph,
org.apache.flink.configuration.Configuration configuration,
DeclarativeSlotPool declarativeSlotPool,
SlotAllocator slotAllocator,
Executor ioExecutor,
ClassLoader userCodeClassLoader,
CheckpointsCleaner checkpointsCleaner,
CheckpointRecoveryFactory checkpointRecoveryFactory,
java.time.Duration initialResourceAllocationTimeout,
java.time.Duration resourceStabilizationTimeout,
JobManagerJobMetricGroup jobManagerJobMetricGroup,
RestartBackoffTimeStrategy restartBackoffTimeStrategy,
long initializationTimestamp,
org.apache.flink.runtime.concurrent.ComponentMainThreadExecutor mainThreadExecutor,
org.apache.flink.runtime.rpc.FatalErrorHandler fatalErrorHandler,
JobStatusListener jobStatusListener,
ExecutionGraphFactory executionGraphFactory) |
| 限定符和类型 | 方法和说明 |
|---|---|
void |
acknowledgeCheckpoint(org.apache.flink.api.common.JobID jobID,
ExecutionAttemptID executionAttemptID,
long checkpointId,
CheckpointMetrics checkpointMetrics,
TaskStateSnapshot checkpointState) |
void |
archiveFailure(RootExceptionHistoryEntry failure)
Archive failure.
|
void |
cancel() |
boolean |
canScaleUp(ExecutionGraph executionGraph)
Asks if we can scale up the currently executing job.
|
CompletableFuture<Void> |
closeAsync() |
void |
declineCheckpoint(DeclineCheckpoint decline) |
CompletableFuture<CoordinationResponse> |
deliverCoordinationRequestToCoordinator(OperatorID operator,
CoordinationRequest request)
Delivers a coordination request to the
OperatorCoordinator with the given OperatorID and returns the coordinator's response. |
void |
deliverOperatorEventToCoordinator(ExecutionAttemptID taskExecution,
OperatorID operator,
OperatorEvent evt)
Delivers the given OperatorEvent to the
OperatorCoordinator with the given OperatorID. |
ArchivedExecutionGraph |
getArchivedExecutionGraph(org.apache.flink.api.common.JobStatus jobStatus,
Throwable cause)
Creates an
ArchivedExecutionGraph for the given jobStatus and failure cause. |
Executor |
getIOExecutor()
Gets the I/O executor.
|
CompletableFuture<org.apache.flink.api.common.JobStatus> |
getJobTerminationFuture() |
org.apache.flink.runtime.concurrent.ComponentMainThreadExecutor |
getMainThreadExecutor()
Gets the main thread executor.
|
void |
goToCanceling(ExecutionGraph executionGraph,
ExecutionGraphHandler executionGraphHandler,
OperatorCoordinatorHandler operatorCoordinatorHandler,
List<ExceptionHistoryEntry> failureCollection)
Transitions into the
Canceling state. |
void |
goToCreatingExecutionGraph()
Transitions into the
CreatingExecutionGraph state. |
void |
goToExecuting(ExecutionGraph executionGraph,
ExecutionGraphHandler executionGraphHandler,
OperatorCoordinatorHandler operatorCoordinatorHandler,
List<ExceptionHistoryEntry> failureCollection)
Transitions into the
Executing state. |
void |
goToFailing(ExecutionGraph executionGraph,
ExecutionGraphHandler executionGraphHandler,
OperatorCoordinatorHandler operatorCoordinatorHandler,
Throwable failureCause,
List<ExceptionHistoryEntry> failureCollection)
Transitions into the
Failing state. |
void |
goToFinished(ArchivedExecutionGraph archivedExecutionGraph)
Transitions into the
Finished state. |
void |
goToRestarting(ExecutionGraph executionGraph,
ExecutionGraphHandler executionGraphHandler,
OperatorCoordinatorHandler operatorCoordinatorHandler,
java.time.Duration backoffTime,
List<ExceptionHistoryEntry> failureCollection)
Transitions into the
Restarting state. |
CompletableFuture<String> |
goToStopWithSavepoint(ExecutionGraph executionGraph,
ExecutionGraphHandler executionGraphHandler,
OperatorCoordinatorHandler operatorCoordinatorHandler,
CheckpointScheduling checkpointScheduling,
CompletableFuture<String> savepointFuture,
List<ExceptionHistoryEntry> failureCollection)
Transitions into the
StopWithSavepoint state. |
void |
goToWaitingForResources()
Transitions into the
WaitingForResources state. |
void |
handleGlobalFailure(Throwable cause)
Handles a global failure.
|
boolean |
hasDesiredResources(ResourceCounter desiredResources)
Checks whether we have the desired resources.
|
boolean |
hasSufficientResources()
Checks if we currently have sufficient resources for executing the job.
|
org.apache.flink.runtime.scheduler.adaptive.FailureResult |
howToHandleFailure(Throwable failure)
Asks how to handle the failure.
|
boolean |
isState(org.apache.flink.runtime.scheduler.adaptive.State expectedState)
Checks whether the current state is the expected state.
|
void |
notifyKvStateRegistered(org.apache.flink.api.common.JobID jobId,
JobVertexID jobVertexId,
KeyGroupRange keyGroupRange,
String registrationName,
org.apache.flink.queryablestate.KvStateID kvStateId,
InetSocketAddress kvStateServerAddress) |
void |
notifyKvStateUnregistered(org.apache.flink.api.common.JobID jobId,
JobVertexID jobVertexId,
KeyGroupRange keyGroupRange,
String registrationName) |
void |
onFinished(ArchivedExecutionGraph archivedExecutionGraph)
Callback which is called when the execution reaches the
Finished state. |
void |
reportCheckpointMetrics(org.apache.flink.api.common.JobID jobID,
ExecutionAttemptID executionAttemptID,
long checkpointId,
CheckpointMetrics checkpointMetrics) |
ExecutionGraphInfo |
requestJob() |
JobDetails |
requestJobDetails() |
org.apache.flink.api.common.JobStatus |
requestJobStatus() |
KvStateLocation |
requestKvStateLocation(org.apache.flink.api.common.JobID jobId,
String registrationName) |
SerializedInputSplit |
requestNextInputSplit(JobVertexID vertexID,
ExecutionAttemptID executionAttempt) |
ExecutionState |
requestPartitionState(IntermediateDataSetID intermediateResultId,
ResultPartitionID resultPartitionId) |
void |
runIfState(org.apache.flink.runtime.scheduler.adaptive.State expectedState,
Runnable action)
Run the given action if the current state equals the expected state.
|
ScheduledFuture<?> |
runIfState(org.apache.flink.runtime.scheduler.adaptive.State expectedState,
Runnable action,
java.time.Duration delay)
Runs the given action after a delay if the state at this time equals the expected state.
|
void |
startScheduling() |
CompletableFuture<String> |
stopWithSavepoint(String targetDirectory,
boolean terminate,
org.apache.flink.core.execution.SavepointFormatType formatType) |
CompletableFuture<String> |
triggerCheckpoint() |
CompletableFuture<String> |
triggerSavepoint(String targetDirectory,
boolean cancelJob,
org.apache.flink.core.execution.SavepointFormatType formatType) |
org.apache.flink.runtime.scheduler.adaptive.CreatingExecutionGraph.AssignmentResult |
tryToAssignSlots(org.apache.flink.runtime.scheduler.adaptive.CreatingExecutionGraph.ExecutionGraphWithVertexParallelism executionGraphWithVertexParallelism)
Try to assign slots to the created
ExecutionGraph. |
void |
updateAccumulators(AccumulatorSnapshot accumulatorSnapshot) |
boolean |
updateTaskExecutionState(TaskExecutionStateTransition taskExecutionState) |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitupdateTaskExecutionStatepublic AdaptiveScheduler(JobGraph jobGraph, org.apache.flink.configuration.Configuration configuration, DeclarativeSlotPool declarativeSlotPool, SlotAllocator slotAllocator, Executor ioExecutor, ClassLoader userCodeClassLoader, CheckpointsCleaner checkpointsCleaner, CheckpointRecoveryFactory checkpointRecoveryFactory, java.time.Duration initialResourceAllocationTimeout, java.time.Duration resourceStabilizationTimeout, JobManagerJobMetricGroup jobManagerJobMetricGroup, RestartBackoffTimeStrategy restartBackoffTimeStrategy, long initializationTimestamp, org.apache.flink.runtime.concurrent.ComponentMainThreadExecutor mainThreadExecutor, org.apache.flink.runtime.rpc.FatalErrorHandler fatalErrorHandler, JobStatusListener jobStatusListener, ExecutionGraphFactory executionGraphFactory) throws JobExecutionException
public void startScheduling()
startScheduling 在接口中 SchedulerNGpublic CompletableFuture<Void> closeAsync()
closeAsync 在接口中 org.apache.flink.util.AutoCloseableAsyncpublic void cancel()
cancel 在接口中 SchedulerNGpublic CompletableFuture<org.apache.flink.api.common.JobStatus> getJobTerminationFuture()
getJobTerminationFuture 在接口中 SchedulerNGpublic void handleGlobalFailure(Throwable cause)
GlobalFailureHandlerhandleGlobalFailure 在接口中 GlobalFailureHandlercause - A cause that describes the global failure.public boolean updateTaskExecutionState(TaskExecutionStateTransition taskExecutionState)
updateTaskExecutionState 在接口中 SchedulerNGpublic SerializedInputSplit requestNextInputSplit(JobVertexID vertexID, ExecutionAttemptID executionAttempt) throws IOException
requestNextInputSplit 在接口中 SchedulerNGIOExceptionpublic ExecutionState requestPartitionState(IntermediateDataSetID intermediateResultId, ResultPartitionID resultPartitionId) throws PartitionProducerDisposedException
public ExecutionGraphInfo requestJob()
requestJob 在接口中 SchedulerNGpublic void archiveFailure(RootExceptionHistoryEntry failure)
public org.apache.flink.api.common.JobStatus requestJobStatus()
requestJobStatus 在接口中 SchedulerNGpublic JobDetails requestJobDetails()
requestJobDetails 在接口中 SchedulerNGpublic KvStateLocation requestKvStateLocation(org.apache.flink.api.common.JobID jobId, String registrationName) throws UnknownKvStateLocation, FlinkJobNotFoundException
public void notifyKvStateRegistered(org.apache.flink.api.common.JobID jobId,
JobVertexID jobVertexId,
KeyGroupRange keyGroupRange,
String registrationName,
org.apache.flink.queryablestate.KvStateID kvStateId,
InetSocketAddress kvStateServerAddress)
throws FlinkJobNotFoundException
notifyKvStateRegistered 在接口中 SchedulerNGFlinkJobNotFoundExceptionpublic void notifyKvStateUnregistered(org.apache.flink.api.common.JobID jobId,
JobVertexID jobVertexId,
KeyGroupRange keyGroupRange,
String registrationName)
throws FlinkJobNotFoundException
notifyKvStateUnregistered 在接口中 SchedulerNGFlinkJobNotFoundExceptionpublic void updateAccumulators(AccumulatorSnapshot accumulatorSnapshot)
updateAccumulators 在接口中 SchedulerNGpublic CompletableFuture<String> triggerSavepoint(@Nullable String targetDirectory, boolean cancelJob, org.apache.flink.core.execution.SavepointFormatType formatType)
triggerSavepoint 在接口中 SchedulerNGpublic CompletableFuture<String> triggerCheckpoint()
triggerCheckpoint 在接口中 SchedulerNGpublic void acknowledgeCheckpoint(org.apache.flink.api.common.JobID jobID,
ExecutionAttemptID executionAttemptID,
long checkpointId,
CheckpointMetrics checkpointMetrics,
TaskStateSnapshot checkpointState)
acknowledgeCheckpoint 在接口中 SchedulerNGpublic void reportCheckpointMetrics(org.apache.flink.api.common.JobID jobID,
ExecutionAttemptID executionAttemptID,
long checkpointId,
CheckpointMetrics checkpointMetrics)
reportCheckpointMetrics 在接口中 SchedulerNGpublic void declineCheckpoint(DeclineCheckpoint decline)
declineCheckpoint 在接口中 SchedulerNGpublic CompletableFuture<String> stopWithSavepoint(@Nullable String targetDirectory, boolean terminate, org.apache.flink.core.execution.SavepointFormatType formatType)
stopWithSavepoint 在接口中 SchedulerNGpublic void deliverOperatorEventToCoordinator(ExecutionAttemptID taskExecution, OperatorID operator, OperatorEvent evt) throws org.apache.flink.util.FlinkException
SchedulerNGOperatorCoordinator with the given OperatorID.
Failure semantics: If the task manager sends an event for a non-running task or a non-existing operator coordinator, then respond with an exception to the call. If task and coordinator exist, then we assume that the call from the TaskManager was valid, and any bubbling exception needs to cause a job failure
deliverOperatorEventToCoordinator 在接口中 SchedulerNGorg.apache.flink.util.FlinkException - Thrown, if the task is not running or no operator/coordinator exists
for the given ID.public CompletableFuture<CoordinationResponse> deliverCoordinationRequestToCoordinator(OperatorID operator, CoordinationRequest request) throws org.apache.flink.util.FlinkException
SchedulerNGOperatorCoordinator with the given OperatorID and returns the coordinator's response.deliverCoordinationRequestToCoordinator 在接口中 SchedulerNGorg.apache.flink.util.FlinkException - Thrown, if the task is not running, or no operator/coordinator exists
for the given ID, or the coordinator cannot handle client events.public boolean hasDesiredResources(ResourceCounter desiredResources)
desiredResources - desiredResources describing the desired resourcestrue if we have enough resources; otherwise falsepublic boolean hasSufficientResources()
true if we have sufficient resources; otherwise falsepublic ArchivedExecutionGraph getArchivedExecutionGraph(org.apache.flink.api.common.JobStatus jobStatus, @Nullable Throwable cause)
ArchivedExecutionGraph for the given jobStatus and failure cause.jobStatus - jobStatus to create the ArchivedExecutionGraph withcause - cause represents the failure cause for the ArchivedExecutionGraph;
null if there is no failure causeArchivedExecutionGraphpublic void goToWaitingForResources()
StateTransitions.ToWaitingForResourcesWaitingForResources state.public void goToExecuting(ExecutionGraph executionGraph, ExecutionGraphHandler executionGraphHandler, OperatorCoordinatorHandler operatorCoordinatorHandler, List<ExceptionHistoryEntry> failureCollection)
StateTransitions.ToExecutingExecuting state.executionGraph - executionGraph to pass to the Executing stateexecutionGraphHandler - executionGraphHandler to pass to the Executing stateoperatorCoordinatorHandler - operatorCoordinatorHandler to pass to the Executing statefailureCollection - collection of failures that are propagatedpublic void goToCanceling(ExecutionGraph executionGraph, ExecutionGraphHandler executionGraphHandler, OperatorCoordinatorHandler operatorCoordinatorHandler, List<ExceptionHistoryEntry> failureCollection)
StateTransitions.ToCancellingCanceling state.executionGraph - executionGraph to pass to the Canceling stateexecutionGraphHandler - executionGraphHandler to pass to the Canceling stateoperatorCoordinatorHandler - operatorCoordinatorHandler to pass to the Canceling statefailureCollection - collection of failures that are propagatedpublic void goToRestarting(ExecutionGraph executionGraph, ExecutionGraphHandler executionGraphHandler, OperatorCoordinatorHandler operatorCoordinatorHandler, java.time.Duration backoffTime, List<ExceptionHistoryEntry> failureCollection)
StateTransitions.ToRestartingRestarting state.executionGraph - executionGraph to pass to the Restarting stateexecutionGraphHandler - executionGraphHandler to pass to the Restarting
stateoperatorCoordinatorHandler - operatorCoordinatorHandler to pas to the Restarting statebackoffTime - backoffTime to wait before transitioning to the Restarting
statefailureCollection - collection of failures that are propagatedpublic void goToFailing(ExecutionGraph executionGraph, ExecutionGraphHandler executionGraphHandler, OperatorCoordinatorHandler operatorCoordinatorHandler, Throwable failureCause, List<ExceptionHistoryEntry> failureCollection)
StateTransitions.ToFailingFailing state.executionGraph - executionGraph to pass to the Failing stateexecutionGraphHandler - executionGraphHandler to pass to the Failing stateoperatorCoordinatorHandler - operatorCoordinatorHandler to pass to the Failing statefailureCause - failureCause describing why the job execution failedfailureCollection - collection of failures that are propagatedpublic CompletableFuture<String> goToStopWithSavepoint(ExecutionGraph executionGraph, ExecutionGraphHandler executionGraphHandler, OperatorCoordinatorHandler operatorCoordinatorHandler, CheckpointScheduling checkpointScheduling, CompletableFuture<String> savepointFuture, List<ExceptionHistoryEntry> failureCollection)
StateTransitions.ToStopWithSavepointStopWithSavepoint state.executionGraph - executionGraph to pass to the StopWithSavepoint stateexecutionGraphHandler - executionGraphHandler to pass to the StopWithSavepoint stateoperatorCoordinatorHandler - operatorCoordinatorHandler to pass to the StopWithSavepoint statesavepointFuture - Future for the savepoint to complete.failureCollection - collection of failures that are propagatedpublic void goToFinished(ArchivedExecutionGraph archivedExecutionGraph)
StateTransitions.ToFinishedFinished state.archivedExecutionGraph - archivedExecutionGraph which is passed to the Finished statepublic void goToCreatingExecutionGraph()
CreatingExecutionGraph state.public org.apache.flink.runtime.scheduler.adaptive.CreatingExecutionGraph.AssignmentResult tryToAssignSlots(org.apache.flink.runtime.scheduler.adaptive.CreatingExecutionGraph.ExecutionGraphWithVertexParallelism executionGraphWithVertexParallelism)
ExecutionGraph. If it is possible, then this
method returns a successful AssignmentResult which contains the assigned ExecutionGraph. If not, then the assignment result is a failure.executionGraphWithVertexParallelism - executionGraphWithVertexParallelism to assign
slots to resourcesAssignmentResult representing the result of the assignmentpublic boolean canScaleUp(ExecutionGraph executionGraph)
executionGraph - executionGraph for making the scaling decision.public void onFinished(ArchivedExecutionGraph archivedExecutionGraph)
Finished state.archivedExecutionGraph - archivedExecutionGraph represents the final state of the
job executionpublic org.apache.flink.runtime.scheduler.adaptive.FailureResult howToHandleFailure(Throwable failure)
failure - failure describing the failure causeFailureResult which describes how to handle the failurepublic Executor getIOExecutor()
public org.apache.flink.runtime.concurrent.ComponentMainThreadExecutor getMainThreadExecutor()
public boolean isState(org.apache.flink.runtime.scheduler.adaptive.State expectedState)
expectedState - expectedState is the expected statetrue if the current state equals the expected state; otherwise falsepublic void runIfState(org.apache.flink.runtime.scheduler.adaptive.State expectedState,
Runnable action)
expectedState - expectedState is the expected stateaction - action to run if the current state equals the expected statepublic ScheduledFuture<?> runIfState(org.apache.flink.runtime.scheduler.adaptive.State expectedState, Runnable action, java.time.Duration delay)
expectedState - expectedState describes the required state at the time of running
the actionaction - action to run if the expected state equals the actual statedelay - delay after which to run the actionCopyright © 2014–2022 The Apache Software Foundation. All rights reserved.