Handle errors with automatic retries and notifications.
Failure is inevitable. Kestra provides automatic retries and error handling to help you build resilient workflows.
Error handling
By default, a failure of any task will stop the execution and will mark it as failed.
For more control over error handling, you can add errors
tasks, AllowFailure
tasks, or automatic retries.
The errors
property allows you to execute one or more actions before terminating the flow, e.g. sending an email or a Slack message to your team. The property is named errors
because they are triggered upon errors within your flow.
You can implement error handling at the flow-level or on a namespace-level.
- Flow-level: Useful to implement custom alerting for a specific flow or task. This can be accomplished by adding
errors
tasks. - Namespace-level: Useful to send a notification for any failed Execution within a given namespace. This approach allows you to implement centralized error handling for all flows within a given namespace.
Flow-level error handling using errors
The errors
is a property of a flow that accepts a list of tasks that will be executed when an error occurs. You can add as many tasks as you want, and they will be executed sequentially.
The example below sends a flow-level failure alert via Slack using the SlackIncomingWebhook task defined using the errors
property.
id: unreliable_flow
namespace: company.team
tasks:
- id: fail
type: io.kestra.plugin.scripts.shell.Commands
taskRunner:
type: io.kestra.plugin.core.runner.Process
commands:
- exit 1
errors:
- id: alert_on_failure
type: io.kestra.plugin.notifications.slack.SlackIncomingWebhook
url: "{{ secret('SLACK_WEBHOOK') }}" # https://hooks.slack.com/services/xyz/xyz/xyz
payload: |
{
"channel": "#alerts",
"text": "Failure alert for flow {{ flow.namespace }}.{{ flow.id }} with ID {{ execution.id }}"
}
Check the error handling page for more details.
Namespace-level error handling using a Flow trigger
To get notified on a workflow failure, you can leverage Kestra's built-in notification tasks, including among others (the list keeps growing with new releases):
For a centralized namespace-level alerting, we recommend adding a dedicated monitoring workflow with one of the above mentioned notification tasks and a Flow trigger. Below is an example workflow that automatically sends a Slack alert as soon as any flow in the namespace company.analytics
fails or finishes with warnings.
id: failure_alert
namespace: company.monitoring
tasks:
- id: send
type: io.kestra.plugin.notifications.slack.SlackExecution
url: "{{ secret('SLACK_WEBHOOK') }}"
channel: "#general"
executionId: "{{trigger.executionId}}"
triggers:
- id: listen
type: io.kestra.plugin.core.trigger.Flow
conditions:
- type: io.kestra.plugin.core.condition.ExecutionStatusCondition
in:
- FAILED
- WARNING
- type: io.kestra.plugin.core.condition.ExecutionNamespaceCondition
namespace: company.analytics
prefix: true
Adding this single flow will ensure that you receive a Slack alert on any flow failure in the prod
namespace. Here is an example alert notification:
Retries
When working with external systems, transient errors are common. For example, a file may not be available yet, an API might be temporarily unreachable, or a database can be under maintenance. In such cases, retries can automatically fix the issue without human intervention.
Configuring retries
Each task can be retried a certain number of times and in a specific way. Use the retry
property with the desired type of retry.
The following types of retries are currently supported:
- Constant: The task will be retried every X seconds/minutes/hours/days.
- Exponential: The task will also be retried every X seconds/minutes/hours/days, but with an exponential backoff, i.e. an exponential time interval in between each retry attempt.
- Random: The task will be retried every X seconds/minutes/hours/days with a random delay i.e. a random time interval in between each retry attempt.
In this example, we will retry the task 5 times up to 1 minute of a total task run duration, with a constant interval of 2 seconds between each retry attempt.
id: retries
namespace: company.team
tasks:
- id: fail_four_times
type: io.kestra.plugin.scripts.shell.Commands
taskRunner:
type: io.kestra.plugin.core.runner.Process
commands:
- 'if [ "{{ taskrun.attemptsCount }}" -eq 4 ]; then exit 0; else exit 1; fi'
retry:
type: constant
interval: PT2S
maxAttempt: 5
maxDuration: PT1M
warningOnRetry: false
errors:
- id: will_never_run
type: io.kestra.plugin.core.debug.Return
format: This will never be executed as retries will fix the issue
Adding a retry configuration to our tutorial workflow
Let's get back to our example from the Fundamentals section. We will add a retry configuration to the api
task. API calls are prone to transient errors, so we will retry that task up to 10 times, for at most 1 hour of total duration, every 10 seconds (i.e. with a constant interval of 10 seconds in between retry attempts).
id: getting_started
namespace: company.team
tasks:
- id: api
type: io.kestra.plugin.core.http.Request
uri: https://dummyjson.com/products
retry:
type: constant # type: string
interval: PT20S # type: Duration
maxDuration: PT1H # type: Duration
maxAttempt: 10 # type: int
warningOnRetry: true # type: boolean, default is false
Was this page helpful?