Core concepts

Job

A job in Ocypod represents some task created by clients, which will be queued, then fetched and processed by workers.

Each job has a set of metadata associated with it, some of which is managed by Ocypod, and some of which can be created/updated by clients/workers.

Job lifecycle and statuses

When a job is initially created, it's added to a queue, assigned the queued status.

Clients will then poll that queue for new jobs, receiving the job's payload (the contents of its input field), and the job's ID. The job is removed from its queue, and its status is set to running.

If the client completes the job, it will send a message to Ocypod asking it to update the job's status to completed. If there's some error/exception and the client can't finish the job, it will mark the job as failed.

If the client fails to complete/fail a job before the job's timeout (or heartbeat timeout) is exceeded, then Ocypod marks the job as timed_out.

Ocypod will periodically look at all failed and timed out jobs and check if they're elgible for automatic retries, and if so, will re-queue them.

Job metadata

The Ocypod server maintains the following information about a job, some of which is immutable, some of which will be modified by Ocypod throughout a job's lifecycle, and some of which is modifiable by clients.

Job Status

A job in Ocypod will always have one of the following statuses:

To aid clients that are checking on the status of jobs, each job also has an ended boolean field. This is set to true if the job is in its final state, or false otherwise.

A job is marked as ended in the following circumstances:

Queue

Each queue in Ocypod has its own settings, which are used as defaults for jobs created on that queue (though they can be overridden on a per-job basis).

A queue in Ocypod is a FIFO, with new jobs being added to the beginning of the queue, and workers taking jobs from the end of the queue.

Queue settings

Each queue has a number of settings, which are defaults that are applied to new jobs created in that queue. Each can be overridden on a per-job basis, they just exist at the queue level for convenience.


timeout

This is the maximum amount of time a job can be running for before it's considered to have timed out. It's specified as a human readable duration string, e.g. "30s", "1h15m5s", "3w2d", etc.

To disable timeouts entirely, this can be set to "0s".


heartbeat_timeout

For long running jobs, it's recommended that workers send regular heartbeats to the Ocypod server to let it know that the job is still being processed. This allows timeouts or failures to be noticed much earlier than if just relying on timeout.

The heartbeat_timeout setting determines how long a job can be running without getting a heartbeat update before it's considered to have timed out. It's specified as a human readable duration string.

To disable heartbeat timeouts entirely, this can be set to "0s".


expires_after

This setting determines how long jobs that have ended (either successfully completed, failed, or timed out without any retries) will remain in the system. After this period of time, the job and its metadata will be cleared from Ocypod.

This is specified as a human readable duration string, and can be set to "0s" to disable expiry entirely. In this case, you'll be responsible for managing and cleaning up old jobs manually.


retries

This controls the number of times that jobs created in this queue will be automatically retried.

If a job fails or times out and has a number of retries remaining, it will be re-queued.

To disable retries, this can be set to 0.

retry_delays

This configures an optional list of delays to apply whenever a job is retried. This allows for different backoff strategies to be configured, depending on the application.

If the number of retries exceeds the number of retry delays specified, then the last value will continue to be used.

E.g. configuring a queue with retries: 4 and retry_delays: ["10s", "1m", "5m"] means that if a job in this queue keeps failing, Ocypod will wait 10 seconds before retrying for the 1st time, 1 minute before retrying a 2nd time, and 5 minutes before retrying for the 3rd and 4th times.

To disable retry delays, this can be ommitted, or set to an empty list.

Tag

A tag is a short string that can be attached to a job at creation time. An endpoint for getting all job IDs by tag is provided.

This allows separate jobs to be grouped together, use cases include e.g.: