Core concepts
Job
A job in Ocypod represents some task created by clients, which will be queued, then fetched and processed by workers.
Each job has a set of metadata associated with it, some of which is managed by Ocypod, and some of which can be created/updated by clients/workers.
Job lifecycle and statuses
When a job is initially created, it's added to a queue, assigned the
queued
status.
Clients will then poll that queue for new jobs, receiving the job's payload
(the contents of its input
field), and the job's ID. The job is removed
from its queue, and its status is set to running
.
If the client completes the job, it will send a message to Ocypod asking
it to update the job's status to completed
. If there's some
error/exception and the client can't finish the job, it will mark the job
as failed
.
If the client fails to complete/fail a job before the job's timeout (or
heartbeat timeout) is exceeded, then Ocypod marks the job as timed_out
.
Ocypod will periodically look at all failed and timed out jobs and check if they're elgible for automatic retries, and if so, will re-queue them.
Job metadata
The Ocypod server maintains the following information about a job, some of which is immutable, some of which will be modified by Ocypod throughout a job's lifecycle, and some of which is modifiable by clients.
id
- autogenerated ID for the job, generated when a job is first created and queuedqueue
- name of the queue the job was created instatus
- current status of the jobtags
- list of tags (if any) assigned to this job at creation timecreated_at
- date/time this job was first created and queuedstarted_at
- date/time this job was accepted by a client, and the job's status changed torunning
ended_at
- date/time this job stopped running, whether due to successful completed, timing out, or failurelast_heartbeat
- date/time the last heartbeat for this job was sent by the client executing itinput
- the job's payload, sent by the client creating this job - this typically contains the data needed for a worker to execute the joboutput
- contains any information the client working on this job decides to store here, this might include the job's result, progress information, partial results, etc. - it can be set anytime the task is runningtimeout
- maximum execution time of the job before it's marked as timed outheartbeat_timeout
- maximum time without receiving a heartbeat before the job is marked as timed outexpires_after
- amount of time this job metadata will persist in Ocypod after the job reaches a final state (i.e.completed
/failed
/timed_out
with no retries remaining)retries
- number of times this job will automatically be requeued on failureretries_attempted
- number of times this job has failed and been requeuedretry_delays
- minimum amount of time to wait between each retry attemptended
- indicates whether the job is in a final state or not (i.e. completed, or failed/timed out with no retries remaining)
Job Status
A job in Ocypod will always have one of the following statuses:
queued
- set by the server when a job is first created and added to a queuerunning
- set by the server when a worker picks up a jobcompleted
- set by the client to mark a job as successfully completedfailed
- set by the client to mark a job as having failedtimed_out
- set by the server when a job exceeds either itstimeout
orheartbeat_timeout
cancelled
- set by client to mark that a job has been cancelled
To aid clients that are checking on the status of jobs, each job also has an
ended
boolean field. This is set to true
if the job is in its final state,
or false
otherwise.
A job is marked as ended in the following circumstances:
- job has
completed
status - job has
cancelled
status - job has
failed
status and 0 retries remaining - job has
timed_out
status and 0 retries remaining
Queue
Each queue in Ocypod has its own settings, which are used as defaults for jobs created on that queue (though they can be overridden on a per-job basis).
A queue in Ocypod is a FIFO, with new jobs being added to the beginning of the queue, and workers taking jobs from the end of the queue.
Queue settings
Each queue has a number of settings, which are defaults that are applied to new jobs created in that queue. Each can be overridden on a per-job basis, they just exist at the queue level for convenience.
timeout
This is the maximum amount of time a job can be running for before it's considered to have timed out. It's specified as a human readable duration string, e.g. "30s", "1h15m5s", "3w2d", etc.
To disable timeouts entirely, this can be set to "0s".
heartbeat_timeout
For long running jobs, it's recommended that workers send regular heartbeats
to the Ocypod server to let it know that the job is still being processed.
This allows timeouts or failures to be noticed much earlier than if just
relying on timeout
.
The heartbeat_timeout
setting determines how long a job can be running
without getting a heartbeat update before it's considered to have timed out.
It's specified as a human readable duration string.
To disable heartbeat timeouts entirely, this can be set to "0s".
expires_after
This setting determines how long jobs that have ended (either successfully completed, failed, or timed out without any retries) will remain in the system. After this period of time, the job and its metadata will be cleared from Ocypod.
This is specified as a human readable duration string, and can be set to "0s" to disable expiry entirely. In this case, you'll be responsible for managing and cleaning up old jobs manually.
retries
This controls the number of times that jobs created in this queue will be automatically retried.
If a job fails or times out and has a number of retries remaining, it will be re-queued.
To disable retries, this can be set to 0.
retry_delays
This configures an optional list of delays to apply whenever a job is retried. This allows for different backoff strategies to be configured, depending on the application.
If the number of retries exceeds the number of retry delays specified, then the last value will continue to be used.
E.g. configuring a queue with retries: 4
and retry_delays: ["10s", "1m", "5m"]
means that if a job in this queue
keeps failing, Ocypod will wait 10 seconds before retrying for the 1st time, 1 minute before retrying a 2nd time, and 5 minutes before retrying for the 3rd and 4th times.
To disable retry delays, this can be ommitted, or set to an empty list.
Tag
A tag is a short string that can be attached to a job at creation time. An endpoint for getting all job IDs by tag is provided.
This allows separate jobs to be grouped together, use cases include e.g.:
- using a batch ID tag to a related set of jobs
- using a username tag to track all jobs belonging to a user
- using a source tag to track the client/process that created a job