Lightning .md

This guide offers more context on the lamindb.integrations.lightning.Checkpoint callback. For end-to-end examples, see the following guides:

Quickstart

Pass ll.Checkpoint and a logger into Trainer. The logger is what gives checkpoints meaningful, namespaced artifact keys — without it, keys fall back to a bare checkpoints/ prefix (or just the run UID when ln.track() is active).

Any logger implementing Lightning’s Logger interface works (TensorBoardLogger, WandbLogger, MLFlowLogger, CSVLogger, etc.). We use TensorBoardLogger in the examples below.

import lamindb as ln
import lightning.pytorch as pl
from lightning.pytorch.loggers import TensorBoardLogger
from lamindb.integrations import lightning as ll

ln.track()

logger = TensorBoardLogger(save_dir="logs")
checkpoint = ll.Checkpoint(monitor="val_loss", mode="min", save_top_k=3)

trainer = pl.Trainer(
    max_epochs=10,
    callbacks=[checkpoint],
    logger=logger,
)
trainer.fit(model, datamodule=datamodule)

After training, each saved checkpoint file is a LaminDB artifact:

checkpoint.last_checkpoint_artifact
checkpoint.last_checkpoint_artifact.key
# e.g. "logs/lightning_logs/2r5pIRnK7z0q/checkpoints/epoch=0-step=100.ckpt"

checkpoint.checkpoint_key_prefix
# e.g. "logs/lightning_logs/2r5pIRnK7z0q/checkpoints"

How is a run organized?

A Lightning Trainer coordinates three concerns during training:

  1. Logger — writes metrics (loss curves, learning rate, etc.) to a dashboard directory. The logger determines the local directory layout: {save_dir}/{name}/{version}/.

  2. ModelCheckpoint — saves model snapshots (.ckpt files) into a checkpoints/ subdirectory underneath the logger’s directory.

  3. SaveConfigCallback — when using LightningCLI, writes the fully resolved config.yaml into the logger’s directory so you can reproduce exactly which hyperparameters were used.

All three share the same directory tree. The logger creates it, the checkpoint callback writes into it, and the config callback stores beside it:

logs/                          # logger save_dir
  lightning_logs/              # logger name
    version_0/                 # logger version (local filesystem)
      events.out.tfevents.*    # ← logger output (TensorBoard)
      config.yaml              # ← SaveConfigCallback
      checkpoints/
        epoch=0-step=100.ckpt  # ← ModelCheckpoint
        epoch=1-step=200.ckpt
        hparams.yaml           # ← auto-generated by Lightning

LaminDB’s integration replaces ModelCheckpoint with ll.Checkpoint and Lightning’s SaveConfigCallback with ll.SaveConfigCallback. Checkpoint files, the config, and hparams.yaml become lamindb.Artifact records with lineage tracking and optional feature annotations.

Note that artifact keys in LaminDB do not mirror the local directory layout exactly — the callback uses the LaminDB run UID instead of Lightning’s auto-incrementing version_N directory by default. See How artifact keys are derived for details.

Which kind of artifacts?

Checkpoint saves three kinds of artifacts:

Kind

Example key

When

checkpoint

…/checkpoints/epoch=0-step=100.ckpt

Every time Lightning writes a checkpoint

config

…/config.yaml

When using ll.SaveConfigCallback

hparams

…/checkpoints/hparams.yaml

When Lightning generates it

Checkpoints and hparams.yaml live under the checkpoints/ subdirectory, while the config sits directly under the base prefix.

The callback tracks the latest artifact of each kind:

checkpoint.last_checkpoint_artifact
checkpoint.last_config_artifact
checkpoint.last_hparams_artifact
checkpoint.last_artifact_event

How is data lineage tracked?

When a run is being tracked with ln.track():

  • checkpoint artifacts are recorded as run outputs — they are produced by the training run.

  • config artifacts are recorded as run inputs — the resolved config is part of the run specification.

  • hparams.yaml is saved as an artifact but not linked as a run input.

How are artifact keys derived?

LaminDB artifact keys are not necessarily a mirror of the local filesystem layout. Lightning uses auto-incrementing version directories (version_0, version_1, …) on disk, but these are meaningless as artifact identifiers — they depend on what already exists locally and cannot reliably distinguish runs across machines.

Instead, when ln.track() is active, the callback uses the LaminDB run UID as the version segment by default (run_uid_is_version=True). This guarantees that every tracked run produces unique artifact keys regardless of local state.

The base prefix is determined by priority:

Scenario

Base prefix

dirpath set (± logger)

{dirpath}/{run_uid}

No dirpath + logger

{save_dir_basename}/{name}/{run_uid}

No dirpath + no logger

{run_uid}

run_uid above refers to the active LaminDB run UID (from ln.context.run.uid). When no run is tracked or run_uid_is_version=False, the callback falls back to the logger’s own version (e.g. version_0) or omits the segment entirely.

Checkpoint & hparams keys:

Scenario

LaminDB key pattern

Logger present (recommended)

{save_dir_basename}/{name}/{run_uid}/checkpoints/{filename}

No logger, explicit dirpath

{dirpath}/{run_uid}/checkpoints/{filename}

No logger, no dirpath

{run_uid}/checkpoints/{filename}

Config keys:

Scenario

Key pattern

Logger present

{save_dir_basename}/{name}/{run_uid}/config.yaml

No logger, explicit dirpath

{dirpath}/{run_uid}/config.yaml

No logger, no dirpath

{run_uid}/config.yaml

For example, with TensorBoardLogger(save_dir="logs") and a tracked run:

logs/lightning_logs/2r5pIRnK7z0q/       # base prefix ({save_dir_basename}/{name}/{run_uid})
  config.yaml                            # ← config artifact
  checkpoints/
    epoch=0-step=100.ckpt                # ← checkpoint artifact
    hparams.yaml                         # ← hparams artifact

Opting out of run UID keys

Pass run_uid_is_version=False to fall back to the logger-managed version directory, matching Lightning’s local layout more closely:

checkpoint = ll.Checkpoint(
    monitor="val_loss",
    run_uid_is_version=False,
)

With this setting, the key uses the logger’s version (version_0, etc.) instead of the run UID. This is mainly useful when you don’t call ln.track() or when you want artifact keys that exactly mirror the local directory tree.

Why run UIDs instead of version_N?

Lightning’s auto-incrementing version_N depends on what directories already exist at save_dir. Two runs on different machines — or the same machine after clearing logs/ — can both produce version_0. With run_uid_is_version=True (the default), each tracked run gets a unique prefix derived from the Lamin run, so artifact keys never collide.

Use with the Lightning CLI

The Lightning CLI resolves a YAML config into concrete model and data module arguments. To also store that resolved config as a LaminDB artifact, pass ll.SaveConfigCallback in your training script and declare the trainer, logger, callbacks, model, and data in a config file.

config.yaml

trainer:
  max_epochs: 10

  logger:
    class_path: lightning.pytorch.loggers.TensorBoardLogger
    init_args:
      save_dir: logs

  callbacks:
    - class_path: lamindb.integrations.lightning.Checkpoint
      init_args:
        monitor: val/loss
        mode: min
        save_top_k: 3

model:
  learning_rate: 1.0e-3

data:
  batch_size: 64

train.py

import lamindb as ln
from lightning.pytorch.cli import LightningCLI
from lamindb.integrations.lightning import SaveConfigCallback

ln.track()

def cli_main() -> None:
    LightningCLI(
        model_class=MyModel,
        datamodule_class=MyDataModule,
        save_config_callback=SaveConfigCallback,
    )

if __name__ == "__main__":
    cli_main()
python train.py fit --config config.yaml

ll.SaveConfigCallback extends Lightning’s built-in version: it writes the local file as usual and then delegates to whichever ArtifactPublishingModelCheckpoint is registered on the trainer to persist the config as an artifact.

Annotating with features

Attach custom run-level and artifact-level feature values through features=:

logger = TensorBoardLogger(save_dir="logs")
checkpoint = ll.Checkpoint(
    monitor="val_loss",
    features={
        "run": {"training_framework": "lightning"},
        "artifact": {"dataset_version": "2026-03"},
    },
)

trainer = pl.Trainer(callbacks=[checkpoint], logger=logger)

Feature names must already exist in Lamin.

The callback can also auto-track standard Lightning fields. Create the corresponding LaminDB features once:

ll.save_lightning_features()

This enables auto-features:

  • Artifact-level: is_best_model, is_last_model, score, model_rank, save_weights_only, monitor, mode

  • Run-level: logger_name, logger_version, max_epochs, max_steps, precision, accumulate_grad_batches, gradient_clip_val, monitor, mode

Extending the callback

Subclass Checkpoint

Subclass when you want to keep LaminDB persistence and additionally notify an external system after each artifact is saved:

from lamindb.integrations import lightning as ll
from my_model_registry import ModelRegistry


class ModelRegistryCheckpoint(ll.Checkpoint):
    """Register each checkpoint in an external model registry."""

    def __init__(self, *args, registry_project: str, **kwargs):
        super().__init__(*args, **kwargs)
        self.registry_project = registry_project
        self.model_registry = ModelRegistry()

    def on_artifact_saved(self, event: ll.ArtifactSavedEvent) -> None:
        if event.kind == "checkpoint":
            # register the model in your external system
            self.model_registry.register(
                project=self.registry_project,
                model_uri=event.storage_uri,
                metadata={"lamin_key": event.key},
            )


logger = TensorBoardLogger(save_dir="logs")
checkpoint = ModelRegistryCheckpoint(
    registry_project="my-project",
    monitor="val_loss",
    save_top_k=3,
)
trainer = pl.Trainer(callbacks=[checkpoint], logger=logger)
trainer.fit(model, datamodule=datamodule)

Each event gives you:

  • event.kind: "checkpoint", "config", or "hparams"

  • event.artifact: the persisted LaminDB artifact

  • event.key: the LaminDB artifact key

  • event.local_path: the local file path Lightning wrote

  • event.storage_uri: the stable storage URI for downstream systems

Attach an observer

Observers are useful when you want composition instead of inheritance:

from lamindb.integrations import lightning as ll


class ArtifactLogger:
    def on_artifact_saved(self, event: ll.ArtifactSavedEvent) -> None:
        print(event.kind, event.storage_uri)

    def on_artifact_removed(self, event: ll.ArtifactRemovedEvent) -> None:
        print("removed", event.key)


logger = TensorBoardLogger(save_dir="logs")
checkpoint = ll.Checkpoint(
    monitor="val_loss",
    artifact_observers=[ArtifactLogger()],
)

trainer = pl.Trainer(callbacks=[checkpoint], logger=logger)
trainer.fit(model, datamodule=datamodule)

Observers receive the same events that subclasses see.

Integrating other systems

To register checkpoints in another system (e.g. ClearML, Weights & Biases, MLflow, Neptune, or Comet), use the artifact lifecycle events rather than re-deriving paths from Lightning internals.

The key hand-off value is event.storage_uri, which resolves to the persisted artifact location. event.artifact gives you the full LaminDB record when you need metadata beyond the URI.