Lightning
¶
This guide offers more context on the lamindb.integrations.lightning.Checkpoint callback. For end-to-end examples, see the following guides:
docs:clearml
Quickstart¶
Pass ll.Checkpoint and a logger into Trainer. The logger is what gives
checkpoints meaningful, namespaced artifact keys — without it, keys fall back
to a bare checkpoints/ prefix (or just the run UID when ln.track() is
active).
Any logger implementing Lightning’s Logger interface works (TensorBoardLogger,
WandbLogger, MLFlowLogger, CSVLogger, etc.). We use TensorBoardLogger
in the examples below.
import lamindb as ln
import lightning.pytorch as pl
from lightning.pytorch.loggers import TensorBoardLogger
from lamindb.integrations import lightning as ll
ln.track()
logger = TensorBoardLogger(save_dir="logs")
checkpoint = ll.Checkpoint(monitor="val_loss", mode="min", save_top_k=3)
trainer = pl.Trainer(
max_epochs=10,
callbacks=[checkpoint],
logger=logger,
)
trainer.fit(model, datamodule=datamodule)
After training, each saved checkpoint file is a LaminDB artifact:
checkpoint.last_checkpoint_artifact
checkpoint.last_checkpoint_artifact.key
# e.g. "logs/lightning_logs/2r5pIRnK7z0q/checkpoints/epoch=0-step=100.ckpt"
checkpoint.checkpoint_key_prefix
# e.g. "logs/lightning_logs/2r5pIRnK7z0q/checkpoints"
How is a run organized?¶
A Lightning Trainer coordinates three concerns during training:
Logger — writes metrics (loss curves, learning rate, etc.) to a dashboard directory. The logger determines the local directory layout:
{save_dir}/{name}/{version}/.ModelCheckpoint — saves model snapshots (
.ckptfiles) into acheckpoints/subdirectory underneath the logger’s directory.SaveConfigCallback — when using
LightningCLI, writes the fully resolvedconfig.yamlinto the logger’s directory so you can reproduce exactly which hyperparameters were used.
All three share the same directory tree. The logger creates it, the checkpoint callback writes into it, and the config callback stores beside it:
logs/ # logger save_dir
lightning_logs/ # logger name
version_0/ # logger version (local filesystem)
events.out.tfevents.* # ← logger output (TensorBoard)
config.yaml # ← SaveConfigCallback
checkpoints/
epoch=0-step=100.ckpt # ← ModelCheckpoint
epoch=1-step=200.ckpt
hparams.yaml # ← auto-generated by Lightning
LaminDB’s integration replaces ModelCheckpoint with ll.Checkpoint and
Lightning’s SaveConfigCallback with ll.SaveConfigCallback. Checkpoint
files, the config, and hparams.yaml become lamindb.Artifact records with
lineage tracking and optional feature annotations.
Note that artifact keys in LaminDB do not mirror the local directory layout
exactly — the callback uses the LaminDB run UID instead of Lightning’s
auto-incrementing version_N directory by default. See
How artifact keys are derived for details.
Which kind of artifacts?¶
Checkpoint saves three kinds of artifacts:
Kind |
Example key |
When |
|---|---|---|
|
|
Every time Lightning writes a checkpoint |
|
|
When using |
|
|
When Lightning generates it |
Checkpoints and hparams.yaml live under the checkpoints/ subdirectory,
while the config sits directly under the base prefix.
The callback tracks the latest artifact of each kind:
checkpoint.last_checkpoint_artifact
checkpoint.last_config_artifact
checkpoint.last_hparams_artifact
checkpoint.last_artifact_event
How is data lineage tracked?¶
When a run is being tracked with ln.track():
checkpointartifacts are recorded as run outputs — they are produced by the training run.configartifacts are recorded as run inputs — the resolved config is part of the run specification.hparams.yamlis saved as an artifact but not linked as a run input.
How are artifact keys derived?¶
LaminDB artifact keys are not necessarily a mirror of the local filesystem layout.
Lightning uses auto-incrementing version directories (version_0, version_1,
…) on disk, but these are meaningless as artifact identifiers — they depend on
what already exists locally and cannot reliably distinguish runs across
machines.
Instead, when ln.track() is active, the callback uses the LaminDB run UID
as the version segment by default (run_uid_is_version=True). This guarantees
that every tracked run produces unique artifact keys regardless of local state.
The base prefix is determined by priority:
Scenario |
Base prefix |
|---|---|
|
|
No |
|
No |
|
run_uid above refers to the active LaminDB run UID (from ln.context.run.uid).
When no run is tracked or run_uid_is_version=False, the callback falls back
to the logger’s own version (e.g. version_0) or omits the segment entirely.
Checkpoint & hparams keys:
Scenario |
LaminDB key pattern |
|---|---|
Logger present (recommended) |
|
No logger, explicit |
|
No logger, no |
|
Config keys:
Scenario |
Key pattern |
|---|---|
Logger present |
|
No logger, explicit |
|
No logger, no |
|
For example, with TensorBoardLogger(save_dir="logs") and a tracked run:
logs/lightning_logs/2r5pIRnK7z0q/ # base prefix ({save_dir_basename}/{name}/{run_uid})
config.yaml # ← config artifact
checkpoints/
epoch=0-step=100.ckpt # ← checkpoint artifact
hparams.yaml # ← hparams artifact
Opting out of run UID keys¶
Pass run_uid_is_version=False to fall back to the logger-managed version
directory, matching Lightning’s local layout more closely:
checkpoint = ll.Checkpoint(
monitor="val_loss",
run_uid_is_version=False,
)
With this setting, the key uses the logger’s version (version_0, etc.)
instead of the run UID. This is mainly useful when you don’t call ln.track()
or when you want artifact keys that exactly mirror the local directory tree.
Why run UIDs instead of version_N?¶
Lightning’s auto-incrementing version_N depends on what directories already
exist at save_dir. Two runs on different machines — or the same machine after
clearing logs/ — can both produce version_0. With run_uid_is_version=True
(the default), each tracked run gets a unique prefix derived from the Lamin
run, so artifact keys never collide.
Use with the Lightning CLI¶
The Lightning CLI resolves a YAML config into concrete model and data module
arguments. To also store that resolved config as a LaminDB artifact, pass
ll.SaveConfigCallback in your training script and declare the trainer,
logger, callbacks, model, and data in a config file.
config.yaml
trainer:
max_epochs: 10
logger:
class_path: lightning.pytorch.loggers.TensorBoardLogger
init_args:
save_dir: logs
callbacks:
- class_path: lamindb.integrations.lightning.Checkpoint
init_args:
monitor: val/loss
mode: min
save_top_k: 3
model:
learning_rate: 1.0e-3
data:
batch_size: 64
train.py
import lamindb as ln
from lightning.pytorch.cli import LightningCLI
from lamindb.integrations.lightning import SaveConfigCallback
ln.track()
def cli_main() -> None:
LightningCLI(
model_class=MyModel,
datamodule_class=MyDataModule,
save_config_callback=SaveConfigCallback,
)
if __name__ == "__main__":
cli_main()
python train.py fit --config config.yaml
ll.SaveConfigCallback extends Lightning’s built-in version: it writes the
local file as usual and then delegates to whichever
ArtifactPublishingModelCheckpoint is registered on the trainer to persist the
config as an artifact.
Annotating with features¶
Attach custom run-level and artifact-level feature values through features=:
logger = TensorBoardLogger(save_dir="logs")
checkpoint = ll.Checkpoint(
monitor="val_loss",
features={
"run": {"training_framework": "lightning"},
"artifact": {"dataset_version": "2026-03"},
},
)
trainer = pl.Trainer(callbacks=[checkpoint], logger=logger)
Feature names must already exist in Lamin.
The callback can also auto-track standard Lightning fields. Create the corresponding LaminDB features once:
ll.save_lightning_features()
This enables auto-features:
Artifact-level:
is_best_model,is_last_model,score,model_rank,save_weights_only,monitor,modeRun-level:
logger_name,logger_version,max_epochs,max_steps,precision,accumulate_grad_batches,gradient_clip_val,monitor,mode
Extending the callback¶
Subclass Checkpoint¶
Subclass when you want to keep LaminDB persistence and additionally notify an external system after each artifact is saved:
from lamindb.integrations import lightning as ll
from my_model_registry import ModelRegistry
class ModelRegistryCheckpoint(ll.Checkpoint):
"""Register each checkpoint in an external model registry."""
def __init__(self, *args, registry_project: str, **kwargs):
super().__init__(*args, **kwargs)
self.registry_project = registry_project
self.model_registry = ModelRegistry()
def on_artifact_saved(self, event: ll.ArtifactSavedEvent) -> None:
if event.kind == "checkpoint":
# register the model in your external system
self.model_registry.register(
project=self.registry_project,
model_uri=event.storage_uri,
metadata={"lamin_key": event.key},
)
logger = TensorBoardLogger(save_dir="logs")
checkpoint = ModelRegistryCheckpoint(
registry_project="my-project",
monitor="val_loss",
save_top_k=3,
)
trainer = pl.Trainer(callbacks=[checkpoint], logger=logger)
trainer.fit(model, datamodule=datamodule)
Each event gives you:
event.kind:"checkpoint","config", or"hparams"event.artifact: the persisted LaminDB artifactevent.key: the LaminDB artifact keyevent.local_path: the local file path Lightning wroteevent.storage_uri: the stable storage URI for downstream systems
Attach an observer¶
Observers are useful when you want composition instead of inheritance:
from lamindb.integrations import lightning as ll
class ArtifactLogger:
def on_artifact_saved(self, event: ll.ArtifactSavedEvent) -> None:
print(event.kind, event.storage_uri)
def on_artifact_removed(self, event: ll.ArtifactRemovedEvent) -> None:
print("removed", event.key)
logger = TensorBoardLogger(save_dir="logs")
checkpoint = ll.Checkpoint(
monitor="val_loss",
artifact_observers=[ArtifactLogger()],
)
trainer = pl.Trainer(callbacks=[checkpoint], logger=logger)
trainer.fit(model, datamodule=datamodule)
Observers receive the same events that subclasses see.
Integrating other systems¶
To register checkpoints in another system (e.g. ClearML, Weights & Biases, MLflow, Neptune, or Comet), use the artifact lifecycle events rather than re-deriving paths from Lightning internals.
The key hand-off value is event.storage_uri, which resolves to the persisted
artifact location. event.artifact gives you the full LaminDB record when you
need metadata beyond the URI.