Guide#

Welcome to the LaminDB guide! 👋

Curate, store, track, query, integrate, and learn from biological data.

LaminDB is an open-source data lake for R&D in biology.

It gives you components to build on data lineage & biological entities with an ORM for your existing infrastructure: object storage (local directories, S3, GCP) with a mapped SQL query engine (SQLite, Postgres, and soon, BigQuery).

You can readily create distributed LaminDB instances at any scale:

Get started on your laptop, deploy in the cloud, or work with a mesh of instances for different teams and purposes.
Share them through a hub akin to HuggingFace & GitHub - see, e.g, lamin.ai/sunnyosun.

Warning

Public beta: Currently only recommended for collaborators as we still make breaking changes.

Installation#

LaminDB is a python package available for Python versions 3.8+.

pip install lamindb

Biological entities are installed like so:

pip install 'lamindb[bionty,wetlab]'

Import#

In your python script, import LaminDB as:

import lamindb as ln

Quick setup#

Quick setup on the command line:

Sign up via lamin signup <email>
Log in via lamin login <handle>
Set up an instance via lamin init --storage <storage> --schema <schema_modules>

See Setup quickstart for more.

Track & query data#

Track data & metadata with sources#

Track the T in ETL and ELT.

Within a Jupyter notebook

import lamindb as ln

# track global data source (Run & Transform records)
ln.track()
#> ℹ️ Instance: testuser1/mydata
#> ℹ️ User: testuser1
#> ℹ️ Loaded notebook: Transform(id='OdlFhFWW7qg3', v='0', name='04-memory', title='Track in-memory data objects', type=notebook, created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 15, 16, 14, 42))
#> ℹ️ Loaded run:
#> Run(id='L1oBMKW60ndt5YtjRqav', transform_id='sePTpDsGJRq3', transform_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 14, 21, 49, 36))

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

# serialize data object with SQL metadata record including hash and linked source (run record)
file = ln.File(df, name="My dataframe")
#> File(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c')

# upload serialized version to configured storage
# commit a File record to the SQL database
ln.add(file)
#> File(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))

Within a data pipeline

# create (or query) a transform record
transform = ln.Transform(name="My pipeline")
#> Transform(id='fhn5Zydf', v='1', name='My pipeline', type=pipeline, created_by='bKeW4T6E')

# create a run from the above pipeline as the data source
run = ln.Run(pipeline=pipeline)
#> Run(id='2aaKWH8dwBE6hnj3n9K9', pipeline_id='fhn5Zydf', pipeline_v='1', created_by='bKeW4T6E')

# access pipeline from run via
print(run.transform)
#> Transform(id='fhn5Zydf', v='1', name='My pipeline', created_by='bKeW4T6E')

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

# create a data object with SQL metadata record including hash and link run record
file = ln.File(df, name="My dataframe", source=run)
#> File(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c')

# upload serialized version to the configured storage
# commit a File record to the SQL database
ln.add(file)
#> File(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))

Query & load data#

file = ln.select(ln.File, name="My dataframe").one()
#> [File(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))]
df = file.load()
#>      a	b
#>  0	1	3
#>  1	2	4

Get the data ingested by the latest run:

run = ln.select(ln.Run).order_by(ln.Run.created_at.desc()).first()
#> Run(id='L1oBMKW60ndt5YtjRqav', notebook_id='sePTpDsGJRq3', notebook_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 14, 21, 49, 36))
file = ln.select(ln.File).where(ln.File.source == run).all()
#> [File(id='dZvGD7YUKCKG4X4aLd5K', name='My dataframe', suffix='.parquet', size=2240, hash='R2_kKlH1nBGesMdyulMYkA', source_id='L1oBMKW60ndt5YtjRqav', storage_id='wor0ul6c', created_at=datetime.datetime(2023, 3, 14, 21, 49, 46))]

See Track data for more.

Track biological metadata#

Track biological features#

import bionty as bt  # Lamin's manager for biological knowledge
import lamindb as ln

ln.Run()  # assume we're in a notebook and don't need to pass pipeline_name

# a sample single cell RNA-seq dataset
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()

# Create a reference
# - ensembl id as the standardized id
# - mouse as the species
reference = bt.Gene(species="mouse")

# parse gene identifiers from data and map on reference
features = ln.Features(adata, reference)
#> 🔶 id column not found, using index as features.
#> ✅ 10000 terms (100.0%) are mapped.
#> 🔶 0 terms (0.0%) are not mapped.
# The result is a hashed feature set record:
print(features)
#> Features(id='2Mv3JtH-ScBVYHilbLaQ', type='gene', created_by='bKeW4T6E')
# genes records can be accessed via:
print(features.genes[:3])
#> [Gene(id='ENSMUSG00000020592', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000034931', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000071005', species_id='NCBI_10090')]

# track data with features
file = ln.File(adata, name="Mouse Lymph Node scRNA-seq", features=features)

# access linked gene references
print(file.features.genes[:3])
#> [Gene(id='ENSMUSG00000020592', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000034931', species_id='NCBI_10090'),
#>  Gene(id='ENSMUSG00000071005', species_id='NCBI_10090')]

# upload serialized data to configured storage
# commit a File record to the SQL database
# commit all linked features to the SQL database
ln.add(file)
#> File(id='VRu0Mg93d5l6NLb4znCD', name='Mouse Lymph Node scRNA-seq', suffix='.h5ad', size=17341245, hash='Qprqj0O23197Ko-VobaZiw', source_id='EB78Sl5KPG6wW6XcOlsm', storage_id='0Xt6BY40', created_at=datetime.datetime(2023, 3, 17, 6, 49, 39))

See Track biological features for more.

Tip

Each page in this guide is a Jupyter Notebook, which you can download here.
You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, and others.
We recommend using JupyterLab for best notebook tracking experience.

📬 Reach out to learn about data modules that connect your assays & workflows within our data platform enterprise plan.