Open Lightly, drop in your raw images or logs, and start shaping a dataset that actually moves your model’s metrics. Connect cloud storage (S3, GCS, Azure) or upload a folder, then use the Python SDK to compute representation vectors with the model of your choice. In minutes, the web console builds similarity maps, flags near-duplicates, and spotlights outliers. Filter by metadata, confidence, or custom rules to carve out a first, lean training set. Assign tags to collections (e.g., v0.1-baseline, needs-labels, hard-negatives), compare snapshots side by side, and share a link with your team for review.
Spin up an iterative improvement loop without changing your stack. Train a baseline, push predictions and vectors back to Lightly, and mine the misses: clusters of misclassifications, edge cases, or underrepresented scenarios. Prioritize exactly what to label next—no more random sampling. Export the selected batch to your annotation tool, re-train, and measure the lift. Schedule recurring runs that pull in new data, score it for novelty and utility, and queue the top candidates for labeling. Every export is versioned with lineage so you can always reproduce a release or roll back.
Use it to keep production honest. Route fresh data from pipelines to Lightly and watch distribution shifts in real time—lighting conditions changing, new camera angles, a new SKU on shelves, or sensor patterns drifting. Build targeted refresh sets that rebalance your corpus and protect accuracy. Governance is built in: role-based access, audit logs, and searchable history of who changed what and when. Tie samples to model runs and notes so handoffs between ML, data, and QA stay tight, even across teams and time zones. Integrate with your orchestration (Airflow, Prefect, GitHub Actions) to make curation part of CI for data.
Concrete flows to get value fast: For computer vision, import a week of factory footage, cluster similar frames, remove redundancies, and promote rare defect examples for labeling; ship a compact training set that boosts recall with fewer annotations. For retail, filter shelf images by store, time, and planogram, then mine hard negatives where the detector overfires. For autonomous systems, tag nighttime and weather subsets, track gaps, and schedule targeted data collection. For text or logs, embed documents with your encoder, de-duplicate, group by topic, and isolate noisy sources before RAG or classification. In each case, you move from a pile of raw data to a curated, documented, and shareable dataset that is ready for training—backed by clear evidence of why those samples were chosen.
Community
Free
Samples per Dataset: 1'000
Datasets: 3
Upload: Thumbnails, Embeddings
Data Selection: Histogram Filters, Lightly Filters, Active Learning
Support: Community
Professional
$440.00 per month
Samples per Dataset: 100'000
Datasets: 10
Upload: Thumbnails, Embeddings, Custom Metadata, Original Images
Data Selection: Histogram Filters, Lightly Filters, Active Learning
Support: Priority E-mail/Slack
Enterprise
Custom
Includes features of Professional plan, plus
Unique Features
Unlimited samples
Unlimited datasets
Tailored to your needs
On-premise/Priv. Cloud
Automated Labeling
Other data formats
Comments