Key features

Fully open model: open weights + open data + full training details including all data and training recipes

Massively Multilingual: 1811 natively supported languages

Compliant Apertus is trained while respecting opt-out consent of data owners (even retrospectivey), and avoiding memorization of training data

top 6 comments

sorted by: hot top controversial new old

[–] snikta@programming.dev 21 points 4 days ago

A fully open-source LLM

As a fully open language model, Apertus allows researchers, professionals and enthusiasts to build upon the model and adapt it to their specific needs, as well as to inspect any part of the training process. This distinguishes Apertus from models that make only selected components accessible.

“With this release, we aim to provide a blueprint for how a trustworthy, sovereign, and inclusive AI model can be developed,” says Martin Jaggi, Professor of Machine Learning at EPFL and member of the Steering Committee of the Swiss AI Initiative. The model will be regularly updated by the development team which includes specialized engineers and a large number of researchers from CSCS, ETH Zurich and EPFL.

[–] kuberoot@discuss.tchncs.de 14 points 3 days ago (1 children)

Apertus was developed with due consideration to Swiss data protection laws, Swiss copyright laws, and the transparency obligations under the EU AI Act. Particular attention has been paid to data integrity and ethical standards: the training corpus builds only on data which is publicly available. It is filtered to respect machine-readable opt-out requests from websites, even retroactively, and to remove personal data, and other undesired content before training begins.

We probably won't get better, but sounds like it's still being trained on scraped data unless you explicitly opt out, including anything that may be getting mirrored by third parties that don't opt out. Also, they can remove data from the training material retroactively... But presumably won't be retraining the model from scratch, which means it will still have that in their weights, and the official weights will still have a potential advantage on models trained later on their training data.

From the license:

SNAI will regularly provide a file with hash values for download which you can apply as an output filter to your use of our Apertus LLM. The file reflects data protection deletion requests which have been addressed to SNAI as the developer of the Apertus LLM. It allows you to remove Personal Data contained in the model output.

Oof, so they're basically passing on data protection deletion requests to the users and telling them all to respectfully account for them.

They also claim "open data", but I'm having trouble finding the actual training data, only the "Training data reconstruction scripts"...

[–] lime@feddit.nu 14 points 3 days ago* (last edited 3 days ago)

that's the problem with deletion requests, the data isn't in there. it can't be, from a purely mathematical standpoint. statistically, with the amount of stuff that goes into training, any full work included in an llm is represented by less than one bit. but the model just... remakes sensitive information from scratch. ih reconstructs infringing data based on patterns.

which of course highlights the big issue with data anonymization: it can't really be done.

[–] dwt@feddit.org 12 points 3 days ago

This begs the question: how Good is it? Did anyone test it yet?

[–] lascapi@jlai.lu 9 points 3 days ago

Sounds good!

Is it the first LLM that is open like that (architecture, model weights, and training data and recipes)?

[–] witty_username@feddit.nl 3 points 3 days ago

But can it send me into a psychotic rage?