hfselect package

Submodules

hfselect.dataset module

class hfselect.dataset.Dataset(dataset: Dataset | IterableDataset, text_col: str | Tuple[str], label_col: str, is_regression: bool, metadata: dict | None = None)[source]

Bases: Dataset

This custom dataset contains an internal dataset, metadata and instructions about processing the data

collate_fn(rows: dict, tokenizer: PreTrainedTokenizer, max_length: int = 128, return_token_type_ids: bool = False)[source]

The collate function for pre-processing and tokenizing the data

Parameters:

rows – The dataset rows (usually a batch)
tokenizer – The tokenizer to be used
max_length – The maximum length of one input text. Longer texts are truncated.
return_token_type_ids – Whether to return token type IDs

Returns:

classmethod from_disk(filepath) → Dataset | None[source]

Loads the dataset from local filepath

Parameters:: filepath – Filepath for the dataset
Returns:: The loaded dataset

classmethod from_hugging_face(name: str, split: str, text_col: str | List[str], label_col: str, is_regression: bool, subset: str | None = None, num_examples: int | None = None, seed: int | None = None, streaming: bool = False, trust_remote_code: bool | None = None) → Dataset[source]

Loads an underlying HF dataset and creates the dataset wrapper class around it

Parameters:

name – The repo ID of the HF dataset
split – The split of the HF dataset
text_col – The text column of the HF dataset. This can be a tuple of columns to be concatenated.
label_col – The label column of the HF dataset
is_regression – A flag that signals if the underlying task is a regression task
subset – The subset of the dataset on HF
num_examples – Number of tutorials to sample. If this is None, the whole dataset is used.
seed – The random state for sampling tutorials
streaming – Whether to use the option for streaming datasets from HF
trust_remote_code – Trust remote code for HF datasets. If set to None, the local config of the datasets package is used. By default, this results in a False value.

Returns:

A dataset class with the specified underlying HF dataset

save(filepath) → None[source]

Locally saves the dataset

Parameters:: filepath – Filepath for the dataset

Returns:

exception hfselect.dataset.EmptyDatasetError(message: str | None = None)[source]

Bases: Exception

EmptyDatasetError are raised when a dataset is empty (possibly after filtering).

default_message = 'The dataset is empty.'

hfselect.embedding_dataset module

class hfselect.embedding_dataset.EmbeddingDataset(x: array | List[array], y: array | List[array], metadata: dict | None = None)[source]

Bases: Dataset

And EmbeddingDataset contains two sets of embeddings: A dataset embedded using a base model and the same dataset embedded by a fine-tuned model. It can be used to train an ESM on the transformation of the embedding space caused by fine-tuning the model.

classmethod from_disk(filepath: str)[source]

Loads an EmbeddingDataset from a local file

Parameters:: filepath – Filepath of the saved EmbeddingDataset
Returns:: The loaded EmbeddingDataset

save(filepath: str) → None[source]

Saves an EmbeddingDataset to a local file

Parameters:: filepath – Filepath to save the embedding

Returns:

exception hfselect.embedding_dataset.InvalidEmbeddingDatasetError(message: str)[source]

Bases: Exception

This error should be raised when an embedding dataset is invalid.

hfselect.embedding_dataset.create_embedding_dataset(dataset: Dataset, base_model: PreTrainedModel, tuned_model: PreTrainedModel, tokenizer: PreTrainedTokenizer, device_name: str = 'cpu', output_path: str | None = None, batch_size: int = 128) → EmbeddingDataset[source]

Creates an EmbeddingDataset by embedding the same dataset with a base model and fine-tuned model

Parameters:

dataset – The dataset to be embedded
base_model – The base model
tuned_model – The fine-tuned model
tokenizer – The tokenizer to be used
device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)
output_path – If an output path is passed here, the EmbeddingDataset will be saved
batch_size – The batch size for embedding the dataset

Returns:

The resulting EmbeddingDataset

hfselect.esm module

class hfselect.esm.ESM(*args, **kwargs)[source]

Bases: Module, PyTorchModelHubMixin

An ESM (embedding space map) is a neural network that approximates the effect of fine-tuning of a language model on the embedding space. It works similarly to an adapter that can be placed on top of the base language model / applied to the embeddings of computed by the base language model.

convert_legacy_to_new() → None[source]

In the 0.1.0 previous version of the package, the underlying model of the ESM had a different attribute name. To ensure compatibility, this function renames the attribute from sequential to model.

Returns:

create_config() → ESMConfig[source]

Returns the ESMConfig of the model. This ensures that it is returned in the right format.

Returns:: The ESMConfig of the ESM

forward(x: Tensor) → Tensor[source]

The forward pass of the ESM

Parameters:: x – The embeddings to be transformed by the ESM
Returns:: The transformed embeddings

property is_initialized: bool

Whether the model is initialized or not

Returns:

publish(repo_id: str, config: ESMConfig | Dict[str, float | int | str] | None = None) → None[source]

Publishes the ESM to the HF Hub

Parameters:

repo_id – The repo ID to publish the model at. It is advised, to include your HF username in the repo ID.
config – A ESMConfig with metadata about the ESM. The model card will contain the data from this config.

Returns:

Save weights in local directory.

Parameters:

save_directory (str or Path) – Path to directory in which the model weights and configuration will be saved.
config (dict or DataclassInstance, optional) – Model configuration specified as a key/value dictionary or a dataclass instance.
push_to_hub (bool, optional, defaults to False) – Whether or not to push your model to the Huggingface Hub after saving it.
repo_id (str, optional) – ID of your repository on the Hub. Used only if push_to_hub=True. Will default to the folder name if not provided.
model_card_kwargs (Dict[str, Any], optional) – Additional arguments passed to the model card template to customize the model card.
push_to_hub_kwargs – Additional key word arguments passed along to the [~ModelHubMixin.push_to_hub] method.

Returns:

url of the commit on the Hub if push_to_hub=True, None otherwise.

Return type:

str or None

exception hfselect.esm.ESMNotInitializedError(details_message: str | None = None)[source]

Bases: Exception

This error is raised when a forward pass of the ESM is triggered before properly defining its architecture.

custom_message = 'ESM was not initialized correctly. Define the ESM architecture before using it for training or inference.'

hfselect.esm_logme module

exception hfselect.esm_logme.NoESMsFoundError[source]: Bases: Exception

hfselect.esm_logme.compute_scores(dataset: Dataset, base_model: PreTrainedModel, esms: list[ESM], tokenizer: PreTrainedTokenizer, batch_size: int = 128, device_name: str = 'cpu') → list[float][source]

Computes the ESM-LogME scores for all ESMs.

Parameters:

dataset – The target dataset
base_model – The base LM used for computing embeddings
esms – List of the ESMs representing the intermediate datasets
tokenizer – The tokenizer used for tokenizing the target texts
batch_size – Describes how many embeddings are computed and transformed in a batch
device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)

Returns:

The ESM-LogME scores produced by the ESMs

Return type:

scores

hfselect.esm_logme.compute_task_ranking(dataset: Dataset, model_name: str, esms: list[ESM] | None = None, esm_repo_ids: list[str] | None = None, batch_size: int = 128, device_name: str = 'cpu') → TaskRanking[source]

Computes a task ranking by first computing scores and then ranking the intermediate datasets by their scores.

Parameters:

dataset – The target dataset
model_name – The name of the base LM used for computing embeddings
esms – List of the ESMs representing the intermediate datasets
esm_repo_ids – List of the HF repo IDs of the ESMs representing the intermediate datasets
batch_size – Describes how many embeddings are computed and transformed in a batch
device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)

Returns:

A task ranking of the intermediate tasks. Intermediate datasets with invalid ESMS are excluded.

Return type:

task_ranking

hfselect.esmconfig module

Bases: PretrainedConfig

ESMConfig is a config for an ESM. It contains metadata that is parsed to the model card when uploaded to HF.

get(attr_name: str, default_return_val: Any = None) → Any[source]

A get function to make the class behave like a dictionary

Parameters:

attr_name – The name of the attribute to access
default_return_val – A default value that gets returned when the attribute does not exist

Returns:

The value of the attribute if it exists, and otherwise the default return value

property is_valid: bool

Checks if the config is valid. Only ESMs with valid configs should be uploaded and used for task selection. An ESMConfig must contain the name of base langauge model and the dataset that was used to fine-tune it.

Returns:: The validity of the config

exception hfselect.esmconfig.InvalidESMConfigError(message: str | None = None)[source]

Bases: Exception

Raised when the ESMConfig is invalid

default_message = 'The Config is not a valid ESM Config. Task ID and base model name need to be specified.'

hfselect.logme module

class hfselect.logme.LogME(regression=False)[source]

Bases: object

fit(f: ndarray, y: ndarray, add_intercept=False)[source]

Parameters:

f – [N, F], feature matrix from pre-trained model
y – target labels. For classification, y has shape [N] with element in [0, C_t). For regression, y has shape [N, C] with C regression-labels

Returns:

LogME score (how well f can fit y directly)

predict(f: ndarray)[source]

Parameters:: f – [N, F], feature matrix
Returns:: prediction, return shape [N, X]

reset()[source]

hfselect.logme.each_evidence(y_, f, fh, v, s, vh, N, D)[source]: compute the maximum evidence for each class

hfselect.logme.truncated_svd(x)[source]

hfselect.model_utils module

hfselect.model_utils.get_pooled_output(base_model: PreTrainedModel, input_ids: Tensor, attention_mask: Tensor)[source]

Embeds texts using a language model

Parameters:

base_model – The language model
input_ids – The input IDs of the texts (after tokenization)
attention_mask – The attention masks of the texts (after tokenization)

Returns:

The embeddings of the texts

hfselect.setup_logger module

hfselect.task_ranking module

exception hfselect.task_ranking.InvalidTaskRankingError(message: str | None = None)[source]

Bases: Exception

An Exception raised when the task ranking contains an error

default_message = 'The task ranking is invalid.'

class hfselect.task_ranking.TaskRanking(esm_configs: list[ESMConfig], scores: list[float], ranks: list[int] | None = None)[source]

Bases: Sequence

A task ranking contains the esm configs of ranked ESMS, their scores and their ranks

to_pandas() → DataFrame[source]

Creates a Pandas DataFrame of the ranking

Returns:: The resulting dataframe

hfselect.trainers module

class hfselect.trainers.ESMTrainer(model: Module | None = None, optimizer: Optimizer | None = None, weight_decay: float = 0.01, learning_rate: float = 0.01, device_name: str = 'cpu')[source]

Bases: Trainer

A trainer class that fabricates ESMs

train_with_embeddings(embedding_dataset: EmbeddingDataset, architecture: str | dict[str, str | tuple[str]] | None = 'linear', output_dir: str | None = None, num_epochs: int = 10, batch_size: int = 32, reset_model: bool = True, verbose: int = 1) → ESM[source]

Trains an ESM using an EmbeddingDataset dataset. The ESM is fitted to the embedding pairs in the dataset.

Parameters:

embedding_dataset – The embeddings of the same dataset embedded by a base model and a fine-tuned model
architecture – The desired architecture of the ESM
output_dir – If a directory is specified, the ESM will be saved locally after training
num_epochs – The number of epochs for training the ESM
batch_size – The batch size for training the ESM
reset_model – If set to False, the same model with be trained further with multiple calls of the function.
verbose – 0 hides everything, 1 shows the complete training of the ESM, and 2 shows the ESM training epochs.

Returns:

The resulting ESM

train_with_models(dataset: Dataset, base_model: PreTrainedModel, tuned_model: PreTrainedModel, tokenizer: PreTrainedTokenizer, architecture: str | dict[str, str | tuple[str]] | None = 'linear', model_output_dir: str | None = None, embeddings_output_filepath: str | None = None, num_epochs: int = 10, train_batch_size: int = 32, embeddings_batch_size: int = 128, device_name: str = 'cpu') → ESM[source]

Trains an ESM using a dataset, a base language model and a fine-tuned language model. Internally, an EmbeddingDataset is created. Following this, the train_with_embeddings is called and the ESM is fitted to the embedding pairs in the dataset.

Parameters:

dataset – The dataset used for fine-tuning the language model
base_model – The base language model
tuned_model – The fine-tuned language model
tokenizer – The tokenizer for processing input texts
architecture – The desired architecture of the ESM
model_output_dir – If a directory is specified, the ESM will be saved locally after training
embeddings_output_filepath – If a filepath is specified, the EmbeddingDataset will be saved locally
num_epochs – The number of epochs for training the ESM
train_batch_size – The batch size for training the ESM
embeddings_batch_size – The batch size for creating the EmbeddingDataset
device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)

Returns:

The resulting ESM

class hfselect.trainers.Trainer(model: Module | None = None, optimizer: Optimizer | None = None, learning_rate: float = 0.001, weight_decay: float = 0.01, device_name: str = 'cpu')[source]

Bases: ABC

A abstract trainer class

property avg_loss

The average loss per training example

Returns:: The average loss per training example

reset_loss()[source]

Resets the loss for optimization.

Returns:

hfselect.utils module

hfselect.utils.fetch_esm_configs(repo_ids: list[str]) → list[ESMConfig][source]

Fetches ESMConfigs by their repo IDs. Invalid ESMConfigs are excluded from the results. This can be seen in the logs.

Parameters:: repo_ids – The HF repo IDs of the ESMs
Returns:: A list of ESMConfigs

hfselect.utils.fetch_esms(repo_ids: list[str]) → list[ESM][source]

Fetches ESMs by their repo IDs. Invalid ESMs are excluded from the results. This can be seen in the logs.

Parameters:: repo_ids – The HF repo IDs of the ESMs
Returns:: A list of ESMs

hfselect.utils.find_esm_model_infos(model_name: str | None = None, filters: list[str] | None = None) → list[ModelInfo][source]

Finds HF ModelInfos for all ESMs specified by the language model name and filters

Parameters:

model_name – The name of the base language model
filters – Filters for selecting ESMs (see hf_api.list_models)

Returns:

A list of ESM ModelInfos

hfselect.utils.find_esm_repo_ids(model_name: str | None = None, filters: list[str] | None = None) → list[str][source]

Finds all ESM repo IDs for the specified language model name and filters

Parameters:

model_name – The name of the base language model
filters – Filters for selecting ESMs (see hf_api.list_models)

Returns:

A list of ESM repo IDs

hfselect.utils.get_esm_coverage(filters: list[str] | None = None) → dict[str, int][source]

Finds out how many ESMs are available for each base model

Parameters:: filters – Filters for selecting ESMs (see hf_api.list_models)
Returns:: A dictionary with base model names as keys and the number of available ESMs for them as items

hfselect package

Submodules

hfselect.dataset module

hfselect.embedding_dataset module

hfselect.esm module

hfselect.esm_logme module

hfselect.esmconfig module

hfselect.logme module

hfselect.model_utils module

hfselect.setup_logger module

hfselect.task_ranking module

hfselect.trainers module

hfselect.utils module

Module contents