hfselect package

Submodules

hfselect.dataset module

class hfselect.dataset.Dataset(dataset: Dataset | IterableDataset, text_col: str | Tuple[str], label_col: str, is_regression: bool, metadata: dict | None = None)[source]

Bases: Dataset

This custom dataset contains an internal dataset, metadata and instructions about processing the data

collate_fn(rows: dict, tokenizer: PreTrainedTokenizer, max_length: int = 128, return_token_type_ids: bool = False)[source]

The collate function for pre-processing and tokenizing the data

Parameters:
  • rows – The dataset rows (usually a batch)

  • tokenizer – The tokenizer to be used

  • max_length – The maximum length of one input text. Longer texts are truncated.

  • return_token_type_ids – Whether to return token type IDs

Returns:

classmethod from_disk(filepath) Dataset | None[source]

Loads the dataset from local filepath

Parameters:

filepath – Filepath for the dataset

Returns:

The loaded dataset

classmethod from_hugging_face(name: str, split: str, text_col: str | List[str], label_col: str, is_regression: bool, subset: str | None = None, num_examples: int | None = None, seed: int | None = None, streaming: bool = False, trust_remote_code: bool | None = None) Dataset[source]

Loads an underlying HF dataset and creates the dataset wrapper class around it

Parameters:
  • name – The repo ID of the HF dataset

  • split – The split of the HF dataset

  • text_col – The text column of the HF dataset. This can be a tuple of columns to be concatenated.

  • label_col – The label column of the HF dataset

  • is_regression – A flag that signals if the underlying task is a regression task

  • subset – The subset of the dataset on HF

  • num_examples – Number of tutorials to sample. If this is None, the whole dataset is used.

  • seed – The random state for sampling tutorials

  • streaming – Whether to use the option for streaming datasets from HF

  • trust_remote_code – Trust remote code for HF datasets. If set to None, the local config of the datasets package is used. By default, this results in a False value.

Returns:

A dataset class with the specified underlying HF dataset

save(filepath) None[source]

Locally saves the dataset

Parameters:

filepath – Filepath for the dataset

Returns:

exception hfselect.dataset.EmptyDatasetError(message: str | None = None)[source]

Bases: Exception

EmptyDatasetError are raised when a dataset is empty (possibly after filtering).

default_message = 'The dataset is empty.'

hfselect.embedding_dataset module

class hfselect.embedding_dataset.EmbeddingDataset(x: array | List[array], y: array | List[array], metadata: dict | None = None)[source]

Bases: Dataset

And EmbeddingDataset contains two sets of embeddings: A dataset embedded using a base model and the same dataset embedded by a fine-tuned model. It can be used to train an ESM on the transformation of the embedding space caused by fine-tuning the model.

classmethod from_disk(filepath: str)[source]

Loads an EmbeddingDataset from a local file

Parameters:

filepath – Filepath of the saved EmbeddingDataset

Returns:

The loaded EmbeddingDataset

save(filepath: str) None[source]

Saves an EmbeddingDataset to a local file

Parameters:

filepath – Filepath to save the embedding

Returns:

exception hfselect.embedding_dataset.InvalidEmbeddingDatasetError(message: str)[source]

Bases: Exception

This error should be raised when an embedding dataset is invalid.

hfselect.embedding_dataset.create_embedding_dataset(dataset: Dataset, base_model: PreTrainedModel, tuned_model: PreTrainedModel, tokenizer: PreTrainedTokenizer, device_name: str = 'cpu', output_path: str | None = None, batch_size: int = 128) EmbeddingDataset[source]

Creates an EmbeddingDataset by embedding the same dataset with a base model and fine-tuned model

Parameters:
  • dataset – The dataset to be embedded

  • base_model – The base model

  • tuned_model – The fine-tuned model

  • tokenizer – The tokenizer to be used

  • device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)

  • output_path – If an output path is passed here, the EmbeddingDataset will be saved

  • batch_size – The batch size for embedding the dataset

Returns:

The resulting EmbeddingDataset

hfselect.esm module

class hfselect.esm.ESM(*args, **kwargs)[source]

Bases: Module, PyTorchModelHubMixin

An ESM (embedding space map) is a neural network that approximates the effect of fine-tuning of a language model on the embedding space. It works similarly to an adapter that can be placed on top of the base language model / applied to the embeddings of computed by the base language model.

convert_legacy_to_new() None[source]

In the 0.1.0 previous version of the package, the underlying model of the ESM had a different attribute name. To ensure compatibility, this function renames the attribute from sequential to model.

Returns:

create_config() ESMConfig[source]

Returns the ESMConfig of the model. This ensures that it is returned in the right format.

Returns:

The ESMConfig of the ESM

forward(x: Tensor) Tensor[source]

The forward pass of the ESM

Parameters:

x – The embeddings to be transformed by the ESM

Returns:

The transformed embeddings

property is_initialized: bool

Whether the model is initialized or not

Returns:

publish(repo_id: str, config: ESMConfig | Dict[str, float | int | str] | None = None) None[source]

Publishes the ESM to the HF Hub

Parameters:
  • repo_id – The repo ID to publish the model at. It is advised, to include your HF username in the repo ID.

  • config – A ESMConfig with metadata about the ESM. The model card will contain the data from this config.

Returns:

save_pretrained(save_directory: str | Path, *, config: dict | DataclassInstance | None = None, repo_id: str | None = None, push_to_hub: bool = False, model_card_kwargs: Dict[str, Any] | None = None, **push_to_hub_kwargs) str | None[source]

Save weights in local directory.

Parameters:
  • save_directory (str or Path) – Path to directory in which the model weights and configuration will be saved.

  • config (dict or DataclassInstance, optional) – Model configuration specified as a key/value dictionary or a dataclass instance.

  • push_to_hub (bool, optional, defaults to False) – Whether or not to push your model to the Huggingface Hub after saving it.

  • repo_id (str, optional) – ID of your repository on the Hub. Used only if push_to_hub=True. Will default to the folder name if not provided.

  • model_card_kwargs (Dict[str, Any], optional) – Additional arguments passed to the model card template to customize the model card.

  • push_to_hub_kwargs – Additional key word arguments passed along to the [~ModelHubMixin.push_to_hub] method.

Returns:

url of the commit on the Hub if push_to_hub=True, None otherwise.

Return type:

str or None

exception hfselect.esm.ESMNotInitializedError(details_message: str | None = None)[source]

Bases: Exception

This error is raised when a forward pass of the ESM is triggered before properly defining its architecture.

custom_message = 'ESM was not initialized correctly. Define the ESM architecture before using it for training or inference.'

hfselect.esm_logme module

exception hfselect.esm_logme.NoESMsFoundError[source]

Bases: Exception

hfselect.esm_logme.compute_scores(dataset: Dataset, base_model: PreTrainedModel, esms: list[ESM], tokenizer: PreTrainedTokenizer, batch_size: int = 128, device_name: str = 'cpu') list[float][source]

Computes the ESM-LogME scores for all ESMs.

Parameters:
  • dataset – The target dataset

  • base_model – The base LM used for computing embeddings

  • esms – List of the ESMs representing the intermediate datasets

  • tokenizer – The tokenizer used for tokenizing the target texts

  • batch_size – Describes how many embeddings are computed and transformed in a batch

  • device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)

Returns:

The ESM-LogME scores produced by the ESMs

Return type:

scores

hfselect.esm_logme.compute_task_ranking(dataset: Dataset, model_name: str, esms: list[ESM] | None = None, esm_repo_ids: list[str] | None = None, batch_size: int = 128, device_name: str = 'cpu') TaskRanking[source]

Computes a task ranking by first computing scores and then ranking the intermediate datasets by their scores.

Parameters:
  • dataset – The target dataset

  • model_name – The name of the base LM used for computing embeddings

  • esms – List of the ESMs representing the intermediate datasets

  • esm_repo_ids – List of the HF repo IDs of the ESMs representing the intermediate datasets

  • batch_size – Describes how many embeddings are computed and transformed in a batch

  • device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)

Returns:

A task ranking of the intermediate tasks. Intermediate datasets with invalid ESMS are excluded.

Return type:

task_ranking

hfselect.esmconfig module

class hfselect.esmconfig.ESMConfig(base_model_name: str | None = None, task_id: str | None = None, task_subset: str | None = None, text_column: str | tuple[str] | None = None, label_column: str | None = None, task_split: str | None = None, num_examples: int | None = None, seed: int | None = None, language: str | None = None, esm_architecture: str | None = None, esm_embedding_dim: int | None = None, lm_num_epochs: int | None = None, lm_batch_size: int | None = None, lm_learning_rate: float | None = None, lm_weight_decay: float | None = None, lm_optimizer: str | None = None, esm_num_epochs: int | None = None, esm_batch_size: int | None = None, esm_learning_rate: float | None = None, esm_weight_decay: float | None = None, esm_optimizer: str | None = None, developers: str | None = None, version: str | None = '0.2.1', **kwargs)[source]

Bases: PretrainedConfig

ESMConfig is a config for an ESM. It contains metadata that is parsed to the model card when uploaded to HF.

get(attr_name: str, default_return_val: Any = None) Any[source]

A get function to make the class behave like a dictionary

Parameters:
  • attr_name – The name of the attribute to access

  • default_return_val – A default value that gets returned when the attribute does not exist

Returns:

The value of the attribute if it exists, and otherwise the default return value

property is_valid: bool

Checks if the config is valid. Only ESMs with valid configs should be uploaded and used for task selection. An ESMConfig must contain the name of base langauge model and the dataset that was used to fine-tune it.

Returns:

The validity of the config

exception hfselect.esmconfig.InvalidESMConfigError(message: str | None = None)[source]

Bases: Exception

Raised when the ESMConfig is invalid

default_message = 'The Config is not a valid ESM Config. Task ID and base model name need to be specified.'

hfselect.logme module

class hfselect.logme.LogME(regression=False)[source]

Bases: object

fit(f: ndarray, y: ndarray, add_intercept=False)[source]
Parameters:
  • f – [N, F], feature matrix from pre-trained model

  • y – target labels. For classification, y has shape [N] with element in [0, C_t). For regression, y has shape [N, C] with C regression-labels

Returns:

LogME score (how well f can fit y directly)

predict(f: ndarray)[source]
Parameters:

f – [N, F], feature matrix

Returns:

prediction, return shape [N, X]

reset()[source]
hfselect.logme.each_evidence(y_, f, fh, v, s, vh, N, D)[source]

compute the maximum evidence for each class

hfselect.logme.truncated_svd(x)[source]

hfselect.model_utils module

hfselect.model_utils.get_pooled_output(base_model: PreTrainedModel, input_ids: Tensor, attention_mask: Tensor)[source]

Embeds texts using a language model

Parameters:
  • base_model – The language model

  • input_ids – The input IDs of the texts (after tokenization)

  • attention_mask – The attention masks of the texts (after tokenization)

Returns:

The embeddings of the texts

hfselect.setup_logger module

hfselect.task_ranking module

exception hfselect.task_ranking.InvalidTaskRankingError(message: str | None = None)[source]

Bases: Exception

An Exception raised when the task ranking contains an error

default_message = 'The task ranking is invalid.'
class hfselect.task_ranking.TaskRanking(esm_configs: list[ESMConfig], scores: list[float], ranks: list[int] | None = None)[source]

Bases: Sequence

A task ranking contains the esm configs of ranked ESMS, their scores and their ranks

to_pandas() DataFrame[source]

Creates a Pandas DataFrame of the ranking

Returns:

The resulting dataframe

hfselect.trainers module

class hfselect.trainers.ESMTrainer(model: Module | None = None, optimizer: Optimizer | None = None, weight_decay: float = 0.01, learning_rate: float = 0.01, device_name: str = 'cpu')[source]

Bases: Trainer

A trainer class that fabricates ESMs

train_with_embeddings(embedding_dataset: EmbeddingDataset, architecture: str | dict[str, str | tuple[str]] | None = 'linear', output_dir: str | None = None, num_epochs: int = 10, batch_size: int = 32, reset_model: bool = True, verbose: int = 1) ESM[source]

Trains an ESM using an EmbeddingDataset dataset. The ESM is fitted to the embedding pairs in the dataset.

Parameters:
  • embedding_dataset – The embeddings of the same dataset embedded by a base model and a fine-tuned model

  • architecture – The desired architecture of the ESM

  • output_dir – If a directory is specified, the ESM will be saved locally after training

  • num_epochs – The number of epochs for training the ESM

  • batch_size – The batch size for training the ESM

  • reset_model – If set to False, the same model with be trained further with multiple calls of the function.

  • verbose – 0 hides everything, 1 shows the complete training of the ESM, and 2 shows the ESM training epochs.

Returns:

The resulting ESM

train_with_models(dataset: Dataset, base_model: PreTrainedModel, tuned_model: PreTrainedModel, tokenizer: PreTrainedTokenizer, architecture: str | dict[str, str | tuple[str]] | None = 'linear', model_output_dir: str | None = None, embeddings_output_filepath: str | None = None, num_epochs: int = 10, train_batch_size: int = 32, embeddings_batch_size: int = 128, device_name: str = 'cpu') ESM[source]

Trains an ESM using a dataset, a base language model and a fine-tuned language model. Internally, an EmbeddingDataset is created. Following this, the train_with_embeddings is called and the ESM is fitted to the embedding pairs in the dataset.

Parameters:
  • dataset – The dataset used for fine-tuning the language model

  • base_model – The base language model

  • tuned_model – The fine-tuned language model

  • tokenizer – The tokenizer for processing input texts

  • architecture – The desired architecture of the ESM

  • model_output_dir – If a directory is specified, the ESM will be saved locally after training

  • embeddings_output_filepath – If a filepath is specified, the EmbeddingDataset will be saved locally

  • num_epochs – The number of epochs for training the ESM

  • train_batch_size – The batch size for training the ESM

  • embeddings_batch_size – The batch size for creating the EmbeddingDataset

  • device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)

Returns:

The resulting ESM

class hfselect.trainers.Trainer(model: Module | None = None, optimizer: Optimizer | None = None, learning_rate: float = 0.001, weight_decay: float = 0.01, device_name: str = 'cpu')[source]

Bases: ABC

A abstract trainer class

property avg_loss

The average loss per training example

Returns:

The average loss per training example

reset_loss()[source]

Resets the loss for optimization.

Returns:

hfselect.utils module

hfselect.utils.fetch_esm_configs(repo_ids: list[str]) list[ESMConfig][source]

Fetches ESMConfigs by their repo IDs. Invalid ESMConfigs are excluded from the results. This can be seen in the logs.

Parameters:

repo_ids – The HF repo IDs of the ESMs

Returns:

A list of ESMConfigs

hfselect.utils.fetch_esms(repo_ids: list[str]) list[ESM][source]

Fetches ESMs by their repo IDs. Invalid ESMs are excluded from the results. This can be seen in the logs.

Parameters:

repo_ids – The HF repo IDs of the ESMs

Returns:

A list of ESMs

hfselect.utils.find_esm_model_infos(model_name: str | None = None, filters: list[str] | None = None) list[ModelInfo][source]

Finds HF ModelInfos for all ESMs specified by the language model name and filters

Parameters:
  • model_name – The name of the base language model

  • filters – Filters for selecting ESMs (see hf_api.list_models)

Returns:

A list of ESM ModelInfos

hfselect.utils.find_esm_repo_ids(model_name: str | None = None, filters: list[str] | None = None) list[str][source]

Finds all ESM repo IDs for the specified language model name and filters

Parameters:
  • model_name – The name of the base language model

  • filters – Filters for selecting ESMs (see hf_api.list_models)

Returns:

A list of ESM repo IDs

hfselect.utils.get_esm_coverage(filters: list[str] | None = None) dict[str, int][source]

Finds out how many ESMs are available for each base model

Parameters:

filters – Filters for selecting ESMs (see hf_api.list_models)

Returns:

A dictionary with base model names as keys and the number of available ESMs for them as items

Module contents