hfselect package
Submodules
hfselect.dataset module
- class hfselect.dataset.Dataset(dataset: Dataset | IterableDataset, text_col: str | Tuple[str], label_col: str, is_regression: bool, metadata: dict | None = None)[source]
Bases:
DatasetThis custom dataset contains an internal dataset, metadata and instructions about processing the data
- collate_fn(rows: dict, tokenizer: PreTrainedTokenizer, max_length: int = 128, return_token_type_ids: bool = False)[source]
The collate function for pre-processing and tokenizing the data
- Parameters:
rows – The dataset rows (usually a batch)
tokenizer – The tokenizer to be used
max_length – The maximum length of one input text. Longer texts are truncated.
return_token_type_ids – Whether to return token type IDs
Returns:
- classmethod from_disk(filepath) Dataset | None[source]
Loads the dataset from local filepath
- Parameters:
filepath – Filepath for the dataset
- Returns:
The loaded dataset
- classmethod from_hugging_face(name: str, split: str, text_col: str | List[str], label_col: str, is_regression: bool, subset: str | None = None, num_examples: int | None = None, seed: int | None = None, streaming: bool = False, trust_remote_code: bool | None = None) Dataset[source]
Loads an underlying HF dataset and creates the dataset wrapper class around it
- Parameters:
name – The repo ID of the HF dataset
split – The split of the HF dataset
text_col – The text column of the HF dataset. This can be a tuple of columns to be concatenated.
label_col – The label column of the HF dataset
is_regression – A flag that signals if the underlying task is a regression task
subset – The subset of the dataset on HF
num_examples – Number of tutorials to sample. If this is None, the whole dataset is used.
seed – The random state for sampling tutorials
streaming – Whether to use the option for streaming datasets from HF
trust_remote_code – Trust remote code for HF datasets. If set to None, the local config of the datasets package is used. By default, this results in a False value.
- Returns:
A dataset class with the specified underlying HF dataset
hfselect.embedding_dataset module
- class hfselect.embedding_dataset.EmbeddingDataset(x: array | List[array], y: array | List[array], metadata: dict | None = None)[source]
Bases:
DatasetAnd EmbeddingDataset contains two sets of embeddings: A dataset embedded using a base model and the same dataset embedded by a fine-tuned model. It can be used to train an ESM on the transformation of the embedding space caused by fine-tuning the model.
- exception hfselect.embedding_dataset.InvalidEmbeddingDatasetError(message: str)[source]
Bases:
ExceptionThis error should be raised when an embedding dataset is invalid.
- hfselect.embedding_dataset.create_embedding_dataset(dataset: Dataset, base_model: PreTrainedModel, tuned_model: PreTrainedModel, tokenizer: PreTrainedTokenizer, device_name: str = 'cpu', output_path: str | None = None, batch_size: int = 128) EmbeddingDataset[source]
Creates an EmbeddingDataset by embedding the same dataset with a base model and fine-tuned model
- Parameters:
dataset – The dataset to be embedded
base_model – The base model
tuned_model – The fine-tuned model
tokenizer – The tokenizer to be used
device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)
output_path – If an output path is passed here, the EmbeddingDataset will be saved
batch_size – The batch size for embedding the dataset
- Returns:
The resulting EmbeddingDataset
hfselect.esm module
- class hfselect.esm.ESM(*args, **kwargs)[source]
Bases:
Module,PyTorchModelHubMixinAn ESM (embedding space map) is a neural network that approximates the effect of fine-tuning of a language model on the embedding space. It works similarly to an adapter that can be placed on top of the base language model / applied to the embeddings of computed by the base language model.
- convert_legacy_to_new() None[source]
In the 0.1.0 previous version of the package, the underlying model of the ESM had a different attribute name. To ensure compatibility, this function renames the attribute from sequential to model.
Returns:
- create_config() ESMConfig[source]
Returns the ESMConfig of the model. This ensures that it is returned in the right format.
- Returns:
The ESMConfig of the ESM
- forward(x: Tensor) Tensor[source]
The forward pass of the ESM
- Parameters:
x – The embeddings to be transformed by the ESM
- Returns:
The transformed embeddings
- property is_initialized: bool
Whether the model is initialized or not
Returns:
- publish(repo_id: str, config: ESMConfig | Dict[str, float | int | str] | None = None) None[source]
Publishes the ESM to the HF Hub
- Parameters:
repo_id – The repo ID to publish the model at. It is advised, to include your HF username in the repo ID.
config – A ESMConfig with metadata about the ESM. The model card will contain the data from this config.
Returns:
- save_pretrained(save_directory: str | Path, *, config: dict | DataclassInstance | None = None, repo_id: str | None = None, push_to_hub: bool = False, model_card_kwargs: Dict[str, Any] | None = None, **push_to_hub_kwargs) str | None[source]
Save weights in local directory.
- Parameters:
save_directory (str or Path) – Path to directory in which the model weights and configuration will be saved.
config (dict or DataclassInstance, optional) – Model configuration specified as a key/value dictionary or a dataclass instance.
push_to_hub (bool, optional, defaults to False) – Whether or not to push your model to the Huggingface Hub after saving it.
repo_id (str, optional) – ID of your repository on the Hub. Used only if push_to_hub=True. Will default to the folder name if not provided.
model_card_kwargs (Dict[str, Any], optional) – Additional arguments passed to the model card template to customize the model card.
push_to_hub_kwargs – Additional key word arguments passed along to the [~ModelHubMixin.push_to_hub] method.
- Returns:
url of the commit on the Hub if push_to_hub=True, None otherwise.
- Return type:
str or None
- exception hfselect.esm.ESMNotInitializedError(details_message: str | None = None)[source]
Bases:
ExceptionThis error is raised when a forward pass of the ESM is triggered before properly defining its architecture.
- custom_message = 'ESM was not initialized correctly. Define the ESM architecture before using it for training or inference.'
hfselect.esm_logme module
- hfselect.esm_logme.compute_scores(dataset: Dataset, base_model: PreTrainedModel, esms: list[ESM], tokenizer: PreTrainedTokenizer, batch_size: int = 128, device_name: str = 'cpu') list[float][source]
Computes the ESM-LogME scores for all ESMs.
- Parameters:
dataset – The target dataset
base_model – The base LM used for computing embeddings
esms – List of the ESMs representing the intermediate datasets
tokenizer – The tokenizer used for tokenizing the target texts
batch_size – Describes how many embeddings are computed and transformed in a batch
device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)
- Returns:
The ESM-LogME scores produced by the ESMs
- Return type:
scores
- hfselect.esm_logme.compute_task_ranking(dataset: Dataset, model_name: str, esms: list[ESM] | None = None, esm_repo_ids: list[str] | None = None, batch_size: int = 128, device_name: str = 'cpu') TaskRanking[source]
Computes a task ranking by first computing scores and then ranking the intermediate datasets by their scores.
- Parameters:
dataset – The target dataset
model_name – The name of the base LM used for computing embeddings
esms – List of the ESMs representing the intermediate datasets
esm_repo_ids – List of the HF repo IDs of the ESMs representing the intermediate datasets
batch_size – Describes how many embeddings are computed and transformed in a batch
device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)
- Returns:
A task ranking of the intermediate tasks. Intermediate datasets with invalid ESMS are excluded.
- Return type:
task_ranking
hfselect.esmconfig module
- class hfselect.esmconfig.ESMConfig(base_model_name: str | None = None, task_id: str | None = None, task_subset: str | None = None, text_column: str | tuple[str] | None = None, label_column: str | None = None, task_split: str | None = None, num_examples: int | None = None, seed: int | None = None, language: str | None = None, esm_architecture: str | None = None, esm_embedding_dim: int | None = None, lm_num_epochs: int | None = None, lm_batch_size: int | None = None, lm_learning_rate: float | None = None, lm_weight_decay: float | None = None, lm_optimizer: str | None = None, esm_num_epochs: int | None = None, esm_batch_size: int | None = None, esm_learning_rate: float | None = None, esm_weight_decay: float | None = None, esm_optimizer: str | None = None, developers: str | None = None, version: str | None = '0.2.1', **kwargs)[source]
Bases:
PretrainedConfigESMConfig is a config for an ESM. It contains metadata that is parsed to the model card when uploaded to HF.
- get(attr_name: str, default_return_val: Any = None) Any[source]
A get function to make the class behave like a dictionary
- Parameters:
attr_name – The name of the attribute to access
default_return_val – A default value that gets returned when the attribute does not exist
- Returns:
The value of the attribute if it exists, and otherwise the default return value
- property is_valid: bool
Checks if the config is valid. Only ESMs with valid configs should be uploaded and used for task selection. An ESMConfig must contain the name of base langauge model and the dataset that was used to fine-tune it.
- Returns:
The validity of the config
hfselect.logme module
- class hfselect.logme.LogME(regression=False)[source]
Bases:
object- fit(f: ndarray, y: ndarray, add_intercept=False)[source]
- Parameters:
f – [N, F], feature matrix from pre-trained model
y – target labels. For classification, y has shape [N] with element in [0, C_t). For regression, y has shape [N, C] with C regression-labels
- Returns:
LogME score (how well f can fit y directly)
hfselect.model_utils module
- hfselect.model_utils.get_pooled_output(base_model: PreTrainedModel, input_ids: Tensor, attention_mask: Tensor)[source]
Embeds texts using a language model
- Parameters:
base_model – The language model
input_ids – The input IDs of the texts (after tokenization)
attention_mask – The attention masks of the texts (after tokenization)
- Returns:
The embeddings of the texts
hfselect.setup_logger module
hfselect.task_ranking module
- exception hfselect.task_ranking.InvalidTaskRankingError(message: str | None = None)[source]
Bases:
ExceptionAn Exception raised when the task ranking contains an error
- default_message = 'The task ranking is invalid.'
hfselect.trainers module
- class hfselect.trainers.ESMTrainer(model: Module | None = None, optimizer: Optimizer | None = None, weight_decay: float = 0.01, learning_rate: float = 0.01, device_name: str = 'cpu')[source]
Bases:
TrainerA trainer class that fabricates ESMs
- train_with_embeddings(embedding_dataset: EmbeddingDataset, architecture: str | dict[str, str | tuple[str]] | None = 'linear', output_dir: str | None = None, num_epochs: int = 10, batch_size: int = 32, reset_model: bool = True, verbose: int = 1) ESM[source]
Trains an ESM using an EmbeddingDataset dataset. The ESM is fitted to the embedding pairs in the dataset.
- Parameters:
embedding_dataset – The embeddings of the same dataset embedded by a base model and a fine-tuned model
architecture – The desired architecture of the ESM
output_dir – If a directory is specified, the ESM will be saved locally after training
num_epochs – The number of epochs for training the ESM
batch_size – The batch size for training the ESM
reset_model – If set to False, the same model with be trained further with multiple calls of the function.
verbose – 0 hides everything, 1 shows the complete training of the ESM, and 2 shows the ESM training epochs.
- Returns:
The resulting ESM
- train_with_models(dataset: Dataset, base_model: PreTrainedModel, tuned_model: PreTrainedModel, tokenizer: PreTrainedTokenizer, architecture: str | dict[str, str | tuple[str]] | None = 'linear', model_output_dir: str | None = None, embeddings_output_filepath: str | None = None, num_epochs: int = 10, train_batch_size: int = 32, embeddings_batch_size: int = 128, device_name: str = 'cpu') ESM[source]
Trains an ESM using a dataset, a base language model and a fine-tuned language model. Internally, an EmbeddingDataset is created. Following this, the train_with_embeddings is called and the ESM is fitted to the embedding pairs in the dataset.
- Parameters:
dataset – The dataset used for fine-tuning the language model
base_model – The base language model
tuned_model – The fine-tuned language model
tokenizer – The tokenizer for processing input texts
architecture – The desired architecture of the ESM
model_output_dir – If a directory is specified, the ESM will be saved locally after training
embeddings_output_filepath – If a filepath is specified, the EmbeddingDataset will be saved locally
num_epochs – The number of epochs for training the ESM
train_batch_size – The batch size for training the ESM
embeddings_batch_size – The batch size for creating the EmbeddingDataset
device_name – The device name of the device for computation (e.g. “cpu”, “cuda”)
- Returns:
The resulting ESM
- class hfselect.trainers.Trainer(model: Module | None = None, optimizer: Optimizer | None = None, learning_rate: float = 0.001, weight_decay: float = 0.01, device_name: str = 'cpu')[source]
Bases:
ABCA abstract trainer class
- property avg_loss
The average loss per training example
- Returns:
The average loss per training example
hfselect.utils module
- hfselect.utils.fetch_esm_configs(repo_ids: list[str]) list[ESMConfig][source]
Fetches ESMConfigs by their repo IDs. Invalid ESMConfigs are excluded from the results. This can be seen in the logs.
- Parameters:
repo_ids – The HF repo IDs of the ESMs
- Returns:
A list of ESMConfigs
- hfselect.utils.fetch_esms(repo_ids: list[str]) list[ESM][source]
Fetches ESMs by their repo IDs. Invalid ESMs are excluded from the results. This can be seen in the logs.
- Parameters:
repo_ids – The HF repo IDs of the ESMs
- Returns:
A list of ESMs
- hfselect.utils.find_esm_model_infos(model_name: str | None = None, filters: list[str] | None = None) list[ModelInfo][source]
Finds HF ModelInfos for all ESMs specified by the language model name and filters
- Parameters:
model_name – The name of the base language model
filters – Filters for selecting ESMs (see hf_api.list_models)
- Returns:
A list of ESM ModelInfos
- hfselect.utils.find_esm_repo_ids(model_name: str | None = None, filters: list[str] | None = None) list[str][source]
Finds all ESM repo IDs for the specified language model name and filters
- Parameters:
model_name – The name of the base language model
filters – Filters for selecting ESMs (see hf_api.list_models)
- Returns:
A list of ESM repo IDs
- hfselect.utils.get_esm_coverage(filters: list[str] | None = None) dict[str, int][source]
Finds out how many ESMs are available for each base model
- Parameters:
filters – Filters for selecting ESMs (see hf_api.list_models)
- Returns:
A dictionary with base model names as keys and the number of available ESMs for them as items