elemeta.nlp.extractors.low_level package#

Submodules#

elemeta.nlp.extractors.low_level.abstract_text_metafeature_extractor module#

class elemeta.nlp.extractors.low_level.abstract_text_metafeature_extractor.AbstractTextMetafeatureExtractor(name: str | None = None)#

Bases: ABC

Representation of a MetafeatureExtractor This class holds a function to be run to extract the metafeature value and the name of the metafeature

Methods

__call__(text)

run self.extract on the given text

extract(text)

This function will extract the metric from the text :param text: :type text: str

abstract extract(text: str) Any#

This function will extract the metric from the text :param text: :type text: str

Returns:

the metadata extracted from text

Return type:

Any

elemeta.nlp.extractors.low_level.abstract_text_pair_metafeature_extractor module#

class elemeta.nlp.extractors.low_level.abstract_text_pair_metafeature_extractor.AbstractTextPairMetafeatureExtractor(name: str | None = None)#

Bases: AbstractPairMetafeatureExtractor

This class holds a function to be run to extract the metadata value and the name of the metadata

Methods

__call__(input_1, input_2)

run self.extract on the given text

extract(input_1, input_2)

This function will extract the metric from the text :param input_1: :type input_1: str :param input_2: :type input_2: str

abstract extract(input_1: str, input_2: str) Any#

This function will extract the metric from the text :param input_1: :type input_1: str :param input_2: :type input_2: str

Returns:

the metadata extracted from

Return type:

Any

elemeta.nlp.extractors.low_level.avg_token_length module#

class elemeta.nlp.extractors.low_level.avg_token_length.AvgTokenLength(tokenizer: Callable[[str], List[str]], tokens_to_exclude: Set[str] | None = None, name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Implementation of AbstractTextMetafeatureExtractor class that return the average token length

Example

>>> from elemeta.nlp.extractors.low_level.avg_token_length import AvgTokenLength
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> avg_token_length = AvgTokenLength(word_tokenize)
>>> result = avg_token_length(text)
>>> print(result)  # Output: 3.5

Methods

__call__(text)

run self.extract on the given text

extract(text)

return the number of average token length in the text

extract(text: str) float#

return the number of average token length in the text

Parameters:

text (str) – the string to run on

Returns:

the number of average tokens length in the text

Return type:

int

elemeta.nlp.extractors.low_level.hinted_profanity_token_count module#

class elemeta.nlp.extractors.low_level.hinted_profanity_token_count.HintedProfanityTokensCount(tokenizer: Callable[[str], List[str]], name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Implementation of AbstractTextMetafeatureExtractor class that count the number profanity words

Example

>>> from elemeta.nlp.extractors.low_level.hinted_profanity_token_count import HintedProfanityTokensCount
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> profanity_token_counter = HintedProfanityTokensCount(word_tokenize)
>>> result = profanity_token_counter(text)
>>> print(result)  # Output: 0

Methods

__call__(text)

run self.extract on the given text

extract(text)

return the number of profanity words in the text

extract(text: str) int#

return the number of profanity words in the text

Parameters:

text (str) – the string to run on

Returns:

the number of profanity words in the text

Return type:

int

elemeta.nlp.extractors.low_level.must_appear_tokens_parentage module#

class elemeta.nlp.extractors.low_level.must_appear_tokens_parentage.MustAppearTokensPercentage(tokenizer: Callable[[str], List[str]], must_appear: Set[str], name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Implementation of AbstractTextMetafeatureExtractor class that return the ration between the number of appearances of tokens from

given tokens list in the text to all the tokens

Example

>>> from elemeta.nlp.extractors.low_level.must_appear_tokens_parentage import MustAppearTokensPercentage
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> must_appear_tokens_percentage = MustAppearTokensPercentage(word_tokenize, {"I","yes"})
>>> result = must_appear_tokens_percentage(text)
>>> print(result)  # Output: 0.5

Methods

__call__(text)

run self.extract on the given text

extract(text)

gives the percentage of the tokens in must_appear set that appeared in the text

extract(text: str) float#

gives the percentage of the tokens in must_appear set that appeared in the text

Parameters:

text (str) – the text to check appearance on

Returns:

the ratio between the number of must-appear tokens to all words

Return type:

float

elemeta.nlp.extractors.low_level.regex_token_matches_count module#

class elemeta.nlp.extractors.low_level.regex_token_matches_count.TokenRegexMatchesCount(tokenizer: Callable[[str], List[str]], regex: str = '.*', name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Return number of tokens that match the given regex.

Example

>>> from elemeta.nlp.extractors.low_level.regex_token_matches_count import TokenRegexMatchesCount
>>> from nltk import word_tokenize
>>> text = "he hee is"
>>> regex = "h.+"
>>> token_regex_matches_counter = TokenRegexMatchesCount(word_tokenize, regex=regex)
>>> result = token_regex_matches_counter(text)
>>> print(result)  # Output: 2

Methods

__call__(text)

run self.extract on the given text

extract(text)

return the number of matches of the given regex in the text

validator(token)

regex check validator checks if the token abides by the regex

extract(text: str) int#

return the number of matches of the given regex in the text

Parameters:

text (str) – the string to run on

Returns:

the number of the given text in the text

Return type:

int

validator(token: str) bool#

regex check validator checks if the token abides by the regex

Parameters:

token (str) – the token check if abides by the regex

Returns:

true if the token abides and false otherwise

Return type:

bool

elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity module#

class elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity.SemanticEmbeddingPairSimilarity(name: str | None = None)#

Bases: AbstractPairMetafeatureExtractor

Calculates the semantic embedding pair similarity between two input tensors.

Parameters:#

input_1 (Tensor): The first input tensor. input_2 (Tensor): The second input tensor.

Returns:#

Tensor: The semantic embedding pair similarity between the two input tensors.

Examples:#

>>> import torch
>>> from elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity import SemanticEmbeddingPairSimilarity
>>> input_1 = torch.tensor([1, 2, 3], dtype=torch.float)
>>> input_2 = torch.tensor([4, 5, 6], dtype=torch.float)
>>> extractor = SemanticEmbeddingPairSimilarity()
>>> similarity = extractor(input_1, input_2)
>>> print(similarity) #Output: tensor([[0.9746]])

Methods

__call__(input_1, input_2)

run self.extract on the given text

extract(input_1, input_2)

This function will extract the metric from the text :param input_1: :type input_1: Any :param input_2: :type input_2: Any

extract(input_1: Tensor, input_2: Tensor) Tensor#

This function will extract the metric from the text :param input_1: :type input_1: Any :param input_2: :type input_2: Any

Returns:

the metadata extracted from

Return type:

Any

elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity module#

class elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity.SemanticTextToGroupSimilarity(group: List[str], embedding_model: str | None = None, modules: Iterable[Module] | None = None, device: str | None = None, cache_folder: str | None = None, use_auth_token: bool | str | None = None, name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Extracts the similarity between a text and a group of texts.

Parameters:
  • group (List[str]) – Group of strings to compare to.

  • embedding_model (Optional[str]) – The name of the SentenceTransformer model to use, by default “all-MiniLM-L6-v2”.

  • modules (Optional[Iterable[nn.Module]]) – This parameter can be used to create custom SentenceTransformer models from scratch.

  • device (Optional[str]) – Device (like ‘cuda’ / ‘cpu’) that should be used for computation. If None, checks if a GPU can be used.

  • cache_folder (Optional[str]) – Path to store models.

  • use_auth_token (Union[bool, str, None]) – HuggingFace authentication token to download private models.

  • name (Optional[str]) – Name of the extractor.

Examples

>>> from elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity import SemanticTextToGroupSimilarity
>>> group = ["apple", "banana", "orange"]
>>> extractor = SemanticTextToGroupSimilarity(group)
>>> text = "apple"
>>> similarity = extractor.extract(text)
>>> print(similarity) #Output: 1.000000238418579

Methods

__call__(text)

run self.extract on the given text

extract(input)

Extracts the similarity between a text and a group of texts.

extract(input: str) float#

Extracts the similarity between a text and a group of texts.

Parameters:

input (str) – Text to compare to the group.

Returns:

Maximum similarity between the input text and the group.

Return type:

float

elemeta.nlp.extractors.low_level.tokens_count module#

class elemeta.nlp.extractors.low_level.tokens_count.TokensCount(tokenizer: Callable[[str], List[str]], exclude_tokens_list: Set[str] | None = None, include_tokens_list: Set[str] | None = None, name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Return the number of tokens in the text.

Example

>>> from elemeta.nlp.extractors.low_level.tokens_count import TokensCount
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> tokens_count = TokensCount(word_tokenize)
>>> result = tokens_count(text)
>>> print(result)  # Output: 8

Methods

__call__(text)

run self.extract on the given text

extract(text)

counts the number tokens in the text

extract(text: str) int#

counts the number tokens in the text

Parameters:

text (str) – the text to check appearance on

Returns:

the number of appearance of a must-appear word list

Return type:

int

elemeta.nlp.extractors.low_level.unique_token_count module#

class elemeta.nlp.extractors.low_level.unique_token_count.UniqueTokenCount(tokenizer: Callable[[str], List[str]], exclude_tokens_list: Set[str] | None = None, include_tokens_list: Set[str] | None = None, name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Returns the number of unique tokens in the text.

Example

>>> from elemeta.nlp.extractors.low_level.unique_token_count import UniqueTokenCount
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> unique_token_count = UniqueTokenCount(word_tokenize)
>>> result = unique_token_count(text)
>>> print(result)  # Output: 4

Methods

__call__(text)

run self.extract on the given text

extract(text)

counts the number tokens in the text

extract(text: str) int#

counts the number tokens in the text

Parameters:

text (str) – the text to check appearance on

Returns:

the number of appearance of a must-appear word list

Return type:

int

elemeta.nlp.extractors.low_level.unique_token_ratio module#

class elemeta.nlp.extractors.low_level.unique_token_ratio.UniqueTokensRatio(tokenizer: Callable[[str], List[str]], exceptions: Set[str], name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Return the ratio between the number of unique tokens to all tokens

Example

>>> from elemeta.nlp.extractors.low_level.unique_token_ratio import UniqueTokensRatio
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> unique_tokens_ratio = UniqueTokensRatio(word_tokenize,exceptions={"was"})
>>> result = unique_tokens_ratio(text)
>>> print(result)  # Output: 0.8

Methods

__call__(text)

run self.extract on the given text

extract(text)

Unique words in text function

extract(text: str) float#

Unique words in text function

returns the ratio between set(tokens)/len(tokens) filters on tokens that are defined as relevant

Parameters:

text (str) – the text we want to find unique words ratio on

Returns:

sentiment – the ratio between len(set(tokens that appear once ))/len(set(tokens))

Return type:

float

Module contents#