elemeta.nlp.extractors.low_level package

elemeta.nlp.extractors.low_level package#

Submodules#

elemeta.nlp.extractors.low_level.abstract_text_metafeature_extractor module#

class elemeta.nlp.extractors.low_level.abstract_text_metafeature_extractor.AbstractTextMetafeatureExtractor(name: str | None = None)#

Bases: ABC

Representation of a MetafeatureExtractor This class holds a function to be run to extract the metafeature value and the name of the metafeature

Methods

`__call__`(text)	run self.extract on the given text
`extract`(text)	This function will extract the metric from the text :param text: :type text: str

abstract extract(text: str) → Any#

This function will extract the metric from the text :param text: :type text: str

Returns:: the metadata extracted from text
Return type:: Any

elemeta.nlp.extractors.low_level.abstract_text_pair_metafeature_extractor module#

class elemeta.nlp.extractors.low_level.abstract_text_pair_metafeature_extractor.AbstractTextPairMetafeatureExtractor(name: str | None = None)#

Bases: AbstractPairMetafeatureExtractor

This class holds a function to be run to extract the metadata value and the name of the metadata

Methods

`__call__`(input_1, input_2)	run self.extract on the given text
`extract`(input_1, input_2)	This function will extract the metric from the text :param input_1: :type input_1: str :param input_2: :type input_2: str

abstract extract(input_1: str, input_2: str) → Any#

This function will extract the metric from the text :param input_1: :type input_1: str :param input_2: :type input_2: str

Returns:: the metadata extracted from
Return type:: Any

elemeta.nlp.extractors.low_level.avg_token_length module#

class elemeta.nlp.extractors.low_level.avg_token_length.AvgTokenLength(tokenizer: Callable[[str], List[str]], tokens_to_exclude: Set[str] | None = None, name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Implementation of AbstractTextMetafeatureExtractor class that return the average token length

Example

>>> from elemeta.nlp.extractors.low_level.avg_token_length import AvgTokenLength
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> avg_token_length = AvgTokenLength(word_tokenize)
>>> result = avg_token_length(text)
>>> print(result)  # Output: 3.5

Methods

`__call__`(text)	run self.extract on the given text
`extract`(text)	return the number of average token length in the text

extract(text: str) → float#

return the number of average token length in the text

Parameters:: text (str) – the string to run on
Returns:: the number of average tokens length in the text
Return type:: int

elemeta.nlp.extractors.low_level.hinted_profanity_token_count module#

class elemeta.nlp.extractors.low_level.hinted_profanity_token_count.HintedProfanityTokensCount(tokenizer: Callable[[str], List[str]], name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Implementation of AbstractTextMetafeatureExtractor class that count the number profanity words

Example

>>> from elemeta.nlp.extractors.low_level.hinted_profanity_token_count import HintedProfanityTokensCount
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> profanity_token_counter = HintedProfanityTokensCount(word_tokenize)
>>> result = profanity_token_counter(text)
>>> print(result)  # Output: 0

Methods

`__call__`(text)	run self.extract on the given text
`extract`(text)	return the number of profanity words in the text

extract(text: str) → int#

return the number of profanity words in the text

Parameters:: text (str) – the string to run on
Returns:: the number of profanity words in the text
Return type:: int

elemeta.nlp.extractors.low_level.must_appear_tokens_parentage module#

class elemeta.nlp.extractors.low_level.must_appear_tokens_parentage.MustAppearTokensPercentage(tokenizer: Callable[[str], List[str]], must_appear: Set[str], name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Implementation of AbstractTextMetafeatureExtractor class that return the ration between the number of appearances of tokens from

given tokens list in the text to all the tokens

Example

>>> from elemeta.nlp.extractors.low_level.must_appear_tokens_parentage import MustAppearTokensPercentage
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> must_appear_tokens_percentage = MustAppearTokensPercentage(word_tokenize, {"I","yes"})
>>> result = must_appear_tokens_percentage(text)
>>> print(result)  # Output: 0.5

Methods

`__call__`(text)	run self.extract on the given text
`extract`(text)	gives the percentage of the tokens in must_appear set that appeared in the text

extract(text: str) → float#

gives the percentage of the tokens in must_appear set that appeared in the text

Parameters:: text (str) – the text to check appearance on
Returns:: the ratio between the number of must-appear tokens to all words
Return type:: float

elemeta.nlp.extractors.low_level.regex_token_matches_count module#

class elemeta.nlp.extractors.low_level.regex_token_matches_count.TokenRegexMatchesCount(tokenizer: Callable[[str], List[str]], regex: str = '.*', name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Return number of tokens that match the given regex.

Example

>>> from elemeta.nlp.extractors.low_level.regex_token_matches_count import TokenRegexMatchesCount
>>> from nltk import word_tokenize
>>> text = "he hee is"
>>> regex = "h.+"
>>> token_regex_matches_counter = TokenRegexMatchesCount(word_tokenize, regex=regex)
>>> result = token_regex_matches_counter(text)
>>> print(result)  # Output: 2

Methods

`__call__`(text)	run self.extract on the given text
`extract`(text)	return the number of matches of the given regex in the text
`validator`(token)	regex check validator checks if the token abides by the regex

extract(text: str) → int#

return the number of matches of the given regex in the text

Parameters:: text (str) – the string to run on
Returns:: the number of the given text in the text
Return type:: int

validator(token: str) → bool#

regex check validator checks if the token abides by the regex

Parameters:: token (str) – the token check if abides by the regex
Returns:: true if the token abides and false otherwise
Return type:: bool

elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity module#

class elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity.SemanticEmbeddingPairSimilarity(name: str | None = None)#

Bases: AbstractPairMetafeatureExtractor

Calculates the semantic embedding pair similarity between two input tensors.

Parameters:#

input_1 (Tensor): The first input tensor. input_2 (Tensor): The second input tensor.

Returns:#

Tensor: The semantic embedding pair similarity between the two input tensors.

Examples:#

>>> import torch
>>> from elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity import SemanticEmbeddingPairSimilarity
>>> input_1 = torch.tensor([1, 2, 3], dtype=torch.float)
>>> input_2 = torch.tensor([4, 5, 6], dtype=torch.float)
>>> extractor = SemanticEmbeddingPairSimilarity()
>>> similarity = extractor(input_1, input_2)
>>> print(similarity) #Output: tensor([[0.9746]])

Methods

`__call__`(input_1, input_2)	run self.extract on the given text
`extract`(input_1, input_2)	This function will extract the metric from the text :param input_1: :type input_1: Any :param input_2: :type input_2: Any

extract(input_1: Tensor, input_2: Tensor) → Tensor#

This function will extract the metric from the text :param input_1: :type input_1: Any :param input_2: :type input_2: Any

Returns:: the metadata extracted from
Return type:: Any

elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity module#

Bases: AbstractTextMetafeatureExtractor

Extracts the similarity between a text and a group of texts.

Parameters:

group (List[str]) – Group of strings to compare to.
embedding_model (Optional[str]) – The name of the SentenceTransformer model to use, by default “all-MiniLM-L6-v2”.
modules (Optional[Iterable[nn.Module]]) – This parameter can be used to create custom SentenceTransformer models from scratch.
device (Optional[str]) – Device (like ‘cuda’ / ‘cpu’) that should be used for computation. If None, checks if a GPU can be used.
cache_folder (Optional[str]) – Path to store models.
use_auth_token (Union[bool, str, None]) – HuggingFace authentication token to download private models.
name (Optional[str]) – Name of the extractor.

Examples

>>> from elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity import SemanticTextToGroupSimilarity
>>> group = ["apple", "banana", "orange"]
>>> extractor = SemanticTextToGroupSimilarity(group)
>>> text = "apple"
>>> similarity = extractor.extract(text)
>>> print(similarity) #Output: 1.000000238418579

Methods

`__call__`(text)	run self.extract on the given text
`extract`(input)	Extracts the similarity between a text and a group of texts.

extract(input: str) → float#

Extracts the similarity between a text and a group of texts.

Parameters:: input (str) – Text to compare to the group.
Returns:: Maximum similarity between the input text and the group.
Return type:: float

elemeta.nlp.extractors.low_level.tokens_count module#

class elemeta.nlp.extractors.low_level.tokens_count.TokensCount(tokenizer: Callable[[str], List[str]], exclude_tokens_list: Set[str] | None = None, include_tokens_list: Set[str] | None = None, name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Return the number of tokens in the text.

Example

>>> from elemeta.nlp.extractors.low_level.tokens_count import TokensCount
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> tokens_count = TokensCount(word_tokenize)
>>> result = tokens_count(text)
>>> print(result)  # Output: 8

Methods

`__call__`(text)	run self.extract on the given text
`extract`(text)	counts the number tokens in the text

extract(text: str) → int#

counts the number tokens in the text

Parameters:: text (str) – the text to check appearance on
Returns:: the number of appearance of a must-appear word list
Return type:: int

elemeta.nlp.extractors.low_level.unique_token_count module#

class elemeta.nlp.extractors.low_level.unique_token_count.UniqueTokenCount(tokenizer: Callable[[str], List[str]], exclude_tokens_list: Set[str] | None = None, include_tokens_list: Set[str] | None = None, name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Returns the number of unique tokens in the text.

Example

>>> from elemeta.nlp.extractors.low_level.unique_token_count import UniqueTokenCount
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> unique_token_count = UniqueTokenCount(word_tokenize)
>>> result = unique_token_count(text)
>>> print(result)  # Output: 4

Methods

`__call__`(text)	run self.extract on the given text
`extract`(text)	counts the number tokens in the text

extract(text: str) → int#

counts the number tokens in the text

Parameters:: text (str) – the text to check appearance on
Returns:: the number of appearance of a must-appear word list
Return type:: int

elemeta.nlp.extractors.low_level.unique_token_ratio module#

class elemeta.nlp.extractors.low_level.unique_token_ratio.UniqueTokensRatio(tokenizer: Callable[[str], List[str]], exceptions: Set[str], name: str | None = None)#

Bases: AbstractTextMetafeatureExtractor

Return the ratio between the number of unique tokens to all tokens

Example

>>> from elemeta.nlp.extractors.low_level.unique_token_ratio import UniqueTokensRatio
>>> from nltk import word_tokenize
>>> text = "Once I was afraid, I was petrified"
>>> unique_tokens_ratio = UniqueTokensRatio(word_tokenize,exceptions={"was"})
>>> result = unique_tokens_ratio(text)
>>> print(result)  # Output: 0.8

Methods

`__call__`(text)	run self.extract on the given text
`extract`(text)	Unique words in text function

extract(text: str) → float#

Unique words in text function

returns the ratio between set(tokens)/len(tokens) filters on tokens that are defined as relevant

Parameters:: text (str) – the text we want to find unique words ratio on
Returns:: sentiment – the ratio between len(set(tokens that appear once ))/len(set(tokens))
Return type:: float

elemeta.nlp.extractors.low_level package

Contents

elemeta.nlp.extractors.low_level package#

Submodules#

elemeta.nlp.extractors.low_level.abstract_text_metafeature_extractor module#

elemeta.nlp.extractors.low_level.abstract_text_pair_metafeature_extractor module#

elemeta.nlp.extractors.low_level.avg_token_length module#

elemeta.nlp.extractors.low_level.hinted_profanity_token_count module#

elemeta.nlp.extractors.low_level.must_appear_tokens_parentage module#

elemeta.nlp.extractors.low_level.regex_token_matches_count module#

elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity module#

Parameters:#

Returns:#

Examples:#

elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity module#

elemeta.nlp.extractors.low_level.tokens_count module#

elemeta.nlp.extractors.low_level.unique_token_count module#

elemeta.nlp.extractors.low_level.unique_token_ratio module#

Module contents#