elemeta.nlp.extractors.low_level package#
Submodules#
elemeta.nlp.extractors.low_level.abstract_text_metafeature_extractor module#
- class elemeta.nlp.extractors.low_level.abstract_text_metafeature_extractor.AbstractTextMetafeatureExtractor(name: str | None = None)#
Bases:
ABC
Representation of a MetafeatureExtractor This class holds a function to be run to extract the metafeature value and the name of the metafeature
Methods
__call__
(text)run self.extract on the given text
extract
(text)This function will extract the metric from the text :param text: :type text: str
- abstract extract(text: str) Any #
This function will extract the metric from the text :param text: :type text: str
- Returns:
the metadata extracted from text
- Return type:
Any
elemeta.nlp.extractors.low_level.abstract_text_pair_metafeature_extractor module#
- class elemeta.nlp.extractors.low_level.abstract_text_pair_metafeature_extractor.AbstractTextPairMetafeatureExtractor(name: str | None = None)#
Bases:
AbstractPairMetafeatureExtractor
This class holds a function to be run to extract the metadata value and the name of the metadata
Methods
__call__
(input_1, input_2)run self.extract on the given text
extract
(input_1, input_2)This function will extract the metric from the text :param input_1: :type input_1: str :param input_2: :type input_2: str
- abstract extract(input_1: str, input_2: str) Any #
This function will extract the metric from the text :param input_1: :type input_1: str :param input_2: :type input_2: str
- Returns:
the metadata extracted from
- Return type:
Any
elemeta.nlp.extractors.low_level.avg_token_length module#
- class elemeta.nlp.extractors.low_level.avg_token_length.AvgTokenLength(tokenizer: Callable[[str], List[str]], tokens_to_exclude: Set[str] | None = None, name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractor
Implementation of AbstractTextMetafeatureExtractor class that return the average token length
Example
>>> from elemeta.nlp.extractors.low_level.avg_token_length import AvgTokenLength >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> avg_token_length = AvgTokenLength(word_tokenize) >>> result = avg_token_length(text) >>> print(result) # Output: 3.5
Methods
__call__
(text)run self.extract on the given text
extract
(text)return the number of average token length in the text
- extract(text: str) float #
return the number of average token length in the text
- Parameters:
text (str) – the string to run on
- Returns:
the number of average tokens length in the text
- Return type:
int
elemeta.nlp.extractors.low_level.hinted_profanity_token_count module#
- class elemeta.nlp.extractors.low_level.hinted_profanity_token_count.HintedProfanityTokensCount(tokenizer: Callable[[str], List[str]], name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractor
Implementation of AbstractTextMetafeatureExtractor class that count the number profanity words
Example
>>> from elemeta.nlp.extractors.low_level.hinted_profanity_token_count import HintedProfanityTokensCount >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> profanity_token_counter = HintedProfanityTokensCount(word_tokenize) >>> result = profanity_token_counter(text) >>> print(result) # Output: 0
Methods
__call__
(text)run self.extract on the given text
extract
(text)return the number of profanity words in the text
- extract(text: str) int #
return the number of profanity words in the text
- Parameters:
text (str) – the string to run on
- Returns:
the number of profanity words in the text
- Return type:
int
elemeta.nlp.extractors.low_level.must_appear_tokens_parentage module#
- class elemeta.nlp.extractors.low_level.must_appear_tokens_parentage.MustAppearTokensPercentage(tokenizer: Callable[[str], List[str]], must_appear: Set[str], name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractor
Implementation of AbstractTextMetafeatureExtractor class that return the ration between the number of appearances of tokens from
given tokens list in the text to all the tokens
Example
>>> from elemeta.nlp.extractors.low_level.must_appear_tokens_parentage import MustAppearTokensPercentage >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> must_appear_tokens_percentage = MustAppearTokensPercentage(word_tokenize, {"I","yes"}) >>> result = must_appear_tokens_percentage(text) >>> print(result) # Output: 0.5
Methods
__call__
(text)run self.extract on the given text
extract
(text)gives the percentage of the tokens in must_appear set that appeared in the text
- extract(text: str) float #
gives the percentage of the tokens in must_appear set that appeared in the text
- Parameters:
text (str) – the text to check appearance on
- Returns:
the ratio between the number of must-appear tokens to all words
- Return type:
float
elemeta.nlp.extractors.low_level.regex_token_matches_count module#
- class elemeta.nlp.extractors.low_level.regex_token_matches_count.TokenRegexMatchesCount(tokenizer: Callable[[str], List[str]], regex: str = '.*', name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractor
Return number of tokens that match the given regex.
Example
>>> from elemeta.nlp.extractors.low_level.regex_token_matches_count import TokenRegexMatchesCount >>> from nltk import word_tokenize >>> text = "he hee is" >>> regex = "h.+" >>> token_regex_matches_counter = TokenRegexMatchesCount(word_tokenize, regex=regex) >>> result = token_regex_matches_counter(text) >>> print(result) # Output: 2
Methods
__call__
(text)run self.extract on the given text
extract
(text)return the number of matches of the given regex in the text
validator
(token)regex check validator checks if the token abides by the regex
- extract(text: str) int #
return the number of matches of the given regex in the text
- Parameters:
text (str) – the string to run on
- Returns:
the number of the given text in the text
- Return type:
int
- validator(token: str) bool #
regex check validator checks if the token abides by the regex
- Parameters:
token (str) – the token check if abides by the regex
- Returns:
true if the token abides and false otherwise
- Return type:
bool
elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity module#
- class elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity.SemanticEmbeddingPairSimilarity(name: str | None = None)#
Bases:
AbstractPairMetafeatureExtractor
Calculates the semantic embedding pair similarity between two input tensors.
Parameters:#
input_1 (Tensor): The first input tensor. input_2 (Tensor): The second input tensor.
Returns:#
Tensor: The semantic embedding pair similarity between the two input tensors.
Examples:#
>>> import torch >>> from elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity import SemanticEmbeddingPairSimilarity >>> input_1 = torch.tensor([1, 2, 3], dtype=torch.float) >>> input_2 = torch.tensor([4, 5, 6], dtype=torch.float) >>> extractor = SemanticEmbeddingPairSimilarity() >>> similarity = extractor(input_1, input_2) >>> print(similarity) #Output: tensor([[0.9746]])
Methods
__call__
(input_1, input_2)run self.extract on the given text
extract
(input_1, input_2)This function will extract the metric from the text :param input_1: :type input_1: Any :param input_2: :type input_2: Any
- extract(input_1: Tensor, input_2: Tensor) Tensor #
This function will extract the metric from the text :param input_1: :type input_1: Any :param input_2: :type input_2: Any
- Returns:
the metadata extracted from
- Return type:
Any
elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity module#
- class elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity.SemanticTextToGroupSimilarity(group: List[str], embedding_model: str | None = None, modules: Iterable[Module] | None = None, device: str | None = None, cache_folder: str | None = None, use_auth_token: bool | str | None = None, name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractor
Extracts the similarity between a text and a group of texts.
- Parameters:
group (List[str]) – Group of strings to compare to.
embedding_model (Optional[str]) – The name of the SentenceTransformer model to use, by default “all-MiniLM-L6-v2”.
modules (Optional[Iterable[nn.Module]]) – This parameter can be used to create custom SentenceTransformer models from scratch.
device (Optional[str]) – Device (like ‘cuda’ / ‘cpu’) that should be used for computation. If None, checks if a GPU can be used.
cache_folder (Optional[str]) – Path to store models.
use_auth_token (Union[bool, str, None]) – HuggingFace authentication token to download private models.
name (Optional[str]) – Name of the extractor.
Examples
>>> from elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity import SemanticTextToGroupSimilarity >>> group = ["apple", "banana", "orange"] >>> extractor = SemanticTextToGroupSimilarity(group) >>> text = "apple" >>> similarity = extractor.extract(text) >>> print(similarity) #Output: 1.000000238418579
Methods
__call__
(text)run self.extract on the given text
extract
(input)Extracts the similarity between a text and a group of texts.
- extract(input: str) float #
Extracts the similarity between a text and a group of texts.
- Parameters:
input (str) – Text to compare to the group.
- Returns:
Maximum similarity between the input text and the group.
- Return type:
float
elemeta.nlp.extractors.low_level.tokens_count module#
- class elemeta.nlp.extractors.low_level.tokens_count.TokensCount(tokenizer: Callable[[str], List[str]], exclude_tokens_list: Set[str] | None = None, include_tokens_list: Set[str] | None = None, name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractor
Return the number of tokens in the text.
Example
>>> from elemeta.nlp.extractors.low_level.tokens_count import TokensCount >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> tokens_count = TokensCount(word_tokenize) >>> result = tokens_count(text) >>> print(result) # Output: 8
Methods
__call__
(text)run self.extract on the given text
extract
(text)counts the number tokens in the text
- extract(text: str) int #
counts the number tokens in the text
- Parameters:
text (str) – the text to check appearance on
- Returns:
the number of appearance of a must-appear word list
- Return type:
int
elemeta.nlp.extractors.low_level.unique_token_count module#
- class elemeta.nlp.extractors.low_level.unique_token_count.UniqueTokenCount(tokenizer: Callable[[str], List[str]], exclude_tokens_list: Set[str] | None = None, include_tokens_list: Set[str] | None = None, name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractor
Returns the number of unique tokens in the text.
Example
>>> from elemeta.nlp.extractors.low_level.unique_token_count import UniqueTokenCount >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> unique_token_count = UniqueTokenCount(word_tokenize) >>> result = unique_token_count(text) >>> print(result) # Output: 4
Methods
__call__
(text)run self.extract on the given text
extract
(text)counts the number tokens in the text
- extract(text: str) int #
counts the number tokens in the text
- Parameters:
text (str) – the text to check appearance on
- Returns:
the number of appearance of a must-appear word list
- Return type:
int
elemeta.nlp.extractors.low_level.unique_token_ratio module#
- class elemeta.nlp.extractors.low_level.unique_token_ratio.UniqueTokensRatio(tokenizer: Callable[[str], List[str]], exceptions: Set[str], name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractor
Return the ratio between the number of unique tokens to all tokens
Example
>>> from elemeta.nlp.extractors.low_level.unique_token_ratio import UniqueTokensRatio >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> unique_tokens_ratio = UniqueTokensRatio(word_tokenize,exceptions={"was"}) >>> result = unique_tokens_ratio(text) >>> print(result) # Output: 0.8
Methods
__call__
(text)run self.extract on the given text
extract
(text)Unique words in text function
- extract(text: str) float #
Unique words in text function
returns the ratio between set(tokens)/len(tokens) filters on tokens that are defined as relevant
- Parameters:
text (str) – the text we want to find unique words ratio on
- Returns:
sentiment – the ratio between len(set(tokens that appear once ))/len(set(tokens))
- Return type:
float