elemeta.nlp.extractors.low_level package#
Submodules#
elemeta.nlp.extractors.low_level.abstract_text_metafeature_extractor module#
- class elemeta.nlp.extractors.low_level.abstract_text_metafeature_extractor.AbstractTextMetafeatureExtractor(name: str | None = None)#
Bases:
ABCRepresentation of a MetafeatureExtractor This class holds a function to be run to extract the metafeature value and the name of the metafeature
Methods
__call__(text)run self.extract on the given text
extract(text)This function will extract the metric from the text :param text: :type text: str
- abstract extract(text: str) Any#
This function will extract the metric from the text :param text: :type text: str
- Returns:
the metadata extracted from text
- Return type:
Any
elemeta.nlp.extractors.low_level.abstract_text_pair_metafeature_extractor module#
- class elemeta.nlp.extractors.low_level.abstract_text_pair_metafeature_extractor.AbstractTextPairMetafeatureExtractor(name: str | None = None)#
Bases:
AbstractPairMetafeatureExtractorThis class holds a function to be run to extract the metadata value and the name of the metadata
Methods
__call__(input_1, input_2)run self.extract on the given text
extract(input_1, input_2)This function will extract the metric from the text :param input_1: :type input_1: str :param input_2: :type input_2: str
- abstract extract(input_1: str, input_2: str) Any#
This function will extract the metric from the text :param input_1: :type input_1: str :param input_2: :type input_2: str
- Returns:
the metadata extracted from
- Return type:
Any
elemeta.nlp.extractors.low_level.avg_token_length module#
- class elemeta.nlp.extractors.low_level.avg_token_length.AvgTokenLength(tokenizer: Callable[[str], List[str]], tokens_to_exclude: Set[str] | None = None, name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractorImplementation of AbstractTextMetafeatureExtractor class that return the average token length
Example
>>> from elemeta.nlp.extractors.low_level.avg_token_length import AvgTokenLength >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> avg_token_length = AvgTokenLength(word_tokenize) >>> result = avg_token_length(text) >>> print(result) # Output: 3.5
Methods
__call__(text)run self.extract on the given text
extract(text)return the number of average token length in the text
- extract(text: str) float#
return the number of average token length in the text
- Parameters:
text (str) – the string to run on
- Returns:
the number of average tokens length in the text
- Return type:
int
elemeta.nlp.extractors.low_level.hinted_profanity_token_count module#
- class elemeta.nlp.extractors.low_level.hinted_profanity_token_count.HintedProfanityTokensCount(tokenizer: Callable[[str], List[str]], name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractorImplementation of AbstractTextMetafeatureExtractor class that count the number profanity words
Example
>>> from elemeta.nlp.extractors.low_level.hinted_profanity_token_count import HintedProfanityTokensCount >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> profanity_token_counter = HintedProfanityTokensCount(word_tokenize) >>> result = profanity_token_counter(text) >>> print(result) # Output: 0
Methods
__call__(text)run self.extract on the given text
extract(text)return the number of profanity words in the text
- extract(text: str) int#
return the number of profanity words in the text
- Parameters:
text (str) – the string to run on
- Returns:
the number of profanity words in the text
- Return type:
int
elemeta.nlp.extractors.low_level.must_appear_tokens_parentage module#
- class elemeta.nlp.extractors.low_level.must_appear_tokens_parentage.MustAppearTokensPercentage(tokenizer: Callable[[str], List[str]], must_appear: Set[str], name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractorImplementation of AbstractTextMetafeatureExtractor class that return the ration between the number of appearances of tokens from
given tokens list in the text to all the tokens
Example
>>> from elemeta.nlp.extractors.low_level.must_appear_tokens_parentage import MustAppearTokensPercentage >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> must_appear_tokens_percentage = MustAppearTokensPercentage(word_tokenize, {"I","yes"}) >>> result = must_appear_tokens_percentage(text) >>> print(result) # Output: 0.5
Methods
__call__(text)run self.extract on the given text
extract(text)gives the percentage of the tokens in must_appear set that appeared in the text
- extract(text: str) float#
gives the percentage of the tokens in must_appear set that appeared in the text
- Parameters:
text (str) – the text to check appearance on
- Returns:
the ratio between the number of must-appear tokens to all words
- Return type:
float
elemeta.nlp.extractors.low_level.regex_token_matches_count module#
- class elemeta.nlp.extractors.low_level.regex_token_matches_count.TokenRegexMatchesCount(tokenizer: Callable[[str], List[str]], regex: str = '.*', name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractorReturn number of tokens that match the given regex.
Example
>>> from elemeta.nlp.extractors.low_level.regex_token_matches_count import TokenRegexMatchesCount >>> from nltk import word_tokenize >>> text = "he hee is" >>> regex = "h.+" >>> token_regex_matches_counter = TokenRegexMatchesCount(word_tokenize, regex=regex) >>> result = token_regex_matches_counter(text) >>> print(result) # Output: 2
Methods
__call__(text)run self.extract on the given text
extract(text)return the number of matches of the given regex in the text
validator(token)regex check validator checks if the token abides by the regex
- extract(text: str) int#
return the number of matches of the given regex in the text
- Parameters:
text (str) – the string to run on
- Returns:
the number of the given text in the text
- Return type:
int
- validator(token: str) bool#
regex check validator checks if the token abides by the regex
- Parameters:
token (str) – the token check if abides by the regex
- Returns:
true if the token abides and false otherwise
- Return type:
bool
elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity module#
- class elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity.SemanticEmbeddingPairSimilarity(name: str | None = None)#
Bases:
AbstractPairMetafeatureExtractorCalculates the semantic embedding pair similarity between two input tensors.
Parameters:#
input_1 (Tensor): The first input tensor. input_2 (Tensor): The second input tensor.
Returns:#
Tensor: The semantic embedding pair similarity between the two input tensors.
Examples:#
>>> import torch >>> from elemeta.nlp.extractors.low_level.semantic_embedding_pair_similarity import SemanticEmbeddingPairSimilarity >>> input_1 = torch.tensor([1, 2, 3], dtype=torch.float) >>> input_2 = torch.tensor([4, 5, 6], dtype=torch.float) >>> extractor = SemanticEmbeddingPairSimilarity() >>> similarity = extractor(input_1, input_2) >>> print(similarity) #Output: tensor([[0.9746]])
Methods
__call__(input_1, input_2)run self.extract on the given text
extract(input_1, input_2)This function will extract the metric from the text :param input_1: :type input_1: Any :param input_2: :type input_2: Any
- extract(input_1: Tensor, input_2: Tensor) Tensor#
This function will extract the metric from the text :param input_1: :type input_1: Any :param input_2: :type input_2: Any
- Returns:
the metadata extracted from
- Return type:
Any
elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity module#
- class elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity.SemanticTextToGroupSimilarity(group: List[str], embedding_model: str | None = None, modules: Iterable[Module] | None = None, device: str | None = None, cache_folder: str | None = None, use_auth_token: bool | str | None = None, name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractorExtracts the similarity between a text and a group of texts.
- Parameters:
group (List[str]) – Group of strings to compare to.
embedding_model (Optional[str]) – The name of the SentenceTransformer model to use, by default “all-MiniLM-L6-v2”.
modules (Optional[Iterable[nn.Module]]) – This parameter can be used to create custom SentenceTransformer models from scratch.
device (Optional[str]) – Device (like ‘cuda’ / ‘cpu’) that should be used for computation. If None, checks if a GPU can be used.
cache_folder (Optional[str]) – Path to store models.
use_auth_token (Union[bool, str, None]) – HuggingFace authentication token to download private models.
name (Optional[str]) – Name of the extractor.
Examples
>>> from elemeta.nlp.extractors.low_level.semantic_text_to_group_similarity import SemanticTextToGroupSimilarity >>> group = ["apple", "banana", "orange"] >>> extractor = SemanticTextToGroupSimilarity(group) >>> text = "apple" >>> similarity = extractor.extract(text) >>> print(similarity) #Output: 1.000000238418579
Methods
__call__(text)run self.extract on the given text
extract(input)Extracts the similarity between a text and a group of texts.
- extract(input: str) float#
Extracts the similarity between a text and a group of texts.
- Parameters:
input (str) – Text to compare to the group.
- Returns:
Maximum similarity between the input text and the group.
- Return type:
float
elemeta.nlp.extractors.low_level.tokens_count module#
- class elemeta.nlp.extractors.low_level.tokens_count.TokensCount(tokenizer: Callable[[str], List[str]], exclude_tokens_list: Set[str] | None = None, include_tokens_list: Set[str] | None = None, name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractorReturn the number of tokens in the text.
Example
>>> from elemeta.nlp.extractors.low_level.tokens_count import TokensCount >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> tokens_count = TokensCount(word_tokenize) >>> result = tokens_count(text) >>> print(result) # Output: 8
Methods
__call__(text)run self.extract on the given text
extract(text)counts the number tokens in the text
- extract(text: str) int#
counts the number tokens in the text
- Parameters:
text (str) – the text to check appearance on
- Returns:
the number of appearance of a must-appear word list
- Return type:
int
elemeta.nlp.extractors.low_level.unique_token_count module#
- class elemeta.nlp.extractors.low_level.unique_token_count.UniqueTokenCount(tokenizer: Callable[[str], List[str]], exclude_tokens_list: Set[str] | None = None, include_tokens_list: Set[str] | None = None, name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractorReturns the number of unique tokens in the text.
Example
>>> from elemeta.nlp.extractors.low_level.unique_token_count import UniqueTokenCount >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> unique_token_count = UniqueTokenCount(word_tokenize) >>> result = unique_token_count(text) >>> print(result) # Output: 4
Methods
__call__(text)run self.extract on the given text
extract(text)counts the number tokens in the text
- extract(text: str) int#
counts the number tokens in the text
- Parameters:
text (str) – the text to check appearance on
- Returns:
the number of appearance of a must-appear word list
- Return type:
int
elemeta.nlp.extractors.low_level.unique_token_ratio module#
- class elemeta.nlp.extractors.low_level.unique_token_ratio.UniqueTokensRatio(tokenizer: Callable[[str], List[str]], exceptions: Set[str], name: str | None = None)#
Bases:
AbstractTextMetafeatureExtractorReturn the ratio between the number of unique tokens to all tokens
Example
>>> from elemeta.nlp.extractors.low_level.unique_token_ratio import UniqueTokensRatio >>> from nltk import word_tokenize >>> text = "Once I was afraid, I was petrified" >>> unique_tokens_ratio = UniqueTokensRatio(word_tokenize,exceptions={"was"}) >>> result = unique_tokens_ratio(text) >>> print(result) # Output: 0.8
Methods
__call__(text)run self.extract on the given text
extract(text)Unique words in text function
- extract(text: str) float#
Unique words in text function
returns the ratio between set(tokens)/len(tokens) filters on tokens that are defined as relevant
- Parameters:
text (str) – the text we want to find unique words ratio on
- Returns:
sentiment – the ratio between len(set(tokens that appear once ))/len(set(tokens))
- Return type:
float