vocabulary
- class aitoolbox.nlp.core.vocabulary.Vocabulary(name, document_level=False)[source]
Bases:
object
Vocabulary used for storing the tokens and converting between the indices and the tokens
- Parameters:
name (str) – name of the vocabulary / type of vocabulary. Needed just for tracking purposes
document_level (bool) – If the vocabulary is on the sentence level or on the document level. Document consists of multiple sentences. This in effect means that we are adding additional tokens for start and the end of the doc.
- add_sentence(sentence_tokens)[source]
Add tokenized sentence to the vocabulary
- Parameters:
sentence_tokens (list) – sentence tokens, e.g. list of words representing the sentence
- Returns:
None
- add_word(word)[source]
Add the single word to the vocabulary
- Parameters:
word (str) – single word string
- Returns:
None
- trim(min_count)[source]
Remove words below a certain count threshold
- Parameters:
min_count (int) –
- Returns:
None
- convert_sent2idx_sent(sent_tokens, start_end_token=True)[source]
Convert the given tokenized string sentence into the indices