Stemming and Lemmatization are both text normalization techniques used in Natural Language Processing to reduce words to their base or root forms. However, they differ in approach and accuracy.
Aspect | Stemming | Lemmatization |
---|---|---|
Definition | Chops off word endings to get the stem (may not be a real word). | Reduces word to its dictionary root (lemma) using vocabulary and POS. |
Method | Rule-based or algorithmic cutting of suffixes. | Dictionary-based and uses morphological analysis. |
Output | May not be a valid word (e.g., “running” → “run”). | Valid root word with proper meaning (e.g., “running” → “run”). |
Speed | Generally faster. | Slower due to more complex processing. |
Accuracy | Less accurate, may produce stems that aren't actual words. | More accurate and linguistically meaningful. |
from nltk.stem import PorterStemmer, WordNetLemmatizer
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "runs", "runner", "better", "cats", "studies", "wolves", "was", "geese"]
for word in words:
stem = ps.stem(word)
lemma = lemmatizer.lemmatize(word)
print(f"Word: {word:10} | Stem: {stem:10} | Lemma: {lemma}")
# Examples with POS for lemmatizer
print("Lemmatize 'better' as adjective:", lemmatizer.lemmatize("better", pos='a'))
print("Stem 'better':", ps.stem("better"))