Tag pos de Python NLTK ne renvoie pas la tag correcte de la partie de la parole

Question

Tag pos de Python NLTK ne renvoie pas la tag correcte de la partie de la parole

ayant ceci:

text = word_tokenize("The quick brown fox jumps over the lazy dog")

et en cours d'exécution:

nltk.pos_tag(text)

, j'obtiens:

[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

c'est incorrect. Les étiquettes pour quick brown lazy dans la phrase devraient être:

('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')

testant ceci à travers leur outil en ligne donne le même résultat; quick , brown et fox devraient être des adjectifs et non des noms.

24

machine-learning nlp nltk pos-tagger python

demandé sur dmcc 2015-06-13 19:52:28

1 réponses

score 55 · Answer 1

en abrégé :

NLTK n'est pas parfait. En fait, aucun modèle n'est parfait.

Note:

de NLTK version 3.1, par défaut pos_tag la fonction n'est plus le vieux MaxEnt anglais cornichon .

c'est maintenant le perceptron tagger de La mise en œuvre de @Honnibal , voir nltk.tag.pos_tag

>>> import inspect
>>> print inspect.getsource(pos_tag)
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)

c'est mieux, mais pas parfait:

>>> from nltk import pos_tag
>>> pos_tag("The quick brown fox jumps over the lazy dog".split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

à un moment donné, si quelqu'un veut TL;DR solutions, voir https://github.com/alvations/nltk_cli

en long :

essayez d'utiliser un autre tagger (voir https://github.com/nltk/nltk/tree/develop/nltk/tag ), p.ex. :

HunPos
Stanford POS
Senna

utilisant par défaut MaxEnt POS tagger de NLTK, i.e. nltk.pos_tag :

>>> from nltk import word_tokenize, pos_tag
>>> text = "The quick brown fox jumps over the lazy dog"
>>> pos_tag(word_tokenize(text))
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

utilisant Stanford POS tagger :

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip
$ unzip stanford-postagger-2015-04-20.zip
$ mv stanford-postagger-2015-04-20 stanford-postagger
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.stanford import POSTagger
>>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger'
>>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar'
>>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]

utilisant HunPOS (NOTE: l'encodage par défaut est ISO-8859-1 non UTF8):

$ cd ~
$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz 
$ mv en_wsj.model hunpos-1.0-linux/
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.hunpos import HunposTagger
>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> ht.tag(text.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

en utilisant Senna (assurez-vous que vous avez la dernière version de NLTK, Il ya eu quelques changements apportés à L'API):

$ cd ~
$ wget http://ronan.collobert.com/senna/senna-v3.0.tgz
$ tar zxvf senna-v3.0.tgz
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.senna import SennaTagger
>>> st = SennaTagger(home+'/senna')
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]

ou essayez de construire un meilleur POS tagger :

Ngram Tagger: http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1 /
Affix / Regex Tagger: http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2 /
Construisez votre propre Brill (lire le code c'est un assez amusant tagger, http://www.nltk.org/_modules/nltk/tag/brill.html ), voir http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3 /
Perceptron Tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python /
LDA Tagger: http://scm.io/blog/hack/2015/02/lda-intentions /

se plaint de pos_tag précision sur l'écoulement des piles inclure :

les questions concernant NLTK HunPos comprennent :

les problèmes avec NLTK et Stanford POS tagger comprennent :

Las etiquetas más populares

Tag pos de Python NLTK ne renvoie pas la tag correcte de la partie de la parole

1 réponses