spaCyの基本操作 - 機械学習・自然言語処理の勉強メモ

spaCyとは

pythonで動かす自然言語処理ライブラリ。
品詞タグ付け、固有表現抽出、構文解析などが出来る。

詳しくはここ。
spacy.io

基本操作

基本的な操作を備忘録として残す。

import spacy
nlp = spacy.load('en')
doc = nlp(u'Jeffrey Navin saw the girl with the telescope. She looked very strong.')

Spacyの単語は文字列ではなく品詞情報などを含む特殊なオブジェクト

doc[0]
>> Jeffrey
type(doc[0])
>> spacy.tokens.token.Token

sentenceに分ける。

sentences = list(doc.sents)
>> [Jeffrey Navin saw the girl with the telescope., She looked very strong.]
len(sentences )
>> 2

品詞情報の抽出

(doc[0].pos_, doc[0].pos)
>> (u'PROPN', 94)

名詞のチャンクの抽出

list(doc.noun_chunks)[0].text
>> u'Jeffrey Navin'

原形・品詞タグの取得

for sent in doc.sents:
    for token in sent:
        print str(token),token.lemma_, token.tag_

Jeffrey jeffrey NNP
Navin navin NNP
saw see VBD
the the DT
girl girl NN
with with IN
the the DT
telescope telescope NN
. . .
She -PRON- PRP
looked look VBD
very very RB
strong strong JJ
. . .

固有表現の抽出

for sent in doc.sents:
    for token in sent:
        print str(token), token.ent_type_ if token.ent_type_ else 'O'

Jeffrey PERSON
Navin PERSON
saw O
the O
girl O
with O
the O
telescope O
. O
She O
looked O
very O
strong O
. O

もう少し詳細な固有表現抽出

doc = nlp(u'San Francisco considers banning sidewalk delivery robots')
ents = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
>> [(u'San Francisco', 0, 13, u'GPE')]

doc = nlp(u'Tom goes to New York')
ents = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
>> [(u'Tom', 0, 3, u'PERSON'), (u'New York', 12, 20, u'GPE')]

基本操作はこんな感じ。