推荐 :快速掌握spacy在python中进行自然语言处理(附代码&链接)
作者:Paco Nathan?翻译:笪洁琼?校对:和中华
本文简要介绍了如何使用spaCy和Python中的相关库进行自然语言处理(有时称为“文本分析”)。以及一些目前最新的相关应用。
说明页面
https://support.dominodatalab.com/hc/en-us/articles/115000392643-Environment-management
import spacynlp = spacy.load("en_core_web_sm")
text = "The rain in Spain falls mainly on the plain."doc = nlp(text)for token in doc:????print(token.text,?token.lemma_,?token.pos_,?token.is_stop)
The the DET Truerain rain NOUN Falsein in ADP TrueSpain Spain PROPN Falsefalls fall VERB Falsemainly mainly ADV Falseon on ADP Truethe the DET Trueplain plain NOUN False. . PUNCT False
import pandas as pdcols = ("text", "lemma", "POS", "explain", "stopword")rows = []for t in doc:row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]rows.append(row)df = pd.DataFrame(rows, columns=cols)df
原始文本
词形(lemma)引理——这个词的词根形式
词性(part-of-speech)
是否是停用词的标志,比如一个可能会被过滤的常用词
from?spacy?import?displacydisplacy.render(doc, style="dep")

text = "We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit. I fell in. Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket. The gorillas just went wild."doc = nlp(text)for sent in doc.sents:print(">", sent)
We?were?all?out?at?the?zoo?one?day,?I?was?doing?some?acting,?walking?on?the?railing?of?the?gorilla?exhibit.I?fell?in.Everyone?screamed?and?Tommy?jumped?in?after?me,?forgetting?that?he?had?blueberries?in?his?front?pocket.The?gorillas?just?went?wild.
for?sent?in?doc.sents:print(">", sent.start, sent.end)
doc[48:54]The gorillas just went wild.
token = doc[51]print(token.text, token.lemma_, token.pos_)went go VERB
import sysimport warningswarnings.filterwarnings("ignore")
from bs4 import BeautifulSoupimport requestsimport tracebackdef get_text (url):buf = []try:soup = BeautifulSoup(requests.get(url).text, "html.parser")for p in soup.find_all("p"):buf.append(p.get_text())return "n".join(buf)except:print(traceback.format_exc())sys.exit(-1)
https://opensource.org/licenses/
lic = {}lic["mit"] = nlp(get_text("https://opensource.org/licenses/MIT"))lic["asl"] = nlp(get_text("https://opensource.org/licenses/Apache-2.0"))lic["bsd"] = nlp(get_text("https://opensource.org/licenses/BSD-3-Clause"))for sent in lic["bsd"].sents: print(">", sent)
> SPDX short identifier: BSD-3-Clause> Note: This license has also been called the "New BSD License" or "Modified BSD License"> See also the 2-clause BSD License.…
pairs?=?[["mit", "asl"],["asl", "bsd"],["bsd", "mit"]]for a, b in pairs:print(a, b, lic[a].similarity(lic[b]))
mit asl 0.9482039305669306asl bsd 0.9391555350757145bsd mit 0.9895838089575453
现在让我们深入了解一下spaCy中的NLU特性。假设我们要解析有一个文档,从纯语法的角度来看,我们可以提取名词块(https://spacy.io/usage/linguistic-features#noun-chunks),即每个名词短语:
text = "Steve Jobs and Steve Wozniak incorporated Apple Computer on January 3, 1977, in Cupertino, California."doc = nlp(text)for chunk in doc.noun_chunks:print(chunk.text)
Steve JobsSteve WozniakApple ComputerJanuaryCupertinoCalifornia
for ent in doc.ents:print(ent.text,?ent.label_)
displacy.render(doc, style="ent")

?
import?nltknltk.download("wordnet")[nltk_data] Downloading package wordnet to /home/ceteri/nltk_data...[nltk_data] Package wordnet is already up-to-date!True
from spacy_wordnet.wordnet_annotator import WordnetAnnotatorprint("before", nlp.pipe_names)if "WordnetAnnotator" not in nlp.pipe_names: nlp.add_pipe(WordnetAnnotator(nlp.lang), after="tagger")print("after", nlp.pipe_names)before ['tagger', 'parser', 'ner']after ['tagger', 'WordnetAnnotator', 'parser', 'ner']
token = nlp("withdraw")[0]token._.wordnet.synsets()
[Synset('withdraw.v.01'),Synset('retire.v.02'),Synset('disengage.v.01'),Synset('recall.v.07'),Synset('swallow.v.05'),Synset('seclude.v.01'),Synset('adjourn.v.02'),Synset('bow_out.v.02'),Synset('withdraw.v.09'),Synset('retire.v.08'),Synset('retreat.v.04'),Synset('remove.v.01')]
token._.wordnet.lemmas()
[Lemma('withdraw.v.01.withdraw'),Lemma('withdraw.v.01.retreat'),Lemma('withdraw.v.01.pull_away'),Lemma('withdraw.v.01.draw_back'),Lemma('withdraw.v.01.recede'),Lemma('withdraw.v.01.pull_back'),Lemma('withdraw.v.01.retire'),…
token._.wordnet.wordnet_domains()
['astronomy','school','telegraphy','industry','psychology','ethnology','ethnology','administration','school','finance','economy','exchange','banking','commerce','medicine','ethnology','university',…
domains = ["finance", "banking"]sentence = nlp("I want to withdraw 5,000 euros.")enriched_sent = []for token in sentence:# get synsets within the desired domainssynsets = token._.wordnet.wordnet_synsets_for_domain(domains)if synsets:lemmas_for_synset = []for s in synsets:# get synset variants and add to the enriched sentencelemmas_for_synset.extend(s.lemma_names())enriched_sent.append("({})".format("|".join(set(lemmas_for_synset))))else:enriched_sent.append(token.text)print(" ".join(enriched_sent))
I (require|want|need) to (draw_off|withdraw|draw|take_out) 5,000 euros .
import scattertext as stif?"merge_entities"?not?in?nlp.pipe_names:????nlp.add_pipe(nlp.create_pipe("merge_entities"))?if?"merge_noun_chunks"?not?in?nlp.pipe_names:????nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))?convention_df?=?st.SampleCorpora.ConventionData2012.get_data()corpus?=?st.CorpusFromPandas(convention_df,?category_col="party",????????????????????????????text_col="text",????????????????????????????nlp=nlp).build()
html = st.produce_scattertext_explorer(corpus,category="democrat",category_name="Democratic",not_category_name="Republican",width_in_pixels=1000,metadata=convention_df["speaker"])
from IPython.display import IFramefile_name = "foo.html"with open(file_name, "wb") as f:f.write(html.encode("utf-8"))IFrame(src=file_name, width = 1200, height=700)

总结
值得注意的是,随着谷歌开始赢得国际语言翻译比赛,用于自然语言的的机器学习自2000年中期得到了很大的发展。2017年至2018年期间,随着深度学习的诸多成功,这些方法开始超越以前的机器学习模型,出现了另一个重大变化。
例如,经Allen AI研究提出的看到ELMo?语言嵌入模型,?随后是谷歌的BERT,(https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html),以及最近由
END
转自:?数据派THU?公众号;
版权声明:本号内容部分来自互联网,转载请注明原文链接和作者,如有侵权或出处有误请和我们联系。
商务合作|约稿 请加qq:365242293??
更多相关知识请回复:“ 月光宝盒 ”;
数据分析(ID :?ecshujufenxi?)互联网科技与数据圈自己的微信,也是WeMedia自媒体联盟成员之一,WeMedia联盟覆盖5000万人群。

关注公众号:拾黑(shiheibook)了解更多
[广告]赞助链接:
四季很好,只要有你,文娱排行榜:https://www.yaopaiming.com/
让资讯触达的更精准有趣:https://www.0xu.cn/
关注网络尖刀微信公众号随时掌握互联网精彩
- 1 中国经济向世界提供“机遇清单” 7904589
- 2 再次提醒:中国公民近期避免前往日本 7808892
- 3 狂甩45次 无汞体温计才降到36℃ 7712769
- 4 2025这些“经济”持续成长壮大 7616778
- 5 女生被宿舍孤立关门外一夜?学校回应 7521665
- 6 “无名”男子医院躺7年半 警方介入 7426316
- 7 网警:男子AI生成车展低俗视频被拘 7331343
- 8 北冥有鱼竟然是真的 7238947
- 9 5000万粉丝女网红涉嫌逃税仍直播 7143962
- 10 寒潮来袭!多地气温将创下半年来新低 7044674







数据分析
