1. In the case of a keyword with spaces or special symbols, jieba cannot distinguish the word
. 2. Found a solution on github, modify the jieba source code
_ _ init__.py
free sharing, damage exemption.
Open the default dictionary (root directory) or custom dictionary, and change all space spacers used to interval word frequency and part of speech to @ @
(@ @ is chosen because you are less likely to encounter this delimiter in general keywords)
go ahead, open init.py under the root directory of jieba
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+-sharp&\._]+)", re.U)
re_han_default = re.compile("(.+)", re.U)
re_userdict = re.compile("^(.+?)( [0-9]+)?( [a-z]+)?$", re.U)
re_userdict = re.compile("^(.+?)(\u0040\u0040[0-9]+)?(\u0040\u0040[a-z]+)?$", re.U)
word, freq = line.split(" ")[:2]
word, freq = line.split("\u0040\u0040")[:2]
:
re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
re_han_cut_all = re.compile("(.+)", re.U)
but this results in a large number of emoji expressions or unwanted symbols like =, (),
.3. Expected output
I just want jieba to recognize Chinese and English keywords with spaces or key words connected with-in custom words and remove other special characters such as the emoji .
how to modify it?string = "my dog is a happy dog"
jieba.add_word("happy dog")
jieba.cut(my dog is a happy dog)
outputs: ["my","dog","is","a","happy","dog"]
: ["my","dog","is","a","happy dog"]
is really big on regular expressions. I hope experienced bosses can tell me what to do.