文档规范化(normalization)
文档规范化,提高召回率
示例代码
#normalization GET _analyze { "text": "Mr. Ma is an excellent teacher", "analyzer": "english" }
字符过滤器(character filter)
分词之前的预处理,过滤无用字符
html标签过滤器
官方参考地址
HTML strip character filter | Elasticsearch Guide [8.11] | Elastic
示例代码
GET /_analyze { "tokenizer": "keyword", "char_filter": [ "html_strip" ], "text": "I'm so happy!
" }
字符映射过滤器(MappingCharFilter)
官方参考地址
Mapping character filter | Elasticsearch Guide [8.11] | Elastic
示例代码
PUT my_index { "settings": { "analysis": { "char_filter": { "my_char_filter":{ "type":"mapping", "mappings":[ "滚 => *", "垃 => *", "圾 => *" ] } }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":["my_char_filter"] } } } } } GET my_index/_analyze { "analyzer": "my_analyzer", "text": "你就是个垃圾!滚" }
正则过滤器
官方参考地址
Pattern replace character filter | Elasticsearch Guide [8.11] | Elastic
示例代码
PUT my_index { "settings": { "analysis": { "char_filter": { "my_char_filter":{ "type":"pattern_replace", "pattern":"(\d{3})\d{4}(\d{4})", "replacement":"****" } }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":["my_char_filter"] } } } } } GET my_index/_analyze { "analyzer": "my_analyzer", "text": "您的手机号是17611001200" }
令牌过滤器(token filter)
停用词、时态转换、大小写转换、同义词转换、语气词处理等。比如:has=>have him=>he apples=>apple
示例代码
#停用词 PUT /test_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "standard", "stopwords":["me","you"] } } } } } GET test_index/_analyze { "analyzer": "my_analyzer", "text": ["Teacher me and you in the china"] }
分词器(tokenizer)
切词
官方参考地址
Tokenizer reference | Elasticsearch Guide [8.11] | Elastic
常见分词器
-
standard analyzer:默认分词器,中文支持的不理想,会逐字拆分。
-
pattern tokenizer:以正则匹配分隔符,把文本拆分成若干词项。
-
simple pattern tokenizer:以正则匹配词项,速度比pattern tokenizer快。
-
whitespace analyzer:以空白符分隔
-
ik分词器:中文分词器(git地址:GitHub - medcl/elasticsearch-analysis-ik: The IK Analysis plugin integrates Lucene IK analyzer into elasticsearch, support customized dictionary.)
示例代码
#分词器 tokenizer POST _analyze { "analyzer": "ik_max_word", "text": "小孩儿不能吃糖" }
自定义分词器
-
char_filter:内置或自定义字符过滤器 。
-
token filter:内置或自定义token filter 。
-
tokenizer:内置或自定义分词器。
示例代码
PUT custom_analysis { "settings": { "analysis": { "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "& => and", "| => or" ] }, "html_strip_char_filter":{ "type":"html_strip", "escaped_tags":["a"] } }, "filter": { "my_stopword": { "type": "stop", "stopwords": [ "is", "in", "the", "a", "at", "for" ] } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "[ ,.!?]" } }, "analyzer": { "my_analyzer":{ "type":"custom", "char_filter":["my_char_filter","html_strip_char_filter"], "filter":["my_stopword","lowercase"], "tokenizer":"my_tokenizer" } } } } } GET custom_analysis/_analyze { "analyzer": "my_analyzer", "text": ["What is ,as.df ss
in ? &
| is ! in the a at for "] }
猜你喜欢
网友评论
- 搜索
- 最新文章
- 热门文章