记录一个Elasticsearch IK分词性能的问题

On 2019年12月6日By yuer

最近线上遇到了一个ES写入慢的问题，发现是一条特定数据导致的，其耗时需要8~20秒之间。

现象与定位

经过复现定位，明确触发条件如下：

ik_smart分词模式，text字段。
8000多个重复的”哈”字符。

也就是说，对于内容很长的叠词，会导致ik_smart分词性能非常差。

复现方法

大家都可以在自己的ES上试试。

下载POST测试数据：https://yuerblog.cc/wp-content/uploads/data.txt

发起Elastcisearch分词请求：curl ‘http://localhost:9200/_analyze‘ -d @data.txt

先熟悉ik_max_word与ik_smart的区别

ik_max_word在相同场景下没有性能问题，ik_smart则存在性能问题。

ik_max_word

我们知道ik_max_word是尽可能拆分出所有的词，比如：

中华人民共和国我爱我的祖国

经过ik_max_word分词将得到：

{
	"tokens": [{
		"token": "中华人民共和国",
		"start_offset": 0,
		"end_offset": 7,
		"type": "CN_WORD",
		"position": 0
	}, {
		"token": "中华人民",
		"start_offset": 0,
		"end_offset": 4,
		"type": "CN_WORD",
		"position": 1
	}, {
		"token": "中华",
		"start_offset": 0,
		"end_offset": 2,
		"type": "CN_WORD",
		"position": 2
	}, {
		"token": "华人",
		"start_offset": 1,
		"end_offset": 3,
		"type": "CN_WORD",
		"position": 3
	}, {
		"token": "人民共和国",
		"start_offset": 2,
		"end_offset": 7,
		"type": "CN_WORD",
		"position": 4
	}, {
		"token": "人民",
		"start_offset": 2,
		"end_offset": 4,
		"type": "CN_WORD",
		"position": 5
	}, {
		"token": "共和国",
		"start_offset": 4,
		"end_offset": 7,
		"type": "CN_WORD",
		"position": 6
	}, {
		"token": "共和",
		"start_offset": 4,
		"end_offset": 6,
		"type": "CN_WORD",
		"position": 7
	}, {
		"token": "国",
		"start_offset": 6,
		"end_offset": 7,
		"type": "CN_CHAR",
		"position": 8
	}, {
		"token": "我",
		"start_offset": 8,
		"end_offset": 9,
		"type": "CN_CHAR",
		"position": 9
	}, {
		"token": "爱我",
		"start_offset": 9,
		"end_offset": 11,
		"type": "CN_WORD",
		"position": 10
	}, {
		"token": "祖国",
		"start_offset": 12,
		"end_offset": 14,
		"type": "CN_WORD",
		"position": 11
	}, {
		"token": "祖",
		"start_offset": 12,
		"end_offset": 13,
		"type": "CN_WORD",
		"position": 12
	}, {
		"token": "国",
		"start_offset": 13,
		"end_offset": 14,
		"type": "CN_CHAR",
		"position": 13
	}]
}

{

"tokens": [{

"token": "中华人民共和国",

"start_offset": 0,

"end_offset": 7,

"type": "CN_WORD",

"position": 0

}, {

"token": "中华人民",

"start_offset": 0,

"end_offset": 4,

"type": "CN_WORD",

"position": 1

}, {

"token": "中华",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD",

"position": 2

}, {

"token": "华人",

"start_offset": 1,

"end_offset": 3,

"type": "CN_WORD",

"position": 3

}, {

"token": "人民共和国",

"start_offset": 2,

"end_offset": 7,

"type": "CN_WORD",

"position": 4

}, {

"token": "人民",

"start_offset": 2,

"end_offset": 4,

"type": "CN_WORD",

"position": 5

}, {

"token": "共和国",

"start_offset": 4,

"end_offset": 7,

"type": "CN_WORD",

"position": 6

}, {

"token": "共和",

"start_offset": 4,

"end_offset": 6,

"type": "CN_WORD",

"position": 7

}, {

"token": "国",

"start_offset": 6,

"end_offset": 7,

"type": "CN_CHAR",

"position": 8

}, {

"token": "我",

"start_offset": 8,

"end_offset": 9,

"type": "CN_CHAR",

"position": 9

}, {

"token": "爱我",

"start_offset": 9,

"end_offset": 11,

"type": "CN_WORD",

"position": 10

}, {

"token": "祖国",

"start_offset": 12,

"end_offset": 14,

"type": "CN_WORD",

"position": 11

}, {

"token": "祖",

"start_offset": 12,

"end_offset": 13,

"type": "CN_WORD",

"position": 12

}, {

"token": "国",

"start_offset": 13,

"end_offset": 14,

"type": "CN_CHAR",

"position": 13

}]

}

ik_max_word拆分的组合特别丰富，所以召回成功的概率就很大了。

ik_smart

ik_smart在8000个”哈”的叠词场景下存在性能问题，这是因为ik_smart会在ik_max_word的基础上进一步计算，而出现性能问题的部分就是这块。

{
	"tokens": [{
		"token": "中华人民共和国",
		"start_offset": 0,
		"end_offset": 7,
		"type": "CN_WORD",
		"position": 0
	}, {
		"token": "我",
		"start_offset": 8,
		"end_offset": 9,
		"type": "CN_CHAR",
		"position": 1
	}, {
		"token": "爱我",
		"start_offset": 9,
		"end_offset": 11,
		"type": "CN_WORD",
		"position": 2
	}, {
		"token": "祖国",
		"start_offset": 12,
		"end_offset": 14,
		"type": "CN_WORD",
		"position": 3
	}]
}

{

"tokens": [{

"token": "中华人民共和国",

"start_offset": 0,

"end_offset": 7,

"type": "CN_WORD",

"position": 0

}, {

"token": "我",

"start_offset": 8,

"end_offset": 9,

"type": "CN_CHAR",

"position": 1

}, {

"token": "爱我",

"start_offset": 9,

"end_offset": 11,

"type": "CN_WORD",

"position": 2

}, {

"token": "祖国",

"start_offset": 12,

"end_offset": 14,

"type": "CN_WORD",

"position": 3

}]

}

你会发现，ik_max_word的分词结果经过ik_smart处理后，剩余的term之间是没有文字交叉的。

ik_smart进行的交叉term删减逻辑就是性能的罪魁祸首，这个过程叫做歧义词处理。

我们看看8000个”哈”的ik_max_word分词结果是如何交叉的：

{
	"tokens": [{
				"token": "哈哈哈哈",
				"start_offset": 0,
				"end_offset": 4,
				"type": "CN_WORD",
				"position": 0
			}, {
				"token": "哈哈哈",
				"start_offset": 0,
				"end_offset": 3,
				"type": "CN_WORD",
				"position": 1
			}, {
				"token": "哈哈",
				"start_offset": 0,
				"end_offset": 2,
				"type": "CN_WORD",
				"position": 2
			}, {
				"token": "哈哈哈哈",
				"start_offset": 1,
				"end_offset": 5,
				"type": "CN_WORD",
				"position": 3
			}, {
				"token": "哈哈哈",
				"start_offset": 1,
				"end_offset": 4,
				"type": "CN_WORD",
				"positio    n": 4
			}, {
				"token": "哈哈",
				"start_offset": 1,
				"end_offset": 3,
				"type": "CN_WORD",
				"position": 5
			}, {
				"token": "哈哈哈哈",
				"start_offset": 2,
				"end_offset": 6,
				"type": "CN_WORD",
				"position": 6
			}, {
				"token": "哈哈哈",
				"start_offset": 2,
				"en    d_offset": 5,
				"type": "CN_WORD",
				"position": 7
			}, {
				"token": "哈哈",
				"start_offset": 2,
				"end_offset": 4,
				"type": "CN_WORD",
				"position": 8
			}

{

"tokens": [{

"token": "哈哈哈哈",

"start_offset": 0,

"end_offset": 4,

"type": "CN_WORD",

"position": 0

}, {

"token": "哈哈哈",

"start_offset": 0,

"end_offset": 3,

"type": "CN_WORD",

"position": 1

}, {

"token": "哈哈",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD",

"position": 2

}, {

"token": "哈哈哈哈",

"start_offset": 1,

"end_offset": 5,

"type": "CN_WORD",

"position": 3

}, {

"token": "哈哈哈",

"start_offset": 1,

"end_offset": 4,

"type": "CN_WORD",

"positio n": 4

}, {

"token": "哈哈",

"start_offset": 1,

"end_offset": 3,

"type": "CN_WORD",

"position": 5

}, {

"token": "哈哈哈哈",

"start_offset": 2,

"end_offset": 6,

"type": "CN_WORD",

"position": 6

}, {

"token": "哈哈哈",

"start_offset": 2,

"en d_offset": 5,

"type": "CN_WORD",

"position": 7

}, {

"token": "哈哈",

"start_offset": 2,

"end_offset": 4,

"type": "CN_WORD",

"position": 8

}

你会发现term与term间的start_offset和end_offset是环环相扣交叉不断的。

ik_smart在处理歧义词的时候，会顺序遍历ik_max_word分词列表，先求出每一组环环相扣的交叉term集合，再对每一组交叉term集合内部求得一组最优的非交叉term集合作为一组ik_smart分词结果，因为叠词分词的特点，导致交叉集合特别大，所以枚举非交叉集合的性能就爆炸了。

解决方案

写入的时候使用ik_max_word索引，查询的时候使用ik_smart索引并对查询词的长度进行限制（避免被外部用户叠词攻击）。

附加链接

我提交的IK分词器ISSUE：https://github.com/medcl/elasticsearch-analysis-ik/issues/740
IK分词器原理：https://blog.csdn.net/wl044090432/article/details/71723051

如果文章帮助您解决了工作难题，您可以帮我点击屏幕上的任意广告，或者赞助少量费用来支持我的持续创作，谢谢~

现象与定位

复现方法

先熟悉ik_max_word与ik_smart的区别

ik_max_word

ik_smart

解决方案

附加链接

发表回复 取消回复

发表回复取消回复