K8S部分业务POD内存持续泄露问题
线上K8S集群有极少量的PHP业务,它们的POD内存持续走高直到OOM,相信与特殊代码场景有关,需要展开分析。
我选择从POD的内存监控原理入手,分析到底内存用到了哪些地方。
分析过程
我把整个分析过程拆分成步骤,实际我也是按照这个步骤的逻辑逐渐展开的。
step 1
因为容器化依赖Cgroup限制内存资源,Docker采集容器的内存使用量也是基于Cgroup技术,因此需要先搞明白Cgroup,其核心原理如下:
cgroup需要先建树(实际就是目录),整个操作系统可以建多颗树,每棵树可以关联N个子系统(cpu、mem、io…),但是整个操作系统中每种子系统只能出现在1颗树中,不能出现在多个树中。
说白了,假设Cgroup有cpu、mem、io三种子系统,那么整个系统:
1)最多mount挂载3颗Cgroup树,每棵树只管理1种子系统。
2)最小mount挂载1颗Cgroup树,这棵树管理3种子系统。
step 2
实际上,Cgroup标准做法是把每个子系统作为一棵树(Hierarchy),然后在树里面创建子cgroup做 资源限制。
Centos默认创建了这样的N颗树,每棵树管理1个子系统,K8S就是在这些树中创建子目录来使用Cgroup能力。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
[root@10-42-53-112 ~]# ll /sys/fs/cgroup/ total 0 dr-xr-xr-x 7 root root 0 Jul 6 10:26 blkio lrwxrwxrwx 1 root root 11 May 17 17:05 cpu -> cpu,cpuacct lrwxrwxrwx 1 root root 11 May 17 17:05 cpuacct -> cpu,cpuacct dr-xr-xr-x 7 root root 0 Jul 6 10:26 cpu,cpuacct dr-xr-xr-x 5 root root 0 Jul 6 10:26 cpuset dr-xr-xr-x 7 root root 0 Jul 6 10:26 devices dr-xr-xr-x 5 root root 0 Jul 6 10:26 freezer dr-xr-xr-x 5 root root 0 Jul 6 10:26 hugetlb dr-xr-xr-x 7 root root 0 Jul 6 10:26 memory lrwxrwxrwx 1 root root 16 May 17 17:05 net_cls -> net_cls,net_prio dr-xr-xr-x 5 root root 0 Jul 6 10:26 net_cls,net_prio lrwxrwxrwx 1 root root 16 May 17 17:05 net_prio -> net_cls,net_prio dr-xr-xr-x 5 root root 0 Jul 6 10:26 perf_event dr-xr-xr-x 7 root root 0 Jul 6 10:26 pids dr-xr-xr-x 2 root root 0 Jul 6 10:26 rdma dr-xr-xr-x 7 root root 0 Jul 6 10:26 systemd |
step 3
以内存memory为例,我们知道POD可以设置resource limit,具体是什么原理呢?
1)首先docker ps找到目标pod的相关容器,至少有2个容器,一个是pause容器,一个是应用容器。2)拿着应用容器的container id,执行docker inspect 可以看到label里有一个pod唯一标识uid:
1 |
"io.kubernetes.pod.uid": "931369e9-2a87-4090-a304-dd02122e7acc", |
同时,该容器ID为:
1 |
"io.kubernetes.pod.uid": "931369e9-2a87-4090-a304-dd02122e7acc", |
另外,标签里也说明了同POD的pause容器ID是多少:
1 |
"io.kubernetes.sandbox.id": "dc9b09ac63191180ac5dca2836ebd15c82add818424ccf23417ebd16c0587a1d", |
3)K8S创建了kubepods子cgroup,仍旧以memory为例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
[root@10-42-53-112 ~]# ll /sys/fs/cgroup/memory/kubepods/ total 0 drwxr-xr-x 4 root root 0 Jul 6 10:26 besteffort drwxr-xr-x 3 root root 0 Jul 6 10:26 burstable -rw-r--r-- 1 root root 0 Jul 6 10:26 cgroup.clone_children --w--w--w- 1 root root 0 Jul 6 10:26 cgroup.event_control -rw-r--r-- 1 root root 0 Jul 6 10:26 cgroup.procs -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.failcnt --w------- 1 root root 0 Jul 6 10:26 memory.force_empty -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.kmem.failcnt -rw-r--r-- 1 root root 0 May 17 17:05 memory.kmem.limit_in_bytes -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.kmem.max_usage_in_bytes -r--r--r-- 1 root root 0 Jul 6 10:26 memory.kmem.slabinfo -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.kmem.tcp.failcnt -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.kmem.tcp.limit_in_bytes -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.kmem.tcp.max_usage_in_bytes -r--r--r-- 1 root root 0 Jul 6 10:26 memory.kmem.tcp.usage_in_bytes -r--r--r-- 1 root root 0 Jul 6 10:26 memory.kmem.usage_in_bytes -rw-r--r-- 1 root root 0 May 17 17:05 memory.limit_in_bytes -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.max_usage_in_bytes -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.memsw.failcnt -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.memsw.limit_in_bytes -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.memsw.max_usage_in_bytes -r--r--r-- 1 root root 0 Jul 6 10:26 memory.memsw.usage_in_bytes -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.move_charge_at_immigrate -r--r--r-- 1 root root 0 Jul 6 10:26 memory.numa_stat -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.oom_control ---------- 1 root root 0 Jul 6 10:26 memory.pressure_level -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.soft_limit_in_bytes -r--r--r-- 1 root root 0 Jul 6 10:26 memory.stat -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.swappiness -r--r--r-- 1 root root 0 Jul 6 10:26 memory.usage_in_bytes -rw-r--r-- 1 root root 0 Jul 6 10:26 memory.use_hierarchy -rw-r--r-- 1 root root 0 Jul 6 10:26 notify_on_release drwxr-xr-x 4 root root 0 Jul 6 10:26 pod07ddb571-fbf5-496a-a391-938d1a5bdfef drwxr-xr-x 4 root root 0 Jul 6 10:26 pod0d9d11d6-ce6f-41b5-9d89-31803fe050c6 drwxr-xr-x 4 root root 0 Jul 6 10:26 pod56c2f6e1-24fd-43f2-91f3-928f5f221f57 drwxr-xr-x 4 root root 0 Jul 6 10:26 pod5f3ea4c8-e39b-41e0-a729-20a5e98d6f7a drwxr-xr-x 4 root root 0 Jul 6 10:26 pod79b470b0-2fa9-403e-b4b7-6e4878a5ac49 drwxr-xr-x 4 root root 0 Jul 6 10:26 pod8bad5a6c-3523-47c5-81ff-f641030d85b0 drwxr-xr-x 4 root root 0 Jul 6 10:26 pod931369e9-2a87-4090-a304-dd02122e7acc -rw-r--r-- 1 root root 0 Jul 6 10:26 tasks |
K8S资源限制是POD级的,所以K8S还会在这个cgroup下创建POD的子memory cgroup,进行POD级具体的资源限制。
在继续深入POD级cgroup之前,我们看一下kubepods这一级的内存限制:
1 2 |
[root@10-42-53-112 ~]# cat /sys/fs/cgroup/memory/kubepods/memory.limit_in_bytes 32457519104 |
所有POD的总内存限制为30.23G,宿主机是32G内存,其他1G多内存没有纳入cgroup是因为kubelet配置的预留内存导致的。
step 4
根据上面找到的POD,就可以继续定位到POD级的cgroup了:
1 2 |
[root@10-42-53-112 ~]# cat /sys/fs/cgroup/memory/kubepods/pod931369e9-2a87-4090-a304-dd02122e7acc/memory.limit_in_bytes 2147483648 |
整个POD限制为2G,符合Deployment YAML定义。
step 5
再往POD下面一级就是container的cgroup了,这里的内存会限制为什么呢?
1 2 |
cat /sys/fs/cgroup/memory/kubepods/pod931369e9-2a87-4090-a304-dd02122e7acc/7e75c3921b2157ccecc5cff5055940c782f02cb8227ae080874220bb06124dad/memory.limit_in_bytes 2147483648 |
看样是继承了POD级的限制,反正POD级就那么多内存,里面的单个容器最多也就用这些。
为什么还要做container级的cgroup呢?这样做,至少memory的使用明细是可以具体到container去查看的:
1 2 |
[root@10-42-53-112 ~]# cat /sys/fs/cgroup/memory/kubepods/pod931369e9-2a87-4090-a304-dd02122e7acc/7e75c3921b2157ccecc5cff5055940c782f02cb8227ae080874220bb06124dad/memory.usage_in_bytes 1949036544 |
会发现应用容器占了1.8G左右,快要把POD的内存限制用满了。(也可以通过docker stats命令查看到容器内存占用)
我们拿着之前发现的sandbox容器ID(实际就是pause容器)查看一下内存使用:
1 2 |
[root@10-42-53-112 ~]# cat /sys/fs/cgroup/memory/kubepods/pod931369e9-2a87-4090-a304-dd02122e7acc/dc9b09ac63191180ac5dca2836ebd15c82add818424ccf23417ebd16c0587a1d/memory.usage_in_bytes 1089536 |
只用了1M左右,因此pause容器的内存占用可以忽略。
step 6
那么应用容器真的占用了1.8G吗?实际我们详细看应用容器的内存使用统计:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
[root@10-42-53-112 ~]# cat /sys/fs/cgroup/memory/kubepods/pod931369e9-2a87-4090-a304-dd02122e7acc/7e75c3921b2157ccecc5cff5055940c782f02cb8227ae080874220bb06124dad/memory.stat cache 60653568 rss 325496832 rss_huge 0 shmem 26628096 mapped_file 27844608 dirty 540672 writeback 946176 swap 0 pgpgin 3729620103 pgpgout 3729526233 pgfault 5994964305 pgmajfault 0 inactive_anon 27070464 active_anon 324997120 inactive_file 24436736 active_file 8134656 unevictable 0 hierarchical_memory_limit 2147483648 hierarchical_memsw_limit 2147483648 total_cache 60653568 total_rss 325496832 total_rss_huge 0 total_shmem 26628096 total_mapped_file 27844608 total_dirty 540672 total_writeback 946176 total_swap 0 total_pgpgin 3729620103 total_pgpgout 3729526233 total_pgfault 5994964305 total_pgmajfault 0 total_inactive_anon 27070464 total_active_anon 324997120 total_inactive_file 24436736 total_active_file 8134656 total_unevictable 0 |
会发现total_rss和total_cache加起来不过300MB+,其他内存跑哪里去了?
step 7
经过了解,cgroup的memory.usage_in_bytes除了计算rss和swap外,还统计了kmem,也就是内核使用内存,我们查看一下实际kmem使用量:
1 2 |
[root@10-42-53-112 ~]# cat /sys/fs/cgroup/memory/kubepods/pod931369e9-2a87-4090-a304-dd02122e7acc/7e75c3921b2157ccecc5cff5055940c782f02cb8227ae080874220bb06124dad/memory.kmem.usage_in_bytes 1564602368 |
果然1.5G左右,和rss加起来大概就是1.8G了,为什么这个应用容器大部分内存都被kernel使用了呢?用来做啥呢?
step 8
kmem体现在内核slab内存的分配使用,可以直接查看应用容器的slabinfo:
1 |
[root@10-42-53-112 ~]# cat /sys/fs/cgroup/memory/kubepods/pod931369e9-2a87-4090-a304-dd02122e7acc/7e75c3921b2157ccecc5cff5055940c782f02cb8227ae080874220bb06124dad/memory.kmem.slabinfo |
找到内存占用高的容器,查看其slabinfo:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
[root@10-42-53-112 ~]# cat /sys/fs/cgroup/memory/kubepods/pod931369e9-2a87-4090-a304-dd02122e7acc/7e75c3921b2157ccecc5cff5055940c782f02cb8227ae080874220bb06124dad/memory.kmem.slabinfo slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> tw_sock_TCP 544 544 240 34 2 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-8192 64 64 8192 4 8 : tunables 0 0 0 : slabdata 16 16 0 hugetlbfs_inode_cache 0 0 624 52 8 : tunables 0 0 0 : slabdata 0 0 0 UDPv6 25 25 1280 25 8 : tunables 0 0 0 : slabdata 1 1 0 TCPv6 28 28 2304 14 8 : tunables 0 0 0 : slabdata 2 2 0 TCP 240 240 2176 15 8 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-16 2560 2560 16 256 1 : tunables 0 0 0 : slabdata 10 10 0 radix_tree_node 5208 5208 584 56 8 : tunables 0 0 0 : slabdata 93 93 0 kmalloc-96 672 672 96 42 1 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-2048 256 256 2048 16 8 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-1024 512 512 1024 32 8 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-192 672 672 192 42 2 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-8 8192 8192 8 512 1 : tunables 0 0 0 : slabdata 16 16 0 xfs_inode 3618 4114 960 34 8 : tunables 0 0 0 : slabdata 121 121 0 ovl_inode 3300 3504 680 48 8 : tunables 0 0 0 : slabdata 73 73 0 kmalloc-32 2048 2048 32 128 1 : tunables 0 0 0 : slabdata 16 16 0 eventpoll_pwq 896 896 72 56 1 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-64 1024 1024 64 64 1 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-4096 168 168 4096 8 8 : tunables 0 0 0 : slabdata 21 21 0 pde_opener 1632 1632 40 102 1 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-512 576 576 512 32 4 : tunables 0 0 0 : slabdata 18 18 0 skbuff_head_cache 640 640 256 32 2 : tunables 0 0 0 : slabdata 20 20 0 uts_namespace 0 0 440 37 4 : tunables 0 0 0 : slabdata 0 0 0 inode_cache 864 864 600 54 8 : tunables 0 0 0 : slabdata 16 16 0 pid 608 608 128 32 1 : tunables 0 0 0 : slabdata 19 19 0 signal_cache 510 510 1088 30 8 : tunables 0 0 0 : slabdata 17 17 0 sighand_cache 255 255 2112 15 8 : tunables 0 0 0 : slabdata 17 17 0 files_cache 736 736 704 46 8 : tunables 0 0 0 : slabdata 16 16 0 task_struct 173 200 7808 4 8 : tunables 0 0 0 : slabdata 50 50 0 UNIX 576 576 1024 32 8 : tunables 0 0 0 : slabdata 18 18 0 sock_inode_cache 736 736 704 46 8 : tunables 0 0 0 : slabdata 16 16 0 mm_struct 512 512 1024 32 8 : tunables 0 0 0 : slabdata 16 16 0 cred_jar 2394 2394 192 42 2 : tunables 0 0 0 : slabdata 57 57 0 shmem_inode_cache 414 414 704 46 8 : tunables 0 0 0 : slabdata 9 9 0 proc_inode_cache 4419 4512 672 48 8 : tunables 0 0 0 : slabdata 94 94 0 dentry 7900536 7900536 192 42 2 : tunables 0 0 0 : slabdata 188108 188108 0 filp 5344 5344 256 32 2 : tunables 0 0 0 : slabdata 167 167 0 anon_vma 7200 7452 88 46 1 : tunables 0 0 0 : slabdata 162 162 0 anon_vma_chain 17088 17088 64 64 1 : tunables 0 0 0 : slabdata 267 267 0 vm_area_struct 11486 11560 200 40 2 : tunables 0 0 0 : slabdata 289 289 0 |
找到内存占用低的容器,查看其slabinfo:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
[root@10-10-67-233 ~]# cat /sys/fs/cgroup/memory/kubepods/pod180b0a55-7c9a-45d3-a13b-01b654dce11a/c4d6ac72bfb3c98cb901a9a4c0d6a39408b16f3b68c0a142dbc507f07e1366ec/memory.kmem.slabinfo slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> tw_sock_TCP 238 238 240 34 2 : tunables 0 0 0 : slabdata 7 7 0 kmalloc-8192 64 64 8192 4 8 : tunables 0 0 0 : slabdata 16 16 0 hugetlbfs_inode_cache 0 0 624 52 8 : tunables 0 0 0 : slabdata 0 0 0 UDPv6 0 0 1280 25 8 : tunables 0 0 0 : slabdata 0 0 0 TCPv6 0 0 2304 14 8 : tunables 0 0 0 : slabdata 0 0 0 TCP 240 240 2176 15 8 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-16 3840 3840 16 256 1 : tunables 0 0 0 : slabdata 15 15 0 kmalloc-96 672 672 96 42 1 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-2048 256 256 2048 16 8 : tunables 0 0 0 : slabdata 16 16 0 radix_tree_node 1400 1400 584 56 8 : tunables 0 0 0 : slabdata 25 25 0 kmalloc-1024 512 512 1024 32 8 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-192 672 672 192 42 2 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-8 8192 8192 8 512 1 : tunables 0 0 0 : slabdata 16 16 0 xfs_inode 578 578 960 34 8 : tunables 0 0 0 : slabdata 17 17 0 ovl_inode 1056 1056 680 48 8 : tunables 0 0 0 : slabdata 22 22 0 kmalloc-32 2048 2048 32 128 1 : tunables 0 0 0 : slabdata 16 16 0 eventpoll_pwq 896 896 72 56 1 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-64 1024 1024 64 64 1 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-4096 136 136 4096 8 8 : tunables 0 0 0 : slabdata 17 17 0 pde_opener 1632 1632 40 102 1 : tunables 0 0 0 : slabdata 16 16 0 kmalloc-512 512 512 512 32 4 : tunables 0 0 0 : slabdata 16 16 0 skbuff_head_cache 512 512 256 32 2 : tunables 0 0 0 : slabdata 16 16 0 uts_namespace 0 0 440 37 4 : tunables 0 0 0 : slabdata 0 0 0 inode_cache 864 864 600 54 8 : tunables 0 0 0 : slabdata 16 16 0 pid 544 544 128 32 1 : tunables 0 0 0 : slabdata 17 17 0 signal_cache 480 480 1088 30 8 : tunables 0 0 0 : slabdata 16 16 0 sighand_cache 255 255 2112 15 8 : tunables 0 0 0 : slabdata 17 17 0 files_cache 736 736 704 46 8 : tunables 0 0 0 : slabdata 16 16 0 task_struct 208 212 7808 4 8 : tunables 0 0 0 : slabdata 53 53 0 UNIX 512 512 1024 32 8 : tunables 0 0 0 : slabdata 16 16 0 sock_inode_cache 736 736 704 46 8 : tunables 0 0 0 : slabdata 16 16 0 mm_struct 512 512 1024 32 8 : tunables 0 0 0 : slabdata 16 16 0 cred_jar 2016 2016 192 42 2 : tunables 0 0 0 : slabdata 48 48 0 shmem_inode_cache 368 368 704 46 8 : tunables 0 0 0 : slabdata 8 8 0 proc_inode_cache 2409 2592 672 48 8 : tunables 0 0 0 : slabdata 54 54 0 dentry 1642116 1642116 192 42 2 : tunables 0 0 0 : slabdata 39098 39098 0 filp 4928 4928 256 32 2 : tunables 0 0 0 : slabdata 154 154 0 anon_vma 11914 11914 88 46 1 : tunables 0 0 0 : slabdata 259 259 0 anon_vma_chain 10816 10816 64 64 1 : tunables 0 0 0 : slabdata 169 169 0 vm_area_struct 9974 10200 200 40 2 : tunables 0 0 0 : slabdata 255 255 0 |
dentry占用内存的差距最大,可以通过7900536*192得出大概是1.4G,的确吻合内存占比,那么它的用途是什么呢?大概就是文件项缓存之类的用途,具体参考:。
step 9
上述容器使用了790万的dentry,占了1.4G内存;宿主机执行slabtop可以看到整机分配了3000万的dentry,占了6G左右内存。
我们只有个别的应用存在内存泄露情况,怀疑与代码特殊行为有关,尝试strace了一下php-fpm,看是否有大量文件操作导致dentry增加:
1 2 3 4 5 6 7 8 9 10 11 12 |
[root@zfilter-api-smzdm-com-5657c49d7d-cgzzm ~]# strace -p 48 2>&1 |grep open open("/tmp/phpn1quPl", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phphHcgMJ", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phpp6Oq1f", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phpPRfPXO", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phpzms77o", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phpDps111", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phpHmbTnH", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phpl7hv2u", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phpRYE3jq", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phpfGUqxn", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 open("/tmp/phpVodQrn", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 |
竟然真的在不停的创建临时文件。
进一步strace保存完整日志,找到创建/tmp文件的HTTP请求信息:
1 |
strace -p 48 -s 2048 > ./debug.log 2>&1 |
从debug.log中,可以明确创建临时文件的接口是/comment/bgm_bulk_index,POST长度102633,类型是application/x-www-form-urlencoded:
1 |
read(4, "\0178SCRIPT_FILENAME/data/webroot/phpsrc/zfilter-api-smzdm-com/api/index.php\f\0QUERY_STRING\16\4REQUEST_METHODPOST\f!CONTENT_TYPEapplication/x-www-form-urlencoded\16\6CONTENT_LENGTH102633\v\nSCRIPT_NAME/index.php\v\27REQUEST_URI/comment/bgm_bulk_index\f\nDOCUMENT_URI/index.php\r.DOCUMENT_ROOT/data/webroot/phpsrc/zfilter-api-smzdm-com/api\17\10SERVER_PROTOCOLHTTP/1.1\21\7GATEWAY_INTERFACECGI/1.1\17\tSERVER_SOFTWAREnginx/1.7\v\fREMOTE_ADDR10.42.53.112\v\5REMOTE_PORT49028\v\fSERVER_ADDR10.42.130.74\v\3SERVER_PORT809\v\25SERVER_NAMEzfilter-api.smzdm.com\17\3REDIRECT_STATUS200\t\31HTTP_HOSTzfilter-api.smzdm.com:809\17\22HTTP_USER_AGENTSMZDM PHP CURL 1.0\v\3HTTP_ACCEPT*/*\27VHTTP__CATCALLFROMMETHOD/data/webroot/phpsrc/phpjob-comments-job/job/index.php daemon refresh_comment_es 0 8 0\25\36HTTP__CATCALLERDOMAINphpjob.phpjob-comments-job.job\25SHTTP__CATCALLERMETHODhttp%3a%2f%2fzfilter%2dapi%2esmzdm%2ecom%3a%38%30%39%2fcomment%2fbgm%5fbulk%5findex\27-HTTP__CATCHILDMESSAGEIDzfilter-api.smzdm.com-0a2abcbf-442781-5682431\0305HTTP__CATPARENTMESSAGEIDphpjob.phpjob-comments-job.job-0a2abcbf-442781-275909\0265HTTP__CATROOTMESSAGEIDphpjob.phpjob-comments-job.job-0a2abcbf-442781-275909\23\6HTTP_CONTENT_LENGTH102633\21!HTTP_CONTENT_TYPEapplication/x-www-form-urlencoded\v\fHTTP_EXPECT100-continue\0\0", 1224) = 1224 |
其行为是先读取socket读进来16384字节的数据:
1 |
read(4, "doc_arr=%5B%7B%22comment_id%22%3A18677409%2C%22channel_id%22%3A3%2C%22article_id%22%3A2375099%2C%22receive_user_id%22%3A0%2C%22at_user_ids%22%3A%221604147%22%2C%22content1%22%3A%22%40%5Cu81ea%5Cu7531%5Cu843d%5Cu4f53+%5Cu80fd%5Cu52a0%5Cu4f60%5Cu597d%5Cu53cb%5Cu5417%5Cuff1f++%22%2C%22root_id%22%3A0%2C%22parent_id%22%3A0%2C%22parent_ids%22%3A%22%22%2C%22content2%22%3Anull%2C%22content3%22%3Anull%2C%22content4%22%3Anull%2C%22content5%22%3Anull%2C%22ip%22%3A%22180.156.213.241%22%2C%22remote_ip%22%3A%22%22%2C%22user_agent%22%3A%22%5Cu4ec0%5Cu4e48%5Cu503c%5Cu5f97%5Cu4e70HD+2.2.4+rv%3A3+%28iPad%3B+iPhone+OS+8.1.1%3B+zh_CN%29%22%2C%22comment_from%22%3A%22%22%2C%22reply_from%22%3A0%2C%22creation_date%22%3A%222014-12-09+20%3A45%3A27%22%2C%22card_num%22%3A0%2C%22up_num%22%3A0%2C%22down_num%22%3A0%2C%22sort_v1%22%3A0%2C%22sort_v2%22%3A0%2C%22sort_v3%22%3A0%2C%22sort_v4%22%3A0%2C%22sort_v5%22%3A0%2C%22children_ids_1%22%3A%22%22%2C%22children_ids_2%22%3A%22%22%2C%22children_ids_3%22%3A%22%22%2C%22children_ids_4%22%3A%22%22%2C%22children_ids_5%22%3A%22%22%2C%22status%22%3A1%2C%22is_locked%22%3A0%2C%22comment_card_list%22%3A%5B%5D%2C%22report%22%3A%7B%22report_latest_time%22%3A%220000-00-00+00%3A00%3A00%22%2C%22report_count%22%3A0%2C%22report_count2%22%3A0%7D%2C%22have_read%22%3A1%2C%22origin_status%22%3A0%2C%22report_log_list%22%3A%5B%5D%2C%22user_info%22%3A%7B%22user_id%22%3A2772343%2C%22nickname%22%3A%22mini%5Cu6768%5Cu4e3d%22%7D%2C%22admin_log_list%22%3A%5B%7B%22id%22%3A7061455%2C%22editor_id%22%3A187%2C%22editor_name%22%3A%22wangtao%22%2C%22ctype%22%3A3%2C%22description%22%3A%22%22%2C%22op_module%22%3A%22%22%2C%22creation_date%22%3A%222015-03-10+11%3A18%3A26%22%7D%5D%2C%22admin_first_check_time%22%3A%222015-03-10+11%3A18%3A26%22%2C%22ploy_risk_type%22%3A%22EMPTY%22%2C%22ploy_description%22%3A%22%22%2C%22risk_type_tencent%22%3A%22EMPTY%22%2C%22tencent_moderation_description%22%3A%22%22%7D%2C%7B%22comment_id%22%3A18677417%2C%22channel_id%22%3A3%2C%22article_id%22%3A2375775%2C%22receive_user_id%22%3A0%2C%22at_user_ids%22%3A%22%"..., 16384) = 16384 |
然后才创建了1个临时文件开始写入后续数据:
1 2 3 4 |
getcwd("/data/webroot/phpsrc/zfilter-api-smzdm-com/api", 4096) = 47 open("/tmp/phpxdEznL", O_RDWR|O_CREAT|O_EXCL, 0600) = 5 write(5, "doc_arr=%5B%7B%22comment_id%22%3A18677409%2C%22channel_id%22%3A3%2C%22article_id%22%3A2375099%2C%22receive_user_id%22%3A0%2C%22at_user_ids%22%3A%221604147%22%2C%22content1%22%3A%22%40%5Cu81ea%5Cu7531%5Cu843d%5Cu4f53+%5Cu80fd%5Cu52a0%5Cu4f60%5Cu597d%5Cu53cb%5Cu5417%5Cuff1f++%22%2C%22root_id%22%3A0%2C%22parent_id%22%3A0%2C%22parent_ids%22%3A%22%22%2C%22content2%22%3Anull%2C%22content3%22%3Anull%2C%22content4%22%3Anull%2C%22content5%22%3Anull%2C%22ip%22%3A%22180.156.213.241%22%2C%22remote_ip%22%3A%22%22%2C%22user_agent%22%3A%22%5Cu4ec0%5Cu4e48%5Cu503c%5Cu5f97%5Cu4e70HD+2.2.4+rv%3A3+%28iPad%3B+iPhone+OS+8.1.1%3B+zh_CN%29%22%2C%22comment_from%22%3A%22%22%2C%22reply_from%22%3A0%2C%22creation_date%22%3A%222014-12-09+20%3A45%3A27%22%2C%22card_num%22%3A0%2C%22up_num%22%3A0%2C%22down_num%22%3A0%2C%22sort_v1%22%3A0%2C%22sort_v2%22%3A0%2C%22sort_v3%22%3A0%2C%22sort_v4%22%3A0%2C%22sort_v5%22%3A0%2C%22children_ids_1%22%3A%22%22%2C%22children_ids_2%22%3A%22%22%2C%22children_ids_3%22%3A%22%22%2C%22children_ids_4%22%3A%22%22%2C%22children_ids_5%22%3A%22%22%2C%22status%22%3A1%2C%22is_locked%22%3A0%2C%22comment_card_list%22%3A%5B%5D%2C%22report%22%3A%7B%22report_latest_time%22%3A%220000-00-00+00%3A00%3A00%22%2C%22report_count%22%3A0%2C%22report_count2%22%3A0%7D%2C%22have_read%22%3A1%2C%22origin_status%22%3A0%2C%22report_log_list%22%3A%5B%5D%2C%22user_info%22%3A%7B%22user_id%22%3A2772343%2C%22nickname%22%3A%22mini%5Cu6768%5Cu4e3d%22%7D%2C%22admin_log_list%22%3A%5B%7B%22id%22%3A7061455%2C%22editor_id%22%3A187%2C%22editor_name%22%3A%22wangtao%22%2C%22ctype%22%3A3%2C%22description%22%3A%22%22%2C%22op_module%22%3A%22%22%2C%22creation_date%22%3A%222015-03-10+11%3A18%3A26%22%7D%5D%2C%22admin_first_check_time%22%3A%222015-03-10+11%3A18%3A26%22%2C%22ploy_risk_type%22%3A%22EMPTY%22%2C%22ploy_description%22%3A%22%22%2C%22risk_type_tencent%22%3A%22EMPTY%22%2C%22tencent_moderation_description%22%3A%22%22%7D%2C%7B%22comment_id%22%3A18677417%2C%22channel_id%22%3A3%2C%22article_id%22%3A2375775%2C%22receive_user_id%22%3A0%2C%22at_user_ids%22%3A%22%"..., 8192) = 8192 write(5, "249%22%2C%22user_agent%22%3A%22Mozilla%5C%2F5.0+%28Windows+NT+6.3%3B+WOW64%29+AppleWebKit%5C%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%5C%2F37.0.2062.120+Safari%5C%2F537.36%22%2C%22comment_from%22%3A%22%22%2C%22reply_from%22%3A0%2C%22creation_date%22%3A%222014-12-09+20%3A46%3A04%22%2C%22card_num%22%3A0%2C%22up_num%22%3A0%2C%22down_num%22%3A0%2C%22sort_v1%22%3A0%2C%22sort_v2%22%3A0%2C%22sort_v3%22%3A0%2C%22sort_v4%22%3A0%2C%22sort_v5%22%3A0%2C%22children_ids_1%22%3A%22%22%2C%22children_ids_2%22%3A%22%22%2C%22children_ids_3%22%3A%22%22%2C%22children_ids_4%22%3A%22%22%2C%22children_ids_5%22%3A%22%22%2C%22status%22%3A1%2C%22is_locked%22%3A0%2C%22comment_card_list%22%3A%5B%5D%2C%22report%22%3A%7B%22report_latest_time%22%3A%220000-00-00+00%3A00%3A00%22%2C%22report_count%22%3A0%2C%22report_count2%22%3A0%7D%2C%22have_read%22%3A1%2C%22origin_status%22%3A0%2C%22report_log_list%22%3A%5B%5D%2C%22user_info%22%3A%7B%22user_id%22%3A755583%2C%22nickname%22%3A%22%5Cu5b59%5Cu5c0f%5Cu9c81%22%7D%2C%22admin_log_list%22%3A%5B%7B%22id%22%3A7061445%2C%22editor_id%22%3A187%2C%22editor_name%22%3A%22wangtao%22%2C%22ctype%22%3A3%2C%22description%22%3A%22%22%2C%22op_module%22%3A%22%22%2C%22creation_date%22%3A%222015-03-10+11%3A18%3A26%22%7D%5D%2C%22admin_first_check_time%22%3A%222015-03-10+11%3A18%3A26%22%2C%22ploy_risk_type%22%3A%22EMPTY%22%2C%22ploy_description%22%3A%22%22%2C%22risk_type_tencent%22%3A%22EMPTY%22%2C%22tencent_moderation_description%22%3A%22%22%7D%2C%7B%22comment_id%22%3A18677457%2C%22channel_id%22%3A3%2C%22article_id%22%3A5309159%2C%22receive_user_id%22%3A0%2C%22at_user_ids%22%3A%22%22%2C%22content1%22%3A%22%5Cu521a%5Cu4e70%5Cu7684%5Cuff0c%5Cu5c31%5Cu964d%5Cu4ef7%5Cu4e86%5Cuff0c%5Cu5509%5Cuff01%5Cu4e0d%5Cu8fc7%5Cu786e%5Cu5b9e%5Cu633a%5Cu597d%5Cuff0c%5Cu633a%5Cu6d41%5Cu7545%5Cuff0c%5Cu8fd9%5Cu4e2a%5Cu4ef7%5Cu4f4d%5Cu5f88%5Cu503c%5Cu4e86%5Cu3002%22%2C%22root_id%22%3A0%2C%22parent_id%22%3A0%2C%22parent_ids%22%3A%22%22%2C%22content2%22%3Anull%2C%22content3%22%3Anull%2C%22content4%22%3Anull%2C%22content5%22%3Anull%2C%22ip%22%3A%"..., 8192) = 8192 |
最后再把所有数据从临时文件里读进内存,才开始进入PHP脚本的处理逻辑。
step 10
我高频抓取了一下/tmp目录,抓到1个临时文件看了一下内容:
1 |
while true; cp /tmp/php* .' done |
发现内容就是/comment/bgm_bulk_index接口的POST body体,怀疑PHP-FPM遇到太大的POST体会走临时文件。
找到PHP源码SAPI.c文件,函数sapi_read_standard_form_data用于解析POST表单:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
SAPI_API SAPI_POST_READER_FUNC(sapi_read_standard_form_data) { ... SG(request_info).request_body = php_stream_temp_create_ex(TEMP_STREAM_DEFAULT, SAPI_POST_BLOCK_SIZE, PG(upload_tmp_dir)); if (sapi_module.read_post) { size_t read_bytes; for (;;) { char buffer[SAPI_POST_BLOCK_SIZE]; read_bytes = sapi_read_post_block(buffer, SAPI_POST_BLOCK_SIZE); if (read_bytes > 0) { if (php_stream_write(SG(request_info).request_body, buffer, read_bytes) != read_bytes) { .... |
FPM处理POST表单时,大概会通过php_stream_temp_create_ex创建用于存放解析结果的request_body buffer,第2个参数是内存阈值,一旦超过内存阈值就会写临时文件;
然后循环解析数据写入这个Buffer,因为上述case的POST body总大小是百K,所以就超过了内存阈值,写了临时文件。
这个SAPI_POST_BLOCK_SIZE内存阈值是16进制定义的,实际就是16384:
1 |
#define SAPI_POST_BLOCK_SIZE 0x4000 |
要想提高它,只能改PHP-FPM源码重新编译。
step 11
最后,在高内存POD所在的node,进行一次slab dentry cache清理,观察POD内存是否下降:
1 |
echo 2 > /proc/sys/vm/drop_caches |
POD内存从1.8G降到了346M,基本吻合了RSS实际占用,说明kmem部分被释放了。
step 12
虽然上述PHP接口频繁的创建临时文件,但是它请求结束也会删除掉,为什么slab cache能创建出数百万的dentry缓存对象呢?难道不应该删除后回收复用么?难道删除的文件表项也需要缓存起来,以便stat系统调用的时候可以立即返回文件不存在?还真不好说。
经过搜索(链接),发现内核的确会缓存删除文件的dentry:
负状态(negative):
与目录项关联的索引节点不复存在,那是因为相应的磁盘索引节点已被删除,或者因为目录项对象是通过解析一个不存在文件的路径名创建的。目录项对象的d_inode字段被置为NULL,但该对象仍然被保存在目录项高速缓存中,以便后续对同一文件目录名的查找操作能够快速完成。术语“负状态”容易使人误解,因为根本不涉及任何负值。
因此,PHP频繁的新建+删除文件,就会不停的分配新的dentry对象,旧的dentry会越来越多直到系统没有更多内存可用才会开始淘汰缓存。
总结
这个案例告诉我们,docker默认将kmem算作cgroup的内存占用是比较坑的,哪个cgroup创建出来的slab对象就会被算到谁的头上,多多少少有点不合理。
所以,也许禁止docker将kmem统计在memory usage内,是不是一个更好的做法呢?网上有诸多讨论,就不赘述了。
如果文章帮助您解决了工作难题,您可以帮我点击屏幕上的任意广告,或者赞助少量费用来支持我的持续创作,谢谢~

One thought on “K8S部分业务POD内存持续泄露问题”