调查APISIX上的java插件重启问题

问题

在使用apisix中,我们有一个java plugin做加解密,命令行配置如下:

ext-plugin:
  cmd: ['java', '-jar', '-XX:+HeapDumpOnOutOfMemoryError' ,'-XX:HeapDumpPath=/usr/local/apisix/logs/apisixrunner_oomdump.hprof','-XX:ErrorFile=/usr/local/apisix/logs/apisixrunner_hs_error_dump.log', '-Xmx1g', '-Xms512m', '-Dspring.profiles.active=develop', '/usr/apisix/apisix-runner-bin/apisix-java-plugin-runner.jar']

但是发现这个jvm偶尔会自动重启。遂开始调查。

调查步骤

1,先对比了下java日志,发现没有异常,就是自动重启(上下并无明显异常):

[]06-17 04:30:38.540.+0000[] [INFO ][main][o.a.a.p.r.PluginRunnerApplication.logStarted:61]Started PluginRunnerApplication in 3.652 seconds (JVM running for 5.473)

2,查看是否有oom日志,上面的heapdump中没有hprof生成。
3,问AI APISIX用lua写的,所以问问ai它是如何管理插件系统的:
在这里插入图片描述于是过滤apisix容器日志,发现真的有respawn相关日志:
在这里插入图片描述
可以看到确实是被kill -9了,
4,于是查看为啥被kill了,使用的如下命令:

# 检查 OOM Killer 详细日志
dmesg -T | grep -i "out of memory\|killed process" -B20 -A20  # 显示上下文

查看到如下内容:

[Tue Jun 17 05:51:40 2025] [61337]    99 61337   286571    12845      81        0           958 openresty
[Tue Jun 17 05:51:40 2025] [60265]    99 60265   286378    12718      81        0           958 openresty
[Tue Jun 17 05:51:40 2025] [42939]    99 42939   284377    10724      75        0           958 openresty
[Tue Jun 17 05:51:40 2025] [64084]     0 64084     3816      259      13        0           958 bash
[Tue Jun 17 05:51:40 2025] [  596]     0   596  1395294   133765     380        0           958 java
[Tue Jun 17 05:51:40 2025] [54493]     0 54493     1954       99      10        0           958 tail
[Tue Jun 17 05:51:40 2025] Memory cgroup out of memory: Kill process 59314 (epollEventLoopG) score 1213 or sacrifice child
[Tue Jun 17 05:51:40 2025] Killed process 596 (java), UID 0, total-vm:5581176kB, anon-rss:524188kB, file-rss:10872kB, shmem-rss:0kB
[Tue Jun 17 05:51:40 2025] openresty invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=958
[Tue Jun 17 05:51:40 2025] openresty cpuset=docker-a71969200f6221395c742fe44dd198730358fe11d1a668980c0eb7410e42a79d.scope mems_allowed=0
[Tue Jun 17 05:51:40 2025] CPU: 6 PID: 53681 Comm: openresty Kdump: loaded Tainted: G               ------------ T 3.10.0-1160.el7.x86_64 #1
[Tue Jun 17 05:51:40 2025] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[Tue Jun 17 05:51:40 2025] Call Trace:
[Tue Jun 17 05:51:40 2025]  [<ffffffff9f381340>] dump_stack+0x19/0x1b
[Tue Jun 17 05:51:40 2025]  [<ffffffff9f37bc60>] dump_header+0x90/0x229
[Tue Jun 17 05:51:40 2025]  [<ffffffff9edc1b26>] ? find_lock_task_mm+0x56/0xc0
[Tue Jun 17 05:51:40 2025]  [<ffffffff9edc208d>] oom_kill_process+0x2cd/0x490
[Tue Jun 17 05:51:40 2025]  [<ffffffff9ee40cdc>] mem_cgroup_oom_synchronize+0x55c/0x590
[Tue Jun 17 05:51:40 2025]  [<ffffffff9ee40140>] ? mem_cgroup_charge_common+0xc0/0xc0
[Tue Jun 17 05:51:40 2025]  [<ffffffff9edc2974>] pagefault_out_of_memory+0x14/0x90
[Tue Jun 17 05:51:40 2025]  [<ffffffff9f37a16c>] mm_fault_error+0x6a/0x157
[Tue Jun 17 05:51:40 2025]  [<ffffffff9f38e8d1>] __do_page_fault+0x491/0x500
[Tue Jun 17 05:51:40 2025]  [<ffffffff9f38ea26>] trace_do_page_fault+0x56/0x150
[Tue Jun 17 05:51:40 2025]  [<ffffffff9f38dfa2>] do_async_page_fault+0x22/0xf0
[Tue Jun 17 05:51:40 2025]  [<ffffffff9f38a7a8>] async_page_fault+0x28/0x30
[Tue Jun 17 05:51:40 2025] Task in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6962c5d3_918f_425a_b560_2b977138f026.slice/docker-a71969200f6221395c742fe44dd198730358fe11d1a668980c0eb7410e42a79d.scope killed as a result of limit of /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6962c5d3_918f_425a_b560_2b977138f026.slice
[Tue Jun 17 05:51:40 2025] memory: usage 2097152kB, limit 2097152kB, failcnt 564269
[Tue Jun 17 05:51:40 2025] memory+swap: usage 2097152kB, limit 9007199254740988kB, failcnt 0

这里面有看到kill的原因是因为 Memory cgroup out of memory, 然后选择牺牲掉子进程(也就是java)。
可以看到该任务是k8s启动,于是联想到k8s的request resource limit.
5, 解决办法:可以修改apisix的resource limit 或者调低 java的xms和xmx。

总结

1,善用ai,尤其是deepwiki这种,可以给不太熟悉的代码仓库一些大的逻辑参考。
2,oom killer grep command for dmesg