Nginx转发两个后端服务,请求返回502,没有进入到后端服务,排查Nginx日志发现:

no live upstreams while connecting to upstream

造成该问题的原因:

转发到upstream两个节点的请求都出现了超时(upstream timed out (110: Connection timed out) while reading response header from upstream),Nginx认为upstream中的两个节点都挂掉了,无法使用,所以拒绝新请求的连接,返回502。

需要等待fail_timeout时间后,才能继续尝试转发,这是属于Nginx的被动健康检查机制。

1.问题描述

正式环境经常会有上层调后端服务返回 502 状态码的情况,查了下后端日志的执行时间也还可以,502请求没有进入后端服务,什么原因?

2025-10-21 15:22:58.251 POST /xxxx/xx/xxx/save with HTTP status:502
2025-10-21 15:22:58.245 POST /xxxx/xx/xxx/save with HTTP status:502

2.原因分析

Nginx配置

upstream xxxx-prod{
        server 192.168.6.27:8092 weight=4;
        server 192.168.6.26:8092 weight=1;
}

 server {
        listen 8080;# default_server;
        charset utf-8;

        location  /  {
                proxy_redirect off;
                proxy_set_header Host  $http_host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_pass http://xxxx-prod;
                }
        access_log  /var/log/nginx/xxxx-prod_access.log jsonlog;
        error_log  /var/log/nginx/xxxx-prod_error.log ;

}

Nginx做负载均衡,外部服务调用http://192.168.6.27:8080,请求转到两个后端服务节点 http://192.168.6.26:8092http://192.168.6.27:8092

日志分析

查看对应的Nginx错误日志

2025/10/21 15:22:52 [error] 5873#5873: *1345620337 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxx/detail HTTP/1.1", upstream: "http://192.168.6.27:8092/xxxx/xx/xxxxxxxxx/xxxxxx/xxxx/detail", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5874#5874: *1345620421 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://192.168.6.27:8092/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5874#5874: *1345622340 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://192.168.6.26:8092/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5874#5874: *1345622215 no live upstreams while connecting to upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/xxxxxx/list HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/xxxxxx/list", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5873#5873: *1345622348 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://192.168.6.26:8092/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5873#5873: *1345622336 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://192.168.6.27:8092/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5873#5873: *1345622336 no live upstreams while connecting to upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5873#5873: *1345622336 no live upstreams while connecting to upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5873#5873: *1345622336 no live upstreams while connecting to upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxx/info HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxxxxxxxx/xxxxxx/xxx/info", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5874#5874: *1345622356 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://192.168.6.27:8092/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"

存在upstream timed out (110: Connection timed out) while reading response header from upstreamno live upstreams while connecting to upstream报错日志。

通过日志可以发现,由于转发到两个后端服务的请求超出Nginx配置的时间,出现 504 time out,导致Nginx认为上游没有存活的节点,返回的502,拒绝新请求。

对应那两条日志

2025/10/21 15:22:56 [error] 5874#5874: *1345618781 no live upstreams while connecting to upstream, client: 192.168.20.31, server: , request: "POST /xxxx/xx/xxx/save HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxx/save", host: "192.168.6.27:8080"
2025/10/21 15:22:58 [error] 5874#5874: *1345623128 no live upstreams while connecting to upstream, client: 192.168.20.31, server: , request: "POST /xxxx/xx/xxx/save HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxx/save", host: "192.168.6.27:8080"

节点健康检查策略

使用的配置是

upstream xxxx-prod{
        server 192.168.6.27:8092 weight=4;
        server 192.168.6.26:8092 weight=1;
}

Nginx 使用 被动健康检测(passive health check)

没有配置 max_failsfail_timeout 参数,因此默认值生效;

在默认配置下,max_fails 默认值 1fail_timeout 默认值 10s,一旦上游发生1次超时或连接错误,会被标记为 unavailable,然后等待fail_timeout 时间,才能接受新请求;

当两个节点都被标记为不可用时,就出现你看到的:

no live upstreams while connecting to upstream

Nginx返回502,拒绝新请求的连接。当等待时间fail_timeout过去后,才能继续转发请求。

3.解决方法

由于不希望Nginx将新请求拦截,返回502。需要将所有请求都转发到后端服务,所以这里将被动的健康检查禁用了。

nginx禁用对upstream的检查,将 max_fails 设置为0即可。

官方说明: The zero value disables the accounting of attempts. (零值将禁用尝试计数。)

upstream xxxx-prod{
        server 192.168.6.27:8092 weight=4;
        server 192.168.6.26:8092 weight=1;
}

调整为

upstream xxxx-prod{
        server 192.168.6.27:8092 weight=4 max_fails=0;
        server 192.168.6.26:8092 weight=1 max_fails=0;
}

即可禁用。

也可以根据自身情况,调整配置项max_failsfail_timeout来处理。

4.扩展

4.1Nginx错误日志结构说明

错误日志中的5874#5874是什么

2025/10/21 15:22:52 [error] 5874#5874: *1345620421 upstream timed out ...

这一行里 5874#5874Nginx 进程的标识信息,具体含义如下:

一、结构说明

标准 Nginx 错误日志格式为:

[时间] [日志级别] pid#tid: *请求编号 信息内容

比如:

[error] 5874#5874: *1345620421 upstream timed out ...

字段解释如下

字段示例值含义
5874(第一个)5874进程 ID(PID),即哪个 Nginx worker 进程输出了这条日志
#5874(第二个)#5874线程 ID(TID),如果是多线程版本或基于线程池,显示线程号。 对于大多数 Linux 上的 Nginx worker,这是与 PID 相同的,因为 Nginx 默认是单线程多进程模型
*1345620421请求唯一编号每个请求在该 worker 内部分配一个自增 ID(递增计数器)
后续内容报错详情即 Nginx 模块记录的错误

二、通俗解释

所以:

5874#5874

可以理解为:

“由 PID=5874 的 Nginx worker 进程(也是线程 5874)打印的日志”

三、实际用途

这个信息可以帮助你:

1.区分哪个 worker 报错
如果你看到同一时刻有多个 xxxx#xxxx 不同值的报错,说明是多个 worker 同时在处理请求。

2.结合请求编号分析问题

  • *1345620421 是每个 worker 的请求内部 ID。
  • 你可以在访问日志 (access.log) 中搜索相同请求编号(如果日志中配置了 $request_id$connection),对应同一次请求的上下文。

3.分析 worker 是否崩溃或卡死
如果你发现同一个 PID 的日志在报错之后不再输出,很可能该 worker 进程已退出或被 master 重启。

根据进程标识符5874#5874过滤日志,可以看到,当两个后端节点,发生504超时后,Nginx认为没有存活的节点,拦截了后续的新请求。

2025/10/21 15:22:52 [error] 5874#5874: *1345620421 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://192.168.6.27:8092/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5874#5874: *1345622340 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://192.168.6.26:8092/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5874#5874: *1345622215 no live upstreams while connecting to upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/xxxxxx/list HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/xxxxxx/list", host: "192.168.6.27:8080"
2025/10/21 15:22:52 [error] 5874#5874: *1345622356 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept HTTP/1.1", upstream: "http://192.168.6.27:8092/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/dept", host: "192.168.6.27:8080"
2025/10/21 15:22:53 [error] 5874#5874: *1345622356 no live upstreams while connecting to upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/xxxxxx/list HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxxxxxxxx/xxxxxx/xxxxx/xxxxxx/list", host: "192.168.6.27:8080"
2025/10/21 15:22:56 [error] 5874#5874: *1345618781 no live upstreams while connecting to upstream, client: 192.168.20.31, server: , request: "POST /xxxx/xx/xxx/save HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxx/save", host: "192.168.6.27:8080"
2025/10/21 15:22:57 [error] 5874#5874: *1345621535 no live upstreams while connecting to upstream, client: 192.168.20.31, server: , request: "POST /xxxx/xx/xxxx/xxxxx/orderInfo HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxxx/xxxxx/orderInfo", host: "192.168.6.27:8080"
2025/10/21 15:22:57 [error] 5874#5874: *1345621535 no live upstreams while connecting to upstream, client: 192.168.20.31, server: , request: "POST /xxxx/xx/xxx/xxxxxx/list HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxx/xxxxxx/list", host: "192.168.6.27:8080"
2025/10/21 15:22:57 [error] 5874#5874: *1345621535 no live upstreams while connecting to upstream, client: 192.168.20.31, server: , request: "POST /xxxx/xx/xxxx/xxxxxx/list HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxxx/xxxxxx/list", host: "192.168.6.27:8080"
2025/10/21 15:22:57 [error] 5874#5874: *1345622637 no live upstreams while connecting to upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxx/xxxxxxx/check HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxx/xxxxxxx/check", host: "192.168.6.27:8080"
2025/10/21 15:22:58 [error] 5874#5874: *1345622637 no live upstreams while connecting to upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxx/xxxxxx/list HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxx/xxxxxx/list", host: "192.168.6.27:8080"
2025/10/21 15:22:58 [error] 5874#5874: *1345622637 no live upstreams while connecting to upstream, client: 192.168.20.32, server: , request: "POST /xxxx/xx/xxxx/xxxxxx/list HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxxx/xxxxxx/list", host: "192.168.6.27:8080"
2025/10/21 15:22:58 [error] 5874#5874: *1345623128 no live upstreams while connecting to upstream, client: 192.168.20.31, server: , request: "POST /xxxx/xx/xxx/save HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxx/save", host: "192.168.6.27:8080"
2025/10/21 15:22:58 [error] 5874#5874: *1345621535 no live upstreams while connecting to upstream, client: 192.168.20.31, server: , request: "POST /xxxx/xx/xxx/xxxxxx/list HTTP/1.1", upstream: "http://xxxx-prod/xxxx/xx/xxx/xxxxxx/list", host: "192.168.6.27:8080"

4.2upstream 节点的健康状态在各 worker 之间不共享

每个worker维护各自的

upstream 节点的健康状态(fail_count、down 标记等)在各 worker 之间是不共享的。
每个 worker 独立维护、独立计数、独立判断。

这意味着:

  • Worker A 认为某节点 down;
  • Worker B 可能仍认为它正常;
  • Worker C 此时甚至在成功发请求。

于是出现:

同一时刻,不同请求(由不同 worker 处理)
可能命中同一个 upstream 节点,得到不同结果(200 / 504 / 502)。

这也就解释了为什么从error日志中看到已经no live upstreams while connecting to upstream,为什么后续还有请求在执行,而不是全部丢弃,返回502。

通过分析access日志,可以更明显的看到,从异常开始后,还有200,504,502是混合交叉在一起的,因为不同的worker,健康状态可能不一致。

标签: Nginx, 问题记录

添加新评论