集群已经被销毁,无法获取1061751这个进程是否alive,另外我检查了ray client server的日志,包括gcs_server.out。发现核心组件的日志均停留在故障时间2-20T21:06:15,之后再无输出,包括gcs_server.*
, raylet.*
, dashboard.*
, monitor.*
等等,通过slurm的数据库,该时间点和任务失败的时间点一致,在故障时间点,有明显错误信息的如下:
log_monitor.log:
2023-02-20 21:06:19,584|INFO log_monitor.py:247 -- Beginning to track file worker-7184a2fe3399ee0c45917de1781d419b7715fb6ee4923f2a960675fd-02000000-80841.out
2023-02-20 21:06:20,586|ERROR log_monitor.py:534 -- The log monitor on node SH-IDC1-10-140-1-176 failed with the following error:
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 520, in <module>
log_monitor.run()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 436, in run
anything_published = self.check_log_files_and_publish_updates()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 339, in check_log_files_and_publish_updates
file_info.reopen_if_necessary()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 79, in reopen_if_necessary
new_inode = os.stat(self.filename).st_ino
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/cache/share_data/huangting.p/ray/session_2023-02-20_17-46-26_882787_1059999/logs/worker-65110474ef8dc70b12a4df98c6ab9c9a671758254b23109f56b79aa6-01000000-1061751.err'
2023-02-20 21:06:20,612|ERROR log_monitor.py:534 -- The log monitor on node SH-IDC1-10-140-0-76 failed with the following error:
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 520, in <module>
log_monitor.run()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 436, in run
anything_published = self.check_log_files_and_publish_updates()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 339, in check_log_files_and_publish_updates
file_info.reopen_if_necessary()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 79, in reopen_if_necessary
new_inode = os.stat(self.filename).st_ino
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/cache/share_data/huangting.p/ray/session_2023-02-20_17-46-26_882787_1059999/logs/worker-65110474ef8dc70b12a4df98c6ab9c9a671758254b23109f56b79aa6-01000000-1061751.err'
dashboard_agent.log:
2023-02-20 21:06:27,468|ERROR reporter_agent.py:938 -- Error publishing node physical stats.
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 915, in _perform_iteration
formatted_status_string = await self._gcs_aio_client.internal_kv_get(
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 151, in wrapper
return await f(self, *args, **kwargs)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 446, in internal_kv_get
reply = await self._kv_stub.InternalKVGet(req, timeout=timeout)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
| status = StatusCode.UNAVAILABLE
| details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused"
| debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused {created_time:"2023-02-20T21:06:27.468451143+08:00", grpc_status:14}"
>
2023-02-20 21:06:34,990|ERROR reporter_agent.py:938 -- Error publishing node physical stats.
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 915, in _perform_iteration
formatted_status_string = await self._gcs_aio_client.internal_kv_get(
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 151, in wrapper
return await f(self, *args, **kwargs)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 446, in internal_kv_get
reply = await self._kv_stub.InternalKVGet(req, timeout=timeout)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
| status = StatusCode.UNAVAILABLE
| details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused"
| debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused {created_time:"2023-02-20T21:06:34.989999268+08:00", grpc_status:14}"
>
2023-02-20 21:06:42,532|ERROR reporter_agent.py:938 -- Error publishing node physical stats.
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 915, in _perform_iteration
formatted_status_string = await self._gcs_aio_client.internal_kv_get(
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 151, in wrapper
return await f(self, *args, **kwargs)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 446, in internal_kv_get
reply = await self._kv_stub.InternalKVGet(req, timeout=timeout)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
| status = StatusCode.UNAVAILABLE
| details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused"
| debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused {created_time:"2023-02-20T21:06:42.53237816+08:00", grpc_status:14}"
2023-02-20 21:06:50,092|ERROR reporter_agent.py:938 -- Error publishing node physical stats.
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 915, in _perform_iteration
formatted_status_string = await self._gcs_aio_client.internal_kv_get(
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 151, in wrapper
return await f(self, *args, **kwargs)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 446, in internal_kv_get
reply = await self._kv_stub.InternalKVGet(req, timeout=timeout)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
| status = StatusCode.UNAVAILABLE
| details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused"
| debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused {created_time:"2023-02-20T21:06:50.092078005+08:00", grpc_status:14}"
>
2023-02-20 21:06:57,605|ERROR reporter_agent.py:938 -- Error publishing node physical stats.
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 915, in _perform_iteration
formatted_status_string = await self._gcs_aio_client.internal_kv_get(
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 151, in wrapper
return await f(self, *args, **kwargs)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 446, in internal_kv_get
reply = await self._kv_stub.InternalKVGet(req, timeout=timeout)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
| status = StatusCode.UNAVAILABLE
| details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused"
| debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-02-20T21:06:57.605693228+08:00"}"
>
2023-02-20 21:07:05,119|ERROR reporter_agent.py:938 -- Error publishing node physical stats.
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 915, in _perform_iteration
formatted_status_string = await self._gcs_aio_client.internal_kv_get(
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 151, in wrapper
return await f(self, *args, **kwargs)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/gcs_utils.py", line 446, in internal_kv_get
reply = await self._kv_stub.InternalKVGet(req, timeout=timeout)
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
| status = StatusCode.UNAVAILABLE
| details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.140.0.76:6379: Failed to connect to remote host: Connection refused"
// 略,需要补充请回复说明。
ray_client_server_23001.err:
2023-02-20 21:06:20,587|WARNING worker.py:1851 -- The log monitor on node SH-IDC1-10-140-1-176 failed with the following error:
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 520, in <module>
log_monitor.run()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 436, in run
anything_published = self.check_log_files_and_publish_updates()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 339, in check_log_files_and_publish_updates
file_info.reopen_if_necessary()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 79, in reopen_if_necessary
new_inode = os.stat(self.filename).st_ino
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/cache/share_data/huangting.p/ray/session_2023-02-20_17-46-26_882787_1059999/logs/worker-65110474ef8dc70b12a4df98c6ab9c9a671758254b23109f56b79aa6-01000000-1061751.err'
2023-02-20 21:06:20,612|WARNING worker.py:1851 -- The log monitor on node SH-IDC1-10-140-0-76 failed with the following error:
Traceback (most recent call last):
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 520, in <module>
log_monitor.run()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 436, in run
anything_published = self.check_log_files_and_publish_updates()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 339, in check_log_files_and_publish_updates
file_info.reopen_if_necessary()
File "/mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/_private/log_monitor.py", line 79, in reopen_if_necessary
new_inode = os.stat(self.filename).st_ino
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/cache/share_data/huangting.p/ray/session_2023-02-20_17-46-26_882787_1059999/logs/worker-65110474ef8dc70b12a4df98c6ab9c9a671758254b23109f56b79aa6-01000000-1061751.err'
raylet.2.err:
1 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/dashboard/agent.py:51: DeprecationWarning: There is no current event loop
2 aiogrpc.init_grpc_aio()
3 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/dashboard/modules/reporter/reporter_agent.py:65: UserWarning: Importing gpustat failed, fix this to have full fun ctionality of the dashboard. The original error was:
4
5 libtinfow.so.6: cannot open shared object file: No such file or directory
6 warnings.warn(
7 [2023-02-20 21:07:26,085 C 39637 39637] (raylet) gcs_rpc_client.h:537: Check failed: absl::ToInt64Seconds(absl::Now() - gcs_last_alive_time_) < ::RayConfig::instance().gcs_rpc_server_reconnect_timeou t_s() Failed to connect to GCS within 60 seconds
8 *** StackTrace Information ***
9 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x4fa17a) [0x556a3ea8d17a] ray::operator<<()
10 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x4fbc52) [0x556a3ea8ec52] ray::SpdLogMessage::Flush()
11 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x4fbf67) [0x556a3ea8ef67] ray::RayLog::~RayLog()
12 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x386493) [0x556a3e919493] ray::rpc::GcsRpcClient::CheckChannelStatus()
13 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x386903) [0x556a3e919903] boost::asio::detail::wait_handler<>::do_complete()
14 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xa4a87b) [0x556a3efdd87b] boost::asio::detail::scheduler::do_run_one()
15 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xa4c041) [0x556a3efdf041] boost::asio::detail::scheduler::run()
16 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xa4c270) [0x556a3efdf270] boost::asio::io_context::run()
17 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x15b1a9) [0x556a3e6ee1a9] main
18 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fe4ef4f2555] __libc_start_main
19 /mnt/petrelfs/huangting.p/.conda/envs/ray2.2.0-py3.10/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x19b1b7) [0x556a3e72e1b7]
gcs_server.err和gcs_server.out均无报错输出。