如何在资源不足时,提升RayTask进程复用的能力,而不是执行结束后立即回收进程?

【Ray使用环境】POC
【Ray版本和类库】 Ray 2.5.1
【使用现场】
— K8S CPU-8 MEM-64G

import asyncio
import time

import ray

ray.init()


@ray.remote(num_cpus=0.01)
def sleep():
    time.sleep(1)
    return 0


async def coro():
    start_time_out = time.time()
    ref = sleep.remote()

    await ref
    ray.get(ref)
    end_time_out = time.time()
    print(f'remote task cost {end_time_out - start_time_out - 1:.6f} s')
    return end_time_out - start_time_out - 1


all_result = []


async def main():
    tasks = []
    for i in range(30):
        task = asyncio.create_task(coro())
        tasks.append(task)
    done, _ = await asyncio.wait(tasks)

    result = [fut.result() for fut in done]
    all_result.extend(result)
    lived_num = sum([1 if i < 0.01 else 0 for i in result])
    print(f'--------------------------- live process num {lived_num}, rate {(lived_num / len(result)) * 100:.2f}%')
    return


if __name__ == '__main__':
    for i in range(10):
        asyncio.run(main())
        time.sleep(1)
        asyncio.run(main())

    lived_num = sum([1 if i < 0.01 else 0 for i in all_result])
    print(f'>>>>>>>>>>>>>>>>>>>>>>>>>>> live process num {lived_num}, rate {(lived_num / len(all_result)) * 100:.2f}%')

【问题复现】
— RayTask在任务调度上的时间超出预期,约1-6秒钟,未能实现进程的有效复用

(base) [root@9c9b487bf-t9ls7 opt]# python ray_test.py
2023-10-21 11:37:50,317 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 172.17.1.171:6380...
2023-10-21 11:37:50,335 INFO worker.py:1544 -- Connected to Ray cluster. View the dashboard at 172.17.1.171:8265
remote task cost 1.403650 s
remote task cost 1.414990 s
remote task cost 1.404192 s
remote task cost 1.403921 s
remote task cost 1.440736 s
remote task cost 1.488831 s
remote task cost 1.501244 s
remote task cost 1.506321 s
remote task cost 1.775334 s
remote task cost 1.839326 s
remote task cost 1.847032 s
remote task cost 1.847158 s
remote task cost 1.847301 s
remote task cost 2.026421 s
remote task cost 2.168200 s
remote task cost 2.307553 s
remote task cost 2.370319 s
remote task cost 2.404747 s
remote task cost 2.408004 s
remote task cost 2.411449 s
remote task cost 2.522131 s
remote task cost 2.592857 s
remote task cost 2.656482 s
remote task cost 2.814053 s
remote task cost 2.822711 s
remote task cost 2.841676 s
remote task cost 2.909370 s
remote task cost 2.993161 s
remote task cost 3.034261 s
remote task cost 3.074411 s
--------------------------- live process num 0, rate 0.00%
remote task cost 0.008732 s
remote task cost 0.010633 s
remote task cost 0.011312 s
remote task cost 0.012592 s
remote task cost 0.019629 s
remote task cost 0.032814 s
remote task cost 0.033267 s
remote task cost 0.035363 s
remote task cost 0.886400 s
remote task cost 0.894015 s
remote task cost 0.894363 s
remote task cost 0.951854 s
remote task cost 0.975335 s
remote task cost 1.017292 s
remote task cost 1.018023 s
remote task cost 1.018267 s
remote task cost 1.018735 s
remote task cost 1.024701 s
remote task cost 1.040073 s
remote task cost 1.040633 s
remote task cost 1.040805 s
remote task cost 1.341469 s
remote task cost 1.382063 s
remote task cost 1.642497 s
remote task cost 1.643132 s
remote task cost 1.651612 s
remote task cost 1.836777 s
remote task cost 1.895972 s
remote task cost 1.904340 s
remote task cost 1.908550 s
--------------------------- live process num 1, rate 3.33%
可以调下 kill_idle_workers_interval_ms 这个参数试试,默认200
ray.init(
    _system_config={
        "kill_idle_workers_interval_ms": 200,
    },
)

好的、我试试,感谢!