Actor的get方法效率很差，是否正常

tlu · 2023 年2 月 8 日 04:07

测试代码：

import ray
import time


@ray.remote
class RayMemoryObject:
    def __init__(self):
        self.data = 1

    def get_data(self):
        return self.data


class MemoryObject:
    def __init__(self):
        self.data = 1

    def get_data(self):
        return self.data


NUM_RUNS = [1, 10, 100, 1000, 5000, 10000, 50000, 100000]
EACH_RUN = 5
if __name__ == "__main__":
    ray.init(num_cpus=8)
    ray_ref = RayMemoryObject.remote()
    mem_obj = MemoryObject()
    res = []
    for r in NUM_RUNS:
        print(f"Testing getting a number for {r} times...")
        tmp_data_ray = []
        tmp_data_mem = []
        for e_r in range(EACH_RUN):
            t0 = time.time()
            for i in range(r):
                ray.get(ray_ref.get_data.remote())
            t1 = time.time()
            for i in range(r):
                mem_obj.get_data()
            t2 = time.time()
            tmp_data_mem.append(t2 - t1)
            tmp_data_ray.append(t1 - t0)
        res.append([sum(tmp_data_mem) / len(tmp_data_mem), sum(tmp_data_ray) / len(tmp_data_ray)])
    print("%8s | %8s | %8s | %8s" % ("Num Iter", "Phy Mem", "Ray Mem", "Ratio"))
    for i in range(len(res)):
        print("%8d | %8.2f | %8.2f | %8.2f" % (NUM_RUNS[i], res[i][0], res[i][1], res[i][1] / res[i][0]))

第一个类从Ray的公共内存里面取数，第二种从物理内存里面取数。测试结果如下：

Num Iter |  Phy Mem |  Ray Mem |    Ratio
       1 |     0.00 |     0.13 | 27309.15
      10 |     0.00 |     0.01 |   498.49
     100 |     0.00 |     0.08 |  1223.88
    1000 |     0.00 |     0.79 |  1275.69
    5000 |     0.00 |     4.16 |  1301.07
   10000 |     0.01 |     8.65 |  1679.47
   50000 |     0.02 |    41.11 |  1870.96
  100000 |     0.04 |    83.57 |  2112.83

从Ray的公有内存里面取数比从物理内存取数满了上千倍。

profile结果（前几行）：
80036972 function calls (78347649 primitive calls) in 696.781 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
830555 481.887 0.001 481.887 0.001 {method ‘get_objects’ of ‘ray._raylet.CoreWorker’ objects}
830555 44.821 0.000 44.821 0.000 {method ‘submit_actor_task’ of ‘ray._raylet.CoreWorker’ objects}
830555 16.173 0.000 16.173 0.000 {ray._raylet.split_buffer}
1 14.670 14.670 696.783 696.783 main.py:1()
830556 12.344 0.000 17.479 0.000 inspect.py:2909(_bind)
830555 9.008 0.000 566.797 0.001 worker.py:2205(get)
830555 6.928 0.000 17.581 0.000 {built-in method loads}
1661127/830558 6.815 0.000 591.847 0.001 client_mode_hook.py:96(wrapper)
830555 6.300 0.000 49.202 0.000 serialization.py:341(deserialize_objects)
830555 6.063 0.000 80.557 0.000 actor.py:1109(_actor_method_call)
830555 5.671 0.000 554.389 0.001 worker.py:643(get_objects)
830556 4.984 0.000 29.325 0.000 signature.py:81(flatten_args)
830555 4.745 0.000 9.390 0.000 worker.py:517(get_serialization_context)
830705 4.714 0.000 5.438 0.000 inspect.py:2781(init)

这里{method ‘get_objects’ of ‘ray._raylet.CoreWorker’ objects}耗时很长，是否正常？

Catch-Bull · 2023 年2 月 8 日 09:41

我觉得是 make sense 的，在你的这个 case 中，

ray.get: 由于数据是一个整数1,数据很小是不会经过 raylet进程的，基本上都是 driver 进程和 actor 进程之间的通信耗费了时间大头，这其中涉及序列化、反序列化，grpc 通讯。
MemoryObject : 这个就是单纯的内存访问

综上: 这两者的速度差距有几千倍我理解是没有啥问题的，这两者的比较并不公平，如果想看一下 ray actor 在 grpc 直接调用上的性能损失的话，应该是比较 ray.get 和 grpc 的 client/server
延伸: 如果你尝试 get 的数据超过内置阈值(默认是 100KB，可以调整),那么这个ray.get(ray_ref.get_data.remote())还会涉及 raylet 进程之间的通信，差的会更多。

tlu · 2023 年2 月 9 日 07:53

Got it. Thx.