Profiler¶

class paddle.profiler. Profiler ( *, targets: Optional[Iterable[ProfilerTarget]] = None, scheduler: Union[Callable[[int], ProfilerState], tuple, None] = None, on_trace_ready: Optional[Callable[..., Any]] = None, record_shapes: Optional[bool] = False, profile_memory: Optional[bool] = False, timer_only: Optional[bool] = False, custom_device_types: Optional[list] = [], with_flops: Optional[bool] = False ) [源代码] ¶

性能分析器，该类负责管理性能分析的启动、关闭，以及性能数据的导出和统计分析。

参数¶

targets (list，可选) - 指定性能分析所要分析的设备，默认会自动分析所有存在且支持的设备，当前支持 CPU，GPU 和 MLU（可选值见 ProfilerState )。

scheduler (Callable|tuple，可选) - 如果是 Callable 对象，代表是性能分析器状态的调度器，该调度器会接受一个 step_num 参数并返回相应的状态(详情见状态说明），可以通过 make_scheduler 接口生成调度器。如果没有设置这个参数(None)，默认的调度器会一直让性能分析器保持 RECORD 状态到结束。如果是 tuple 类型，有两个值 start_batch 和 end_batch，则会在[start_batch, end_batch)(前闭后开区间)内处于 RECORD 状态进行性能分析。

on_trace_ready (Callable，可选) - 处理性能分析器的回调函数，该回调函数接受 Profiler 对象作为参数，提供了一种自定义后处理的方式。当性能分析器处于 RECORD_AND_RETURN 状态或者结束时返回性能数据，将会调用该回调函数进行处理，默认为 export_chrome_tracing (./profiler_log/)。

record_shapes (bool，可选) - 如果设置为 True, 则会开启收集框架算子输入 Tensor 的 shape，默认为 False。

profile_memory (bool，可选) - 如果设置为 True, 则会开启收集显存分析的数据，默认为 False。

timer_only (bool，可选) - 如果设置为 True，将只统计模型的数据读取和每一个迭代所消耗的时间，而不进行性能分析。否则，模型将被计时，同时进行性能分析。默认值：False。

custom_device_types (list，可选) - 如果 targets 包含 profiler.ProfilerTarget.CUSTOM_DEVICE，custom_device_types 选择进行性能分析的设备类型，默认值为空表示所有自定义设备。

with_flops (bool，可选) - 如果设置为 True，则会开启收集相关数据计算 FLOPs 指标。默认值：False。

代码示例 1¶

性能分析 batch [2, 5)

           >>> import paddle.profiler as profiler
 >>> import paddle
 >>> paddle.device.set_device('gpu')
 >>> with profiler.Profiler(
 ...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
 ...     scheduler = (2, 5),
 ...     on_trace_ready = profiler.export_chrome_tracing('./log')
 ... ) as p:
 ...     for iter in range(10):
 ...         # train()
 ...         p.step()

         

代码示例 2¶

性能分析 batch [2,4], [7, 9], [11,13]

           >>> import paddle.profiler as profiler
 >>> import paddle
 >>> paddle.device.set_device('gpu')
 >>> with profiler.Profiler(
 ...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
 ...     scheduler = profiler.make_scheduler(closed=1, ready=1, record=3, repeat=3),
 ...     on_trace_ready = profiler.export_chrome_tracing('./log')
 ... ) as p:
 ...     for iter in range(10):
 ...         # train()
 ...         p.step()

         

代码示例 3¶

使用全部默认参数，且脱离环境管理器的用法，性能分析整个运行过程

           >>> import paddle.profiler as profiler
 >>> import paddle
 >>> paddle.device.set_device('gpu')
 >>> p = profiler.Profiler()
 >>> p.start()
 >>> for iter in range(10):
 ...     #train()
 ...     p.step()
 >>> p.stop()
 >>> p.summary()

         

代码示例 4¶

使用该工具获取模型的吞吐量以及模型的时间开销

           >>> import paddle
 >>> import paddle.profiler as profiler

 >>> class RandomDataset(paddle.io.Dataset):
 ...     def __init__(self, num_samples):
 ...         self.num_samples = num_samples
 ...     def __getitem__(self, idx):
 ...         image = paddle.rand(shape=[100], dtype='float32')
 ...         label = paddle.randint(0, 10, shape=[1], dtype='int64')
 ...         return image, label
 ...     def __len__(self):
 ...         return self.num_samples
 >>> class SimpleNet(paddle.nn.Layer):
 ...     def __init__(self):
 ...         super().__init__()
 ...         self.fc = paddle.nn.Linear(100, 10)
 ...     def forward(self, image, label=None):
 ...         return self.fc(image)
 >>> dataset = RandomDataset(20 * 4)
 >>> simple_net = SimpleNet()
 >>> opt = paddle.optimizer.SGD(learning_rate=1e-3, parameters=simple_net.parameters())
 >>> BATCH_SIZE = 4
 >>> loader = paddle.io.DataLoader(
 ...     dataset,
 ...     batch_size=BATCH_SIZE)
 >>> p = profiler.Profiler(timer_only=True)
 >>> p.start()
 >>> for i, (image, label) in enumerate(loader()):
 ...     out = simple_net(image)
 ...     loss = paddle.nn.functional.cross_entropy(out, label)
 ...     avg_loss = paddle.mean(loss)
 ...     avg_loss.backward()
 ...     opt.minimize(avg_loss)
 ...     simple_net.clear_gradients()
 ...     p.step(num_samples=BATCH_SIZE)
 ...     if i % 10 == 0:
 ...         step_info = p.step_info(unit='images')
 ...         print("Iter {}: {}".format(i, step_info))
 ...         # The average statistics for 10 steps between the last and this call will be
 ...         # printed when the "step_info" is called at 10 iteration intervals.
 ...         # The values you get may be different from the following.
 ...         # Iter 0:  reader_cost: 0.51946 s batch_cost: 0.66077 s ips: 6.054 images/s
 ...         # Iter 10:  reader_cost: 0.00014 s batch_cost: 0.00441 s ips: 907.009 images/s
 >>> p.stop()
 >>> # The performance summary will be automatically printed when the "stop" is called.
 >>> # Reader Ratio: 2.658%
 >>> # Time Unit: s, IPS Unit: images/s
 >>> # |                 |       avg       |       max       |       min       |
 >>> # |   reader_cost   |     0.00011     |     0.00013     |     0.00007     |
 >>> # |    batch_cost   |     0.00405     |     0.00434     |     0.00326     |
 >>> # |       ips       |    1086.42904   |    1227.30604   |    959.92796    |

         

方法¶

start()¶

开启性能分析器，进入状态 scheduler(0)。即性能分析器状态从 CLOSED -> scheduler(0)，并根据新的状态触发相应行为。

代码示例

            >>> import paddle.profiler as profiler
 >>> import paddle
 >>> paddle.device.set_device('gpu')
 >>> prof = profiler.Profiler(
 ...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
 ...     scheduler = (1, 9),
 ...     on_trace_ready = profiler.export_chrome_tracing('./log'))
 >>> prof.start()
 >>> for iter in range(10):
 ...     # train()
 ...     prof.step()
 >>> prof.stop()

          

stop()¶

停止性能分析器，并且进入状态 CLOSED。即性能分析器状态从当前状态 -> CLOSED，性能分析器关闭，如果有性能数据返回，调用 on_trace_ready 回调函数进行处理。

代码示例

            >>> import paddle.profiler as profiler
 >>> import paddle
 >>> paddle.device.set_device('gpu')
 >>> prof = profiler.Profiler(
 ...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
 ...     scheduler = (1, 7),
 ...     on_trace_ready = profiler.export_chrome_tracing('./log'))
 >>> prof.start()
 >>> for iter in range(10):
 ...     # train()
 ...     prof.step()
 ... prof.stop()

          

step(num_samples: Optional[int]=None)¶

指示性能分析器进入下一个 step，根据 scheduler 计算新的性能分析器状态，并根据新的状态触发相应行为。如果有性能数据返回，调用 on_trace_ready 回调函数进行处理。

参数

num_samples (int|None，可选) - 模型运行中每一步的样本数量 batch size，当 timer_only 为 True 时该参数被用于计算吞吐量。默认值：None。

代码示例

            >>> import paddle.profiler as profiler
 >>> import paddle
 >>> paddle.device.set_device('gpu')
 >>> prof = profiler.Profiler(
 ...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
 ...     scheduler = (3, 7),
 ...     on_trace_ready = profiler.export_chrome_tracing('./log'))

 >>> prof.start()
 >>> for iter in range(10):
 ...     #train()
 ...     prof.step()
 >>> prof.stop()

          

step_info(unit: Optional[int]=None)¶

获取当前迭代的统计信息。如果以特定的迭代间隔调用该方法，则结果是上一次调用和本次调用之间所有迭代的平均值。统计信息如下：

reader_cost：加载数据的开销，单位为秒。
batch_cost：1 次迭代的开销，单位为秒。
ips（Instance Per Second）：模型吞吐量，单位为 samples/s 或其他，取决于参数 unit 的设置。当 step()的 num_samples 为 None 时，单位为 steps/s。

参数

unit (string，可选) - 输入数据的单位，仅在 step()的 num_samples 指定为实数时有效。例如，当 unit 为 images 时，吞吐量的单位为 images/s。默认值：None，吞吐量的单位是 samples/s。

返回

表示统计数据的字符串

代码示例

            >>> import paddle.profiler as profiler
 >>> prof = profiler.Profiler(timer_only=True)
 >>> prof.start()
 >>> for iter in range(20):
 ...     #train()
 ...     prof.step()
 ...     if iter % 10 == 0:
 ...         print("Iter {}: {}".format(iter, prof.step_info()))
 ...         # The example does not call the DataLoader, so there is no "reader_cost".
 ...         # Iter 0:  batch_cost: 0.00001 s ips: 86216.623 steps/s
 ...         # Iter 10:  batch_cost: 0.00001 s ips: 103645.034 steps/s
 >>> prof.stop()
 >>> # Time Unit: s, IPS Unit: steps/s
 >>> # |                 |       avg       |       max       |       min       |
 >>> # |    batch_cost   |     0.00000     |     0.00002     |     0.00000     |
 >>> # |       ips       |   267846.19437  |   712030.38727  |   45134.16662   |

          

export(path, format="json")¶

导出性能数据到文件。

参数

path (str) – 性能数据导出的文件名。

format (str，可选) – 性能数据导出的格式，目前支持"json"和"pb"两种。即"json"为导出 chrome tracing 文件，"pb"为导出 protobuf 文件，默认值为"json"。

代码示例

            >>> import paddle
 >>> paddle.device.set_device('gpu')
 >>> import paddle.profiler as profiler
 >>> prof = profiler.Profiler(
 ...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
 ...     scheduler = (3, 7))
 >>> prof.start()
 >>> for iter in range(10):
 ...     # train()
 ...     prof.step()
 >>> prof.stop()
 >>> prof.export(path="./profiler_data.json", format="json")

          

summary(sorted_by=SortedKeys.CPUTotal, op_detail=True, thread_sep=False, time_unit='ms', views=None)¶

统计性能数据并打印表单。当前支持从总览、模型、分布式、算子、内存操作、自定义六个角度来对性能数据进行统计。

参数

sorted_by ( SortedKeys，可选) – 表单的数据项排列方式，默认值 SortedKeys.CPUTotal。

op_detail (bool，可选) – 是否打印算子内各过程的详细信息，默认值 True。

thread_sep (bool，可选) - 是否分线程打印，默认值 False。

time_unit (str，可选) - 表单数据的时间单位，默认为'ms'，可选's'、'us'、'ns'。

views (SummaryView|list[SummaryView]，可选) - 打印的表单列表，默认值为 None, 表示打印所有表单。

代码示例

           >>> import paddle
>>> paddle.device.set_device('gpu')
>>> import paddle.profiler as profiler
>>> prof = profiler.Profiler(
...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
...     scheduler = (3, 7),
...     on_trace_ready = profiler.export_chrome_tracing('./log'))
>>> prof.start()
>>> for iter in range(10):
...     # train()
...     prof.step()
>>> prof.stop()
>>> prof.summary(sorted_by=profiler.SortedKeys.CPUTotal, op_detail=True, thread_sep=False, time_unit='ms')

          

使用本API的教程文档¶

PyTorch 写法: