Improve
ball_query()runtime for large-scale cases (#2006)Summary:
Overview
The current C++ code for
pytorch3d.ops.ball_query()performs floating point multiplication for every coordinate of every pair of points (up until the maximum number of neighbor points is reached). This PR modifies the code (for both CPU and CUDA versions) to implement idea presented here: aD-cube around theD-ball is first constructed, and any point pairs falling outside the cube are skipped, without explicitly computing the squared distances. This change is especially useful for when the dimensionDand the number of pointsP2are large and the radius is much smaller than the overall volume of space occupied by the point clouds; as much as ~2.5x speedup (CPU case; ~1.8x speedup in CUDA case) is observed whenD = 10andradius = 0.01. In all benchmark cases, points were uniform randomly distributed inside a unitD-cube.The benchmark code used was different from
tests/benchmarks/bm_ball_query.py(only the forward part is benchmarked, larger input sizes were used) and is stored intests/benchmarks/bm_ball_query_large.py.Average time comparisons
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Peak time comparisons
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Full benchmark logs
benchmark-before-change.txt benchmark-after-change.txt
Pull Request resolved: https://github.com/facebookresearch/pytorch3d/pull/2006
Reviewed By: shapovalov
Differential Revision: D85356394
Pulled By: bottler
fbshipit-source-id: 9b3ce5fc87bb73d4323cc5b4190fc38ae42f41b2
该内容不合规,请修改。