CENG795-HW2: Raytracing with BVH and Transformations
WARNING: This blog post discusses not only my submission, but also a few fixes I have done while writing the post. Whenever such a deviation from the submission occurs, it is denoted as [NYI].
Hello everybody. Welcome to my second blog post about the first homework of the course CENG795, Advanced Ray Tracing.
I have developed a simple raytracer in the previous homework, which worked pretty well. However, it lacked two fundamental features: an acceleration structure and support for transformations. For the homework, we also had to implement instancing, which works pretty well with transformations.
However, this time, I did not have a total success.
Implementation
Bounding Volume Hierarchy
The first thing I did was to implement an acceleration structure.
After having more than 4 hours of rendering time in some scenes
(other_dragon to be exact), I needed that
structure.
There is two kinds of acceleration structures one can use in a scene: Top Level Acceleration Structures (TLAS) and Bottom Level Acceleration Structures (BLAS). I only implemented BLAS since transformations would make TLAS worse: transforming a bounding box without transforming every vertex will result in a non-tight bounding box if the transformation includes rotations. However, there is no such issue for BLAS due to the fact that BLAS is constructed inside the local space of the mesh. My implementation can easily be extended to TLAS, but it currently does not support it.
I added a jagged array of BVH nodes to the scene. Each array in
mesh_bvh contain the full BVH tree of the mesh at the same
index.
std::vector<std::vector<bvh_node_children_t>> mesh_bvh;Why not an array of nodes (bvh_node_t in my
implementation) but an array of children? The reason is to
simplify the creation of BVH. A bvh_node_children_t is
actually a simple array of 4 BVH nodes. I planned for some sort of a
SIMD traversal for which I would apply bounding box-ray intersection
algorithm to 4 nodes in parallel. Since I have not even succeed on
implementing the homework, I had to discard that plan. However, I wanted
to make this optimization possible in an easy way. This is why my BVH is
not a binary tree but a quad tree. The children of a node are laid
sequentially in the memory as bvh_node_children_t. This
also meant that I needed a single integer per node to denote children,
this index will be used to index
std::vector<bvh_node_children_t> part. This approach
also had another advantage of improved cache efficiency. If a BVH node
is hit, all of its children must be visited. Laying the children
sequentially can improve the performance, especially when a child is
discarded quickly before populating the cache with its own children.
One of the problems with this approach is indexing. Now, there are two indices denoting a BVH node: the index in the children array array and the index in the children array. This can be solved by using a bitfield:
[NYI]
struct bvh_index_t {
size_t array_index: 62;
size_t child_index: 2;
};I did not use this approach. Well, this actually came to my mind while writing this post. If I have time, I will probably change my implementation to use this trick. This is not a crucial trick tho, since BVH creation is not (at least should not) be a bottleneck and this improvement will not be a drastic one. It is sexy nonetheless.
For BVH creation, I had slice of the primitives which I divided using two algorithms: the simple median one and a complete tree one.
Of course the median one does not work as it is since my nodes had 4
children. However, 25th, 50th and 75th percentiles of the primitive
array can be used for partition. I divided (n) and modulo’d
(m) The size of the array slice. For each child, the start
of the child’s slice is the end of the previous child’s slice (at the
first iteration, the end of the slice of the previous child is the start
of the parent’s slice). This ensures there is no missed primitive. The
end of the child’s slice is the start of it + n +
(m < i) where i is the index of the child
we are processing. I actually have a wtf comment at the end
of this line, but it works to make it so that the partitions lean left.
If there is any remaining primitive after dividing by 4, the parititons
are increased by 1 starting from the leftmost child.
For the other implementation, I needed to find the largest power of 4 smaller than the size of the slice, not including the size itself.
static size_t get_previous_power_of_4(size_t n) {
return std::pow(4.0, std::floor(std::log2(n - 1) / 2.0));
}This achieves what I needed. It is somewhat slow since
log2 is not the fastest thing a processor can do. It can be
improved by using bit hacks; however, I did not use them since I needed
a working code first and I sometimes cannot be sure about the
correctness of bit hacks, especially ones I create myself. After calling
this function with the size of the slice, I just use this as the size of
the slice of each child. However, I segfaulted when I used this code. A
quick debugging session revealed that I accessed out-of-bounds in the
primitive array.
Let’s have 60 primitives. The get_previous_power_of_4
function will return 16, which is the size of each child
slice. The half-open intervals then would be:
0-16, 16-32, 32-48, 48-64
As you can see, the last slice accesses 60, 61, 62 and 63, all of which are out-of-bounds. The fix was simple: clamping the end to the size of the array. This also works when the right slice is empty, e.g. 40 primitives:
0-16, 16-32, 32-40, 40-40
Afterwards, I called the construction function recursively for each children and after returning, set the parent’s bounding box accordingly.
The BVH intersection was simple, but I wanted to support a possible TLAS implementation. After thinking for some time, I remembered what I did in my internship. I used callbacks for events in order to separate the GUI and the application logic in my internship and I realized that this can easily be applied in this case. An intersection is an event after all, not different than a key press!
If the BVH results in a negative index, which means that it is a leaf
node and the absolute value of this index denotes something for the
callback. The absolute value of the index, along with the ray and the
t_candidate reference, will be passed to a callback. This
way, the callback can do any intersection it wants. For BLAS, the
callback simply delegated the execution to the triangle intersection
function. As I said, I did not implement TLAS but it would simply be a
callback that delegates the execution to the BVH intersection with the
root at mesh_bvh[index] where the index is the
index passed to the callback.
When I tried to run with bunny, I saw that the geometry
looked like it was frontface-culled.
After some painful debugging, I came accross that this line was the culprit:
return intersect(bvh, ch, 0, t_out, index, hit_callback) or
intersect(bvh, ch, 1, t_out, index, hit_callback) or
intersect(bvh, ch, 2, t_out, index, hit_callback) or
intersect(bvh, ch, 3, t_out, index, hit_callback);If the first intersection succeeds, due to the short-circuit nature
of or, the remaining children are not tested at all!
However, if backface culling is enabled (which I managed to implement
using the boolean template approach I mentioned in the HW1 blog post),
the result is less severe:
After fixing the line so that each intersection is done and the
resulting booleans are saved to variables, which are ord to
get the return value, the result was correct (same as the one from HW1).
However, the time was about 2x of the non-BVH version!
Houston, we have a problem. After another turn of painful debugging
session, I realized that no matter what, the
aabb_intersection function returned true.
Since I took the intersection code from here and it
was pretty much magic, I could not debug it. I fixed the problem when I
came across this
AABB-ray intersection implementation, which is pretty much the same as
the previous one. But for some reason, this time it worked.
And now, a successful bunny in 136.5 milliseconds? BVH
was a great success!
Now, I can construct BVH for other_dragon in
1.62426 seconds and render the image in
0.904226 seconds. If I divide my result from HW1
(15122.9 seconds) by the total time the program needs to do
all the work with BVH (measured using Measure-Command
cmdlet of PowerShell), I got a 5815.5x improvement! This is
more than three orders of magnitude! Of course not all scenes had
improvements that dramatic. For instance, bunny by
comparison improved by about 27x, which is still more than
one order of magnitude. Still impressive!
Instancing
Before transformations, I wanted to implement instancing since I
thought that it would be easier. I thought about the architecture of the
program in the toilet cabins (the best place on the Earth to solve
engineering problems, for some reason) and decided to replace rendering
of the meshes with rendering of instances of meshes, with a
<Mesh> element also corresponding to an instance.
struct instance_t {
glm::mat4 transform = glm::mat4(1.0f);
glm::mat4 inverse_transform = glm::mat4(1.0f);
size_t material;
size_t index;
};However, for some reason, I thought that spheres and singular
triangles would also have instancing, so changed all of three primitive
types to use instance_t. I did not have time to fix it, but
I will probably fix in the next homework. This does not cause any
problem in outputs but simply decreases the performance a bit. On the
other hand, it has a nice side effect: the triangles in my meshes each
had materials since I used the same type for singular triangle
primitives. When I extracted the material from triangles to instances,
this small memory inefficiency was also solved. Yet I need a better
solution, probably by using two different kinds of triangles, one for
triangle primitives and one for triangles in meshes.
The next thing was modifying the parser. First, I pushed the
transformations. The transformations themselves were discarded after
parsing since the instance_t struct has enough information.
I used stringstream to parse transformations used in
primitives and glm’s translate,
rotate and scale functions to create
transformation matrices for instances. I simply multiplied the
transformation from left.
After this, I took the base mesh, multiplied its transform with the
new instance’s calculated transform and pre-calculated the inverse of
this. This however should have been overridable by
resetTransform attribute, but I forgot to do it since it
was not needed for my first tests, so this check is not in my
submission. Adding it was simple tho: [NYI] a single
line of if and a matching closing brace.
Transformations
Transformation should be easy, right? I can just multiply the inverse of the transformation of the mesh with the ray and call it a day, right? Sadly, no. I was thinking that the transformation implementation would be like “Let’s go, in and out 20 minutes adventure” but it was the opposite. Honestly, I would prefer implementing BVH if I could have implemented only one.
The current architecture was suitable to implement transformations.
The ray_cast (which takes a ray and finds the closest
intersection) method of the scene was modified so that it
iterated over the instances instead of meshes/triangles/spheres. This
meant that at each iteration, I had the model transformation. To utilize
it, I created a ray::transform method, which takes a
transformation matrix and applies it to the ray. The code was
simple:
rt::ray rt::ray::transform(glm::mat4 transformation) const {
const auto start = origin;
const auto end = apply(1.0f);
const auto tstart = transformation * glm::vec4(start, 1.0f);
const auto tend = transformation * glm::vec4(end, 1.0f);
return ray{.origin = tstart, .direction = glm::normalize(tend - tstart)};
}The code is pretty self-explanatory: it applies the ray to get the
point with distance 1 from the origin, which becomes our
endpoint. This creates a line segment with two points. In the lecture of
2024-10-22, our instructor had said that a vector is difference between
two points. I applied this principle by transforming two points, whose
difference is the direction of the ray, and it did not work, which I
realized while writing the blog post. However, at the moment, I have not
fixed it totally. [NYI] Simply multiplying the
transformation matrix from the left with the direction vector (with
w component being 0.0) resulted in better
results (and made grass_desert and
metal_glass_plates work barring some transformation
problems) but some other cases like spheres are still not
working. Nonetheless, my submission uses the code above to transform a
ray.
After this, when the usual ray-intersection algorithms are applied,
the resulting t is actually in local space. I fixed it
using this:
t_candidate = glm::length(
glm::vec3(tri.transform * glm::vec4(hit_point, 1.0f)) - r.origin
);where hit_point is calculated using t from the local
space. I am not sure whether this has a problem, but it works nice with
my NYI fixes, so I assume that this part was correct.
The last thing I added was support for transformation matrices in shading. I passed the transformation matrix and its inverse and used it to calculate the the starting point of the shadow/reflection/refraction ray. The problem probably was lack of conversion of the normal from local to world space when shading.
Results
Only simple_transform and marching_dragons cases work well (marching_dragons has some intersection problems causing holes).
Performance
Some of the instanced and/or transformed cases:
| Testcase | Initialization Time | Rendering Time |
|---|---|---|
| dragon_metal | 1.66781 | 3.34593 |
| ellipsoids | 0.0009215 | 0.127835 |
| marching_dragons | 0.642299 | 1.49556 |
| metal_glass_plates | 0.0009182 | 1.4536 |
| simple_transform | 0.0005905 | 0.0972494 |
| spheres | 0.0006594 | 0.115266 |
| grass_desert | 0.004157 | 9.4637 |
| two_berserkers | 0.13364 | 0.29851 |
For comparison with the previous homework, here is the table for the cases I tested in HW1:
| Testcase | Initialization Time (Complete Tree BVH) | Initialization Time (Median BVH) | Rendering Time (Complete Tree BVH) | Rendering Time (Median BVH) |
|---|---|---|---|---|
| Berserker | 0.019156044 | 0.019131013 | 0.335249032 | 0.321981069 |
| Bunny | 0.026116776 | 0.02602214 | 0.076112225 | 0.072042758 |
| Car | 0.036106005 | 0.036436071 | 0.515020124 | 0.473704802 |
| Car Front | 0.036106005 | 0.036436071 | 0.51656498 | 0.476866441 |
| Cornellbox | 0.000743204 | 0.000748143 | 0.145353368 | 0.14518485 |
| Cornellbox Recursive | 0.000798003 | 0.000804106 | 0.235976484 | 0.234756606 |
| Low Poly | 0.032911273 | 0.032460691 | 0.83449588 | 0.816300713 |
| Other Dragon | 1.641374146 | 1.568181485 | 0.914947284 | 0.961579617 |
| Science Tree Glass | 0.012107489 | 0.012102265 | 0.804868588 | 0.821330206 |
| Spheres | 0.000773866 | 0.000767542 | 0.134561967 | 0.130511143 |
| Spheres Mirror | 0.00081141 | 0.000775633 | 0.152832844 | 0.148085395 |
| Tower | 0.023215116 | 0.023918437 | 0.901310673 | 0.988822282 |
| Two Spheres | 0.000517918 | 0.000508618 | 0.08861407 | 0.089882233 |
| Windmill | 0.044253173 | 0.044485013 | 0.782553569 | 0.818833286 |
Each of the cases actually ran for 10 times and the result is calculated as the geometric mean of these results. Car and Car Front have same initialization times since they actually render the same scene.
For visualization purposes, I also draw bar charts for the previous table. First graph shows each scene, phase and BVH split algorithm separately. Second graph stacks initialization and rendering times of each algorithm so that the total times can easily be compared. Both plots are drawn using matplotlib.
Conclusion
The homework was, well, not a success. My half-baked submission is so bad that I have and will not try to render the video examples. However, I have at least some idea to fix these problems, and also an extra week. I also learnt once again that looking down to any kind of work is harmful. Having unrelated problems to the homeworks also does not help. Nevertheless, I am glad that I tried to do this homework, the insights of these homeworks are simply great.