The current constant STRING cache is a mess for several reasons:
- it's the only consumer of a cstring Hash type that we could otherwise remove
- hashing cstring keys is time consuming
- it only handles ASCII STRINGs
- we can't perform any compilation-time precaching
- rehashing performance is terrible when we cache a lot of STRINGs
- we can't easily cache constant STRINGs thawed from PBC, which contain a lot of duplication
We can fix it:
- Create a new custom data structure with which we can cache these STRINGs easily by their important characteristics (length, encoding type, and contents). This is likely some kind of binary tree; for best performance characteristics, an AVL tree has the most advantages, as we perform many more fetches than inserts and never delete. A red-black tree would also be fine if that's easier, but its additional complexity adds few advantages (and is slightly more expensive to read).
- Rewrite the constant STRING caching mechanism in Parrot_str_new_constant() to use the new cache. We should index into the tree based on the STRING's characteristics, not any hash key or not the contents of the STRING naively.
- Add a function to create a new constant STRING given an encoding, a length, and a buffer.
- Use that new function when thawing STRINGs from PBC to avoid re-creating STRINGs we already have.
- Remove the cstring hash (optional).
Simple profiling tests done on Rakudo reveal that we can reduce our usage of STRING headers thawed from PBC by more than 80%.
This is a task of moderate difficulty, primarily in creating and debugging the binary tree. Takers welcome.