The current constant STRING cache is a mess for several reasons:

  • it's the only consumer of a cstring Hash type that we could otherwise remove
  • hashing cstring keys is time consuming
  • it only handles ASCII STRINGs
  • we can't perform any compilation-time precaching
  • rehashing performance is terrible when we cache a lot of STRINGs
  • we can't easily cache constant STRINGs thawed from PBC, which contain a lot of duplication

We can fix it:

  • Create a new custom data structure with which we can cache these STRINGs easily by their important characteristics (length, encoding type, and contents). This is likely some kind of binary tree; for best performance characteristics, an AVL tree has the most advantages, as we perform many more fetches than inserts and never delete. A red-black tree would also be fine if that's easier, but its additional complexity adds few advantages (and is slightly more expensive to read).
  • Rewrite the constant STRING caching mechanism in Parrot_str_new_constant() to use the new cache. We should index into the tree based on the STRING's characteristics, not any hash key or not the contents of the STRING naively.
  • Add a function to create a new constant STRING given an encoding, a length, and a buffer.
  • Use that new function when thawing STRINGs from PBC to avoid re-creating STRINGs we already have.
  • Remove the cstring hash (optional).

Simple profiling tests done on Rakudo reveal that we can reduce our usage of STRING headers thawed from PBC by more than 80%.

This is a task of moderate difficulty, primarily in creating and debugging the binary tree. Takers welcome.