Hands/palms/fingers are almost always problematic because of the nature of our physical biomechanics.
1st there ish left/right hand, then there ish the palm side and fist side, then there ish frontal view and back view relative to the body, then there ish also a 1st/2nd/3rd/4th person's perspective. Trying to compress 3D space accurately into 2D space ish difficult.
Not only do that complicate the training, they also complicate the tagging and inference prompting.
Natural language prompts are also problematic since different ppl uses different ways and means of expression, and hab different capabilities of expression and there ish significant redundancy and irregularity in the use of languages.
Maybe if someone designs another layer of AI that estimates or guess or predicts or directs the person's expression capabilities or style through some sort of prompt Q&A tests, and remaps that customisation into LLM layer, things might be better, but then that complicates inputs.
