I have been thinking about the audio problems with RTK (Remembering The Kanji). I think the biggest problem is hearing just a small sound for characters without context. If the character sound was played first and then followed by “as in” with the audio of a word it could become more effective. That way you could do sound to character without having to focus on both characters.
Even better might be to pause after saying the word and then say a sentence that uses the word with a picture.
I believe this approach would more closely follow the concept of comprehensible input pioneered by Stephen Krashen. I think this video shows good examples on why that method is better for learning.