Text this: A Lightweight Framework for Audio-Visual Segmentation with an Audio-Guided Space–Time Memory Network