Text this: Multimodal Latent Representation Learning for Video Moment Retrieval