Text this: Learning Distinctive Feature Representation for Descriptors on Multimodal Images by Incorporating Multiple Negative Samples