Publications
Welcome to the publications section of my portfolio. Here you can find a curated list of my published research papers in top-tier conferences and journals in the fields of machine learning, computer vision, and natural language processing. Browse through the collection to get a glimpse of my work and research interests.
Learning geospatially aware place embeddings via weak-supervision

Vamsi Krishna Penumadu, Nitesh Methani, Saurabh Sohoney

Sigspatial 2022

Understanding and representing real-world places (physical locations where drivers can deliver packages) is key to successfully and efficiently delivering packages to customer's doorstep. Prerequisite to this is the task of capturing similarity and relatedness between places. Intuitively, places that belong to a same building should have similar characteristics in geospatial as well as textual space. However, these assumptions fail in practice as existing methods use customer address text as a proxy for places. While providing the address text, customers tend to miss-out on key tokens, use vernacular content or place synonyms and do not follow a standard structure making them inherently ambiguous. Thus, modelling the problem from linguistic perspective alone is not sufficient. To overcome these shortcomings, we adapt various state-of-the-art embedding learning techniques to geospatial domain and propose Places-FastText, Places-Bert, and Places-GraphSage. We train these models using weak-supervision by innovatively leveraging different geospatial signals already available from historical delivery data. Our experiments and intrinsic evaluation demonstrate the significance of utilizing these signals and neighborhood information in learning geospatially aware place embeddings. Conclusions are further validated by observing significant improvements in two domain specific tasks viz., Pair-wise Matching (recall@95precision improves by 29%) and Candidate Generation (avg recall@k improves by 10%) as evaluated on UAE addresses.

PlotQA: Reasoning over Scientific Plots

Nitesh S. Methani

M.S. Thesis, May 2021

Reasoning over plots by question-answering (QA) is a challenging machine learning task at the intersection of vision, language processing, and reasoning. Existing synthetic datasets (FigureQA, DVQA) for reasoning over plots do not contain variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. In particular, they assume that the answer comes either from a small fixed size vocabulary or from a bounding box within the image. However, in practice, this is an unrealistic assumption because many questions require reasoning and thus have real-valued answers which appear neither in a small fixed size vocabulary nor in the image. In this work, we aim to bridge this gap between existing datasets and real-world plots. Specifically, we propose PlotQA with 28.9 million question-answer pairs over 224,377 plots on data from real-world sources and questions based on crowd-sourced question templates. Further, 80.76% of the out-of-vocabulary (OOV) questions in PlotQA have answers that are not in a fixed vocabulary. Analysis of existing models on PlotQA reveals that they cannot deal with OOV questions: their overall accuracy on our dataset is in single digits. This is not surprising given that these models were not designed for such questions. As a step towards a more holistic model which can address fixed vocabulary as well as OOV questions, we propose a hybrid model for data interpretation and reasoning over plots (HYDRA). Using this approach, specific questions are answered by choosing the answer from a fixed vocabulary or by extracting it from a predicted bounding box in the plot, while other questions are answered with a table question-answering engine which is fed with a structured table generated by detecting visual elements from the image. On the existing DVQA dataset, our model has an accuracy of 58%, significantly improving on the highest reported accuracy of 46%. On PlotQA, our model has an accuracy of 22.52%, which is significantly better than the state-of-the-art models.

A Systematic Evaluation of Object Detection Networks for Scientific Plots

Pritha Ganguly, Nitesh S. Methani, Mitesh M. Khapra, Pratyush Kumar

AAAI 2021

Are existing object detection methods adequate for detecting text and visual elements in scientific plots which are arguably different than the objects found in natural images? To answer this question, we train and compare the accuracy of Fast/Faster R-CNN, SSD, YOLO and RetinaNet on the PlotQA dataset with over 220, 000 scientific plots. At the standard IOU setting of 0.5, most networks perform well with mAP scores greater than 80% in detecting the relatively simple objects in plots. However, the performance drops drastically when evaluated at a stricter IOU of 0.9 with the best model giving a mAP of 35.70%. Note that such a stricter evaluation is essential when dealing with scientific plots where even minor localisation errors can lead to large errors in downstream numerical inferences. Given this poor performance, we propose minor modifications to existing models by combining ideas from different object detection networks. While this significantly improves the performance, there are still two main issues: (i) performance on text objects which are essential for reasoning is very poor, and (ii) inference time is unacceptably large considering the simplicity of plots. To solve this open problem, we make a series of contributions: (a) an efficient region proposal method based on Laplacian edge detectors, (b) a feature representation of region proposals that includes neighbouring information, (c) a linking component to join multiple region proposals for detecting longer textual objects, and (d) a custom loss function that combines a smooth `1-loss with an IOU-based loss. Combining these ideas, our final model is very accurate at extreme IOU values achieving a mAP of 93.44%@0.9 IOU. Simultaneously, our model is very efficient with an inference time 16x lesser than the current models, including one-stage detectors. Our model also achieves a high accuracy on an extrinsic plot-to-table conversion task with an F1 score of 0.77. With these contributions, we make a definitive progress in object detection for plots and enable further exploration on automated reasoning of plots.

PlotQA: Reasoning over Scientific Plots

Nitesh S. Methani, Pritha Ganguly, Mitesh M. Khapra, Pratyush Kumar

WACV 2020

Existing synthetic datasets (FigureQA, DVQA) for reasoning over plots do not contain variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. In particular, they assume that the answer comes either from a small fixed size vocabulary or from a bounding box within the image. However, in practice, this is an unrealistic assumption because many questions require reasoning and thus have real-valued answers which appear neither in a small fixed size vocabulary nor in the image. In this work, we aim to bridge this gap between existing datasets and real-world plots. Specifically, we propose PlotQA with 28.9 million question-answer pairs over 224,377 plots on data from real-world sources and questions based on crowd-sourced question templates. Further, 80.76% of the out-of-vocabulary (OOV) questions in PlotQA have answers that are not in a fixed vocabulary. Analysis of existing models on PlotQA reveals that they cannot deal with OOV questions: their overall accuracy on our dataset is in single digits. This is not surprising given that these models were not designed for such questions. As a step towards a more holistic model which can address fixed vocabulary as well as OOV questions, we propose a hybrid approach: Specific questions are answered by choosing the answer from a fixed vocabulary or by extracting it from a predicted bounding box in the plot, while other questions are answered with a table question-answering engine which is fed with a structured table generated by detecting visual elements from the image. On the existing DVQA dataset, our model has an accuracy of 58%, significantly improving on the highest reported accuracy of 46%. On PlotQA, our model has an accuracy of 22.52%, which is significantly better than state of the art models.