Our Models
We have tuned and trained state of the art computer vision and natural language models to create one of a kind software with a strong emphasis on user experience
Computer Vision
"YOLOv5 is a family of object detection architectures and models pretrained on the COCO dataset, and represents Ultralytics open-source research into future vision AI methods, incorporating lessons learned and best practices evolved over thousands of hours of research and development."
​
We at Argos have tuned this architecture with a custom set of objects and images that we believe will be most relevant to our user.
NLP Auto-Captioning
Our automatic video captioning process uses an inception model. The model itself is made up of symmetric and asymmetric building blocks, including convolutions, average pooling, max pooling, dropouts, and Loss computed using Softmax. This specific model type has obtained over 78% accuracy on the large ImageNet dataset so we are comfortable using it for our captioning process.
Facial Recognition
Team Argos has implemented a facial recognition Support Vector Machine (SVM) to identify people who have been seen in a video. Argos allows the user to label an image of the person and if that person is seen again the label will be used in the video caption
Model Evaluation
Training YOLOv5
Argos's focus is on capturing only the important events in video clips, but we don't want to miss anything either. In training our custom YOLOv5 model we focused on improving recall. Focusing on recall means we tried to decrease false negatives (objects that were relevant but not captured).
We improved this metric over time by continually increasing our training epochs and the number of training images for our model.
​
Our primary data source for images is Google's Open Images dataset which has a robust and diverse selection of pre-annotated images.
Auto-Caption
ROUGE F1 Score is a standard NLP metric used to evaluate automatic summarization. After many iterations of development, we have ROUGE scores around 0.3. Which is low for this type of model.
​
We attribute the low ROUGE Scores to model architecture and human-generated Lookup sentences.
​
In the future, we plan to switch to a transformer architecture and change the lookup functionality to improve the overall Auto-Caption scores.