Model Description

Overview

This model detects hand gestures for use as input controls for video games. It uses object detection to recognize specific hand poses from a webcam or standard camera and translate them into game actions. The goal of the project is to explore whether computer vision–based gesture recognition can provide a low-cost and accessible alternative to traditional game controllers.

Training Approach

The model was trained using the nano version of YOLOv8 (YOLOv8n) through the Ultralytics training framework. The model was trained from pretrained YOLOv8n weights and fine-tuned on a custom hand gesture dataset.

Intended Use Cases

Gesture-controlled video games with simple control schemes
Touchless interfaces
Interactive displays
Public kiosks
Smart home media controls
Desktop navigation

Training Data

Dataset Sources

The training dataset was constructed from two sources:

Rock-Paper-Scissors dataset

Source: Roboflow Universe
Creator: Audrey
Used for the first three gesture classes
Dataset URL: https://universe.roboflow.com/audrey-x3i6m/rps-knmjj

Custom gesture dataset

Created by recording a 30-second video of the author performing gestures
Video parsed into frames at 10 frames per second
Images manually selected and annotated

Dataset Size

Category	Count
Original Images	444
Augmented Images	1066
Image Resolution	512 × 512

Class Distribution

Class	Gesture	Annotation Count
Forward	Open Palm	169
Backward	Closed Fist	210
Jump	Peace Sign	187
Attack	Thumbs Up	121

Data Collection Methodology

The dataset combines stock gesture images with a custom dataset created from recorded video frames.

The custom dataset was generated by:

Recording a short gesture demonstration video
Extracting frames at 10 FPS
Selecting usable frames
Annotating gesture bounding boxes
This process produced 236 custom images that were merged with the stock dataset.

Annotation Process

All annotations were created manually using Roboflow. Bounding boxes were drawn around the visible hand gesture in each image. Due to failure to import annotation metadata from the original dataset, all 444 images were annotated manually. Estimated annotation time: 2–3 hours

Train / Validation / Test Split

Dataset Split	Image Count
Training	933
Validation	88
Test	45

Data Augmentation

The following augmentations were applied:

Rotation: ±15 degrees
Saturation adjustment: ±30%

These augmentations expanded the dataset from 444 to 1066 images.

Dataset Availability

Dataset availability: https://universe.roboflow.com/b-data-497-ws/hand-gesture-controls

Known Dataset Biases and Limitations

Small dataset size
Class imbalance (thumbs-up has fewer examples)
Mixed image quality between stock and custom images
Limited diversity in backgrounds and lighting conditions
Limited number of subjects (primarily one person)

These factors may affect model generalization.

Training Procedure

Framework

Training was performed in Google Colab using altered Python code from a YOLOv11 training run. Code was taken and altered for YOLOv8n from here.

Model Architecture

Base model: YOLOv8n (Nano)

Reasons for selection:

Lightweight architecture
Low inference latency
Lower hardware requirements
Faster training times
Suitable for real-time applications

Training Configuration

Parameter	Value
Epochs	200 (training stopped early)
Early stopping patience	10
Image size	512 × 512
Batch size	64

Training Hardware

Component	Specification
GPU	A100 (High Ram)
VRAM	80 GB
Training Time	~40 minutes

Preprocessing Steps

Images resized to 512×512
Bounding box annotations normalized
Augmented images generated before training

Evaluation Results

Overall Metrics

Final model performance at epoch 41:

Metric	Score
mAP@50	0.97
mAP@50–95	0.78
Precision	0.93
Recall	0.91
F1 Score	0.94

These results exceed the predefined project success criteria.

Per-Class Performance

Sample Class Images Sample Images

Key Visualizations

Confusion Matrix F1 Curve Precision-Recall Curve

Performance Analysis

The model achieved high precision and recall across all gesture classes, indicating strong detection performance on the test set.

Several factors contributed to this performance:

A small number of distinct gesture classes
Highly visible and consistent hand poses
A balanced dataset for most classes However, the dataset size is relatively small, which may inflate evaluation scores and limit generalization.

Failure cases were observed in several situations:

Complex or cluttered backgrounds
Low confidence detections
Ambiguous or blurred gesture poses These issues highlight areas where the model could be improved with more diverse training data.

Limitations and Biases

Known Failure Cases

Failure Cases The model struggled with some of the photos from the RPS dataset as these images contain complex backgrounds, partially occluded hands, or ambiguous gestures.

Data Biases

Potential biases include:

limited subject diversity
similar backgrounds across many images
dataset partially composed of stock imagery
limited environmental variability

Environmental Limitations

Model performance may degrade when:

lighting conditions vary significantly
gestures are performed at unusual angles
hands are partially occluded
gestures appear at extreme scales or distances

Inappropriate Use Cases

This model should not be used for:

complex gesture recognition (complex 3D control schemes)
sign language recognition
high-precision human-computer interaction systems
any safety-critical applications

Sample Size Limitations

The dataset is relatively small for object detection training, which may limit generalization to new users or environments. Future improvements to the model would likely be a larger and more diverse dataset. Best course of action would be to remove stock images dataset and culminate gesture videos using diverse individuals, backgrounds, etc.

Future Work

Potential improvements include:

collecting a larger and more diverse gesture dataset
increasing the number of gesture classes
improving image quality and environmental diversity
exploring hand keypoint detection models instead of object detection
Keypoint estimation could allow detection of more complex hand gestures and improve gesture recognition accuracy.

Downloads last month: 505