Building Vision AI with Foundation and Generative Models

Building Vision AI with Embodied Vision Models

A code-first course on building vision systems with foundation and generative models. Part of the Hands-On AI Science series, designed around Innovation-First Learning principles.

Vision AI Tasks

Vision AI has moved from narrow classifiers to foundation models that understand, describe, and generate images. Every product that touches cameras, documents, or visual content now depends on these capabilities. This course prepares students to build vision systems that see, reason, and create.

Foundation & Generative Models

Core concepts, models, and ideas behind modern computer vision: convolutional and vision transformer architectures, contrastive learning, diffusion processes, latent space geometry, and multimodal alignment between images and text.

Tools & Platforms

PyTorch, OpenCV, Hugging Face Diffusers, Stable Diffusion, YOLO, SAM, CLIP, Roboflow, Weights & Biases, and Google Colab.

Modular Syllabus

A specific course syllabus is built for each audience: graduate or undergraduate, across engineering, digital health, or computer science.

Innovation Through Tools Mastery

As AI and mature libraries handle standard tasks, professional developers must focus on innovation. Student projects tackle new use cases by generating unique data and fine-tuning task-specific vision models.

Guided Student Projects

Students begin their projects while learning the material and enrich them as new concepts arrive. Each team gives several in-class presentations for discussion and feedback.

Typical Weekly ScheduleSample Syllabus (PDF)Poster (HIT)

Week 1

Image Processing Fundamentals

OpenCV, NumPy, filtering, edges

Week 2

CNNs & Image Classification

PyTorch, ResNet, transfer learning

Week 3

Object Detection

YOLO, Roboflow, bounding boxes

Week 4

Semantic Segmentation

SAM, U-Net, pixel-level masks

Week 5

Project Proposal Presentations

Student proposals, peer feedback

Week 6

Vision Transformers

ViT, DINOv2, Hugging Face

Week 7

Multimodal Models & CLIP

CLIP, BLIP-2, visual QA

Week 8

Interim Project Presentations

Progress demos, instructor feedback

Week 9

Generative Models: GANs

StyleGAN, image-to-image translation

Week 10

Diffusion Models

Stable Diffusion, HF Diffusers

Week 11

Image Editing & ControlNet

Inpainting, ControlNet, img2img

Week 12

Video Understanding & 3D Vision

Optical flow, depth estimation, NeRF

Week 13

Final Project Presentations

Live demos, peer evaluation

Building Language AI

Language AI

LLMs and Agents

Building Scalable AI

Scalable AI

Big Data and Distributed Intelligence

Building Temporal AI

Temporal AI

Sequential Intelligence and RL