I build systems that perceive, learn, and act in the physical world. My interests span robot learning (world models, imitation learning, teleoperation), computer vision (3D reconstruction, physics-informed restoration, NeRFs), and AI systems (LLM agents, benchmarking, CUDA optimization). I've worked on world models and humanoid control at a robotics startup, published at NeurIPS 2025, and currently conduct vision research at the National Center for Supercomputing Applications. I also enjoy building from first principles: I've written an OS kernel, designed CPUs, and built VR teleoperation pipelines for data collection.
News
Sep 2025Paper accepted at NeurIPS 2025: "Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs"
Aug 2025Joined a stealth robotics startup in SF working on world models, CUDA kernel optimization, and Unitree G1 humanoid control
May 2025Started as ML Intern at AstraZeneca, built RAG agents for large-scale pharma data access
Jan 2025Started as Computer Vision Researcher at NCSA and CSL with Prof. Narendra Ahuja, atmospheric turbulence restoration with physics-informed deep learning
Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs X. Guo, Y. Li, X. Kong, N. Sabharwal, et al. Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Physics-Based Dynamic Scene Reconstruction for Atmospheric Turbulence Correction N. Sabharwal, et al. In preparation, target: AAAI 2026
Selected Projects
Multimodal Robotic World Model2024 – present
In progress
Developing a unified sensor and teleoperation AI model that fuses vision, proprioception, and human demonstration data to accelerate robot skill acquisition. The model enables robots to build internal world representations from diverse sensory inputs and operator guidance, allowing them to learn complex manipulation tasks with significantly less data and safer, more intuitive human-robot collaboration.
Engineered an intuitive VR teleoperation system that enables precise, real-time remote robot control by translating head movements into robot actions and providing immersive 3D visual feedback for enhanced depth perception.
Real-Time Head-Driven Control: Captured 3-DOF head-pose at 60 Hz from an Oculus headset via ROS, translating movements into joint commands for a multi-DOF robotic neck. This facilitates smooth, natural robot motion through imitation learning.
Immersive 3D Visual Feedback: Integrated a stereo RGBD camera, streaming dual-view video into the Oculus, completing a closed loop where the headset user sees the robot's environment and can repeatedly perform actions with immediate feedback, providing operators with crucial depth perception for fine manipulation tasks.
Modular & Extensible Architecture: Designed a Unity–ROS–Dynamixel workflow with configurable DOF settings, ensuring adaptability to various robot platforms and paving the way for future full-body telemanipulation. Implemented ROS dampening filters and H.264/H.265 video compression, achieving 60 Hz command updates with 45–60 ms round-trip delay and sub-degree motion accuracy.
Demonstrated Results: Successfully executed remote folding and assembly tasks over a 3,000-mile network. Trained imitation learning policies for long-horizon tasks (sorting, insertion, folding), cutting fine manipulation time by ~30%, establishing a robust foundation for scalable imitation-learning data collection.
Photorealistic Sim-to-Real via 3D Gaussian Splatting + MuJoCo2025
The visual sim-to-real gap is the biggest bottleneck in scaling robot learning: policies trained on flat-shaded MuJoCo scenes fail against real-world lighting and textures. This project fuses a photorealistic 3D Gaussian Splat of a real lab with MuJoCo physics so robot policies train in an environment that looks real and behaves real, tackling the same visual grounding problem that labs like Physical Intelligence, Google DeepMind, and Toyota Research Institute are racing to solve.
3DGS–Physics Alignment: ICP registration pipeline aligning a real-world Gaussian Splat capture (iPhone LiDAR + COLMAP) of an ALOHA bimanual setup with its MuJoCo twin, achieving 7.0 mm RMSE via point-to-plane ICP with Tukey robust loss and adaptive voxel downsampling.
Real-Time Composite Rendering: MuJoCo robot foreground (segmentation mask) + custom NumPy Gaussian rasterizer background at 30–60 FPS across 7+ viewpoints including wrist cameras.
Policy-Ready API: Observation interface matching the real ALOHA multi-camera topology, outputting dicts compatible with LeRobot and OpenPI policy servers for direct sim-to-real transfer.
Physics-Based Dynamic Scene Reconstruction2024 – 2025
Submitting to AAAI 2026
Developing unsupervised, physics-informed deep learning frameworks to model and compensate for atmospheric turbulence and fluid dynamics in visual data. Integrates convolutional encoders, optical flow estimation, and advanced digital signal processing filters (Fourier and wavelet domain) with physical optics simulations and 3D geometric scene reconstruction. The system accurately predicts and corrects refractive and flow-induced distortions in real-world imagery, with potential to transform imaging in challenging environmental conditions. Model training conducted on parallel supercomputing clusters.
Hybrid Robot Learning & Control for Contact-Rich Manipulation2025
Deploying learned policies in the real world requires more than good imitation: it requires robustness to novel states, graceful degradation, and integration with classical perception and control. This project builds the full stack from first principles (URDFs, PD control, 3D vision) through modern imitation learning (DAgger, ResNet18 visuomotor policies), culminating in a Simplex safety architecture that achieves 100% task success where vision-only (40%) and learned-only (60%) each fail, the kind of hybrid system needed for reliable real-world robot deployment.
3D Perception: RGBD → point-cloud deprojection → hand-eye calibration → HSV segmentation → DBSCAN clustering → coarse-to-fine ICP for 6-DOF pose. Multi-view fusion from 5+ cameras enables autonomous pick-and-place with zero ground-truth state.
Imitation Learning + DAgger: Trained MLP (state) and ResNet18 (vision) behavioral cloning policies on 20k+ teleoperated demos. DAgger closed the covariate shift gap → 80%+ success on randomized pick-and-place.
Simplex Architecture (100% Success): Vision-based reactive controller (HSV dirt detection → boustrophedon paths) with automatic BC fallback. 97.6% ± 2.0% cleaning, 100% success vs. 40% vision-only and 60% BC-only on 150-marker contact-rich wiping.
Every breakthrough in AI training and inference (FlashAttention, cuDNN, custom training kernels) comes down to someone understanding exactly how GPU hardware works. This project is that deep dive: iteratively rewriting the same CNN forward pass across 6+ kernel versions, exploiting every level of the CUDA memory hierarchy, mixed-precision arithmetic, and async execution, the same primitives that power model training infrastructure at scale.
Memory Hierarchy: Weights in constant memory, I/O tiles in shared memory, register accumulators, achieving order-of-magnitude reduction in global memory traffic.
FP16 Mixed Precision + Tree Reduction: GPU performance is memory-bandwidth bound: FP16 halves memory footprint, doubling effective bandwidth and allowing 2× more data in cache. Modern GPUs (Tensor Cores, FP16 ALUs) also deliver 2× compute throughput on half-precision ops. Implemented via CUDA intrinsics (__half, __hmul) with 3D thread-block tree reduction over input channels, eliminating serial loops and halving register pressure to enable higher occupancy, the same quantization strategy that makes large-scale training feasible.
Tiled GEMM (im2col): Convolution re-cast as matrix multiply; 16×16 shared-memory tiled kernel with unrolled inner loops for near-peak occupancy.
Streams & Layer-Adaptive Dispatch: Direct conv for shallow layers, GEMM for deep layers. 10 concurrent CUDA streams pipelining kernel execution with memory transfers.
AI-Accelerated Hardware Design & Benchmarking2024
Published at NeurIPS 2025
Leveraging Large Language Models to automate the generation and verification of SystemVerilog modules for complex hardware design tasks. Developed a comprehensive benchmarking pipeline to evaluate LLM performance in hardware design, featuring built-in automated graders that test LLM-generated modules against testbenches using open-source simulation software. The suite includes a diverse array of tasks varying in complexity and domain-specific requirements, enabling thorough assessment of LLM capabilities across the hardware engineering lifecycle.
RISC-V OS Kernel + Journaling Filesystem + I/O Drivers2025
Developed a full UNIX-style operating system kernel and robust journaling filesystem from scratch for the RISC-V architecture.
Kernel: Implemented core OS functionalities including a bootloader, trap handling, Sv39 virtual memory with demand paging, and process abstraction. Supports essential user-mode syscalls (open, close, read, write, ioctl, exec, fork, wait, usleep, fscreate, fsdelete). Created cooperative and preemptive threading models using condition variables and timer (mtime) interrupts.
Custom Filesystem + Block Cache: Engineered a block-based filesystem with a write-ahead journal for metadata consistency and crash recovery. Supports create, read, write, delete, flush, and multi-level indirection. Mountable via VirtIO block device or in-memory "memio" for rapid testing. Implemented a write-back cache with configurable associativity, reducing I/O latency by ~45% while ensuring data consistency through the journaling mechanism.
Device Drivers & MMIO: Developed drivers for UART (polling & interrupt-driven), Real-Time Clock (RTC), Platform-Level Interrupt Controller (PLIC), VirtIO block & RNG devices (with custom ISR integration), and GPIO, SPI interfaces for embedded peripherals.
Games! Built a unified I/O interface to load and execute ELF binaries (Star Trek, Doom, Rogue, Zork) on QEMU RISC-V, validated by automated tests for correct loading, execution, and system-call handling.
System on Chip: 16-bit CPU Core + Graphics Controller2024
Designed and implemented a 16-bit CPU based on a reduced instruction set, x86-inspired ISA in SystemVerilog, end-to-end from ISA specification through FPGA verification.
ISA & Pipeline: 16-bit processor with Program Counter, Instruction Register, general-purpose registers, ALU, and a 3-stage fetch–decode–execute pipeline supporting ADD, AND, NOT, BR, JMP, JSR, LDR, STR, and PAUSE. FSM control unit sequences memory access, ALU operations, and I/O interactions via a custom cpu_to_io bridge interfacing with on-board switches and hex displays.
Memory-Mapped I/O & BRAM: Mapped UART, switches, and seven-segment displays into the CPU's address space using on-chip Spartan-7 Block RAM, handling read/write timing without an external "ready" signal.
Graphics Controller: Developed an IP-core-based HDMI graphics controller for 80×30 character text rendering over AXI4 on Vivado IP Integrator. Implemented monochrome and color text output using VRAM and font ROM, supporting inverse text and palette-based coloring.
FPGA Verified: Synthesized and achieved timing closure in Vivado, deploying and verifying stable operation at 50 MHz on a Xilinx Spartan-7 board.
Developed an autonomous security system to modernize access control, replacing traditional key/card-based systems with sensor-triggered visual verification and remote actuation.
Problem: Campus and home access relying on physical cards or keys creates delays, extra costs, and accessibility barriers.
Solution: An ultrasonic sensor (HC-SR04) detects approaching individuals, triggering an Arduino UNO and OV7670 camera. The captured image streams via UART → PC → Telegram bot. The owner sends "door open" / "door close" over Telegram, and an ESP32 actuates SG90 servo motors to lock or unlock.
Tech Stack: HC-SR04 ultrasonic sensor (50 cm trigger range, power-saving standby), OV7670 camera over UART, Arduino UNO for coordination, ESP32 for Wi-Fi and Telegram API, SG90 servos for actuation, Make.com for workflow automation.
Features: Live video feed for verification, automated entry/exit logs, updatable face database, upgradable firmware. Real-world applications include keyless Airbnb access, ID-free campus entry, and improved accessibility.
My first project! Built and scaled a Harry Potter fan community to 12,500+ registered users and 4,000+ social media followers. Led and coordinated a team of 30 volunteers to develop quizzes, discussion forums, and engaging content, fostering a highly active online platform.
National Center for Supercomputing Applications (NCSA)Jan 2025 – Present
Computer Vision Researcher, PI: Prof. Narendra Ahuja
Leading research in video restoration using unsupervised, physics-informed deep learning for modeling atmospheric turbulence and fluid dynamics.
Integrated Vision Transformers, NeRFs, and 3D Mamba-based architectures with optical flow estimation and 3D geometric scene reconstruction to correct non-rigid distortions in real-world videos.
Training on parallel supercomputing clusters with CUDA-accelerated pipelines.
Pioneered systematic evaluation framework for assessing frontier LLMs on complex hardware design synthesis, benchmarking across hierarchical SystemVerilog tasks.
Architected closed-loop agentic system with automated verification pipeline integrating simulation and synthesis tools, enabling iterative refinement through structured error feedback.
Industry Experience
AstraZenecaMay – Aug 2025
Machine Learning Intern · Gaithersburg, MD
Built an intelligent agent with natural language interface using RAG and LangGraph with MCP framework, querying millions of rows of pharma data in under 10 s.
Designed secure, highly available AWS stack (ECS, ALB, WAF, CloudWatch) delivering 99.9% uptime, scaling to 50+ concurrent users.
Led cross-functional discovery, built MVP independently, then collaborated with Mexico Data Science team for company-wide rollout.
Mashreq BankMay – Jul 2023
Cloud Infrastructure Intern
Developed and optimized RESTful & GraphQL APIs, improving data retrieval speed by 40% and scalability for high-traffic usage.
Refined CI/CD pipelines using Jenkins and Docker, reducing deployment time by 50%.