Machine Learning Engineer

Table of contents

No heading

No headings in the article.


1. Hardware-Aware Optimization Experience
I have extensive experience in hardware-aware optimization, particularly with GPUs and tensor cores.

  • GPU Utilization: Optimized GPU usage by leveraging mixed precision training with TensorFlow/Keras and PyTorch to accelerate computation without sacrificing model performance. By enabling automatic mixed precision (AMP), I achieved up to 2x training speed improvements.

  • Tensor Cores: Utilized NVIDIA tensor cores for matrix multiplications, especially in large-scale transformer models. By reformatting data to FP16 precision, I maximized throughput while maintaining accuracy.

  • Quantization and Sparsity: Implemented post-training quantization (PTQ) and quantization-aware training (QAT) to compress models for deployment on edge devices. This reduced model size by 70% and inference latency by 50%.

  • Cache Efficiency: Designed custom data pipelines to optimize memory access patterns, ensuring cache locality and reducing memory bottlenecks in CPU-bound pre-processing tasks.


2. Parallelism in AI
I have successfully implemented multiple forms of parallelism in AI systems:

  • Model Sharding: Partitioned transformer models across GPUs using frameworks like DeepSpeed and Hugging Face Accelerate, enabling the training of 10-billion-parameter models on limited hardware.

  • Tensor Parallelism: Used NVIDIA Megatron-LM for splitting tensor operations across multiple GPUs to reduce memory usage.

  • Pipeline Parallelism: Segmented models into computational stages using PyTorch Lightning to achieve balanced workload distribution, optimizing GPU utilization in multi-node environments.

For example, while training a large-scale language model, I combined pipeline and tensor parallelism to minimize communication overhead and maximize throughput.


3. Novel Optimization Proposal
While working on a recommendation system, I proposed a hybrid approach combining approximate nearest neighbor (ANN) search and graph-based methods.

  • Validation: Conducted extensive A/B testing on live user traffic to measure performance improvements in click-through rates (CTR).

  • Implementation: Integrated the solution using Faiss for ANN and a Graph Neural Network (GNN) to learn embeddings dynamically. This reduced response time by 30% while improving recommendation accuracy.


4. Research Idea to Working Prototype
During a project on neural style transfer, I implemented a research paper on adaptive instance normalization (AdaIN) into a real-time image stylization tool.

  • Challenges: The primary challenge was optimizing the model for inference on mobile devices without TensorFlow Lite's full support for custom layers.

  • Solution: Developed a custom kernel for AdaIN and converted the model into ONNX format for deployment. The resulting tool achieved real-time performance at 30 FPS on mobile GPUs.


5. Distributed AI Training/Inference Systems
I have significant experience in building and scaling distributed AI systems.

  • Challenges: Managing inter-GPU communication overhead in distributed training and ensuring fault tolerance.

  • Solution: Leveraged Horovod and NCCL for efficient gradient synchronization across GPUs. Implemented checkpointing strategies and automatic recovery in PyTorch to ensure robustness. For inference, utilized model parallelism in Kubernetes-based deployments to balance load across nodes.


6. Improving AI System Speed Without Quality Loss
In a computer vision project for object detection, I reduced model inference time by:

  • Methodology: Pruning redundant layers and fine-tuning the remaining model, combined with TensorRT optimizations.

  • Results: Achieved a 40% reduction in latency while maintaining over 98% mAP (mean Average Precision).

  • Tools Used: TensorFlow Model Optimization Toolkit, NVIDIA TensorRT, and ONNX Runtime.