This is my approach to the LunarLander-v3 environment, completed as the final term project for the course "Foundations of Artificial Intelligence" in May 2025.
Below is an excerpt from my full report ("LunarLander-v3 23020351.ipynb").
If this project has helped you, I’m glad to have been a part of your journey!
This (pretty) comprehensive study successfully implemented and compared two fundamentally different reinforcement learning approaches for the LunarLander-v3 control task: Double Deep Q-Network (DDQN) with N-step returns and Proximal Policy Optimization (PPO) with vectorized environments. Both algorithms exceeded the environment's 200-point "solved" threshold, demonstrating the viability of deep reinforcement learning for autonomous spacecraft control systems.
The project encompassed extensive experimental work including hyperparameter analysis, custom reward function implementation, multi-session statistical validation, and comprehensive performance evaluation across multiple metrics. The modular codebase architecture enabled systematic comparison of value-based versus policy-based learning paradigms in a controlled experimental setting.
Network Design: The final DDQN implementation utilized a progressive architecture (256 -> 128 -> 64) with layer normalization for training stability and conservative Xavier initialization (gain=0.1) to prevent gradient instabilities that plagued earlier versions.
N-Step Returns Enhancement: The integration of 3-step returns significantly improved temporal credit assignment, enabling faster propagation of landing rewards through multi-step sequences while maintaining computational efficiency compared to the computationally prohibitive Prioritized Experience Replay (PER) approach.
Training Stabilization: A comprehensive stabilization framework including reward normalization, gradient clipping (max norm 1.0), and a strategic warmup phase (300 episodes of forced exploration) ensured robust convergence without the catastrophic gradient explosions experienced in preliminary implementations.
Vectorized Architecture: The use of 4 parallel environments with Stable-Baselines3's MlpPolicy enabled efficient sample collection and reduced gradient variance, demonstrating the effectiveness of distributed training for policy-based methods.
Conservative Hyperparameters: The selection of clip_range=0.2, learning_rate=3e-4, and GAE_lambda=0.95 reflected a deliberate emphasis on training stability over aggressive optimization, resulting in consistent performance across evaluation sessions.
The rigorous 10-session evaluation protocol (1,000 episodes per algorithm) revealed significant performance differences:
Success Rate Performance: PPO achieved superior reliability with 92.0% ± 2.4% success rate compared to DDQN's 79.9% ± 4.6%, representing a 15.2% relative improvement in landing consistency. PPO's lower variance (2.4% vs 4.6%) indicates more predictable deployment performance.
Fuel Efficiency Paradigm: DDQN demonstrated exceptional resource efficiency, consuming only 104.47 ± 3.12 fuel units compared to PPO's 210.23 ± 1.02 units—a remarkable 50.3% reduction. This efficiency advantage stems from DDQN's value-based learning approach that inherently optimizes the fuel-reward trade-off.
Precision and Consistency: PPO achieved marginally better landing precision (0.947 vs 0.901) and demonstrated superior consistency across all metrics, with standard deviations approximately 50% lower than DDQN across reward and success rate measurements.
Despite lower success rates, DDQN achieved marginally higher average rewards (245.54 vs 242.33), indicating that successful DDQN landings generate higher individual episode rewards, likely due to substantial fuel efficiency bonuses. This paradox highlights the complex relationship between task completion and reward optimization in reinforcement learning.
DDQN's 50% fuel efficiency advantage suggests the emergence of a conservative, ballistic-trajectory approach that minimizes engine usage while accepting lower success rates. This behavior pattern aligns with value-based learning's natural tendency to optimize long-term cumulative rewards, resulting in risk-averse policies that prioritize resource conservation.
Exploration-Exploitation Balance: The extended warmup phase (300 episodes) and gradual epsilon decay (0.998) enabled comprehensive state space exploration before transitioning to exploitation, resulting in well-informed but cautious policies.
N-Step Learning Benefits: The 3-step lookahead mechanism accelerated learning of landing sequences while avoiding the computational overhead of more complex temporal credit assignment methods.
PPO's higher fuel consumption paired with 92% success rate indicates a more aggressive control strategy that prioritizes landing success over resource conservation. This reflects policy-based learning's direct optimization of action selection for task completion rather than value maximization.
Stochastic Policy Advantages: PPO's inherent stochasticity enabled more robust exploration of the action space, leading to discovery of highly successful but resource-intensive control strategies.
Vectorized Training Benefits: The 4-environment parallel collection mechanism provided 4x sample efficiency per timestep, enabling rapid policy refinement through diverse experience accumulation.
Fuel-Critical Applications: DDQN's exceptional resource efficiency makes it ideal for scenarios where fuel conservation is paramount, such as deep space missions or multi-landing sequences where each drop of propellant matters.
Reliability-Critical Applications: PPO's 92% success rate and low variance make it preferable for high-stakes missions where landing failure would be catastrophic, such as crewed missions or expensive scientific payloads.
Computational Constraints: DDQN's simpler architecture and lower computational requirements during inference make it suitable for embedded systems with limited processing power.
The project revealed critical details during the implementation process:
Tensor Creation Optimization: The discovery that torch.FloatTensor([list_of_numpy_arrays])
caused catastrophic training slowdowns (90 minutes vs normal speeds) highlights the importance of efficient tensor operations in practical implementations.
Gradient Explosion Prevention: The necessity of conservative initialization and gradient clipping was demonstrated through the failure of alternative approaches, providing concrete evidence for these stabilization techniques.
Hyperparameter Sensitivity: The systematic learning rate and discount factor experiments confirmed that DDQN performance is highly sensitive to these parameters, with lr=0.0001 and γ=0.99 emerging as optimal through empirical validation.
Multi-Session Validation: Single-session evaluations can be misleading; the 10-session protocol revealed important variance characteristics that inform deployment reliability estimates.
Video Generation Value: The visual analysis capability proved invaluable for qualitative policy assessment, enabling identification of behavioral patterns not captured by numerical metrics.
Custom Reward Function Impact: The fuel efficiency and precision bonus experiments demonstrated that reward shaping can significantly influence learned behaviors, providing a tool for biasing algorithms toward specific objectives.
Environment Specificity: Results are specific to the LunarLander-v3 discrete action space; continuous control variants may favor different algorithmic approaches.
Computational Resources: Training required substantial GPU resources; deployment on resource-constrained spacecraft computers would require model compression techniques.
Environmental Variability: The standardized environment lacks the unpredictability of real-world lunar surface conditions, wind effects, and hardware variations.
Hybrid Approaches: Investigate combining DDQN's fuel efficiency with PPO's reliability through ensemble methods or hierarchical control architectures.
Domain Randomization: Implement environmental variations (gravity fluctuations, sensor noise, actuator delays) to improve real-world transferability.
Model Compression: Develop quantization and pruning techniques to enable deployment on spacecraft computing systems.
Continuous Control: Extend the comparison to continuous action spaces using algorithms like SAC (Soft Actor-Critic) or TD3 (Twin Delayed DDPG).
This project makes several important contributions to the reinforcement learning community:
- Comprehensive Algorithm Comparison: Provides rigorous statistical comparison of value-based vs policy-based approaches with practical deployment considerations.
- Implementation Best Practices: Documents critical implementation details and optimization techniques often omitted from academic papers.
- Multi-Metric Evaluation Framework: Establishes a comprehensive evaluation protocol incorporating success rate, fuel efficiency, precision, and statistical robustness.
- Practical Decision Framework: Offers concrete guidance for algorithm selection based on mission requirements and operational constraints.
The project successfully demonstrates that both DDQN and PPO can solve the lunar landing problem effectively, but with fundamentally different approaches and trade-offs. DDQN excels in resource-constrained scenarios requiring fuel efficiency, while PPO dominates when mission success rates are paramount. This comprehensive analysis provides valuable insights for practitioners implementing reinforcement learning solutions in aerospace applications and similar control domains.
For real-world lunar landing applications, a dual-algorithm approach leveraging DDQN for nominal operations (fuel efficiency) and PPO for contingency scenarios (maximum reliability) would provide optimal operational flexibility across diverse mission requirements. Maybe with a bit of luck too, it's kinda necessary those days.