Skip to content

Debug university cluster configurations (SLURM and SSH) #56

Closed
@jeremymanning

Description

@jeremymanning

Objective

Verify and debug Clustrix functionality with university-managed clusters that do not require API keys:

  1. SLURM cluster - Traditional HPC scheduler
  2. SSH cluster - Direct SSH-based execution

Context

Before investigating cloud provider issues, we need to establish that core cluster functionality works correctly with controlled university infrastructure.

Testing Approach

Local Jupyter Testing Required: University servers require VPN access, so testing must be done via local Jupyter server (not Colab notebooks).

SLURM Cluster Testing

Configuration Requirements

  • Cluster hostname and SSH access
  • SLURM partition and account information
  • SSH key authentication setup
  • Module loading requirements (if any)

Test Cases

  • Basic job submission and execution
  • Resource specification (cores, memory, time)
  • Job status monitoring and result retrieval
  • Parallel loop execution
  • Error handling and cleanup

SSH Cluster Testing

Configuration Requirements

  • Remote server hostname and SSH access
  • Python environment setup on remote server
  • Working directory permissions
  • SSH key authentication

Test Cases

  • Basic function execution over SSH
  • File transfer and cleanup
  • Environment replication
  • Error handling and connection recovery
  • Multiple concurrent jobs

Implementation Plan

  1. Setup Local Jupyter Environment: Configure local development environment with VPN access
  2. SLURM Configuration: Create and test university SLURM cluster configuration
  3. SSH Configuration: Create and test university SSH server configuration
  4. Systematic Testing: Execute comprehensive test suite for both cluster types
  5. Documentation: Document any configuration requirements or limitations discovered

Success Criteria

  • SLURM cluster executes jobs successfully with proper resource allocation
  • SSH cluster executes jobs successfully with proper environment setup
  • Both configurations handle errors gracefully
  • Performance is reasonable for university network latency
  • Configuration examples are documented for other university users

Related Files

  • clustrix/executor.py (core execution logic)
  • clustrix/config.py (configuration management)
  • clustrix/utils.py (job script generation)
  • tests/test_executor.py (execution tests)

Priority

High - Establishes baseline functionality before cloud provider debugging

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions