Returning Candidate?

Sr Director of Engineering - Infinia

ID: 2024-4988
# of Openings: 1
Location : Name Linked: Remote: Germany
Posting Location : Country (Full Name): Germany
Posting Location : City: Remote
Job Function: Engineering
Worker Type: Regular Full-Time Employee

Overview

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing.

"DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC

“The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA

DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence.

Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management.

Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.

Job Description

Sr Director of Engineering - Infinia Distributed Platform

We are looking for an experienced and technically driven Director of Engineering to lead the Infinia Distributed Platform organization — the foundational team powering DDN’s flagship AI-native distributed data platform.

In this role, you will oversee engineering teams responsible for the core systems that enable Infinia’s performance, scalability, and reliability at global scale. This includes mission-critical components such as task scheduling, distributed tracing, memory management, SPDK data access, profiling, networking, reliability, distributed locking, internal key-value stores, and filesystem clients — all orchestrated within a multi-tenant, high-throughput environment.

You will define the strategy, scale execution, and mentor engineering leaders to deliver production-grade systems that meet the demands of AI/ML, high-performance computing, and enterprise analytics.

This is a hands-on technical leadership role at the heart of Infinia’s distributed architecture — where decisions today shape how data moves tomorrow.

Key Responsibilities

Core Systems Leadership

Lead and scale multiple engineering teams focused on critical path components of the Infinia platform:

Task scheduling and orchestration
Tracing and observability infrastructure
Memory management and performance tuning
SPDK-based I/O data path
Reliability and fault-tolerance systems
Networking stack optimization and event-driven IO
TDS (Tenant Data Services) and multi-tenant isolation
DLM (Distributed Lock Manager) and concurrency control
Internal KVStore for system metadata and state
FS client for scalable POSIX-like access

Technical Strategy & Execution

Own the end-to-end architecture, roadmap, and execution for all core components.
Guide technical design reviews, enforce performance standards, and align cross-team priorities to platform milestones.
Collaborate with architecture and infrastructure teams to evolve platform interfaces, service contracts, and internal APIs.

Organizational Growth & Team Development

Hire, mentor, and develop engineering managers and senior ICs to build a culture of accountability, innovation, and technical rigor.
Drive a results-oriented mindset focused on high-velocity, high-reliability software delivery.
Set clear goals and foster professional growth through coaching, feedback, and performance management.

Cross-Functional Collaboration

Partner with product management, field engineering, and customer teams to shape feature priorities and ensure core platform needs are anticipated early.
Interface with support and site reliability teams to define SLAs, improve telemetry, and reduce MTTR for platform incidents.
Contribute to platform-wide initiatives in multi-tenancy, fault isolation, observability, and performance benchmarking.

Platform Reliability & Performance

Champion operational excellence across core services — including incident response, regression testing, and release stability.
Optimize memory usage, lock contention, thread scheduling, and task pipelines to deliver microsecond-level performance where required.
Establish strong internal metrics and observability standards to measure system health, responsiveness, and uptime.

Required Qualifications

12+ years of engineering experience in distributed systems, operating systems, or storage platform engineering.
5+ years of experience leading multi-team organizations delivering core systems software in production environments.
Strong expertise in systems programming (C, C++, Rust) and deep knowledge of concurrency, memory models, and network programming.
Proven track record designing and scaling services related to task scheduling, locking, memory, and I/O performance.
Experience managing components at the intersection of infrastructure and application performance, especially in multi-tenant platforms.
Excellent communication, roadmap planning, and cross-functional leadership skills.

Preferred Qualifications

Experience with SPDK, RDMA, DPDK, or high-performance storage stacks.
Knowledge of distributed coordination protocols, key-value stores, or scalable metadata architectures.
Background in AI/ML, HPC, or cloud-native infrastructure (Kubernetes, microservices, etc.).
Familiarity with observability tools (e.g., tracing frameworks, profilers, Prometheus, OpenTelemetry).

Success Metrics – First 30 Days

Strategic Alignment

Ramp up on all core components, existing technical challenges, and roadmap priorities.
Meet with team leads and cross-functional partners to assess execution readiness and architectural cohesion.

Early Impact

Identify 2–3 areas for performance optimization, team structure refinement, or architectural alignment.
Deliver a 90-day strategy plan outlining key initiatives across reliability, latency, and scalability.

Team Integration

Build trust and alignment with engineering managers and ICs.
Assess hiring needs and begin shaping the next phase of team growth.

Success Metrics – Beyond 30 Days

Timely, high-quality delivery of core platform milestones aligned to product roadmap.
Improvements in performance, fault-tolerance, and memory/network efficiency across key subsystems.
Clear reduction in escalations, latency spikes, and cross-component coordination complexity.
Team health, engagement, and velocity aligned with long-term technical and business goals.

Options

Apply for this job onlineApply

Refer this job to a friendRefer

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.

Share on your newsfeed

Application FAQs

Data Direct Networks