HPC Engineer, AI Infrastructure
Company: Tesla, Inc.
Location: Palo Alto
Posted on: June 2, 2025
Job Description:
Tesla's Supercomputing/AI infrastructure team works directly
with the high-performance computing and machine learning
infrastructure on which our ML algorithms run; this includes
virtual simulations, Autopilot hardware, silicon design, and Dojo.
With the rapidly-growing need for more data and optimized compute
resources, cluster builds are getting larger and increasingly
complex. Continued development/automation of deployment,
monitoring, self-healing and alerting processes is imperative to
the success of our engineering groups. As the scope and impact of
our Optimus, Full-Self-Driving (FSD) & Robotaxi efforts continue to
scale, so does the value of this team and its work.As an HPC
Engineer, you will be responsible for maintaining and improving our
platform to ensure our Full-Self-Driving (FSD), Optimus & Dojo
engineering teams have the necessary tools and resources to be
productive. This includes managing/operating our AI infrastructure,
monitoring compute/GPU/network metrics, Linux troubleshooting &
performance tuning, and collaborating with our Data Center team to
coordinate the smooth operation of hundreds of servers & bring up
new GPU capacity. Your work will directly facilitate neural network
training at scale, streamline FSD development, and enable Dojo to
become the most powerful supercomputer to date.What You'll Do
- Support the AI/ML cluster infrastructure on both GPU and Dojo
platforms, focusing on systems automation, configuration management
and deployment at scale
- Improve our monitoring & self-healing pipelines, as well as
security posture
- Work with hardware and storage vendors to tune and optimize our
server, storage and network performance
- Performance tuning & OS provisioning on Linux systems
- Manage HPC clusters, workloads and applications
- Automation and systems engineering
- Participate in 24x7 on-call rotationWhat You'll Bring
- Proficiency with scripting languages such as Python or
Bash
- Proficiency with Linux & network fundamentals
- Experience with configuration management software (Ansible,
etc.), systems monitoring & alerting (Prometheus, Grafana,
Telegraf, Splunk, etc.) is a plus
- Experience with high-throughput low-latency networks, GPU-based
computing systems, and/or high performance storage systems is a
plus
- Experience with Slurm, LSF and storage management of parallel
file systems is a plus
- Bachelor's Degree in Computer Science, Computer Engineering,
Electrical Engineering, Physics or proof of exceptional skills in
related field
- 3+ years of additional equivalent experience or evidence of
exceptional ability related to the positionCompensation and
BenefitsAlong with competitive pay, as a full-time Tesla employee,
you are eligible for the following benefits at day 1 of hire:
- Aetna PPO and HSA plans > 2 medical plan options with $0
payroll deduction
- Family-building, fertility, adoption and surrogacy
benefits
- Dental (including orthodontic coverage) and vision plans, both
have options with a $0 paycheck contribution
- Company Paid (Health Savings Account) HSA Contribution when
enrolled in the High Deductible Aetna medical plan with HSA
- Healthcare and Dependent Care Flexible Spending Accounts
(FSA)
- 401(k) with employer match, Employee Stock Purchase Plans, and
other financial benefits
- Company paid Basic Life, AD&D, short-term and long-term
disability insurance
- Employee Assistance Program
- Sick and Vacation time (Flex time for salary positions), and
Paid Holidays
- Back-up childcare and parenting support resources
- Voluntary benefits to include: critical illness, hospital
indemnity, accident insurance, theft & legal services, and pet
insurance
- Weight Loss and Tobacco Cessation Programs
- Tesla Babies program
- Commuter benefits
- Employee discounts and perks programExpected
Compensation$133,440 - $355,920/annual salary + cash and stock
awards + benefitsPay offered may vary depending on multiple
individualized factors, including market location, job-related
knowledge, skills, and experience. The total compensation package
for this position may also include other elements dependent on the
position offered. Details of participation in these benefit plans
will be provided if an employee receives an offer of
employment.Tesla is an Equal Opportunity employer. All qualified
applicants will receive consideration for employment without regard
to any factor, including veteran status and disability status,
protected by applicable federal, state or local laws.Tesla is also
committed to working with and providing reasonable accommodations
to individuals with disabilities. Please let your recruiter know if
you need an accommodation at any point during the interview
process.Privacy is a top priority for Tesla. We build it into our
products and view it as an essential part of our business. To
understand more about the data we collect and process as part of
your application, please view our Tesla Talent Privacy Notice.
#J-18808-Ljbffr
Keywords: Tesla, Inc., San Leandro , HPC Engineer, AI Infrastructure, Engineering , Palo Alto, California
Didn't find what you're looking for? Search again!
Loading more jobs...