Infrastructurelinuxnvidiajetsonembeddededge-aibenchmarking

NVIDIA Jetson Orin Nano Guide: From Unboxing to Running Model Benchmarks

By Anthony Kung
Picture of the author
Published on
Domain
Embedded AI and infrastructure
Focus
Jetson Orin Nano setup, performance tuning, and benchmarking
Scope
JetPack, Docker, vision benchmarks, Ollama, and vLLM benchmarking
Engineer setting up a Jetson Orin Nano developer kit at a benchmarking workstation with peripherals and storage

NVIDIA Jetson Orin Nano Guide: From Unboxing to Running Model Benchmarks

The NVIDIA Jetson Orin Nano Developer Kit is one of the more approachable ways to do modern edge AI work locally. It gives you an ARM Linux system, CUDA support, TensorRT acceleration, camera interfaces, GPIO, NVMe support, and enough GPU performance to run real computer vision workloads plus smaller local LLM and VLM experiments.

NVIDIA's current Jetson Orin Nano Super Developer Kit positioning highlights 67 sparse INT8 TOPS, a 1024-core Ampere GPU, 32 Tensor Cores, a 6-core Arm Cortex-A78AE CPU, 8GB LPDDR5, 102 GB/s memory bandwidth, support for microSD and NVMe storage, and operation across 7W to 25W modes depending on configuration. NVIDIA Technical Blog

This guide covers the full setup flow from unboxing to benchmarking, including microSD preparation, firmware update considerations, first boot, system verification, performance mode configuration, Docker validation, computer vision benchmarking, Ollama testing, and vLLM-style LLM benchmarking.

Engineer setting up a Jetson Orin Nano developer kit at a benchmarking workstation with peripherals and storage
Engineer setting up a Jetson Orin Nano developer kit at a benchmarking workstation with peripherals and storage

Required hardware

The Jetson Orin Nano Developer Kit package typically includes:

  • Jetson Orin Nano module
  • reference carrier board
  • preinstalled Wi-Fi/Bluetooth module
  • Wi-Fi antenna
  • 19V power supply
  • quick-start material

Additional required items:

  • 64GB or larger microSD card
  • DisplayPort monitor, or a known-good DisplayPort-to-HDMI adapter
  • USB keyboard
  • USB mouse
  • Ethernet or Wi-Fi network access
  • host computer for writing the microSD card

Optional but strongly recommended:

  • M.2 NVMe SSD
  • better airflow or active cooling
  • USB or CSI camera

The microSD card is fine for first boot and basic learning. For real benchmarking, Docker images, Hugging Face caches, and larger model files, NVMe storage is the better long-term path.

Recommended software version

JetPack 6.x is the practical target for Jetson Orin Nano development.

JetPack 6.2 includes Jetson Linux 36.4.3, Linux kernel 5.15, an Ubuntu 22.04 root filesystem, CUDA 12.6, TensorRT 10.3, cuDNN 9.3, VPI 3.2, and support for new Super modes on Jetson Orin Nano and Orin NX. NVIDIA Technical Blog

Jetson Linux 36.4.4, which ships as part of JetPack 6.2.1, is a product-quality release for Jetson Orin Nano and includes fixes related to SDK Manager flashing and Docker 28 compatibility. NVIDIA Docs

JetPack 7 exists, but current JetPack 7 releases are aimed primarily at newer Thor-class hardware. For Jetson Orin Nano, JetPack 6.x remains the safer target unless NVIDIA explicitly documents newer support for this board.

Firmware compatibility warning

Fresh Jetson Orin Nano Developer Kits may ship with older factory firmware. Older firmware can create a frustrating setup failure: the board may boot JetPack 5.x, but fail to boot JetPack 6.x cleanly.

Common pattern:

Fresh Jetson Orin Nano
  |
  +-- Firmware already supports JetPack 6.x
  |     |
  |     +-- Flash JetPack 6.x SD card and boot normally
  |
  +-- Firmware is too old
        |
        +-- JetPack 6.x may show black screen, UEFI shell, or boot failure
        +-- Boot JetPack 5.x first
        +-- Apply the QSPI bootloader update
        +-- Reboot with JetPack 6.x SD card

NVIDIA's JetPack installation docs explicitly note that using JetPack 6.x SD card images for the first time may require updating the QSPI bootloaders first. NVIDIA Docs

The practical rule is: reflashing the same JetPack 6.x microSD card over and over will not fix an outdated QSPI firmware problem.

Setup path options

There are two normal setup paths.

Path A: microSD card method

Best for:

  • Windows hosts
  • macOS hosts
  • Linux hosts where the simplest setup is preferred
  • first-time board bring-up
  • initial validation

This is the cleanest starting point for most users.

Path B: SDK Manager method

Best for:

  • x86 Ubuntu 20.04 or 22.04 host systems
  • direct flashing through NVIDIA SDK Manager
  • NVMe-first installation
  • recovery-mode workflows
  • more advanced maintenance

A realistic workflow is:

Initial bring-up: microSD
Development storage: NVMe
Advanced flashing: SDK Manager
Production-style setup: SDK Manager or manual flashing

Flashing the microSD card

Download the Jetson Orin Nano Developer Kit SD card image from NVIDIA's JetPack download path.

If the board already has JetPack 6-compatible firmware, write the JetPack 6.x image directly. If the board fails to boot JetPack 6.x, complete the older-firmware update path first.

Windows host

  1. Insert the microSD card.
  2. Open Balena Etcher.
  3. Select the Jetson Orin Nano image.
  4. Select the microSD card.
  5. Start flashing.
  6. Safely eject the card.

macOS host

  1. Insert the microSD card.
  2. Open Balena Etcher.
  3. Select the Jetson Orin Nano image.
  4. Select the microSD card.
  5. Start flashing.
  6. Eject the card.

Linux host

GUI tools such as Balena Etcher, Raspberry Pi Imager, or GNOME Disks work well. A command-line path also works:

lsblk

# Replace /dev/sdX with the real microSD device.
# Do not use a partition path like /dev/sdX1.
sudo dd if=jetson-orin-nano-image.img of=/dev/sdX bs=64M status=progress conv=fsync
sync

Warning: dd can destroy the wrong disk if you choose the wrong device path.

First boot

Insert the flashed microSD card into the slot on the underside of the module.

Connect:

  • DisplayPort monitor
  • USB keyboard
  • USB mouse
  • Ethernet, if available
  • 19V barrel power supply

The board powers on when power is connected.

During first boot, Ubuntu asks for:

  • EULA acceptance
  • language
  • keyboard layout
  • time zone
  • network setup
  • username and password
  • login configuration

If the board drops into a UEFI shell, shows a black screen, or fails to boot Ubuntu, outdated firmware is one of the first things to suspect.

System verification

After login:

cat /etc/os-release
cat /etc/nv_tegra_release
dpkg-query -W nvidia-l4t-core
uname -a
lsblk
df -h
free -h

Open a second terminal and monitor runtime behavior:

sudo tegrastats

tegrastats is one of the most useful Jetson tools because it shows CPU activity, GPU activity, memory, thermals, and runtime behavior in one place.

System update

Run normal package updates:

sudo apt update
sudo apt upgrade -y
sudo reboot

After reboot:

cat /etc/nv_tegra_release
dpkg-query -W nvidia-l4t-core

Avoid mixing random desktop CUDA, TensorRT, PyTorch, or container packages with JetPack-managed Jetson packages unless you know exactly how they line up with the installed L4T release.

Basic developer tools

Install a practical baseline:

sudo apt install -y \
  build-essential git curl wget htop nano vim \
  python3-pip python3-venv python3-dev cmake pkg-config unzip

Optional but useful:

sudo apt install -y net-tools openssh-server tmux

Enable SSH:

sudo systemctl enable ssh
sudo systemctl start ssh
ip addr

Connect from another machine:

ssh username@jetson_ip_address

Performance mode setup

For benchmarking, use the official power supply, adequate cooling, and the highest supported performance mode that your thermal setup can actually sustain.

Check the current mode:

sudo nvpmodel -q
sudo nvpmodel -q --verbose

JetPack 6.2 introduced new Super modes for supported Orin Nano configurations. The exact mode IDs can vary, so check the board itself before assuming -m 0 means the same thing across every image. NVIDIA Technical Blog

Set the desired high-performance mode:

sudo nvpmodel -m 0
sudo reboot

After reboot:

sudo nvpmodel -q

Lock clocks for benchmark runs:

sudo jetson_clocks
sudo jetson_clocks --show

For reproducible results, always record:

sudo nvpmodel -q
sudo jetson_clocks --show
cat /etc/nv_tegra_release

Jetson Stats

jtop is a convenient higher-level monitor for Jetson devices.

Install:

sudo -H pip3 install -U jetson-stats
sudo reboot

Run:

jtop

Useful values include CPU, GPU, RAM, swap, temperatures, power mode, fan behavior, and throttling state.

Optional NVMe setup

NVMe storage is strongly recommended for model files, Docker images, and benchmark datasets.

Check the drive:

lsblk

Assuming the drive appears as /dev/nvme0n1:

sudo parted /dev/nvme0n1 --script mklabel gpt
sudo parted /dev/nvme0n1 --script mkpart primary ext4 0% 100%
sudo mkfs.ext4 -F /dev/nvme0n1p1
sudo mkdir -p /mnt/nvme
sudo mount /dev/nvme0n1p1 /mnt/nvme

Get the UUID:

sudo blkid /dev/nvme0n1p1

Add it to /etc/fstab:

UUID=YOUR_UUID_HERE /mnt/nvme ext4 defaults,noatime 0 2

Test the mount:

sudo umount /mnt/nvme
sudo mount -a
df -h

Create working folders:

mkdir -p /mnt/nvme/models
mkdir -p /mnt/nvme/hf-cache
mkdir -p /mnt/nvme/docker-data

Move Hugging Face caches:

echo 'export HF_HOME=/mnt/nvme/hf-cache' >> ~/.bashrc
echo 'export TRANSFORMERS_CACHE=/mnt/nvme/hf-cache' >> ~/.bashrc
source ~/.bashrc

Docker GPU runtime verification

JetPack includes NVIDIA container runtime integration for Jetson.

Check Docker:

docker --version
sudo docker info | grep -i runtime

Run a Jetson L4T base container:

sudo docker run --rm -it --runtime nvidia nvcr.io/nvidia/l4t-base:r36.2.0

Inside the container:

cat /etc/os-release
exit

If Docker breaks after updates, check the installed L4T version and whether the board is on a release with the Docker 28 compatibility fixes noted for 36.4.4. NVIDIA Docs

Running benchmarks

Recommended benchmark groups:

1. System baseline
2. Computer vision / TensorRT-style model benchmarks
3. Ollama local LLM smoke test
4. vLLM serving benchmark
5. Optional custom PyTorch CUDA benchmark

For every result, record:

Board:
JetPack version:
Jetson Linux / L4T version:
Power mode:
jetson_clocks status:
Storage:
Cooling:
Ambient temperature:
Model:
Precision:
Batch size:
Input size:
Average latency:
Throughput:
Power:
Temperature:

Benchmark 1: system baseline

Monitoring terminal:

sudo tegrastats

Configuration terminal:

sudo nvpmodel -q
sudo jetson_clocks --show
cat /etc/nv_tegra_release
free -h
df -h

Install sysbench:

sudo apt install -y sysbench

CPU test:

sysbench cpu --threads=6 --time=30 run

Memory test:

sysbench memory --time=30 run

This does not measure AI acceleration directly, but it does catch unstable setups quickly.

Benchmark 2: NVIDIA Jetson computer vision benchmarks

NVIDIA's jetson_benchmarks repository supports common models such as Inception V4, ResNet-50, OpenPose, VGG-19, YOLO-V3, Super Resolution, and UNet, and includes an Orin Nano benchmark CSV path. GitHub

cd ~
git clone https://github.com/NVIDIA-AI-IOT/jetson_benchmarks.git
cd jetson_benchmarks
mkdir -p models
sudo sh install_requirements.sh

Download Orin Nano models:

python3 utils/download_models.py \
  --all \
  --csv_file_path benchmark_csv/orin-nano-benchmarks.csv \
  --save_dir "$(pwd)/models"

Set benchmark clocks:

sudo nvpmodel -q
sudo jetson_clocks

Run all Orin Nano benchmarks:

sudo python3 benchmark.py \
  --all \
  --csv_file_path benchmark_csv/orin-nano-benchmarks.csv \
  --model_dir "$(pwd)/models" \
  --jetson_clocks

Run a single model:

sudo python3 benchmark.py \
  --model_name resnet \
  --csv_file_path benchmark_csv/orin-nano-benchmarks.csv \
  --model_dir "$(pwd)/models" \
  --jetson_clocks

Benchmark 3: Ollama LLM smoke test

Ollama is a simple way to do a first local-LLM sanity check.

Install:

curl -fsSL https://ollama.com/install.sh | sh

Check service status:

systemctl status ollama

Run a small model:

ollama run llama3.2:1b

Example controlled timing test:

time ollama run llama3.2:1b "Write a short explanation of TensorRT for edge AI."

In another terminal:

sudo tegrastats

For larger models, NVMe is strongly recommended. The Orin Nano's shared 8GB memory means quantization, context length, and runtime choice matter a lot.

Benchmark 4: vLLM LLM serving benchmark

Jetson AI Lab documents a vLLM-based GenAI benchmarking flow for Jetson. Jetson AI Lab

Set high-performance mode:

sudo nvpmodel -m 0
sudo jetson_clocks

Pull the container:

sudo docker pull ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin

Start the container:

sudo docker run --rm -it \
  --network host \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin

Open a second shell into the same container:

sudo docker exec -it vllm bash

Serve a quantized model in the first shell:

vllm serve "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" \
  --gpu-memory-utilization 0.8

Warm-up or single-user benchmark:

vllm bench serve \
  --dataset-name random \
  --model RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 \
  --num-prompts 50 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --random-input-len 2048 \
  --random-output-len 128 \
  --max-concurrency 1

Multi-user benchmark:

vllm bench serve \
  --dataset-name random \
  --model RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 \
  --num-prompts 50 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --random-input-len 2048 \
  --random-output-len 128 \
  --max-concurrency 8

Higher concurrency usually raises total throughput while also increasing per-request latency. Record both.

Benchmark 5: simple PyTorch CUDA check

import torch
from typing import Final

def main() -> None:
  print("PyTorch:", torch.__version__)
  print("CUDA available:", torch.cuda.is_available())

  if not torch.cuda.is_available():
    return

  device: Final[torch.device] = torch.device("cuda")
  x: torch.Tensor = torch.randn((4096, 4096), device=device)
  y: torch.Tensor = torch.randn((4096, 4096), device=device)

  torch.cuda.synchronize()
  z: torch.Tensor = x @ y
  torch.cuda.synchronize()

  print("Result shape:", z.shape)

if __name__ == "__main__":
  main()

Run:

python3 torch_cuda_check.py

If CUDA is unavailable, avoid random desktop CUDA packages. Prefer Jetson-specific wheels, containers, or JetPack-compatible package sources.

Result interpretation

For computer vision models, FPS alone is not enough. Also record:

  • input resolution
  • precision
  • preprocessing cost
  • postprocessing cost
  • batch size
  • camera or video I/O overhead
  • power mode
  • thermal behavior

For LLMs, the most useful metrics are:

TTFT: Time to first token
ITL: Inter-token latency
Output tok/s: Generated token throughput
E2E latency: Full request latency
Peak memory: Whether the model fits comfortably
Power: Efficiency during inference
Temperature: Thermal stability during sustained load

For edge-AI comparisons, report both performance and efficiency:

Performance:
  FPS
  tok/s
  latency
  throughput

Efficiency:
  FPS/W
  tok/s/W
  joules per inference

Troubleshooting

JetPack 6.x SD card does not boot

Most likely cause:

Outdated QSPI firmware

Recommended fix:

Boot a JetPack 5.x image first.
Apply the QSPI firmware update.
Then boot JetPack 6.x again.

Display does not work

Use DisplayPort when possible. If HDMI is required, use a known-good DisplayPort-to-HDMI adapter because the board uses DisplayPort, not native HDMI.

Docker GPU runtime fails

Check:

docker --version
sudo docker info | grep -i runtime
cat /etc/nv_tegra_release

Benchmark results look too slow

Check:

sudo nvpmodel -q
sudo jetson_clocks --show
sudo tegrastats

Common causes:

Not running in the intended high-performance mode
jetson_clocks not enabled
Thermal throttling
Slow microSD storage
Model too large for available memory
CPU fallback instead of GPU execution
Incorrect container or package version
Background desktop load

Ollama or vLLM falls back to CPU

Watch tegrastats. If the GPU stays quiet and CPU load spikes, the runtime probably is not using the accelerated path you expected.

Out of memory

Possible fixes:

Use a smaller model
Use a quantized model
Reduce context length
Reduce concurrency
Move caches and models to NVMe
Use swap only as a last-resort stability tool

If swap is used during benchmarking, document it clearly because it can make the results misleading.

Recommended benchmark report format

# Jetson Orin Nano Benchmark Report

## System

| Item | Value |
| --- | --- |
| Board | Jetson Orin Nano Developer Kit |
| Memory | 8GB LPDDR5 |
| JetPack | TBD |
| L4T / Jetson Linux | TBD |
| Storage | microSD / NVMe |
| Power Mode | MAXN SUPER / 25W / 15W |
| jetson_clocks | Enabled / Disabled |
| Cooling | Stock / modified |
| Ambient Temp | TBD |

## Computer Vision Benchmarks

| Model | Input | Precision | FPS | Avg Power | FPS/W | Peak Temp |
| --- | ---: | --- | ---: | ---: | ---: | ---: |
| ResNet-50 | 224x224 | TBD | TBD | TBD | TBD | TBD |
| YOLO | TBD | TBD | TBD | TBD | TBD | TBD |
| UNet | 256x256 | TBD | TBD | TBD | TBD | TBD |

## LLM Benchmarks

| Model | Runtime | Quantization | Concurrency | Output tok/s | TTFT | ITL | Peak RAM | Peak Temp |
| --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: |
| llama3.2:1b | Ollama | default | 1 | TBD | TBD | TBD | TBD | TBD |
| Llama 3.1 8B | vLLM | W4A16 | 1 | TBD | TBD | TBD | TBD | TBD |
| Llama 3.1 8B | vLLM | W4A16 | 8 | TBD | TBD | TBD | TBD | TBD |

Final setup flow

1. Flash microSD card.
2. Update firmware if JetPack 6.x does not boot.
3. Boot JetPack 6.x.
4. Update system packages.
5. Install developer tools.
6. Enable SSH.
7. Configure the desired high-performance mode.
8. Enable jetson_clocks for benchmarks.
9. Verify behavior with tegrastats or jtop.
10. Move models and caches to NVMe if available.
11. Validate Docker GPU runtime.
12. Run jetson_benchmarks.
13. Run an Ollama smoke test.
14. Run a vLLM benchmark for serving metrics.
15. Record latency, throughput, power, and thermal data.

The most important setup requirement on Jetson is software-stack consistency: firmware, JetPack, CUDA, TensorRT, Docker, PyTorch, and model runtime versions need to line up. Once the board is stable, the most useful comparisons are sustained performance per watt under realistic thermal and memory constraints, not just isolated peak FPS or token throughput.

Stay Tuned

Want to stay up to date with the latest posts?
The best articles, links and news delivered once a week to your inbox.