As you increase the size and complexity of your quantum simulation, you rapidly require a large increase in computational power. This guide describes considerations that can help you choose hardware for your simulation.
Your simulation setup depends on the following:
- Noise; noisy (realistic) simulations require more compute power than noiseless (idealised) simulations.
- Number of qubits.
- Circuit depth; the number of time steps required to perform the circuit.
Quick start
The following graph provides loose guidelines to help you get started with choosing hardware for your simulation. The qubit upper bounds in this chart are not technical limits.
Choose hardware for your simulation
1. Evaluate whether your simulation can be run locally
If you have a modern laptop with at least 8GB of memory, you can run your simulation locally in the following cases:
- Noiseless simulations that use fewer than 29 qubits.
- Noisy simulations that use fewer than 18 qubits.
If you intend to simulate a circuit many times, consider multinode simulation. For more information about multinode simulation see step 5, below.
2. Estimate your memory requirements
You can estimate your memory requirements with the following rule of thumb:
- Noiseless simulation: \( memory\ required = 8 \cdot 2^N bytes \) for an N-qubit circuit
- Noisy simulation: \( memory\ required = 16 \cdot 2^N bytes \) for an N-qubit circuit
In addition to memory size, consider the bandwidth of your memory. qsim performs best when it can use the maximum number of threads. Multi-threaded simulation benefits from high-bandwidth memory (above 100GB/s).
3. Decide between CPUs and GPUs
- GPU hardware starts to outperform CPU hardware significantly (up to 15x faster) for circuits with more than 20 qubits.
- The maximum number of qubits that you can simulate with a GPU is limited by the memory of the GPU. Currently, for a noiseless simulation on an NVIDIA A100 GPU (with 40GB of memory), the maximum number of qubits is 32.
- For noiseless simulations with 32-40 qubits, you can use CPUs. However, the runtime time increases exponentially with the number of qubits, and runtimes are long for simulations above 32 qubits.
The following charts show the runtime for a random circuit run on Google Compute Engine, using an NVidia A100 GPU, and a compute-optimized CPU (c2-standard-4). The first chart shows the runtimes for the noiseless simulation. The second chart shows the runtimes for a noisy simulation, using a phase damping channel (p=0.01). The charts use a log scale.
4. Select a specific machine
After you decide whether you want to use CPUs or GPUs for your simulation, choose a specific machine:
- Restrict your options to machines that meet your memory requirements. For more information about memory requirements, see step 2.
- Decide if performance (speed) or cost is more important to you:
- For a table of performance benchmarks, see Sample benchmarks below.
- For more information about GCP pricing, see the Google Cloud pricing calculator.
- Prioritizing performance is particularly important in the following
scenarios:
- Simulating with a higher f value (f is the maximum number of
qubits allowed per fused gate).
- For small to medium size circuits (up to 22 qubits), keep f low (2 or 3).
- For medium to large size qubits (22+ qubits), use a higher f typically, f=4 is the best option).
- Simulating a deep circuit (depth 30+).
- Simulating with a higher f value (f is the maximum number of
qubits allowed per fused gate).
5. Consider multiple compute nodes
Simulating in multinode mode is useful when your simulation can be parallelized. In a noisy simulation, the trajectories (also known as repetitions, iterations) are “embarrassingly parallelizable”, there is an automated workflow for distributing these trajectories over multiple nodes. A simulation of many noiseless circuits can also be distributed over multiple compute nodes.
For mor information about running a mulitnode simulation, see Multinode quantum simulation using HTCondor on Google Cloud.
Runtime estimates
Runtime grows exponentially with the number of qubits, and linearly with circuit depth beyond 20 qubits.
- For noiseless simulations, runtime grows at a rate of \( 2^N \) for an N-qubit circuit. For more information about runtimes for small circuits, see Additional notes for advanced users below).
- For noisy simulations, runtime grows at a rate of \( 2^N \) multiplied by the number of iterations for an N-qubit circuit.
Additional notes for advanced users
- The impact of noise on simulation depends on:
- What type of errors are included in your noise channel (decoherence, depolarizing channels, coherent errors, readout errors).
- How you can represent your noise model using Kraus operator formalism:
- Performance is best in the case where all Kraus operators are proportional to unitary matrices, such as when using only a depolarizing channel. In this case, memory requirements are equal to noiseless memory requirements (8*2^n bytes).
- Using noise which cannot be represented with Kraus operators proportional to unitary matrices, can slow down simulations by a factor of up to 6** **compared to using a depolarizing channel only
- Noisy simulations are faster with lower noise (when one Kraus operator dominates).
- Experimenting with the 'f' parameter (maximum number of qubits allowed per
fused gate):
- The advanced user is advised to try out multiple f values to optimize
their simulation setup.
- Note that f=2 or f=3 can be optimal for large circuits simulated on CPUs with a smaller number of threads (say, up to four or eight threads). However, this depends on the circuit structure.
- Note that f=6 is very rarely optimal.
- The advanced user is advised to try out multiple f values to optimize
their simulation setup.
- Using the optimal number of threads:
- Use the maximum number of threads on CPUs for the best performance.
- If the maximum number of threads is not used on multi-socket machines then it is advisable to distribute threads evenly to all sockets or to run all threads within a single socket. Separate simulations on each socket can be run simultaneously in the latter case.
- Note that currently the number of CPU threads does not affect the performance for small circuits (smaller than 17 qubits). Only one thread is used because of OpenMP overhead.
- Runtime estimates for small circuits:
- For circuits that contain fewer than 20 qubits, the qsimcirq translation layer performance overhead tends to dominate the runtime estimate. In addition to this, qsim is not optimized for small circuits.
- The total small circuits runtime overhead for an N qubit circuit depends on the circuit depth and on N. The overhead can be large enough to conceal the \( 2^N \) growth in runtime.
Sample benchmarks
Noiseless simulation benchmarks data sheet
For a random circuit, depth=20, f=3, max threads.
processor type | machine | # of qubits | runtime |
CPU | c2-standard-60 | 34 | 291.987 |
CPU | c2-standard-60 | 32 | 54.558 |
CPU | c2-standard-60 | 30 | 13.455 |
CPU | c2-standard-60 | 28 | 2.837 |
CPU | c2-standard-60 | 24 | 0.123 |
CPU | c2-standard-60 | 20 | 0.013 |
CPU | c2-standard-60 | 16 | 0.009 |
CPU | c2-standard-4-4 | 30 | 52.880 |
CPU | c2-standard-4-4 | 28 | 12.814 |
CPU | c2-standard-4-4 | 24 | 0.658 |
CPU | c2-standard-4-4 | 20 | 0.031 |
CPU | c2-standard-4-4 | 16 | 0.008 |
GPU | a100 | 32 | 7.415 |
GPU | a100 | 30 | 1.561 |
GPU | a100 | 28 | 0.384 |
GPU | a100 | 24 | 0.030 |
GPU | a100 | 20 | 0.010 |
GPU | a100 | 16 | 0.007 |
GPU | t4 | 30 | 10.163 |
GPU | t4 | 28 | 2.394 |
GPU | t4 | 24 | 0.118 |
GPU | t4 | 20 | 0.014 |
GPU | t4 | 16 | 0.007 |
Noisy simulation benchmarks data sheet
For one trajectory of a random circuit, depth=20, f=3, max threads.
processor type | machine | noise type | # of qubits | runtime |
CPU | c2-standard-60 | depolarizing | 30 | 13.021 |
CPU | c2-standard-60 | depolarizing | 28 | 2.840 |
CPU | c2-standard-60 | depolarizing | 26 | 0.604 |
CPU | c2-standard-60 | depolarizing | 24 | 0.110 |
CPU | c2-standard-60 | depolarizing | 20 | 0.009 |
CPU | c2-standard-60 | depolarizing | 16 | 0.006 |
CPU | c2-standard-60 | dephasing | 30 | 122.788 |
CPU | c2-standard-60 | dephasing | 28 | 29.966 |
CPU | c2-standard-60 | dephasing | 26 | 6.378 |
CPU | c2-standard-60 | dephasing | 24 | 1.181 |
CPU | c2-standard-60 | dephasing | 20 | 0.045 |
CPU | c2-standard-60 | dephasing | 16 | 0.023 |
CPU | c2-standard-4-4 | depolarizing | 26 | 2.807 |
CPU | c2-standard-4-4 | depolarizing | 24 | 0.631 |
CPU | c2-standard-4-4 | depolarizing | 20 | 0.027 |
CPU | c2-standard-4-4 | depolarizing | 16 | 0.005 |
CPU | c2-standard-4-4 | dephasing | 26 | 33.038 |
CPU | c2-standard-4-4 | dephasing | 24 | 7.432 |
CPU | c2-standard-4-4 | dephasing | 20 | 0.230 |
CPU | c2-standard-4-4 | dephasing | 16 | 0.014 |
GPU | a100 | depolarizing | 30 | 1.568 |
GPU | a100 | depolarizing | 28 | 0.391 |
GPU | a100 | depolarizing | 26 | 0.094 |
GPU | a100 | depolarizing | 24 | 0.026 |
GPU | a100 | depolarizing | 20 | 0.006 |
GPU | a100 | depolarizing | 16 | 0.004 |
GPU | a100 | dephasing | 30 | 17.032 |
GPU | a100 | dephasing | 28 | 3.959 |
GPU | a100 | dephasing | 26 | 0.896 |
GPU | a100 | dephasing | 24 | 0.236 |
GPU | a100 | dephasing | 20 | 0.029 |
GPU | a100 | dephasing | 16 | 0.021 |
GPU | t4 | depolarizing | 30 | 10.229 |
GPU | t4 | depolarizing | 28 | 2.444 |
GPU | t4 | depolarizing | 26 | 0.519 |
GPU | t4 | depolarizing | 24 | 0.115 |
GPU | t4 | depolarizing | 20 | 0.009 |
GPU | t4 | depolarizing | 16 | 0.004 |
GPU | t4 | dephasing | 28 | 21.800 |
GPU | t4 | dephasing | 26 | 5.056 |
GPU | t4 | dephasing | 24 | 1.164 |
GPU | t4 | dephasing | 20 | 0.077 |
GPU | t4 | dephasing | 16 | 0.017 |