Version 4.1 - The Async Quantum Revolution
π The 2026 Quantum Breakthrough
Section titled βπ The 2026 Quantum BreakthroughβFrom Quantum-Enhanced to Quantum-First
Section titled βFrom Quantum-Enhanced to Quantum-FirstβQ-Store v4.1 represents a fundamental architectural transformation - moving from quantum as a helper to quantum as the primary compute engine. With async execution and optimized circuit batching, we achieve 10-20Γ speedup over sequential quantum execution.
Core Philosophy: βMake quantum the primary compute, with minimal classical overhead and zero-blocking async execution.β
GitHub Discussions: Share your thoughts on the v4.1 async design
π― Whatβs New in v4.1
Section titled βπ― Whatβs New in v4.1β| Aspect | v4.0 (Current) | v4.1 (New) | Impact |
|---|---|---|---|
| Architecture | Quantum-enhanced classical | Quantum-first (70% quantum) | π 14Γ more quantum compute |
| Execution | Sequential circuits | Async parallel (10-20Γ) | π Never blocks on I/O |
| Storage | Blocking I/O | Async Zarr + Parquet | π Zero-blocking writes |
| Performance vs v4.0 | Baseline | 10-20Γ faster | π Major improvement! |
| Performance vs GPU | Much slower | Still 85-183Γ slower | β οΈ GPU wins for speed |
| Layers | Mixed classical/quantum | Pure quantum pipeline | β¨ Minimal classical overhead |
| PyTorch | Broken | Fixed + async support | π Production-ready |
Note: Real-world test with IonQ Simulator on Cats vs Dogs (1,000 images): 38.1 minutes vs 6-12 seconds for GPU
π Key Innovations (Unique to Q-Store v4.1)
Section titled βπ Key Innovations (Unique to Q-Store v4.1)β- β‘ AsyncQuantumExecutor: Non-blocking circuit execution with 10-20Γ throughput
- π Quantum-First Layers: 70% quantum computation (vs 5% in v4.0)
- πΎ Zero-Blocking Storage: Async Zarr checkpoints + Parquet metrics
- π Local Chip Optimization: Direct quantum chip access eliminates network latency
- π Fixed PyTorch Integration: QuantumLayer with proper async support
- β¨ Multi-Basis Measurements: Extract more features per circuit execution
π The Reality Check: When Quantum Wins
Section titled βπ The Reality Check: When Quantum Winsβπ Two Different Comparisons
Section titled βπ Two Different ComparisonsβComparison 1: v4.1 vs v4.0 (Quantum Internal) β
Section titled βComparison 1: v4.1 vs v4.0 (Quantum Internal) β βFashion MNIST Training (1000 samples):
Q-Store v4.0 (Sequential): ~45 minutesQ-Store v4.1 (Async): ~30-45 seconds (local chip) ~3-4 minutes (cloud IonQ)
Result: 60-90Γ FASTER (local) or 10-15Γ FASTER (cloud)What changed?
- v4.0: Submit circuit β wait β result β next circuit (sequential)
- v4.1: Submit 10-20 circuits at once β poll in background (parallel)
- Local chip: Eliminates 200ms network latency per batch
Comparison 2: Quantum vs Classical GPU (Real Competition) π
Section titled βComparison 2: Quantum vs Classical GPU (Real Competition) πβFashion MNIST Training (1000 samples):
Classical GPU (A100): ~2-3 minutesQ-Store v4.1 (Cloud IonQ): ~3-4 minutes (0.7-1.0Γ slower)Q-Store v4.1 (Local IonQ): ~30-45 seconds (3-5Γ FASTER!) πWhy is local quantum faster?
| Factor | Cloud IonQ | Local IonQ | Advantage |
|---|
π Architecture Highlights
Section titled βπ Architecture HighlightsβQuantum-First Layer Pipeline
Section titled βQuantum-First Layer Pipelineβv4.0 Architecture (5% quantum):
model = Sequential([ Flatten(), # Classical Dense(128, activation='relu'), # Classical (95% compute) Dense(64, activation='relu'), # Classical (95% compute) QuantumLayer(n_qubits=4, depth=2), # Quantum (5% compute) Dense(10, activation='softmax') # Classical (95% compute)])# Total: 95% classical, 5% quantumv4.1 Architecture (70% quantum) π:
model = Sequential([ Flatten(), # Classical (5%) QuantumFeatureExtractor(n_qubits=8, depth=4), # Quantum (40%) QuantumPooling(n_qubits=4), # Quantum (15%) QuantumFeatureExtractor(n_qubits=4, depth=3), # Quantum (30%) QuantumReadout(n_qubits=4, n_classes=10) # Quantum (5%)])# Total: 30% classical, 70% quantum β¨π New Quantum-First Layers
Section titled βπ New Quantum-First Layersβ1. QuantumFeatureExtractor
Section titled β1. QuantumFeatureExtractorβReplaces classical Dense layers with quantum circuits:
from q_store.layers import QuantumFeatureExtractor
layer = QuantumFeatureExtractor( n_qubits=8, depth=4, entanglement='full', # 'linear', 'full', 'circular' measurement_bases=['Z', 'X', 'Y'], # Multi-basis for rich features backend='ionq')
# Async forward pass (never blocks!)features = await layer.call_async(inputs)
# Output dimension: n_qubits Γ len(measurement_bases)# Example: 8 qubits Γ 3 bases = 24 features per sampleKey innovations:
- β¨ Multi-basis measurements (more information per circuit)
- β‘ Async execution (never blocks on IonQ latency)
- π Parallel submission (batch all samples at once)
2. QuantumNonlinearity
Section titled β2. QuantumNonlinearityβQuantum-native activation functions:
from q_store.layers import QuantumNonlinearity
layer = QuantumNonlinearity( n_qubits=6, nonlinearity_type='amplitude_damping', # or 'phase_damping', 'parametric' strength=0.1)
# Natural quantum nonlinearity - no classical compute!output = await layer.call_async(inputs)Advantage: Natural quantum nonlinearity vs classical ReLU/Tanh
3. QuantumPooling
Section titled β3. QuantumPoolingβInformation-theoretically optimal compression:
from q_store.layers import QuantumPooling
layer = QuantumPooling( n_qubits=8, pool_size=2, pooling_type='partial_trace' # or 'measurement')
# Reduces 8 qubits β 4 qubitspooled = await layer.call_async(inputs)4. QuantumReadout
Section titled β4. QuantumReadoutβMulti-class quantum measurement:
from q_store.layers import QuantumReadout
layer = QuantumReadout( n_qubits=4, n_classes=10, readout_type='computational')
# Returns class probabilities via Born ruleprobs = await layer.call_async(features) # Shape: (batch_size, n_classes)β‘ AsyncQuantumExecutor
Section titled ββ‘ AsyncQuantumExecutorβThe Problem: IonQ Latency Kills Performance
Section titled βThe Problem: IonQ Latency Kills PerformanceβSequential Execution (v4.0):
for sample in batch: result = ionq.execute(circuit, sample) # β±οΈ Wait 2s # Blocked! Cannot do anything else!# Total: 32 samples Γ 2s = 64s per batch βAsync Execution (v4.1) π:
async def train_batch(batch): # Submit ALL circuits at once (non-blocking) futures = [ ionq.execute_async(circuit, sample) for sample in batch ]
# Do other work while waiting! preprocess_next_batch() update_metrics()
# Await all results results = await asyncio.gather(*futures) return results
# Total: 32 samples in parallel = ~2-4s per batch β
# Result: 16-32Γ faster!AsyncQuantumExecutor Features
Section titled βAsyncQuantumExecutor Featuresβfrom q_store.runtime import AsyncQuantumExecutor
executor = AsyncQuantumExecutor( backend='ionq', max_concurrent=100, # 100 circuits in flight batch_size=20, # Submit 20 at once cache_size=1000 # LRU cache for results)
# Non-blocking submissionfuture = await executor.submit(circuit)
# Batch submissionresults = await executor.submit_batch(circuits)
# Automatic caching (instant for repeated circuits)# Background polling (never blocks)# Connection pooling (better utilization)πΎ Zero-Blocking Storage Architecture
Section titled βπΎ Zero-Blocking Storage ArchitectureβThe Problem: Storage I/O Blocks Training
Section titled βThe Problem: Storage I/O Blocks Trainingβv4.0 (blocking):
for batch in training: loss = train_batch(batch)
# BLOCKS training loop! β±οΈ save_checkpoint(model) # ~500ms log_metrics(loss) # ~100ms
# Lost 600ms per batch to I/O βv4.1 (async) π:
async def train(): for batch in training: loss = await train_batch(batch)
# Fire-and-forget (never blocks!) metrics_logger.log(loss) # 0ms blocking β
await checkpoint_manager.save(model) # Async in background β
# Zero blocking on I/O! πStorage Stack
Section titled βStorage Stackβfrom q_store.storage import ( AsyncMetricsLogger, CheckpointManager, AsyncBuffer)
# Async Parquet metrics (never blocks)metrics = AsyncMetricsLogger( output_path='experiments/run_001/metrics.parquet', buffer_size=1000)
await metrics.log(TrainingMetrics( epoch=1, step=100, train_loss=0.342, circuit_execution_time_ms=107, cost_usd=0.05))
# Async Zarr checkpoints (compressed)checkpoints = CheckpointManager( checkpoint_dir='experiments/run_001/checkpoints')
await checkpoints.save( epoch=10, model_state=model.state_dict(), optimizer_state=optimizer.state_dict())Storage hierarchy:
- L1 - In-Memory: Model parameters, gradients (O(1) ns access)
- L2 - Async Buffer: Pending writes (O(1) ΞΌs access)
- L3 - Zarr Checkpoints: Model state (async write, ms latency)
- L4 - Parquet Metrics: Training telemetry (async append, ms latency)
π§ Fixed PyTorch Integration
Section titled βπ§ Fixed PyTorch IntegrationβThe Problem in v4.0
Section titled βThe Problem in v4.0β# v4.0: Broken!from q_store.torch import QuantumLayer
layer = QuantumLayer(n_qubits=4, depth=2)print(layer.n_parameters) # AttributeError! βThe Solution in v4.1 π
Section titled βThe Solution in v4.1 πβ# v4.1: Fixed + async!from q_store.torch import QuantumLayerimport torch.nn as nn
class HybridQNN(nn.Module): def __init__(self): super().__init__() self.quantum = QuantumLayer( n_qubits=8, depth=4, backend='ionq' ) self.output = nn.Linear(24, 10) # 8 qubits Γ 3 bases = 24 features
def forward(self, x): # Async execution with autograd support x = self.quantum(x) return self.output(x)
# Standard PyTorch trainingmodel = HybridQNN()optimizer = torch.optim.Adam(model.parameters())criterion = nn.CrossEntropyLoss()
for epoch in range(10): for batch_x, batch_y in train_loader: optimizer.zero_grad() loss = criterion(model(batch_x), batch_y) loss.backward() # Quantum gradients computed via SPSA optimizer.step()Whatβs fixed:
- β
n_parametersproperty now works - β Async execution integrated with PyTorch autograd
- β Proper gradient estimation via SPSA
- β GPU tensor support (CUDA)
- β DistributedDataParallel compatibility
π When to Use Q-Store v4.1
Section titled βπ When to Use Q-Store v4.1ββ Excellent Use Cases
Section titled ββ Excellent Use Casesβ1. Quantum ML Research π¬
Section titled β1. Quantum ML Research π¬β- Testing quantum ML algorithms and architectures
- Publishing papers on quantum machine learning
- Algorithm development with free IonQ simulator
- Exploring quantum advantage in specific domains
- Benchmarking quantum vs classical approaches
2. Small Dataset Problems π
Section titled β2. Small Dataset Problems πβ- <1,000 training samples where data is expensive
- Non-convex optimization landscapes
- Problems where classical gets stuck in local minima
- Better loss landscape exploration via quantum tunneling
- Comparable accuracy (58% vs 60% classical on quick tests)
3. Educational Applications π
Section titled β3. Educational Applications πβ- Teaching quantum machine learning concepts
- University courses on quantum computing
- Hands-on quantum circuit design
- Understanding quantum-classical hybrid systems
4. Algorithm Prototyping π§ͺ
Section titled β4. Algorithm Prototyping π§ͺβ- Cost-free experimentation with quantum circuits
- Testing new quantum layer architectures
- Validating quantum ML hypotheses
- Zero cost with IonQ simulator
β οΈ Not Recommended For
Section titled ββ οΈ Not Recommended Forβ1. Production Training at Scale β
Section titled β1. Production Training at Scale ββ- Large datasets (>1K samples)
- Time-critical applications
- Use classical GPUs: 183-457Γ faster, $0.01 vs $1,152+ cost
- Real QPU costs: $1,152 (Aria) to $4,480 (Forte) per run
2. Speed-Critical Applications β
Section titled β2. Speed-Critical Applications ββ- Real-time inference
- High-throughput services (>100 req/s)
- GPU training: 7.5s vs 38 minutes for 1K images
- Network latency dominates (55% of execution time)
3. Cost-Sensitive Deployments β
Section titled β3. Cost-Sensitive Deployments ββ- Budget-constrained projects
- GPU cost: $0.01 per training run
- Quantum cost: $1,152-$4,480 per run on real QPU
- Simulator is free but 183-457Γ slower than GPU
π Honest Performance Table (Real Test Data)
Section titled βπ Honest Performance Table (Real Test Data)βTest: Cats vs Dogs (1,000 images, 5 epochs, 180Γ180Γ3 RGB)
| Metric | NVIDIA H100 | NVIDIA A100 | NVIDIA V100 | IonQ Cloud (v4.1.1) | Winner |
|---|---|---|---|---|---|
| Training Time | 5s | 7.5s | 12.5s | 2,288s (38.1 min) | π GPU (457Γ) |
| Time per Epoch | 1.0s | 1.5s | 2.5s | 457s (7.6 min) | π GPU (305Γ) |
| Samples/Second | 40 | 26.7 | 16 | 0.35 | π GPU (114Γ) |
| Cost per Run | $0.009 | $0.010 | $0.012 | $0 (simulator) | π Quantum (free) |
| Cost (Real QPU) | $0.009 | $0.010 | $0.012 | $1,152-$4,480 | π GPU (115,200Γ) |
| Energy | 700W | 400W | 300W | 50-80W | π Quantum (5Γ) |
| Accuracy | 60-70% | 60-70% | 60-70% | 58.5% | π€ Comparable |
| Loss Exploration | Local optima | Local optima | Local optima | Better | π Quantum |
| Production Ready | β Yes | β Yes | β Yes | β Research only | π GPU |
π Performance Optimizations
Section titled βπ Performance Optimizationsβ1. Adaptive Batch Scheduler
Section titled β1. Adaptive Batch Schedulerβfrom q_store.runtime import AdaptiveBatchScheduler
scheduler = AdaptiveBatchScheduler( min_batch_size=10, max_batch_size=100, target_latency_ms=5000)
# Adjusts batch size based on:# - Queue depth# - Circuit complexity# - Historical latencybatch_size = scheduler.get_batch_size( queue_depth=15, circuit_complexity=50)2. Multi-Level Caching
Section titled β2. Multi-Level Cachingβfrom q_store.runtime import MultiLevelCache
cache = MultiLevelCache()
# L1: Hot parameters (100 entries, <1ms)# L2: Compiled circuits (1000 entries, ~10ms)# L3: Results (10000 entries, ~100ms)
result = cache.get_result(circuit_hash, params_hash)
# Cache statisticsstats = cache.stats()print(f"Total hit rate: {stats['total_hit_rate']:.2%}")3. Native Gate Compilation
Section titled β3. Native Gate Compilationβfrom q_store.compiler import IonQNativeCompiler
compiler = IonQNativeCompiler()
# Compile to IonQ native gates (30% speedup!)native_circuit = compiler.compile(circuit)
# Native gates: GPi(Ο), GPi2(Ο), MS(Ο)# vs universal gates: RY, RZ, CNOTπ Migration from v4.0
Section titled βπ Migration from v4.0βWhatβs Compatible β
Section titled βWhatβs Compatible β β- β All v4.0 verification APIs (circuit equivalence, properties, formal)
- β All v4.0 profiling APIs (circuit profiler, performance analyzer)
- β All v4.0 visualization APIs (circuit diagrams, Bloch sphere)
- β TensorFlow integration (just add async support)
- β Backend configurations
Whatβs New in v4.1 π
Section titled βWhatβs New in v4.1 πβ# v4.0: Sequential executionfrom q_store.tf import QuantumLayer
layer = QuantumLayer(n_qubits=4)output = layer(inputs) # Blocks until done
# v4.1: Async execution (recommended!)from q_store.layers import QuantumFeatureExtractor
layer = QuantumFeatureExtractor(n_qubits=8, depth=4)output = await layer.call_async(inputs) # Non-blocking! π
# v4.1 also supports synchronous API for compatibilityoutput = layer.call_sync(inputs) # Works, but slowerMigration Checklist
Section titled βMigration Checklistβ- Update to async training loops (optional, but 10-20Γ faster than v4.0)
- Replace classical Dense layers with QuantumFeatureExtractor (14Γ more quantum)
- Switch to AsyncMetricsLogger for storage (zero blocking)
- Enable CheckpointManager for Zarr checkpoints (compressed, async)
- For PyTorch: Update to fixed QuantumLayer (n_parameters works now!)
- Understand performance: GPU is faster for production, quantum excels at research
- Test with async executor (max_concurrent=100 recommended)
π Installation
Section titled βπ Installationβ# Install Q-Store v4.1.1 (January 2026)pip install q-store==4.1.1
# With async supportpip install q-store[async]==4.1.1
# Full installation (all backends)pip install q-store[all]==4.1.1
# Development installationgit clone https://github.com/yucelz/q-storecd q-storepip install -e ".[dev,async]"π Quick Start Example
Section titled βπ Quick Start Exampleβimport asynciofrom q_store.layers import ( QuantumFeatureExtractor, QuantumPooling, QuantumReadout)from q_store.runtime import AsyncQuantumExecutorfrom q_store.storage import AsyncMetricsLogger, CheckpointManager
async def train_quantum_model(): # Build quantum-first model (70% quantum!) model = Sequential([ Flatten(), QuantumFeatureExtractor(n_qubits=8, depth=4, backend='ionq'), QuantumPooling(n_qubits=4), QuantumFeatureExtractor(n_qubits=4, depth=3), QuantumReadout(n_qubits=4, n_classes=10) ])
# Setup async storage (never blocks!) metrics = AsyncMetricsLogger('experiments/run_001/metrics.parquet') checkpoints = CheckpointManager('experiments/run_001/checkpoints')
# Async training loop for epoch in range(10): for batch_x, batch_y in train_loader: # Forward pass (async, non-blocking) predictions = await model.forward_async(batch_x)
# Loss & gradients loss = criterion(predictions, batch_y) gradients = await model.backward_async(loss)
# Optimizer step optimizer.step(gradients)
# Log metrics (async, never blocks!) await metrics.log(TrainingMetrics( epoch=epoch, loss=loss.item(), circuit_execution_time_ms=107 ))
# Checkpoint (async, compressed) if epoch % 10 == 0: await checkpoints.save(epoch, model.state_dict())
print("β
Training complete! 10-20Γ faster than v4.0 π") print("Note: Classical GPUs are still 183-457Γ faster for large-scale training")
# Run async trainingasyncio.run(train_quantum_model())π Real Performance Report
Section titled βπ Real Performance ReportβFor detailed performance analysis including:
- Real-world benchmark results (Cats vs Dogs dataset)
- Network latency analysis (55% overhead)
- Cost comparison (GPU vs QPU: $0.01 vs $1,152)
- Bottleneck identification and optimization recommendations
- When quantum makes sense vs when to use classical
See the full Q-Store v4.1.1 Performance Report
Ready to explore quantum ML? β Star us on GitHub and join the quantum research community! π