Chapter 10: Deployment, Operations, and Best Practices
Learning Objectives
Plan a phased Silver Peak SD-WAN deployment from pilot to full rollout.
Apply day-2 operational best practices for monitoring, upgrades, and troubleshooting.
Summarize how architecture, models, licensing, and management come together in a real-world enterprise design.
Pre-Reading Check — Deployment Lifecycle
1. An organization wants to "go SD-WAN" but skips measuring its current WAN's latency, loss, jitter, and ticket volume. What capability does it lose most directly?
2. Which set of pilot sites best reflects the recommended selection criteria?
3. Why is MPLS-to-hybrid migration framed as "coexistence then transition" rather than a flag-day cutover?
4. During the parallel-run phase, what mechanisms protect critical-app performance while traffic is shifted onto SD-WAN paths?
5. In the retail cutover example, the technician powers on the appliance, confirms both underlays are green, and runs a four-item checklist. What is the LAST thing they do, and why?
1. Deployment Lifecycle
Key Points
An EdgeConnect deployment is a five-phase lifecycle — discovery/design, lab/PoC, pilot, phased rollout, optimization — not a single flag-day cutover.
Design must be requirements-driven and must baseline the WAN (latency, jitter, loss, ticket volume) so success can be proven with evidence.
Pilot on 3–10 representative (not easiest) sites with diverse connectivity and application mix; refine templates, BIOs, ZTP, and runbooks there.
Migrate MPLS→hybrid as coexistence then transition: run transports in parallel, shift critical apps gradually using path selection, QoS, and FEC.
Never cut over a wave without a documented rollback to legacy routing and DNS, with a named go/no-go owner.
A Silver Peak (now Aruba) EdgeConnect deployment is not a single cutover event. It is a deployment lifecycle — a structured sequence that moves a network from assessment and design, through a tightly scoped pilot, to a controlled phased migration, and finally to an industrialized full rollout. Treating deployment as a lifecycle rather than a flag-day swap is the single biggest predictor of a low-drama project. Think of it like renovating a house you still live in: you plan, test materials in one room, keep the kitchen working while you redo the bathroom, and always know how to put a wall back.
Design and Pilot Phase
The lifecycle has five recognizable phases. The early phases are about learning and validating; the later phases are about repeating at scale.
Phase
Purpose
Key activities
1. Discovery & design
Understand the environment
Inventory sites, circuits, apps; classify apps by criticality; baseline current WAN (latency, jitter, loss, ticket volume)
2. Lab / PoC
Validate technology in isolation
Stand up Orchestrator + a few appliances; test routing, overlays, security integration, firewall interop
3. Pilot
Validate in production reality
Deploy to 3–10 representative sites; test app performance, failover, policy, runbooks
4. Phased rollout
Industrialize
Roll out in waves using templates and ZTP; review metrics after each wave
5. Optimization
Steady state
Tune Business Intent Overlays, path thresholds, security posture; iterate on operations
The design phase must be driven by requirements, not technology preference. Before choosing any appliance model, inventory sites, circuits, applications, and performance/security requirements — latency budgets, loss tolerance, availability targets, regulatory constraints. Classify applications by business criticality and performance sensitivity, because that classification becomes the raw material for your Business Intent Overlays (Chapter 5). Then decide the underlay strategy: keep MPLS, move to dedicated Internet access (DIA), broadband, and LTE, or run a hybrid.
Critically, you must baseline the current WAN during design. Capture today's latency, jitter, packet loss, application response times, and ticket volume. Without a baseline you cannot prove the project succeeded. "Voice sounds better now" is an opinion; "mean opinion score rose from 3.6 to 4.2 and failover events dropped 80%" is evidence.
Figure 10.1 (animated): The five-phase EdgeConnect deployment lifecycle — broad discovery, narrowing to a pilot, then widening rollout waves into steady-state operations. Revealed phase by phase.
The pilot phase is where design assumptions meet reality. Choose 3–10 sites that collectively mirror your environment rather than your easiest sites: low-to-medium business risk but real usage; diverse connectivity (MPLS+DIA, broadband-only, LTE backup); application-mix coverage (a SaaS-heavy site and a latency-sensitive voice/VDI/EMR site); and local champions who can validate user experience. Pilot objectives should be explicit and measurable, tied to the baseline — for example, "improve voice MOS and reduce failover events." The pilot is where you refine the artifacts you will mass-produce: device templates, Business Intent Overlays, Zero-Touch Provisioning workflows, and runbooks.
Phased Migration from MPLS to Hybrid WAN
The migration from legacy MPLS to a hybrid WAN — one using MPLS, broadband, and/or LTE simultaneously as underlays — should be treated as coexistence then transition, never an abrupt cutover. Legacy and new transports run in parallel while traffic is gradually shifted. The typical pattern unfolds in four steps:
Transport audit & underlay readiness. Validate existing MPLS, DIA, broadband, and LTE circuits; order new circuits early — ISP lead times are often the longest pole in the tent.
Overlay standup. Deploy EdgeConnect over the hybrid underlay, build overlays and BIOs, steer low-risk traffic across SD-WAN while critical traffic stays on MPLS.
Parallel run / gradual traffic shift. Move increasingly critical apps from MPLS to SD-WAN as performance is validated. Dynamic path selection, QoS, and FEC guarantee critical-app performance; keep a per-site fall-back (revert default route to MPLS) active throughout.
MPLS optimization or decommission. Retain a reduced MPLS footprint where strict SLAs or regulatory isolation justify it; elsewhere remove MPLS once operations are stable, update routing, and renegotiate contracts.
Cutover and Rollback Planning
Every site wave needs a documented rollback procedure before cutover begins. The core of any EdgeConnect rollback is simple: know how to return default routing and DNS to the legacy WAN. Because MPLS remains in place during coexistence, rollback usually means re-pointing the default route back to the legacy edge router — not re-cabling. A practical per-wave cutover package contains a design/migration plan and site diagram, a test/validation checklist (reachability, hard and soft failover, monitoring), and rollback steps with a clear go/no-go decision point owned by a named person.
Example: A retail chain cutting over 40 stores schedules each for a 30-minute window after closing. The technician powers on the pre-staged EdgeConnect, confirms both underlays come up green in Orchestrator, runs a four-item checklist (POS reachable, voice MOS acceptable, SaaS breakout working, MPLS failover tested by disabling broadband), and only then removes the temporary route back to the old router. If any check fails, the documented rollback re-enables the MPLS default route and the store is unaffected at open the next morning.
Key Takeaway
Five-phase lifecycle (design → lab → pilot → rollout → optimize). Baseline first, pilot on representative sites, migrate as coexistence-then-transition, and never cut a wave without a documented rollback to legacy routing and DNS.
Post-Reading Check — Deployment Lifecycle
1. An organization wants to "go SD-WAN" but skips measuring its current WAN's latency, loss, jitter, and ticket volume. What capability does it lose most directly?
2. Which set of pilot sites best reflects the recommended selection criteria?
3. Why is MPLS-to-hybrid migration framed as "coexistence then transition" rather than a flag-day cutover?
4. During the parallel-run phase, what mechanisms protect critical-app performance while traffic is shifted onto SD-WAN paths?
5. In the retail cutover example, the technician powers on the appliance, confirms both underlays are green, and runs a four-item checklist. What is the LAST thing they do, and why?
Pre-Reading Check — Day-2 Operations
1. The two firm rules of an EdgeConnect software upgrade are "Orchestrator first, then appliances" and "appliances in waves." Why must Orchestrator go first?
2. An engineer sees that all tunnels from one site are down, while every other site is healthy. Where should triage begin?
3. A tunnel that worked fine yesterday goes down immediately after an ECOS upgrade, showing no IKE problem on the upgraded peer's logs but a failure to establish. What is the most likely cause?
4. Users report a branch's voice quality is poor. Orchestrator's path view shows the Internet path at 5–10% loss and MPLS at 0.1% loss but higher latency, while the tunnel itself stays up. Which response best matches the chapter's guidance?
5. A capacity review finds an appliance sized exactly to last year's peak bandwidth, now running IPS, Boost, and segmentation. Why is this risky even though raw bandwidth still fits?
2. Day-2 Operations
Key Points
Day-2 work is centralized in Orchestrator — most operations happen there, not by SSH-ing into routers.
Upgrades: Orchestrator first, then appliances in waves, least-critical first, with a backup/snapshot rollback and HA pairs upgraded one appliance at a time. Freeze config during a wave.
Troubleshoot by splitting overlay from underlay: all tunnels down → underlay/site; some overlays down → overlay/policy. Fix underlay reachability before touching tunnel config.
Path issues (tunnel up, but loss/latency/jitter) are diagnosed via per-WAN-label performance and remediated by biasing overlays, enabling FEC/packet duplication, and relaxing overly strict SLA thresholds.
Review capacity and licenses quarterly; size to peak + 20–30% headroom and account for the CPU cost of enabled features.
Once sites are live, work shifts from project mode to day-2 operations — the ongoing tasks of monitoring, upgrading, troubleshooting, and capacity-managing a running SD-WAN. The architecture makes this far more centralized than legacy WAN operations: most day-2 work happens in Orchestrator rather than by SSH-ing into individual routers.
Software Upgrade Strategy via Orchestrator
A software upgrade follows two firm rules: Orchestrator first, then appliances, and appliances in waves. Before touching any version, complete the pre-upgrade checklist: check compatibility (does the target Orchestrator support your current ECOS versions? read release notes for schema changes and deprecations); back up Orchestrator (export config + database, take a VM snapshot, confirm integrity); and set a change window during which policy changes are frozen.
Appliance upgrades are staged and pushed from Orchestrator: upload the new ECOS image per platform, create upgrade groups by region/role/impact, then schedule waves during low-traffic hours, ordered least-to-most critical. Upgrade non-critical sites → smaller branches → central hubs/DCs. For HA hubs, upgrade one appliance at a time: fail traffic over, upgrade the first, confirm health, then upgrade its partner. Avoid config changes during a wave — keep a single variable in flight.
Upgrade principle
Why it matters
Orchestrator first
Management plane must support the appliance versions it manages
Back up before upgrading
Snapshot/config export is the rollback path
Waves, least-critical first
Limits blast radius; a bad wave 1 stops wave 2
HA pair: one at a time
Maintains a forwarding path throughout
Freeze config during a wave
Isolates upgrade as the only change variable
After each wave, verify per-appliance that the device is Online with the new version, tunnels are up, and no new alarms appear. If tunnel issues appear only after an upgrade, the usual suspects are a new default behavior (changed cipher suite, path-SLA default, NAT detection) or a template inconsistency where a change did not reach every site.
Troubleshooting Tunnels and Path Issues
Effective troubleshooting starts with one decision: is this an overlay problem (tunnel config or crypto) or an underlay problem (transport, routing, or path quality)? A fast diagnostic shortcut answers most of it: if all tunnels from a site are down it is an underlay/site problem; if only some overlays are down it is an overlay/policy problem. Verify underlay reachability first (ping the peer WAN IP, check the WAN interface, routing, and UDP 500/4500 through any firewall/NAT) before touching any tunnel configuration.
Symptom
Likely cause
Phase 1 up, Phase 2 down
Transform-set / PFS mismatch, or proxy-ID mismatch
No IKE negotiation at all
Firewall or port issue (UDP 500/4500 blocked)
Worked before, failed after upgrade
New default cipher unsupported by older peer; PSK/cert not on both ends
Figure 10.4 (animated): Tunnel-down troubleshooting decision flow — split overlay from underlay, fix underlay reachability first, then walk IKE Phase 1 / Phase 2 state. The path to root cause highlights step by step.
Path issues are a distinct category: the tunnel stays up, but traffic suffers loss, latency, jitter, or flapping on a specific transport. Diagnose via Orchestrator's per-WAN-label path performance view (loss, latency, jitter, MOS, SLA status). Remediate by biasing the overlay toward a healthier transport, enabling FEC or packet duplication for real-time flows on a lossy Internet path, and raising an ISP ticket with the loss graphs as evidence. Watch for overly strict SLA thresholds, which cause constant path flapping on normal Internet links. A related complaint — "the tunnel is fine but the app is slow" — often points to Boost, which requires an active license on both ends and a symmetric path.
Capacity Planning and License Review
Capacity planning confirms that appliances, transports, and licenses still fit the load before users feel the squeeze. Appliance capacity is consumed by more than raw bandwidth: tunnel/IPsec count and enabled features (IPS/IDS, Boost, advanced QoS, segmentation, First-Packet iQ) all draw CPU, and effective throughput drops when many features are on. Size to peak WAN throughput plus 20–30% headroom over a 3–5 year horizon, and size up a model tier if you plan to enable many features. A quarterly review should ask whether appliances are above headroom, whether Boost/security licenses are active where the design expects (especially after HA or appliance replacement), and whether upcoming additions fit current Orchestrator sizing.
Key Takeaway
Day-2 work is centralized in Orchestrator. Upgrade Orchestrator first, then appliances least-critical-first with snapshot rollback and one-at-a-time HA upgrades. Split overlay (config/crypto) from underlay (reachability/path): all tunnels down = underlay, some overlays down = policy. Review capacity and licenses quarterly, sizing for peak plus headroom and feature CPU cost.
Post-Reading Check — Day-2 Operations
1. The two firm rules of an EdgeConnect software upgrade are "Orchestrator first, then appliances" and "appliances in waves." Why must Orchestrator go first?
2. An engineer sees that all tunnels from one site are down, while every other site is healthy. Where should triage begin?
3. A tunnel that worked fine yesterday goes down immediately after an ECOS upgrade, showing no IKE problem on the upgraded peer's logs but a failure to establish. What is the most likely cause?
4. Users report a branch's voice quality is poor. Orchestrator's path view shows the Internet path at 5–10% loss and MPLS at 0.1% loss but higher latency, while the tunnel itself stays up. Which response best matches the chapter's guidance?
5. A capacity review finds an appliance sized exactly to last year's peak bandwidth, now running IPS, Boost, and segmentation. Why is this risky even though raw bandwidth still fits?
Pre-Reading Check — Putting It All Together
1. Why does the Meridian Retail reference design use hub-and-spoke for 600 branches but only a selective mesh between regional hubs and a few key sites?
2. In the reference design, each data-center EC-L in an HA pair terminates all transports (MPLS, Internet, LTE) rather than splitting transports between the two appliances. What does this buy?
3. The chapter says the reference diagram is where "the four themes of the book converge." Which mapping is correct?
4. A team rolls out 600 stores but lets each site keep small custom tweaks "just for that location." Which recurring pitfall is this, and what is the fix?
5. Which two best practices does the chapter say "pay off for years," and why?
3. Putting It All Together
Key Points
A multi-site reference design is hub-and-spoke for branches, with selective mesh between regional hubs and a few key sites (full mesh explodes tunnel count).
Appliance models map to role: EC-XS (small branch) through EC-L (data center / large hub).
Hubs and DCs run dual appliances with a dedicated HA link, each terminating all transports so either has a complete forwarding path.
One global Orchestrator drives ZTP, BIOs, segmentation, steering, and monitoring — the operational home of the licensing model (base + Boost + security).
Avoid recurring pitfalls (no baseline, config drift, hard cutover, alarm floods, under-sizing) by standardizing on templates/ZTP and integrating telemetry into NMS/SIEM.
This final section synthesizes the whole book — architecture, models, licensing, and management — into one coherent picture. Consider a fictional enterprise, Meridian Retail: one primary data center, one secondary DC, three regional hubs (Americas, EMEA, APAC), and 600 branch stores.
Topology. Most enterprises use hub-and-spoke for branches, with branches forming IPsec tunnels to a regional hub, plus a full or partial mesh between the regional hubs and selective mesh for a few large or latency-sensitive sites. The selectivity is deliberate: meshing every site to every site explodes the tunnel count, so mesh is reserved for genuine site-to-site traffic (campuses, call centers) or strict low-latency needs.
Role
Typical model
Notes
Small branch (kiosk, clinic, small store)
EC-XS
Lowest throughput tier
Medium branch
EC-S
—
Large branch / regional hub
EC-M
Terminates a region's branch tunnels
Data center / large hub
EC-L
DC-grade throughput and tunnel scale
Redundancy. Hubs and data centers run dual EdgeConnect appliances with a dedicated HA link; tunnels over each underlay can connect to both appliances. At large hubs, each appliance terminates all transports (MPLS, Internet, LTE) so either one has a complete forwarding path. Routing is designed so each hub prefers its local EdgeConnect gateways, avoiding the asymmetric paths and ECMP behavior that break stateful firewalls.
Management and licensing. A single global Orchestrator — SaaS, on-prem VM, or cloud — drives ZTP, Business Intent Overlays, segmentation, path steering, and monitoring, with role-based access control and per-region templates. This is where the licensing model (Chapter 6) shows up operationally: the EdgeConnect base subscription plus Boost plus security tiers are all tracked and assigned centrally.
Figure 10.3 (animated): Meridian Retail reference design building up — Orchestrator manages all; 600 branches in hub-and-spoke to three regional EC-M hubs; hubs meshed to each other and to dual EC-L data centers.
Notice how the four themes converge in this one diagram: architecture (overlay/underlay, hub-and-spoke-plus-mesh), models (EC-XS through EC-L sized to role), licensing (base + Boost + security assigned per appliance), and management (one Orchestrator with ZTP and BIO templates).
Common Pitfalls and Best Practices
Pitfall
Best practice
No WAN baseline
Measure latency, loss, jitter, ticket volume before you start
Per-site custom configs / config drift
Use central templates and BIOs; avoid site-specific exceptions
Hard MPLS cutover
Coexist then transition; keep per-wave rollback to legacy routing
Ignoring ISP lead times
Order circuits early; build last-mile redundancy into the plan
Alarm floods
Tune alarms to what matters: tunnel down, SLA breach, appliance unreachable, license fault
Overly strict SLA thresholds
Set realistic thresholds for Internet paths to avoid flapping
Security bolted on later
Align firewalls, segmentation, and Zero Trust/SASE from the start
Sizing to today's bandwidth
Size to peak + 20–30% headroom over 3–5 years, accounting for feature CPU cost
Two best practices deserve emphasis. First, standardize through templates and ZTP: appliances pre-registered by serial number auto-download their configuration on first boot and join the correct site group, making a 600-store rollout repeatable and low-touch (keep a manual bootstrap fallback for sites behind restrictive firewalls). Second, export telemetry to your existing NMS/SIEM via syslog, SNMP, or API, so SD-WAN events correlate with ISP and router events for long-term trending rather than living only inside Orchestrator.
Where to Learn More and Certification Paths
Silver Peak EdgeConnect is now part of HPE Aruba Networking, so authoritative documentation, data sheets, and training live under the Aruba brand. The most useful reference for design and sizing is the official Aruba EdgeConnect solution data sheet. Because throughput and tunnel figures change by generation, always validate sizing against the current data sheet for your hardware. The recommended progression: vendor documentation first; technical deep-dives and design guides; HPE Aruba role-based certifications (associate through expert, including SD-WAN/EdgeConnect tracks); and hands-on labs with virtual Orchestrator and EC-V appliances to practice ZTP, BIO design, upgrades, and tunnel troubleshooting safely.
Key Takeaway
A multi-site reference design is hub-and-spoke for branches, mesh between regional hubs and a few key sites, EC-XS-to-EC-L by role, dual-appliance HA at hubs/DCs with diverse underlays, and one global Orchestrator — the point where architecture, models, licensing, and management become one system. Avoid the recurring pitfalls by standardizing on templates/ZTP and integrating telemetry, and deepen skills through Aruba documentation, design guides, certifications, and virtual-appliance labs.
Post-Reading Check — Putting It All Together
1. Why does the Meridian Retail reference design use hub-and-spoke for 600 branches but only a selective mesh between regional hubs and a few key sites?
2. In the reference design, each data-center EC-L in an HA pair terminates all transports (MPLS, Internet, LTE) rather than splitting transports between the two appliances. What does this buy?
3. The chapter says the reference diagram is where "the four themes of the book converge." Which mapping is correct?
4. A team rolls out 600 stores but lets each site keep small custom tweaks "just for that location." Which recurring pitfall is this, and what is the fix?
5. Which two best practices does the chapter say "pay off for years," and why?