Study Guide: Deployment, Operations, and Best Practices

Pre-Reading Check — Deployment Lifecycle

1. An organization wants to "go SD-WAN" but skips measuring its current WAN's latency, loss, jitter, and ticket volume. What capability does it lose most directly?

The ability to register appliances by serial number for ZTP. The ability to prove, with evidence, that the migration actually improved anything. The ability to build Business Intent Overlays in Orchestrator. The ability to form IPsec tunnels between branches and hubs.

2. Which set of pilot sites best reflects the recommended selection criteria?

The three highest-revenue stores, to maximize business value early. An empty back-office site plus two identical broadband-only branches. An MPLS+DIA site, a broadband-only site, and a SaaS-heavy site with LTE backup. Whichever sites have the most permissive maintenance windows, regardless of mix.

3. Why is MPLS-to-hybrid migration framed as "coexistence then transition" rather than a flag-day cutover?

Because EdgeConnect cannot run over MPLS and broadband at the same time. Because running both transports in parallel lets traffic shift gradually with a fall-back path the whole time. Because Orchestrator can only manage one underlay type per appliance. Because ISP contracts forbid disconnecting MPLS within the first year.

4. During the parallel-run phase, what mechanisms protect critical-app performance while traffic is shifted onto SD-WAN paths?

Dynamic path selection, QoS, and Forward Error Correction. Disabling all MPLS circuits to force traffic onto the new paths. Upgrading every appliance to the newest ECOS image first. Meshing every branch to every other branch.

5. In the retail cutover example, the technician powers on the appliance, confirms both underlays are green, and runs a four-item checklist. What is the LAST thing they do, and why?

Decommission the MPLS circuit, because SD-WAN is now live. Remove the temporary route back to the old router, because rollback must stay possible until every check passes. Push a new BIO template, because cutover is the time to change policy. Re-cable the appliance into the LAN, because staging only covered power.

1. Deployment Lifecycle

Key Points

An EdgeConnect deployment is a five-phase lifecycle — discovery/design, lab/PoC, pilot, phased rollout, optimization — not a single flag-day cutover.
Design must be requirements-driven and must baseline the WAN (latency, jitter, loss, ticket volume) so success can be proven with evidence.
Pilot on 3–10 representative (not easiest) sites with diverse connectivity and application mix; refine templates, BIOs, ZTP, and runbooks there.
Migrate MPLS→hybrid as coexistence then transition: run transports in parallel, shift critical apps gradually using path selection, QoS, and FEC.
Never cut over a wave without a documented rollback to legacy routing and DNS, with a named go/no-go owner.

A Silver Peak (now Aruba) EdgeConnect deployment is not a single cutover event. It is a deployment lifecycle — a structured sequence that moves a network from assessment and design, through a tightly scoped pilot, to a controlled phased migration, and finally to an industrialized full rollout. Treating deployment as a lifecycle rather than a flag-day swap is the single biggest predictor of a low-drama project. Think of it like renovating a house you still live in: you plan, test materials in one room, keep the kitchen working while you redo the bathroom, and always know how to put a wall back.

Design and Pilot Phase

The lifecycle has five recognizable phases. The early phases are about learning and validating; the later phases are about repeating at scale.

Phase	Purpose	Key activities
1. Discovery & design	Understand the environment	Inventory sites, circuits, apps; classify apps by criticality; baseline current WAN (latency, jitter, loss, ticket volume)
2. Lab / PoC	Validate technology in isolation	Stand up Orchestrator + a few appliances; test routing, overlays, security integration, firewall interop
3. Pilot	Validate in production reality	Deploy to 3–10 representative sites; test app performance, failover, policy, runbooks
4. Phased rollout	Industrialize	Roll out in waves using templates and ZTP; review metrics after each wave
5. Optimization	Steady state	Tune Business Intent Overlays, path thresholds, security posture; iterate on operations

The design phase must be driven by requirements, not technology preference. Before choosing any appliance model, inventory sites, circuits, applications, and performance/security requirements — latency budgets, loss tolerance, availability targets, regulatory constraints. Classify applications by business criticality and performance sensitivity, because that classification becomes the raw material for your Business Intent Overlays (Chapter 5). Then decide the underlay strategy: keep MPLS, move to dedicated Internet access (DIA), broadband, and LTE, or run a hybrid.

Critically, you must baseline the current WAN during design. Capture today's latency, jitter, packet loss, application response times, and ticket volume. Without a baseline you cannot prove the project succeeded. "Voice sounds better now" is an opinion; "mean opinion score rose from 3.6 to 4.2 and failover events dropped 80%" is evidence.

Figure 10.1 (animated): The five-phase EdgeConnect deployment lifecycle — broad discovery, narrowing to a pilot, then widening rollout waves into steady-state operations. Revealed phase by phase.

The pilot phase is where design assumptions meet reality. Choose 3–10 sites that collectively mirror your environment rather than your easiest sites: low-to-medium business risk but real usage; diverse connectivity (MPLS+DIA, broadband-only, LTE backup); application-mix coverage (a SaaS-heavy site and a latency-sensitive voice/VDI/EMR site); and local champions who can validate user experience. Pilot objectives should be explicit and measurable, tied to the baseline — for example, "improve voice MOS and reduce failover events." The pilot is where you refine the artifacts you will mass-produce: device templates, Business Intent Overlays, Zero-Touch Provisioning workflows, and runbooks.

Phased Migration from MPLS to Hybrid WAN

The migration from legacy MPLS to a hybrid WAN — one using MPLS, broadband, and/or LTE simultaneously as underlays — should be treated as coexistence then transition, never an abrupt cutover. Legacy and new transports run in parallel while traffic is gradually shifted. The typical pattern unfolds in four steps:

Transport audit & underlay readiness. Validate existing MPLS, DIA, broadband, and LTE circuits; order new circuits early — ISP lead times are often the longest pole in the tent.
Overlay standup. Deploy EdgeConnect over the hybrid underlay, build overlays and BIOs, steer low-risk traffic across SD-WAN while critical traffic stays on MPLS.
Parallel run / gradual traffic shift. Move increasingly critical apps from MPLS to SD-WAN as performance is validated. Dynamic path selection, QoS, and FEC guarantee critical-app performance; keep a per-site fall-back (revert default route to MPLS) active throughout.
MPLS optimization or decommission. Retain a reduced MPLS footprint where strict SLAs or regulatory isolation justify it; elsewhere remove MPLS once operations are stable, update routing, and renegotiate contracts.

Cutover and Rollback Planning

Every site wave needs a documented rollback procedure before cutover begins. The core of any EdgeConnect rollback is simple: know how to return default routing and DNS to the legacy WAN. Because MPLS remains in place during coexistence, rollback usually means re-pointing the default route back to the legacy edge router — not re-cabling. A practical per-wave cutover package contains a design/migration plan and site diagram, a test/validation checklist (reachability, hard and soft failover, monitoring), and rollback steps with a clear go/no-go decision point owned by a named person.

Example: A retail chain cutting over 40 stores schedules each for a 30-minute window after closing. The technician powers on the pre-staged EdgeConnect, confirms both underlays come up green in Orchestrator, runs a four-item checklist (POS reachable, voice MOS acceptable, SaaS breakout working, MPLS failover tested by disabling broadband), and only then removes the temporary route back to the old router. If any check fails, the documented rollback re-enables the MPLS default route and the store is unaffected at open the next morning.

Key Takeaway

Five-phase lifecycle (design → lab → pilot → rollout → optimize). Baseline first, pilot on representative sites, migrate as coexistence-then-transition, and never cut a wave without a documented rollback to legacy routing and DNS.

Post-Reading Check — Deployment Lifecycle

1. An organization wants to "go SD-WAN" but skips measuring its current WAN's latency, loss, jitter, and ticket volume. What capability does it lose most directly?

2. Which set of pilot sites best reflects the recommended selection criteria?

3. Why is MPLS-to-hybrid migration framed as "coexistence then transition" rather than a flag-day cutover?

4. During the parallel-run phase, what mechanisms protect critical-app performance while traffic is shifted onto SD-WAN paths?

5. In the retail cutover example, the technician powers on the appliance, confirms both underlays are green, and runs a four-item checklist. What is the LAST thing they do, and why?

Pre-Reading Check — Day-2 Operations

1. The two firm rules of an EdgeConnect software upgrade are "Orchestrator first, then appliances" and "appliances in waves." Why must Orchestrator go first?

Appliances cannot form tunnels until Orchestrator is upgraded. The management plane must support the appliance versions it manages. Orchestrator upgrades take longer, so doing it first saves the window. Appliance images are only downloadable after an Orchestrator upgrade.

2. An engineer sees that all tunnels from one site are down, while every other site is healthy. Where should triage begin?

Overlay/policy: compare IPsec suites and proxy-IDs on both ends. Underlay/site: the circuit, WAN interface, or routing to the peer. Boost: check that WAN-optimization licenses are active on both ends. Capacity: the appliance has likely exceeded its tunnel count.

3. A tunnel that worked fine yesterday goes down immediately after an ECOS upgrade, showing no IKE problem on the upgraded peer's logs but a failure to establish. What is the most likely cause?

The WAN circuit was physically disconnected during the upgrade. A new default cipher suite the older, un-upgraded peer does not support. The Orchestrator backup snapshot was corrupted. The appliance ran out of CPU headroom for First-Packet iQ.

4. Users report a branch's voice quality is poor. Orchestrator's path view shows the Internet path at 5–10% loss and MPLS at 0.1% loss but higher latency, while the tunnel itself stays up. Which response best matches the chapter's guidance?

Reboot both appliances to clear the IKE state and re-negotiate. Bias the voice overlay toward MPLS, enable packet duplication on the Internet path, and open an ISP ticket with the loss graph. Roll back the last ECOS upgrade, since path loss implies an upgrade defect. Tighten the SLA threshold so the lossy path fails faster.

5. A capacity review finds an appliance sized exactly to last year's peak bandwidth, now running IPS, Boost, and segmentation. Why is this risky even though raw bandwidth still fits?

Enabled advanced features draw CPU and reduce effective throughput, eroding headroom. Orchestrator cannot manage an appliance running three features at once. Boost licenses expire automatically when segmentation is enabled. Tunnel count is fixed and cannot grow once features are on.

2. Day-2 Operations

Key Points

Day-2 work is centralized in Orchestrator — most operations happen there, not by SSH-ing into routers.
Upgrades: Orchestrator first, then appliances in waves, least-critical first, with a backup/snapshot rollback and HA pairs upgraded one appliance at a time. Freeze config during a wave.
Troubleshoot by splitting overlay from underlay: all tunnels down → underlay/site; some overlays down → overlay/policy. Fix underlay reachability before touching tunnel config.
Path issues (tunnel up, but loss/latency/jitter) are diagnosed via per-WAN-label performance and remediated by biasing overlays, enabling FEC/packet duplication, and relaxing overly strict SLA thresholds.
Review capacity and licenses quarterly; size to peak + 20–30% headroom and account for the CPU cost of enabled features.

Once sites are live, work shifts from project mode to day-2 operations — the ongoing tasks of monitoring, upgrading, troubleshooting, and capacity-managing a running SD-WAN. The architecture makes this far more centralized than legacy WAN operations: most day-2 work happens in Orchestrator rather than by SSH-ing into individual routers.

Software Upgrade Strategy via Orchestrator

A software upgrade follows two firm rules: Orchestrator first, then appliances, and appliances in waves. Before touching any version, complete the pre-upgrade checklist: check compatibility (does the target Orchestrator support your current ECOS versions? read release notes for schema changes and deprecations); back up Orchestrator (export config + database, take a VM snapshot, confirm integrity); and set a change window during which policy changes are frozen.

Appliance upgrades are staged and pushed from Orchestrator: upload the new ECOS image per platform, create upgrade groups by region/role/impact, then schedule waves during low-traffic hours, ordered least-to-most critical. Upgrade non-critical sites → smaller branches → central hubs/DCs. For HA hubs, upgrade one appliance at a time: fail traffic over, upgrade the first, confirm health, then upgrade its partner. Avoid config changes during a wave — keep a single variable in flight.

Upgrade principle	Why it matters
Orchestrator first	Management plane must support the appliance versions it manages
Back up before upgrading	Snapshot/config export is the rollback path
Waves, least-critical first	Limits blast radius; a bad wave 1 stops wave 2
HA pair: one at a time	Maintains a forwarding path throughout
Freeze config during a wave	Isolates upgrade as the only change variable

After each wave, verify per-appliance that the device is Online with the new version, tunnels are up, and no new alarms appear. If tunnel issues appear only after an upgrade, the usual suspects are a new default behavior (changed cipher suite, path-SLA default, NAT detection) or a template inconsistency where a change did not reach every site.

Troubleshooting Tunnels and Path Issues

Effective troubleshooting starts with one decision: is this an overlay problem (tunnel config or crypto) or an underlay problem (transport, routing, or path quality)? A fast diagnostic shortcut answers most of it: if all tunnels from a site are down it is an underlay/site problem; if only some overlays are down it is an overlay/policy problem. Verify underlay reachability first (ping the peer WAN IP, check the WAN interface, routing, and UDP 500/4500 through any firewall/NAT) before touching any tunnel configuration.

Symptom	Likely cause
Phase 1 up, Phase 2 down	Transform-set / PFS mismatch, or proxy-ID mismatch
No IKE negotiation at all	Firewall or port issue (UDP 500/4500 blocked)
Worked before, failed after upgrade	New default cipher unsupported by older peer; PSK/cert not on both ends

Figure 10.4 (animated): Tunnel-down troubleshooting decision flow — split overlay from underlay, fix underlay reachability first, then walk IKE Phase 1 / Phase 2 state. The path to root cause highlights step by step.

Path issues are a distinct category: the tunnel stays up, but traffic suffers loss, latency, jitter, or flapping on a specific transport. Diagnose via Orchestrator's per-WAN-label path performance view (loss, latency, jitter, MOS, SLA status). Remediate by biasing the overlay toward a healthier transport, enabling FEC or packet duplication for real-time flows on a lossy Internet path, and raising an ISP ticket with the loss graphs as evidence. Watch for overly strict SLA thresholds, which cause constant path flapping on normal Internet links. A related complaint — "the tunnel is fine but the app is slow" — often points to Boost, which requires an active license on both ends and a symmetric path.

Capacity Planning and License Review

Capacity planning confirms that appliances, transports, and licenses still fit the load before users feel the squeeze. Appliance capacity is consumed by more than raw bandwidth: tunnel/IPsec count and enabled features (IPS/IDS, Boost, advanced QoS, segmentation, First-Packet iQ) all draw CPU, and effective throughput drops when many features are on. Size to peak WAN throughput plus 20–30% headroom over a 3–5 year horizon, and size up a model tier if you plan to enable many features. A quarterly review should ask whether appliances are above headroom, whether Boost/security licenses are active where the design expects (especially after HA or appliance replacement), and whether upcoming additions fit current Orchestrator sizing.

Key Takeaway

Day-2 work is centralized in Orchestrator. Upgrade Orchestrator first, then appliances least-critical-first with snapshot rollback and one-at-a-time HA upgrades. Split overlay (config/crypto) from underlay (reachability/path): all tunnels down = underlay, some overlays down = policy. Review capacity and licenses quarterly, sizing for peak plus headroom and feature CPU cost.

Post-Reading Check — Day-2 Operations

1. The two firm rules of an EdgeConnect software upgrade are "Orchestrator first, then appliances" and "appliances in waves." Why must Orchestrator go first?

2. An engineer sees that all tunnels from one site are down, while every other site is healthy. Where should triage begin?

3. A tunnel that worked fine yesterday goes down immediately after an ECOS upgrade, showing no IKE problem on the upgraded peer's logs but a failure to establish. What is the most likely cause?

5. A capacity review finds an appliance sized exactly to last year's peak bandwidth, now running IPS, Boost, and segmentation. Why is this risky even though raw bandwidth still fits?

Pre-Reading Check — Putting It All Together

1. Why does the Meridian Retail reference design use hub-and-spoke for 600 branches but only a selective mesh between regional hubs and a few key sites?

Hub-and-spoke is the only topology Orchestrator supports for branches. Full any-to-any mesh explodes the tunnel count, so mesh is reserved for genuine site-to-site or low-latency needs. Branches cannot run IPsec, so they must rely on a hub to encrypt. Mesh is cheaper to license than hub-and-spoke at scale.

2. In the reference design, each data-center EC-L in an HA pair terminates all transports (MPLS, Internet, LTE) rather than splitting transports between the two appliances. What does this buy?

It halves the licensing cost per site. Either appliance alone has a complete forwarding path, so one can fail without losing a transport. It allows the appliances to skip the dedicated HA link. It lets branches mesh directly without a hub.

3. The chapter says the reference diagram is where "the four themes of the book converge." Which mapping is correct?

EC-XS-to-EC-L sizing = management; one Orchestrator = architecture. Hub-and-spoke-plus-mesh = architecture; EC-XS-to-EC-L = models; base+Boost+security = licensing; one Orchestrator with ZTP/BIOs = management. Boost = architecture; mesh = licensing; ZTP = models. All four themes map onto the single global Orchestrator and nothing else.

4. A team rolls out 600 stores but lets each site keep small custom tweaks "just for that location." Which recurring pitfall is this, and what is the fix?

Config drift; fix with central templates and BIOs, avoiding site-specific exceptions. Alarm floods; fix by disabling alarms during rollout. Under-sizing; fix by upgrading every appliance one tier. Hard cutover; fix by decommissioning MPLS immediately.

5. Which two best practices does the chapter say "pay off for years," and why?

Tightening SLA thresholds and disabling FEC, to reduce CPU load. Standardizing via templates/ZTP (repeatable low-touch rollout) and exporting telemetry to NMS/SIEM (cross-correlate events over time). Meshing all sites and decommissioning MPLS on day one. Buying the largest appliance everywhere and skipping the pilot.

3. Putting It All Together

Key Points

A multi-site reference design is hub-and-spoke for branches, with selective mesh between regional hubs and a few key sites (full mesh explodes tunnel count).
Appliance models map to role: EC-XS (small branch) through EC-L (data center / large hub).
Hubs and DCs run dual appliances with a dedicated HA link, each terminating all transports so either has a complete forwarding path.
One global Orchestrator drives ZTP, BIOs, segmentation, steering, and monitoring — the operational home of the licensing model (base + Boost + security).
Avoid recurring pitfalls (no baseline, config drift, hard cutover, alarm floods, under-sizing) by standardizing on templates/ZTP and integrating telemetry into NMS/SIEM.

This final section synthesizes the whole book — architecture, models, licensing, and management — into one coherent picture. Consider a fictional enterprise, Meridian Retail: one primary data center, one secondary DC, three regional hubs (Americas, EMEA, APAC), and 600 branch stores.

Topology. Most enterprises use hub-and-spoke for branches, with branches forming IPsec tunnels to a regional hub, plus a full or partial mesh between the regional hubs and selective mesh for a few large or latency-sensitive sites. The selectivity is deliberate: meshing every site to every site explodes the tunnel count, so mesh is reserved for genuine site-to-site traffic (campuses, call centers) or strict low-latency needs.

Role	Typical model	Notes
Small branch (kiosk, clinic, small store)	EC-XS	Lowest throughput tier
Medium branch	EC-S	—
Large branch / regional hub	EC-M	Terminates a region's branch tunnels
Data center / large hub	EC-L	DC-grade throughput and tunnel scale

Redundancy. Hubs and data centers run dual EdgeConnect appliances with a dedicated HA link; tunnels over each underlay can connect to both appliances. At large hubs, each appliance terminates all transports (MPLS, Internet, LTE) so either one has a complete forwarding path. Routing is designed so each hub prefers its local EdgeConnect gateways, avoiding the asymmetric paths and ECMP behavior that break stateful firewalls.

Management and licensing. A single global Orchestrator — SaaS, on-prem VM, or cloud — drives ZTP, Business Intent Overlays, segmentation, path steering, and monitoring, with role-based access control and per-region templates. This is where the licensing model (Chapter 6) shows up operationally: the EdgeConnect base subscription plus Boost plus security tiers are all tracked and assigned centrally.

Figure 10.3 (animated): Meridian Retail reference design building up — Orchestrator manages all; 600 branches in hub-and-spoke to three regional EC-M hubs; hubs meshed to each other and to dual EC-L data centers.

Notice how the four themes converge in this one diagram: architecture (overlay/underlay, hub-and-spoke-plus-mesh), models (EC-XS through EC-L sized to role), licensing (base + Boost + security assigned per appliance), and management (one Orchestrator with ZTP and BIO templates).

Common Pitfalls and Best Practices

Pitfall	Best practice
No WAN baseline	Measure latency, loss, jitter, ticket volume before you start
Per-site custom configs / config drift	Use central templates and BIOs; avoid site-specific exceptions
Hard MPLS cutover	Coexist then transition; keep per-wave rollback to legacy routing
Ignoring ISP lead times	Order circuits early; build last-mile redundancy into the plan
Alarm floods	Tune alarms to what matters: tunnel down, SLA breach, appliance unreachable, license fault
Overly strict SLA thresholds	Set realistic thresholds for Internet paths to avoid flapping
Security bolted on later	Align firewalls, segmentation, and Zero Trust/SASE from the start
Sizing to today's bandwidth	Size to peak + 20–30% headroom over 3–5 years, accounting for feature CPU cost

Two best practices deserve emphasis. First, standardize through templates and ZTP: appliances pre-registered by serial number auto-download their configuration on first boot and join the correct site group, making a 600-store rollout repeatable and low-touch (keep a manual bootstrap fallback for sites behind restrictive firewalls). Second, export telemetry to your existing NMS/SIEM via syslog, SNMP, or API, so SD-WAN events correlate with ISP and router events for long-term trending rather than living only inside Orchestrator.

Where to Learn More and Certification Paths

Silver Peak EdgeConnect is now part of HPE Aruba Networking, so authoritative documentation, data sheets, and training live under the Aruba brand. The most useful reference for design and sizing is the official Aruba EdgeConnect solution data sheet. Because throughput and tunnel figures change by generation, always validate sizing against the current data sheet for your hardware. The recommended progression: vendor documentation first; technical deep-dives and design guides; HPE Aruba role-based certifications (associate through expert, including SD-WAN/EdgeConnect tracks); and hands-on labs with virtual Orchestrator and EC-V appliances to practice ZTP, BIO design, upgrades, and tunnel troubleshooting safely.

Key Takeaway

A multi-site reference design is hub-and-spoke for branches, mesh between regional hubs and a few key sites, EC-XS-to-EC-L by role, dual-appliance HA at hubs/DCs with diverse underlays, and one global Orchestrator — the point where architecture, models, licensing, and management become one system. Avoid the recurring pitfalls by standardizing on templates/ZTP and integrating telemetry, and deepen skills through Aruba documentation, design guides, certifications, and virtual-appliance labs.

Post-Reading Check — Putting It All Together

1. Why does the Meridian Retail reference design use hub-and-spoke for 600 branches but only a selective mesh between regional hubs and a few key sites?

2. In the reference design, each data-center EC-L in an HA pair terminates all transports (MPLS, Internet, LTE) rather than splitting transports between the two appliances. What does this buy?

3. The chapter says the reference diagram is where "the four themes of the book converge." Which mapping is correct?

4. A team rolls out 600 stores but lets each site keep small custom tweaks "just for that location." Which recurring pitfall is this, and what is the fix?

5. Which two best practices does the chapter say "pay off for years," and why?

Chapter 10: Deployment, Operations, and Best Practices

Learning Objectives

1. Deployment Lifecycle

Key Points

Design and Pilot Phase

Phased Migration from MPLS to Hybrid WAN

Cutover and Rollback Planning

Key Takeaway

2. Day-2 Operations

Key Points

Software Upgrade Strategy via Orchestrator

Troubleshooting Tunnels and Path Issues

Capacity Planning and License Review

Key Takeaway

3. Putting It All Together

Key Points

Common Pitfalls and Best Practices

Where to Learn More and Certification Paths

Key Takeaway

Your Progress

Answer Explanations