Resilience management for the cloud

SEcure Cloud computing
for CRitical Infrastructure IT
Resilience management for the
Noor Shirazi, Steven Simpson, Andreas Mauthe & David Hutchison
Lancaster University
{n.shirazi, s.simpson,a.mauthe,d.hutchison}
AIT Austrian Institute of Technology • ETRA Investigación y Desarrollo • Fraunhofer Institute for Experimental
Software Engineering IESE • Karlsruhe Institute of Technology • NEC Europe • Lancaster University • Mirasys
• Hellenic Telecommunications Organization OTE • Ayuntamiento de Valencia • Amaris
Critical cloud computing
• Cloud services which
underpin CI
o Cloud computing services which are
used by operators of CI to support
the delivery of their core services, in
cases where the reliability of the
underlying cloud technology is itself
essential to the safe functioning of
the critical service.
• Cloud services which
underpin the digital society
o Cloud computing services which are
critical in themselves, i.e. failure
would have significant impact on
health, safety, security or economic
well-being of citizens or the effective
functioning of EU governments.
Source: “ENISA Incident reporting for cloud computing,
Secure cloud air traffic management
solution (SESAR)
NASDAQ QMX FinQloud for compliance
and surveillance system
A Slovenia-based railway operator has a
cloud-based platform to centralize
passenger, freight and logistic systems
A big oil company in US has adopted
cloud solution, 60% of its infrastructure is
Jan 2013, Dropbox suffered a substantial
loss of service for more than 15 hours
affecting all users across globe.
March 2013, Microsoft email infrastructure
suffered a loss of availability for nearly 16
hours affecting business critical services
August 2013, Amazon Web services
suffered an outage, taking down Vine,
Instagram and other applications for an
Resilience as a cloud need
• Deploying CI services in the cloud increases resilience
and security concerns
• A resilient system is one that can continue to offer a
satisfactory level of service even in the face of the
challenges it experiences
• We need resilience as a property of the Cloud
(networks and systems (VM)), such that they can
withstand any challenge, whether from misconfigurations, congestion/overloads (including flash
crowds), or attacks (such as DDoS, malware)
We define cloud resilience as “the ability to maintain an acceptable level of system operation and services
even in the presence of challenges”.
Resilience strategy: D²R² + DR
• D2R2+DR  Resilience
• Real-time control loop
o Defend against challenges and threats to normal operation
• reduce the probability of a fault leading to a failure
• reduce the impact of adverse event or condition
o Detect when an adverse event has occurred
• determine when remediation needs to occur
o Remediate the effects of the adverse event
• minimize the impact of failure
• graceful degradation of performance
o Recover to original and normal operations
once an adverse event has ended
Source: James PG Sterbenz, David
• Off-line control loop
o Diagnose reflecting on past operational experiences
o Refine; aim to improve design of system (e.g. cloud)
Hutchison, Egemen K Çetinkaya, Abdul
Jabbar, Justin P. rohrer, Marcus Schöller,
and Paul Smith. Resilience and
survivability in communication networks:
Strategies, principles, and survey of
disciplines. Computer Networks,
54(8):1245–1265, 2010.
SECCRIT objective mapping to resilience
Cloud resilience management
framework (CRMF)
o Design a joint network- and
system-wide analysis via a
unified resilience
The resilience framework helps
to protect cloud infrastructures
through dynamically observing
the state of the cloud services
and resources, analysing for
threats and remediating against
their effects.
De-constructing D²R² + DR
• Detect
o Implies a monitoring system (Network and VM level)
• Instrument the cloud
• Aim to observe normal behaviour
• Then look for anomalies
o Employ suitable ADTs (migration-aware)
• Classify the detected anomalies
• Attempt a root cause analysis
• Remediate
o Policy-based remediation
o Make adjustments as appropriate
e.g. migrate VMs, adjust firewall rules,
sandbox a VM
o Get as much context as possible
• Recover
o Get back to normal behaviour if possible
o Use policies for high-level guidance
• Diagnose & Refine
o Learning phase
Anomaly evaluation framework
How to quantify the impact of elasticity such as VM migration on
state-of-the-art ADTs?
Simpson. S, Shirazi. N, Hutchsion. D,
and B. Helge, “Anomaly detection
techniques for cloud computing,” Dec.
2013. [Online].
• Anomaly evaluation framework
o Composed of various pre-/post-processing modules (scripts, Perl libraries,
Python and C)
o Attack Scripts for volume- and non-volume-based attacks with rate-limiting
o Monitoring scripts based on tcpdump
o Background traffic
o Summary extraction scripts
• Convert traffic into normalized statistical properties on per-packet
o Detector scripts provide reference implementation of ADTs
o Visualization scripts compare anomaly score to threshold and plot ROC
and PRC
NW & VM level analysis
Network Level analysis
• 8 different network features such that X is
• Aggregation in 1-second bins
• 40-minute experiment
VM Level analysis
• 33 different system features such that X is
• Aggregation in 3-seconds bins
• 20-minute experiment
• Deploying CI services in the cloud increases concerns
about resilience and security
• There is a need for robust anomaly detection for cloud
environments especially for CI, that are aware of elastic
behaviour and can work in real settings
• Elasticity has direct impact on underlying ADTs
• Policies are back bone of remediation
• The resilience framework can help to protect cloud
infrastructures through dynamically observing the state
of the cloud services and resources, analysing for
threats and remediating against their effects