Why Cloud IR Is Different
Incident response in cloud environments is not simply traditional IR performed on virtual machines. The cloud introduces fundamental architectural differences that change how investigations are conducted, evidence is collected, and containment is executed. Treating cloud IR as "the same but remote" leads to missed evidence, botched containment, and destroyed forensic artifacts.
Ephemeral infrastructure is the most disruptive difference. In an on-premises environment, a compromised server sits on a rack, available for imaging and analysis indefinitely. In the cloud, instances spin up and down with auto-scaling groups, containers run for minutes before being replaced, and serverless functions execute for milliseconds. The evidence you need may literally cease to exist between the time you detect the incident and the time you begin your investigation. Cloud IR must be fast, and ideally automated, to capture evidence before it disappears.
API-driven operations mean that nearly everything an attacker does — and nearly everything you do in response — is an API call. Creating an instance, modifying a security group, exfiltrating data from a storage bucket, escalating privileges through role assumption — these are all API operations that are (or should be) logged. This is both a challenge and an advantage. The challenge is the sheer volume and complexity of API logs. The advantage is that cloud environments can provide a more complete audit trail than most on-premises environments ever achieve, if logging is properly configured.
Identity is the perimeter. In traditional environments, network segmentation and firewalls define security boundaries. In the cloud, identity and access management (IAM) is the primary control plane. An attacker with valid IAM credentials can access resources regardless of network location. Cloud IR investigations therefore focus heavily on identity: which credentials were compromised, what permissions did they grant, what actions were taken, and what resources were accessed.
Multi-region and multi-account complexity adds investigative scope that does not exist in traditional environments. A single compromised IAM user may have accessed resources across multiple regions, multiple accounts, and multiple services. The investigation must span all of these — a compromised access key in us-east-1 may have been used to exfiltrate data from a bucket in eu-west-1 and launch crypto-mining instances in ap-southeast-1.
The Shared Responsibility Model
Before investigating a cloud incident, you must understand what you are responsible for — and what you are not. Every major cloud provider operates under a shared responsibility model that divides security obligations between the provider and the customer. The split varies by service type:
Infrastructure as a Service (IaaS)
For IaaS (EC2, Azure VMs, Compute Engine), the cloud provider is responsible for the physical infrastructure, hypervisor, and host operating system. You are responsible for everything above: the guest operating system, application stack, data, identity management, network configuration, and firewall rules. This means OS-level forensics (file systems, processes, memory) is your responsibility, but you cannot access the hypervisor or underlying hardware. Evidence collection must be done through APIs and in-guest tooling.
Platform as a Service (PaaS)
For PaaS (RDS, Azure App Service, Cloud SQL), the provider additionally manages the runtime, middleware, and operating system. Your forensic visibility is limited to application logs, data, and IAM. You cannot image the underlying OS of a managed database — you can only work with the logs and data access records the service provides.
Software as a Service (SaaS)
For SaaS (Microsoft 365, Google Workspace, Salesforce), the provider manages everything except data and user access. Your investigation scope is limited to audit logs, user activity, and data. Understanding this boundary is critical: you cannot request a memory dump of the server running your SaaS application.
During an incident, you may need to engage the cloud provider's incident response support. AWS, Azure, and GCP all offer IR support through their enterprise support plans, and some offer dedicated security IR teams for major incidents. Know how to reach these teams before an incident occurs.
Critical Log Sources by Provider
The foundation of cloud IR is log data. Each provider offers a distinct set of logging services, and the names, formats, and default configurations differ significantly. Ensure these are enabled and retained before an incident — retroactive enablement does not generate historical data.
Amazon Web Services (AWS)
- CloudTrail: Records API calls across your AWS account. This is the single most important log source for AWS IR. Every API call — who made it, from where, when, and with what parameters — is logged. Ensure CloudTrail is enabled in all regions (attackers frequently operate in regions you do not use), with management events and optionally data events (S3 object-level operations, Lambda invocations) enabled. Trail logs should be delivered to a centralized, immutable S3 bucket with Object Lock enabled to prevent tampering.
- VPC Flow Logs: Capture IP traffic metadata (source/destination IP, port, protocol, bytes, action) for network interfaces in your VPCs. Not packet captures — they contain metadata only. Enable at the VPC level for complete coverage. Essential for detecting data exfiltration, lateral movement, and command-and-control traffic.
- GuardDuty: AWS's managed threat detection service that analyzes CloudTrail, VPC Flow Logs, and DNS logs to generate findings for suspicious activity — reconnaissance, instance compromise, credential compromise, and data exfiltration. GuardDuty findings are often the initial detection point for cloud incidents.
- S3 Access Logs / S3 CloudTrail Data Events: Record access to S3 objects. Critical for investigating data exfiltration. S3 access logs provide detailed request-level information; CloudTrail data events provide API-level records of GetObject, PutObject, and DeleteObject operations.
- CloudWatch Logs: Centralized log aggregation for application logs, Lambda function logs, and custom log streams. EC2 instances require the CloudWatch agent to forward OS-level logs.
- Route 53 DNS Query Logs: Record DNS queries made to Route 53 Resolver. Useful for detecting DNS-based data exfiltration and C2 communication.
Microsoft Azure
- Azure Activity Log: Records control plane operations (resource creation, modification, deletion) across your Azure subscription. Equivalent to CloudTrail management events. Retained for 90 days by default — export to a Log Analytics workspace or storage account for longer retention.
- Entra ID (Azure AD) Sign-in and Audit Logs: Record user authentication events (successful and failed sign-ins, MFA challenges, conditional access evaluations) and directory changes (user creation, role assignments, application registrations). Critical for investigating identity compromise. Sign-in logs include risk detections — impossible travel, anonymous IP, unfamiliar sign-in properties.
- NSG Flow Logs: Network Security Group flow logs, equivalent to VPC Flow Logs. Record allowed and denied traffic at the network security group level. Enable version 2 for additional fields including bytes transferred.
- Microsoft Defender for Cloud: Security posture management and threat detection. Generates security alerts for suspicious activity across Azure resources, similar to GuardDuty.
- Key Vault Audit Logs: Record access to secrets, keys, and certificates. Essential when investigating whether credentials stored in Key Vault were accessed by a compromised identity.
- Storage Analytics Logs: Record read, write, and delete operations on Azure Storage. Enable diagnostic settings on storage accounts to capture these events.
Google Cloud Platform (GCP)
- Cloud Audit Logs — Admin Activity: Records administrative operations (resource creation, IAM policy changes, configuration modifications). Enabled by default, cannot be disabled, retained for 400 days. The equivalent of CloudTrail management events.
- Cloud Audit Logs — Data Access: Records data read/write operations. Not enabled by default for most services (must be explicitly configured). Can generate high volume — enable selectively for sensitive resources. Retained for 30 days by default.
- VPC Flow Logs: Record network traffic metadata for VPC subnets. Similar to AWS VPC Flow Logs. Enable at the subnet level. Configurable sampling rate and aggregation interval.
- Cloud DNS Logs: Record DNS queries. Useful for detecting DNS tunneling and C2 communication.
- Security Command Center: GCP's integrated security and risk management platform. Premium tier includes threat detection (Event Threat Detection, Container Threat Detection) that generates findings for suspicious activity.
Cloud-Specific Attack Patterns
Cloud environments introduce attack vectors that do not exist in traditional infrastructure. Understanding these patterns is essential for effective investigation.
- Compromised IAM credentials: The most common cloud attack vector. Credentials are leaked through code repositories, phishing, metadata service exploitation (SSRF to the instance metadata service at 169.254.169.254), or compromised CI/CD pipelines. Once obtained, the attacker operates with whatever permissions the credentials grant — which, in environments with overly permissive IAM policies, can be devastating.
- Misconfigured storage buckets: Public S3 buckets, Azure Blob containers, and GCS buckets continue to expose sensitive data. The attack may be as simple as listing a publicly accessible bucket and downloading its contents. Check bucket policies, ACLs, and the "Block Public Access" settings (AWS) or equivalent controls.
- Privilege escalation via role chaining: An attacker with limited permissions discovers they can assume a more privileged role, which can assume an even more privileged role, or create new credentials with elevated permissions. In AWS, this manifests as
sts:AssumeRolechains. In GCP, service account impersonation chains. The CloudTrail or Audit Log record shows the escalation path. - Resource hijacking: Compromised cloud accounts are frequently used to launch cryptocurrency mining instances. The attacker launches the largest available instance types across multiple regions. The first indicator is often a spike in the cloud billing dashboard or a billing alert. By the time you investigate, the attacker may have already spun up hundreds of instances.
- Data exfiltration via snapshot or export: Rather than downloading data through network connections (which may trigger DLP or flow log alerts), attackers may share an EBS snapshot, RDS snapshot, or storage bucket with an external account they control. This is an API operation, not a network transfer — it appears only in CloudTrail or equivalent logs, not in VPC Flow Logs.
- Lateral movement through service accounts: Cloud workloads are assigned service accounts or instance profiles with IAM roles. Compromising a workload gives the attacker access to the role's permissions — which often include access to other cloud services, databases, storage, and secrets managers.
Containment in the Cloud
Cloud containment strategies differ fundamentally from on-premises approaches. You cannot physically disconnect a server. You cannot walk to a rack and pull a network cable. Everything is done through API calls, and every action must be carefully considered for its impact on evidence preservation.
Identity Containment
For compromised IAM credentials, containment focuses on revoking access while preserving the audit trail:
- Revoke active sessions: In AWS, attach an inline policy that denies all actions for sessions issued before the current time (using the
aws:TokenIssueTimecondition key). In Azure, revoke all refresh tokens for the compromised user via Entra ID. In GCP, revoke OAuth tokens for the service account. - Rotate credentials: Deactivate and replace access keys (do not delete — the key ID is needed for CloudTrail correlation). Reset passwords. Generate new service account keys.
- Restrict IAM policies: Apply a deny-all policy to the compromised entity. Do not delete the user or role — deleting it destroys the IAM policy history and breaks the audit trail. Instead, attach a restrictive policy that overrides all permissions.
- Review and remove persistence: Check for attacker-created access keys, assumed roles, federation configurations, OAuth applications, and service account keys that provide alternative access paths.
Network Containment
- Security group modification: Replace the instance's security group with an isolated security group that allows no inbound or outbound traffic (except from your forensic investigation IP). In AWS, note that security group changes take effect immediately but do not terminate existing connections — you may need to restart the network interface.
- Network ACLs: Apply deny rules at the subnet level for broader isolation. NACLs are stateless and take effect immediately, including for existing connections.
- WAF rules: If the attack vector is a web application, deploy WAF rules to block the attacker's IP addresses or malicious request patterns while the investigation proceeds.
Compute Containment
The critical principle for compute containment is: snapshot before you act, and never terminate an instance you have not preserved.
- Create disk snapshots: Before any containment action, snapshot all volumes attached to the compromised instance. This preserves the filesystem state at the time of investigation. In AWS, use
create-snapshotfor each EBS volume. In Azure, create a disk snapshot. In GCP, create a persistent disk snapshot. - Capture instance metadata: Record the instance's configuration, security groups, IAM role, user data, tags, and network interfaces. This metadata may change during containment or remediation and cannot be recovered later.
- Isolate, do not terminate: Move the instance to a quarantine VPC or subnet with no internet access and no access to other production resources. Do not stop or terminate the instance — volatile evidence (running processes, network connections, memory) is lost when the instance stops. If memory acquisition is needed, install a memory acquisition tool (like LiME for Linux) and capture the memory dump before stopping the instance.
Storage Containment
- Enable versioning: If not already enabled, enable versioning on compromised storage buckets to prevent the attacker from permanently deleting evidence.
- Restrict access policies: Modify bucket policies to deny all access except from the investigation team's IP addresses and IAM identities.
- Enable Object Lock: For S3 buckets containing critical evidence, enable Object Lock in compliance mode to prevent deletion or modification, even by root account.
Evidence Preservation
Evidence preservation in the cloud faces unique challenges that demand proactive preparation. You cannot retroactively enable logging, extend retention, or recover terminated instances.
Pre-Incident Preparation
- Enable logging everywhere, before you need it. CloudTrail in all regions (including regions you do not use). VPC Flow Logs on all VPCs. DNS query logging. Data access audit logs on sensitive resources. Cloud provider security services (GuardDuty, Defender, SCC). The most common post-incident finding is "the log we needed was not enabled."
- Centralize and protect log storage. Send all logs to a centralized, immutable storage location — an S3 bucket with Object Lock, a Log Analytics workspace with immutable retention, or a GCS bucket with a retention policy. Use a separate, restricted account (a "security" or "log archive" account) that is not accessible from the production environment. If the attacker compromises your production account but not your log archive account, your evidence is preserved.
- Extend default retention periods. Default retention for many cloud logs is 90 days or less. Many breaches are not detected for months. Configure retention periods of at least 365 days for security-relevant logs. The cost of log storage is negligible compared to the cost of a breach investigation with missing evidence.
- Pre-stage IR roles and access. Create dedicated IR IAM roles with break-glass access to investigation-relevant services and resources. These roles should be dormant during normal operations and activated only during incidents. Document the activation process. Test it before you need it.
During-Incident Collection
When collecting evidence during an active incident, follow this priority order to capture the most volatile evidence first:
- Running instance memory — most volatile, lost on stop/terminate
- Running instance network state — active connections, listening ports
- Running instance process state — process listings, open files
- Disk snapshots — persistent but may be modified by the attacker
- Cloud API logs — persistent if properly configured and retained
- IAM policies and configurations — may be modified during incident
- Network configurations — security groups, NACLs, route tables
For each piece of evidence collected, record a chain of custody: what was collected, when, by whom, from where, and how. Use cryptographic hashes (SHA-256) to verify integrity. Store evidence in a dedicated, access-controlled, immutable storage location separate from the compromised environment.
Cloud IR Toolkit
A growing ecosystem of tools supports cloud incident response. Building your toolkit before an incident is essential — you do not want to be evaluating tools during a crisis.
AWS
- AWS IR Playbooks (aws-incident-response-playbooks): Open-source playbooks covering common AWS incident scenarios — compromised IAM credentials, exposed S3 buckets, compromised EC2 instances.
- Prowler: Open-source security assessment tool that evaluates your AWS environment against hundreds of checks based on CIS benchmarks, PCI-DSS, and AWS best practices. Useful for identifying misconfigurations that enabled the attack.
- ScoutSuite: Multi-cloud security auditing tool that collects configuration data from AWS, Azure, and GCP and identifies security risks. Generates a comprehensive report of your cloud security posture.
- CloudTrail Lake: AWS's managed CloudTrail query service that enables SQL-based analysis of CloudTrail events. Useful for complex queries across large volumes of audit data during investigations.
Azure
- AzureHound: Open-source tool for mapping Azure AD and Azure Resource Manager relationships, useful for understanding privilege escalation paths and lateral movement opportunities.
- Microsoft Incident Response tools: Microsoft's IR team uses and publicly documents tools for Azure investigation, including KQL queries for Sentinel and Log Analytics.
- Azure Resource Graph: Enables querying resource configurations across subscriptions. Useful for rapidly assessing the scope of an incident across a large Azure environment.
GCP
- Security Command Center (Premium): Provides threat detection (Event Threat Detection, Container Threat Detection), vulnerability scanning, and asset inventory. The investigation workspace supports forensic queries across audit logs.
- GCP Policy Analyzer: Analyzes IAM policies to determine what identities have access to what resources. Essential for understanding the blast radius of a compromised service account.
Cross-Cloud
- Steampipe: Open-source tool that exposes cloud APIs as SQL tables, enabling SQL queries across AWS, Azure, GCP, and dozens of other cloud services. Powerful for multi-cloud investigation queries.
- CloudQuery: Open-source cloud asset inventory that syncs cloud configuration and audit data to a database for analysis. Useful for building a queryable inventory of your cloud environment.
- Cartography: Open-source tool from Lyft that creates a graph database of your cloud infrastructure, mapping relationships between resources, identities, and network paths. Invaluable for understanding lateral movement paths and blast radius analysis.
Building Cloud IR Readiness
The organizations that respond effectively to cloud incidents are those that prepared before the incident occurred. Cloud IR readiness is not a one-time project — it is an ongoing practice that must evolve as your cloud environment grows and changes.
- Enable logging comprehensively. If there is one action you take after reading this article, make it this: audit every cloud account and ensure that all security-relevant logging is enabled, centralized, and retained for at least 365 days. This single step has more impact on your IR capability than any other.
- Maintain cloud architecture diagrams. During an incident, responders need to quickly understand the environment — what accounts exist, what services are deployed, how they connect, where sensitive data lives. Automated tools like CloudQuery, Steampipe, or native cloud asset inventories can generate up-to-date architecture maps.
- Pre-stage automated evidence collection. Write and test scripts that automatically capture disk snapshots, instance metadata, memory dumps, and log exports. When an incident occurs, manual evidence collection is too slow for ephemeral environments. Automation ensures consistent, complete, and timely collection.
- Conduct cloud-specific tabletop exercises. Traditional tabletop exercises often assume on-premises infrastructure. Design exercises that specifically test your team's ability to investigate in cloud environments — compromised IAM credentials, public bucket exposure, crypto-mining in unused regions, cross-account lateral movement.
- Establish cloud provider IR contacts. Know how to reach your cloud provider's security IR team before you need them. For AWS, this is the AWS Customer Incident Response Team (CIRT). For Azure, this is Microsoft DART. For GCP, this is the Google Cloud Incident Response team. Understand what support your contract tier entitles you to.
"In cloud IR, the incident response process starts months before the incident. The logging you enable today, the retention you configure today, and the IR roles you pre-stage today determine whether you can investigate effectively when a breach occurs."
Cloud environments fundamentally change the practice of incident response, but the core principles remain the same: prepare in advance, detect quickly, contain decisively, preserve evidence carefully, and learn from every incident. The organizations that treat cloud IR as a distinct discipline — not an afterthought bolted onto their existing IR program — are the ones that will navigate cloud incidents with the speed and precision these environments demand.