Skip to content

Scanning Overview

Object Flow

There are a number of ways for objects to be placed into buckets: direct upload, cli and more predominantly through applications providing interfaces and workflows with employees, customers and partners. However the objects arrive, Cloud Storage Security sees three main interaction mechanisms with those objects: event driven, api driven and retro driven (the looking back upon). Currently, Antivirus for Amazon S3 delivers all three mechanisms: event driven scanning, retro scanning (via on-demand requests as well as scheduled) and API driven scanning through a REST interface.

Event Driven Scanning

Event driven scanning is where an event, in this case the All object create event, is leveraged on the bucket so any time an object is created/modified within the bucket an event is raised. Antivirus for Amazon S3 places and event destination / handler onto the protected buckets which listen for these events to trigger scanning. This allows Antivirus for Amazon S3 to easily plugin to any existing workflow you have without modifications.

So this looks as follows:

  1. An object is added to a protected bucket
  2. An event is raised and sent to an SNS Topic
  3. The Antivirus for Amazon S3 provides an SQS Queue which subscribes to the Topic
  4. One or more Antivirus for Amazon S3 Agents are monitoring the queue
  5. Entries are pulled from the queue identifying the object to scan. The object is retrieved and scanned
  6. Objects are handled according to the Scan Settings you have set
    a. All objects are tagged
    b. Infected files are moved to a quarantine bucket (default behavior)

Note

This flow and behavior is irrespective of region. Amazon S3 buckets have a global view, but are regionally placed. This flow will be performed local to each region you enable buckets for scanning.

Document Flows

Standard Document Flow

When a bucket is protected and event listener (SNS Topic) is added to the bucket. This will send S3 Events to the topic which in turn populates an SQS Queue. From there, everything is scanned in near-real-time.
Standard Document Flow

2 Bucket System Document Flow

The Two Bucket System (2 Bucket System) allows a customer to physically separate the incoming files from the downstream users of the "production buckets". The separation lasts as long as it takes to scan the files and ensure they are clean. In this way, you can ensure nothing other than clean files makes it into your production buckets and therefore are safe to be consumed.
2 bucket Document Flow
For guided steps on how to setup the 2 Bucket System go here

Retro Scanning

Retro Scanning is the scanning of existing objects. Whether you want to scan all your existing data the first time you setup the product or you have compliance requirements that dictate a regular cadence of data scanning, Retro Scanning will allow you to scan and re-scan your existing S3 objects. Unlike event driven scanning which looks at all "new" objects coming into buckets and has scanning triggered by events, retro scanning leverages S3 crawling to determine the work set. If it is the first time you are enabling a bucket for event scanning, you'll be prompted to scan existing data which will automatically default to a time window of the beginning of time (March 14, 2006) through the current time. We won't continue past the current time of activation as event scanning will take care of everything going forward. The default time window can be changed. If you are triggering retro scanning without a bucket activation, you will be prompted to pick a time slice for scanning. This can still be all time or a window of time of your choosing.

It looks as follows:

  1. You have existing objects within your AWS account(s)
  2. Via the console, you trigger a scan for existing objects (on-demand or through a schedule)

    This can be all objects within the bucket(s) or a subset based on a time window

  3. The Antivirus for Amazon S3 spins up a Fargate Run Task(s) for each on-demand request or schedule that will trigger the crawling process which will crawl all objects within the bucket(s) and add entries to a temporary Retro SQS Queue uniquely made for the job

    Amazon S3 doesn't allow for simple searching and segmenting of the objects to process so all objects must be crawled. Only the objects that match the time window will be added to the queue for processing. If you have 1 million objects in a bucket and the time window dictates such that only 50,000 are processed, all 1 million will be crawled, but only the 50,000 will be scanned.

    Note

    Antivirus for Amazon S3 will spin up a number of Run Task scanning agents with the notion to complete the work in ~1 hour. Depending on the volume (# of buckets, # of objects, size of buckets) related to the job this could be a "lot" of agents. Ultimately, this has no affect on the cost as running 1 agent for 100 hours to complete the job is the same cost as running 100 agents for 1 hour to complete the job.


    You may face service quota limits when running big jobs or multiple jobs at the same time. The default value for the number of these tasks is 1000 in most accounts. If it was needed to go beyond 1000 the process will not stop or break, it will just take longer than it would if you could spin more tasks up. If you know your loads will typically require more tasks beyond that, you can make a request to AWS to increase your limits.

  4. Once crawling has completed for a job, a new set of Fargate Run Tasks are spun up to perform the scanning. Antivirus for Amazon S3 will automatically spin up the number of scanning agents to complete the

    The retro agents are the same agent as the event agents, but run in as Fargate Run Tasks. These agents will destroy themselves when the job has completed.

  5. Entries are pulled from the queue identifying the object to scan. The object is retrieved and scanned

On-Demand Scanning

On-Demand Scanning is a user initiated scan from within the management console GUI. Triggered from the Bucket Protection page, a user can on a one-off basis select one or many buckets and a time window to scan objects. This scan will be treated and tracked as a "job" on the Jobs page under the Monitoring section of the console.

Scheduled Scanning

Scheduled Scanning allows you to process new files or existing files based on a schedule. Instead of processing new files as they come in, your workflow may allow them to be scanned once per day. For compliance reasons you may be required to scan all of your files on a quarterly basis and Scheduled Scanning will allow you to do that. Learn more about schedule based scanning. Scheduled scans will be treated and tracked as a "job" on the Jobs page under the Monitoring section of the console.

API Driven Scanning

API driven scanning is the notion of scanning a file and receiving the verdict before it is written anywhere. We see this when your workflow demands a verdict at the time of uploaded. We often hear from customers that they want the file scanned before it resides in Amazon S3. Or they may have aspects of their workflow such that they just need an API driven verdict engine and Amazon S3 may not be in play at all. This is often necessary in applications where users are waiting to be told the upload was successful and the file accepted. APIs allow the application to make a direct handoff of the file to the scanning agent.

Ultimately, API driven scanning provides an API Endpoint verdict engine that can be used inside or out of AWS. You can send files to scan from on-prem or applications residing within AWS or from anywhere you grant access. The API scanning agents sit behind an AWS Load Balancer. You can make the Load Balancer internet-facing or internal depending on your requirements. Learn more about configuring and managing the API endpoint on the API Agent Settings page.

Setup:

  1. Create a user for API use - How to
  2. Setup and configure API Agent Region - How to
  3. Integrate HTTP Post calls into your applications - explore samples below

You can use the programming language of your choice as we only require you to leverage HTTP Post to submit the file for scanning. Below are very simple examples of how to submit a file and the results you will see back.

Steps to making the API call:

  1. Make request for Auth Token
    a. Specify content type of JSON in headers
    b. Capture username and password in JSON
    c. HTTP Post the data block and headers to <baseURL> + /api/Token
    headers = {'Content-type': 'application/json'}
    json_foo: {"username": "<username here>", "password": "<pw here>"}
    r = session.post("https://<baseURL to load balancer or friendly URL>/api/Token", data=json_foo, headers=headers)
    
    This will return the following response text:
    {
        "accessToken":"eyJraWQiOiI0Qk41QU1yVXdhWUUrZlBUZ0dhQTZWQUNXUmREMmh2dlMxWFgrUmNmTzd3PSIsImFsZyI6IlJTMjU2In0.eyJzdWIiOiIyMDYyZDQxMC1kMGE0LTRiNTItYjc2Yi03M2FiNWQ5Njk4YWQiLCJjb2duaXRvOmdyb3VwcyI6WyJVc2VycyIsIlByaW1hcnkiXSwiZW1haWxfdmVyaWZpZWQiOnRydWUsImlzcyI6Imh0dHBzOlwvXC9jb2duaXRvLWlkcC51cy1lYXN0LTEuYW1hem9uYXdzLmNvbVwvdXMtZWFzdC0xX1haNWpVNXcwWSIsImN1c3RvbTpoaWRlX3RyaWFsX21zZyI6IjAiLCJjb2duaXRvOnVzZXJuYW1lIjoiZWRjIiwiY3VzdG9tOnVzZXJfZGlzYWJsZWQiOiIwIiwiY3VzdG9tOmF3c19hY2NvdW50X2lkIjoiNzMwMDc5MDI1Njg4IiwiY3VzdG9tOmhpZGVfd2VsY29tZV9tc2ciOiIwIiwiYXVkIjoiNXM2M29raWtodGJxdDR2cTFtMmV1bDk4Z2kiLCJldmVudF9pZCI6IjkwOTJlZjc4LTMzZTMtNDVhNS1hZTlhLTVmN2Y0NGY2NDZmNiIsInRva2VuX3VzZSI6ImlkIiwiYXV0aF90aW1lIjoxNjI1MjgyODU5LCJleHAiOjE2MjUyODY0NTksImlhdCI6MTYyNTI4Mjg1OSwiZW1haWwiOiJzdXBwb3J0QGNsb3Vkc3RvcmFnZXNlYy5jb20ifQ.QehudPO4zTphRq9ch3p6IopzRz7m72D5LquVgnzw8iHfDBbgZLQiAM7uWtkKGQw5fYV5dsB_U0fbcrW6F3ov_U4LcpvLgP88NXk7MR9PprzIQQjvnHRU9z6wy6wavgrK-VdPiqNF7dsKaAJGW6vVZCzFzVIEKaZCThHpqVYbKdiSfVm08nvWsWEM4fxAgCFY8sAr2pNxY5VHydGc_iP4On3H7MSFh1n7ee-lH88Ao8PLWMWQBYlbR6ZFLin7KKi6lhDOE-b4cAGDgPtl4acdw6ha_AWJPxozJILQkSAesl-BbxWquphTJ-oD_jRl7DvJBSbBw3DPNzXcO4w4SMnnLA",
        "tokenType":"Bearer",
        "expiresIn":3600
    }
    
    Save the access token off for the next call. It is valid for 1 hour if you choose to re-use it.
  2. Send the file for scanning
    a. Specify the headers - big thing here is the accessToken needs to be added, this is the minimum
    b. Get the file as your language dictates, but should be multipart form upload
    c. HTTP Post the file and headers <baseURL> + /api/Scan
    headers = {"Prefer": "respond-async", "Content-Type": form.content_type, 'Authorization': 'Bearer ' + accessToken}
    r = session.post("https://<baseURL to load balancer or friendly URL>/api/Scan", headers=headers, data=form, timeout=4000)
    
    This will return the following response text
    {
        "dateScanned": "2021-07-02T07:04:18.8896831Z",
        "detectedInfections": [],
        "errorMessage": null,
        "result": "Clean"
    }
    

Both code samples are simple, but documented so take a closer look here to learn more.

Code Samples

Code Samples

This is a simple command line example with the base URL and the file to scan passed in on the command line.

python ./scanWithAPI.py <username> <password> <base-URL> <file-to-scan>
import json
import requests
from requests_toolbelt.multipart import encoder
from requests_toolbelt.multipart.encoder import MultipartEncoder
import sys

# baseURL is the value found on the API Agent Settings page as the Default DNS
# this can also be a friendly URL which you've mapped within your DNS
baseURL = sys.argv[3]

# /api/Token is the API to retrieve the auth token
# the auth token is valid for 1 hour, so you can re-use if your application can manage it
getTokenURL = baseURL + '/api/Token'

# /api/Scan is the API to pass the file to for scanning
scanFileURL = baseURL + '/api/Scan'

# must specify the content type as JSON when retrieving the auth token
headers = {'Content-type': 'application/json'}

# as part of the /api/Token HTTP Post you must pass the username and password for the 
# user created and configured inside of the Antivirus for Amazon S3 console
# the data block must be passed in JSON format
uname = sys.argv[1]
pw = sys.argv[2]
foo = {"username": "", "password": ""}
foo["username"] = uname
foo["password"] = pw
json_foo = json.dumps(foo)

# make the HTTP post now passing in the username/pw data block and headers
session = requests.Session()
r = session.post(getTokenURL, data=json_foo, headers=headers)

# pull the auth token from the response to use in the scan call below
# valid for 1 hour if you want to re-use
jsonResponse = json.loads(r.text)
accessToken = jsonResponse["accessToken"]

# read file in from wherever it is coming from: form upload, in file system, etc
with open(sys.argv[4], 'rb') as f:
    form = encoder.MultipartEncoder({
        "documents": ("my_file", f, "application/octet-stream"),
        "composite": "NONE",
    })
# setup headers for /api/Scan HTTP Post.
# the only thing you really need is the 'Authorization': 'Bearer ' with the auth token
# assigned to that. Depending on how you are reading the file or the language, you may 
# need to pass more values in the header as seen below
    headers = {"Prefer": "respond-async", "Content-Type": form.content_type, 'Authorization': 'Bearer ' + accessToken}
    r = session.post(scanFileURL, headers=headers, data=form, timeout=4000)

# grab the text from the response and check however you will
# below converts the response to JSON for easy formatting and handling
parsed = json.loads(r.text)
print(json.dumps(parsed, indent=4, sort_keys=True))

# do the next portion of your workflow based on what the scan result is
if parsed['result'] == "Clean":
    print("file was clean")
    #do more work in the workflow here

session.close()
// Host URL will either be the default DNS for your Load Balancer or the DNS CNAME you registered on your own domain
const string hostUrl = "https://your-lb-or-registered-dns-name";
// If you are using this in an on-demand per-user basis, ideally you would have users provide their own username and password
// If you are using this inside an application, it would be recommended to make a console user solely for this application
const string username = "av-console-username";
// Hard-coding the password is not a recommended practice, 
// ideally it would be stored in something like AWS Secrets Manager, and retrieved by the application to use here,
// or provided by the user if being done on an on-demand basis
const string password = "av-console-password"; 
// This handler is only needed if you are using the default load balancer DNS, as your SSL certificate will not be valid for it.
// If you've registered your own DNS CNAME that is valid for your certificate, this is not needed.
HttpClientHandler handler =
    new()
    {
        ClientCertificateOptions = ClientCertificateOption.Manual,
        ServerCertificateCustomValidationCallback = (_, _, _, _) => true
    };
// Timeout is set to infinite as the file upload may take a long time. This could be lowered if desired, just be aware that
// any file uploads taking longer than that will then fail.
HttpClient httpClient = 
    new(handler)
    {
        Timeout = System.Threading.Timeout.InfiniteTimeSpan
    };
HttpResponseMessage resp = 
    await httpClient.PostAsync(
        $"{hostUrl}/api/token",
        new StringContent(
            JsonConvert.SerializeObject(
                new Dictionary<string, string>
                {
                    {"username", username},
                    {"password", password}
                }),
            Encoding.UTF8,
            "application/json"));
string tokenResponse = await response.Content.ReadAsStringAsync();
JObject responseJson = JObject.Parse(tokenResponse);
string accessToken = responseJson["accessToken"]?.Value<string>();
httpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", accessToken);
// This example is using a file that is stored on disk.
// If for example you already have a stream to the file, just pass that to the new StreamContent() constructor
FileStream fileStream = File.OpenRead("path/to/your/file");
MultipartFormDataContent content =
    new()
    {
        {new StreamContent(fileStream), tmpFile, tmpFile}
    };
HttpResponseMessage scanResult =
    await httpClient.PostAsync(
        scanUrl,
        content);
Console.WriteLine(await scanResult.Content.ReadAsStringAsync());
fileStream.Dispose();
Scan Results - JSON formatted

Scan Results

{
    "dateScanned": "2021-07-02T07:04:18.8896831Z",
    "detectedInfections": [],
    "errorMessage": null,
    "result": "Clean"
}
{
    "dateScanned": "2021-07-02T07:09:53.8972969Z",
    "detectedInfections": [
        {
            "file": "DemoFiles/eicarcom2.zip/eicar_com.zip/eicar.com",
            "infection": "EICAR-AV-Test"
        },
        {
            "file": "DemoFiles/eicar_com.zip/eicar.com",
            "infection": "EICAR-AV-Test"
        },
        {
            "file": "DemoFiles/infected_bill.pdf",
            "infection": "Troj/PDFJs-AIA"
        },
        {
            "file": "DemoFiles/eicar.com",
            "infection": "EICAR-AV-Test"
        },
        {
            "file": "DemoFiles/urgent_payment.pdf",
            "infection": "Troj/PDFJs-AIA"
        },
        {
            "file": "DemoFiles/eicar.com.txt",
            "infection": "EICAR-AV-Test"
        },
        {
            "file": "DemoFiles/eicar-from-vincent/eicar_-_wiki.txt",
            "infection": "EICAR-AV-Test"
        },
        {
            "file": "DemoFiles/eicar-from-vincent/eicar_-_milestone.txt",
            "infection": "EICAR-AV-Test"
        },
        {
            "file": "DemoFiles/eicar-from-vincent/eicar_-_snippet.txt",
            "infection": "EICAR-AV-Test"
        }
    ],
    "errorMessage": null,
    "result": "Infected"
}

Last update: August 27, 2021