Retro Scanning for Pre-Existing Files

Retro Scanning is the scanning of existing objects. Whether you want to scan all your existing data the first time you setup the product or you have compliance requirements that dictate a regular cadence of data scanning, Retro Scanning will allow you to scan and re-scan your existing S3 objects. Unlike event driven scanning which looks at all "new" objects coming into buckets and has scanning triggered by events, retro scanning leverages S3 crawling to determine the work set. If it is the first time you are enabling a bucket for event scanning, you'll be prompted to scan existing data which will automatically default to a time window of the beginning of time (March 14, 2006) through the current time. We won't continue past the current time of activation as event scanning will take care of everything going forward. The default time window can be changed. If you are triggering retro scanning without a bucket activation, you will be prompted to pick a time slice for scanning. This can still be all time or a window of time of your choosing.

It looks as follows:

  1. You have existing objects within your AWS account(s)

  2. Via the console, you trigger a scan for existing objects (on-demand or through a schedule). This can be all objects within the bucket(s) or a subset based on a time window.

  3. The Antivirus for Amazon S3 spins up a Fargate Run Task(s) for each on-demand request or schedule that will trigger the crawling process which will crawl all objects within the bucket(s) and add entries to a temporary Retro SQS Queue uniquely made for the job

    Amazon S3 doesn't allow for simple searching and segmenting of the objects to process so all objects must be crawled. Only the objects that match the time window will be added to the queue for processing. If you have 1 million objects in a bucket and the time window dictates such that only 50,000 are processed, all 1 million will be crawled, but only the 50,000 will be scanned.

    Antivirus for Amazon S3 will spin up a number of Run Task scanning agents with the notion to complete the work in ~1 hour. Depending on the volume (# of buckets, # of objects, size of buckets) related to the job this could be a "lot" of agents. Ultimately, this has no affect on the cost as running 1 agent for 100 hours to complete the job is the same cost as running 100 agents for 1 hour to complete the job.

    You may face service quota limits when running big jobs or multiple jobs at the same time. The default value for the number of these tasks is 1000 in most accounts. If it was needed to go beyond 1000 the process will not stop or break, it will just take longer than it would if you could spin more tasks up. If you know your loads will typically require more tasks beyond that, you can make a request to AWS to increase your limits.

  4. Once crawling has completed for a job, a new set of Fargate Run Tasks are spun up to perform the scanning. Antivirus for Amazon S3 will automatically spin up the number of scanning agents to complete the scan.

    The retro agents are the same agent as the event agents, but run in as Fargate Run Tasks. These agents will destroy themselves when the job has completed.

  5. Entries are pulled from the queue identifying the object to scan. The object is retrieved and scanned

On-Demand Scanning

On-Demand Scanning is a user initiated scan from within the management console GUI. Triggered from the Bucket Protection page, a user can on a one-off basis select one or many buckets and a time window to scan objects. This scan will be treated and tracked as a "job" on the Jobs page under the Monitoring section of the console.

Scheduled Scanning

Scheduled Scanning allows you to process new files or existing files based on a schedule. Instead of processing new files as they come in, your workflow may allow them to be scanned once per day. For compliance reasons you may be required to scan all of your files on a quarterly basis and Scheduled Scanning will allow you to do that. Learn more about schedule based scanning. Scheduled scans will be treated and tracked as a "job" on the Jobs page under the Monitoring section of the console.

Last updated