Sizing Discussion
File size and the number of files you plan on scanning will impact how you scale your deployment to fit your needs.
To get a feel for scale and performance we ran a series of tests with a number of file sizes: 100kb, 1mb, 10mb, 100mb, 500mb, 1gb, 1.5gb, and 2gb (the current maximum file size) with the ClamAV engine. We tested all the same file sizes, but also added 5gb, 10gb and 15gb files for the Sophos engine. The goal here was to find two things:
What was the throughput in GB and object counts?
Is the scale linear as you add scanning agents?
We did this systematically with a number of agents in a strict environment. We later similarly tested with auto-scaling in place to see the scanning agents spin up and down as the load backed off and similar results were found with greater numbers of agents.
We tested uniquely generated junk files with the hashing function turned off to ensure that each file would be fully scanned. This allows for more accurate throughput metrics for our scanner agents.
Your Mileage May Vary
These tests are not real world tests with your particular data sets. This is purely to give you a feel for how your environment may behave and allow you to make deployment decisions. Please test with files similar to what you will see in production. We'd love to have you Contact Us to report findings and see how your environment matches up or how we may help you get the most out of it.
Event Driven and Existing (Retro) were both tested and had similar results for the scanning per agent per hour time. Event Driven Scanning does NOT include time spent copying nor uploading to the bucket. Scan Existing does NOT include bucket crawling.
In practice, scan existing will start many agents, usually enough to complete scanning the entire bucket of objects, however large, within an hour or less.
Where 300 Gb/hr is reported, we actually observed initial speeds from 200 Gb/hr to 600 Gb/hr.
After throttling (1 - n hours later) we observed speeds as low as 100 Gb/hr (and as high as 300 Gb/hr)
Testing was done in us-east-1, but a few tests in us-east-2 ran about 20% faster
Throughput Table
Here are the average of results we observed before throttling:
Throughput results for the CrowdStrike engine will be released in the future. If you have specific questions on throughput for a scanning agent using CrowdStrike please Contact Us.
File Size Tests Throughput | ClamAV Engine S3 Integrated | ClamAV Engine S3 Integrated | Sophos Engine S3 Integrated | Sophos Engine S3 Integrated |
---|---|---|---|---|
in GB/hr | in ~Files/hr | in GB/hr | in ~Files/hr | |
100kb files | ~1.75 | ~17,500 | ~2.25 | ~22,500 |
1mb files | ~6.5 | ~6,500 | ~20 | ~20,000 |
10mb files | ~9 | ~900 | ~100 | ~10,000 |
100mb files | ~9 | ~90 | ~200 | ~2,000 |
500mb files | ~9 | ~18 | ~300* | ~600 |
1 gb files | ~9 | ~9 | ~300* | ~300 |
2 gb files | ~9 | ~4.5 | ~300* | ~150 |
5 gb files | X | X | ~300* | ~60 |
10 gb files | X | X | ~300* | ~30 |
50 gb files | X | X | ~300* | ~6 |
100 gb files | X | X | ~300* | ~3 |
150 gb files | X | X | ~300* | ~2 |
Linear Scale Out for S3 Integrated
The table above shows the results of a single scanning agent running and be bombarded with objects to get to upper end, but sustainable throughput value. We noticed in our testing that as you add scanning agents, you simply increase the throughput by the same values above for that second scanning agent. Just multiply the GBs / hr and the Files / hr values to see what it would be like with 2 to N scanning agents.
With the testing we've done on the API file scanning, we have seen significant performance increases at all file sizes up through 1.5GB. We have not done extended testing at this time so cannot post full results of testing. And your mileage will vary based on network performance and latency.
We are happy to have a discussion with you on these metrics. Please Contact Us if you'd like to learn more.
So what does this mean?
There have always been questions around what is required to meet the business needs when adopting a new solution.
How much infrastructure do I need?
Do I scale up or scale out?
Do I need to run it all the time?
Am I trying to get a certain amount of work done in a particular window of time or can it take as long as it wants?
The answers to these questions can help you determine how you want to run the solution. The simple answer is, and taken with a your mileage may vary
consideration, is to look at your environment and see the types of files you deal with and the average size. Apply that to the chart above to get a baseline to the amount of given work the scanning agents can achieve.
For example, let's say most of your files are approximately 1mb in size. A single agent can do ~7000 of those files an hour. How many files per hour or per day are you receiving? Do you need to do them in "realtime" as they come in throughout the day or in a certain scan window? How old will you allow an object to get before it is scanned?
Extending the example, let's say 7001 files come in all at once. A single agent will evaluate those in an hour (2 per second), but many of the files will sit there for tens of minutes to even a full hour for that 7001th file. Is that ok? If not, then we have to judge the impacts of scaling additional agents in this scenario. Adding a second agent in this case then roughly doubles the throughput so we're now at 14k per hour (4 per second) and therefore you can now evaluate the files in ~30 minutes instead of 60. You're oldest file would be at most 30 minutes before getting scanned. Adding a third agent takes you down to ~20 minutes and so on.
With that, you can start to think through how you want to drive your system. The main configuration available to you today for this is modifying the Number of Messages in Queue to Trigger Agent Auto-Scaling during deployment. This can be modified after the fact if you find your original choice is not allowing you to meet your goals. The way the auto-scaling works is the queue must have the number of entries you specified during deployment sitting there for at least 1 minute to trigger the alarm that will then generate the scaling event. Similarly this works in much the same, but opposite fashion for backing off the scaling events.
In the scenario above, how could you ensure no item was more than 4 minutes old? Looking at the numbers, a single agent can do ~120 of those files per minute and therefore ~480 in 4 minutes. As soon as you see more than 120 entries in the queue for longer than 1 minute's time you are starting to fall behind. It isn't until the queue has had ~480 entries in it for longer than a minute you may no longer hit that 'at most 4 minute' scan window. So the queue value you may want to specify could be between 240-360. This allows for the time it takes to spin another agent up. If the files are coming in so fast your queue is backed up and is now sitting above 700 entries for a minute, then another alarm triggers a scaling event for another agent to spin up and so on it goes. So the queue value you pick during deployment is used in multiples of queue entries for triggering scaling events up and down. This choice should allow for scanning agents to spin up on demand to continue to serve that 4 minute old window. As entries drop below those multiples
in the queue, scanning agents will start to spin down.
In this scenario, you are receiving more than 120 files per minute. If you never have this type of inflow, then a single agent will always keep up and you are always within a few seconds to a minute of scanning. The idea to take away from this section is to evaluate the inflow of objects along with the size of the objects and determine your acceptable scan window. Maybe it isn't 4 minutes, but rather 4 seconds. Thinking through how that changes your deployment allows you to determine the scaling values.
The alternative to good queue choices is to just brute force it by upping the minimum running agents. This will add infrastructure costs, but you'll have the agents ready and waiting for the loads to come in.
As items are peeled off the queue, scaling contractions will happen and the scanning agents will drop off. There is a cool down period so you may notice they don't immediately drop off, but how AWS manages it seem reasonable.
In the brute force scenario the agents won't contract as you have set the minimum. You'd have to change that value directly if you wanted it reduced.
Other Sample Scenarios (using the slower ClamAV throughputs):
100GB 1GB files (100 files) in 4 hours: 3 agents with an autoscaling queue of 10
Baseline: 1 agent = 10gb/hr, 3 agents = 30gb/hr
200GB 100kb files (2,000,000 files) in 3 hours: 34 agents with an autoscaling queue of 17,000
Baseline: 1 agent = 1.77gb/hr, Assume 5 agents = 10gb/hr, 50 agents = 100gb/hr
1TB of 100MB files (10,000 files) in 2 hours: 50 agents with an autoscaling queue of 100
Baseline: 1 agent = 10gb/hr, 50 agents = 500gb/hr
Last updated