Skip to content

Sizing Discussion

Background

To get a feel for scale and performance we ran a series of tests with a number of file sizes: 100kb, 1mb, 10mb, 100mb, 500mb, 1gb, 1.5gb, and 2gb (the current maximum file size). The goal here was to find two things: 1. what was the throughput in GB and object counts and 2. is the scale linear as you add scanning agents. We did this systematically with up to 3 agents in a strict environment. We later similarly tested with auto-scaling in place to see them spin up and down as the load backed off and similar results were found with greater than 3 agents.

Originally, we were just running a lot of files through the system and monitoring it hour to hour (really minute to minute). You can see this in the fact we ran 100,000 100kb and 1mb files through the system and ~50,000 10mb files. After that, we decided to run a set amount (50gb) of the larger files sizes to determine overall throughputs.

We tested uniquely generated junk files with the hashing function turned off to ensure that each file would be fully scanned. This allows for more accurate throughput metrics for our scanner agents.

Your Mileage May Vary

These tests are not real world tests with your particular data sets. This is purely to give you a feel for how your environment may behave and allow you to make deployment decisions. Please test with files similar to what you will see in production. We'd love to have you Contact Us to report findings and see how your environment matches up or how we may help you get the most out of it.

Throughput Chart

Here are the initial results we found:

Test Run: 1 agent: Files per hour: 2 agents: Files per hour: 3 agents: Files per hour:
10gb 100kb files 1.77gb/hr 17,610 3.82gb/hr 38,200 6.1gb/hr 61,020
100gb 1mb files 7gb/hr 7,000 15.5gb/hr 15,500 20.83gb/hr 21,000
468.71gb 10mb files 9.44gb/hr 1,000 19.02gb/hr 2,000 28.2gb/hr 3,000
50gb 100mb files ~10gb/hr 100 ~20gb/hr 200 ~30gb/hr 300
50gb 500mb files ~10gb/hr 20 ~20gb/hr 40 ~30gb/hr 60
50gb 1gb files ~10gb/hr 10 ~20gb/hr 20 ~30gb/hr 30
50gb 1.5gb files ~10gb/hr 7 ~20gb/hr 15 ~30gb/hr 20
50gb 2gb files ~10gb/hr 5 ~20gb/hr 10 ~30gb/hr 15

So what does this mean?

There have always been questions around what is required to meet the business needs when adopting a new solution. How much infrastructure do I need? Do I scale up or scale out? Do I need to run it all the time? Am I trying to get a certain amount of work done in a particular window of time or can it take as long as it wants? The answers to these questions can help you determine how you want to run the solution. The simple answer is, and taken with a your mileage may vary consideration, is to look at your environment and see the types of files you deal with and the average size. Apply that to the chart above to get a baseline to the amount of given work the scanning agents can achieve.

For example, let's say most of your files are approximately 1mb in size. A single agent can do ~7000 of those files an hour. How many files per hour or per day are you receiving? Do you need to do them in "realtime" as they come in throughout the day or in a certain scan window? How old will you allow an object to get before it is scanned? Extending the example, let's say 7001 files come in all at once. A single agent will evaluate those in an hour (2 per second), but many of the files will sit there for tens of minutes to even a full hour for that 7001th file. Is that ok? If not, then we have to judge the impacts of scaling additional agents in this scenario. Adding a second agent in this case then roughly doubles the throughput so we're now at 14k per hour (4 per second) and therefore you can now evaluate the files in ~30 minutes instead of 60. You're oldest file would be at most 30 minutes before getting scanned. Adding a third agent takes you down to ~20 minutes and so on.

With that, you can start to think through how you want to drive your system. The main configuration available to you today for this is modifying the Number of Messages in Queue to Trigger Agent Auto-Scaling during deployment. This can be modified after the fact if you find your original choice is not allowing you to meet your goals. The way the auto-scaling works is the queue must have the number of entries you specified during deployment sitting there for at least 1 minute to trigger the alarm that will then generate the scaling event. Similarly this works in much the same, but opposite fashion for backing off the scaling events.

In the scenario above, how could you ensure no item was more than 4 minutes old? Looking at the numbers, a single agent can do ~120 of those files per minute and therefore ~480 in 4 minutes. As soon as you see more than 120 entries in the queue for longer than 1 minute's time you are starting to fall behind. It isn't until the queue has had ~480 entries in it for longer than a minute you may no longer hit that 'at most 4 minute' scan window. So the queue value you may want to specify could be between 240-360. This allows for the time it takes to spin another agent up. If the files are coming in so fast your queue is backed up and is now sitting above 700 entries for a minute, then another alarm triggers a scaling event for another agent to spin up and so on it goes. So the queue value you pick during deployment is used in multiples of queue entries for triggering scaling events up and down. This choice should allow for scanning agents to spin up on demand to continue to serve that 4 minute old window. As entries drop below those multiples in the queue, scanning agents will start to spin down.

In this scenario, you are receiving more than 120 files per minute. If you never have this type of inflow, then a single agent will always keep up and you are always within a few seconds to a minute of scanning. The idea to take away from this section is to evaluate the inflow of objects along with the size of the objects and determine your acceptable scan window. Maybe it isn't 4 minutes, but rather 4 seconds. Thinking through how that changes your deployment allows you to determine the scaling values.

The alternative to good queue choices is to just brute force it by upping the minimum running agents. This will add infrastructure costs, but you'll have the agents ready and waiting for the loads to come in.

Note

As items are peeled off the queue, scaling contractions will happen and the scanning agents will drop off. There is a cool down period so you may notice they don't immediately drop off, but how AWS manages it seem reasonable.

*In the brute force scenario the agents won't contract as you have set the minimum. You'd have to change that value directly if you wanted it reduced.

Other Sample Scenarios:

  • 100GB 1GB files (100 files) in 4 hours: 3 agents with an autoscaling queue of 10
    • Baseline: 1 agent = 10gb/hr, 3 agents = 30gb/hr
  • 200GB 100kb files (2,000,000 files) in 3 hours: 34 agents with an autoscaling queue of 17,000
    • Baseline: 1 agent = 1.77gb/hr, Assume 5 agents = 10gb/hr, 50 agents = 100gb/hr
  • 1TB of 100MB files (10,000 files) in 2 hours: 50 agents with an autoscaling queue of 100
    • Baseline: 1 agent = 10gb/hr, 50 agents = 500gb/hr

Last update: October 30, 2020