Issues with premature optimization

Sven-Erik Jonsson 26.04.2024 08.18.00

The quote is often shortened to just “premature optimization is the root of all evil” and then falsely interpreted as “performance issues are something that can be addressed in later iterations of development”, not a thing to worry about in the beginning.

I’m arguing that performance is something that can be achieved in part by having a solid understanding of the code that is written and by selecting the correct tools (DSA) for the job.

Most of the .NET performance issues that I’ve come across (and sometimes caused myself)  are caused by misunderstandings of how the underlying code will execute. One solid principle (pun intended) is to not make assumptions of what is optimal and always measure performance with a profiler or benchmarking suite. This isn’t always straightforward as measuring improvements, requires either multiple sequential executions with manual record keeping between runs, or a setup capable of feature switching between different implementations of the tested code. To be able to execute the second approach on more complex classes like services or repositories, following the dependency injection pattern is recommended. Where if you made your code fully testable and implemented unit testing, comparative benchmarking comes as a bonus.

When comparative benchmarking is set up, optimizations aren’t going to be implemented based on assumptions anymore. They will be backed up by real data, removing the word “premature” from the equation.

To establish a baseline on how long different things take in relation to each other, here is a list of read operations and an estimated time, for reference.

CPU instruction

~ 1 nanosecond (ns)

10-9 seconds

CPU cache (L2, L3)

~ 10 nanoseconds (ns)

10-8 seconds

Memory (RAM)

~ 100 nanoseconds (ns)

10-7 seconds

Fast storage (SSD)

~ 1 microsecond (μs)

10-6 seconds

Hard drive (HDD)

~ 1 millisecond (ms)

10-3 seconds

Network

~10-20 milliseconds (ms)

10-2 seconds

Here are some observations that I’ve made when benchmarking. Before reading further you should know that all these examples are taken from actual business logic in customer projects.

Memory allocation and garbage collection

Since .NET is a memory managed platform where the whole point is to automate memory management there is less control over how memory is handled by the runtime.

Allocation is done mostly implicitly on the stack or on the heap depending on if your variables are value types or reference types (struct/class) or with explicit expressions like stackalloc for arrays.

Please note that the above statment is a simplification, what ends up on the stack is not always as simple as "value types / structs". Sometimes there isn't even a stack.

Allocating memory on the heap will eventually lead to automated garbage collection which can be executed at any time the .NET runtime decides that it is necessary. This will be done with either a workstation or server strategy depending on project configuration. The garbage collector will classify objects in memory as belonging to generation 0, 1 or 2 depending on how many collections they have survived. Due to how the garbage collector compacts the memory, gen 0 is a lot quicker and can be thought of as “on top” of the heap and increasing generations as further down.

Given this increasing cost of garbage collection, if allocation is needed at all, the following is preferable in decreasing order.

Allocating on the stack

int itemCount = 0;

This can be achieved where the logic only contains simple data structures like  value types, spans, or small arrays (which are normally kept on the heap). Total memory on a thread must be kept under 1 MB.

Please note that there are nuances. For example if a struct is a member on a reference type, is captured inside a closure, part of a collection, is boxed or for a number of additional reasons... the variable will allocate on the heap.

If successful, garbage collection becomes a free operation. Going over the 1MB limit though, throws a fatal StackOverflowException.

Allocating on the heap (no collection)

private static readonly List<T> _items = new();

If the memory allocated on the heap is rooted for the entire execution, the garbage collector will never touch it. A drawback is that this will consume memory that is never reclaimed until the application exits. Note that, in excess, this can lead to some unmaintainable code.

Allocating on the heap (gen 0 collection)

List<T> items = new();

If there are no references to an object the memory on the heap is reclaimed in the first generation of garbage collection, and this is a relatively quick operation that will not affect the overall performance overly until the amount of memory reaches hundreds of MB. That being said, your business logic will need references to objects in memory and depending on the amount of work that can be saved this can be a complex evaluation that is worth giving some thought, but not going too deep.

Use data structures to their full potential

.NET provides a base set of collections inside the System.Collections.Generic namespace, these can be used for many purposes and in ways that are better than others.

List<T>

Many might know about the generic List<T> structure that enables you to define a list that grows automatically, allowing for any number of items to be added. Internally the list has an array that is resized as the list needs to grow. The resize happens when the amount of items exceeds the capacity property. If a capacity isn’t set the list will resize, this will happen multiple times as items are added. If the amount of items in the list is known ahead of time, the capacity can be set in the constructor and multiple resizes can be avoided.

HashSet<T>

A hash set is an indexed structure that allows for rapid testing if an item is present or not. The check will have O(1) time complexity and will scale extremely well with many items. It has a capacity, similar to List<T>, but also uses an internal IEqualityComparer (that can be customized) to determine object equality.

Dictionary<TKey, TValue>

Is a structure that builds on both the list and the hash set, where a data type can be used as a key to find a single item with a given key. The item can be found in O(1) time.

Lookup<TKey, TValue>

Is a structure that is similar to the dictionary but allows for multiple items to be stored on a single key.

Putting the structures to use

All of these collections have their own specialized methods for accessing their respective features, but they can also be enumerated normally, losing all potential.

Consider the following search implementation for a method aimed at finding prices stored inside a Dictionary.

public IReadOnlyCollection<KeyValuePair<string, PriceCatalogItem>> SearchPriceCatalogForSearch(ICustomerCalculationRequest request, PriceCatalogSearchReply priceListForSearch)
{
    var requestedKeys = request.Lines.Select(line => $"{line.VariantNumber}|{line.SalesUnit}");

    var matchingItems = from item in priceListForSearch.PriceCatalogItems
                        join key in requestedKeys on item.Key equals key
                        select item;

    var results = matchingItems.ToArray();
    return results;
}

This logic will enumerate all keys in the dictionary PriceCatalogItems. It will also enumerate request.Lines multiple times. If we use our previously established knowledge about collections we can instead of enumerating the dictionary use its index via the TryGetValue method to find the item.

private List<KeyValuePair<string, PriceCatalogItem>> SearchPriceCatalogForSearch(IReadOnlyCollection<ICalculationRequestLine> lines, PriceCatalogSearchReply priceListForSearch)
{
    var results = new List<KeyValuePair<string, PriceCatalogItem>>(lines.Count);

    foreach (var line in lines)
    {
        var key = _keyCreatorService.CreatePriceListItemKey(line.VariantNumber, line.SalesUnit);

        if (!priceListForSearch.PriceCatalogItems.TryGetValue(key, out PriceCatalogItem? item))
            continue;

        if (item == null)
            continue;

        results.Add(new KeyValuePair<string, PriceCatalogItem>(key, item));
    }

    return results;
}

Implementation

Mean

Median

Allocated

Previous

9,793.1 μs

9,793.2 μs

11.26 KB

Proposed

102.7 μs

100.6 μs

11.17 KB

This change led to a 90x speed increase. Going from O(n) to O(1) time complexity (meaning that this would be an even greater increase if more items were added).

Avoiding LINQ in the hot path

The “hot path” is a set of instructions where the most time is spent. This can be either because of latency (waiting for input) or throughput (many operations). This part of the code is a moving target as when you improve one part, the hot path will likely shift.

I’ve seen some examples where LINQ statements are written in query syntax in a way that a big SQL query would look. Lots of join statements from one IEnumerable written like the Roslyn compiler is able to do way more than generating effective enumerators.

var megaJoinDtos = from priceListItem in input.Items

 join conditions in input.PriceRules on priceListItem.Value.ProductGroup1 equals conditions.Key2
     into joinedProductGroup1Key2

 join conditions in input.PriceRules on priceListItem.Value.ProductGroup2 equals conditions.Key2
     into joinedProductGroup2Key2

 join conditions in input.PriceRules on priceListItem.Value.ProductGroup3 equals conditions.Key2
     into joinedProductGroup3Key2

 join conditions in input.PriceRules on priceListItem.Value.VariantNumber equals conditions.Key2
     into joinedVariantNumberKey2

 join condition in input.PriceRules on priceListItem.Value.VariantNumber equals condition.Key1
     into joinedVariantNumberKey1

 join condition in input.PriceRules on priceListItem.Value.ProductGroup2 equals condition.Key1
     into joinedProductGroup2Key1

 join specialDuty in input.SpecialDuties on priceListItem.Value.SpecialDutyProfileId equals specialDuty.DutyProfile
     into joinedSpecialDutys
 from jSpecialDuty in joinedSpecialDutys.DefaultIfEmpty()

 join environMentalDuty in input.EnvironmentalDuties on priceListItem.Value.EnvironmentalDutyProfileId equals environMentalDuty.EnvironmentalProfile
     into joinedEnvironmentalDutys
 from jEnvironmentalDutys in joinedEnvironmentalDutys.DefaultIfEmpty()

 join quantityprice in input.QuantityPrices on priceListItem.Key equals quantityprice.Key
     into joinedQuantityPrices

 join bestPrice in input.QuantityPricesIncludedForAnyQuantities on priceListItem.Value.VariantNumber equals bestPrice.VariantNumber
     into joinedBestpriceNumber

 join bestPrice in input.QuantityPricesIncludedForAnyQuantities on priceListItem.Value.ProductGroup2 equals bestPrice.ProductGroup
     into joinedBestPriceGroup

 join requestLine in input.CustomerCalculationRequest?.Lines ?? []
     on priceListItem.Key equals _keyCreatorService.CreatePriceCatalogKey(requestLine.VariantNumber, requestLine.SalesUnit)
         into joinedRequestLines
 from jRequestLine in joinedRequestLines.DefaultIfEmpty()

 join preDiscountedPriceListItem in input.PreDiscountedPriceListItems on priceListItem.Key equals preDiscountedPriceListItem.Key
     into joinedPreDiscountedPriceListItems

 select new QueryResult
 {
     ...
 };

For more information about how LINQ works under hood, check out this 2-part video series with Stephen Toub on the subject.

The above example will basically become multiple nested loops where if-checks will check if every item belongs in a result or not. It has O(n) time complexity and probably isn’t the best way of finding the cheapest price.

Instead, dividing the code into a preprocessing step, where data structures that are made for the job are allocated, then a loop that can reuse those data structures over and over.

var requestLines = input.CustomerCalculationRequest?.Lines ?? [];
var requestLineDictionary = requestLines.ToDictionary(x => _keyCreatorService.CreatePriceCatalogKey(x.VariantNumber, x.SalesUnit));

var prediscountedItems = input.PreDiscountedPriceListItems.ToLookup(x => x.Key);
var priceRulesKey1Lookup = input.PriceRules.ToLookup(x => x.Key1);
var priceRulesKey2Lookup = input.PriceRules.ToLookup(x => x.Key2);

var quantityPrices = _quantityPricesRepository.QuantityPrices;
var quantityPricesForAnyVariant = input.QuantityPricesIncludedForAnyQuantities.ToLookup(x => x.VariantNumber);
var quantityPricesForAnyGroup = input.QuantityPricesIncludedForAnyQuantities.ToLookup(x => x.ProductGroup);

foreach (var (key, value) in input.Items)
{
    var productGroup1Key2 = value.ProductGroup1 != null ? priceRulesKey2Lookup[value.ProductGroup1] : Enumerable.Empty<PriceRule>();
    var productGroup2Key2 = value.ProductGroup2 != null ? priceRulesKey2Lookup[value.ProductGroup2] : Enumerable.Empty<PriceRule>();
    var productGroup3Key2 = value.ProductGroup3 != null ? priceRulesKey2Lookup[value.ProductGroup3] : Enumerable.Empty<PriceRule>();
    var productGroup2Key1 = priceRulesKey2Lookup[value.VariantNumber];

    var variantNumberKey1 = priceRulesKey1Lookup[value.VariantNumber];
    var variantNumberKey2 = value.ProductGroup2 != null ? priceRulesKey1Lookup[value.ProductGroup2] : Enumerable.Empty<PriceRule>();

    var specialDuty = GetSpecialDuty(value);
    var environmentalDuty = GetEnvironmentalDuty(value);
    var quantityPriceKey = _keyCreatorService.CreateQuantityPriceKey(value.PriceCatalogKey, value.VariantNumber, value.PriceUnit);
    var quantityPriceKeyValue = GetQuantityPrices(quantityPrices, quantityPriceKey);
    var bestPriceVariant = quantityPricesForAnyVariant[value.VariantNumber];
    var bestPriceGroup = quantityPricesForAnyGroup[value.ProductGroup2];

    requestLineDictionary.TryGetValue(key, out var line);

    var preDiscountedPriceListItems = prediscountedItems[key];

    results.Add(new QueryResult
    {
        ...
    });
}

Implementation

Mean

Allocated

Previous

286.42 ms

477.51 MB

Proposed

78.08 ms

98.08 MB

After this we see a ~3,5x speedup and an ~79% reduction of memory allocation.
Personally I find this solves one or two of the readability issues in the code too.

Object pooling

Another strategy of reducing allocations and object construction cost is having a preallocated pool of objects where an object is rented and returned to the pool when not needed anymore. This strategy fits especially well with the StringBuilder class which is expensive to construct, and can be easily implemented by referencing the nuget package Microsoft.Extensions.ObjectPool

For example, this code, printing a comment.

var sbComment = new StringBuilder();

sbComment.Append($"{condition.PriceRuleDetail.CustomerKeyDescription} ({condition.Key1})");
sbComment.Append(" - ");
sbComment.Append($"{condition.PriceRuleDetail.ProductKeyDescription} ({condition.Key2})");

if (condition.Key3 != null)
{
    sbComment.Append(" - ");
    sbComment.Append($"{condition.PriceRuleDetail.ThirdKeyDescription} ({condition.Key3})");
}

if (listPrice.IsQuantumDependentPrice)
{
    sbComment.Append(" QUANTUM:" + listPrice.Quantity);
}

dto.Comment = sbComment.ToString();

Can be redefined as.

var builder = _stringBuilderPool.Get();
var length = CalculateLength(condition);

if (builder.Capacity < length)
{
    builder.Capacity = length;
}

builder.Append(condition.PriceRuleDetail.CustomerKeyDescription);
builder.Append(" (");
builder.Append(condition.Key1);
builder.Append(") - ");
builder.Append(condition.PriceRuleDetail.ProductKeyDescription);
builder.Append(" (");
builder.Append(condition.Key2);
builder.Append(')');

if (condition.Key3 != null)
{
    builder.Append(" - ");
    builder.Append(condition.PriceRuleDetail.ThirdKeyDescription);
    builder.Append(" (");
    builder.Append(condition.Key3);
    builder.Append(')');
}

if (listPrice.IsQuantumDependentPrice)
{
    builder.Append(" QUANTUM:").Append(listPrice.Quantity);
}

var result = builder.ToString();

_stringBuilderPool.Return(builder);

Implementation

Mean

Allocated

Previous

634.2 ms

1255.65 MB

Proposed

793.3 ms

966.76 MB

This change made a 1,2x slowdown in execution but a 23% decrease in memory allocation. Note that the reduction of allocations doesn’t all come from the use of the object pool. One major factor is calculating the string length ahead of time. Also a tiny contributor is that the string interpolations were separated, some then only containing string literals, that are automatically interned by the compiler.

The main reason for allocations in this scenario is a parent method creating an unconstrained, massive amount of results. Pooling can lead to increased processing like in the example where the pool has to synchronize the renting of objects and it doesn’t end up being a faster way in this case. We ended up implementing options for page size and the ability to omit this comment entirely instead.

Another notable pool is the StringPool found in the CommunityToolkit.HighPerformance package.

Set up comparative benchmarking

Here is a guide on getting started with benchmarking.
Although there are other libraries out there, we will be using BenchmarkDotNet.

Create a benchmarking project like YourProject.Benchmarks. Open up a console and change to your target folder.

dotnet new console -n YourProject.Benchmarks

cd YourProject.Benchmarks

dotnet add package Microsoft.Extensions.DependencyInjection
dotnet add package BenchmarkDotNet

In the newly created Program.cs, enter the following.

using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly)
                 .Run(Environment.GetCommandLineArgs());

Create Implementation.cs

namespace YourProject.Benchmarks;

public enum Implementation
{
    A,
    B
}

Create Benchmarks.cs

using BenchmarkDotNet.Attributes;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Logging.Abstractions;

namespace YourProject.Benchmarks;

[MemoryDiagnoser]
public class Benchmarks
{
    [Params(Implementation.A, Implementation.B)]
    public Implementation Implementation { get; set; }

    private ServiceProvider? _serviceProvider;

    [GlobalSetup]
    public void Setup()
    {
        var provider = GetServiceProvider(services =>
        {
            if (Implementation == Implementation.A)
            {
                // Register services for A
            }
            else if (Implementation == Implementation.B)
            {
                // Register services for B
            }
        });

        _serviceProvider = provider;
       
        // Fetch services for benchmarks from _serviceProvider
    }

    [GlobalCleanup]
    public void Cleanup()
    {
        _serviceProvider!.Dispose();
        _serviceProvider = null;
    }

    [Benchmark]
    public void ExecuteBenchmarkX()
    {
        // Perform benchmark X
    }

    [Benchmark]
    public void ExecuteBenchmarkY()
    {
        // Perform benchmark Y
    }

    private static ServiceProvider GetServiceProvider(Action<IServiceCollection> configure)

    {
        var services = GetServices();
        configure(services);

        return services.BuildServiceProvider();
    }

    private static IServiceCollection GetServices()
    {
        var services = new ServiceCollection();

        // Add shared services here

        services.AddSingleton(typeof(ILogger<>), typeof(NullLogger<>));

        return services;
    }
}

After populating the Benchmarks.cs file it is just the matter of running the project and following the guide.

Other ways of using the benchmarking project

During the process of optimization I’ve often found the need for finding the hot path of the application to pin down bottlenecks. This can be done in a hacky way by modifying the Program.cs file slightly to instead contain something like this.

var benchmarks = new Benchmarks();

benchmarks.Setup();

Console.WriteLine("Setup complete");

for (var i = 0; i < 5_000; i++)
{
    benchmarks.ExecuteBenchmarkX();
}

The amount of iterations are circumstantial and depends on how fast your code executes. The main idea if using Visual Studio Performance Profiler or something like JetBrains DotTrace is to “start with collection paused” and then activate it after the console writes “Setup complete” to not measure the setup.

The importance of determinism

For the benchmarks to be reliable, results need to be relatively deterministic. Meaning that given the same parameters, the benchmarks will produce a similar result on each execution. If data is randomized, supply a specific seed to your Random instances.

Final thoughts

Focusing too much on optimization can reduce maintainability and readability of the code and should be considered as a factor when deciding if it is needed or not. Sometimes the most performance can be gained simply by using appropriate data structures (as we saw in the optimization examples), since this leads to less work being performed by the CPU and reductions in memory use. Most of the time, readable code is preferable before optimization, but there is no exact way of knowing before finding out via benchmarking.