Skip to content

SVE microbenchmarks with string operations #4841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jacob-crawley
Copy link

The SVE .NET APIs were introduced in .NET 9 as an experimental feature.

Currently there are no microbenchmarks for testing the performance of these SVE features.

This PR introduces an initial set of microbenchmarks to measure the performance of SVE in comparison to scalar and Vector128 implementations using BenchmarkDotNet.

This commit includes benchmarks of the following string operations

  • String length (StrLen)
  • Index of a string (StrIndexOf)
  • String comparison (StrCmp)

Some tests contain two SVE implementations, those marked with 'Tail' at the end of the test name use fully populated vectors in each iteration with a scalar loop afterwards to compute any remaining values.
The purpose of these tests is to compare against the other SVE implementations to highlight the impact of using predicate vectors (that aren't all set to true) on performance.

Initial results of these tests (ran on cobalt-100) are as follows:
StrLen:

| Method          | Size  | Mean           | Error     | StdDev    | Median         | Min            | Max            | Allocated |
|---------------- |------ |---------------:|----------:|----------:|---------------:|---------------:|---------------:|----------:|
| ScalarStrLen    | 15    |     24.7109 ns | 0.0054 ns | 0.0051 ns |     24.7107 ns |     24.7021 ns |     24.7202 ns |         - |
| Vector128StrLen | 15    |      3.8836 ns | 0.0276 ns | 0.0245 ns |      3.8844 ns |      3.8460 ns |      3.9425 ns |         - |
| SveStrLen       | 15    |      0.3774 ns | 0.0006 ns | 0.0006 ns |      0.3773 ns |      0.3765 ns |      0.3785 ns |         - |
| ScalarStrLen    | 127   |    221.2625 ns | 0.0382 ns | 0.0357 ns |    221.2651 ns |    221.1856 ns |    221.3065 ns |         - |
| Vector128StrLen | 127   |      7.7746 ns | 0.0180 ns | 0.0150 ns |      7.7733 ns |      7.7519 ns |      7.8036 ns |         - |
| SveStrLen       | 127   |      7.1787 ns | 0.0067 ns | 0.0063 ns |      7.1769 ns |      7.1673 ns |      7.1922 ns |         - |
| ScalarStrLen    | 527   |    929.3121 ns | 0.1061 ns | 0.0941 ns |    929.3189 ns |    929.0874 ns |    929.4475 ns |         - |
| Vector128StrLen | 527   |     19.2975 ns | 0.0278 ns | 0.0260 ns |     19.2884 ns |     19.2679 ns |     19.3461 ns |         - |
| SveStrLen       | 527   |     32.1379 ns | 0.0240 ns | 0.0224 ns |     32.1314 ns |     32.1033 ns |     32.1711 ns |         - |
| ScalarStrLen    | 10015 | 17,713.4163 ns | 3.3844 ns | 3.0002 ns | 17,713.1484 ns | 17,708.9913 ns | 17,719.9227 ns |         - |
| Vector128StrLen | 10015 |    305.8770 ns | 0.8203 ns | 0.7673 ns |    306.2599 ns |    304.1435 ns |    306.6253 ns |         - |
| SveStrLen       | 10015 |    663.3082 ns | 1.4353 ns | 1.3426 ns |    663.3336 ns |    661.1712 ns |    664.9327 ns |         - |

StrIndexof:

| Method           | Size  | Mean         | Error     | StdDev    | Median       | Min          | Max          | Allocated |
|----------------- |------ |-------------:|----------:|----------:|-------------:|-------------:|-------------:|----------:|
| ScalarIndexOf    | 15    |     4.813 ns | 0.0018 ns | 0.0017 ns |     4.813 ns |     4.811 ns |     4.817 ns |         - |
| Vector128IndexOf | 15    |    22.264 ns | 0.0190 ns | 0.0159 ns |    22.263 ns |    22.243 ns |    22.301 ns |         - |
| SveIndexOf       | 15    |     2.719 ns | 0.0008 ns | 0.0007 ns |     2.718 ns |     2.718 ns |     2.720 ns |         - |
| SveIndexOfTail   | 15    |     3.221 ns | 0.0007 ns | 0.0007 ns |     3.221 ns |     3.221 ns |     3.223 ns |         - |
| ScalarIndexOf    | 127   |    39.202 ns | 0.0110 ns | 0.0097 ns |    39.200 ns |    39.184 ns |    39.222 ns |         - |
| Vector128IndexOf | 127   |    24.241 ns | 0.0110 ns | 0.0103 ns |    24.243 ns |    24.221 ns |    24.254 ns |         - |
| SveIndexOf       | 127   |    14.152 ns | 0.0023 ns | 0.0022 ns |    14.152 ns |    14.149 ns |    14.156 ns |         - |
| SveIndexOfTail   | 127   |    12.158 ns | 0.0033 ns | 0.0031 ns |    12.157 ns |    12.153 ns |    12.162 ns |         - |
| ScalarIndexOf    | 527   |   173.125 ns | 0.0455 ns | 0.0425 ns |   173.125 ns |   173.034 ns |   173.181 ns |         - |
| Vector128IndexOf | 527   |    40.650 ns | 0.0212 ns | 0.0188 ns |    40.652 ns |    40.603 ns |    40.675 ns |         - |
| SveIndexOf       | 527   |    53.744 ns | 0.0062 ns | 0.0055 ns |    53.743 ns |    53.737 ns |    53.755 ns |         - |
| SveIndexOfTail   | 527   |    44.726 ns | 0.1043 ns | 0.0975 ns |    44.711 ns |    44.580 ns |    44.902 ns |         - |
| ScalarIndexOf    | 10015 | 3,325.136 ns | 1.4973 ns | 1.3273 ns | 3,324.868 ns | 3,322.935 ns | 3,328.274 ns |         - |
| Vector128IndexOf | 10015 |   440.725 ns | 0.1196 ns | 0.1119 ns |   440.749 ns |   440.390 ns |   440.864 ns |         - |
| SveIndexOf       | 10015 | 1,031.137 ns | 0.1276 ns | 0.1131 ns | 1,031.103 ns | 1,030.981 ns | 1,031.333 ns |         - |
| SveIndexOfTail   | 10015 |   849.561 ns | 0.8771 ns | 0.7325 ns |   849.879 ns |   848.295 ns |   850.741 ns |         - |

Str Cmp:

| Method          | Size  | Scenario   | Mean         | Error     | StdDev    | Median       | Min          | Max          | Allocated |
|---------------- |------ |----------- |-------------:|----------:|----------:|-------------:|-------------:|-------------:|----------:|
| ScalarStrCmp    | 15    | ChangeArr1 |     5.987 ns | 0.0017 ns | 0.0015 ns |     5.986 ns |     5.984 ns |     5.990 ns |         - |
| Vector128StrCmp | 15    | ChangeArr1 |     7.132 ns | 0.0087 ns | 0.0081 ns |     7.129 ns |     7.119 ns |     7.144 ns |         - |
| SveStrCmp       | 15    | ChangeArr1 |    51.336 ns | 0.0178 ns | 0.0166 ns |    51.341 ns |    51.297 ns |    51.357 ns |         - |
| SveStrCmpTail   | 15    | ChangeArr1 |     7.496 ns | 0.0016 ns | 0.0015 ns |     7.496 ns |     7.494 ns |     7.499 ns |         - |
| ScalarStrCmp    | 15    | ChangeArr2 |    12.372 ns | 0.0057 ns | 0.0053 ns |    12.371 ns |    12.365 ns |    12.381 ns |         - |
| Vector128StrCmp | 15    | ChangeArr2 |    13.779 ns | 0.0076 ns | 0.0071 ns |    13.780 ns |    13.765 ns |    13.792 ns |         - |
| SveStrCmp       | 15    | ChangeArr2 |    51.271 ns | 0.0122 ns | 0.0114 ns |    51.266 ns |    51.261 ns |    51.298 ns |         - |
| SveStrCmpTail   | 15    | ChangeArr2 |    14.268 ns | 0.0064 ns | 0.0060 ns |    14.267 ns |    14.259 ns |    14.277 ns |         - |
| ScalarStrCmp    | 15    | Zero       |    12.404 ns | 0.0352 ns | 0.0329 ns |    12.412 ns |    12.285 ns |    12.419 ns |         - |
| Vector128StrCmp | 15    | Zero       |    13.029 ns | 0.0149 ns | 0.0132 ns |    13.028 ns |    13.009 ns |    13.057 ns |         - |
| SveStrCmp       | 15    | Zero       |    51.156 ns | 0.0061 ns | 0.0057 ns |    51.156 ns |    51.144 ns |    51.165 ns |         - |
| SveStrCmpTail   | 15    | Zero       |    14.074 ns | 0.0035 ns | 0.0032 ns |    14.073 ns |    14.070 ns |    14.081 ns |         - |
| ScalarStrCmp    | 127   | ChangeArr1 |    52.245 ns | 0.0217 ns | 0.0192 ns |    52.243 ns |    52.213 ns |    52.270 ns |         - |
| Vector128StrCmp | 127   | ChangeArr1 |    19.156 ns | 0.0047 ns | 0.0041 ns |    19.157 ns |    19.148 ns |    19.161 ns |         - |
| SveStrCmp       | 127   | ChangeArr1 |    52.898 ns | 0.0108 ns | 0.0095 ns |    52.901 ns |    52.881 ns |    52.913 ns |         - |
| SveStrCmpTail   | 127   | ChangeArr1 |    23.309 ns | 0.0090 ns | 0.0080 ns |    23.310 ns |    23.292 ns |    23.322 ns |         - |
| ScalarStrCmp    | 127   | ChangeArr2 |   104.459 ns | 0.0403 ns | 0.0377 ns |   104.462 ns |   104.412 ns |   104.525 ns |         - |
| Vector128StrCmp | 127   | ChangeArr2 |    21.411 ns | 0.0075 ns | 0.0066 ns |    21.412 ns |    21.391 ns |    21.418 ns |         - |
| SveStrCmp       | 127   | ChangeArr2 |    57.338 ns | 0.0387 ns | 0.0362 ns |    57.333 ns |    57.291 ns |    57.406 ns |         - |
| SveStrCmpTail   | 127   | ChangeArr2 |    27.557 ns | 0.0089 ns | 0.0084 ns |    27.556 ns |    27.546 ns |    27.572 ns |         - |
| ScalarStrCmp    | 127   | Zero       |   104.072 ns | 0.0149 ns | 0.0139 ns |   104.074 ns |   104.043 ns |   104.096 ns |         - |
| Vector128StrCmp | 127   | Zero       |    21.451 ns | 0.0046 ns | 0.0041 ns |    21.451 ns |    21.445 ns |    21.459 ns |         - |
| SveStrCmp       | 127   | Zero       |    56.550 ns | 0.0331 ns | 0.0309 ns |    56.553 ns |    56.499 ns |    56.603 ns |         - |
| SveStrCmpTail   | 127   | Zero       |    26.732 ns | 0.0197 ns | 0.0184 ns |    26.738 ns |    26.694 ns |    26.762 ns |         - |
| ScalarStrCmp    | 527   | ChangeArr1 |   217.308 ns | 0.0887 ns | 0.0829 ns |   217.319 ns |   217.112 ns |   217.414 ns |         - |
| Vector128StrCmp | 527   | ChangeArr1 |    25.675 ns | 0.0101 ns | 0.0094 ns |    25.675 ns |    25.662 ns |    25.694 ns |         - |
| SveStrCmp       | 527   | ChangeArr1 |    75.928 ns | 0.2925 ns | 0.2736 ns |    75.854 ns |    75.416 ns |    76.293 ns |         - |
| SveStrCmpTail   | 527   | ChangeArr1 |    32.373 ns | 0.0108 ns | 0.0101 ns |    32.369 ns |    32.357 ns |    32.390 ns |         - |
| ScalarStrCmp    | 527   | ChangeArr2 |   439.540 ns | 0.0706 ns | 0.0660 ns |   439.554 ns |   439.376 ns |   439.613 ns |         - |
| Vector128StrCmp | 527   | ChangeArr2 |    47.813 ns | 0.0326 ns | 0.0305 ns |    47.819 ns |    47.750 ns |    47.850 ns |         - |
| SveStrCmp       | 527   | ChangeArr2 |   104.090 ns | 0.1739 ns | 0.1627 ns |   104.044 ns |   103.903 ns |   104.369 ns |         - |
| SveStrCmpTail   | 527   | ChangeArr2 |    62.306 ns | 0.0161 ns | 0.0135 ns |    62.305 ns |    62.282 ns |    62.325 ns |         - |
| ScalarStrCmp    | 527   | Zero       |   438.286 ns | 0.0601 ns | 0.0562 ns |   438.307 ns |   438.176 ns |   438.359 ns |         - |
| Vector128StrCmp | 527   | Zero       |    47.914 ns | 0.0307 ns | 0.0287 ns |    47.913 ns |    47.867 ns |    47.981 ns |         - |
| SveStrCmp       | 527   | Zero       |   102.589 ns | 0.1280 ns | 0.1197 ns |   102.603 ns |   102.316 ns |   102.797 ns |         - |
| SveStrCmpTail   | 527   | Zero       |    62.642 ns | 0.0347 ns | 0.0324 ns |    62.634 ns |    62.600 ns |    62.684 ns |         - |
| ScalarStrCmp    | 10015 | ChangeArr1 | 4,141.196 ns | 0.5018 ns | 0.4449 ns | 4,141.221 ns | 4,140.498 ns | 4,142.157 ns |         - |
| Vector128StrCmp | 10015 | ChangeArr1 |   365.715 ns | 0.4989 ns | 0.4666 ns |   365.784 ns |   365.049 ns |   366.471 ns |         - |
| SveStrCmp       | 10015 | ChangeArr1 |   607.617 ns | 0.3152 ns | 0.2948 ns |   607.654 ns |   607.088 ns |   608.232 ns |         - |
| SveStrCmpTail   | 10015 | ChangeArr1 |   477.274 ns | 0.3953 ns | 0.3698 ns |   477.145 ns |   476.792 ns |   477.911 ns |         - |
| ScalarStrCmp    | 10015 | ChangeArr2 | 8,275.683 ns | 0.7596 ns | 0.6733 ns | 8,275.470 ns | 8,275.094 ns | 8,277.086 ns |         - |
| Vector128StrCmp | 10015 | ChangeArr2 |   724.204 ns | 0.9676 ns | 0.9051 ns |   724.469 ns |   722.624 ns |   725.547 ns |         - |
| SveStrCmp       | 10015 | ChangeArr2 | 1,176.387 ns | 2.7769 ns | 2.5975 ns | 1,177.075 ns | 1,169.617 ns | 1,178.888 ns |         - |
| SveStrCmpTail   | 10015 | ChangeArr2 |   940.443 ns | 0.7952 ns | 0.7439 ns |   940.215 ns |   939.558 ns |   941.722 ns |         - |
| ScalarStrCmp    | 10015 | Zero       | 8,271.079 ns | 1.0579 ns | 0.9378 ns | 8,271.230 ns | 8,269.293 ns | 8,272.946 ns |         - |
| Vector128StrCmp | 10015 | Zero       |   724.574 ns | 0.7333 ns | 0.6860 ns |   724.501 ns |   723.211 ns |   725.702 ns |         - |
| SveStrCmp       | 10015 | Zero       | 1,196.635 ns | 6.3129 ns | 5.9051 ns | 1,196.864 ns | 1,183.301 ns | 1,204.401 ns |         - |
| SveStrCmpTail   | 10015 | Zero       |   940.188 ns | 0.6239 ns | 0.5531 ns |   939.974 ns |   939.585 ns |   941.287 ns |         - |

Microbenchmarking tests on string operations (len, indexof, cmp) to
compare the runtimes across scalar, Vector128 and SVE implementations.
@jacob-crawley
Copy link
Author

@a74nh @kunalspathak @dotnet/arm64-contrib

@jacob-crawley
Copy link
Author

@dotnet-policy-service agree company="Arm"

@a74nh
Copy link
Contributor

a74nh commented Apr 23, 2025

Question for the maintainers: Is microbenchmarks the right place for these? Microbenchmarks feels maybe too "small" but it doesn't fit into "real world". "Loops" or "vectorisation" would be a better category, but it doesn't exist.

@kunalspathak
Copy link
Member

@LoopedBard3 @DrewScoggins

@kunalspathak
Copy link
Member

Thanks @jacob-crawley for coming up with this. I am wondering if you got a chance to verify the performance behavior of StrLen? For small size length, seems SVE is 3X faster, but then for the higher length, it is 2X slower than NEON.

[Benchmark]
public unsafe long SveStrCmp()
{
int i = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to add a check of Sve.IsSupported here because we do have machines in perf lab that doesn't support SVE. @LoopedBard3 or @caaavik-msft can suggest right way to do it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added these checks to the Sve tests

 [Benchmark]
        public unsafe long SveStrCmpTail()
        {
            if (Sve.IsSupported)
            {

If there's a way of doing these checks as a filter before running the benchmarks please let me know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamsitnik do you have any recommendations for this type of filtering?

Include a 'Sve.IsSupported' check on each SVE benchmark so they
still run on machines that dont have support for SVE.
@kunalspathak
Copy link
Member

@caaavik-msft ping

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering the benchmarks with SVE enabled could be done by something similar to what we have here:

Defining a custom config type that is applied via an attribute: https://github.com/dotnet/BenchmarkDotNet/blob/ee248c319919ac112eb908394f1e941b78ca6a28/samples/BenchmarkDotNet.Samples/IntroFilters.cs#L9-L20

The filter could looks similar to this (I have not tested it):

[Config(typeof(Config))]
public class IntroFilters
{
    private class Config : ManualConfig
    {
        public Config()
        {
            AddFilter(new SimpleFilter(_ => Sve.IsSupported));
        }
    }
}

for (int i = 0; i < Size; i++)
{
if (_arr1[i] != _arr2[i] )
return _arr1[i] - _arr2[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks that are part of this repo are used to determine whether there is any performance regression in the .NET. Running this scalar benchmark every day multiple times would rather not catch any regression. So I would focus purely on the ones that use Sve directly and indirectly (via Vector types if possible)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts...

  • They are giving comparison point so we can easily show the advantage given by using intrinsics. Knowing an SVE loop is slightly slower than Vector128, but still massively faster than scalar I think is useful.
  • If C# started to add loop optimisations / auto vectorisation then the gap between scalar and intrinsics will start to close.
  • There are some loops (not in this PR) that cannot easily be optimised via vector128 (eg the partition used by a quicksort). For those we definitely want scalar versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @a74nh. The point of adding scalar version is not to catch any regression in that code, but compare the improvements we do using Vector128/Sve APIs.

Adds a custom filter to each benchmark class which means it can only
be run if SVE is supported on the machine
Addressing upstream comments to renmae 'Scenario' param in StrCmp to
'Modify', and all scalar methos names have been renamed to 'Scalar()'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants