所以在我的上一篇文章中,我展示了一堆小的微基准测试,除了实际结果之外,我不太确定那里发生了什么。幸运的是,我认识一些 perf 专家,所以我可以依靠他们。
具体而言,建议的更改是:
- 不要只做一个微小的操作,如果操作太便宜,很容易在调用设置中产生过多的抖动。
- 注意潜在的数据问题,编译器/jit 可以决定将某些内容放在寄存器中,在这种情况下,您将直接让 cpu 工作,而现实世界中不会出现这种情况。
我还学习了如何运行实际的程序集,这很棒。总而言之,我们得到以下基准代码:
[benchmarktask(platform: benchmarkplatform.x86,
jitversion: benchmarkjitversion.ryujit)]
[benchmarktask(platform: benchmarkplatform.x86,
jitversion: benchmarkjitversion.legacyjit)]
[benchmarktask(platform: benchmarkplatform.x64,
jitversion: benchmarkjitversion.legacyjit)]
[benchmarktask(platform: benchmarkplatform.x64,
jitversion: benchmarkjitversion.ryujit)]
public unsafe class tocastornottocast
{
byte* p1, p2, p3, p4;
fooheader* h1, h2,h3,h4;
public tocastornottocast()
{
p1 = (byte*)marshal.allochglobal(1024);
p2 = (byte*)marshal.allochglobal(1024);
p3 = (byte*)marshal.allochglobal(1024);
p4 = (byte*)marshal.allochglobal(1024);
h1 = (fooheader*)p1;
h2 = (fooheader*)p2;
h3 = (fooheader*)p3;
h4 = (fooheader*)p4;
}
[benchmark]
[operationsperinvoke(4)]
public void nocast()
{
h1->pagenumber++;
h2->pagenumber++;
h3->pagenumber++;
h4->pagenumber++;
}
[benchmark]
[operationsperinvoke(4)]
public void cast()
{
((fooheader*)p1)->pagenumber++;
((fooheader*)p2)->pagenumber++;
((fooheader*)p3)->pagenumber++;
((fooheader*)p4)->pagenumber++;
}
}
以及以下结果:
[benchmarktask(platform: benchmarkplatform.x86,
jitversion: benchmarkjitversion.ryujit)]
[benchmarktask(platform: benchmarkplatform.x86,
jitversion: benchmarkjitversion.legacyjit)]
[benchmarktask(platform: benchmarkplatform.x64,
jitversion: benchmarkjitversion.legacyjit)]
[benchmarktask(platform: benchmarkplatform.x64,
jitversion: benchmarkjitversion.ryujit)]
public unsafe class tocastornottocast
{
byte* p1, p2, p3, p4;
fooheader* h1, h2,h3,h4;
public tocastornottocast()
{
p1 = (byte*)marshal.allochglobal(1024);
p2 = (byte*)marshal.allochglobal(1024);
p3 = (byte*)marshal.allochglobal(1024);
p4 = (byte*)marshal.allochglobal(1024);
h1 = (fooheader*)p1;
h2 = (fooheader*)p2;
h3 = (fooheader*)p3;
h4 = (fooheader*)p4;
}
[benchmark]
[operationsperinvoke(4)]
public void nocast()
{
h1->pagenumber++;
h2->pagenumber++;
h3->pagenumber++;
h4->pagenumber++;
}
[benchmark]
[operationsperinvoke(4)]
public void cast()
{
((fooheader*)p1)->pagenumber++;
((fooheader*)p2)->pagenumber++;
((fooheader*)p3)->pagenumber++;
((fooheader*)p4)->pagenumber++;
}
}
[benchmarktask(platform: benchmarkplatform.x86,
jitversion: benchmarkjitversion.ryujit)]
[benchmarktask(platform: benchmarkplatform.x86,
jitversion: benchmarkjitversion.legacyjit)]
[benchmarktask(platform: benchmarkplatform.x64,
jitversion: benchmarkjitversion.legacyjit)]
[benchmarktask(platform: benchmarkplatform.x64,
jitversion: benchmarkjitversion.ryujit)]
public unsafe class tocastornottocast
{
byte* p1, p2, p3, p4;
fooheader* h1, h2,h3,h4;
public tocastornottocast()
{
p1 = (byte*)marshal.allochglobal(1024);
p2 = (byte*)marshal.allochglobal(1024);
p3 = (byte*)marshal.allochglobal(1024);
p4 = (byte*)marshal.allochglobal(1024);
h1 = (fooheader*)p1;
h2 = (fooheader*)p2;
h3 = (fooheader*)p3;
h4 = (fooheader*)p4;
}
[benchmark]
[operationsperinvoke(4)]
public void nocast()
{
h1->pagenumber++;
h2->pagenumber++;
h3->pagenumber++;
h4->pagenumber++;
}
[benchmark]
[operationsperinvoke(4)]
public void cast()
{
((fooheader*)p1)->pagenumber++;
((fooheader*)p2)->pagenumber++;
((fooheader*)p3)->pagenumber++;
((fooheader*)p4)->pagenumber++;
}
}
有趣的是,nocast 方法在几乎所有设置中都更快。
这是 x64 中 legacyjit 的汇编代码:
对于 ryujit,该代码与 cast code 相同,无 cast 代码的唯一区别是 mov edx, ecx 是 ryujit 中的 mov rdx,rcx。
顺便说一句,x64 汇编代码比 x86 汇编代码更容易阅读。
简而言之,强制转换或不强制转换的性能差异非常小,但 不 强制转换允许我们在对象中保存一个指针引用,这意味着它会稍微小一些,如果我们要有很多,那么可以很好地节省空间。