Skylay.cc Review 3 Reasons Why Skylay Doubler is Risky { BEWARE }

Best Binary Options Brokers 2020:
  • BINARIUM
    BINARIUM

    The Best Binary Options Broker 2020!
    Perfect For Beginners and Middle-Leveled Traders!
    Free Education.
    Free Demo Account.
    Get Your Sign-Up Bonus Now!

  • BINOMO
    BINOMO

    Recommended Only For Experienced Traders!

Skylay.cc Review: 3 Reasons Why Skylay Doubler is Risky BEWARE

When you spend over an hour beautifying a bus station, nice patterned pavements leading to the interchange stops, lines of colourful bushes, benches, signposts, lovely trees. zoom out to admire your work

I cried. Thanks, Skylines God. You’re a real pal.

Hahahaa too funny
Time for another Meteor Memorial Park methinks ;)

Hahaha I literally screamed when I saw where the meteor was heading. The timing of it is what got me it’s like the game just *KNEW* what i’d been working on for the last hour and a half.

But OH YEAH. I can do that memorial park idea now! There is an upside ;)

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

I’ve been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented out. I’ve tested Sandy Bridge and Ivy Bridge CPUs and both versions run at the same speed, with or without VZEROUPPER .

Now I have a fairly good idea of what VZEROUPPER does and I think it should not matter at all to this code when there are no VEX coded instructions and no calls to any function which might contain them. The fact that it does not on other AVX capable CPUs appears to support this. So does table 11-2 in the Intel® 64 and IA-32 Architectures Optimization Reference Manual

So what is going on?

The only theory I have left is that there’s a bug in the CPU and it’s incorrectly triggering the “save the upper half of the AVX registers” procedure where it shouldn’t. Or something else just as strange.

This is main.cpp:

and this is slow_function.cpp:

The function compiles down to this with clang:

Best Binary Options Brokers 2020:
  • BINARIUM
    BINARIUM

    The Best Binary Options Broker 2020!
    Perfect For Beginners and Middle-Leveled Traders!
    Free Education.
    Free Demo Account.
    Get Your Sign-Up Bonus Now!

  • BINOMO
    BINOMO

    Recommended Only For Experienced Traders!

The generated code is different with gcc but it shows the same problem. An older version of the intel compiler generates yet another variation of the function which shows the problem too but only if main.cpp is not built with the intel compiler as it inserts calls to initialize some of its own libraries which probably end up doing VZEROUPPER somewhere.

And of course, if the whole thing is built with AVX support so the intrinsics are turned into VEX coded instructions, there is no problem either.

I’ve tried profiling the code with perf on linux and most of the runtime usually lands on 1-2 instructions but not always the same ones depending on which version of the code I profile (gcc, clang, intel). Shortening the function appears to make the performance difference gradually go away so it looks like several instructions are causing the problem.

EDIT: Here’s a pure assembly version, for linux. Comments below.

Ok, so as suspected in comments, using VEX coded instructions causes the slowdown. Using VZEROUPPER clears it up. But that still does not explain why.

As I understand it, not using VZEROUPPER is supposed to involve a cost to transition to old SSE instructions but not a permanent slowdown of them. Especially not such a large one. Taking loop overhead into account, the ratio is at least 10x, perhaps more.

I have tried messing with the assembly a little and float instructions are just as bad as double ones. I could not pinpoint the problem to a single instruction either.

2 Answers 2

You are experiencing a penalty for “mixing” non-VEX SSE and VEX-encoded instructions – even though your entire visible application doesn’t obviously use any AVX instructions!

Prior to Skylake, this type of penalty was only a one-time transition penalty, when switching from code that used vex to code that didn’t, or vice-versa. That is, you never paid an ongoing penalty for whatever happened in the past unless you were actively mixing VEX and non-VEX. In Skylake, however, there is a state where non-VEX SSE instructions pay a high ongoing execution penalty, even without further mixing.

Straight from the horse’s mouth, here’s Figure 11-1 1 – the old (pre-Skylake) transition diagram:

As you can see, all of the penalties (red arrows), bring you to a new state, at which point there is no longer a penalty for repeating that action. For example, if you get to the dirty upper state by executing some 256-bit AVX, an you then execute legacy SSE, you pay a one-time penalty to transition to the preserved non-INIT upper state, but you don’t pay any penalties after that.

In Skylake, everything is different per Figure 11-2:

There are fewer penalties overall, but critically for your case, one of them is a self-loop: the penalty for executing a legacy SSE (Penalty A in the Figure 11-2) instruction in the dirty upper state keeps you in that state. That’s what happens to you – any AVX instruction puts you in the dirty upper state, which slows all further SSE execution down.

Here’s what Intel says (section 11.3) about the new penalty:

The Skylake microarchitecture implements a different state machine than prior generations to manage the YMM state transition associated with mixing SSE and AVX instructions. It no longer saves the entire upper YMM state when executing an SSE instruction when in “Modified and Unsaved” state, but saves the upper bits of individual register. As a result, mixing SSE and AVX instructions will experience a penalty associated with partial register dependency of the destination registers being used and additional blend operation on the upper bits of the destination registers.

So the penalty is apparently quite large – it has to blend the top bits all the time to preserve them, and it also makes instructions which are apparently independently become dependent, since there is a dependency on the hidden upper bits. For example xorpd xmm0, xmm0 no longer breaks the dependence on the previous value of xmm0 , since the result is actually dependent on the hidden upper bits from ymm0 which aren’t cleared by the xorpd . That latter effect is probably what kills your performance since you’ll now have very long dependency chains that wouldn’t expect from the usual analysis.

This is among the worst type of performance pitfall: where the behavior/best practice for the prior architecture is essentially opposite of the current architecture. Presumably the hardware architects had a good reason for making the change, but it does just add another “gotcha” to the list of subtle performance issues.

I would file a bug against the compiler or runtime that inserted that AVX instruction and didn’t follow up with a VZEROUPPER .

Update: Per the OP’s comment below, the offending (AVX) code was inserted by the runtime linker ld and a bug already exists.

Скайлайт

Расстояние от дома: 618 км.

Минимальный уровень: 1

Скайлайт – постоянная локация, которая находится в Холодных землях. Доступна для посещения начиная с 65 уровня и после выполнения квестов на локации Мелтонвиль. Микроцели: осмотреть 21 Гейзер (награда: 1 Особый подарок); осмотреть 12 Замёрзших кратеров, плотность 70 ед. Из каждого кратера 10 ед. Эссенции метеорита (награда: 1 Особый подарок). Лошадей на локации можно покормить 10 раз каждую. На одно кормление необходимо 30 Морковных палочек. После каждого кормления получаем полезности. Получаем призы за расчистку локации на 30%, 50%, 80% и 100%. В подарке за 100% Тележка Скупщика с крафтами.

One more step

Please complete the security check to access www.artstation.com

Why do I have to complete a CAPTCHA?

Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.

What can I do to prevent this in the future?

If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.

If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.

Another way to prevent getting this page in the future is to use Privacy Pass. You may need to download version 2.0 now from the Chrome Web Store.

Cloudflare Ray ID: 581690f1e98b4979 • Your IP : 188.64.173.24 • Performance & security by Cloudflare

Best Binary Options Brokers 2020:
  • BINARIUM
    BINARIUM

    The Best Binary Options Broker 2020!
    Perfect For Beginners and Middle-Leveled Traders!
    Free Education.
    Free Demo Account.
    Get Your Sign-Up Bonus Now!

  • BINOMO
    BINOMO

    Recommended Only For Experienced Traders!

Like this post? Please share to your friends:
Binary Options Brokers, Signals and Strategies
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: