c++ - Sorting 64-bit structs using AVX? -

i have 64-bit struct represents several pieces of data, 1 of floating point value:

struct mystruct{     uint16_t a;     uint16_t b;     float f; };

and have 4 of these structs in, lets std::array<mystruct, 4>

is possible use avx sort array, in terms of float member mystruct::f?

sorry answer messy; didn't written @ once , i'm lazy. there duplication.

i have 4 separate ideas:

normal sorting, moving struct 64bit unit
vectorized insertion-sort building block qsort
sorting networks, comparator implementation using cmpps / blendvpd instead of minps/maxps. overhead might kill speedup, though.
sorting networks: load structs, shuffle/blend registers of floats , registers of payload. use timothy furtak's technique of doing normal minps/maxps comparator , cmpeqps min,orig -> masked xor-swap on payload. sorts twice data per comparator, require matching shuffles on 2 registers between comparators. requires re-interleaving when you're done (but that's easy unpcklps / unpckhps, if arrange comparators in-lane unpacks put final data in right order).

this avoids potential slowdowns cpus may have when doing fp comparisons on bit patterns in payload represent denormals, nans, or infinities, without resorting setting denormals-are-zero bit in mxcsr.

furtak's paper suggests doing scalar cleanup after getting things sorted vectors, reduce amount of shuffling lot.

normal sorting

there's @ least small speedup gained when using normal sorting algorithms, moving whole struct around 64bit loads/stores, , doing scalar fp compare on fp element. idea work possible, order struct float value first, movq whole struct xmm reg, , float value in low32 ucomiss. (or maybe smart compiler) store struct movq.

looking @ asm output kerrek sb linked to, compilers seem rather bad job of efficiently copying structs around:

icc seems movzx 2 uint values separately, rather scooping whole struct in 64b load. maybe doesn't pack struct? gcc 5.1 doesn't seem have problem of time.

speeding insertion-sort

big sorts divide-and-conquer insertion sort small-enough problems. insertion sort copies array elements on one, stopping when find we've reached spot current element belongs. need compare 1 element sequence of packed elements, stopping if comparison true any. smell vectors? smell vectors.

# rsi points  struct { float f; uint... payload; } buf[]; # rdi points next element inserted sorted portion # [ rsi rdi ) sorted, rest isn't. ##### proof of concept: debug / finish writing before using!  ######  .new_elem: vbroadcastsd ymm0, [rdi]      # broadcast whole struct mov rdx, rdi  .search_loop:     sub        rdx, 32     vmovups    ymm1, [rdx]    # load sorted data     vcmplt_oqps ymm2, ymm0, ymm1   # all-ones in element ymm0[i] < ymm1[i] (fp compare, false if either nan).     vmovups    [rdx+8], ymm1  # shuffle on make space, usual insertion-sort style     cmp        rdx, rsi     jbe     .endsearch        # below-or-equal (addresses unsigned)     movmskps   eax, ymm2     test       al, 0b01010101 # test compare results       jz      .search_loop      # [rdi] wasn't less of 4 elements  .endsearch: # todo: scalar loop find out new element goes. #  know it's less 1 of elements in ymm1, not add           rdi, 8 vmovsd         [rdx], ymm0 cmp           rdi, r8   # pointer end of buf jle           .new_elem    # worse alternative movmskps / test:   # vtestps    ymm2, ymm7     # ymm7 loaded 1s in odd (float) elements, , 0s in (payload) elements.   # vtestps ptest, tests high bit.  if struct in other order, float high, vtestpd against register of all-1s work, that's more convenient generate.

this full of bugs, , should have written in c intrinsics.

this insertion sort more overhead most, might lose scalar version small problem sizes, due complexity of handling first few element (don't fill vector), , of figuring out put new element after breaking out of vector search loop checked multiple elements.

probably pipelining loop haven't stored ymm1 until next iteration (or after breaking out) save redundant store. doing compares in registers shifting / shuffling them, instead of literally doing scalar load/compares win. end way many unpredictable branches, , i'm not seeing nice way end high 4 packed in reg vmovups, , low 1 in reg vmovsd.

i may have invented insertion sort that's worst of both worlds: slow small arrays because of more work after breaking out of search loop, it's still insertion sort: slow large arrays because of o(n^2). however, if code outside searchloop can made non-horrible, useful small-array endpoint qsort / mergesort.

anyway, if develop idea actual debugged , working code, let know.

update: timothy furtak's paper describes sse implementation sorting short arrays (for use building block bigger sorts, insertion sort). suggests producing partially-ordered result sse, , doing cleanup scalar ops. (insertion-sort on mostly-sorted array fast.)

which leads to:

sorting networks

there might not speedup here. xiaochen, rocki, , suda report 3.7x speedup scalar -> avx-512 32bit (int) elements, single-threaded mergesort, on xeon phi card. wider elements, fewer fit in vector reg. (that's factor of 4 us: 64b elements in 256b, vs. 32b elements in 512b.) take advantage of avx512 masks compare lanes, feature not available in avx. plus, slower comparator function competes shuffle/blend unit, we're in worse shape.

sorting networks can constructed using sse/avx packed-compare instructions. (more usually, pair of min/max instructions set of packed 2-element sorts.) larger sorts can built out of operation pairwise sorts. this paper tian xiaochen, kamil rocki , reiji suda @ u of tokyo has real avx code sorting (without payloads), , discussion of how it's tricky vector registers because can't compare 2 elements in same register (so sorting network has designed not require that). use pshufd line elements next comparison, build larger sort out of sorting few registers full of data.

now, the trick sort of pairs of packed 64b elements, based on comparison of half element. (i.e. keeping payload sort key.) potentially sort other things way, sorting array of (key, payload) pairs, payload can index or 32bit pointer (mmap(map_32bit), or x32 abi).

so let's build ourselves comparator. in sorting-network parlance, that's operation sorts pair of inputs. either swaps elements between registers, or not.

# avx comparator snb/ivb # struct { uint16_t a, b; float f; }  inputs in ymm0, ymm1 # note: struct order f second saves shuffle extend mask  vcmpps    ymm7, ymm0, ymm1, _cmp_lt_oq  # imm8=17: less-than, ordered, quiet (non-signalling on nan)      # ymm7 32bit elements = 0xffffffff if ymm0[i] < ymm1[i], else 0 # vblendvpd checks high bit of 64b element, mask *doesn't* need extended low32 vblendvpd ymm2, ymm1, ymm0, ymm7 vblendvpd ymm3, ymm0, ymm1, ymm7 # result: !(ymm2[i] > ymm3[i])  (i.e. ymm2[i] < ymm3[i], or they're equal or unordered (nan).) #  untested

you might need set mxcsr make sure int bits don't slow down fp ops if happen represent denormal or nan float. i'm not sure if happens mul/div, or if affect compare.

intel haswell: latency: 5 cycles ymm2 ready, 7 cycles ymm3. throughput: one per 4 cycles. (p5 bottleneck).
intel sandybridge/ivybridge: latency: 5 cycles ymm2 ready, 6 cycles ymm3. throughput: one per 2 cycles. (p0/p5 bottleneck).
amd bulldozer/piledriver: (vblendvpd ymm: 2c lat, 2c recip tput): lat: 4c ymm2, 6c ymm3. or worse, bypass delays between cmpps , blend. tput: one per 4c. (bottleneck on vector p1)
amd steamroller: (vblendvpd ymm: 2c lat, 1c recip tput): lat: 4c ymm2, 5c ymm3. or maybe 1 higher because of bypass delays. tput: one per 3c (bottleneck on vector ports p0/1, cmp , blend).

vblendvpd 2 uops. (it has 3 reg inputs, can't 1 uop :/). both uops can run on shuffle ports. on haswell, that's port5. on snb, that's p0/p5. (idk why haswell halved shuffle / blend throughput compared snb/ivb.)

if amd designs had 256b-wide vector units, lower-latency fp compare , single-macro-op decoding of 3-input instructions put them ahead.

the usual minps/maxps pair 3 , 4 cycles latency (ymm2/3), , 1 per 2 cycles throughput (intel). (p1 bottleneck on fp add/sub/compare unit). fair comparison sorting 64bit doubles. latency, may hurt if there aren't multiple pairs of independent registers compared. halved throughput on haswell cut speedups pretty heavily.

also keep in mind shuffles needed between comparator operations right elements lined comparison. min/maxps leave shuffle ports unused, cmpps/blendv version saturates them, meaning shuffling can't overlap comparing, except fill gaps left data dependencies.

with hyperthreading, thread can keep other ports busy (e.g. port 0/1 fp mul/add units, or integer code) share core quite nicely blend-bottlenecked version.

i attempted version haswell, blends "manually" using bitwise and/or operations. ended slower, though, because both sources have masked both ways before combining.

# avx2 comparator haswell # struct { float f; uint16_t a, b; }  inputs in ymm0, ymm1 # vcmpps ymm7, ymm0, ymm1, _cmp_lt_oq  # imm8=17: less-than, ordered, quiet (non-signalling on nan)      # ymm7 32bit elements = 0xffffffff if ymm0[i] < ymm1[i], else 0 vshufps ymm7, ymm7, ymm7, mask(0, 0, 2, 2)  # extend mask payload part.  there's no mask function, don't want work out result in head. vpand    ymm10, ymm7, ymm0       # ymm10 = ymm0 keeping elements ymm0[i] < ymm1[i] vpandn   ymm11, ymm7, ymm1       # ymm11 = ymm1 keeping elements !(ymm0[i] < ymm1[i]) vpor     ymm2, ymm10, ymm11      # ymm2 = min_packed_mystruct(ymm0, ymm1)  vpandn   ymm10, ymm7, ymm0       # ymm10 = ymm0 keeping elements !(ymm0[i] < ymm1[i]) vpand    ymm11, ymm7, ymm1       # ymm11 = ymm1 keeping elements ymm0[i] < ymm1[i] vpor     ymm3, ymm10, ymm11  # ymm2 = max_packed_mystruct(ymm0, ymm1)  # result: !(ymm2[i] > ymm3[i]) #  untested

this 8 uops, compared 5 blendv version. there's lot of parallelism in last 6 and/andn/or instructions. cmpps has 3 cycle latency, though. think ymm2 ready in 6 cycles, while ymm3 ready in 7. (and can overlap operations on ymm2). insns following comparator op shuffles, put data in right elements next compare. there's no forwarding delay to/from shuffle unit integer-domain logicals, vshufps, result should come out in fp domain, ready vcmpps. using vpand instead of vandps essential throughput.

timothy furtak's paper suggests approach sorting keys payload: don't pack payload pointers keys, instead generate mask compare, , use on both keys , payload same way. means have separate payload keys either in data structure, or every time load struct.

see appendix of paper (fig. 12). uses standard min/max on keys, , uses cmpps see elements changed. ands mask in middle of xor-swap end swapping payloads keys swapped.

Search This Blog

Chrom