c++ - Sorting 64-bit structs using AVX? -
i have 64-bit struct represents several pieces of data, 1 of floating point value:
struct mystruct{ uint16_t a; uint16_t b; float f; };
and have 4 of these structs in, lets std::array<mystruct, 4>
is possible use avx sort array, in terms of float member mystruct::f
?
sorry answer messy; didn't written @ once , i'm lazy. there duplication.
i have 4 separate ideas:
- normal sorting, moving struct 64bit unit
- vectorized insertion-sort building block qsort
sorting networks, comparator implementation using
cmpps
/blendvpd
instead ofminps
/maxps
. overhead might kill speedup, though.sorting networks: load structs, shuffle/blend registers of floats , registers of payload. use timothy furtak's technique of doing normal
minps
/maxps
comparator ,cmpeqps min,orig
-> masked xor-swap on payload. sorts twice data per comparator, require matching shuffles on 2 registers between comparators. requires re-interleaving when you're done (but that's easy unpcklps / unpckhps, if arrange comparators in-lane unpacks put final data in right order).this avoids potential slowdowns cpus may have when doing fp comparisons on bit patterns in payload represent denormals, nans, or infinities, without resorting setting denormals-are-zero bit in mxcsr.
furtak's paper suggests doing scalar cleanup after getting things sorted vectors, reduce amount of shuffling lot.
normal sorting
there's @ least small speedup gained when using normal sorting algorithms, moving whole struct around 64bit loads/stores, , doing scalar fp compare on fp element. idea work possible, order struct float value first, movq
whole struct xmm reg, , float value in low32 ucomiss
. (or maybe smart compiler) store struct movq
.
looking @ asm output kerrek sb linked to, compilers seem rather bad job of efficiently copying structs around:
icc
seems movzx 2 uint values separately, rather scooping whole struct in 64b load. maybe doesn't pack struct? gcc
5.1 doesn't seem have problem of time.
speeding insertion-sort
big sorts divide-and-conquer insertion sort small-enough problems. insertion sort copies array elements on one, stopping when find we've reached spot current element belongs. need compare 1 element sequence of packed elements, stopping if comparison true any. smell vectors? smell vectors.
# rsi points struct { float f; uint... payload; } buf[]; # rdi points next element inserted sorted portion # [ rsi rdi ) sorted, rest isn't. ##### proof of concept: debug / finish writing before using! ###### .new_elem: vbroadcastsd ymm0, [rdi] # broadcast whole struct mov rdx, rdi .search_loop: sub rdx, 32 vmovups ymm1, [rdx] # load sorted data vcmplt_oqps ymm2, ymm0, ymm1 # all-ones in element ymm0[i] < ymm1[i] (fp compare, false if either nan). vmovups [rdx+8], ymm1 # shuffle on make space, usual insertion-sort style cmp rdx, rsi jbe .endsearch # below-or-equal (addresses unsigned) movmskps eax, ymm2 test al, 0b01010101 # test compare results jz .search_loop # [rdi] wasn't less of 4 elements .endsearch: # todo: scalar loop find out new element goes. # know it's less 1 of elements in ymm1, not add rdi, 8 vmovsd [rdx], ymm0 cmp rdi, r8 # pointer end of buf jle .new_elem # worse alternative movmskps / test: # vtestps ymm2, ymm7 # ymm7 loaded 1s in odd (float) elements, , 0s in (payload) elements. # vtestps ptest, tests high bit. if struct in other order, float high, vtestpd against register of all-1s work, that's more convenient generate.
this full of bugs, , should have written in c intrinsics.
this insertion sort more overhead most, might lose scalar version small problem sizes, due complexity of handling first few element (don't fill vector), , of figuring out put new element after breaking out of vector search loop checked multiple elements.
probably pipelining loop haven't stored ymm1
until next iteration (or after breaking out) save redundant store. doing compares in registers shifting / shuffling them, instead of literally doing scalar load/compares win. end way many unpredictable branches, , i'm not seeing nice way end high 4 packed in reg vmovups
, , low 1 in reg vmovsd
.
i may have invented insertion sort that's worst of both worlds: slow small arrays because of more work after breaking out of search loop, it's still insertion sort: slow large arrays because of o(n^2). however, if code outside searchloop can made non-horrible, useful small-array endpoint qsort / mergesort.
anyway, if develop idea actual debugged , working code, let know.
update: timothy furtak's paper describes sse implementation sorting short arrays (for use building block bigger sorts, insertion sort). suggests producing partially-ordered result sse, , doing cleanup scalar ops. (insertion-sort on mostly-sorted array fast.)
which leads to:
sorting networks
there might not speedup here. xiaochen, rocki, , suda report 3.7x speedup scalar -> avx-512 32bit (int) elements, single-threaded mergesort, on xeon phi card. wider elements, fewer fit in vector reg. (that's factor of 4 us: 64b elements in 256b, vs. 32b elements in 512b.) take advantage of avx512 masks compare lanes, feature not available in avx. plus, slower comparator function competes shuffle/blend unit, we're in worse shape.
sorting networks can constructed using sse/avx packed-compare instructions. (more usually, pair of min/max instructions set of packed 2-element sorts.) larger sorts can built out of operation pairwise sorts. this paper tian xiaochen, kamil rocki , reiji suda @ u of tokyo has real avx code sorting (without payloads), , discussion of how it's tricky vector registers because can't compare 2 elements in same register (so sorting network has designed not require that). use pshufd
line elements next comparison, build larger sort out of sorting few registers full of data.
now, the trick sort of pairs of packed 64b elements, based on comparison of half element. (i.e. keeping payload sort key.) potentially sort other things way, sorting array of (key, payload)
pairs, payload can index or 32bit pointer (mmap(map_32bit)
, or x32 abi).
so let's build ourselves comparator. in sorting-network parlance, that's operation sorts pair of inputs. either swaps elements between registers, or not.
# avx comparator snb/ivb # struct { uint16_t a, b; float f; } inputs in ymm0, ymm1 # note: struct order f second saves shuffle extend mask vcmpps ymm7, ymm0, ymm1, _cmp_lt_oq # imm8=17: less-than, ordered, quiet (non-signalling on nan) # ymm7 32bit elements = 0xffffffff if ymm0[i] < ymm1[i], else 0 # vblendvpd checks high bit of 64b element, mask *doesn't* need extended low32 vblendvpd ymm2, ymm1, ymm0, ymm7 vblendvpd ymm3, ymm0, ymm1, ymm7 # result: !(ymm2[i] > ymm3[i]) (i.e. ymm2[i] < ymm3[i], or they're equal or unordered (nan).) # untested
you might need set mxcsr make sure int bits don't slow down fp ops if happen represent denormal or nan float. i'm not sure if happens mul/div, or if affect compare.
- intel haswell: latency: 5 cycles
ymm2
ready, 7 cyclesymm3
. throughput: one per 4 cycles. (p5 bottleneck). - intel sandybridge/ivybridge: latency: 5 cycles
ymm2
ready, 6 cyclesymm3
. throughput: one per 2 cycles. (p0/p5 bottleneck). - amd bulldozer/piledriver: (
vblendvpd ymm
: 2c lat, 2c recip tput): lat: 4cymm2
, 6cymm3
. or worse, bypass delays between cmpps , blend. tput: one per 4c. (bottleneck on vector p1) - amd steamroller: (
vblendvpd ymm
: 2c lat, 1c recip tput): lat: 4cymm2
, 5cymm3
. or maybe 1 higher because of bypass delays. tput: one per 3c (bottleneck on vector ports p0/1, cmp , blend).
vblendvpd
2 uops. (it has 3 reg inputs, can't 1 uop :/). both uops can run on shuffle ports. on haswell, that's port5. on snb, that's p0/p5. (idk why haswell halved shuffle / blend throughput compared snb/ivb.)
if amd designs had 256b-wide vector units, lower-latency fp compare , single-macro-op decoding of 3-input instructions put them ahead.
the usual minps/maxps pair 3 , 4 cycles latency (ymm2/3
), , 1 per 2 cycles throughput (intel). (p1 bottleneck on fp add/sub/compare unit). fair comparison sorting 64bit doubles. latency, may hurt if there aren't multiple pairs of independent registers compared. halved throughput on haswell cut speedups pretty heavily.
also keep in mind shuffles needed between comparator operations right elements lined comparison. min/maxps leave shuffle ports unused, cmpps/blendv version saturates them, meaning shuffling can't overlap comparing, except fill gaps left data dependencies.
with hyperthreading, thread can keep other ports busy (e.g. port 0/1 fp mul/add units, or integer code) share core quite nicely blend-bottlenecked version.
i attempted version haswell, blends "manually" using bitwise and/or operations. ended slower, though, because both sources have masked both ways before combining.
# avx2 comparator haswell # struct { float f; uint16_t a, b; } inputs in ymm0, ymm1 # vcmpps ymm7, ymm0, ymm1, _cmp_lt_oq # imm8=17: less-than, ordered, quiet (non-signalling on nan) # ymm7 32bit elements = 0xffffffff if ymm0[i] < ymm1[i], else 0 vshufps ymm7, ymm7, ymm7, mask(0, 0, 2, 2) # extend mask payload part. there's no mask function, don't want work out result in head. vpand ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements ymm0[i] < ymm1[i] vpandn ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements !(ymm0[i] < ymm1[i]) vpor ymm2, ymm10, ymm11 # ymm2 = min_packed_mystruct(ymm0, ymm1) vpandn ymm10, ymm7, ymm0 # ymm10 = ymm0 keeping elements !(ymm0[i] < ymm1[i]) vpand ymm11, ymm7, ymm1 # ymm11 = ymm1 keeping elements ymm0[i] < ymm1[i] vpor ymm3, ymm10, ymm11 # ymm2 = max_packed_mystruct(ymm0, ymm1) # result: !(ymm2[i] > ymm3[i]) # untested
this 8 uops, compared 5 blendv version. there's lot of parallelism in last 6 and/andn/or instructions. cmpps
has 3 cycle latency, though. think ymm2
ready in 6 cycles, while ymm3
ready in 7. (and can overlap operations on ymm2
). insns following comparator op shuffles, put data in right elements next compare. there's no forwarding delay to/from shuffle unit integer-domain logicals, vshufps
, result should come out in fp domain, ready vcmpps
. using vpand
instead of vandps
essential throughput.
timothy furtak's paper suggests approach sorting keys payload: don't pack payload pointers keys, instead generate mask compare, , use on both keys , payload same way. means have separate payload keys either in data structure, or every time load struct.
see appendix of paper (fig. 12). uses standard min/max on keys, , uses cmpps
see elements changed. ands mask in middle of xor-swap end swapping payloads keys swapped.
Comments
Post a Comment