diff options
Diffstat (limited to 'src/lib/libcrypto/modes/asm/ghash-x86.pl')
| -rw-r--r-- | src/lib/libcrypto/modes/asm/ghash-x86.pl | 1342 |
1 files changed, 1342 insertions, 0 deletions
diff --git a/src/lib/libcrypto/modes/asm/ghash-x86.pl b/src/lib/libcrypto/modes/asm/ghash-x86.pl new file mode 100644 index 0000000000..6b09669d47 --- /dev/null +++ b/src/lib/libcrypto/modes/asm/ghash-x86.pl | |||
| @@ -0,0 +1,1342 @@ | |||
| 1 | #!/usr/bin/env perl | ||
| 2 | # | ||
| 3 | # ==================================================================== | ||
| 4 | # Written by Andy Polyakov <appro@openssl.org> for the OpenSSL | ||
| 5 | # project. The module is, however, dual licensed under OpenSSL and | ||
| 6 | # CRYPTOGAMS licenses depending on where you obtain it. For further | ||
| 7 | # details see http://www.openssl.org/~appro/cryptogams/. | ||
| 8 | # ==================================================================== | ||
| 9 | # | ||
| 10 | # March, May, June 2010 | ||
| 11 | # | ||
| 12 | # The module implements "4-bit" GCM GHASH function and underlying | ||
| 13 | # single multiplication operation in GF(2^128). "4-bit" means that it | ||
| 14 | # uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two | ||
| 15 | # code paths: vanilla x86 and vanilla MMX. Former will be executed on | ||
| 16 | # 486 and Pentium, latter on all others. MMX GHASH features so called | ||
| 17 | # "528B" variant of "4-bit" method utilizing additional 256+16 bytes | ||
| 18 | # of per-key storage [+512 bytes shared table]. Performance results | ||
| 19 | # are for streamed GHASH subroutine and are expressed in cycles per | ||
| 20 | # processed byte, less is better: | ||
| 21 | # | ||
| 22 | # gcc 2.95.3(*) MMX assembler x86 assembler | ||
| 23 | # | ||
| 24 | # Pentium 105/111(**) - 50 | ||
| 25 | # PIII 68 /75 12.2 24 | ||
| 26 | # P4 125/125 17.8 84(***) | ||
| 27 | # Opteron 66 /70 10.1 30 | ||
| 28 | # Core2 54 /67 8.4 18 | ||
| 29 | # | ||
| 30 | # (*) gcc 3.4.x was observed to generate few percent slower code, | ||
| 31 | # which is one of reasons why 2.95.3 results were chosen, | ||
| 32 | # another reason is lack of 3.4.x results for older CPUs; | ||
| 33 | # comparison with MMX results is not completely fair, because C | ||
| 34 | # results are for vanilla "256B" implementation, while | ||
| 35 | # assembler results are for "528B";-) | ||
| 36 | # (**) second number is result for code compiled with -fPIC flag, | ||
| 37 | # which is actually more relevant, because assembler code is | ||
| 38 | # position-independent; | ||
| 39 | # (***) see comment in non-MMX routine for further details; | ||
| 40 | # | ||
| 41 | # To summarize, it's >2-5 times faster than gcc-generated code. To | ||
| 42 | # anchor it to something else SHA1 assembler processes one byte in | ||
| 43 | # 11-13 cycles on contemporary x86 cores. As for choice of MMX in | ||
| 44 | # particular, see comment at the end of the file... | ||
| 45 | |||
| 46 | # May 2010 | ||
| 47 | # | ||
| 48 | # Add PCLMULQDQ version performing at 2.10 cycles per processed byte. | ||
| 49 | # The question is how close is it to theoretical limit? The pclmulqdq | ||
| 50 | # instruction latency appears to be 14 cycles and there can't be more | ||
| 51 | # than 2 of them executing at any given time. This means that single | ||
| 52 | # Karatsuba multiplication would take 28 cycles *plus* few cycles for | ||
| 53 | # pre- and post-processing. Then multiplication has to be followed by | ||
| 54 | # modulo-reduction. Given that aggregated reduction method [see | ||
| 55 | # "Carry-less Multiplication and Its Usage for Computing the GCM Mode" | ||
| 56 | # white paper by Intel] allows you to perform reduction only once in | ||
| 57 | # a while we can assume that asymptotic performance can be estimated | ||
| 58 | # as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction | ||
| 59 | # and Naggr is the aggregation factor. | ||
| 60 | # | ||
| 61 | # Before we proceed to this implementation let's have closer look at | ||
| 62 | # the best-performing code suggested by Intel in their white paper. | ||
| 63 | # By tracing inter-register dependencies Tmod is estimated as ~19 | ||
| 64 | # cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per | ||
| 65 | # processed byte. As implied, this is quite optimistic estimate, | ||
| 66 | # because it does not account for Karatsuba pre- and post-processing, | ||
| 67 | # which for a single multiplication is ~5 cycles. Unfortunately Intel | ||
| 68 | # does not provide performance data for GHASH alone. But benchmarking | ||
| 69 | # AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt | ||
| 70 | # alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that | ||
| 71 | # the result accounts even for pre-computing of degrees of the hash | ||
| 72 | # key H, but its portion is negligible at 16KB buffer size. | ||
| 73 | # | ||
| 74 | # Moving on to the implementation in question. Tmod is estimated as | ||
| 75 | # ~13 cycles and Naggr is 2, giving asymptotic performance of ... | ||
| 76 | # 2.16. How is it possible that measured performance is better than | ||
| 77 | # optimistic theoretical estimate? There is one thing Intel failed | ||
| 78 | # to recognize. By serializing GHASH with CTR in same subroutine | ||
| 79 | # former's performance is really limited to above (Tmul + Tmod/Naggr) | ||
| 80 | # equation. But if GHASH procedure is detached, the modulo-reduction | ||
| 81 | # can be interleaved with Naggr-1 multiplications at instruction level | ||
| 82 | # and under ideal conditions even disappear from the equation. So that | ||
| 83 | # optimistic theoretical estimate for this implementation is ... | ||
| 84 | # 28/16=1.75, and not 2.16. Well, it's probably way too optimistic, | ||
| 85 | # at least for such small Naggr. I'd argue that (28+Tproc/Naggr), | ||
| 86 | # where Tproc is time required for Karatsuba pre- and post-processing, | ||
| 87 | # is more realistic estimate. In this case it gives ... 1.91 cycles. | ||
| 88 | # Or in other words, depending on how well we can interleave reduction | ||
| 89 | # and one of the two multiplications the performance should be betwen | ||
| 90 | # 1.91 and 2.16. As already mentioned, this implementation processes | ||
| 91 | # one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart | ||
| 92 | # - in 2.02. x86_64 performance is better, because larger register | ||
| 93 | # bank allows to interleave reduction and multiplication better. | ||
| 94 | # | ||
| 95 | # Does it make sense to increase Naggr? To start with it's virtually | ||
| 96 | # impossible in 32-bit mode, because of limited register bank | ||
| 97 | # capacity. Otherwise improvement has to be weighed agiainst slower | ||
| 98 | # setup, as well as code size and complexity increase. As even | ||
| 99 | # optimistic estimate doesn't promise 30% performance improvement, | ||
| 100 | # there are currently no plans to increase Naggr. | ||
| 101 | # | ||
| 102 | # Special thanks to David Woodhouse <dwmw2@infradead.org> for | ||
| 103 | # providing access to a Westmere-based system on behalf of Intel | ||
| 104 | # Open Source Technology Centre. | ||
| 105 | |||
| 106 | # January 2010 | ||
| 107 | # | ||
| 108 | # Tweaked to optimize transitions between integer and FP operations | ||
| 109 | # on same XMM register, PCLMULQDQ subroutine was measured to process | ||
| 110 | # one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere. | ||
| 111 | # The minor regression on Westmere is outweighed by ~15% improvement | ||
| 112 | # on Sandy Bridge. Strangely enough attempt to modify 64-bit code in | ||
| 113 | # similar manner resulted in almost 20% degradation on Sandy Bridge, | ||
| 114 | # where original 64-bit code processes one byte in 1.95 cycles. | ||
| 115 | |||
| 116 | $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; | ||
| 117 | push(@INC,"${dir}","${dir}../../perlasm"); | ||
| 118 | require "x86asm.pl"; | ||
| 119 | |||
| 120 | &asm_init($ARGV[0],"ghash-x86.pl",$x86only = $ARGV[$#ARGV] eq "386"); | ||
| 121 | |||
| 122 | $sse2=0; | ||
| 123 | for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); } | ||
| 124 | |||
| 125 | ($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx"); | ||
| 126 | $inp = "edi"; | ||
| 127 | $Htbl = "esi"; | ||
| 128 | |||
| 129 | $unroll = 0; # Affects x86 loop. Folded loop performs ~7% worse | ||
| 130 | # than unrolled, which has to be weighted against | ||
| 131 | # 2.5x x86-specific code size reduction. | ||
| 132 | |||
| 133 | sub x86_loop { | ||
| 134 | my $off = shift; | ||
| 135 | my $rem = "eax"; | ||
| 136 | |||
| 137 | &mov ($Zhh,&DWP(4,$Htbl,$Zll)); | ||
| 138 | &mov ($Zhl,&DWP(0,$Htbl,$Zll)); | ||
| 139 | &mov ($Zlh,&DWP(12,$Htbl,$Zll)); | ||
| 140 | &mov ($Zll,&DWP(8,$Htbl,$Zll)); | ||
| 141 | &xor ($rem,$rem); # avoid partial register stalls on PIII | ||
| 142 | |||
| 143 | # shrd practically kills P4, 2.5x deterioration, but P4 has | ||
| 144 | # MMX code-path to execute. shrd runs tad faster [than twice | ||
| 145 | # the shifts, move's and or's] on pre-MMX Pentium (as well as | ||
| 146 | # PIII and Core2), *but* minimizes code size, spares register | ||
| 147 | # and thus allows to fold the loop... | ||
| 148 | if (!$unroll) { | ||
| 149 | my $cnt = $inp; | ||
| 150 | &mov ($cnt,15); | ||
| 151 | &jmp (&label("x86_loop")); | ||
| 152 | &set_label("x86_loop",16); | ||
| 153 | for($i=1;$i<=2;$i++) { | ||
| 154 | &mov (&LB($rem),&LB($Zll)); | ||
| 155 | &shrd ($Zll,$Zlh,4); | ||
| 156 | &and (&LB($rem),0xf); | ||
| 157 | &shrd ($Zlh,$Zhl,4); | ||
| 158 | &shrd ($Zhl,$Zhh,4); | ||
| 159 | &shr ($Zhh,4); | ||
| 160 | &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); | ||
| 161 | |||
| 162 | &mov (&LB($rem),&BP($off,"esp",$cnt)); | ||
| 163 | if ($i&1) { | ||
| 164 | &and (&LB($rem),0xf0); | ||
| 165 | } else { | ||
| 166 | &shl (&LB($rem),4); | ||
| 167 | } | ||
| 168 | |||
| 169 | &xor ($Zll,&DWP(8,$Htbl,$rem)); | ||
| 170 | &xor ($Zlh,&DWP(12,$Htbl,$rem)); | ||
| 171 | &xor ($Zhl,&DWP(0,$Htbl,$rem)); | ||
| 172 | &xor ($Zhh,&DWP(4,$Htbl,$rem)); | ||
| 173 | |||
| 174 | if ($i&1) { | ||
| 175 | &dec ($cnt); | ||
| 176 | &js (&label("x86_break")); | ||
| 177 | } else { | ||
| 178 | &jmp (&label("x86_loop")); | ||
| 179 | } | ||
| 180 | } | ||
| 181 | &set_label("x86_break",16); | ||
| 182 | } else { | ||
| 183 | for($i=1;$i<32;$i++) { | ||
| 184 | &comment($i); | ||
| 185 | &mov (&LB($rem),&LB($Zll)); | ||
| 186 | &shrd ($Zll,$Zlh,4); | ||
| 187 | &and (&LB($rem),0xf); | ||
| 188 | &shrd ($Zlh,$Zhl,4); | ||
| 189 | &shrd ($Zhl,$Zhh,4); | ||
| 190 | &shr ($Zhh,4); | ||
| 191 | &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); | ||
| 192 | |||
| 193 | if ($i&1) { | ||
| 194 | &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); | ||
| 195 | &and (&LB($rem),0xf0); | ||
| 196 | } else { | ||
| 197 | &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); | ||
| 198 | &shl (&LB($rem),4); | ||
| 199 | } | ||
| 200 | |||
| 201 | &xor ($Zll,&DWP(8,$Htbl,$rem)); | ||
| 202 | &xor ($Zlh,&DWP(12,$Htbl,$rem)); | ||
| 203 | &xor ($Zhl,&DWP(0,$Htbl,$rem)); | ||
| 204 | &xor ($Zhh,&DWP(4,$Htbl,$rem)); | ||
| 205 | } | ||
| 206 | } | ||
| 207 | &bswap ($Zll); | ||
| 208 | &bswap ($Zlh); | ||
| 209 | &bswap ($Zhl); | ||
| 210 | if (!$x86only) { | ||
| 211 | &bswap ($Zhh); | ||
| 212 | } else { | ||
| 213 | &mov ("eax",$Zhh); | ||
| 214 | &bswap ("eax"); | ||
| 215 | &mov ($Zhh,"eax"); | ||
| 216 | } | ||
| 217 | } | ||
| 218 | |||
| 219 | if ($unroll) { | ||
| 220 | &function_begin_B("_x86_gmult_4bit_inner"); | ||
| 221 | &x86_loop(4); | ||
| 222 | &ret (); | ||
| 223 | &function_end_B("_x86_gmult_4bit_inner"); | ||
| 224 | } | ||
| 225 | |||
| 226 | sub deposit_rem_4bit { | ||
| 227 | my $bias = shift; | ||
| 228 | |||
| 229 | &mov (&DWP($bias+0, "esp"),0x0000<<16); | ||
| 230 | &mov (&DWP($bias+4, "esp"),0x1C20<<16); | ||
| 231 | &mov (&DWP($bias+8, "esp"),0x3840<<16); | ||
| 232 | &mov (&DWP($bias+12,"esp"),0x2460<<16); | ||
| 233 | &mov (&DWP($bias+16,"esp"),0x7080<<16); | ||
| 234 | &mov (&DWP($bias+20,"esp"),0x6CA0<<16); | ||
| 235 | &mov (&DWP($bias+24,"esp"),0x48C0<<16); | ||
| 236 | &mov (&DWP($bias+28,"esp"),0x54E0<<16); | ||
| 237 | &mov (&DWP($bias+32,"esp"),0xE100<<16); | ||
| 238 | &mov (&DWP($bias+36,"esp"),0xFD20<<16); | ||
| 239 | &mov (&DWP($bias+40,"esp"),0xD940<<16); | ||
| 240 | &mov (&DWP($bias+44,"esp"),0xC560<<16); | ||
| 241 | &mov (&DWP($bias+48,"esp"),0x9180<<16); | ||
| 242 | &mov (&DWP($bias+52,"esp"),0x8DA0<<16); | ||
| 243 | &mov (&DWP($bias+56,"esp"),0xA9C0<<16); | ||
| 244 | &mov (&DWP($bias+60,"esp"),0xB5E0<<16); | ||
| 245 | } | ||
| 246 | |||
| 247 | $suffix = $x86only ? "" : "_x86"; | ||
| 248 | |||
| 249 | &function_begin("gcm_gmult_4bit".$suffix); | ||
| 250 | &stack_push(16+4+1); # +1 for stack alignment | ||
| 251 | &mov ($inp,&wparam(0)); # load Xi | ||
| 252 | &mov ($Htbl,&wparam(1)); # load Htable | ||
| 253 | |||
| 254 | &mov ($Zhh,&DWP(0,$inp)); # load Xi[16] | ||
| 255 | &mov ($Zhl,&DWP(4,$inp)); | ||
| 256 | &mov ($Zlh,&DWP(8,$inp)); | ||
| 257 | &mov ($Zll,&DWP(12,$inp)); | ||
| 258 | |||
| 259 | &deposit_rem_4bit(16); | ||
| 260 | |||
| 261 | &mov (&DWP(0,"esp"),$Zhh); # copy Xi[16] on stack | ||
| 262 | &mov (&DWP(4,"esp"),$Zhl); | ||
| 263 | &mov (&DWP(8,"esp"),$Zlh); | ||
| 264 | &mov (&DWP(12,"esp"),$Zll); | ||
| 265 | &shr ($Zll,20); | ||
| 266 | &and ($Zll,0xf0); | ||
| 267 | |||
| 268 | if ($unroll) { | ||
| 269 | &call ("_x86_gmult_4bit_inner"); | ||
| 270 | } else { | ||
| 271 | &x86_loop(0); | ||
| 272 | &mov ($inp,&wparam(0)); | ||
| 273 | } | ||
| 274 | |||
| 275 | &mov (&DWP(12,$inp),$Zll); | ||
| 276 | &mov (&DWP(8,$inp),$Zlh); | ||
| 277 | &mov (&DWP(4,$inp),$Zhl); | ||
| 278 | &mov (&DWP(0,$inp),$Zhh); | ||
| 279 | &stack_pop(16+4+1); | ||
| 280 | &function_end("gcm_gmult_4bit".$suffix); | ||
| 281 | |||
| 282 | &function_begin("gcm_ghash_4bit".$suffix); | ||
| 283 | &stack_push(16+4+1); # +1 for 64-bit alignment | ||
| 284 | &mov ($Zll,&wparam(0)); # load Xi | ||
| 285 | &mov ($Htbl,&wparam(1)); # load Htable | ||
| 286 | &mov ($inp,&wparam(2)); # load in | ||
| 287 | &mov ("ecx",&wparam(3)); # load len | ||
| 288 | &add ("ecx",$inp); | ||
| 289 | &mov (&wparam(3),"ecx"); | ||
| 290 | |||
| 291 | &mov ($Zhh,&DWP(0,$Zll)); # load Xi[16] | ||
| 292 | &mov ($Zhl,&DWP(4,$Zll)); | ||
| 293 | &mov ($Zlh,&DWP(8,$Zll)); | ||
| 294 | &mov ($Zll,&DWP(12,$Zll)); | ||
| 295 | |||
| 296 | &deposit_rem_4bit(16); | ||
| 297 | |||
| 298 | &set_label("x86_outer_loop",16); | ||
| 299 | &xor ($Zll,&DWP(12,$inp)); # xor with input | ||
| 300 | &xor ($Zlh,&DWP(8,$inp)); | ||
| 301 | &xor ($Zhl,&DWP(4,$inp)); | ||
| 302 | &xor ($Zhh,&DWP(0,$inp)); | ||
| 303 | &mov (&DWP(12,"esp"),$Zll); # dump it on stack | ||
| 304 | &mov (&DWP(8,"esp"),$Zlh); | ||
| 305 | &mov (&DWP(4,"esp"),$Zhl); | ||
| 306 | &mov (&DWP(0,"esp"),$Zhh); | ||
| 307 | |||
| 308 | &shr ($Zll,20); | ||
| 309 | &and ($Zll,0xf0); | ||
| 310 | |||
| 311 | if ($unroll) { | ||
| 312 | &call ("_x86_gmult_4bit_inner"); | ||
| 313 | } else { | ||
| 314 | &x86_loop(0); | ||
| 315 | &mov ($inp,&wparam(2)); | ||
| 316 | } | ||
| 317 | &lea ($inp,&DWP(16,$inp)); | ||
| 318 | &cmp ($inp,&wparam(3)); | ||
| 319 | &mov (&wparam(2),$inp) if (!$unroll); | ||
| 320 | &jb (&label("x86_outer_loop")); | ||
| 321 | |||
| 322 | &mov ($inp,&wparam(0)); # load Xi | ||
| 323 | &mov (&DWP(12,$inp),$Zll); | ||
| 324 | &mov (&DWP(8,$inp),$Zlh); | ||
| 325 | &mov (&DWP(4,$inp),$Zhl); | ||
| 326 | &mov (&DWP(0,$inp),$Zhh); | ||
| 327 | &stack_pop(16+4+1); | ||
| 328 | &function_end("gcm_ghash_4bit".$suffix); | ||
| 329 | |||
| 330 | if (!$x86only) {{{ | ||
| 331 | |||
| 332 | &static_label("rem_4bit"); | ||
| 333 | |||
| 334 | if (!$sse2) {{ # pure-MMX "May" version... | ||
| 335 | |||
| 336 | $S=12; # shift factor for rem_4bit | ||
| 337 | |||
| 338 | &function_begin_B("_mmx_gmult_4bit_inner"); | ||
| 339 | # MMX version performs 3.5 times better on P4 (see comment in non-MMX | ||
| 340 | # routine for further details), 100% better on Opteron, ~70% better | ||
| 341 | # on Core2 and PIII... In other words effort is considered to be well | ||
| 342 | # spent... Since initial release the loop was unrolled in order to | ||
| 343 | # "liberate" register previously used as loop counter. Instead it's | ||
| 344 | # used to optimize critical path in 'Z.hi ^= rem_4bit[Z.lo&0xf]'. | ||
| 345 | # The path involves move of Z.lo from MMX to integer register, | ||
| 346 | # effective address calculation and finally merge of value to Z.hi. | ||
| 347 | # Reference to rem_4bit is scheduled so late that I had to >>4 | ||
| 348 | # rem_4bit elements. This resulted in 20-45% procent improvement | ||
| 349 | # on contemporary µ-archs. | ||
| 350 | { | ||
| 351 | my $cnt; | ||
| 352 | my $rem_4bit = "eax"; | ||
| 353 | my @rem = ($Zhh,$Zll); | ||
| 354 | my $nhi = $Zhl; | ||
| 355 | my $nlo = $Zlh; | ||
| 356 | |||
| 357 | my ($Zlo,$Zhi) = ("mm0","mm1"); | ||
| 358 | my $tmp = "mm2"; | ||
| 359 | |||
| 360 | &xor ($nlo,$nlo); # avoid partial register stalls on PIII | ||
| 361 | &mov ($nhi,$Zll); | ||
| 362 | &mov (&LB($nlo),&LB($nhi)); | ||
| 363 | &shl (&LB($nlo),4); | ||
| 364 | &and ($nhi,0xf0); | ||
| 365 | &movq ($Zlo,&QWP(8,$Htbl,$nlo)); | ||
| 366 | &movq ($Zhi,&QWP(0,$Htbl,$nlo)); | ||
| 367 | &movd ($rem[0],$Zlo); | ||
| 368 | |||
| 369 | for ($cnt=28;$cnt>=-2;$cnt--) { | ||
| 370 | my $odd = $cnt&1; | ||
| 371 | my $nix = $odd ? $nlo : $nhi; | ||
| 372 | |||
| 373 | &shl (&LB($nlo),4) if ($odd); | ||
| 374 | &psrlq ($Zlo,4); | ||
| 375 | &movq ($tmp,$Zhi); | ||
| 376 | &psrlq ($Zhi,4); | ||
| 377 | &pxor ($Zlo,&QWP(8,$Htbl,$nix)); | ||
| 378 | &mov (&LB($nlo),&BP($cnt/2,$inp)) if (!$odd && $cnt>=0); | ||
| 379 | &psllq ($tmp,60); | ||
| 380 | &and ($nhi,0xf0) if ($odd); | ||
| 381 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem[1],8)) if ($cnt<28); | ||
| 382 | &and ($rem[0],0xf); | ||
| 383 | &pxor ($Zhi,&QWP(0,$Htbl,$nix)); | ||
| 384 | &mov ($nhi,$nlo) if (!$odd && $cnt>=0); | ||
| 385 | &movd ($rem[1],$Zlo); | ||
| 386 | &pxor ($Zlo,$tmp); | ||
| 387 | |||
| 388 | push (@rem,shift(@rem)); # "rotate" registers | ||
| 389 | } | ||
| 390 | |||
| 391 | &mov ($inp,&DWP(4,$rem_4bit,$rem[1],8)); # last rem_4bit[rem] | ||
| 392 | |||
| 393 | &psrlq ($Zlo,32); # lower part of Zlo is already there | ||
| 394 | &movd ($Zhl,$Zhi); | ||
| 395 | &psrlq ($Zhi,32); | ||
| 396 | &movd ($Zlh,$Zlo); | ||
| 397 | &movd ($Zhh,$Zhi); | ||
| 398 | &shl ($inp,4); # compensate for rem_4bit[i] being >>4 | ||
| 399 | |||
| 400 | &bswap ($Zll); | ||
| 401 | &bswap ($Zhl); | ||
| 402 | &bswap ($Zlh); | ||
| 403 | &xor ($Zhh,$inp); | ||
| 404 | &bswap ($Zhh); | ||
| 405 | |||
| 406 | &ret (); | ||
| 407 | } | ||
| 408 | &function_end_B("_mmx_gmult_4bit_inner"); | ||
| 409 | |||
| 410 | &function_begin("gcm_gmult_4bit_mmx"); | ||
| 411 | &mov ($inp,&wparam(0)); # load Xi | ||
| 412 | &mov ($Htbl,&wparam(1)); # load Htable | ||
| 413 | |||
| 414 | &call (&label("pic_point")); | ||
| 415 | &set_label("pic_point"); | ||
| 416 | &blindpop("eax"); | ||
| 417 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); | ||
| 418 | |||
| 419 | &movz ($Zll,&BP(15,$inp)); | ||
| 420 | |||
| 421 | &call ("_mmx_gmult_4bit_inner"); | ||
| 422 | |||
| 423 | &mov ($inp,&wparam(0)); # load Xi | ||
| 424 | &emms (); | ||
| 425 | &mov (&DWP(12,$inp),$Zll); | ||
| 426 | &mov (&DWP(4,$inp),$Zhl); | ||
| 427 | &mov (&DWP(8,$inp),$Zlh); | ||
| 428 | &mov (&DWP(0,$inp),$Zhh); | ||
| 429 | &function_end("gcm_gmult_4bit_mmx"); | ||
| 430 | |||
| 431 | # Streamed version performs 20% better on P4, 7% on Opteron, | ||
| 432 | # 10% on Core2 and PIII... | ||
| 433 | &function_begin("gcm_ghash_4bit_mmx"); | ||
| 434 | &mov ($Zhh,&wparam(0)); # load Xi | ||
| 435 | &mov ($Htbl,&wparam(1)); # load Htable | ||
| 436 | &mov ($inp,&wparam(2)); # load in | ||
| 437 | &mov ($Zlh,&wparam(3)); # load len | ||
| 438 | |||
| 439 | &call (&label("pic_point")); | ||
| 440 | &set_label("pic_point"); | ||
| 441 | &blindpop("eax"); | ||
| 442 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); | ||
| 443 | |||
| 444 | &add ($Zlh,$inp); | ||
| 445 | &mov (&wparam(3),$Zlh); # len to point at the end of input | ||
| 446 | &stack_push(4+1); # +1 for stack alignment | ||
| 447 | |||
| 448 | &mov ($Zll,&DWP(12,$Zhh)); # load Xi[16] | ||
| 449 | &mov ($Zhl,&DWP(4,$Zhh)); | ||
| 450 | &mov ($Zlh,&DWP(8,$Zhh)); | ||
| 451 | &mov ($Zhh,&DWP(0,$Zhh)); | ||
| 452 | &jmp (&label("mmx_outer_loop")); | ||
| 453 | |||
| 454 | &set_label("mmx_outer_loop",16); | ||
| 455 | &xor ($Zll,&DWP(12,$inp)); | ||
| 456 | &xor ($Zhl,&DWP(4,$inp)); | ||
| 457 | &xor ($Zlh,&DWP(8,$inp)); | ||
| 458 | &xor ($Zhh,&DWP(0,$inp)); | ||
| 459 | &mov (&wparam(2),$inp); | ||
| 460 | &mov (&DWP(12,"esp"),$Zll); | ||
| 461 | &mov (&DWP(4,"esp"),$Zhl); | ||
| 462 | &mov (&DWP(8,"esp"),$Zlh); | ||
| 463 | &mov (&DWP(0,"esp"),$Zhh); | ||
| 464 | |||
| 465 | &mov ($inp,"esp"); | ||
| 466 | &shr ($Zll,24); | ||
| 467 | |||
| 468 | &call ("_mmx_gmult_4bit_inner"); | ||
| 469 | |||
| 470 | &mov ($inp,&wparam(2)); | ||
| 471 | &lea ($inp,&DWP(16,$inp)); | ||
| 472 | &cmp ($inp,&wparam(3)); | ||
| 473 | &jb (&label("mmx_outer_loop")); | ||
| 474 | |||
| 475 | &mov ($inp,&wparam(0)); # load Xi | ||
| 476 | &emms (); | ||
| 477 | &mov (&DWP(12,$inp),$Zll); | ||
| 478 | &mov (&DWP(4,$inp),$Zhl); | ||
| 479 | &mov (&DWP(8,$inp),$Zlh); | ||
| 480 | &mov (&DWP(0,$inp),$Zhh); | ||
| 481 | |||
| 482 | &stack_pop(4+1); | ||
| 483 | &function_end("gcm_ghash_4bit_mmx"); | ||
| 484 | |||
| 485 | }} else {{ # "June" MMX version... | ||
| 486 | # ... has slower "April" gcm_gmult_4bit_mmx with folded | ||
| 487 | # loop. This is done to conserve code size... | ||
| 488 | $S=16; # shift factor for rem_4bit | ||
| 489 | |||
| 490 | sub mmx_loop() { | ||
| 491 | # MMX version performs 2.8 times better on P4 (see comment in non-MMX | ||
| 492 | # routine for further details), 40% better on Opteron and Core2, 50% | ||
| 493 | # better on PIII... In other words effort is considered to be well | ||
| 494 | # spent... | ||
| 495 | my $inp = shift; | ||
| 496 | my $rem_4bit = shift; | ||
| 497 | my $cnt = $Zhh; | ||
| 498 | my $nhi = $Zhl; | ||
| 499 | my $nlo = $Zlh; | ||
| 500 | my $rem = $Zll; | ||
| 501 | |||
| 502 | my ($Zlo,$Zhi) = ("mm0","mm1"); | ||
| 503 | my $tmp = "mm2"; | ||
| 504 | |||
| 505 | &xor ($nlo,$nlo); # avoid partial register stalls on PIII | ||
| 506 | &mov ($nhi,$Zll); | ||
| 507 | &mov (&LB($nlo),&LB($nhi)); | ||
| 508 | &mov ($cnt,14); | ||
| 509 | &shl (&LB($nlo),4); | ||
| 510 | &and ($nhi,0xf0); | ||
| 511 | &movq ($Zlo,&QWP(8,$Htbl,$nlo)); | ||
| 512 | &movq ($Zhi,&QWP(0,$Htbl,$nlo)); | ||
| 513 | &movd ($rem,$Zlo); | ||
| 514 | &jmp (&label("mmx_loop")); | ||
| 515 | |||
| 516 | &set_label("mmx_loop",16); | ||
| 517 | &psrlq ($Zlo,4); | ||
| 518 | &and ($rem,0xf); | ||
| 519 | &movq ($tmp,$Zhi); | ||
| 520 | &psrlq ($Zhi,4); | ||
| 521 | &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); | ||
| 522 | &mov (&LB($nlo),&BP(0,$inp,$cnt)); | ||
| 523 | &psllq ($tmp,60); | ||
| 524 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); | ||
| 525 | &dec ($cnt); | ||
| 526 | &movd ($rem,$Zlo); | ||
| 527 | &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); | ||
| 528 | &mov ($nhi,$nlo); | ||
| 529 | &pxor ($Zlo,$tmp); | ||
| 530 | &js (&label("mmx_break")); | ||
| 531 | |||
| 532 | &shl (&LB($nlo),4); | ||
| 533 | &and ($rem,0xf); | ||
| 534 | &psrlq ($Zlo,4); | ||
| 535 | &and ($nhi,0xf0); | ||
| 536 | &movq ($tmp,$Zhi); | ||
| 537 | &psrlq ($Zhi,4); | ||
| 538 | &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); | ||
| 539 | &psllq ($tmp,60); | ||
| 540 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); | ||
| 541 | &movd ($rem,$Zlo); | ||
| 542 | &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); | ||
| 543 | &pxor ($Zlo,$tmp); | ||
| 544 | &jmp (&label("mmx_loop")); | ||
| 545 | |||
| 546 | &set_label("mmx_break",16); | ||
| 547 | &shl (&LB($nlo),4); | ||
| 548 | &and ($rem,0xf); | ||
| 549 | &psrlq ($Zlo,4); | ||
| 550 | &and ($nhi,0xf0); | ||
| 551 | &movq ($tmp,$Zhi); | ||
| 552 | &psrlq ($Zhi,4); | ||
| 553 | &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); | ||
| 554 | &psllq ($tmp,60); | ||
| 555 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); | ||
| 556 | &movd ($rem,$Zlo); | ||
| 557 | &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); | ||
| 558 | &pxor ($Zlo,$tmp); | ||
| 559 | |||
| 560 | &psrlq ($Zlo,4); | ||
| 561 | &and ($rem,0xf); | ||
| 562 | &movq ($tmp,$Zhi); | ||
| 563 | &psrlq ($Zhi,4); | ||
| 564 | &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); | ||
| 565 | &psllq ($tmp,60); | ||
| 566 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); | ||
| 567 | &movd ($rem,$Zlo); | ||
| 568 | &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); | ||
| 569 | &pxor ($Zlo,$tmp); | ||
| 570 | |||
| 571 | &psrlq ($Zlo,32); # lower part of Zlo is already there | ||
| 572 | &movd ($Zhl,$Zhi); | ||
| 573 | &psrlq ($Zhi,32); | ||
| 574 | &movd ($Zlh,$Zlo); | ||
| 575 | &movd ($Zhh,$Zhi); | ||
| 576 | |||
| 577 | &bswap ($Zll); | ||
| 578 | &bswap ($Zhl); | ||
| 579 | &bswap ($Zlh); | ||
| 580 | &bswap ($Zhh); | ||
| 581 | } | ||
| 582 | |||
| 583 | &function_begin("gcm_gmult_4bit_mmx"); | ||
| 584 | &mov ($inp,&wparam(0)); # load Xi | ||
| 585 | &mov ($Htbl,&wparam(1)); # load Htable | ||
| 586 | |||
| 587 | &call (&label("pic_point")); | ||
| 588 | &set_label("pic_point"); | ||
| 589 | &blindpop("eax"); | ||
| 590 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); | ||
| 591 | |||
| 592 | &movz ($Zll,&BP(15,$inp)); | ||
| 593 | |||
| 594 | &mmx_loop($inp,"eax"); | ||
| 595 | |||
| 596 | &emms (); | ||
| 597 | &mov (&DWP(12,$inp),$Zll); | ||
| 598 | &mov (&DWP(4,$inp),$Zhl); | ||
| 599 | &mov (&DWP(8,$inp),$Zlh); | ||
| 600 | &mov (&DWP(0,$inp),$Zhh); | ||
| 601 | &function_end("gcm_gmult_4bit_mmx"); | ||
| 602 | |||
| 603 | ###################################################################### | ||
| 604 | # Below subroutine is "528B" variant of "4-bit" GCM GHASH function | ||
| 605 | # (see gcm128.c for details). It provides further 20-40% performance | ||
| 606 | # improvement over above mentioned "May" version. | ||
| 607 | |||
| 608 | &static_label("rem_8bit"); | ||
| 609 | |||
| 610 | &function_begin("gcm_ghash_4bit_mmx"); | ||
| 611 | { my ($Zlo,$Zhi) = ("mm7","mm6"); | ||
| 612 | my $rem_8bit = "esi"; | ||
| 613 | my $Htbl = "ebx"; | ||
| 614 | |||
| 615 | # parameter block | ||
| 616 | &mov ("eax",&wparam(0)); # Xi | ||
| 617 | &mov ("ebx",&wparam(1)); # Htable | ||
| 618 | &mov ("ecx",&wparam(2)); # inp | ||
| 619 | &mov ("edx",&wparam(3)); # len | ||
| 620 | &mov ("ebp","esp"); # original %esp | ||
| 621 | &call (&label("pic_point")); | ||
| 622 | &set_label ("pic_point"); | ||
| 623 | &blindpop ($rem_8bit); | ||
| 624 | &lea ($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit)); | ||
| 625 | |||
| 626 | &sub ("esp",512+16+16); # allocate stack frame... | ||
| 627 | &and ("esp",-64); # ...and align it | ||
| 628 | &sub ("esp",16); # place for (u8)(H[]<<4) | ||
| 629 | |||
| 630 | &add ("edx","ecx"); # pointer to the end of input | ||
| 631 | &mov (&DWP(528+16+0,"esp"),"eax"); # save Xi | ||
| 632 | &mov (&DWP(528+16+8,"esp"),"edx"); # save inp+len | ||
| 633 | &mov (&DWP(528+16+12,"esp"),"ebp"); # save original %esp | ||
| 634 | |||
| 635 | { my @lo = ("mm0","mm1","mm2"); | ||
| 636 | my @hi = ("mm3","mm4","mm5"); | ||
| 637 | my @tmp = ("mm6","mm7"); | ||
| 638 | my $off1=0,$off2=0,$i; | ||
| 639 | |||
| 640 | &add ($Htbl,128); # optimize for size | ||
| 641 | &lea ("edi",&DWP(16+128,"esp")); | ||
| 642 | &lea ("ebp",&DWP(16+256+128,"esp")); | ||
| 643 | |||
| 644 | # decompose Htable (low and high parts are kept separately), | ||
| 645 | # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack... | ||
| 646 | for ($i=0;$i<18;$i++) { | ||
| 647 | |||
| 648 | &mov ("edx",&DWP(16*$i+8-128,$Htbl)) if ($i<16); | ||
| 649 | &movq ($lo[0],&QWP(16*$i+8-128,$Htbl)) if ($i<16); | ||
| 650 | &psllq ($tmp[1],60) if ($i>1); | ||
| 651 | &movq ($hi[0],&QWP(16*$i+0-128,$Htbl)) if ($i<16); | ||
| 652 | &por ($lo[2],$tmp[1]) if ($i>1); | ||
| 653 | &movq (&QWP($off1-128,"edi"),$lo[1]) if ($i>0 && $i<17); | ||
| 654 | &psrlq ($lo[1],4) if ($i>0 && $i<17); | ||
| 655 | &movq (&QWP($off1,"edi"),$hi[1]) if ($i>0 && $i<17); | ||
| 656 | &movq ($tmp[0],$hi[1]) if ($i>0 && $i<17); | ||
| 657 | &movq (&QWP($off2-128,"ebp"),$lo[2]) if ($i>1); | ||
| 658 | &psrlq ($hi[1],4) if ($i>0 && $i<17); | ||
| 659 | &movq (&QWP($off2,"ebp"),$hi[2]) if ($i>1); | ||
| 660 | &shl ("edx",4) if ($i<16); | ||
| 661 | &mov (&BP($i,"esp"),&LB("edx")) if ($i<16); | ||
| 662 | |||
| 663 | unshift (@lo,pop(@lo)); # "rotate" registers | ||
| 664 | unshift (@hi,pop(@hi)); | ||
| 665 | unshift (@tmp,pop(@tmp)); | ||
| 666 | $off1 += 8 if ($i>0); | ||
| 667 | $off2 += 8 if ($i>1); | ||
| 668 | } | ||
| 669 | } | ||
| 670 | |||
| 671 | &movq ($Zhi,&QWP(0,"eax")); | ||
| 672 | &mov ("ebx",&DWP(8,"eax")); | ||
| 673 | &mov ("edx",&DWP(12,"eax")); # load Xi | ||
| 674 | |||
| 675 | &set_label("outer",16); | ||
| 676 | { my $nlo = "eax"; | ||
| 677 | my $dat = "edx"; | ||
| 678 | my @nhi = ("edi","ebp"); | ||
| 679 | my @rem = ("ebx","ecx"); | ||
| 680 | my @red = ("mm0","mm1","mm2"); | ||
| 681 | my $tmp = "mm3"; | ||
| 682 | |||
| 683 | &xor ($dat,&DWP(12,"ecx")); # merge input data | ||
| 684 | &xor ("ebx",&DWP(8,"ecx")); | ||
| 685 | &pxor ($Zhi,&QWP(0,"ecx")); | ||
| 686 | &lea ("ecx",&DWP(16,"ecx")); # inp+=16 | ||
| 687 | #&mov (&DWP(528+12,"esp"),$dat); # save inp^Xi | ||
| 688 | &mov (&DWP(528+8,"esp"),"ebx"); | ||
| 689 | &movq (&QWP(528+0,"esp"),$Zhi); | ||
| 690 | &mov (&DWP(528+16+4,"esp"),"ecx"); # save inp | ||
| 691 | |||
| 692 | &xor ($nlo,$nlo); | ||
| 693 | &rol ($dat,8); | ||
| 694 | &mov (&LB($nlo),&LB($dat)); | ||
| 695 | &mov ($nhi[1],$nlo); | ||
| 696 | &and (&LB($nlo),0x0f); | ||
| 697 | &shr ($nhi[1],4); | ||
| 698 | &pxor ($red[0],$red[0]); | ||
| 699 | &rol ($dat,8); # next byte | ||
| 700 | &pxor ($red[1],$red[1]); | ||
| 701 | &pxor ($red[2],$red[2]); | ||
| 702 | |||
| 703 | # Just like in "May" verson modulo-schedule for critical path in | ||
| 704 | # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor' | ||
| 705 | # is scheduled so late that rem_8bit[] has to be shifted *right* | ||
| 706 | # by 16, which is why last argument to pinsrw is 2, which | ||
| 707 | # corresponds to <<32=<<48>>16... | ||
| 708 | for ($j=11,$i=0;$i<15;$i++) { | ||
| 709 | |||
| 710 | if ($i>0) { | ||
| 711 | &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] | ||
| 712 | &rol ($dat,8); # next byte | ||
| 713 | &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); | ||
| 714 | |||
| 715 | &pxor ($Zlo,$tmp); | ||
| 716 | &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); | ||
| 717 | &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) | ||
| 718 | } else { | ||
| 719 | &movq ($Zlo,&QWP(16,"esp",$nlo,8)); | ||
| 720 | &movq ($Zhi,&QWP(16+128,"esp",$nlo,8)); | ||
| 721 | } | ||
| 722 | |||
| 723 | &mov (&LB($nlo),&LB($dat)); | ||
| 724 | &mov ($dat,&DWP(528+$j,"esp")) if (--$j%4==0); | ||
| 725 | |||
| 726 | &movd ($rem[0],$Zlo); | ||
| 727 | &movz ($rem[1],&LB($rem[1])) if ($i>0); | ||
| 728 | &psrlq ($Zlo,8); # Z>>=8 | ||
| 729 | |||
| 730 | &movq ($tmp,$Zhi); | ||
| 731 | &mov ($nhi[0],$nlo); | ||
| 732 | &psrlq ($Zhi,8); | ||
| 733 | |||
| 734 | &pxor ($Zlo,&QWP(16+256+0,"esp",$nhi[1],8)); # Z^=H[nhi]>>4 | ||
| 735 | &and (&LB($nlo),0x0f); | ||
| 736 | &psllq ($tmp,56); | ||
| 737 | |||
| 738 | &pxor ($Zhi,$red[1]) if ($i>1); | ||
| 739 | &shr ($nhi[0],4); | ||
| 740 | &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2) if ($i>0); | ||
| 741 | |||
| 742 | unshift (@red,pop(@red)); # "rotate" registers | ||
| 743 | unshift (@rem,pop(@rem)); | ||
| 744 | unshift (@nhi,pop(@nhi)); | ||
| 745 | } | ||
| 746 | |||
| 747 | &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] | ||
| 748 | &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); | ||
| 749 | &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) | ||
| 750 | |||
| 751 | &pxor ($Zlo,$tmp); | ||
| 752 | &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); | ||
| 753 | &movz ($rem[1],&LB($rem[1])); | ||
| 754 | |||
| 755 | &pxor ($red[2],$red[2]); # clear 2nd word | ||
| 756 | &psllq ($red[1],4); | ||
| 757 | |||
| 758 | &movd ($rem[0],$Zlo); | ||
| 759 | &psrlq ($Zlo,4); # Z>>=4 | ||
| 760 | |||
| 761 | &movq ($tmp,$Zhi); | ||
| 762 | &psrlq ($Zhi,4); | ||
| 763 | &shl ($rem[0],4); # rem<<4 | ||
| 764 | |||
| 765 | &pxor ($Zlo,&QWP(16,"esp",$nhi[1],8)); # Z^=H[nhi] | ||
| 766 | &psllq ($tmp,60); | ||
| 767 | &movz ($rem[0],&LB($rem[0])); | ||
| 768 | |||
| 769 | &pxor ($Zlo,$tmp); | ||
| 770 | &pxor ($Zhi,&QWP(16+128,"esp",$nhi[1],8)); | ||
| 771 | |||
| 772 | &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2); | ||
| 773 | &pxor ($Zhi,$red[1]); | ||
| 774 | |||
| 775 | &movd ($dat,$Zlo); | ||
| 776 | &pinsrw ($red[2],&WP(0,$rem_8bit,$rem[0],2),3); # last is <<48 | ||
| 777 | |||
| 778 | &psllq ($red[0],12); # correct by <<16>>4 | ||
| 779 | &pxor ($Zhi,$red[0]); | ||
| 780 | &psrlq ($Zlo,32); | ||
| 781 | &pxor ($Zhi,$red[2]); | ||
| 782 | |||
| 783 | &mov ("ecx",&DWP(528+16+4,"esp")); # restore inp | ||
| 784 | &movd ("ebx",$Zlo); | ||
| 785 | &movq ($tmp,$Zhi); # 01234567 | ||
| 786 | &psllw ($Zhi,8); # 1.3.5.7. | ||
| 787 | &psrlw ($tmp,8); # .0.2.4.6 | ||
| 788 | &por ($Zhi,$tmp); # 10325476 | ||
| 789 | &bswap ($dat); | ||
| 790 | &pshufw ($Zhi,$Zhi,0b00011011); # 76543210 | ||
| 791 | &bswap ("ebx"); | ||
| 792 | |||
| 793 | &cmp ("ecx",&DWP(528+16+8,"esp")); # are we done? | ||
| 794 | &jne (&label("outer")); | ||
| 795 | } | ||
| 796 | |||
| 797 | &mov ("eax",&DWP(528+16+0,"esp")); # restore Xi | ||
| 798 | &mov (&DWP(12,"eax"),"edx"); | ||
| 799 | &mov (&DWP(8,"eax"),"ebx"); | ||
| 800 | &movq (&QWP(0,"eax"),$Zhi); | ||
| 801 | |||
| 802 | &mov ("esp",&DWP(528+16+12,"esp")); # restore original %esp | ||
| 803 | &emms (); | ||
| 804 | } | ||
| 805 | &function_end("gcm_ghash_4bit_mmx"); | ||
| 806 | }} | ||
| 807 | |||
| 808 | if ($sse2) {{ | ||
| 809 | ###################################################################### | ||
| 810 | # PCLMULQDQ version. | ||
| 811 | |||
| 812 | $Xip="eax"; | ||
| 813 | $Htbl="edx"; | ||
| 814 | $const="ecx"; | ||
| 815 | $inp="esi"; | ||
| 816 | $len="ebx"; | ||
| 817 | |||
| 818 | ($Xi,$Xhi)=("xmm0","xmm1"); $Hkey="xmm2"; | ||
| 819 | ($T1,$T2,$T3)=("xmm3","xmm4","xmm5"); | ||
| 820 | ($Xn,$Xhn)=("xmm6","xmm7"); | ||
| 821 | |||
| 822 | &static_label("bswap"); | ||
| 823 | |||
| 824 | sub clmul64x64_T2 { # minimal "register" pressure | ||
| 825 | my ($Xhi,$Xi,$Hkey)=@_; | ||
| 826 | |||
| 827 | &movdqa ($Xhi,$Xi); # | ||
| 828 | &pshufd ($T1,$Xi,0b01001110); | ||
| 829 | &pshufd ($T2,$Hkey,0b01001110); | ||
| 830 | &pxor ($T1,$Xi); # | ||
| 831 | &pxor ($T2,$Hkey); | ||
| 832 | |||
| 833 | &pclmulqdq ($Xi,$Hkey,0x00); ####### | ||
| 834 | &pclmulqdq ($Xhi,$Hkey,0x11); ####### | ||
| 835 | &pclmulqdq ($T1,$T2,0x00); ####### | ||
| 836 | &xorps ($T1,$Xi); # | ||
| 837 | &xorps ($T1,$Xhi); # | ||
| 838 | |||
| 839 | &movdqa ($T2,$T1); # | ||
| 840 | &psrldq ($T1,8); | ||
| 841 | &pslldq ($T2,8); # | ||
| 842 | &pxor ($Xhi,$T1); | ||
| 843 | &pxor ($Xi,$T2); # | ||
| 844 | } | ||
| 845 | |||
| 846 | sub clmul64x64_T3 { | ||
| 847 | # Even though this subroutine offers visually better ILP, it | ||
| 848 | # was empirically found to be a tad slower than above version. | ||
| 849 | # At least in gcm_ghash_clmul context. But it's just as well, | ||
| 850 | # because loop modulo-scheduling is possible only thanks to | ||
| 851 | # minimized "register" pressure... | ||
| 852 | my ($Xhi,$Xi,$Hkey)=@_; | ||
| 853 | |||
| 854 | &movdqa ($T1,$Xi); # | ||
| 855 | &movdqa ($Xhi,$Xi); | ||
| 856 | &pclmulqdq ($Xi,$Hkey,0x00); ####### | ||
| 857 | &pclmulqdq ($Xhi,$Hkey,0x11); ####### | ||
| 858 | &pshufd ($T2,$T1,0b01001110); # | ||
| 859 | &pshufd ($T3,$Hkey,0b01001110); | ||
| 860 | &pxor ($T2,$T1); # | ||
| 861 | &pxor ($T3,$Hkey); | ||
| 862 | &pclmulqdq ($T2,$T3,0x00); ####### | ||
| 863 | &pxor ($T2,$Xi); # | ||
| 864 | &pxor ($T2,$Xhi); # | ||
| 865 | |||
| 866 | &movdqa ($T3,$T2); # | ||
| 867 | &psrldq ($T2,8); | ||
| 868 | &pslldq ($T3,8); # | ||
| 869 | &pxor ($Xhi,$T2); | ||
| 870 | &pxor ($Xi,$T3); # | ||
| 871 | } | ||
| 872 | |||
| 873 | if (1) { # Algorithm 9 with <<1 twist. | ||
| 874 | # Reduction is shorter and uses only two | ||
| 875 | # temporary registers, which makes it better | ||
| 876 | # candidate for interleaving with 64x64 | ||
| 877 | # multiplication. Pre-modulo-scheduled loop | ||
| 878 | # was found to be ~20% faster than Algorithm 5 | ||
| 879 | # below. Algorithm 9 was therefore chosen for | ||
| 880 | # further optimization... | ||
| 881 | |||
| 882 | sub reduction_alg9 { # 17/13 times faster than Intel version | ||
| 883 | my ($Xhi,$Xi) = @_; | ||
| 884 | |||
| 885 | # 1st phase | ||
| 886 | &movdqa ($T1,$Xi) # | ||
| 887 | &psllq ($Xi,1); | ||
| 888 | &pxor ($Xi,$T1); # | ||
| 889 | &psllq ($Xi,5); # | ||
| 890 | &pxor ($Xi,$T1); # | ||
| 891 | &psllq ($Xi,57); # | ||
| 892 | &movdqa ($T2,$Xi); # | ||
| 893 | &pslldq ($Xi,8); | ||
| 894 | &psrldq ($T2,8); # | ||
| 895 | &pxor ($Xi,$T1); | ||
| 896 | &pxor ($Xhi,$T2); # | ||
| 897 | |||
| 898 | # 2nd phase | ||
| 899 | &movdqa ($T2,$Xi); | ||
| 900 | &psrlq ($Xi,5); | ||
| 901 | &pxor ($Xi,$T2); # | ||
| 902 | &psrlq ($Xi,1); # | ||
| 903 | &pxor ($Xi,$T2); # | ||
| 904 | &pxor ($T2,$Xhi); | ||
| 905 | &psrlq ($Xi,1); # | ||
| 906 | &pxor ($Xi,$T2); # | ||
| 907 | } | ||
| 908 | |||
| 909 | &function_begin_B("gcm_init_clmul"); | ||
| 910 | &mov ($Htbl,&wparam(0)); | ||
| 911 | &mov ($Xip,&wparam(1)); | ||
| 912 | |||
| 913 | &call (&label("pic")); | ||
| 914 | &set_label("pic"); | ||
| 915 | &blindpop ($const); | ||
| 916 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | ||
| 917 | |||
| 918 | &movdqu ($Hkey,&QWP(0,$Xip)); | ||
| 919 | &pshufd ($Hkey,$Hkey,0b01001110);# dword swap | ||
| 920 | |||
| 921 | # <<1 twist | ||
| 922 | &pshufd ($T2,$Hkey,0b11111111); # broadcast uppermost dword | ||
| 923 | &movdqa ($T1,$Hkey); | ||
| 924 | &psllq ($Hkey,1); | ||
| 925 | &pxor ($T3,$T3); # | ||
| 926 | &psrlq ($T1,63); | ||
| 927 | &pcmpgtd ($T3,$T2); # broadcast carry bit | ||
| 928 | &pslldq ($T1,8); | ||
| 929 | &por ($Hkey,$T1); # H<<=1 | ||
| 930 | |||
| 931 | # magic reduction | ||
| 932 | &pand ($T3,&QWP(16,$const)); # 0x1c2_polynomial | ||
| 933 | &pxor ($Hkey,$T3); # if(carry) H^=0x1c2_polynomial | ||
| 934 | |||
| 935 | # calculate H^2 | ||
| 936 | &movdqa ($Xi,$Hkey); | ||
| 937 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); | ||
| 938 | &reduction_alg9 ($Xhi,$Xi); | ||
| 939 | |||
| 940 | &movdqu (&QWP(0,$Htbl),$Hkey); # save H | ||
| 941 | &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 | ||
| 942 | |||
| 943 | &ret (); | ||
| 944 | &function_end_B("gcm_init_clmul"); | ||
| 945 | |||
| 946 | &function_begin_B("gcm_gmult_clmul"); | ||
| 947 | &mov ($Xip,&wparam(0)); | ||
| 948 | &mov ($Htbl,&wparam(1)); | ||
| 949 | |||
| 950 | &call (&label("pic")); | ||
| 951 | &set_label("pic"); | ||
| 952 | &blindpop ($const); | ||
| 953 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | ||
| 954 | |||
| 955 | &movdqu ($Xi,&QWP(0,$Xip)); | ||
| 956 | &movdqa ($T3,&QWP(0,$const)); | ||
| 957 | &movups ($Hkey,&QWP(0,$Htbl)); | ||
| 958 | &pshufb ($Xi,$T3); | ||
| 959 | |||
| 960 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); | ||
| 961 | &reduction_alg9 ($Xhi,$Xi); | ||
| 962 | |||
| 963 | &pshufb ($Xi,$T3); | ||
| 964 | &movdqu (&QWP(0,$Xip),$Xi); | ||
| 965 | |||
| 966 | &ret (); | ||
| 967 | &function_end_B("gcm_gmult_clmul"); | ||
| 968 | |||
| 969 | &function_begin("gcm_ghash_clmul"); | ||
| 970 | &mov ($Xip,&wparam(0)); | ||
| 971 | &mov ($Htbl,&wparam(1)); | ||
| 972 | &mov ($inp,&wparam(2)); | ||
| 973 | &mov ($len,&wparam(3)); | ||
| 974 | |||
| 975 | &call (&label("pic")); | ||
| 976 | &set_label("pic"); | ||
| 977 | &blindpop ($const); | ||
| 978 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | ||
| 979 | |||
| 980 | &movdqu ($Xi,&QWP(0,$Xip)); | ||
| 981 | &movdqa ($T3,&QWP(0,$const)); | ||
| 982 | &movdqu ($Hkey,&QWP(0,$Htbl)); | ||
| 983 | &pshufb ($Xi,$T3); | ||
| 984 | |||
| 985 | &sub ($len,0x10); | ||
| 986 | &jz (&label("odd_tail")); | ||
| 987 | |||
| 988 | ####### | ||
| 989 | # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = | ||
| 990 | # [(H*Ii+1) + (H*Xi+1)] mod P = | ||
| 991 | # [(H*Ii+1) + H^2*(Ii+Xi)] mod P | ||
| 992 | # | ||
| 993 | &movdqu ($T1,&QWP(0,$inp)); # Ii | ||
| 994 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 | ||
| 995 | &pshufb ($T1,$T3); | ||
| 996 | &pshufb ($Xn,$T3); | ||
| 997 | &pxor ($Xi,$T1); # Ii+Xi | ||
| 998 | |||
| 999 | &clmul64x64_T2 ($Xhn,$Xn,$Hkey); # H*Ii+1 | ||
| 1000 | &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 | ||
| 1001 | |||
| 1002 | &lea ($inp,&DWP(32,$inp)); # i+=2 | ||
| 1003 | &sub ($len,0x20); | ||
| 1004 | &jbe (&label("even_tail")); | ||
| 1005 | |||
| 1006 | &set_label("mod_loop"); | ||
| 1007 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) | ||
| 1008 | &movdqu ($T1,&QWP(0,$inp)); # Ii | ||
| 1009 | &movups ($Hkey,&QWP(0,$Htbl)); # load H | ||
| 1010 | |||
| 1011 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) | ||
| 1012 | &pxor ($Xhi,$Xhn); | ||
| 1013 | |||
| 1014 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 | ||
| 1015 | &pshufb ($T1,$T3); | ||
| 1016 | &pshufb ($Xn,$T3); | ||
| 1017 | |||
| 1018 | &movdqa ($T3,$Xn); #&clmul64x64_TX ($Xhn,$Xn,$Hkey); H*Ii+1 | ||
| 1019 | &movdqa ($Xhn,$Xn); | ||
| 1020 | &pxor ($Xhi,$T1); # "Ii+Xi", consume early | ||
| 1021 | |||
| 1022 | &movdqa ($T1,$Xi) #&reduction_alg9($Xhi,$Xi); 1st phase | ||
| 1023 | &psllq ($Xi,1); | ||
| 1024 | &pxor ($Xi,$T1); # | ||
| 1025 | &psllq ($Xi,5); # | ||
| 1026 | &pxor ($Xi,$T1); # | ||
| 1027 | &pclmulqdq ($Xn,$Hkey,0x00); ####### | ||
| 1028 | &psllq ($Xi,57); # | ||
| 1029 | &movdqa ($T2,$Xi); # | ||
| 1030 | &pslldq ($Xi,8); | ||
| 1031 | &psrldq ($T2,8); # | ||
| 1032 | &pxor ($Xi,$T1); | ||
| 1033 | &pshufd ($T1,$T3,0b01001110); | ||
| 1034 | &pxor ($Xhi,$T2); # | ||
| 1035 | &pxor ($T1,$T3); | ||
| 1036 | &pshufd ($T3,$Hkey,0b01001110); | ||
| 1037 | &pxor ($T3,$Hkey); # | ||
| 1038 | |||
| 1039 | &pclmulqdq ($Xhn,$Hkey,0x11); ####### | ||
| 1040 | &movdqa ($T2,$Xi); # 2nd phase | ||
| 1041 | &psrlq ($Xi,5); | ||
| 1042 | &pxor ($Xi,$T2); # | ||
| 1043 | &psrlq ($Xi,1); # | ||
| 1044 | &pxor ($Xi,$T2); # | ||
| 1045 | &pxor ($T2,$Xhi); | ||
| 1046 | &psrlq ($Xi,1); # | ||
| 1047 | &pxor ($Xi,$T2); # | ||
| 1048 | |||
| 1049 | &pclmulqdq ($T1,$T3,0x00); ####### | ||
| 1050 | &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 | ||
| 1051 | &xorps ($T1,$Xn); # | ||
| 1052 | &xorps ($T1,$Xhn); # | ||
| 1053 | |||
| 1054 | &movdqa ($T3,$T1); # | ||
| 1055 | &psrldq ($T1,8); | ||
| 1056 | &pslldq ($T3,8); # | ||
| 1057 | &pxor ($Xhn,$T1); | ||
| 1058 | &pxor ($Xn,$T3); # | ||
| 1059 | &movdqa ($T3,&QWP(0,$const)); | ||
| 1060 | |||
| 1061 | &lea ($inp,&DWP(32,$inp)); | ||
| 1062 | &sub ($len,0x20); | ||
| 1063 | &ja (&label("mod_loop")); | ||
| 1064 | |||
| 1065 | &set_label("even_tail"); | ||
| 1066 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) | ||
| 1067 | |||
| 1068 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) | ||
| 1069 | &pxor ($Xhi,$Xhn); | ||
| 1070 | |||
| 1071 | &reduction_alg9 ($Xhi,$Xi); | ||
| 1072 | |||
| 1073 | &test ($len,$len); | ||
| 1074 | &jnz (&label("done")); | ||
| 1075 | |||
| 1076 | &movups ($Hkey,&QWP(0,$Htbl)); # load H | ||
| 1077 | &set_label("odd_tail"); | ||
| 1078 | &movdqu ($T1,&QWP(0,$inp)); # Ii | ||
| 1079 | &pshufb ($T1,$T3); | ||
| 1080 | &pxor ($Xi,$T1); # Ii+Xi | ||
| 1081 | |||
| 1082 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) | ||
| 1083 | &reduction_alg9 ($Xhi,$Xi); | ||
| 1084 | |||
| 1085 | &set_label("done"); | ||
| 1086 | &pshufb ($Xi,$T3); | ||
| 1087 | &movdqu (&QWP(0,$Xip),$Xi); | ||
| 1088 | &function_end("gcm_ghash_clmul"); | ||
| 1089 | |||
| 1090 | } else { # Algorith 5. Kept for reference purposes. | ||
| 1091 | |||
| 1092 | sub reduction_alg5 { # 19/16 times faster than Intel version | ||
| 1093 | my ($Xhi,$Xi)=@_; | ||
| 1094 | |||
| 1095 | # <<1 | ||
| 1096 | &movdqa ($T1,$Xi); # | ||
| 1097 | &movdqa ($T2,$Xhi); | ||
| 1098 | &pslld ($Xi,1); | ||
| 1099 | &pslld ($Xhi,1); # | ||
| 1100 | &psrld ($T1,31); | ||
| 1101 | &psrld ($T2,31); # | ||
| 1102 | &movdqa ($T3,$T1); | ||
| 1103 | &pslldq ($T1,4); | ||
| 1104 | &psrldq ($T3,12); # | ||
| 1105 | &pslldq ($T2,4); | ||
| 1106 | &por ($Xhi,$T3); # | ||
| 1107 | &por ($Xi,$T1); | ||
| 1108 | &por ($Xhi,$T2); # | ||
| 1109 | |||
| 1110 | # 1st phase | ||
| 1111 | &movdqa ($T1,$Xi); | ||
| 1112 | &movdqa ($T2,$Xi); | ||
| 1113 | &movdqa ($T3,$Xi); # | ||
| 1114 | &pslld ($T1,31); | ||
| 1115 | &pslld ($T2,30); | ||
| 1116 | &pslld ($Xi,25); # | ||
| 1117 | &pxor ($T1,$T2); | ||
| 1118 | &pxor ($T1,$Xi); # | ||
| 1119 | &movdqa ($T2,$T1); # | ||
| 1120 | &pslldq ($T1,12); | ||
| 1121 | &psrldq ($T2,4); # | ||
| 1122 | &pxor ($T3,$T1); | ||
| 1123 | |||
| 1124 | # 2nd phase | ||
| 1125 | &pxor ($Xhi,$T3); # | ||
| 1126 | &movdqa ($Xi,$T3); | ||
| 1127 | &movdqa ($T1,$T3); | ||
| 1128 | &psrld ($Xi,1); # | ||
| 1129 | &psrld ($T1,2); | ||
| 1130 | &psrld ($T3,7); # | ||
| 1131 | &pxor ($Xi,$T1); | ||
| 1132 | &pxor ($Xhi,$T2); | ||
| 1133 | &pxor ($Xi,$T3); # | ||
| 1134 | &pxor ($Xi,$Xhi); # | ||
| 1135 | } | ||
| 1136 | |||
| 1137 | &function_begin_B("gcm_init_clmul"); | ||
| 1138 | &mov ($Htbl,&wparam(0)); | ||
| 1139 | &mov ($Xip,&wparam(1)); | ||
| 1140 | |||
| 1141 | &call (&label("pic")); | ||
| 1142 | &set_label("pic"); | ||
| 1143 | &blindpop ($const); | ||
| 1144 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | ||
| 1145 | |||
| 1146 | &movdqu ($Hkey,&QWP(0,$Xip)); | ||
| 1147 | &pshufd ($Hkey,$Hkey,0b01001110);# dword swap | ||
| 1148 | |||
| 1149 | # calculate H^2 | ||
| 1150 | &movdqa ($Xi,$Hkey); | ||
| 1151 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); | ||
| 1152 | &reduction_alg5 ($Xhi,$Xi); | ||
| 1153 | |||
| 1154 | &movdqu (&QWP(0,$Htbl),$Hkey); # save H | ||
| 1155 | &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 | ||
| 1156 | |||
| 1157 | &ret (); | ||
| 1158 | &function_end_B("gcm_init_clmul"); | ||
| 1159 | |||
| 1160 | &function_begin_B("gcm_gmult_clmul"); | ||
| 1161 | &mov ($Xip,&wparam(0)); | ||
| 1162 | &mov ($Htbl,&wparam(1)); | ||
| 1163 | |||
| 1164 | &call (&label("pic")); | ||
| 1165 | &set_label("pic"); | ||
| 1166 | &blindpop ($const); | ||
| 1167 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | ||
| 1168 | |||
| 1169 | &movdqu ($Xi,&QWP(0,$Xip)); | ||
| 1170 | &movdqa ($Xn,&QWP(0,$const)); | ||
| 1171 | &movdqu ($Hkey,&QWP(0,$Htbl)); | ||
| 1172 | &pshufb ($Xi,$Xn); | ||
| 1173 | |||
| 1174 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); | ||
| 1175 | &reduction_alg5 ($Xhi,$Xi); | ||
| 1176 | |||
| 1177 | &pshufb ($Xi,$Xn); | ||
| 1178 | &movdqu (&QWP(0,$Xip),$Xi); | ||
| 1179 | |||
| 1180 | &ret (); | ||
| 1181 | &function_end_B("gcm_gmult_clmul"); | ||
| 1182 | |||
| 1183 | &function_begin("gcm_ghash_clmul"); | ||
| 1184 | &mov ($Xip,&wparam(0)); | ||
| 1185 | &mov ($Htbl,&wparam(1)); | ||
| 1186 | &mov ($inp,&wparam(2)); | ||
| 1187 | &mov ($len,&wparam(3)); | ||
| 1188 | |||
| 1189 | &call (&label("pic")); | ||
| 1190 | &set_label("pic"); | ||
| 1191 | &blindpop ($const); | ||
| 1192 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | ||
| 1193 | |||
| 1194 | &movdqu ($Xi,&QWP(0,$Xip)); | ||
| 1195 | &movdqa ($T3,&QWP(0,$const)); | ||
| 1196 | &movdqu ($Hkey,&QWP(0,$Htbl)); | ||
| 1197 | &pshufb ($Xi,$T3); | ||
| 1198 | |||
| 1199 | &sub ($len,0x10); | ||
| 1200 | &jz (&label("odd_tail")); | ||
| 1201 | |||
| 1202 | ####### | ||
| 1203 | # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = | ||
| 1204 | # [(H*Ii+1) + (H*Xi+1)] mod P = | ||
| 1205 | # [(H*Ii+1) + H^2*(Ii+Xi)] mod P | ||
| 1206 | # | ||
| 1207 | &movdqu ($T1,&QWP(0,$inp)); # Ii | ||
| 1208 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 | ||
| 1209 | &pshufb ($T1,$T3); | ||
| 1210 | &pshufb ($Xn,$T3); | ||
| 1211 | &pxor ($Xi,$T1); # Ii+Xi | ||
| 1212 | |||
| 1213 | &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 | ||
| 1214 | &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 | ||
| 1215 | |||
| 1216 | &sub ($len,0x20); | ||
| 1217 | &lea ($inp,&DWP(32,$inp)); # i+=2 | ||
| 1218 | &jbe (&label("even_tail")); | ||
| 1219 | |||
| 1220 | &set_label("mod_loop"); | ||
| 1221 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) | ||
| 1222 | &movdqu ($Hkey,&QWP(0,$Htbl)); # load H | ||
| 1223 | |||
| 1224 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) | ||
| 1225 | &pxor ($Xhi,$Xhn); | ||
| 1226 | |||
| 1227 | &reduction_alg5 ($Xhi,$Xi); | ||
| 1228 | |||
| 1229 | ####### | ||
| 1230 | &movdqa ($T3,&QWP(0,$const)); | ||
| 1231 | &movdqu ($T1,&QWP(0,$inp)); # Ii | ||
| 1232 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 | ||
| 1233 | &pshufb ($T1,$T3); | ||
| 1234 | &pshufb ($Xn,$T3); | ||
| 1235 | &pxor ($Xi,$T1); # Ii+Xi | ||
| 1236 | |||
| 1237 | &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 | ||
| 1238 | &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 | ||
| 1239 | |||
| 1240 | &sub ($len,0x20); | ||
| 1241 | &lea ($inp,&DWP(32,$inp)); | ||
| 1242 | &ja (&label("mod_loop")); | ||
| 1243 | |||
| 1244 | &set_label("even_tail"); | ||
| 1245 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) | ||
| 1246 | |||
| 1247 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) | ||
| 1248 | &pxor ($Xhi,$Xhn); | ||
| 1249 | |||
| 1250 | &reduction_alg5 ($Xhi,$Xi); | ||
| 1251 | |||
| 1252 | &movdqa ($T3,&QWP(0,$const)); | ||
| 1253 | &test ($len,$len); | ||
| 1254 | &jnz (&label("done")); | ||
| 1255 | |||
| 1256 | &movdqu ($Hkey,&QWP(0,$Htbl)); # load H | ||
| 1257 | &set_label("odd_tail"); | ||
| 1258 | &movdqu ($T1,&QWP(0,$inp)); # Ii | ||
| 1259 | &pshufb ($T1,$T3); | ||
| 1260 | &pxor ($Xi,$T1); # Ii+Xi | ||
| 1261 | |||
| 1262 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) | ||
| 1263 | &reduction_alg5 ($Xhi,$Xi); | ||
| 1264 | |||
| 1265 | &movdqa ($T3,&QWP(0,$const)); | ||
| 1266 | &set_label("done"); | ||
| 1267 | &pshufb ($Xi,$T3); | ||
| 1268 | &movdqu (&QWP(0,$Xip),$Xi); | ||
| 1269 | &function_end("gcm_ghash_clmul"); | ||
| 1270 | |||
| 1271 | } | ||
| 1272 | |||
| 1273 | &set_label("bswap",64); | ||
| 1274 | &data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); | ||
| 1275 | &data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2); # 0x1c2_polynomial | ||
| 1276 | }} # $sse2 | ||
| 1277 | |||
| 1278 | &set_label("rem_4bit",64); | ||
| 1279 | &data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S); | ||
| 1280 | &data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S); | ||
| 1281 | &data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S); | ||
| 1282 | &data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S); | ||
| 1283 | &set_label("rem_8bit",64); | ||
| 1284 | &data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E); | ||
| 1285 | &data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E); | ||
| 1286 | &data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E); | ||
| 1287 | &data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E); | ||
| 1288 | &data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E); | ||
| 1289 | &data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E); | ||
| 1290 | &data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E); | ||
| 1291 | &data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E); | ||
| 1292 | &data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE); | ||
| 1293 | &data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE); | ||
| 1294 | &data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE); | ||
| 1295 | &data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE); | ||
| 1296 | &data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E); | ||
| 1297 | &data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E); | ||
| 1298 | &data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE); | ||
| 1299 | &data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE); | ||
| 1300 | &data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E); | ||
| 1301 | &data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E); | ||
| 1302 | &data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E); | ||
| 1303 | &data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E); | ||
| 1304 | &data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E); | ||
| 1305 | &data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E); | ||
| 1306 | &data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E); | ||
| 1307 | &data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E); | ||
| 1308 | &data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE); | ||
| 1309 | &data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE); | ||
| 1310 | &data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE); | ||
| 1311 | &data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE); | ||
| 1312 | &data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E); | ||
| 1313 | &data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E); | ||
| 1314 | &data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE); | ||
| 1315 | &data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE); | ||
| 1316 | }}} # !$x86only | ||
| 1317 | |||
| 1318 | &asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>"); | ||
| 1319 | &asm_finish(); | ||
| 1320 | |||
| 1321 | # A question was risen about choice of vanilla MMX. Or rather why wasn't | ||
| 1322 | # SSE2 chosen instead? In addition to the fact that MMX runs on legacy | ||
| 1323 | # CPUs such as PIII, "4-bit" MMX version was observed to provide better | ||
| 1324 | # performance than *corresponding* SSE2 one even on contemporary CPUs. | ||
| 1325 | # SSE2 results were provided by Peter-Michael Hager. He maintains SSE2 | ||
| 1326 | # implementation featuring full range of lookup-table sizes, but with | ||
| 1327 | # per-invocation lookup table setup. Latter means that table size is | ||
| 1328 | # chosen depending on how much data is to be hashed in every given call, | ||
| 1329 | # more data - larger table. Best reported result for Core2 is ~4 cycles | ||
| 1330 | # per processed byte out of 64KB block. This number accounts even for | ||
| 1331 | # 64KB table setup overhead. As discussed in gcm128.c we choose to be | ||
| 1332 | # more conservative in respect to lookup table sizes, but how do the | ||
| 1333 | # results compare? Minimalistic "256B" MMX version delivers ~11 cycles | ||
| 1334 | # on same platform. As also discussed in gcm128.c, next in line "8-bit | ||
| 1335 | # Shoup's" or "4KB" method should deliver twice the performance of | ||
| 1336 | # "256B" one, in other words not worse than ~6 cycles per byte. It | ||
| 1337 | # should be also be noted that in SSE2 case improvement can be "super- | ||
| 1338 | # linear," i.e. more than twice, mostly because >>8 maps to single | ||
| 1339 | # instruction on SSE2 register. This is unlike "4-bit" case when >>4 | ||
| 1340 | # maps to same amount of instructions in both MMX and SSE2 cases. | ||
| 1341 | # Bottom line is that switch to SSE2 is considered to be justifiable | ||
| 1342 | # only in case we choose to implement "8-bit" method... | ||
