| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
This provides a SHA-512 assembly implementation that makes use of the ARM
Cryptographic Extension (CE), which is found on many arm64 CPUs. This gives
a performance gain of up to 2.5x on an Apple M2 (dependent on block size).
If an aarch64 machine does not have SHA512 support, then we'll fall back to
using the existing C implementation.
ok kettenis@ tb@
|
|
|
|
|
|
|
| |
We have code that targets a specific architecture level, hence .arch makes
more sense here than .cpu.
Suggested by kettenis@
|
|
|
|
|
|
|
|
|
|
| |
This provides a SHA-256 assembly implementation that makes use of the ARM
Cryptographic Extension (CE), which is found on many arm64 CPUs. This gives
a performance gain of up to 7.5x on an Apple M2 (dependent on block size).
If an aarch64 machine does not have SHA2 support, then we'll fall back to
using the existing C implementation.
ok kettenis@ tb@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, SHA{1,256,512}_ASM defines are used to remove the C
implementation of sha{1,256,512}_block_data_order() when it is provided
by assembly. However, this prevents the C implementation from being used
as a fallback.
Rename the C sha*_block_data_order() to sha*_block_generic() and provide
a sha*_block_data_order() that calls sha*_block_generic(). Replace the
Makefile based SHA*_ASM defines with two HAVE_SHA_* defines that allow
these functions to be compiled in or removed, such that machine specific
verisons can be provided. This should effectively be a no-op on any
platform that defined SHA{1,256,512}_ASM.
ok tb@
|
|
|
|
| |
discussed with jsing
|
| |
|
|
|
|
|
|
|
|
| |
This provides a SHA-1 assembly implementation for amd64, which uses
the Intel SHA Extensions (aka SHA New Instructions or SHA-NI). This
provides a 2-2.5x performance gain on some Intel CPUs and many AMD CPUs.
ok tb@
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As already done for SHA-256 and SHA-512, replace the perlasm generated
SHA-1 assembly implementation with one that is actually readable. Call the
assembly implementation from a C wrapper that can, in the future, dispatch
to alternate implementations. On a modern CPU the performance is around
5% faster than the base implementation generated by sha1-x86_64.pl, however
it is around 15% slower than the excessively complex SSSE2/AVX version that
is also generated by the same script (a SHA-NI version will greatly
outperform this and is much cleaner/simpler).
ok tb@
|
|
|
|
|
|
|
|
|
|
| |
Rather than having blocks of code that are conditional on
BYTE_ORDER != LITTLE_ENDIAN, use le64toh() and htole64() unconditionally.
In the case of a little endian platform, the compiler will optimise this
away, while on a big endian platform we'll either end up with better code
or the same code than we have currently.
ok tb@
|
|
|
|
|
|
|
|
| |
This provides a SHA-256 assembly implementation for amd64, which uses
the Intel SHA Extensions (aka SHA New Instructions or SHA-NI). This
provides a 3-5x performance gain on some Intel CPUs and many AMD CPUs.
ok tb@
|
|
|
|
|
| |
Now that we have replacement SHA-256 and SHA-512 assembly implementations
for amd64, sha512-x86_64.pl can go the way of the dodo.
|
|
|
|
|
|
|
|
| |
Replace the perlasm generated SHA-512 assembly with a more readable
version and the same C wrapper introduced for SHA-256. As for SHA-256,
on a modern CPU the performance is largely the same.
ok tb@
|
|
|
|
| |
Missing sizes spotted by guenther@
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replace the perlasm generated SHA-256 assembly implementation with one that
is actually readable. Call the assembly implementation from a C wrapper
that can, in the future, dispatch to alternate implementations. Performance
is similar (or even better) on modern CPUs, while somewhat slower on older
CPUs (this is in part due to the wrapper, the impact of which is more
noticable with small block sizes).
Thanks to gkoehler@ and tb@ for testing.
ok tb@
|
| |
|
|
|
|
| |
requested by jsing on review
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
HMAC() and the one-step digests used to support passing a NULL buffer and
would return the digest in a static buffer. This design is firmly from the
nineties, not thread safe and it saves callers a single line. The few ports
that used to rely this were fixed with patches sent to non-hostile (and
non-dead) upstreams. It's early enough in the release cycle that remaining
uses hidden from the compiler should be caught, at least the ones that
matter.
There won't be that many since BoringSSL removed this feature in 2017.
https://boringssl-review.googlesource.com/14528
Add non-null attributes to the headers and add a few missing bounded
attributes.
ok beck jsing
|
|
|
|
|
|
|
|
|
| |
Replace macros with static inline functions and use names that follow
the spec more closely. Unlike SHA256/SHA512, the functions and constants do
not align with the number of words loaded, which means we cannot easily loop
and just end up just unrolling everything.
ok joshua@ tb@
|
| |
|
|
|
|
|
|
|
|
| |
Use be32toh(), htobe32() and crypto_{load,store}_htobe32() as appropriate.
Also use the same while() loop that is used for other hash functions.
ok joshua@ tb@
|
|
|
|
|
|
|
|
|
| |
cet.h is needed for other platforms to emit the relevant .gnu.properties
sections that are necessary for them to enable IBT. It also avoids issues
with older toolchains on macOS that explode on encountering endbr64.
based on a diff by kettenis
ok beck kettenis
|
|
|
|
|
| |
Now that we're no longer dependent on md32_common.h, stop including it.
Remove various defines that only existed for md32_common.h usage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replace macros with static inline functions, as well as writing out the
variable rotations instead of trying to outsmart the compiler. Also pull
the message schedule update up and complete it prior to commencement of
the round. Also use rotate right, rather than transposed rotate left.
Overall this is more readable and more closely follows the specification.
On some platforms (e.g. aarch64) there is no noteable change in
performance, while on others there is a significant improvement (more than
25% on arm).
ok miod@ tb@
|
|
|
|
|
|
|
|
|
| |
This is a hack that is only enabled on a handful of 64 bit platforms, as
a workaround for poor compiler optimisation. If you're running an archiac
compiler on an archiac architecture, then you can deal with slightly lower
performance.
ok tb@
|
|
|
|
| |
ok tb@
|
| |
|
| |
|
|
|
|
| |
No change to generated assembly.
|
|
|
|
| |
No functional change.
|
| |
|
|
|
|
|
|
|
|
| |
Copy the update, transform and final functions from md32_common.h, manually
expanding the macros for SHA1. This will allow for further clean up to
occur.
No change in generated assembly.
|
|
|
|
|
|
|
| |
If input data is 32 bit aligned use be32toh() directly, otherwise use
crypto_load_be32toh(), cleaning up all of the HOST_c2l() usage.
ok beck@
|
|
|
|
|
|
|
| |
Avoid reach around and initialisation outside of the macro, cleaning up
the call sites to remove the initialisation.
ok beck@
|
|
|
|
| |
ok beck@
|
|
|
|
| |
ok beck@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use static inline functions instead of macros to implement SHA-512. At
the same time, make two key changes - firstly, rather than trying to
outsmart the compiler and shuffle variables around, write the algorithm
the way it is documented and actually swap the variable contents. Secondly,
instead of interleaving the message schedule update and the round, do the
full message schedule update first, then process the round.
Overall, we get safer and more readable code. Additionally, the compiler
can generate smaller and faster code (with a gain of 5-10% across a range
of architectures).
ok beck@ tb@
|
| |
|
|
|
|
| |
No change in generated assembly.
|
|
|
|
| |
No intended functional change.
|
| |
|
| |
|
|
|
|
| |
No change to generated assembly.
|
|
|
|
|
|
|
|
|
|
| |
m32_common.h is a typical OpenSSL macro horror show - copy the update,
transform and final functions from md32_common.h, manually expanding the
macros for SHA256. This will allow for further clean up to occur.
No change in generated assembly.
ok beck@ tb@
|
|
|
|
|
|
|
|
|
|
|
| |
This recommits r1.37 of sha512.c, however uses uint8_t * instead of void *
for the crypto_load_* functions and primarily uses const uint8_t * to track
input, only casting to const SHA_LONG64 * once we know that it is suitably
aligned. This prevents the compiler from implying alignment based on type.
Tested by tb@ and deraadt@ on platforms with gcc and strict alignment.
ok tb@
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
All assembly implementations are required to perform their own alignment
handling. In the case of the C implementation, on strict alignment
platforms, unaligned data will be copied into an aligned buffer. However,
most platforms then perform byte-by-byte reads (via the PULL64 macros).
Instead, remove SHA512_BLOCK_CAN_MANAGE_UNALIGNED_DATA and alignment
handling to sha512_block_data_order() - if the data is aligned then simply
perform 64 bit loads and then do endian conversion via be64toh(). If the
data is unaligned then use memcpy() and be64toh() (in the form of
crypto_load_be64toh()). Overall this reduces complexity and can improve
performance (on aarch64 we get a ~10% performance gain with aligned input
and about ~1-2% gain on armv7), while the same movq/bswapq is generated
for amd64 and movl/bswapl for i386.
ok tb@
|
|
|
|
|
|
|
|
|
|
|
| |
Avoid reach around and initialisation outside of the macro, cleaning up
the call sites to remove the initialisation. Use a T2 variable to more
closely follow the documented algorithm and remove the gorgeous compound
statement X = Y += A + B + C.
There is no change to the clang generated assembly on aarch64.
ok tb@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We currently have three C implementations for SHA-512 - a version that is
optimised for CPUs with minimal registers (specifically i386), a regular
implementation and a semi-unrolled implementation. Testing on a ~15 year
old i386 CPU, the fastest version is actually the semi-unrolled version
(not to mention that we still currently have an i586 assembly
implementation that is used on i386 instead...).
More decent architectures do not seem to care between the regular and
semi-unrolled version, presumably since they are effectively doing the
same thing in hardware during execution.
Remove all except the semi-unrolled version.
ok tb@
|