1 files changed, 131 insertions, 0 deletions
diff --git a/src/lib/libcrypto/des/asm/readme b/src/lib/libcrypto/des/asm/readme
new file mode 100644
index 0000000000..1beafe253b
--- /dev/null
+++ b/src/lib/libcrypto/des/asm/readme
@@ -0,0 +1,131 @@
+First up, let me say I don't like writing in assembler.  It is not portable,
+dependant on the particular CPU architecture release and is generally a pig
+to debug and get right.  Having said that, the x86 architecture is probably
+the most important for speed due to number of boxes and since
+it appears to be the worst architecture to to get
+good C compilers for.  So due to this, I have lowered myself to do
+assembler for the inner DES routines in libdes :-).
+The file to implement in assembler is des_enc.c.  Replace the following
+4 functions
+des_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt);
+des_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);
+des_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
+des_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
+They encrypt/decrypt the 64 bits held in 'data' using
+the 'ks' key schedules.   The only difference between the 4 functions is that
+des_encrypt2() does not perform IP() or FP() on the data (this is an
+optimization for when doing triple DES and des_encrypt3() and des_decrypt3()
+perform triple des.  The triple DES routines are in here because it does
+make a big difference to have them located near the des_encrypt2 function
+at link time..
+Now as we all know, there are lots of different operating systems running on
+x86 boxes, and unfortunately they normally try to make sure their assembler
+formating is not the same as the other peoples.
+The 4 main formats I know of are
+Microsoft       Windows 95/Windows NT
+Elf             Includes Linux and FreeBSD(?).
+a.out           The older Linux.
+Solaris         Same as Elf but different comments :-(.
+Now I was not overly keen to write 4 different copies of the same code,
+so I wrote a few perl routines to output the correct assembler, given
+a target assembler type.  This code is ugly and is just a hack.
+The libraries are x86unix.pl and x86ms.pl.
+des586.pl, des686.pl and des-som[23].pl are the programs to actually
+generate the assembler.
+So to generate elf assembler
+perl des-som3.pl elf >dx86-elf.s
+For Windows 95/NT
+perl des-som2.pl win32 >win32.asm
+[ update 4 Jan 1996 ]
+I have added another way to do things.
+perl des-som3.pl cpp >dx86-cpp.s
+generates a file that will be included by dx86unix.cpp when it is compiled.
+To build for elf, a.out, solaris, bsdi etc,
+cc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o
+cc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o
+cc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o
+cc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o
+This was done to cut down the number of files in the distribution.
+Now the ugly part.  I acquired my copy of Intels
+"Optimization's For Intel's 32-Bit Processors" and found a few interesting
+things.  First, the aim of the exersize is to 'extract' one byte at a time
+from a word and do an array lookup.  This involves getting the byte from
+the 4 locations in the word and moving it to a new word and doing the lookup.
+The most obvious way to do this is
+xor     eax,    eax                             # clear word
+movb    al,     cl                              # get low byte
+xor     edi     DWORD PTR 0x100+des_SP[eax]     # xor in word
+movb    al,     ch                              # get next byte
+xor     edi     DWORD PTR 0x300+des_SP[eax]     # xor in word
+shr     ecx     16
+which seems ok.  For the pentium, this system appears to be the best.
+One has to do instruction interleaving to keep both functional units
+operating, but it is basically very efficient.
+Now the crunch.  When a full register is used after a partial write, eg.
+mov     al,     cl
+xor     edi,    DWORD PTR 0x100+des_SP[eax]
+386     - 1 cycle stall
+486     - 1 cycle stall
+586     - 0 cycle stall
+686     - at least 7 cycle stall (page 22 of the above mentioned document).
+So the technique that produces the best results on a pentium, according to
+the documentation, will produce hideous results on a pentium pro.
+To get around this, des686.pl will generate code that is not as fast on
+a pentium, should be very good on a pentium pro.
+mov     eax,    ecx                             # copy word 
+shr     ecx,    8                               # line up next byte
+and     eax,    0fch                            # mask byte
+xor     edi     DWORD PTR 0x100+des_SP[eax]     # xor in array lookup
+mov     eax,    ecx                             # get word
+shr     ecx     8                               # line up next byte
+and     eax,    0fch                            # mask byte
+xor     edi     DWORD PTR 0x300+des_SP[eax]     # xor in array lookup
+Due to the execution units in the pentium, this actually works quite well.
+For a pentium pro it should be very good.  This is the type of output
+Visual C++ generates.
+There is a third option.  instead of using
+mov     al,     ch
+which is bad on the pentium pro, one may be able to use
+movzx   eax,    ch
+which may not incur the partial write penalty.  On the pentium,
+this instruction takes 4 cycles so is not worth using but on the
+pentium pro it appears it may be worth while.  I need access to one to
+experiment :-).
+eric (20 Oct 1996)
+22 Nov 1996 - I have asked people to run the 2 different version on pentium
+pros and it appears that the intel documentation is wrong.  The
+mov al,bh is still faster on a pentium pro, so just use the des586.pl
+install des686.pl
+3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these
+functions into des_enc.c because it does make a massive performance
+difference on some boxes to have the functions code located close to
+the des_encrypt2() function.
+9 Jan 1997 - des-som2.pl is now the correct perl script to use for
+pentiums.  It contains an inner loop from
+Svend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at
+273,000 per second.  He had a previous version at 250,000 and the best
+I was able to get was 203,000.  The content has not changed, this is all
+due to instruction sequencing (and actual instructions choice) which is able
+to keep both functional units of the pentium going.
+We may have lost the ugly register usage restrictions when x86 went 32 bit
+but for the pentium it has been replaced by evil instruction ordering tricks.
+13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf.
+raw DES at 281,000 per second on a pentium 100.

diff --git a/src/lib/libcrypto/des/asm/readme b/src/lib/libcrypto/des/asm/readme new file mode 100644 index 0000000000..1beafe253b --- /dev/null +++ b/src/lib/libcrypto/des/asm/readme
@@ -0,0 +1,131 @@
	1	First up, let me say I don't like writing in assembler. It is not portable,
	2	dependant on the particular CPU architecture release and is generally a pig
	3	to debug and get right. Having said that, the x86 architecture is probably
	4	the most important for speed due to number of boxes and since
	5	it appears to be the worst architecture to to get
	6	good C compilers for. So due to this, I have lowered myself to do
	7	assembler for the inner DES routines in libdes :-).
	8
	9	The file to implement in assembler is des_enc.c. Replace the following
	10	4 functions
	11	des_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt);
	12	des_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);
	13	des_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
	14	des_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
	15
	16	They encrypt/decrypt the 64 bits held in 'data' using
	17	the 'ks' key schedules. The only difference between the 4 functions is that
	18	des_encrypt2() does not perform IP() or FP() on the data (this is an
	19	optimization for when doing triple DES and des_encrypt3() and des_decrypt3()
	20	perform triple des. The triple DES routines are in here because it does
	21	make a big difference to have them located near the des_encrypt2 function
	22	at link time..
	23
	24	Now as we all know, there are lots of different operating systems running on
	25	x86 boxes, and unfortunately they normally try to make sure their assembler
	26	formating is not the same as the other peoples.
	27	The 4 main formats I know of are
	28	Microsoft Windows 95/Windows NT
	29	Elf Includes Linux and FreeBSD(?).
	30	a.out The older Linux.
	31	Solaris Same as Elf but different comments :-(.
	32
	33	Now I was not overly keen to write 4 different copies of the same code,
	34	so I wrote a few perl routines to output the correct assembler, given
	35	a target assembler type. This code is ugly and is just a hack.
	36	The libraries are x86unix.pl and x86ms.pl.
	37	des586.pl, des686.pl and des-som[23].pl are the programs to actually
	38	generate the assembler.
	39
	40	So to generate elf assembler
	41	perl des-som3.pl elf >dx86-elf.s
	42	For Windows 95/NT
	43	perl des-som2.pl win32 >win32.asm
	44
	45	[ update 4 Jan 1996 ]
	46	I have added another way to do things.
	47	perl des-som3.pl cpp >dx86-cpp.s
	48	generates a file that will be included by dx86unix.cpp when it is compiled.
	49	To build for elf, a.out, solaris, bsdi etc,
	50	cc -E -DELF asm/dx86unix.cpp \| as -o asm/dx86-elf.o
	51	cc -E -DSOL asm/dx86unix.cpp \| as -o asm/dx86-sol.o
	52	cc -E -DOUT asm/dx86unix.cpp \| as -o asm/dx86-out.o
	53	cc -E -DBSDI asm/dx86unix.cpp \| as -o asm/dx86bsdi.o
	54	This was done to cut down the number of files in the distribution.
	55
	56	Now the ugly part. I acquired my copy of Intels
	57	"Optimization's For Intel's 32-Bit Processors" and found a few interesting
	58	things. First, the aim of the exersize is to 'extract' one byte at a time
	59	from a word and do an array lookup. This involves getting the byte from
	60	the 4 locations in the word and moving it to a new word and doing the lookup.
	61	The most obvious way to do this is
	62	xor eax, eax # clear word
	63	movb al, cl # get low byte
	64	xor edi DWORD PTR 0x100+des_SP[eax] # xor in word
	65	movb al, ch # get next byte
	66	xor edi DWORD PTR 0x300+des_SP[eax] # xor in word
	67	shr ecx 16
	68	which seems ok. For the pentium, this system appears to be the best.
	69	One has to do instruction interleaving to keep both functional units
	70	operating, but it is basically very efficient.
	71
	72	Now the crunch. When a full register is used after a partial write, eg.
	73	mov al, cl
	74	xor edi, DWORD PTR 0x100+des_SP[eax]
	75	386 - 1 cycle stall
	76	486 - 1 cycle stall
	77	586 - 0 cycle stall
	78	686 - at least 7 cycle stall (page 22 of the above mentioned document).
	79
	80	So the technique that produces the best results on a pentium, according to
	81	the documentation, will produce hideous results on a pentium pro.
	82
	83	To get around this, des686.pl will generate code that is not as fast on
	84	a pentium, should be very good on a pentium pro.
	85	mov eax, ecx # copy word
	86	shr ecx, 8 # line up next byte
	87	and eax, 0fch # mask byte
	88	xor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookup
	89	mov eax, ecx # get word
	90	shr ecx 8 # line up next byte
	91	and eax, 0fch # mask byte
	92	xor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookup
	93
	94	Due to the execution units in the pentium, this actually works quite well.
	95	For a pentium pro it should be very good. This is the type of output
	96	Visual C++ generates.
	97
	98	There is a third option. instead of using
	99	mov al, ch
	100	which is bad on the pentium pro, one may be able to use
	101	movzx eax, ch
	102	which may not incur the partial write penalty. On the pentium,
	103	this instruction takes 4 cycles so is not worth using but on the
	104	pentium pro it appears it may be worth while. I need access to one to
	105	experiment :-).
	106
	107	eric (20 Oct 1996)
	108
	109	22 Nov 1996 - I have asked people to run the 2 different version on pentium
	110	pros and it appears that the intel documentation is wrong. The
	111	mov al,bh is still faster on a pentium pro, so just use the des586.pl
	112	install des686.pl
	113
	114	3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these
	115	functions into des_enc.c because it does make a massive performance
	116	difference on some boxes to have the functions code located close to
	117	the des_encrypt2() function.
	118
	119	9 Jan 1997 - des-som2.pl is now the correct perl script to use for
	120	pentiums. It contains an inner loop from
	121	Svend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at
	122	273,000 per second. He had a previous version at 250,000 and the best
	123	I was able to get was 203,000. The content has not changed, this is all
	124	due to instruction sequencing (and actual instructions choice) which is able
	125	to keep both functional units of the pentium going.
	126	We may have lost the ugly register usage restrictions when x86 went 32 bit
	127	but for the pentium it has been replaced by evil instruction ordering tricks.
	128
	129	13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf.
	130	raw DES at 281,000 per second on a pentium 100.
	131