A pretty straightforward program: just MOV
data from one port to the next. This program shows the theoretical maximum throughput of one value per 2 cycles.
Mostly the same as the optimized version, but let’s take a look at node 0. It spends over 99% of its time in the inner loop:
MOV 638 ACC #4
INNER: SUB 1 #5
JGZ INNER #6
This is a simple loop that runs for 638 iterations, counting down from 638 to 0. It’s almost always more efficient to count from N to 0 than from 0 to N in TIS-100; here’s what the same loop would look like, but counting from 0 to 638:
MOV 0 ACC # Initialize counter to 0.
INNER: SUB 638 # Check if the counter is at 638...
JEZ DONE # ...using subtraction and <0 comparison.
ADD 639 # Reverse the subtraction and add 1.
JMP INNER # Iterate the loop.
DONE:
Since TIS-100’s conditional jump instructions are based on comparing ACC
to zero, on every iteration we have to subtract 638 from the counter, compare the result to 0, then add 638 back again. This takes up 2 extra lines of code and 2 extra cycles per loop iteration. (Note: having the exit command JEZ
in the middle of the loop also makes it easier to introduce off-by-one errors; any parts of the loop above the JEZ
will execute 639 times, while the parts below will execute 638 times. To understand why, try mentally walking through both loops but counting to 2 instead of 638.)
Back to the program:
MOV 2 ACC #1
OUTER: SUB 1 #2
SAV #3
# INNER LOOP... #
SWP #7
JGZ OUTER #8
The inner loop is wrapped inside an outer loop that runs for 2 iterations. The outer loop’s counter is saved to BAK
before the inner loop writes all over ACC
, and then restored once the inner loop is finished.
Finally, we let the node do what it’s supposed to do and…
MOV UP DOWN #9
This program could be further pessimized by changing the constants from 2 and 638 to 999 and 999, for a total running time of about 80 million cycles.
Multiplying a number by 2 is the same as adding it to itself, but we have to read it into ACC
first. The read and the addition take 2 cycles, leaving the input and the downstream nodes idle 50% of the time.
The addition step is a bottleneck, but fortunately it’s simple enough to parallelize easily. If we can split the input values into two streams by sending them alternately to nodes 2 and 4, then we can join the streams back together at node 5 and double our throughput. We could write
MOV UP DOWN
MOV UP RIGHT
at node 1 and
MOV LEFT DOWN
MOV UP DOWN
at node 5 just to make sure the values come out in the right order, but we’ll save two instructions by using ANY
to handle the order for us: since both streams take the same number of cycles, whichever value went in first will still come out first!
Finishing touches by Solomute.
Node 2: calculate IN.B - IN.A
and pass it on
Node 9: pass IN.B - IN.A
to OUT.N
and node 8
Node 8: negate the value from node 9 to get IN.A - IN.B
; pass it to OUT.P
Solution by imamassi.
Node 9 multiplies IN.B - IN.A
by -1 before passing it to node 8, saving an instruction.
Nodes 6, 7, and 8 have nearly identical programs: First, pass the input value along to the next node. If the value is (greater than / equal to / less than) zero, output a one. Otherwise, output a zero and move on to the next input.
Nodes 0 and 4 do some preprocessing on the input, as shown in the table below. The final value going into node 6 will be 1 for any negative input, 3 for a zero input, and 5 or greater for any positive input.
Pipelining these steps instead of waiting until node 6 to do them saves almost 200 cycles.
Step | Operation | |||||
---|---|---|---|---|---|---|
0 | (Input) | -999 | -1 | 0 | 1 | 999 |
1 | SUB 998 |
-999 | -999 | -998 | -997 | 1 |
2 | ADD 999 |
0 | 0 | 1 | 2 | 999 |
3 | ADD 1 |
1 | 1 | 2 | 3 | 999 |
4 | ADD ACC |
2 | 2 | 4 | 6 | 999 |
5 | SUB 1 |
1 | 1 | 3 | 5 | 998 |
Nodes 6 through 8 use the processed input value as an index into the rest of the program, similar to the switch
statement in C. (The labels are just for ease of reading.)
LOOP:
# ... #
JRO ACC
1: MOV 0 DOWN
JMP LOOP
3: MOV 1 DOWN
JMP LOOP
5+: MOV 0 DOWN
Since a positive input can result in a value anywhere from 5 to 998, and a JRO
past the end of the program behaves the same as a JRO
to the end, the 5 “case” needs to be the last instruction in the program.
If we had to choose between a fixed range of values (for example, 1 through 5), we could write a straightforward jump table instead. And in the next problem, we’ll do that.
Speed optimizations by imamassi.
Node 2 contains all the logic. When IN.S
is nonzero, we need to discard one of the values, otherwise IN.A
and IN.B
will get out of sync with each other.
Node 2 calculates IN.S + 2
, which transforms -1 / 0 / 1 into 1 / 2 / 3. Node 6 uses that value as an argument to JRO
, and reads IN.A
and IN.B
from nodes 5 and 7.
Notes on this program:
MOV
the unused input to NIL
in order to get rid of it.JMP
instruction in the case that IN.S
is 1.The trick here is recognizing that instead of having one node that chooses between three output values, we can use two nodes that each make a simpler choice between two output values. When IN.S
is zero, we add both IN.A
and IN.B
to the output; when IN.S
is -1, we only add IN.A
; and when IN.S
is 1, we only add IN.B
.
In C syntax, the behavior we’re looking for is: OUT = ( (IN.S > 0) ? 0 : IN.A ) + ( (IN.S < 0) ? 0 : IN.B )
. The conditional assignments are parallelized between nodes 1 and 3, and node 6 performs the addition.