A pretty straightforward program: just `MOV`

data from one port to the next. This program shows the theoretical maximum throughput of one value per 2 cycles.

Mostly the same as the optimized version, but let’s take a look at node 0. It spends over 99% of its time in the inner loop:

```
MOV 638 ACC #4
INNER: SUB 1 #5
JGZ INNER #6
```

This is a simple loop that runs for 638 iterations, counting down from 638 to 0. It’s almost always more efficient to count from *N* to 0 than from 0 to *N* in TIS-100; here’s what the same loop would look like, but counting from 0 to 638:

```
MOV 0 ACC # Initialize counter to 0.
INNER: SUB 638 # Check if the counter is at 638...
JEZ DONE # ...using subtraction and <0 comparison.
ADD 639 # Reverse the subtraction and add 1.
JMP INNER # Iterate the loop.
DONE:
```

Since TIS-100’s conditional jump instructions are based on comparing `ACC`

to zero, on every iteration we have to subtract 638 from the counter, compare the result to 0, then add 638 back again. This takes up 2 extra lines of code and 2 extra cycles per loop iteration. (Note: having the exit command `JEZ`

in the middle of the loop also makes it easier to introduce off-by-one errors; any parts of the loop above the `JEZ`

will execute 639 times, while the parts below will execute 638 times. To understand why, try mentally walking through both loops but counting to 2 instead of 638.)

Back to the program:

```
MOV 2 ACC #1
OUTER: SUB 1 #2
SAV #3
# INNER LOOP... #
SWP #7
JGZ OUTER #8
```

The inner loop is wrapped inside an outer loop that runs for 2 iterations. The outer loop’s counter is saved to `BAK`

before the inner loop writes all over `ACC`

, and then restored once the inner loop is finished.

Finally, we let the node do what it’s supposed to do and…

```
MOV UP DOWN #9
```

This program could be further pessimized by changing the constants from 2 and 638 to 999 and 999, for a total running time of about 80 million cycles.

Multiplying a number by 2 is the same as adding it to itself, but we have to read it into `ACC`

first. The read and the addition take 2 cycles, leaving the input and the downstream nodes idle 50% of the time.

The addition step is a bottleneck, but fortunately it’s simple enough to parallelize easily. If we can split the input values into two streams by sending them alternately to nodes 2 and 4, then we can join the streams back together at node 5 and double our throughput. We could write

```
MOV UP DOWN
MOV UP RIGHT
```

at node 1 and

```
MOV LEFT DOWN
MOV UP DOWN
```

at node 5 just to make sure the values come out in the right order, but we’ll save two instructions by using `ANY`

to handle the order for us: since both streams take the same number of cycles, whichever value went in first will still come out first!

Finishing touches by Solomute.

Node 2: calculate `IN.B - IN.A`

and pass it on

Node 9: pass `IN.B - IN.A`

to `OUT.N`

and node 8

Node 8: negate the value from node 9 to get `IN.A - IN.B`

; pass it to `OUT.P`

Solution by imamassi.

Node 9 multiplies `IN.B - IN.A`

by -1 before passing it to node 8, saving an instruction.

Nodes 6, 7, and 8 have nearly identical programs: First, pass the input value along to the next node. If the value is (greater than / equal to / less than) zero, output a one. Otherwise, output a zero and move on to the next input.

Nodes 0 and 4 do some preprocessing on the input, as shown in the table below. The final value going into node 6 will be 1 for any negative input, 3 for a zero input, and 5 or greater for any positive input.

Pipelining these steps instead of waiting until node 6 to do them saves almost 200 cycles.

Step | Operation | |||||
---|---|---|---|---|---|---|

0 | (Input) | -999 | -1 | 0 | 1 | 999 |

1 | `SUB 998` |
-999 | -999 | -998 | -997 | 1 |

2 | `ADD 999` |
0 | 0 | 1 | 2 | 999 |

3 | `ADD 1` |
1 | 1 | 2 | 3 | 999 |

4 | `ADD ACC` |
2 | 2 | 4 | 6 | 999 |

5 | `SUB 1` |
1 | 1 | 3 | 5 | 998 |

Nodes 6 through 8 use the processed input value as an index into the rest of the program, similar to the `switch`

statement in C. (The labels are just for ease of reading.)

```
LOOP:
# ... #
JRO ACC
1: MOV 0 DOWN
JMP LOOP
3: MOV 1 DOWN
JMP LOOP
5+: MOV 0 DOWN
```

Since a positive input can result in a value anywhere from 5 to 998, and a `JRO`

past the end of the program behaves the same as a `JRO`

to the end, the 5 “case” needs to be the last instruction in the program.

If we had to choose between a fixed range of values (for example, 1 through 5), we could write a straightforward jump table instead. And in the next problem, we’ll do that.

Speed optimizations by imamassi.

Node 2 contains all the logic. When `IN.S`

is nonzero, we need to discard one of the values, otherwise `IN.A`

and `IN.B`

will get out of sync with each other.

Node 2 calculates `IN.S + 2`

, which transforms -1 / 0 / 1 into 1 / 2 / 3. Node 6 uses that value as an argument to `JRO`

, and reads `IN.A`

and `IN.B`

from nodes 5 and 7.

Notes on this program:

- Both inputs always need to be consumed even if only one is used, so we
`MOV`

the unused input to`NIL`

in order to get rid of it. - The possible cases in node 6 are reordered to 1 / -1 / 0 to avoid an extra
`JMP`

instruction in the case that`IN.S`

is 1.

The trick here is recognizing that instead of having one node that chooses between three output values, we can use *two* nodes that each make a simpler choice between *two* output values. When `IN.S`

is zero, we add both `IN.A`

and `IN.B`

to the output; when `IN.S`

is -1, we only add `IN.A`

; and when `IN.S`

is 1, we only add `IN.B`

.

In C syntax, the behavior we’re looking for is: `OUT = ( (IN.S > 0) ? 0 : IN.A ) + ( (IN.S < 0) ? 0 : IN.B )`

. The conditional assignments are parallelized between nodes 1 and 3, and node 6 performs the addition.