Chapter 4 - Learning about Pipelining

Segment 30647: Sequence Generator

Optimized for size: 156 cycles, 4 nodes, 17 instructions

All the logic is contained in Node 2.

We need to “look at” each input value twice: first for determining which one is larger, then again to create the output stream. Node 1 passes two copies of IN.A to Node 2, and Node 2 stores a copy of IN.B in BAK.

Optimized for speed: 96 cycles, 7 nodes, 20 instructions

Save file

In Chapter 3, Sequence Amplifier taught us to speed up a calculation by using multiple nodes to do the same operations on multiple streams of data. Today we’ll use multiple nodes to do different operations on the same stream of data.

It wasn’t very efficient to have Node 2 doing all the work; in the size-optimized solution, Node 2 is always running while Nodes 6 and 9 are idle almost half the time. We can split its work into three separate tasks.

Node 2 still calculates IN.B - IN.A, but instead of using the result to sort the outputs as before, it just passes the result to Node 6 and grabs another input value as soon as possible.

Node 6 puts the outputs in the correct order, and Node 9 adds a 0 onto the end.

After running the program, look at the path from IN.B to OUT. Node 6 is busiest at 8% idle, but Nodes 2 and 9 are only 17% idle. (The other nodes are less busy, but they’re just for plumbing.) By keeping each node about equally busy with its own step in the pipeline, we cut down on the amount of time spent waiting.

For homework, swap the order of the MOV ACC RIGHT and MOV ACC DOWN instructions in Node 1. What happens? Why?

Optimized for nodes: 108 cycles, 5 nodes, 21 instructions

Save file

Solution by scrat98.

The idea is the following: we need to store the difference IN.B – IN.A and the number IN.A. First, we can determine in what order to display the numbers A and B, and second, we know these two numbers (B = B – A + A).

For a start, we will send the number A into two processors. Fundamentally, to first send to the right processor, and then only to the down(below will be clearer).

In the right CPU, we calculate the difference B – A, while at the down CPU we will only get a top – for parallelization processes.

Next we look at the value of B – A, and left get A twice – first time to determine the number of B(as B is B – A + A), and the second time to display the number of A. Now it becomes clear that logical in the beginning to submit A into the right CPU and then only to the down, because most of the work is here and we fundamentally get the difference B – A before the number A.

Note: If you decide to change the flow of A first at the down processor and then to the right, we get 138 cycles, you can check:)

Segment 31904: Sequence Counter

Neither of these solutions uses the SWP instruction.

Optimized for speed: 220 cycles, 6 nodes, 22 instructions

Save file

Solution by gmnenad.

Node 7 behaves like an extra register for node 8, storing the sum of all the inputs so far.

Node 5 checks whether the input is 0 (signaling the end of a sequence) and passes the result to node 9, which counts the length of each sequence.

Optimized for size: 298 cycles, 4 nodes, 19 instructions

Save file

Solution by CaitSith2.

Node 4 decides whether the input is 0 or not. If it is not 0, the input is passed to Node 8, where it is added, and node 9 gets instructed to increment the ACC. A 0 likewise tells nodes 8 and 9 to output their totals, and reset. This is done JRO style.

Segment 32050: Signal Edge Detector

Optimized for size: 347 cycles, 4 nodes, 14 instructions

Save file

Node 6 subtracts the current input from the previous input stored in ACC, takes the absolute value of the difference, and uses it to determine the output. It then takes a second copy of the current input (generated by node 1) and stores it in ACC.

Optimized for speed: 218 cycles, 6 nodes, 20 instructions

Save file

Optimizations by CaitSith2 and Solomute.

Node 1 splits the input stream in order to stagger it. Node 2 uses the BAK register to continuously supply the previous value to node 5, which subtracts it from the current value to get the difference. To eliminate the need to check for two distinct cases (difference > 10 and difference < -10), node 8 computes the absolute value of the difference. Finally, node 9 checks whether the absolute difference is at least 10.

For homework:

Try to predict how many cycles would be wasted by switching which side is subtracted in node 6.
Swap the MOV ACC RIGHT and MOV ACC DOWN instructions in node 1. How many cycles are wasted?
What happens if you make both changes at once?

Segment 33762: Interrupt Handler

Optimized for speed: 195 cycles, 10 nodes, 44 instructions

Save file

Finishing touches by CaitSith2 and Solomute.

Nodes 0 through 3 output their respective input numbers when their inputs change from zero to one; they output zero for all other conditions. (They are ARMED when the previous input was zero, and DISARMED when the previous input was one.)

Nodes 5, 6, and 9 collect the outputs from nodes 0 through 3. Since we are guaranteed that two interrupts will never change in the same input cycle, adding the output values together results in the correct value for OUT.

Optimzed for size: 427 cycles, 6 nodes, 40 instructions

Save file

Solution by CaitSith2.

Works much like the optimized for speed solution, ecept that node 0 passes the value to node 1; node 1 passes its value and node 0’s value to node 2; node 3 passes its value to node 2; and node 2 passes its value down, along with the values from nodes 0, 1, and 3. Node 6 then adds up the values and passes the result down.

Tessellated Intelligence System Best Practices

Chapter 4 - Learning about Pipelining

Segment 30647: Sequence Generator

Optimized for size: 156 cycles, 4 nodes, 17 instructions

Optimized for speed: 96 cycles, 7 nodes, 20 instructions

Optimized for nodes: 108 cycles, 5 nodes, 21 instructions

Segment 31904: Sequence Counter

Optimized for speed: 220 cycles, 6 nodes, 22 instructions

Optimized for size: 298 cycles, 4 nodes, 19 instructions

Segment 32050: Signal Edge Detector

Optimized for size: 347 cycles, 4 nodes, 14 instructions

Optimized for speed: 218 cycles, 6 nodes, 20 instructions

Segment 33762: Interrupt Handler

Optimized for speed: 195 cycles, 10 nodes, 44 instructions

Optimzed for size: 427 cycles, 6 nodes, 40 instructions