Clock Enable Analysis¶
Consider the following code:
1-- shift in the data and stuff the clock-crossing FIFO.
2-- this occurs in the demodulator's clock domain
3bit_counter: process(ts_clk, rst)
4begin
5 if rising_edge(ts_clk) then
6 cc_write <= '0';
7 packet_done <= '0';
8
9 case shift_state is
10 when WAIT_START =>
11 bitno <= 0;
12 if ts_valid = '1' and ts_sync = '1' then
13 bitno <= 1;
14 shift_state <= SHIFTING;
15 end if;
16
17 when SHIFTING =>
18 -- every 16 bits, send the data to the clock crossing FIFO
19 if bitno mod 16 = 0 then
20 cc_write <= '1';
21 end if;
22
23 -- every 1506 bits (188 bytes) we re-start the counter
24 if bitno = TS_PACKET_BITS then
25 shift_state <= WAIT_END;
26 else
27 bitno <= bitno + 1;
28 end if;
29
30 when WAIT_END =>
31 packet_done <= '1';
32
33 if ts_valid = '0' then
34 shift_state <= WAIT_START;
35 end if;
36 end case;
37 end if;
38
39 if rst = '1' then
40 shift_state <= WAIT_START;
41 bitno <= 0;
42 end if;
43end process;
This theoretically works but is not optimal:
< MatthiasM> when you don't assign the state signal in every state - you will end up with a huge logic cloud which computes the CE for all states combined < MatthiasM> your code basically says "don't change state when the following conditions don't apply: <whole conditions of your FSM qualified by the current state>" < MatthiasM> eg: state.ce <= '0' when state = WAIT_START and (ts_valid = '0' or ts_sync = '0') else ....; < tzanger> so why not put state <= state at the top of the process to "capture" all of that so that state is assigned every time through the process? < MatthiasM> tzanger: this will just replace the CE with a MUX < MatthiasM> if you assign to state a constant it will optimize to a local MUX < MatthiasM> not sure if you could call that a bug/missed optimization in Quartus or not
He goes on to say that what you will likely get is a one-hot coded FSM with the states defined roughly as follows:
state.WAIT_START <= (state.WAIT_START and not (ts_valid = '1' and ts_sync = '1')) or (state.WAIT_END and not ts_valid); state.SHIFTING <= (state.SHIFTING and not (bitno = TS_PACKET_BITS)) or (state.WAIT_START and (ts_valid = '1' and ts_sync = '1')) state.WAIT_END <= (state.WAIT_END and ts_valid) or (state.SHIFTING and (bitno = TS_PACKET_BITS));
I asked him if just putting state <= state
at the top of the clocked process (inside the if rising_edge(ts_clk)
) would alleviate this “huge logic cloud” but it won’t. It’ll generate a MUX instead which led to a discussion on optimization. As he mentioned there if you assign state to a constant all the time Quartus will reduce the logic to a local MUX which is tiny and good.
< tzanger> so it's not so much the state <= default_value that is the issue, but rather that state <= state (i.e. a non-constant) that is the issue < tzanger> it's the constant-ness that is important here *as well as* the signal-is-always-driven-ness < MatthiasM> yep < MatthiasM> btw - your bitno also has a complex CE condition :)
The bitno
signal is a has a complex clock enable because I am only allowing its value to change in the following cases:
state = WAIT_START
state = SHIFTING
andbitno != TS_PACKET_BITS
Specifically that bitno != TS_PACKET_BITS
creates a bigass AND gate since I am comparing a 16-bit value (bitno
) against a constant; all the bits have to be compared. I think that makes 19-term logic... the specific value, the two state checks and the combinatorial logic to get the final true/false out of all of it. FPGAs these days typically have 4-input LUTs, which means I have to burn through a lot of LUTs to get this logic resolved.
If I move the bitno <= bitno + 1
outside of the comparison against TS_PACKET_BITS
then the CE logic for that 16-bit signal decomposes to
bitno.CE <= not state.WAIT_END
(since all the other states end up changing the value of bitno
.)
I had initially suggested just saying bitno <= bitno;
at the start of the clocked process so that bitno
's CE was always active and the input to bitno
was just a 4:1 MUX or somesuch (one for each state, plus reset) – this would also work but if we’re talking about optimization then it’s actually far worse:
- code as-is: bigass logic chain to create
bitno
's CE signal - code with
bitno <= bitno;
: even bigger logic because now there’s a 16-bit 4:1 MUX - code as MatthiasM suggested: no logic at all
I suppose Quartus could optmize the 4:1 MUX to a 3:1 or even 2:1 MUX if it could see the optimization that two of the 4 cases could just drive the reset line of the bitno
FFs but it’s still not as clean as MatthiasM’s solution. His solution works for my particular case as well because as soon as bitno
's value hits TS_PACKET_BITS
I jump out of the state that increments it and the value just stays still until I see the end of the TS packet and jump to the WAIT_START state which zeroes the count anyway.
Very cool little lesson on taking a step back and trying to visualize how the synthesis tool will actually create logic from the gibberish you feed it. Of course you don’t have to try to visualize it; you can always synthesize it and see what the logic looks like in the RTL viewer.