Clock Enable Analysis

Consider the following code:

 1-- shift in the data and stuff the clock-crossing FIFO.
 2-- this occurs in the demodulator's clock domain
 3bit_counter: process(ts_clk, rst)
 5        if rising_edge(ts_clk) then
 6                cc_write <= '0';
 7                packet_done <= '0';
 9                case shift_state is
10                when WAIT_START =>
11                        bitno <= 0;
12                        if ts_valid = '1' and ts_sync = '1' then
13                                bitno <= 1;
14                                shift_state <= SHIFTING;
15                        end if;
17                when SHIFTING =>
18                        -- every 16 bits, send the data to the clock crossing FIFO
19                        if bitno mod 16 = 0 then
20                                cc_write <= '1';
21                        end if;
23                        -- every 1506 bits (188 bytes) we re-start the counter
24                        if bitno = TS_PACKET_BITS then
25                                shift_state <= WAIT_END;
26                        else
27                                bitno <= bitno + 1;
28                        end if;
30                when WAIT_END =>
31                        packet_done <= '1';
33                        if ts_valid = '0' then
34                                shift_state <= WAIT_START;
35                        end if;
36                end case;
37        end if;
39        if rst = '1' then
40                shift_state <= WAIT_START;
41                bitno <= 0;
42        end if;
43end process;

This theoretically works but is not optimal:

< MatthiasM> when you don't assign the state signal in every state - you will end up with a huge logic cloud which computes the CE for all states combined
< MatthiasM> your code basically says "don't change state when the following conditions don't apply: <whole conditions of your FSM qualified by the current state>" 
< MatthiasM> eg: state.ce <= '0' when state = WAIT_START and (ts_valid = '0' or ts_sync = '0') else ....;
< tzanger> so why not put state <= state at the top of the process to "capture" all of that so that state is assigned every time through the process?
< MatthiasM> tzanger: this will just replace the CE with a MUX
< MatthiasM> if you assign to state a constant it will optimize to a local MUX
< MatthiasM> not sure if you could call that a bug/missed optimization in Quartus or not

He goes on to say that what you will likely get is a one-hot coded FSM with the states defined roughly as follows:

state.WAIT_START <= (state.WAIT_START and not (ts_valid = '1' and ts_sync = '1')) or (state.WAIT_END and not ts_valid);
state.SHIFTING <= (state.SHIFTING and not (bitno = TS_PACKET_BITS)) or (state.WAIT_START and (ts_valid = '1' and ts_sync = '1'))
state.WAIT_END <= (state.WAIT_END and ts_valid) or (state.SHIFTING and (bitno = TS_PACKET_BITS));

I asked him if just putting state <= state at the top of the clocked process (inside the if rising_edge(ts_clk)) would alleviate this “huge logic cloud” but it won’t. It’ll generate a MUX instead which led to a discussion on optimization. As he mentioned there if you assign state to a constant all the time Quartus will reduce the logic to a local MUX which is tiny and good.

< tzanger> so it's not so much the state <= default_value that is the issue, but rather that state <= state (i.e. a non-constant) that is the issue
< tzanger> it's the constant-ness that is important here *as well as* the signal-is-always-driven-ness
< MatthiasM> yep
< MatthiasM> btw - your bitno also has a complex CE condition :)

The bitno signal is a has a complex clock enable because I am only allowing its value to change in the following cases:

  • state = WAIT_START
  • state = SHIFTING and bitno != TS_PACKET_BITS

Specifically that bitno != TS_PACKET_BITS creates a bigass AND gate since I am comparing a 16-bit value (bitno) against a constant; all the bits have to be compared. I think that makes 19-term logic... the specific value, the two state checks and the combinatorial logic to get the final true/false out of all of it. FPGAs these days typically have 4-input LUTs, which means I have to burn through a lot of LUTs to get this logic resolved.

If I move the bitno <= bitno + 1 outside of the comparison against TS_PACKET_BITS then the CE logic for that 16-bit signal decomposes to

bitno.CE <= not state.WAIT_END

(since all the other states end up changing the value of bitno.)

I had initially suggested just saying bitno <= bitno; at the start of the clocked process so that bitno's CE was always active and the input to bitno was just a 4:1 MUX or somesuch (one for each state, plus reset) – this would also work but if we’re talking about optimization then it’s actually far worse:

  • code as-is: bigass logic chain to create bitno's CE signal
  • code with bitno <= bitno;: even bigger logic because now there’s a 16-bit 4:1 MUX
  • code as MatthiasM suggested: no logic at all

I suppose Quartus could optmize the 4:1 MUX to a 3:1 or even 2:1 MUX if it could see the optimization that two of the 4 cases could just drive the reset line of the bitno FFs but it’s still not as clean as MatthiasM’s solution. His solution works for my particular case as well because as soon as bitno's value hits TS_PACKET_BITS I jump out of the state that increments it and the value just stays still until I see the end of the TS packet and jump to the WAIT_START state which zeroes the count anyway.

Very cool little lesson on taking a step back and trying to visualize how the synthesis tool will actually create logic from the gibberish you feed it. Of course you don’t have to try to visualize it; you can always synthesize it and see what the logic looks like in the RTL viewer.

Add picture from clipboard (Maximum size: 1 GB)