The attention output for each head is then concatenated and put through a final dense layer. Each multi-head attention block takes a dictionary as input, which consist of query, key and value. Notice that when using Model subclassing with Functional API, the input has to be kept as a single argument, hence we have to […]