NEFF measures sequence diversity within MSAs, which are crucial for extracting correlated mutation information and essential for biological investigations like contact map and structure prediction. NEFF has shown a strong correlation with prediction accuracy in models such as AlphaFold.
For an MSA, NEFF can be formulated as:
\[ \left( \frac{1}{\sqrt{L}} \right) \sum_{n=1}^{N} \frac{1}{1 + \sum_{m=1, m \neq n}^{N} I[S_{m,n} \geq thr]} \]
where \(L\) is the number of residues in the sequence, \(N\) is the number of sequences in an MSA, \(S_{m,n}\) is the sequence identity between \(m\)-th and \(n\)-th sequences, \(thr\) is the threshold cutoff to determine whether two sequences are similar or not, and \(I\) is the inversion bracket, meaning that \(I[S_{m,n} \geq thr]\) equals 1 if \(S_{m,n} \geq thr\) and 0 otherwise.
Note that \(\frac{1}{\sqrt{L}}\) is used as a normalization factor here.
Generally, one can see NEFF simply as a normalized summation of sequence weights for all sequences in an MSA. If the number of sequences (including itself) similar to sequence \(i\) is \(n_i\), then its sequence weight is \(\frac{1}{n_i}\). This approach for calculating NEFF has been widely used in various contact and structure prediction tools, as demonstrated in references [1-6].
For further assistance or inquiries, please contact the developer or create an issue in the GitHub repository.