Validation of the training set for developing the four-body statistical potentials

 

A random subset that is 70% of the size of the original training set (which had 1167 chains) is selected. The four-body potentials are developed using this subset alone. This experiment is repeated ten times. The following table gives the correlation between the log-likelihoods thus developed (using a 70% subset) and the original four-body potentials (developed using the full training set). Each of the subsets provided scores that are highly correlated (0.88) with the original potentials.

 

SET

correlation

70_set1

0.88

70_set2

0.88

70_set3

0.88

70_set4

0.88

70_set5

0.88

70_set6

0.87

70_set7

0.89

70_set8

0.87

70_set9

0.88

70_set10

0.89

 

 

            The correlation coefficients given above are calculated for the whole 8855 x 5 table of log-likelihoods (8855 possible quadruplet compositions in

each of the five classes defined based on backbone chain connectivity). On the other, if only those log-likelihoods were involved in the calculation for which quadruplets were observed in both the 70% subset as well as in the whole training set, much higher correlation coefficients are obtained.