Modelling a complex cognitive system with limited data: Optimization and generalization in a computational model of reading aloud
Optimizing the Connectionist Dual-Process Model of Reading Aloud (CDP; Perry et al., Journal of Memory and Language, 134 , 104468) using large-scale empirical datasets has been shown to enable accurate predictions of independent datasets that were not used for optimization. Here, we investigated CDP’s generalization performance when optimized on small datasets consisting of words, nonwords, or a combination of both. The results showed CDP’s quantitative performance was similar on both small and large datasets except when optimized on small nonword-only datasets. Additionally, CDP’s predictions generally surpassed those derived from regression-based models, suggesting it had good generalization performance. Using sloppy parameter analyses, we also found that a small number of parameters determined most of CDP’s quantitative performance and that the parameters which did this were similar across both small and large datasets. These findings suggest that the CDP does not overfit the data, even when optimized on very small numbers of stimuli. They also give insight into the role the parameters play in generating psycholinguistic effects. More generally, the findings show that when an underlying cognitive architecture constrains behavior, complex systems like reading may be analyzed and understood using very limited data. This is important as it shows that computational modelling can be used in some situations where data is scarce but understanding the system remains crucial.