41 opr collabra attached

Report 0 Downloads 29 Views
Dr. Fernanda Ferreira Department of Psychology University of California, Davis July 26, 2016 Dear Dr. Ferreira, Thank you for your and for the reviewers thoughtful comments on our manuscript entitled “Individual Differences in Statistical Learning: Conceptual and Measurement Issues” by Lucy C. Erickson, Michael P. Kaschak, Erik D. Thiessen, and Cassie A.S. Berry. These comments have allowed us to revise and strengthen the manuscript in response. We have modified the manuscript to include additional analyses suggested by the reviewers, to more clearly elucidate the rationale in selecting the measures and analyses, and to improve clarity of the arguments throughout the document. In addition, we have corrected minor reporting errors in the results section that do not substantively change the results.

The actions taken to address the reviewers’ specific concerns are described below. We look forward to hearing from you about the manuscript. Sincerely, Lucy C. Erickson Department of Psychology University of Maryland College Park, MD 20742 [email protected] Phone: (301) 405-4236 Erik D. Thiessen Department of Psychology Carnegie Mellon University Pittsburgh, PA 15213 [email protected] Phone: (412) 268-6747 Michael P. Kaschak Department of Psychology Florida State University 1107 W. Call Street Tallahasee, FL 32306-4301 [email protected] Phone: (850) 644-9363

Cassie. A.S Berry Department of Psychology Florida State University 1107 W. Call Street Tallahasee, FL 32306-4301 [email protected] Phone: (850) 644-9363 Response to Reviews Reviewer B: 1) General comments and summary of recommendation Describe your overall impressions and your recommendation, including changes or revisions. Please note that you should pay attention to scientific, methodological, and ethical soundness only, not novelty, topicality, or scope. A checklist of things to you may want to consider is below: - Are the methodologies used appropriate? - Are any methodological weaknesses addressed? - Is all statistical analysis sound? - Does the conclusion (if present) reflect the argument, is it supported by data/facts? - Is the article logically structured, succinct, and does the argument flow coherently? - Are the references adequate and appropriate?: (I reviewed a previous version of this manuscript for a different journal). This manuscript presents two experiments investigating individual differences in statistical learning (SL). Following recent work showing no correlations between different SL tasks, the authors examined the inter-correlations between tasks within the same modality, having highly similar structure. In Experiment 1, no correlations were found, but this is most likely explained by the low reliability of the tasks used. In Experiment 2, the same pattern of zero-correlations between SL tasks was found, despite the much improved test-retest reliability of (some of) the tasks used in this study. I found the manuscript interesting and clearly written. It addresses important issues (both theoretical and methodological), which are not sufficiently dealt with in the literature. My recommendation is, therefore, that the study would be published in the journal, with some simple modifications. Below I review specific points in the manuscript where I believe some additional discussion and analysis can make the paper more impactful and clear. We thank the reviewer for these comments, and will address the reviewers’ specific suggestions below. The first issue is that one of the main thrusts of the paper is to argue for the value of aggregating scores across tasks. However, the rationale for calculating the composite scores, both in

Experiment 1 and 2, is not stated clearly enough. If two (or more) tasks are not correlated, what does their composite score measure exactly? As the authors note, one important (yet theoretically uninteresting) reason for improved reliability for the composite scores is the larger number of trials. But other than that, the meaning of composite score of uncorrelated tasks is unclear, considering that the original tests’ scores are (almost) zero-correlated. We thank the reviewer for this comment, and agree that the manuscript would be strengthened by more explicitly laying out the rationale for compositing. Although we are also puzzled by the lack of correlation between the different languages, and agree that caution should be observed in creating composites from uncorrelated measures, we believe that this is appropriate for two reasons. The first reason is that these are not randomly chosen tasks that we chose to correlate. Instead, they are tasks identical in procedure, and the languages themselves also share key similarities. Consequently, we think that creating composites constitutes a valid approach. The second reason is that although these tasks are quite similar, the languages chosen do differ on a variety of features. These features may play a role in how learners perform. Thus, averaging over multiple statistical learning tasks which slightly different characteristics may provide a more full picture of an individual learner’s statistical learning abilities across a range of stimuli. This might be thought as analogous to SAT scores, which are composites created of multiple distinct abilities or skills. Although a score that combines quantitative and verbal ability certainly summarizes over separable abilities, the creation of a single score that can be used to order individuals has large utility in predicting performance in other domains of interest. We have modified the manuscript to more clearly present the logic for the use of composites in the Results section of Experiment 1 under the heading of Comparing Time 1 and Time 2. Prior to reporting the composite scores, a new sentence reads: “Because each individual measure necessarily involved a small number of unique test items, a compositing approach was used to investigate whether averaging across measures would result in a measure with better psychometric properties than observed for any individual tasks. (p. 18). We have also added the following text to this section of the manuscript (p. 18): “Although we acknowledge that caution should be observed in averaging across measures that are not correlated, we argue that this approach is warranted based on the high similarity of the materials and the identical procedures of these measures. The lack of correlations between the individual measures may be the result of the low number of unique items, or repeated items, or because the successful performance on the different languages taps into different abilities. Regardless, compositing is a useful strategy that may provide a more complete summary of a given individual’s statistical learning abilities.”

Relatedly, an additional analysis that is currently missing from the manuscript is the composite scores within each task, of the 2AFC and rating scores. In contrast to the aggregating uncorrelated scores across tasks, these two measures in each task were actually positively correlated. I therefore see promise in calculating the 2AFC/RS composite scores in each tasks, and predict that these will result in higher reliability coefficients. This can strengthen the authors’ point regarding the usefulness of composite scores, and provide an elegant way of improving tasks’ reliability. Although we agree that with the suggestion that creating composites made from the two kinds of assessments may also represent a promising avenue for increased reliability, we argue that this may not be as straightforward as it seems or be a replacement for compositing over multiple languages. First, if benefits are conferred by compositing over more measurements, a composite created by averaging over rating scale and alternative forced choice measurements will only stem from two measurements rather than four. Second, the lack of unique items means that this creates a composite based on heavy and repetitive assessment of a very small number of unique items, and it is not clear to what extent this might change how a participant might respond. Finally, although these measures are not correlated, it is not as though they are very different measures that might be expected not to correlate – they all involve an identical procedure and stimuli characterized by relatively high similarity, that have been used previously in the literature to index the learning mechanism of interest. Despite the lack of inter-correlations, any enterprise which seeks to understand how individual differences in statistical learning ability map onto other outcomes in the real world must grapple with the fact that participants may perform differently on these different tasks. Here are the results for the Rating Scale and 2-AFC composites for Experiment 1, created by summing the two values to avoid problems created by averaging the difference scores, which were sometimes negative if the participants endorsed the part-words at higher levels than the words: Correlation between RS-AFC Composites, Time 1 and Time 2: Language 1: r = 0.45, p < .001 Language 2: r = 0.26, p = .025 Language 3: r = 0.33, p = .005 Language 4: r = 0.28, p = .015 These results are consistent with the possibility that compositing is useful broadly (across different response formats as well as across different languages), and that in general more data points are helpful in creating a more reliable estimate of an individual’s statistical learning ability. The manuscript has been modified to include a paragraph that describes these results (p. 20). My other concern is that I do not find considerable merit to Experiment 1, which is somehow redundant. Experiment 1 basically exemplifies the trivial fact that using unreliable tasks would most likely result in zero correlations between them (as the authors themselves note in the

discussion of Experiment 1). Experiment 2, on the other hand, does a much better job in showing that although the SL tasks show high test-retest reliability, they do not correlate with each other. Given the clear advantage of the design of Experiment 2, the contribution of Experiment 1 seems meager. I therefore suggest removing Experiment 1 (which should be treated in hindsight as a pilot using tasks that are not suitable for studying individual differences), and focus only on Experiment 2. Alternatively, the authors should better argue for the contribution of Experiment 1 (perhaps following the AFC-RS composite analysis suggested above, and then re-examining the correlations between tasks, can provide some novel insights). Although we agree that features of Experiment 1 challenge the interpretation, we believe it is important to include for several reasons, many related to idea that as a field we should be pushing towards greater transparency in open access to data and reporting even results that might be challenging to interpret. First, it is informative to know the reliability of these measures when deployed as is typical of statistical learning measures, particularly in the context of highlighting the relative improvement when more test trials are included. This is particularly important because although this may be well known in the context of adult work, it is extremely typical to conduct infant and child research (including individual differences work) using procedures that are unfortunately less than ideal for individual differences work. This is an unfortunate reality of working with developmental participant populations, who have limited attention spans that prevents extensive testing periods with large numbers of test trials. Because data with children tend to be noisier than adult data, reporting these results can be thought of as representing an upper bound in the attainable reliability. In addition, although it might be thought of as obvious that a greater number of test trials is more advantageous for reliability, this is complicated by the fact that statistical learning tasks are necessarily limited in their number of unique items. An additional challenge is that although repeating test trials may mitigate some of these concerns, it creates the added difficulty of potential learning during the test phase, as well as the concern that participants may simply begin responding differently if asked the same question over and over again. The comparison between the number of test trials used in Experiment 1 and Experiment 2 is informative insofar as it shows the effects of increasing test trials in the absence of increasing the number of unique items. For these reasons, although we are sympathetic to the reviewer’s comment, we feel the inclusion of Experiment 1 is warranted. That said, we appreciate the suggestion of examining the RS/2AFC measures, which is now included in the revised manuscript (p. 20). Minor issues: Page 13-14: For Languages 1 and 3 rate of presentation is listed in syllables per minute, but for Language 2 it is listed in syllables per second. This should be changed to enable the reader to compare tasks. Also, pitch is reported just for Language 2, please add to L1 and L3 as well. We thank the reviewer for pointing this out. The reason for the languages having been described differently lies in their prosodic properties – Language 1 and 3 were produced in

a monotone by a speech synthesizer so the pitch does not vary. Language 2 is produced in natural sentences with multiple tokens of individual items by a speaker and then edited, such that the rate describes averages across the variably produced tokens in sentences. However, we appreciate the advantages of more easily comparing across the measures and have also added information that facilitates cross-language comparison (pp. 13-1). • Page 14: I did not understand the sentence in the description of Language 3: “Two of the words were twice as frequent as the other words which allowed for the creation of test items that were frequency balanced”. Please rephrase. We have expanded the discussion of frequency-balancing in language 3, and hope that it is now more clear. In addition, we have included a citation to Aslin, Saffran, & Newport, 1998 which includes a more detailed discussion of frequency vs. transitional probabilities (pp. 14-15). • Pages 13-14: The number of trials is a critical issue here, as later discussed. Please state clearly how many test items were in each of the tests of Languages 1-4. We had previously reported the number of test trials in the procedures section rather than in the auditory stimuli section for Experiment 1, which is already lengthy. We have modified the wording there in Experiment 1 to more clearly highlight the test trials used for the RS measures in the procedures. The description is now as follows: “In the 2AFC test, participants heard two items (a word and part-word) and were asked to identify which item sounded “more familiar” to the exposure stimulus. There were 8 test trials in which words and part-words were exhaustively paired. In the RS test, participants heard each test item individually, and were asked to rate how familiar (on a scale of 1-5) the item sounded to the language; participants rated each item (2 words, and 2 part-words) twice for a total of 8 test trials assessing ratings for 4 distinct words and 4 distinct partwords.” (pp. 15-16) In addition, we have altered the beginning of the auditory stimuli section in Experiment 1 to also highlight the number of test trials: “Testing for each language consisted of two sets of 8 questions (8 RS questions and 8 2AFC questions” (p. 12). • I would avoid reporting p-values for test-retest reliability coefficients (or the use of the word 'significant' in these analyses), which are misleading (testing the meaningless null hypothesis of reliability of 0) and are mostly a function of the sample size. We agree that the null hypothesis of 0 is unlikely to be true; however, we feel that this is generally well-understood and prefer the transparency inherent in reporting more rather than less information, particularly given that is possible, albeit unlikely, that a task would only show a marginal test-retest correlation.

• Appendix: For clarity, please add a table to the Appendix summarizing the correlations between SL tasks and other cognitive measures. We feel that the appendix is already lengthy given its exploratory nature, and note that one of the other reviewers feels thought that it should be removed entirely. We believe that the appendix adds value to the manuscript, and have thus chosen to leave it in the manuscript. We believe the best compromise is to present the analyses as a series of hypothesis-driven exploratory analyses rather than to lengthen the appendix by including the results in a redundant table. 2) Figures/tables/data availability: Please comment on the author’s use of tables, charts, figures, if relevant. Please acknowledge that adequate underlying data is available to ensure reproducibility (see open data policies per discipline of Collabra here).: Tables and figures are clear. We are glad that the tables and figures present the data with adequate clarity. Please add a summarizing table to the appendix. I suggest also adding scatter plots showing the (zero) correlations between the different SL tasks. See response above. 3) Ethical approval: If humans or animals have been used as research subjects, and/or tissue or field sampling, are the necessary statements of ethical approval by a relevant authority present? Where humans have participated in research, informed consent should also be declared. If not, please detail where you think a further ethics approval/statement/follow-up is required.: Informed consent is not declared in the manuscript. Although it was not mentioned in the manuscript, all of the testing was conducted in compliance with the ethical standards set by the IRB. We have corrected the manuscript to explicitly state this: “Signed consent was obtained for all participants, and testing was conducted in accordance with the ethical standards established by the university’s Institutional Review Board.” (p 16, p. 26). 4) Language: Is the text well written and jargon free? Please comment on the quality of English and any need for improvement beyond the scope of this process.: The manuscript is very well written. We thank the reviewer for this comment.

----------------------------------------------------------------------------------------------------------Reviewer D: 1) General comments and summary of recommendation Describe your overall impressions and your recommendation, including changes or revisions. Please note that you should pay attention to scientific, methodological, and ethical soundness only, not novelty, topicality, or scope. A checklist of things to you may want to consider is below: - Are the methodologies used appropriate? - Are any methodological weaknesses addressed? - Is all statistical analysis sound? - Does the conclusion (if present) reflect the argument, is it supported by data/facts? - Is the article logically structured, succinct, and does the argument flow coherently? - Are the references adequate and appropriate?: This paper reports the results of two experiments aimed at assessing whether several tasks measuring statistical learning in adults are reliable, and whether these tasks correlate with each other. The results suggest that the segmentation measures have moderate reliability, and that they tend to be more reliable as the number of test items used to assess learning is increased. There was very little relation between performance on the various tasks assessing statistical learning. The authors suggest that more work needs to be done to draw firmer conclusions about the extent to which tasks assessing statistical learning may provide meaningful information about individual differences, and the extent to which these tasks tap a general statistical learning mechanism. This work has several strengths: it addresses several important and timely questions, and may provide some information about the reliability of the tasks used. The approach taken entails examining whether performance on various tasks, mostly those assessing segmenting auditory streams using TPs, correlates within an individual over time, and also whether performance on the various tasks correlates within an individual. A strength here is that the authors chose tasks that had been used before, and on which adults had showed evidence of learning, which allowed them to determine whether performance in their sample was comparable to such previous work. However it is not clear what else motivated the use of these specific materials, and how they individually might be (or might not be) well-suited to testing individual differences. In addition, why were 4 tasks chosen? More consideration of this issue, even after the fact, would improve the paper and make the implications much clearer. We thank you for these comments. We have now more explicitly described the logic behind the selection of the measures in the introduction to Experiment 1 (p. 9). I also had some questions about the specific analyses that were done to assess reliability and inter-task relations. Generally speaking, the rationale for the approach the authors have taken is very sparely articulated, and thus the analyses read like a list rather than a set of steps carefully designed to address a set of coherent questions.

We have modified the results sections throughout the manuscript to more clearly articulate the reason for conducting analyses prior to describing the results. Furthermore, I'm unclear why performance across tasks at a given time would be combined to form a composite measure to then assess reliability of composites across time. This seems to assume that these tasks measure something in common, but there is no evidence of that. As described previously in a response to another reviewer, although these measures are not correlated, it is not as though they are very different measures that might be expected not to correlate – they all involve an identical procedure and stimuli characterized by relatively high similarity, that have been used previously in the literature to index the learning mechanism of interest. In addition, compositing is a strategy that can be useful in providing a summary across a number of measures of interest even in situations when the components used to create composites are distinct (e.g., an SAT score indexes verbal and quantitative ability and there are advantages in considering overall performance collapsed across these two dimensions). Thus, we think this constitutes sufficient reason to consider potential benefits of composites. We have more clearly articulated this point in the results section of Experiment 1 prior to the discussion of the analyses using the composite scores (p. 18). Thus, I'm not sure the authors really have good evidence about reliability. If this really were the case, why not average performance in a given task across time and then look at whether the tasks correlate with each other? Furthermore, if I am reading things correctly, the 2AFC and RS scores for a language within an individual were correlated, at least in Experiment 1. This would seem to suggest that performance was reliable across different assessment, but very little discussion of this result was provided. Finally, the authors suggest that Experiment 2 provides evidence that using more test items improves reliability, but this time they included more test items for each task, and then assessed reliability over time for each language separately. I'm not sure that collapsing across tasks at each time point in Experiment 1 is necessary to make this point about the number of test trials. Although averaging across time points for a given language and testing whether those measures correlate with each other may have some benefits, there are also some potential drawbacks to this approach. If one of the advantages of compositing is that more data points are better, averaging across 2 data points (a particular language at time 1 and time 2) may only be half as effective as averaging across 4 data points (each language at one time point). In addition, if these languages tap into differently abilities as a function of their characteristics (e.g., word length), a composite made of several different languages is likely to be a more useful index of general statistical learning ability than a more limited set of measures.

With that said, we agree that other analyses may also be of interest. Below, we describe analyses that use a composite created from both the RS-2AFC measures which is also described in the manuscript (p. 20). In sum, I think that this paper would benefit from a more carefully laid out plan to answer specific questions, and perhaps from including a different set of analyses that is better-suited to assessing reliability and relations between tasks. We thank the reviewer for this point. We have modified the results section so that the analyses read less like a list of results and instead are designed to answer a series of hypotheses. In addition, as suggested by another reviewer, we have included an additional set of analyses that we believe supports the idea that composites may confer benefits for reliability. As described above, below are the results for the Rating Scale and 2-AFC composites for Experiment 1, created by summing the two values to avoid problems created by averaging the difference scores, which were sometimes negative if the participants endorsed the partwords at higher levels than the words: Correlation between RS-AFC Composites, Time 1 and Time 2: Language 1: r = 0.45, p < .001 Language 2: r = 0.26, p = .025 Language 3: r = 0.33, p = .005 Language 4: r = 0.28, p = .015 These results are consistent with the possibility that compositing is useful broadly (across different response formats as well as across different languages), and that in general more data points are helpful in creating a more reliable estimate of an individual’s statistical learning ability. The manuscript has been modified to describe these results (p. 20). 2) Figures/tables/data availability: Please comment on the author’s use of tables, charts, figures, if relevant. Please acknowledge that adequate underlying data is available to ensure reproducibility (see open data policies per discipline of Collabra here).: OK 3) Ethical approval: If humans or animals have been used as research subjects, and/or tissue or field sampling, are the necessary statements of ethical approval by a relevant authority present? Where humans have participated in research, informed consent should also be declared. If not, please detail where you think a further ethics approval/statement/follow-up is required.: I am not sure that this is explicitly stated in the MS.

As mentioned previously, although it was not mentioned in the manuscript, all of the testing was conducted in compliance with the ethical standards set by the IRB. We have corrected the manuscript to explicitly state this: “Signed consent was obtained for all participants, and testing was conducted in accordance with the ethical standards established by the university’s Institutional Review Board.” (p. 16; p. 26). 4) Language: Is the text well written and jargon free? Please comment on the quality of English and any need for improvement beyond the scope of this process :

It is fine.

----------------------------------------------------------------------------------------------------------Reviewer I: 1) General comments and summary of recommendation Describe your overall impressions and your recommendation, including changes or revisions. Please note that you should pay attention to scientific, methodological, and ethical soundness only, not novelty, topicality, or scope. A checklist of things to you may want to consider is below: - Are the methodologies used appropriate? - Are any methodological weaknesses addressed? - Is all statistical analysis sound? - Does the conclusion (if present) reflect the argument, is it supported by data/facts? - Is the article logically structured, succinct, and does the argument flow coherently? - Are the references adequate and appropriate?: Thoughts in brief: The study is methodically sound; the experiments are extensive and pretty well-designed. The statistical analysis is pretty good but there are a few extensions I'd like to see in the final version. The paper lays out the problem clearly, gives good background, and discusses the authors' interpretation of the results fairly and without speculating further than the mixed bag of results would permit. The writing is good, and the paper as a whole gives some useful explanations and raises very interesting questions about the relationship between language acquisition and use and the general concept of "statistical learning". I think this is a really tough paper to write (and tough to review as well, so sorry if my comments are confusing), because it's all about trying to figure out why results don't fit a hypothesis, and why nothing really seems to have a clear pattern. But I think these negative/null results are very interesting, because I think we have a habit of looking at early statistical learning as a problem where we focus on positive evidence and sweep negative evidence kind of under the rug. A lot of early statistical learning results are fragile, and there's an interesting debate to be had on whether that fragility is an epiphenomenon of poor measurement or a sign that different Cognitive processes are being recruited for really similar-looking problems. (And, of course, the consequences if the latter is true.) This paper could advance that conversation substantially.

We thank the reviewer for these comments, and agree that this is an important question, and also hope that these results will help to advance the discussion, even if they also raise many more questions regarding the nature of statistical learning mechanisms. My main thought here is that the focus of the paper should be shifted slightly to focus not on where effects can be found (some of which are likely to be noise from having multiple languages, measures, and timepoints) but rather on where effects consistently aren't appearing. The experimental design seems empirically sound and the authors have a good deal of experience finding significant effects in non-individual-differences versions of the experiments in this paper, even using the same stimuli (e.g., Language 3 being in Thiessen & Saffran 2003). Also, previous work (esp. Siegelman & Frost 2015, also work like Johnson & Tyler 2010 "Testing the limits of statistical learning for word segmentation") has shown that the statistical learning effects and their relationship to individual differences are at least somewhat fragile, so the null effects seem informative here. We have modified the manuscript to include more discussion about the lack of effects (e.g., see below for a discussion of the lack of power). In addition, we have included citations to the suggested work (Johnson & Tyler, 2010; Siegelman & Frost, 2015) regarding the potentially fragile nature of effects. (pp. 30-31). I have some reservations about the statistical analyses, and I think my reservations really stem from them being used to suss out effects that end up being pretty weak rather than analyzing just how weak the effects are. I would like to see some power analyses here to get a sense of what the null results are actually telling us. We thank the reviewer for this suggestion. We have conducted a series of power analyses that investigate (1) post-hoc power given particular sample sizes and correlations as well as (2) analyses that yield the desired sample size to achieve power of 0.8 given a particular correlation strength. The results are presented below: Analyses  Estimating  Sample  Size  Given  Correlation  and  Desired  Power   Statistical  Test   Sample  Size  

Power   Correlation  

Significance  Level  

Correlation  

782.450  

0.800   0.1  

 

0.05  

Correlation  

193.780  

0.800   0.2  

 

0.05  

Correlation  

84.750    

0.800   0.3  

 

0.05  

Correlation  

46.570    

0.800   0.4  

 

0.05  

Correlation  

28.870    

0.800   0.5  

 

0.05  

 

Post-­‐hoc  Analyses  Analyzing  Power  Given  Sample  Size  and  Correlation     Statistical  Test   Sample  Size  

Power   Correlation  

Significance  Level  

Correlation  

50.000    

0.1027   0.1  

 

0.05  

Correlation  

50.000    

0.2819   0.2  

 

0.05  

Correlation  

50.000    

0.5639   0.3  

 

0.05  

Correlation  

50.000    

0.829   0.4  

 

0.05  

Correlation  

50.000    

0.9655   0.5  

 

0.05  

Correlation  

55.000    

0.109   0.1  

 

0.05  

Correlation  

55.000    

0.3071   0.2  

 

0.05  

Correlation  

55.000    

0.6071   0.3  

 

0.05  

Correlation  

55.000    

0.8647   0.4  

 

0.05  

Correlation  

55.000    

0.978   0.5  

 

0.05  

Correlation  

60.000    

0.1153   0.1  

 

0.05  

Correlation  

60.000    

0.332   0.2  

 

0.05  

Correlation  

60.000    

0.6471   0.3  

 

0.05  

Correlation  

60.000    

0.8935   0.4  

 

0.05  

Correlation  

60.000    

0.9861   0.5  

 

0.05  

Correlation  

65.000    

0.1216   0.1  

 

0.05  

Correlation  

65.000    

0.3564   0.2  

 

0.05  

Correlation  

65.000    

0.6839   0.3  

 

0.05  

Correlation  

65.000    

0.9167   0.4  

 

0.05  

Correlation  

65.000    

0.9913   0.5  

 

0.05  

 

 

   

   

Correlation  

70.000    

0.1279   0.1  

 

0.05  

Correlation  

70.000    

0.3805   0.2  

 

0.05  

Correlation  

70.000    

0.7175   0.3  

 

0.05  

Correlation  

70.000    

0.9352   0.4  

 

0.05  

Correlation  

70.000    

0.9946   0.5  

 

0.05  

Correlation  

75.000    

0.1342   0.1  

 

0.05  

Correlation  

75.000    

0.4041   0.2  

 

0.05  

Correlation  

75.000    

0.7482   0.3  

 

0.05  

Correlation  

75.000    

0.9498   0.4  

 

0.05  

Correlation  

75.000    

0.9967   0.5  

 

0.05  

Correlation  

80.000    

0.1404   0.1  

 

0.05  

Correlation  

80.000    

0.4272   0.2  

 

0.05  

Correlation  

80.000    

0.776   0.3  

 

0.05  

Correlation  

80.000    

0.9613   0.4  

 

0.05  

Correlation  

80.000    

0.998   0.5  

 

0.05  

Correlation  

85.000    

0.1467   0.1  

 

0.05  

Correlation  

85.000    

0.4498   0.2  

 

0.05  

Correlation  

85.000    

0.8012   0.3  

 

0.05  

Correlation  

85.000    

0.9703   0.4  

 

0.05  

Correlation  

85.000    

0.9988   0.5  

 

0.05  

Correlation  

90.000    

0.153   0.1  

 

0.05  

Correlation  

90.000    

0.4718   0.2  

 

0.05  

Correlation  

90.000    

0.8239   0.3  

 

0.05  

Correlation  

90.000    

0.9773   0.4  

 

0.05  

   

 

 

 

Correlation  

90.000    

0.9993   0.5  

 

0.05  

Correlation  

95.000    

0.1593   0.1  

 

0.05  

Correlation  

95.000    

0.4933   0.2  

 

0.05  

Correlation  

95.000    

0.8443   0.3  

 

0.05  

Correlation  

95.000    

0.9827   0.4  

 

0.05  

Correlation  

95.000    

0.9996   0.5  

 

0.05  

 

These results indicate that our studies would be reasonably well-powered to detect task-totask correlations of ~0.5 (a reasonable assumption given that the tasks should purportedly index the same construct) but not small correlations. That is, our tasks were sufficiently powered to detect the kinds of correlations we anticipated. Lack of power may be an additional contributing factor, should real but small inter-task correlations exist. According to the estimates, a sample size of almost 800 would be necessary for power of .80, assuming inter-task correlations of 0.1. We have included two sentences in the general discussion that mentions that possibility that lack of power might be a contributing factor, if the effects exist but are very weak. “If these features exert a strong influence on the mechanisms that are engaged during statistical learning, the subtle differences in the tasks might result in weak inter-language correlations, a possibility which is potentially consistent with prior research (e.g., Johnson & Tyler, 2010; Siegelman & Frost, 2016). Such weak correlations could be difficult to detect (a sample size of close to 800 would be necessary to detect a correlation of 0.1 with power of 0.8).” (p. 30-31) I also think that multiple-comparison corrections are appropriate in these analyses, especially on the cross-language correlations (e.g., Table 3) and the cross-time comparisons (e.g., Table 1), to account for the 2AFC/RS distinction, the four languages, and the two time points being analyzed mostly separately. I think this is especially important for understanding what it means that so few of the cross-language correlations are significant (and presumably even fewer would be significant with a multiple-comparison correction). In addition, as described later in a response to another reviewer with regard to multiple corrections, although multiple corrections are important for drawing strong conclusions about results, it also increases the likelihood of missing fragile or weak effects. These effects, should they exist, are likely to be weak. Consequently, any advantages gained through the use of multiple comparisons corrections are of diminishing returns as a result of their high cost, particularly in the case of the exploratory analyses reported in the present research. The exploratory nature of the analyses and lack of multiple comparisons

corrections were already described in the introductory section of the appendix (p. 53). However, we have further modified the appendix to include a note that corrections for multiple comparisons were not used and that many of these of these effects are likely small and would disappear if they were performed (p. 56). It's not critical, but estimating effect sizes would also be a nice addition, especially as the key question is how much of the performance variance is from test-retest unreliability, how much from task/measurement differences, and how much from differences in the cognitive mechanisms being used by the different languages/tasks. Effect sizes aren't absolutely necessary because the authors aren't trying to answer that question completely yet, but I think getting a sense of how relatively strong the different influences are would be a very useful addition. We agree that eventually this question will be of interest. At the moment, however, we believe that it may be a bit premature and that it is important to strike a balance in the amount of information that is presented. We think it is better to not overwhelm readers with too many tables and graphs, given that the manuscript is already lengthy. That said, per Collabra’s policy, we will make the data available and interested readers will be free to explore these questions independently. I'd also like to see a little bit of restructuring to make the problem and the conclusions clearer; the authors have some very nice statements of the problem and their conclusions at various points, especially in the General Discussion, and I'd like to see them moved forward a bit so that the reader can use them to understand the experiment design and analysis better. In particular, the problem tested in this paper is pretty complex, but the second sentence of the GD summarizes it nicely. I'd like to see the hypothesized sources of variance established like this in the background, and each statistical test tied directly to one (or more) of these sources of variance. We have added a similar statement to the sentence referenced in the general discussion to the abstract so that the idea is present immediately. In addition, the same idea was already previously expressed toward the end of the introduction: “Therefore, an important question that remains unaddressed is whether SL measures do not correlate with each other because SL is a much more fragmented construct than originally thought, or because there is something intrinsic to the tasks themselves (e.g., task demands or psychometric properties) that limit the ability to find correlations across measures. The research reported here was intended to take a step toward addressing this issue.” (p. 8). Also, because the effects were so fragile in experiment 1, I'm a bit uncomfortable with talking about "replicating" an effect in experiment 2 (last para. before GD) because it assumes that these effects should be there, and there just isn't that much evidence of them. We have changed the phrasing of the text in question to remove the word “replicate” and to soften the potential expectation that they were real but not observed. It now reads:

“It is unclear why the learning effects that were observed in Experiment 1 were not found in Experiment 2, especially as performance at Time 1 was not at ceiling.“ (p. 28). I understand and agree with the claim in the GD (first full para. on pg 29) that due to the uncertainty about the quality of the measures, we should focus on measurement issues first before assuming that SL tasks really aren't correlated. But I'd like to see a more agnostic framing overall: that the effects may or may not exist and we need to look into whether the measurement issues prevent us from determining whether they do, rather than presuming they exist unless a sufficiently reliable measure tells us otherwise. I think this fits with shifting the focus to how reliable the null effects are, and how much they're shaped by the unreliability of the measurements. We agree that it is important to consider the possibility that measurement issues may not be at the heart of the issue. This is a possibility we already mention in the manuscript (p. 32), where we note “Either SL is a highly fragmentary construct (perhaps to the extent that we might wonder whether it is a useful theoretical construct at all), or the lack of correlation between SL tasks is the result of measurement issues. Given that there has been so little effort to develop and validate measures of SL, it seems most profitable to pursue the latter hypothesis first..“ That said, we do think that it makes sense to explore measurement issues first, particularly given that despite these problems, a reasonable number of studies have found correlations between SL and various outcomes of interest, so it appears that SL is at least somewhat promising as a measure of interest, as we argue in the second sentence. To soften the perception that we have not seriously considered the possibility that measurement issues are not at the heart of the null results, we have further modified the manuscript. An additional sentence after the text quoted above acknowledges this now: “However, it is also important to consider the possibility that construct error rather than measurement error is that the heart of these null results, and that SL may not be a useful theoretical construct in the context of individual differences approaches.”

Overall, I think this is important work. I've avoided designing research that depends on individual difference measures because of an impression that it has been somewhat inconsistent in the sorts of statistical learning and psycholinguistic tasks I've been interested in. I think the conclusions of this paper are tentative, but they represent an important step in laying out how unreliability in the measures of statistical learning limit our analyses of language learning and suggest that statistical learning is a far more fragmented construct than most people think. (On that last point, I know it would be speculative, but I'm very interested in the authors' thoughts of

what it means if statistical learning is so fragmented that people are at most weakly correlated even in super-similar segmentation tasks like those in languages 1 and 3.) We agree that these results are important, even if they may be somewhat deflationary to current approaches using individual differences to gain insight into statistical learning mechanisms. Although we agree that we can only speculate about what it means that even similar statistical learning languages show limited inter-correlations, even though the languages chosen are more similar than tasks used in previous assessments of the psychometric properties of statistical learning tasks, they still do exhibit key differences (e.g., frequency balancing, word length). The perceptual features of the measures may be key in determining whether or not tasks are correlated. An important next step will be to collect measures that even more perceptual similar on the kinds of dimensions that are potentially important and test whether languages correlate or fail to correlate with each other in a predictable manner prior to further speculation. One last thing: The appendix presents a potentially interesting experiment but I thought that the lack of any real conclusions to be drawn from it made it a weak addition to the paper. It seems to me that, with this being exploratory, it's even more important to perform multiple-comparison corrections than in the rest of the paper, and I'm curious if any of these results would hold then (aside from the Operation Span & Reading Span correlation). As with the core parts of the paper, the really interesting result would seem to be the lack of correlation between the measures and across the aspects of cognition. But to derive useful conclusions from the appendix, I think there have to be some sort of power analyses and multiple-comparison corrections. I'd rather see this as an aside or footnote in the main text stating that multiple cognitive measures were tested as an exploratory part of Experiment 2, but no significant relationships were found between the cognitive measures and the statistical learning performance after correcting for multiple comparisons. (Assuming that's the final result, of course.) We agree that it is difficult to draw strong conclusions regarding the results from the data reported in the appendix. However, we respectfully feel that in the interest of transparency it is important for including as much information as possible. Because we did not wish these data to detract from the main story, we chose to relegate the results to an appendix rather than presenting it within the main body of the manuscript. In addition, as described previously in a response to another reviewer, with regard to multiple corrections, although multiple corrections are important for drawing strong conclusions about results, it also increases the likelihood of missing fragile or weak effects. These effects, should they exist, are likely to be weak. Consequently, any advantages gained through the use of multiple comparisons corrections are of diminishing returns as a result of their high cost, particularly in the case of the exploratory analyses reported in the present research. The exploratory nature of the analyses and lack of multiple comparisons corrections were already described in the introductory section of the appendix (p. 53). However, we have further modified the appendix to include a note that corrections for multiple comparisons were not used and that many of these of these effects are likely fragile and would disappear if they were performed (p. 56).

2) Figures/tables/data availability: Please comment on the author’s use of tables, charts, figures, if relevant. Please acknowledge that adequate underlying data is available to ensure reproducibility (see open data policies per discipline of Collabra here).: Everything necessary seems to be in the tables and plots. I'd think about combining a couple of the tables, particularly 3 & 4, 5 & 6, and 8 & 9; while there are differences in the particular patterns of results between Time 1 and Time 2, I didn't get the impression that the authors thought there was a particular theoretical reason to expect differences between the times, and a lot of this paper is about the ways that very similar measurements differ, so I'd would have found it easier having them side-by-side rather than having to flip back and forth. Although we agree that there are potential advantages to combining some of the tables, we believe that space constraints make it challenging to do so in a way that maintain clarity of presentation. Thus, we believe that maintaining separate tables that are embedded near each other in the text, although not ideal, represents the best strategy for presenting the data in a clear and manageable way. The authors also did a very clear and complete job in providing the descriptions of the stimuli for the four segmentation and one artificial grammar languages, which is critical to the points they are making. We thank the reviewer for this comment, and agree that complete information about the details of individual statistical learning tasks is important and should be emphasized in the future as we attempt to disentangle the mechanism(s) underlying these paradigms. 3) Ethical approval: If humans or animals have been used as research subjects, and/or tissue or field sampling, are the necessary statements of ethical approval by a relevant authority present? Where humans have participated in research, informed consent should also be declared. If not, please detail where you think a further ethics approval/statement/follow-up is required.: I'm not finding an explicit declaration of consent, so if that's part of the Collabra requirements, I think it needs to be made explicit. But this is pretty standard psycholinguistic research, so I'm sure it's been approved. As mentioned in previous responses, our research was indeed conducted in compliance with the ethical standards of the university’s guidelines, although we failed to point this out. This has been corrected with the addition of the following sentence in the methods section: “Signed consent was obtained for all participants, and testing was conducted in accordance with the ethical standards established by the university’s Institutional Review Board.” (p. 16; p. 26) 4) Language:

Is the text well written and jargon free? Please comment on the quality of English and any need for improvement beyond the scope of this process.: The language of the article is great. I thought the lit review in the beginning was especially clear, and the goals and designs of the experiments are well explained, which is critical because the actual goals and designs are a little complicated. As I mentioned above, there's some things I'd move forward to better establish the structure. We thank the reviewer for this comment, and have modified the manuscript in several ways (e.g., putting critical logic from the general discussion earlier; softening language regarding replicating effects from Experiment 1; acknowledging more directly the probable weakness of the effects observed in the exploratory analyses presented in the Appendix) to incorporate many of the suggestions made above.    

Recommend Documents