Effect sizes from uncontrolled studies have all sorts of confounding factors. For instance, keen teachers often sign up to deliver an intervention whereas the less enthusiastic ones are left in the control. Teachers in the intervention group also know that they are part of an intervention and often so do their students. This has a potential placebo effect where positive expectations about an intervention become a self-fulfilling prophecy. It is for this reason that Hattie chooses a cut-off of d=0.4 for effect sizes. However, this number is quite arbitrary and he applies it equally to both well-controlled trials – such as Sweller’s trials of worked examples – and the standard kind that I have described. He even applies it equally to time based effects (comparing before an intervention with afterwards) and group based effects (comparing an intervention group with a control).
There are other problems, such as the fact that when the test subjects are quite homogeneous you are likely to generate larger effect sizes. So, if you are testing in a selective school or a group of engineering undergraduates, your standard deviation is likely to be small relative to the improvement in mean scores. Given that the effect size is the latter divided by the former then you are going to get a big one.
Unlike some, I don’t think that this renders effect sizes completely useless – they are a sincere attempt to enable effects to be compared across study designs – but we do need to bear in mind their limitations.