Written by Sebastian Rushworth M.D.: Health and medical information grounded in science, September 25, 2020
Dr. Rushworth’s link is here.
Considering how much misinformation is currently floating around in the area of health and medicine, I thought it might be useful to write an article about how to read and understand scientific studies, so that you can feel comfortable looking at first hand data yourselves and making your own minds up.
Anyone can carry out a study. There is no legal or formal requirement that you have a specific degree or educational background in order to perform a study. All the earliest scientists were hobbyists, who engaged in science in their spare time. Nowadays most studies are carried out by people with some formal training in scientific method. In the area of health and medicine, most studies are carried out by people who are MD’s and/or PhD’s, or people who are in the process of getting these qualifications.
If you want to perform a study on patients, you generally have to get approval from an ethical review board. Additionally, there is an ethical code of conduct that researchers are expected to stick to, known as the Helsinki declaration, which was developed in the 1970’s after it became clear that a lot of medical research that had been done up to that point was not very ethical (to put it mildly). The code isn’t legally binding, but if you don’t follow it, you will generally have trouble getting your research published in a serious medical journal.
The most important part of the Helsinki declaration is the requirement that participants be fully informed about the purpose of the study, and given an informed choice as to whether to take part or not. Additionally, participants have to be clearly informed that it is their right to drop out of a study at any point, without having to provide any reason for doing so.
The bigger and higher quality a scientific study is, the more expensive it is. This means that most big, high quality studies are carried out by pharmaceutical companies. Obviously, this is a problem, because the companies have a vested interest in making their products look good. And when companies carry out studies that don’t show their drugs in the best light, they will usually try to bury the data. When they carry out studies that show good results, however, they will try to maximize the attention paid to them.
This contributes to a problem known as publication bias. What publication bias means is that studies which show good effect are much more likely to get published than studies which show no effect. This is both because the people who did the study are more likely to push for it to be published, and because journals are more likely to accept studies that show benefit (because those studies get much more attention than studies that don’t show benefit).
So, one thing to be aware of before you start searching for scientific studies in a field is that the studies you can find on a topic often aren’t all the studies. You are most likely to find the studies that show the strongest effect. The effect of an intervention in the published literature is pretty much always bigger than the effect subsequently seen in the real world. This is one reason why I am skeptical to drugs, like statins, that show an extremely small benefit even in the studies produced by the drug companies themselves.
There have been efforts in recent years to mitigate this problem. One such effort is the site clinicaltrials.gov. Researchers are expected to post details of their planned study on clinicaltrials.gov in advance of beginning recruitment of participants. This makes it harder to bury studies that subsequently don’t show the wanted results.
Most serious journals have now committed to only publish studies that have been listed on clinicaltrials.gov prior to starting recruitment of participants, which gives the pharmaceutical companies a strong incentive to post their studies there. This is a hugely positive development, since it makes it a little bit harder for the pharmaceutical companies to hide studies that didn’t go as planned.
Once a study is finished, the researchers will usually try to get it published in a peer-reviewed journal. The first scientists, back when modern science was being invented in the 1600’s, mostly wrote books in which they described what they had done and what results they had achieved. Then, after a while, scientific societies started to pop up, and started to produce journals. Gradually science moved from books to journal articles. In the 1700’s the journals started to incorporate the concept of peer-review as a means to ensure quality.
As you can see, journals are an artifact of history. There is actually no technical reason why studies still need to be published in journals in a time when most reading is done on digital devices. It is possible that the journals will disappear with time, to be replaced by on-line science databases.
In recent years, there has been an explosion in the popularity of “pre-print servers”, where scientists can post their studies while waiting to get them in to journals. When it comes to medicine, the most popular such server is medRxiv. The main problem with journals is that they charge money for access, and I think most people will agree that scientific knowledge should not be owned by the journals, it should be the public property of humankind.
Peer-review provides a sort of stamp of approval, although it is questionable how much that stamp is worth. Basically, peer-review means that someone who is considered an expert on the subject of the article (but who wasn’t personally involved with it in any way) reads through the article and determines if it is sensible and worth publishing.
Generally the position of peer-reviewer is an unpaid position, and the person engaging in peer-review does it in his or her spare time. He or she might spend an hour or so going through the article before deciding whether it deserves to be published or not. Clearly, this is not a very high bar. Even the most respected journals have published plenty of bad studies containing manipulated and fake data because they didn’t put much effort in to making sure the data was correct. As an example, the early part of the covid pandemic saw a ton of bad studies which had to be retracted just a few weeks or months after publication because the data wasn’t properly fact checked before publication.
If the peer reviewer at one journal says no to a scientific study, the researchers will generally move on to another, less prestigious journal, and will keep going like that until they can get the study published. There are so many journals that everything gets published somewhere in the end, no matter of how poor quality.
The whole system of peer-review builds on trust. The guiding principle is the idea that bad studies will be caught out over the long term, because when other people try to replicate the results, they won’t be able to.
There are two big problems with this line of thinking. The first is that scientific studies are expensive, so they often don’t get replicated, especially if they are big studies of drugs. For the most part, no-one but the drug company itself has the cash resources to do a follow-up study to make sure that the results are reliable. And if the drug company has done one study which shows a good effect, it won’t want to risk doing a second study that might show a weaker effect.
The second problem is that follow-up studies aren’t exciting. Being first is cool, and generates lots of media attention. Being second is boring. No-one cares about the people who re-did a study and determined that the results actually held up to scrutiny.
Different types of evidence
In medical science, there are a number of “tiers” of data. The higher tier generally trumps the lower tier, because it is by its nature of higher quality. This means that one good quality randomized controlled trial trumps a hundred observational studies.
The lowest quality type of evidence is anecdote. In medicine this often takes the form of “case reports”, which detail a single interesting case, or “case series”, which detail a few interesting cases. An example could be a case report of someone who developed a rare complication, say baldness, after taking a certain drug.
Anecdotal evidence can generate hypotheses for further research, but it can never say anything about causation. If you take a drug and you lose all your hair a few days later, that could have been caused by the drug, but it could also have been caused by a number of other things. It might well just be coincidence.
After anecdote, we have observational studies. These are studies which take a population and follow it to see what happens to it over time. Usually, this type of study is referred to as a “cohort study”, and often, there will be two cohorts that differ in some significant way.
For example, an observational study might be carried out to figure out the long term effects of smoking. Ideally, you want a group that doesn’t smoke to compare with. So you find 5,000 smokers and 5,000 non-smokers. Since you want to know what the effect of smoking is specifically, you try to make sure that the two cohorts are as similar as possible in all other respects. You do this by making sure that both populations are around the same age, weigh as much, exercise as much, and have similar dietary habits. The purpose of this is to decrease confounding effects.
Confounding is when something that you’re not studying interferes with the thing that you are studying. So, for example, people who smoke might also be less likely to exercise. If you then find that smokers are more likely to develop lung cancer, is it because of the smoking or the lack of exercise? If the two groups vary in some way with regards to exercise, it’s impossible to say for certain. This is why observational studies can never answer the question of causation. They can only ever show a correlation.
This is extremely important to be aware of, because observational studies are constantly being touted in the media as showing that this causes that. For example a tabloid article might claim that a vegetarian diet causes you to live longer, based on an observational study. But observational studies can never answer questions of causation. Observational studies can and should do their best to minimize confounding effects, but they can never get rid of them completely.
The highest tier of evidence is the Randomized Controlled Trial (RCT). In a RCT, you take a group of people, and you randomly select who goes in the intervention group, and who goes in the control group.
The people in the control group should ideally get a placebo that is indistinguishable from the intervention. The reason this is important is that the placebo effect is strong. It isn’t uncommon for the placebo effect to contribute more to a drug’s perceived effect than the real effect caused by the drug. Without a control group that gets a placebo it’s impossible to know how much of the perceived benefit from a drug that actually comes from the drug itself.
In order for an RCT to get full marks for quality, it needs to be double-blind. This means that neither the participants nor the members of the research team who interact with the participants know who is in which group. This is as important as having a placebo, because if people know they are getting the real intervention, they will behave differently compared to if they know they are getting the placebo. Also, the researchers performing the study might act differently towards the intervention group and the control group in ways that influence the results, if they know who is in which group. If a study isn’t blinded, it is known as an “open label” study.
So, why does anyone bother with observational studies at all? Why not always just do RCT’s? For three reasons. Firstly, RCT’s take a lot of work to organize. Secondly, RCT’s are expensive to run. Thirdly, people aren’t willing to be randomized to a lot of interventions. For example, few people would be willing to be randomized to smoking or not smoking.
There are those who would say that there is another, higher quality form of evidence, above the randomized controlled trial, and that is the systematic review and meta-analysis. This statement is both true, and not true. The systematic review is a review of all studies that have been carried out on a topic. As the name suggests, the review is “systematic”, i.e. a clearly defined method is used to search for studies. This is important, because it allows others to replicate the search strategy, to see if the reviewers have consciously left out certain studies they didn’t like, in order to influence the results in some direction.
The meta-analysis is a systematic review that has gone a step further, and tried to combine the results of several studies in to a single “meta”-study, in order to get a higher amount of statistical power.
The reason I say it’s both true and not true that this final tier is higher quality than the RCT is that the quality of systematic reviews and meta-analyses depends entirely on the quality of the studies that are included. I would rather take one large high quality RCT than a meta-analysis done of a hundred observational studies. An adage to remember when it comes to meta-analyses is “garbage in, garbage out” – a meta-analysis is only as good as the studies it includes.
There is one thing I haven’t mentioned so far, and that is animal studies. Generally, animal studies will take the form of RCT’s. There are a few advantages to animal studies. You can do things to animals that you would never be allowed to do to humans, and an RCT with animals is much cheaper than an RCT with humans.
When it comes to drugs, there is in most countries a legal requirement that they be tested on animals before being tested on humans. The main problem with animal studies is several million years of evolution. Most animal studies are done in rats and mice, which are separated from us by over fifty million years of evolution, but even our closest relatives, chimps, are about six million years away from us evolutionarily. It is very common for studies to show one thing in animals, and something completely different when done in humans. For example, studies of fever lowering drugs done in animals find a seriously increased risk of dying of infection, but studies in humans don’t find any increased risk. Animal studies always need to be taken with a big grain of salt.
One very important concept when analyzing studies is the idea of statistical significance. In medicine, a result is considered “statistically significant” if the ”p-value” is less than 0,05 (p stands for probability).
This gets a little bit complicated, but please bear with me. To put it as simply as possible, the p-value is the probability that a certain result was seen even though the null hypothesis is true. (The null hypothesis is the alternative to the hypothesis that is being tested. In medicine the null hypothesis is usually the hypothesis that an intervention doesn’t work, for example that statins don’t decrease mortality).
So a p-value of 0,05 means that there is a 5% or lower chance that a result was seen even though the null hypothesis is true.
One thing to understand is that 5% is an entirely arbitrary cut-off. The number was chosen in the early twentieth century, and it has stuck. And it leads to a lot of crazy interpretations. If a p-value is 0,049 the researchers who have carried out a study will frequently rejoice, because the result is statistically significant. If the p-value is on the other hand 0,051, then the result will be considered a failure. Anyone can see that this is ridiculous, because there is actually only a 0,002 (0,2%) difference between the two results, and one is really no more statistically significant than the other.
Personally, I think a p-value of 0,05 is a bit too generous. I would much have preferred if the standard cut-off had been set at 0,01, and I am sceptical of results that show a p-value greater than 0,01. What gets me really excited is when I see a p-value of less than 0,001.
It is especially important to be sceptical of p-values that are higher than 0,01 considering the other things we know about medical science. Firstly, that there is a strong publication bias, which causes studies that don’t show statistical significance to “disappear” at a higher rate than studies that do show statistical significance. Secondly, that studies are often carried out by people with a vested interest in the result, who will do what they can to get the result they want. And thirdly, because the 0,05 cut-off is used inappropriately all the time, for a reason we will now discuss.
The 0,05 limit is only really supposed to apply when you’re looking at a single relationship. If you look at twenty different relationships at the same time, then just by pure chance one of those relationships will show statistical significance. Is that relationship real? Almost certainly not.
The more variables you look at, the more strictly you should set the limit for statistical significance. But very few studies in medicine do this. They happily report statistical significance with a p-value of 0,05, and act like they’ve shown some meaningful result, even when they look at a hundred different variables. That is bad science, but even big studies, published in prestigious journals, do this.
That is why researchers are supposed to decide on a “primary end-point” and ideally post that primary end-point on clinicaltrials.gov before they start their study. The primary end-point is the question that the researchers are mainly trying to answer (for example, do statins decrease overall mortality?). Then they can use the 0,05 cut-off for the primary endpoint without cheating. They will usually report any other results as if the 0,05 cut-off applies to them too, but it doesn’t.
The reason researchers are supposed to post the primary endpoint at clinicaltrials.gov before starting a trial is that they can otherwise choose the endpoint that ends up being most statistically significant just by chance, after they have all the results, and make that the primary endpoint. That is of course a form of statistical cheating. But it has happened, many times. Which is why clinicaltrials.gov is so important.
One thing to be aware of is that a large share of studies can not be successfully replicated. Some studies have found that more than 50% of research cannot be replicated. That is in spite of a cut-off which is supposed to cause this to only happen 5% of the time. How can that be?
I think the three main reasons are publication bias, vested interests that do what they can to manipulate studies, and inappropriate use of the 5% p-value cut-off. That is why we should never put too much trust in a result that has not been replicated.
Absolute risk vs relative risk
We’ve discussed statistical significance a lot now, but that isn’t really what matters to patients. What patients care about is “clinical significance”, i.e. if they take a drug, will it have a meaningful impact for them. Clinical significance is closely tied to the concepts of absolute risk and relative risk.
Let’s say we have a drug that decreases your five year risk of having a heart attack from 0,2% to 0,1% . We’ll invent a random name for the drug, say, “spatin”. Now, the absolute risk redution when you take a spatin is 0,1% over five years (0,2 – 0,1 = 0,1). Not very impressive, right? Would you think it was worth taking that drug? Probably not.
What if I told you that spatins actually decreased your risk of heart attack by 50%? Now you’d definitely want to take the drug, right?
How can a spatin only decrease risk by 0,1% and yet at the same time decrease risk by 50%? Because the risk reduction depends on if we are looking at absolute risk or relative risk. Although spatins only cause a 0,1% reduction in absolute risk, they cause a 50% reduction in relative risk (0,1 / 0,2 = 50%).
So, you get the absolute risk reduction by taking the risk without the drug and subtracting the risk with the drug. You get the relative risk reduction by dividing the risk with the drug from the risk without the drug. Drug companies will generally focus on relative risk, because it sound much more impressive. But the clinical significance of a drug that decreases risk from 0,2% to 0,1% is, I would argue, so small that it’s not worth taking the drug, especially if the drug has side effects which might be more common than the probability of seeing a benefit.
When you look at an advertisement for a drug, always look at the fine print. Are they talking about absolute risk or relative risk?
How a journal article is organized
In the last few decades, a standardized format has developed for how scientific articles are supposed to be written. Articles are generally divided in to four sections.
The first section is the “Introduction”. In this section, the researchers are supposed to discuss the wider literature around the topic of their study, and how their study fits in with that wider literature. This section is mostly fluff, and you can usually skip through it.
The second section is the “Method”. This is an important section and you should always read it carefully. It describes what the researchers did and how they did it. Pay careful attention to what the study groups were, what the intervention was, what the control was. Was the study blinded or not? And if it was, how did they ensure that the blinding was maintained? Generally, the higher quality a scientific study, the more specific the researchers will be about exactly what they’ve done and how. If they’re not being specific, what are they trying to hide? Try to see if they’ve done anything that doesn’t make sense, and ask yourself why. If any manipulation is happening to make you think you’re seeing one thing when you’re actually seeing something else, it usually happens in the method section.
There are a few methodological tricks that are very common in scientific studies. One is choosing surrogate end points and another is choosing combined end points. I will use statins to exemplify each, since there has been so much methodological trickery in the statin research.
Surrogate end points are alternate endpoints that “stand in” for the thing that actually matters to patients. An example of a surrogate end point is looking at whether a drug lowers LDL cholesterol instead of looking at the thing that actually matters, overall mortality. The use of the surrogate end point in this case is motivated by the cholesterol hypothesis, i.e. the idea that cholesterol lowering drugs lower LDL, which results in a decrease in cardiovascular disease, which results in increased longevity.
By using a surrogate end point, researchers can claim that the drug is successful when they have in fact showed no such thing. As we’ve discussed previously, the cholesterol hypothesis is nonsense, so showing that a drug lowers LDL cholesterol does not say anything about whether it does anything clinically useful.
Another example of a surrogate endpoint is looking at cardiovascular mortality instead of overall mortality. People don’t usually care about which cause of death is listed on their death certificate. What they care about is whether they are alive or dead. It is perfectly possible for a drug to decrease cardiovascular mortality while at the same time increasing overall mortality, so overall mortality is the only thing that matters (at least if the purpose of a drug is to make you live longer).
An example of a combined end point is looking at the combination of overall mortality and frequency of cardiac stenting. Basically, when you have a combined end point, you add two or more end points together to get a bigger total amount of events.
Now, cardiac stenting is a decision made by a doctor. It is not a hard patient oriented outcome. A study might show that there is a statistically significant decrease in the combined end point of overall mortality and cardiac stenting, which most people will interpret as a decrease in mortality, without ever looking more closely to see if the decrease was actually in mortality, or stenting, or a combination of both. In fact, it’s perfectly possible for overall mortality to increase and still have a combined endpoint that shows a decrease.
Another trick is choosing which specific adverse events to follow, or not following any adverse events at all. Adverse events is just another word for side effects. Obviously, if you don’t look for side effects, you won’t find them.
Yet another trick is doing a “per-protocol analysis”. When you do a per-protocol analysis, you only include the results from the people who followed the study through to the end. This means that anyone who dropped out of the study because the treatment wasn’t having any effect or because they had side effects, doesn’t get included in the results. Obviously, this will make a treatment look better and safer than it really is.
The alternative to a per-protocol analysis is an “intention to treat” analysis. In this analysis, everyone who started the study is included in the final results, regardless of whether they dropped out or not. This gives a much more accurate understanding of what results can be expected when a patient starts a treatment, and should be standard for all scientific studies in health and medicine. Unfortunately per-protocol analyses are still common, so always be vigilant as to whether the results are being presented in a per-protocol or intention to treat manner.
The third section of a scientific article is the results section, and this is the section that everyone cares most about. This is just a pure tabulation of what results were achieved, and as such it is the least open to manipulation, assuming the researchers haven’t faked the numbers. Faking results has happened, and it’s something to be aware of and watch out for. But in general we have to assume that researchers are being honest. Otherwise the whole basis for evidence based medicine cracks and we might as well give up and go home.
To be fair, I think most researchers are honest. And I think even pharmaceutical companies will in general represent the results honestly (because it would be too destructive for their reputations if they were caught outright inventing data). Pharmaceutical companies engage in lots of trickery when it comes to the method and in the interpretation of the results, but I think it’s uncommon for them to engage in outright lying when it comes to the hard data presented in the results tables.
There is however one blatant manipulation of the results that happens frequently. I am talking about cherry picking of the time point at which a scientific study is ended. This can happen when researchers are allowed to check the results of their study while it is still ongoing. If the results are promising, they will often choose to stop the study at that point, and claim that the results were “so good that it would have been unethical to go on”. The problem is that the results become garbage from a statistical standpoint. Why?
Because of a statistical phenomenon known as “regression to the mean”. Basically, the longer a scientific study goes on for and the more data points that end up being gathered, the closer the result of the study is to the real result. Early on in a study, the results will often swing wildly just due to statistical chance. So studies will tend to show bigger effects early on, and smaller effects towards the end.
This problem is compounded by the fact that if a study at an early point shows a negative result, or a neutral result, or even a result that is positive but not “positive enough”, the researchers will usually continue the study in hopes of getting a better result. But the moment the result goes above a certain point, they stop the study and claim excellent benefit from their treatment.
That is how the time point at which a study is stopped ends up being cherry picked. Which is why the planned length of a study should always be posted in advance on clinicaltrials.gov, and why researchers should always stick to the planned length, and never look at the results until the study has gone on for the planned length. If a study is stopped early at a time point of the researchers’ choosing, the results are not statistically sound no matter what the p-values may show. Never trust the results of a study that stopped early.
The fourth section of a scientific article is the discussion section, and like the introduction section it can mostly be skipped through. Considering how competitive the scientific research field is, and how much money is often at stake, researchers will use the discussion section to try to sell the importance of their research, and if they are selling a drug, to make the drug sound as good as possible.
At the bottom of an article, there will generally be a small section (in smaller print than the rest of the study) that details who funded the study, and what conflicts of interest there are. In my opinion, this information should be provided in large, bright orange text at the top of the article, because the rest of the article should always be read in light of who did the study and what motives they had for doing it.
In conclusion, focus on the method section and the results section. The introduction section and the discussion section can for the most part be ignored.
My main take-home is that you should always be skeptical. Never trust a result just because it comes from a scientific study. Most scientific studies are low quality and contribute nothing to the advancement of human knowledge. Always look at the method used. Always look at who funded the study and what conflicts of interest there were.
I hope this article is useful to you. Please let me know if there are more things in terms of scientific methodology that you have been wondering about. I will try to make this article a living document that grows over time.