Sunday, April 2, 2017

Medical Science and Statistics


One thing that most people don't understand is that most medical science is very weak from a statistical and mathematical point of view. 

Another thing that doctors don't understand at all is the concept of Predictive Value of a test result. 

Lets go through both here. 

The medical journals are voluminous. There are many studies done. They are mostly done in a fairly uniform format. (such as Title, Abstract, Methods, Summary, Conclusion).  There is a ton of complicated language that is familiar only to the professors who read and write these studies.  

The biggest problem with medical studies is the issue of bias. There are a ton of problems with bias. Most studies have bias. If a scientist wishes to prove something, then they will attempt to make a study that proves their point. It is rare for a study to be done that doesn't have bias. Because a truly neutral scientist is not going to be motivated to produce a study and article at all. Pharmaceutical companies are biased because they want to make money. And meds are worth billions of dollars. 

Here is a publication that describes much of the bias in medical science: 

http://fhs.mcmaster.ca/surgery/documents/HandoutGrimesAssociationforResearch2of07Oct2009.pdf

The same article is here: 

http://thelancet.com/journals/lancet/article/PIIS0140-6736(02)07451-2/abstract

The best way to avoid bias is to gather a study group of people, and randomize them. This produces the least selection bias. Then the two or more groups of people are treated differently and someone analyzes if the outcomes are different. If the study is "Double Blinded and Randomized" then there is strong evidence that the different treatments created different outcomes, and the knowledge base of the human race has increased. 

But even the best randomized trials still have bias. One of the strongest biases is publication bias. You see, the studies are designed to create a "P Value" of 5 percent. This p value means that there is only a 5 percent chance that the outcome is due to chance variation. This kind of p value is easy to calculate for people who are trained in this kind of statistical work. There is software to calculate p values based on simple input numbers, such as study size, expected effect, etc. 

Here is a summary of how to calculate p value: 

http://www.wikihow.com/Calculate-P-Value

It looks really complicated and confusing. But it makes a ton of sense to people who know how to do it. I don't suggest you try to learn the details. Just know that the p value tells you how likely the research results are true, and not chance. P value of .05 is the standard and that means there is only a 5 percent chance the study results are just randomness. 

But here is the kicker, and it is a huge problem: 

Studies that prove nothing, with elevated p values, don't generally get published. Those "worthless results" might be considered a waste of time. But if you do the study 20 times, then one of them will give a false result. (one in 20 is 5 percent, or p value of 0.05); The result will appear excellent, but it is wrong, fake, biased, incorrect, garbage, dangerous. You might think that this is a stupid criticism of necessary science. But I can tell you that it is huge. There are studies that are done where the data is looked at hundreds of different ways. Not all of those ways are published. Only the significant results are published. So if a study looks at the same data one hundred different ways, and the p value is 0.05, then 5 results will show a fake but convincing effect. Worse yet, some studies do "early look" at the data. This should be condemned entirely. An early look at the data erodes the quality tremendously. Not only is there less data to look at, but it more than doubles the risk of a false p value. There will be a more than 10 percent chance of a false finding. And if one looks at the data 10 different ways, then it becomes very likely that there will be a false presentation of statistical effect. The study will show a false truth. This certainly happened with the "Woman's Health Initiative". There was an early look and there was a possible false finding of a cause of breast cancer. That study cost many millions of dollars. And it turned the previous data on its head. There was ultimately one table in that study that will show the possible truth, The "life table" analysis, which showed the incidence of breast cancer in the hormone group vs the nonhormone group, across time. If you look at that table, the incidence of cancer was higher in the early data in the estrogen group. But the incidence lines were about to cross, consistent with older data, at the 2 year early look. Despite the fact that the data was poor, the study was cancelled. The p value declared secure. And people believed that estrogen, a natural normal female hormone, is toxic. It might take another hundred years before someone does this study properly. The early look and the multiple analysis gigantically eroded the value of the data. And in any case, the effect of estrogen was a few cases in 10,000. It became easy to vilify estrogen to the point of wrecking woman's lives. 

Also, there is the effect of "study group". A study that is done in one setting will not apply in another setting. A study done by midwives can be perfect scientifically, but it will not apply to obstetricians. Because obstetricians treat their patients differently. A study done on men might not apply to women. A study done in a poor area of Chicago might not apply to Mormons in Utah. For instance, lets say you are doing a vitamin D study and your group is in far north Canada. They might not get sunshine for half the year. This will certainly not apply in Ecuador (which is named for being on the Equator), and has near vertical sunshine year round. (Vitamin D is created by sunshine on the skin). There are a lot of vitamin D studies. One should look carefully at the study population to see if the study applies to yourself or your population. 

One example of study population affecting results is C Section closures. In a university, C Section closure techniques were studied. Staples vs Subcu dissolvable. In this study the results were proven to be equivalent. Staples and sub cu had the same scar outcome. But, the kicker is, this university also published a very high surgical skin infection rate. If I remember correctly, it was as high as 15 percent. This study cannot apply to me, because my surgical infection rate might be 15 times lower than that. I practice at a hospital that has infection control procedures down pat, with highly experienced personnel, laminar flow operating room air, etc. So that study simply doesn't apply to me. I have to make my own decisions about C Section closures, unless I do my own study.  The bottom line is that I will close a C Section in a way that the patient finds best, In other words, the patient will help decide. Some don't like staples. Some have had very good results with staples. Some want dis-solvable stitches, even though those stitches takes weeks to months to fully dissolve, if ever. 

Now lets move from statistical medicine to the doctor patient interaction. 

Predictive Value:

Lets say that I ordered a pregnancy test. And it is positive. But, the test was done on a boy, or a virginal gay woman, or a virginal nun. What is the value of that test result? It will not be valid. It will of course lead to a lot of stress, maybe recriminations, and some terrible feelings. But the value of that test is nearly nil. The predictive value of a positive pregnancy test depends on the study population. Lets say for the sake of argument that this particular test is 99 percent accurate. That leaves a lot of room for error. Because there are women who cannot get pregnant. If we test them, all of the results are inaccurate. Or at least misleading. A test can be inaccurate for a number of reasons: tumors, ovulation, HCG injections for a number of indications. I've even had patients who were ALWAYS POSITIVE. They've never had a negative pregnancy test in their life. Sorting that out is a challenge. Lets hope a 14 year old is not disowned by her father while figuring that out. We might figure that there was a tiny bit of placenta left over from her own fetal days, stuck somewhere in her body. Wherever it was, it did not seem to harm her and she wasn't worried.

So to calculate the predictive value of a positive result, the most important factor to consider is the pre-existing chance of the problem studied for. A good test has a 80 percent "Sensitivity". This means that, in the presence of the condition tested, there is an 80 percent chance of the test showing it.

Take a look at the Wikipedia page as of today:

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

There is a lot of math there. We don't need to know the math, but we have to know the idea. And if we don't, we mess up.

For instance, I might order a "Comprehensive Metabolic Panel" from the lab. This test has about 20 different natural chemicals on a person. For instance glucose and sodium (salt) are usually at the top. It is really tempting to order this test as it gives a ton of good info about a patients chemical status. The lab can print this out in minutes to hours. The problem, and it is not a big problem, is that the normal ranges are set at 95 percent normal ranges. That means that if we do the test 20 times, there will be on average one that falls outside the normal range, in an otherwise completely normal person. For instance, they ate a jelly donut and their glucose is high. That is a bad example because most people won't eat a jelly donut prior to a lab test, but in an emergency, the ER doc might not be able to ask the patient when they ate. So, on average there are 20 measurements, with a 95 percent "confidence interval", that means that a normal person has about one test outside the normal range. This is a completely false positive. And it is normal.

Notice the similarity here to the 95 percent confidence interval, or 5 percent false positive rate. This is identical to the 5 percent false positive rate assigned to medical studies. It seems that medical scientists are somewhat favorable to the 5 percent/95 percent confidence interval.

Where this gets really complicated is when we have tests that are 80 percent confident, or less. This is high for a screening test. A pap smear in the old days prior to HPV testing has a confidence of about 5-10 percent. A glucose screening test in pregnancy has about a 10 percent positive predictive value. In other words, 90 percent of positives are false. So we deal with low predictive values all the time. The tests still have a lot of value but alone mean nothing.

Low Predictive Value:

What is the chance of an 18 year old getting cervical cancer? It is very low. If we do a pap smear, the chance of a positive pap smear meaning cancer is next to nothing. That is because pap smears have a high false positive rate, in a population that is very low risk. Back when I used to do paps in 18 year old women, I only intervened when the biopsies showed severe risk. This did happen, and I kept my interventions very light, like a gentle laser surgery to remove only the surface of the worst areas. But it turns out that even that is unnecessary. The incidence of cancer is so low as to make the positive pap smear nearly worthless. The positive predictive value was near zero. So, as per the new protocols published by the ASCCP, I have stopped doing paps in women under 21 years of age. The paps simply don't help. The predictive value is low. It is like doing pregnancy tests on a boy. Or doing a vaginal sonogram on normal woman, which has been proven again and again to be worthless to dangerous. The predictive value is most likely below zero. In other words, it harms women.

18 year old women can still get checkups, or checked for problems, of course. It is just that the pap is not part of the checkup, unless there is a specific reason. (the reason might be the woman or her mother really wanted it).

But please, don't assume that sonograms themselves are worthless. In fact, woman should have more of them. They should present early and often for pelvic pains, pressures, bloating, or any other symptom. An indicated sonogram can save a life. And we all need to do better detecting ovarian cancer.

I haven't given up on medical science. But there is still a lot of room for what is called the Art of Medicine. That is keeping people healthy, preventing disease, eliminating risk and  pain. And doing it while keeping people feeling safe, comfortable, and happy. And I do that to the best of my abilities.


Comments are appreciated. And let me know if there are any errors.

Thanks
Blog at doctorjohnmarcus.blogspot.com.



2 comments:

  1. Bye the way,it is interesting to note that particle physicists use "Sigma" to calculate the results of chance
    statistical error. For instance if they use a sigma of six to determine truth, the result is one in 10 to the minus 6 chance of error. This is one in a hundred thousand. This is equivalent of p of .0001. Which study is more reliable?

    ReplyDelete