# “Can you really tell the difference?” - My wife, expressing fighting words

This experiment was inspired by an argument with my wife, a stylish but atrocious water filter, and the explosion of start-ups attempting to turn everything you purchase into a subscription service.

About a year ago I was growing tired of our tap water and its overly chlorinated taste. Initially I thought just to buy a Brita, but Brita filters always seemed like something you’d shove in your dorm room mini-fridge and not display on your kitchen counter. I looked around to see if there was anything better out there, and lo and behold, there was a water filter company called Soma with a beautifully designed water filter that seemed to fit the bill. They emphasize how their filters are “plant-based” and “sustainable”, but I just cared about the design. They also put their CEO's head in a circle (along with all their other pictures), which is the universal "Hey Millennials! Our Corporation is Different" indicator, so as a millennial I believed that they believed in their mission statement and just wanted to deliver me an effective and stylish water jug. The initial reviews on Amazon seemed fine, so I went to their website and placed an order.

Figure 1: Pur and Soma filters.

The filter arrived a few days later and it indeed looked great. I followed all the instructions for prepping the water filter, and then turned on the tap to fill it up. Immediately I was struck by how quickly the filter seemed to be “filtering” the water. My experience with other filters was filling the upper chamber and coming back in five minutes after it slowly dripped through, but the water in this filter seemed to be traveling relatively unimpeded from the upper chamber to the lower. Impressed by the speed at which Soma was able to filter their water, I waited for the flow to stop and poured myself a glass.

# It did nothing.

It did nothing. It just tasted like tap water. I emptied the pitcher and filled it again, giving the filter the benefit of the doubt, and again came to the same conclusion. No taste difference. At this point I just decided to throw the pitcher into the fridge (cold water masks poor taste) and use it for a few weeks to see if there were any changes. After that idea proved false, I went online and ordered a Pur filter that had universally good ratings. The filter actually worked, and I put the Soma filter in a corner of the pantry to later throw out. I told my wife we finally had a water filter that worked, to which she replied:

“Can you really tell the difference? I think you’re crazy. They all taste fine to me.”

A statement which I couldn’t refute, as the filter I did have was used and I certainly wasn’t going to order a new one.

A few months later, I got a package in the mail from Soma. It turns out if you order from their site, you agree to sign up to their filter subscription service, where they “helpfully” send and charge you for a new filter every few months. This is part of a larger trend in startups subtly signing you up for subscription services by purchasing their products, a practice pioneered in the 90s by “8 CDs For A Penny!” Columbia House and now adopted online for everything from lingerie to kids clothing to women’s active wear (who is trying to make subscription clothing a thing? Stop trying to make subscription clothing a thing).

Anyway, armed with a fresh Soma filter (one which they touted was improved from their previous filter), I cancelled my subscription and set out to design an experiment to test my ability to distinguish between water types. And hopefully show both how I could tell the difference between filtered and unfiltered water, and potentially show empirically how bad the Soma filtered water tasted.

# How do you design an unbiased experiment when you're predisposed to a certain outcome?

I went online, did some research to see what other people had done, and found this post where someone actually tested the chlorine and impurity content of various filters. The important takeaway from that post is that while other similar filters reduced chlorine content by 95%, the Soma filter only reduced it by about half. Their testing methodology was good, but they based their overall decisions on a subjective ranking system that doesn’t emphasize how poorly the Soma filtered water tasted. The Soma filter did indeed filter less chlorine out of the water, but was there a more objective way to show how terrible it tasted? In addition, how do you design an unbiased experiment when you're predisposed to a certain outcome (in this case, the Soma filter does not change the taste of tap water).

I decided to perform a series of blind pairwise comparisons between four types of water: Pur filtered, Soma filtered, tap, and bottled water. The goal was to see how distinguishable each type of water was from each other type. I wouldn’t know what types of water being compared in each round, and the drive to prove my wife that I could indeed distinguish filtered water from tap (and thus scientifically and indisputably win a marital argument, a rare event) would keep me honest. If I showed I could distinguish between types of water, but couldn’t tell the difference between tap and the Soma water, then I objectively showed that the Soma filter did little to change the taste of the water.

Demo Experimental Set-up
Round 1:
No Switch
Round 2:
Switch
Round 3:
Switch
Round 4:
No Switch
Round 5:
Switch
Round 6:
No Switch
Round 7:
Switch
Round 8:
No Switch
Round 9:
No Switch
Round 10:
Switch
Figure 2: Experimental set-up for the water trials. Successes are judged as successfully determining if the water was or was not switched, and the switching was determined by a random number generator.

For types of water that were indistinguishable, I would only be able to correctly classify them about 50% of the time, by chance. For types of water that were distinguishable, I should be able to correctly categorize a significantly higher percentage of them. Here, I define that percentage as 75%. Along with assuming an acceptable false negative rate ($$\beta$$) of 0.20 and a false positive rate ($$\alpha$$) of 0.05, this sets my minimum required sample size at 23 runs. You can see this in the following figure, where the power (which is 1-$$\beta$$) crosses the 0.80 threshold at 23 runs. There is no formal industry (where “industry” here is a term that loosely means “researchers”) standard for power, usually 80% is a cutoff for an acceptable design size [1]. For an ($$\alpha$$) of 0.05, this means we accept four times as many false negatives as false positives, with the idea that a false negative is usually not as bad as a false positive result.

Usually, power is a monotonically increasing function of sample size, but for a binomial test you get the odd case where a slightly lower number of runs will occasionally have higher power than a design one or two runs larger. This is due to the discrete nature of the binomial distribution.

experimentsize=1:25

alpha=0.05
baseprob = 0.5
thresholdprob = 0.75

power = 1-pbinom(qbinom(1-alpha,experimentsize,baseprob),experimentsize,thresholdprob)

poweranalysis = data.frame(experimentsize,power)
Figure 3: Power as a function of experiment size, for an exact binomial test between 75% and 50%. Power passes the 80% threshold at 23 runs. The higher the power, the better the chance the experiment will see an effect if one is actually present.

(An easy way to calculate sample size if you don’t want to do it by hand is the tool G*Power, which is free and available on most platforms)

Each round now involves filling up 48 cups of water, 23 with water type A and 23 with water type B, along with a taste calibration cup for each type before starting each round. The cups are lined up side-by-side for 23 rounds, and then a random number generator tells my wife how to switch them. Correctly determining when the cups were switched here indicates I successfully distinguished them. Here’s an image of the set-up:

Figure 4: Actual image of the experimental set-up, with water types shown.

Here, I controlled for several variables. First, I filled all the pitchers the previous night and let the temperature settle to 73 degrees, and confirmed they were all the same with a laser thermometer before the start of the experiment. I had my wife run randomize (with a random number generator) both the cup switches as well as the order the water types were compared. Between each round, I left the room while she filled and arranged the water, and had her hide the pitchers during each round. I dried the cups after each round to remove any traces of the previous water. I also did a control round, where all 48 cups were filled with the same type of water and I was tasked with the (futile) goal of trying to classify them. This would make it less clear to me (as the subject) if two waters tasted the same which two waters I was drinking. Finally, and most importantly, my wife agreed to participate (she's a good sport).

And here are the actual results of the experiment, shown in decreasing order of distinguishability:

Figure 5: Results from the experiment. Blue indicates a successful classification; red indicates an unsuccessful classification.

We take this data, and then determine the number of successes and see if the number is statistically different than 50%. We calculate the p-values directly, but we can also show this visually in the lower confidence intervals. If the lower confidence interval crossed 50%, the value is not significantly different and we cannot say the water types are significantly different. The upper confidence intervals are not shown because it is a one-sided test (they just stretch to 1).

results = experiment %>%
group_by(Comparison) %>%
filter(Truth == Data) %>%
summarize(successes = n(),
pval = binom.test(n(),23,p=0.5,alternative = "greater")\$p.value,
lowerci = binom.test(n(),23,p=0.5,alternative = "greater")[[4]][1],
upperci = binom.test(n(),23,p=0.5,alternative = "greater")[[4]][2])
Figure 6: Results of the experiment with 95% confidence intervals, and colored indicating whether the two water types were statistically distinguishable.

Soma filtered water performed the worst, having a taste statistically indistinguishable from tap water. Bottled water performed the best, being distinguishable from tap 100% of the time. Pur did almost as well against tap, only having two misclassifications. It did even better against Soma filtered water, with only one misclassification. Bottled and Pur filtered water were harder to distinguish, but here it shows there is a difference. In this case, I described the Pur filtered water during the test as “smooth” and the bottled water as “slightly alkaline,” and I actually preferred the Pur water’s taste to the bottled. Here, bottled vs Soma filtered are also statistically not different, but only one more success would have made them. If you look at the actual data above, you can see the first two runs accounted for 2/8 of the errors made in that round. The first couple runs are coming right off the initial calibration cups, so its possible that I was not fully “calibrated” to the taste. In addition, I designed the test with an acceptable false negative rate of 20%. If we assume everything but the Tap vs Soma is distinguishable, one out of five false negatives is within the designed sensitivity of the experiment.

# Soma filtered water performed the worst, having a taste statistically indistinguishable from tap water.

Here we look at the actual p-values of all the comparisons:

Figure 7: Map of the p-values of each of the comparisons (rounded to two decimal places). Soma vs Tap was statistically the most indistinguishable of the comparisons.

Looking at the tap water row, it’s obvious here that the Soma filter does little to nothing in improving the taste from ordinary tap water. Their marketing campaign spouts how environmentally friendly their product is, but I doubt the environmental worth of a worthless (but pretty) piece of plastic.

## In terms of arbitrary ranking scales: I rate it 0/5 empty water cups. Thanks, Soma.

As for recommendations, the Pur water filter is cheap and the filters themselves don’t cost a lot. It’s still just looks like a water filter and won’t win any style rewards, but it actually does the one job that it’s supposed to do. As for me and my wife’s argument, she no longer thinks I’m crazy for thinking the Soma water tastes bad–now I’m crazy because I spent three hours on a Sunday night sipping glasses of water. ¯\_(ツ)_/¯

X
###### Learn to data wrangle! Subscribe for more updates and exclusive content.

1. Dooper5000
May 2, 2017

Haha thanks! great post, I have been in there offices. Maybe this I
Will create some buzz and they will improve there filters once again, and and auto change you….

• Gb
May 3, 2017

Do you speak English?

2. May 3, 2017

Too bad you couldn’t have also included Brita in the test.

• Tyler Morgan-Wall
May 3, 2017

I would have, but it would have almost doubled the size of the test!

3. tom anderson
May 3, 2017

It seems paradoxical that you could distinguish tap from bottle 100% of the time, but you couldn’t seem to tell the difference between tap/soma or bottle/soma even though you state in the article that water from the soma tastes just like tap water, Surely comparing bottled to soma should give comparable results to bottle vs. tap?
Any thoughts on why this isn’t the case?

• j605
May 3, 2017

The author said researchers found Soma filtered about 50% of the chlorine but that still was enough to make it slightly better than tap water.

• Tyler Morgan-Wall
May 4, 2017

The issue is that water quality is a spectrum, taste isn’t one dimensional, and comparisons are not transitive.

As a simple toy example: if tap water started at a impurity concentration of 100 ppm and a person could distinguish changes in water of 60 ppm, then Soma reducing the concentration of impurities from 100 ppm to 50 ppm would still leave it indistinguishable from tap water. It would also be indistinguishable from bottled, with zero ppm. However, bottled water with zero ppm would clearly be distinguishable from tap. This shows how comparisons are not necessarily transitive.

More importantly, most human senses are not linear in detection thresholds, so in reality the detection threshold changes depending on the base impurity level. The point of this test was to move away from simply comparing impurity concentrations or chlorine levels and instead focus on the taste, which is a complex multidimensional sense that is hard to quantify but easy to understand in terms of comparisons.

And the last point is that the test was designed with an acceptable 20% false negative rate, so out of the 6 rounds one false negative is well within the design tolerance. Doing the test with more comparisons in the future (and thus gaining more statistical power) would help mitigate this issue.

4. Jason Hirschhorn
May 3, 2017

the real crime is that you need to sit there, pour some water in, wait for it to go through filter then keep filling up. any water filter product tha makes you fill it up more than once per container is useless.

5. Emily Hunt
May 8, 2017

Fascinating.
I have one major issue (that you do bring up) and it’s that you can distinguish between tap and bottled water 100% of the time, and can distinguish between tap and Soma a statistically significant amount of time, but not between the bottled and the Soma. Now, you clearly noticed this and brought it up. Your possible explanation is that perhaps you were not yet calibrated as you missed the first two cups. However I do not thing your accuracy changes depending on what cup you are on. If you count the total number of errors across all trials in the first half (1-11) it is actually less than the number of errors in the second half (13-23). I left out #12 since you didn’t have an even number of trials and there was two errors there so even if averaged it would not make a difference.

I’m still pretty impressed with the effort put fourth here but not 100% convinced given this issues.

• Tyler Morgan-Wall
May 8, 2017

Hi Emily,
Glad you enjoyed the read! As to your concern, see my response above to Tom Anderson’s comment. What the issue comes down to is that comparisons are not necessarily transitive, and the test was sized with an allowable 20% false negative rate. My comment about the first two cups being off was not expressing any particular poignant insight, but rather a personal observation that I became more confident in my classifications after a few trials into each round.

6. Elena
May 25, 2017

Skimming through this I didn’t see you list the municipality that provided the tap water. I would be interested to see how this test would vary across the states. For example, I know when my brother was living in Oakland he would always tell me the water tasted good and they would just filter for impurities, but the water here in San Diego is very hard and tastes much more chlorinated IMHO.

• tyler
May 26, 2017