A comparative versus evaluative, double-blind vs. sighted control test

Discussion in 'High End Audio' started by Harry Lavo, Feb 9, 2004.

  1. Harry Lavo

    Harry Lavo Guest

    Hi RAHE'rs -

    I've had many inquires and some interest in my proposal that before
    comparative dbt'ng is crowned "the" test for audio evaluation, it needs to
    be validated by a control test. While I have sketched such a test in
    several different posts/threads, there seems to be enough confusion over
    what I have said that it is worth outlining here in a definitive post on the
    subject.

    In addition, at the end I will respond to Tom's offer to join together in
    such a test.

    WHAT IS THE ISSUE?

    As I have analyzed my own and others arguments here for and against
    comparative dbt'ng, it seems to me that the issue has much less to be with
    being blind than it does with being comparative. In other words, does a
    test "forcing" a choice under uncertainty duplicate the results that would
    be obtained by listening and evaluating components at home in a relaxed
    atmosphere, whether blind or sighted. I have accordingly proposed that the
    only way to validate the comparative dbt as the definitive tool is to remove
    this question mark. And it could be done, with enough time and resources
    devoted to it.

    As such, the control test must separate out and test two variables -

    * evaluative (blind) vs. comparative (blind) ,,, a test of evaluative
    testing versus comparative testing
    * evaluative (blind) vs. evaluative (sighted) ,,, a test of blind vs.
    sighted testing

    With the answers to these two comparisons, it should be able to answer the
    following questions?

    * Does blinding give better bias control? (presumably yes)
    * How close can open-ended, relaxed, sighted evaluative testing (the
    traditional home "sighted" tests which are believed worthless by the
    objectivists) come to duplicating the results of open-ended, relaxed, but
    blinded evaluative testing. Same test technique, but blinded, which
    objectivist presumably would support.
    * Do traditional comparative dbt tests give identical results to more
    relaxed and evaluative dbt tests? (answer simply not known, but postulated
    by subjectivists as "no", thinking that the test itself is different enough
    to get in the way).

    Essentially, the blinded (dbt), relaxed, evaluative test is "the missing
    link" between the current dbt camp and the current subjectivist camp as it
    helps resolve both the "blind" issue and the "comparative vs. evaluative"
    issue. Using components playing music, not artifacts or pink noise.

    GENERAL TEST CONDITIONS
    * Participants must take place in all three tests...open end sighted, open
    end blind, and comparative blind.
    * There has to be enough trials of each type to allow statistical
    evaluation.
    * Musical selections and media must be agreed to in advance by all parties
    as being sufficiently varied to reveal all types of significant audio
    reproduction qualities. (Dynamic range, soundstaging, depth,
    dimensionality, bass quality, treble quality, midrange quality, etc.)
    * Equipment under test must be believed by most participants to sound
    different from one another under sighted conditions and to have some degree
    of objectivist skepticism about same.
    * Equipment under test, everything else being equal, should make testing
    under home/similar to home conditions as simple as possible, including
    time-synched switching.
    * Tests must either be done in-home of participants, or at a site accessible
    to participants over long periods of time on a sighted basis before test
    ratings collected.

    EVALUATIVE TEST CONDITIONS
    * Open-ended home listening must supplant informal note taking with formal
    rating of components on evaluative scale, in order to be able to
    statistically correlate with blind evaluative testing.
    * Evaluative scale should draw from and reflect all significant variables
    suggested by RAHE participants, reduced to a manageable number by
    consolidating very similar qualities.

    COMPARATIVE TEST CONDITIONS
    * Test should be a-b, rather than a-b-x, in order to better approximate the
    evaluative tests
    * Test should ask for overall preference and preference on comparative
    version of evaluative scales (at least those found significantly different
    in the evaluative testing.)

    BLIND TEST CONDITIONS
    * Participants should be allowed substantial "warm up" time on sighted basis
    to listen to the test equipment using the musical selections to be used in
    the test.
    * Participants should be allowed to control switching of test.
    * Participants should be left alone in room during test ideally, and should
    "turn in" ratings to out of room proctor who also has recorded the actual
    a-b assignment for each trial.
    * a-b assignments shall be based on random drawings and then adjusted, if
    needed, slightly to assure equal positioning and no chance of order bias.

    * * * * * * * * * * * * * * * *

    With those general conditions established, I would like to discuss actual
    test implementation practicalities. This is where it gets complicated.

    THE OPEN-ENDED SIGHTED EVALUATIVE TEST
    Essentially, as I described in an earlier post, the typical audiophile puts
    a new piece of equipment in the system, listens open-ended for awhile,
    switches back, does the same, and by doing this a few times over several
    selections of music begins to hone in on what characteristics the new
    equipment has in his system versus the old. These may be improvements; they
    may be deficiencies. He continues to do this until a) he has to return
    equipment, or b)reaches a definitive preference for one or the other (a
    preference growing organically out of the evaluation and the emergence of
    defining audio characteristics).

    How to best approximate this test on a slightly more structured basis, so
    that results may be compared to later tests?

    The first and probably only thing required, seems to me, is to substitute
    formal evaluation rating scales for the informal notes done during this
    process. My suggestion is that the evaluator would have perhaps
    half-a-dozen interim rating sheets that he/she would use, lets say for six
    weeks. Then at the end, he/she would review those sheets and put together a
    "final" rating for the two pieces of equipment. These would be on an
    absolute scale for the two pieces. For example, both might be rated high on
    "throw a wide soundstage beyond the outside edge of the speakers". One
    would be rated "5" and the other "4" on a "1" to "5" scale. So this score
    can be used both as a numeric rating and as a comparative rating, e.g.. both
    same, one higher (different, higher) on that characteristic. Their would
    also be a similar rating "preference overall" that might be "4" and "3"
    (different, better). Or perhaps "4" and "4" (no preference).

    However, one can immediately see one problem. With a sighted test, there is
    no such thing as doing 16 independent trials, since presumably once the
    person "locks in" his future ratings would be very similar since he knows
    which equipment is which. Even allowing for differences in moods, climates,
    etc. these would not be sixteen independent tests.

    The implications of this are that for the "relaxed, evaluative, sighted"
    versus "relaxed, evaluative, blind" tests, more than one person must be
    tested....probably at least twenty. In the food industry we used to
    consider 100 as the smallest test size we considered reasonable. This adds
    enormously to the cost, time, and complexity of running such a test if one
    is to do it in-home.

    It would be a little more manageable doing it out-of-home at a central
    facility, and having sixteen audiophiles do it. But this is fraught with
    problems...an unfamiliar system probably requiring more time to reach a
    final evaluation for each respondent, the need to maintain the setup for
    several weeks to allow all respondents to have multiple exposures before
    doing so, etc.

    Problems, problems.

    THE OPEN-ENDED, BLIND EVALUATIVE TEST
    This test would be very similar to the open-ended sighted test, but
    double-blind. Once a warm up period of perhaps a few hours was over,
    however, the respondent would take a trial, rate, turn in, take a break,
    start another trial, etc up to four in a row. If repeated four days or
    four weeks in a row, this could result in sixteen trials, enough to
    determine significance of differences in ratings. The ratings would be the
    same used in the sighted testing. The results of this test would be: were
    differences between the equipment found, and were they statistically
    significant at the 95th percentile. What characteristics, if any, came
    through as significantly different.

    Once a respondents results were determined (different, same) overall and for
    each characteristic, they could be compared to open-ended scores and a
    correlation established (or not). Since the open-ended sighted test only
    had one score, it would be hard to evaluate significance for an individual
    person on these correlations, but if done across 20-100 people, a
    statistical correlation could be established. For this to be a true
    "scientific" test, it would have to be done across a substantial population
    of audiophiles as has already been pointed out.

    THE COMPARATIVE, BLIND TEST
    The main blind (a-b) test would use the evaluative factors of the sighted
    and blind evaluative tests, but on a comparative basis (e.g. which did you
    prefer overall, which had the widest soundstage, etc.). The comparative
    evaluation test could be directly correlated with the blind evaluative test,
    as well as within itself over sixteen trials. Again, these probably should
    be done in groups of four since they require a fair number of ratings.

    Not essential but of possible interest, would be to do a traditional a-b-x
    test as well, to see if it correlated with the overall preference a-b test
    (% of respondents noting a difference in each/statistical significance of
    same).

    * * * * * * * * * * * * * * * *
    * * * * * * *

    IMPLICATIONS

    As noted, to truly be significant, this test has to be done across a sample
    of audiophiles, probably at least two dozen in-home evaluations and
    subsequent test follows up. This would kept Tom and I busy for a year.

    From a practical standpoint, the blind comparative vs. blind evaluative
    tests are easier to do, since multiple trials allows for internal
    statistical validity. I would be willing to develop and be the initial
    testee of such a test along with Tom, whom I would also ask to do the same,
    and perhaps a "neutral" third party.
    I would also do the sighted test, but the results would be strictly
    "anecdotal" until an appropriate database of RAHE participants was built up,
    and would request that Tom and the "neutral" do the same.

    I would also suggest that a good and most interesting vehicle for this test
    would be a SACD player using stereo mix SACD and CD layer, on disks and
    tracks judged appropriate and "identical" in mix. The test would be easy to
    run...two identical side-by-side SACD players into a preamp input, with
    control box switching or manual switching, automatically volume matched, no
    impedance problems a la speaker cables, and perhaps some ultimate insight
    into "is there a difference in SACD vs CD". I have a SACD player; Tom
    would have to buy one or borrow one; same for neutral third party.

    If SACD is judged impractical, then I would suggest a CD test between two CD
    players judged to be likely audibly different...say an Arcam 27 versus a
    Sony $300 job. However, the equipment would have to be on long term loan,
    since it would take probably at least six months to complete the testing.

    We would also need neutral proctors to run the test and record scores.

    * * * * * * * * * * * * * * * *
    * * * *

    CONCLUSION
    There would be a fair amount of work needed to get this off the ground, but
    it is doable. In particular, I would want broad agreement within RAHE that
    it was worthwhile doing, and I would want input from members of appropriate
    test SACDS or CDs and tracks for testing, and I would want myself, Tom, and
    the other participant to agree on the selections to be used.

    Your comments and suggestions and questions are hereby solicited.

    Harry Lavo
    "it don't mean a thing if it ain't got that swing" - Duke Ellington
     
    Harry Lavo, Feb 9, 2004
    #1
    1. Advertisements

  2. Harry Lavo

    chung Guest

    Why would DBT duplicate the results of sighted tests?
    Sorry, I don't see the question mark at all.
    You evaluate and compare. They are not mutually exlusive.
    And you still think that it has not been answered?
    Not a question worth answering. We all know that sighted and blind can
    give different results. How close is irrelevant.

    Same test technique, but blinded, which
    Not a question worth answering since the comparative test, as you put
    it, can be as relaxed and evaluative as you make it.
    Big OSAF. Prove that first. Others claim that DBT's don't work because
    the snippets are too short, the snippers are too long, the switching is
    too quick, the switching is too slow, the system does not have enough
    resolution, etc, etc. Your position as stated is not shared by the
    majority of DBT opponents. Even if you remove your concerns, others will
    have a different set of objections.

    snip
    Given my comments above, I myself don't find it worth doing. Since you
    seem to claim DBT's are not effective for audio, the burden of proof is
    on you. In other words, I don't find it worth doing, but go ahead if you
    think you can learn from doing this. I simply see no sense for me to
    waste effort proving something that has been proven.
     
    chung, Feb 9, 2004
    #2
    1. Advertisements

  3. Harry Lavo

    outsor Guest

    There is no reason to establish what has been demonstrated in all areas of
    human behavior research, including human hearing. But there is a simple
    direct way to get at the validity of the "evaluation" listening test, more
    often said to be an audition. Using the traditional stereophile
    experience of one man in a room with a notepad, a blind test can easily be
    done. As notes are said to have been taken each time listening was done
    over a period of days/weeks, use the notes as the test data. Using the
    current well known wire and the new wire to be "auditioned", simply
    randomly insert either wire into the system on each day of the "audition".
    If on the same days the same current wire was randomly used, remarkable
    differences in the perception of the music were said to be heard, or, if
    when either is used and the same remarkable perceptions were said to have
    been experienced; well ... If there is some reality of perception changes
    because of the new wire, it should stand out in the notes like a sore
    thumb, the same case if the current wire has them but the new doesn't.
    So, are any of the "traditional audition" mags up for this test, or
    individuals for that matter?
     
    outsor, Feb 9, 2004
    #3
  4. Harry Lavo

    Bruce Abrams Guest

    Here's the real question, as I see it. Lets say that two CD players, A and
    B, are being evaluated and during sighted tests, evaluative or comparative,
    a subject states a preference for unit A, yet when blinded is unable to
    duplicate the sighted result. What conclusion will be drawn? My suspicion
    is that both subjectivist and objectivist alike will wring their hands with
    glee shouting, "See...just like I told you."
     
    Bruce Abrams, Feb 9, 2004
    #4
  5. Rather than all this business about getting consensus on RAHE about components
    and worthwhile-ness and such, why not just have Harry list some components/treatments
    he *already hears differences between*, and test *those* claims in a DBT,proctored
    by Tom. There's no need for an 'evaluation' step there -- Harry's already done the
    'evaluation'.

    --

    -S.

    "They've got God on their side. All we've got is science and reason."
    -- Dawn Hulsey, Talent Director
     
    Steven Sullivan, Feb 10, 2004
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.