Psychoacoustics & Digital Audio

How human perception and digital systems shape forensic audio analysis

Roadmap

  1. Psychoacoustics for Forensic Listening
    • Masking, critical bands, loudness, temporal limits, spatial cues
  2. Digital Audio Through a Perceptual Lens
    • Sampling, codecs, artifacts, perceptual vs. signal evidence
  3. Applied Forensic Context
    • When perception matters, courtroom implications, best practices

I. Psychoacoustics for Forensic Listening

Understanding the human auditory system’s capabilities and limitations

Why Psychoacoustics Matters in Forensics

  • Earwitness reliability: Can a witness have physically heard what they claim?
  • Enhancement assessment: Does “clearer” audio actually improve intelligibility?
  • Codec artifact identification: Is that noise original or introduced by compression?
  • Expert testimony: How do you explain perceptual limits to a jury?

Auditory Masking: Overview

Definition: One sound (masker) raises the threshold of hearing for another sound (probe/maskee), potentially making it inaudible

Three types critical for forensics:

  1. Simultaneous (spectral) masking
  2. Temporal masking (pre- and post-masking)
  3. Informational masking

Simultaneous (Spectral) Masking

Simultaneous masking curves showing upward spread of masking

Simultaneous masking curves illustrating upward spread of masking

  • Mechanism: Cochlear overlap—strong sound creates peak on basilar membrane that "drowns out" weaker nearby signals
  • Upward spread of masking: Low frequencies mask high frequencies much more effectively than the reverse
  • Critical bands: Masking strongest within auditory filter bandwidths
  • Thresholds: Tonal masker needs ~14.5 dB SMR to hide noise; noise masker needs only ~5 dB to hide tone

Temporal Masking

Pre-masking (backward masking):

  • Sound masked even if it occurs before the masker
  • Brain processes louder signal faster, “catches up” to weaker signal
  • Time window: 5–20 ms

Post-masking (forward masking):

  • Sound masked for duration after masker stops
  • Caused by basilar membrane “ringing” and neural recovery time
  • Time window: 100–200 ms (up to 500 ms for some effects)

Informational Masking

Not energetic overlap—higher-level perceptual interference

  • Occurs in central nervous system, not cochlea
  • Caused by:
    • Similarity between target and masker
    • Attention limitations
    • Auditory Scene Analysis (ASA) failures
  • Example: “Cocktail party effect”—brain can’t separate voices even when physically distinguishable

Forensic Masking: Key Questions

When analyzing audio evidence, ask:

  1. Could the witness have heard this given the noise floor?
  2. Is “inaudible” speech actually below the masking threshold, or just quiet?
  3. Has lossy compression masked subtle background evidence?
  4. Would enhancement shift the masking relationship and make inaudible sounds audible?

Critical Bands and Auditory Filters

Critical bands along the basilar membrane
  • Physiological basis: Basilar membrane is narrow/stiff at base (high freq), wide/flexible at apex (low freq)
  • Tonotopic organization: Spatial frequency mapping—different locations code different frequencies
  • Critical bands: Overlapping bandpass filters; within a band, ear integrates energy as single unit
  • Bandwidth changes: ~100 Hz constant below 500 Hz; above 500 Hz, ~20% of center frequency

Perceptual Frequency Scales

ScaleBasisPrimary Forensic Use
Bark Scale24 critical bands; 1 Bark = 1 critical bandAudio compression (MP3 standard); loudness modeling
Mel ScalePitch perception; 1000 mels = 1000 Hz at 40 dB SPLSpeech recognition; speaker identification (MFCCs)
ERB ScaleEquivalent Rectangular Bandwidth; smoother auditory filter refinementNoise reduction algorithms; high-resolution psychoacoustic research

Equal-Loudness Contours (ISO 226)

ISO 226:2003 equal-loudness contours
  • Phon: Loudness level; 1 phon = 1 dB SPL at 1 kHz
  • Sone: Subjective loudness; 1 sone = 40 phons; +10 phons = ×2 sones
  • Most sensitive: 2–5 kHz (ear canal resonance)
  • Least sensitive: Below 100 Hz and above 10 kHz
  • Level dependency: Curves flatten at high volumes—response more consistent across frequencies

Forensic Implications: Equal-Loudness

Gain normalization & playback

  • Audio sounds "thin" at low volumes—bass and treble fall below threshold before midrange
  • AGC or compression used to normalize levels for court playback
  • Caution: excessive normalization can obscure spatial cues (distance, orientation)

A-weighting

  • Standard sound level measurements use A-weighting (inverse of 40-phon curve)
  • Reflects how environmental noise actually impacts human listeners

Codec artifacts

  • Lossy formats exploit equal-loudness to hide quantization noise in less-sensitive bands
  • "Birdie noise" or artifacts in these bands can be misinterpreted as original evidence

Temporal Resolution and Transients

Loudness integration window: ~200 ms

  • Ear integrates sound energy over ~200 ms
  • Brief sounds shorter than this may seem quieter than actual SPL

Temporal fusion and pitch: ~30 ms

  • Integration window for pitch perception and timbral fusion
  • Sounds separated by <30 ms may fuse into single event

Transient detection thresholds

  • Muzzle blast: 1–3 ms
  • Ballistic shock wave: hundreds of μs
  • Minimum detectable discontinuity: ~2 ms cross-fade can conceal clicks from listeners (but not spectral analysis)

Micro-Edit Detection

Butt splice

  • Abrupt deletion or insertion
  • Creates vertical line across spectrogram (broadband energy)
  • Audible click if during loud passage; visual if during silence

Cross-fade

  • ~2 ms blend smooths samples and eliminates click
  • Background-consistency analysis can still reveal edits

Background forensics

  • Reverb gaps: inserting “dry” speech into a reverberant recording leaves an unnatural gap in the reverberant tail
  • Background shifts: abrupt changes in noise texture or disappearance of continuous tones (e.g., 60 Hz hum)

Spatial Hearing and Localization

Interaural Time Difference (ITD)

  • Difference in arrival time between ears
  • Max ITD: ~0.6 ms (sufficient for full lateral displacement)
  • Most effective below 1.5 kHz (fine-structure phase sensitivity)

Interaural Level Difference (ILD/IAD)

  • Head shadowing reduces intensity at the far ear
  • ILD of 10–20 dB moves auditory image to one side
  • Dominant above 1 kHz (wavelength small relative to head)

Cone of confusion

  • Locations with identical ITD and ILD → ambiguous (front/back, above/below)
  • Resolved by spectral cues from pinna (directional bands, notches 5–10 kHz)

Precedence Effect (Haas Effect)

Law of the first wavefront

  • First sound to reach the ear dominates localization perception

Time windows: 1–30 ms

  • If reflection arrives within this window, brain fuses it with direct sound
  • Localization set by first arrival, even if reflection is up to 10 dB louder

Forensic implication

  • Shooter location determined by direct path, even if wall reflections are energetic
  • Multilateration uses measured TDOA (time difference of arrival), not perceived location

Stereo Artifacts from Tampering

Talker discontinuity

  • Abrupt changes in perceived level or orientation without logical movement

Reverberant mismatches

  • "Dry" recording inserted into reverberant original lacks reverberant tail
  • Visually obvious on spectrogram, aurally detectable

Binaural unmasking artifacts

  • Uncorrelated quantization noise masked in mono becomes audible in stereo (BMLD)
  • Creates "fizzing" in fade-outs or quiet passages

AGC pumping

  • Background noise audibly "pumps" as AGC tries to keep speech constant

Exercise 1: Critical Listening for Masking (Optional)

Task: Listen to provided audio example with simultaneous masker and probe tones

Questions:

  1. At what Signal-to-Mask Ratio (SMR) does the probe become audible?
  2. Is the masking symmetric or does it show upward spread?
  3. If you were an expert witness, how would you explain this to a jury?

II. Digital Audio Through a Perceptual Lens

How sampling, quantization, and perceptual coding affect forensic evidence

Sampling and the Nyquist Theorem

Nyquist-Shannon Sampling Theorem

  • To avoid loss of information: Fs ≥ 2Fmax
  • Nyquist frequency: Fs / 2 (all meaningful components must be below this)

Standard rates

  • 44.1 kHz (CD): captures up to ~22 kHz (covers hearing to ~20 kHz)
  • 48 kHz (professional video): standard for forensic work
  • 96 kHz (high-resolution): used for specialized analysis

Aliasing

  • If signal contains frequencies > Nyquist, they’re misinterpreted as lower “ghost” signals
  • Prevented by anti-aliasing filters before sampling

Quantization and Dynamic Range

6 dB per bit rule

  • Each bit increases dynamic range by ~6 dB
  • 16-bit: 65,536 levels, ~96 dB (CD quality)
  • 24-bit: 16.7 million levels, ~144 dB (professional standard)
  • 32-bit float: >1500 dB effective DR (internal processing, prevents clipping)

Low bit-depth consequences

  • Quantization noise: error between actual analog value and rounded digital step
  • Low bit depth → “fizzing”/“grainy” quiet passages
  • Extremely low resolution → correlated error causes harmonic distortion

Dither and Noise Shaping

  • Dither: low-level noise added before quantization to randomize rounding error
  • Prevents correlated distortion by decorrelating error from signal
  • Types:
    • Rectangular PDF: basic, effective
    • Triangular PDF: better, eliminates DC offset
    • Noise-shaped dither: shapes noise spectrum to minimize audibility (pushes noise to less sensitive frequencies)
  • Noise shaping: redistributes quantization noise to frequency bands where ear is less sensitive (e.g., very low and very high frequencies)

Lossy Codecs: Psychoacoustic Models

How MP3, AAC, and Opus work:

  1. Split signal into filter bank (MDCT)
  2. Psychoacoustic model determines masking threshold
  3. Allocate bits: Keep signal above threshold, discard below
  4. Use noise shaping to prioritize critical frequencies (2–5 kHz)

The Hybrid Filter Bank

Advanced Codec Techniques

TechniqueFunctionForensic Impact
Spectral Band Replication (SBR)Removes high frequencies during encoding, reconstructs by transposing low frequenciesGunshots or high-pitched speech may sound “natural” even if HF data was never recorded
Perceptual Noise Substitution (PNS)Replaces noise-like bands with random noise parametersLoss of subtle background “fingerprints” used to identify recording location
Joint Stereo (M/S)Converts L/R to Sum (M) and Difference (S) to save bitsCan create stereo artifacts; intensity stereo replaces correlated HF with envelope + directional cues

Codec Artifacts: What to Watch For

Pre-echo

  • Quantization noise from sharp transient spreads backward within analysis window
  • Appears as noise before the actual sound
  • Modern codecs use Temporal Noise Shaping (TNS) or shorter windows to minimize
  • Tell-tale sign of lossy processing

Spectral holes (“birdies”)

  • At low bitrates, encoder “runs out of bits”
  • Fails to encode certain spectral lines
  • Tonal whistling/tinkling artifacts that move across spectrum

Aliasing

  • Sample rate too low or filter bank poorly implemented
  • High-frequency components misinterpreted as lower-frequency “ghosts”

Re-encoding buildup

  • Every lossy re-save (e.g., MP3 → edit → MP3) accumulates distortion
  • Can obscure subtle background speech or timestamps

Time-Frequency Tradeoff

Uncertainty Principle: Impossible to achieve arbitrarily high resolution in both time and frequency simultaneously

  • Long window: Good frequency resolution (distinguish close frequencies), poor time resolution (blurred edges)
  • Short window: Good time resolution (sharp edges, transients), poor frequency resolution (can't distinguish close frequencies)

STFT (uniform)

  • Continuous tones
  • ENF analysis
  • Steady voices

Wavelets (multi-resolution)

  • Gunshot classification
  • Transient onset detection

Auditory filterbanks (non-uniform)

  • Assessing audibility
  • Masking analysis
  • Earwitness evaluation

Perceptually Informed vs. Visually Driven Analysis

The pitfall

  • Over-reliance on spectrograms without auditory verification

Visual-only risks

  • Mistaking codec pre-echo for physical event
  • Isolating sound from context (missing perceptual cues)
  • Misinterpreting “birdie noise” as original evidence
  • Assuming “clearer” always means more intelligible

Best practice

  • Oscillate between visual (spectrogram) and aural (critical listening)
  • Use visual analysis to guide listening, not replace it
  • Quality ≠ intelligibility

Exercise 2: Codec Artifact Identification (Optional)

Task: Compare original uncompressed audio with MP3 versions at different bitrates

Analysis:

  1. Identify pre-echo artifacts before sharp transients (use short STFT window)
  2. Locate spectral holes (“birdies”) at low bitrates
  3. Measure noise floor changes across different encodings

Forensic question: If this were evidence, how would you explain these artifacts to a jury? Would you recommend enhancement or caution against it?

III. Applied Forensic Context

When does perception matter? When does signal evidence dominate?

Perceptual Audibility vs. Signal Evidence

FeaturePerceptual AudibilitySignal Evidence
Primary utilityEvaluating earwitness testimony; detecting codec artifactsGeometric reconstruction (multilateration); calculating speed (Doppler effect)
LimitationsEar integrates over ~200 ms; very brief sounds seem quieter than they areHigh-amplitude sounds can be clipped or distorted by recorders
Forensic conflict“Quality” (how nice it sounds) ≠ “intelligibility” (understandability)Processed audio may sound “cleaner” but have lower speech intelligibility

Key principle: Use perceptual analysis for testimony evaluation and enhancement assessment; use signal analysis for geometric and physical reconstruction

Inaudible but Measurable: When It Matters

ENF (Electrical Network Frequency)

  • 60 Hz (US) or 50 Hz (Europe) power grid hum
  • Often inaudible/masked, but fluctuations serve as a timestamp

Multilateration (TDOA)

  • Reflections perceptually fused with direct sound (precedence effect)
  • Measured TDOA reveals geometry

Ballistic shock waves

  • Supersonic projectile N-wave (hundreds of μs duration)
  • Can be temporally masked but measurable via wavelet analysis

Spectral signatures

  • Revolver cylinder gap impulsive sound
  • May be missed in casual listening but detectable in waveform

Courtroom Implications

“CSI Effect”

  • Juries expect “magical” clarity from poor recordings

Standards

  • Seven Tenets of Authenticity (U.S. v. McKeever, 1958)
  • FBI 12-Step Procedure
  • Watergate Procedure

Expert responsibilities

  • Manage expectations: explain material limitations
  • Use lay language: make acoustics understandable
  • State limitations: what cannot be determined scientifically
  • Avoid “golden ear”: findings must be verifiable/reproducible
  • Neutrality: testify to facts and interpretation only

Forensic Listening Protocols

Laboratory setup

  • Acoustically isolated, quiet environment (ambient noise < 25 dBA SPL)
  • High-quality, spectrally flat headphones
  • Moderate playback levels (avoid acoustic reflex; can reduce sensitivity by up to 20 dB)

Iterative audition

  1. Listen to entire recording for context
  2. Replay specific segments multiple times
  3. Focus on foreground sounds (speech)
  4. Shift to background sounds (room tone, distant sounds)—harder to forge consistently

Cognitive bias mitigation

  • Expectation bias: case knowledge can pre-condition perception
  • Use Linear Sequential Unmasking (LSU) when possible—analyze audio before learning case context

Best Practices for Evidence Presentation

Playback calibration

  • Verify listening environment is appropriate
  • Use calibrated monitoring (not laptop speakers)
  • Provide both original and enhanced versions
  • Document all processing steps

Enhancement caution

  • “Clearer” audio is not necessarily more intelligible
  • Can boost false transcript credibility if transcript is wrong
  • Require objective evidence that enhancement improves intelligibility

Hash verification

  • MD5 or SHA to confirm data integrity
  • Chain of custody documentation for every transfer and access

Format standards

  • Uncompressed PCM (WAV), 16-bit minimum, ≥16 kHz sampling
  • Avoid lossy re-encoding during processing

Exercise 3: Forensic Decision-Making (Optional)

Scenario: You receive a 128 kbps MP3 recording of an alleged confession.

Defense: “I kept watching her,”    Prosecution transcript: “I killed Winchester.”

Your tasks

  1. What perceptual and signal analyses would you perform?
  2. How would you assess the reliability of the transcript?
  3. What would you tell the court about the limitations of the evidence?
  4. If asked to enhance, would you recommend it? Why or why not?

Key Takeaways

  1. Auditory masking determines what is perceptually audible (earwitness evaluation, codec analysis)
  2. Critical bands & perceptual scales shape frequency perception (spectral interpretation)
  3. Equal-loudness contours explain “thin” low-volume audio (gain normalization)
  4. Sampling & quantization define digital precision (forensically adequate bit depth/sample rate)
  5. Lossy codecs exploit psychoacoustics (separate artifacts from evidence)
  6. Time–frequency tradeoffs drive tool choice (STFT, wavelets, auditory filterbanks)
  7. Perceptual audibility ≠ signal evidence (use each appropriately)
  8. Experts communicate limits clearly (avoid “golden ear”, manage expectations)

Discussion Questions

  1. When to prioritize perceptual audibility vs. signal evidence?
  2. How to explain temporal masking to a non-technical jury?
  3. Given a 64 kbps MP3, what analyses assess reliability?
  4. Ethical duties when tampering can’t be ruled in/out?
  5. How can psychoacoustics help prevent wrongful convictions?

Further Resources

Psychoacoustics & Perception

  • Zwicker & Fastl (1999). Psychoacoustics: Facts and Models
  • Moore (2012). An Introduction to the Psychology of Hearing

Digital Audio & Codecs

  • Brandenburg (1999). MP3 and AAC Explained
  • Bosi & Goldberg (2003). Introduction to Digital Audio Coding and Standards

Forensic Audio

  • SWGDE Best Practices for Forensic Audio (2022)
  • SWGDE Core Competencies for Forensic Audio (2025)
  • Fraser & Stevenson (2014). The power and persistence of contextual priming

Standards

  • ISO 226:2003 (Equal-loudness contours)
  • ITU-R BS.1770 (Loudness measurement)