whisper : fix VAD processing for skipped audio segments (#3230)
This commit addresses an issue with token timestamps when audio segments
are skipped, in `whisper_exp_compute_token_level_timestamps` related to
the VAD processing and the energy levels.
The motivation for this is that the token timestamps exceed the energy
array bounds due to segment timing misalignment:
```console
(skipped introduction)
↓
Audio segment: [2600ms → 5600ms] (3 seconds of actual audio)
Energy array: [0 → 480652] (samples for 3 seconds)
Token timestamps: [3266ms → 3408ms] (absolute timestamps)
```
So both `s0` and `t1` get clamped to the maximum sample index (480652)
which causes the start/end timestamps to be the same for all the tokens
after a certain point.
This is addressed by using segment-relative timestamps in the
`timestamp_to_sample` and `sample_to_timestamp`.