(Perceiving and recognising pitch values versus perceiving and recognising differences between pitch values ...)
We take relative pitch perception for granted, because most of us have it. And we are amazed at absolute pitch perception, because most people don't have it.
But it is an error to suppose that what is most common is also what is easiest to explain.
To explain absolute pitch perception, all you need is a reasonably direct connection between the sensory neurons in the ear and those parts of the brain responsible for conscious perception.
(Although at least one processing step is required for both relative and absolute pitch perception: it is necessary to calculate the estimated fundamental frequency from the observed harmonic frequencies.)
Whereas to explain relative pitch perception, you need quite a lot of perceptual "machinery".
This machinery must process the "raw" frequency perceptions and generate the "relativized" perceptions that underly our normal perception of both musical melodies and speech melodies.
Firstly, you need machinery to calibrate the 4-way comparison between pitch values when comparing pairs of musical intervals of the same size.
A mechanism of calibration deals with the following question: Given an interval such as the interval from C to E, and a second interval such as the interval from F to A, where both intervals are perceived by me to be the "same" interval, how does my brain know that they are the same?
Secondly, you need machinery to determine a "frame of reference" for a melody, where that frame of reference is a function of the melody itself.
For musical melodies, the frame of reference corresponds to the key (which generally corresponds to a particular scale and a "home" chord), which is typically constant for the duration of a tune. This constancy does not necessarily occur when perceiving non-musical speech melodies.
Any theory of relative pitch perception must explain how the frame of reference is determined by the melody as a function of key, and must also explain how it can be determined for a non-musical speech melody.
(In other words, a theory of frame-of-reference determination cannot assume in general that a melody is constructed entirely from pitch values belonging to a specific musical scale.)
Neither of these two requirements has a simple implementation. Like many other aspects of human cognition, something seems simple because we can "all do it", but when we try to understand in detail how it's done, it doesn't seem so simple anymore.