For years, there was a shift towards avoiding expensive coprocessors and related by having more and more work done by the CPU. The massive growth in single core speeds in e.g. Intel chips made this sensible. Now that single core speeds are not getting faster, and we are having to go multi-core, and now that power consumption is becoming more of an issue, rethinking is becoming more pertinent. Way back when, mainframes would have things like I/O done by independent hardware subsystems, to avoid using expensive time on the main CPUs, and now it seems this is being rediscovered.
Firstly, especially in something like MacOS, there has been progress towards offloading more and more of Quartz to the GPU. Many GUI things could quite happily be handled by a low-power ARM chip on the GPU itself. Already with programmable shaders, and now Vulkan, we are getting to the place where, for graphics, things are accomplished by sending programs, request and data buffers over a high speed interconnect (usually the PCIe bus). To some degree, network transparent graphics are being reinvented, though here the 'network' is the PCIe bus, rather than 10baseT. Having something like an ARM core, with a few specialised bits, for most drawing operations, and having much of the windowing and drawing existing largely at the GPU end of the bus, is one step towards are more efficient architecture: for most of what your PC does, using an Intel Core for it is overkill and wasteful of power. Getting to a point where the main CPUs can be switched off when idling will save a lot of power. In addition, one can look to mainframe architecture of old for inspiration.
Another part of that inspiration is to do similar with I/O. Moving mounting/unmounting and filesystems off to another subsystem run by a small ARM (or similar) core, makes a lot of sense. To the main CPU you have the appearance of a programmable DMA system, to which you merely need to send requests. The small I/O core doing this could be little different to the kind of few-dollar chip SoC we find in cheap smartphones. Moreover, it does not need the capacity for running arbitrary software (not should it have: since its job is more limited, it is more straightforward to lock it down).
This puts you at a point where, especially if you do the 'big-core/little-core' thing with the GPU architecture itself, the system can start up to the point where there is a useable GUI and command line interface before the 'main processors' have even booted up. Essentially you have something a bit like a Chromebook with the traditional 'Central Processing Unit' becoming a coprocessor for handling user tasks.
I'd also go so far to suggest that moving what are traditionally the kernel's duties 'out-of-band', namely on a multi-core CPU, have a small RISC core handling kernel duties, and so far as hyperthreading is concerned, having this 'out of band kernel' able to save/load state from the inactive thread on a hyperthreading core. (Essentially if you have a 2-thread core, the chip then has a state-cache for these threads, where it can move them, and from there save/load thread state to main memory: importantly, much of the CPU overhead for a context switch is removed.)