The trick is that all operations are packetized. Let's assume the CPU wants to write some data to a device. It forwards the order to the PCI Express bridge which then creates a packet. The packet contains the address and data to be written and is forwarded serially to the targeted device, which de-packetizes the write order and executes it.
What if the CPU wants to read?
Same thing, the bridge forwards a packet to the targeted device, which now has to execute the read, create a return packet and send it to the bridge.
All that is pretty straighforward to do in practice thanks for the help from...
Getting packets flowing reliably along the wires takes some magic. As packets are transmitted serially at very high speed, they have to be de-serialized/assembled, decoded at the destination (remove the 8b/10b encoding), de-interleaved (if multiple lanes are used), and checked against line corruption (CRC checks).
Sounds complicated? It probably is. The thing is, we don't really care because most of the complexity is handled in the "PCI Express stack", composed of three layers.
The first two layers are the ones implemented for us in the PCI Express FPGA core (usually a combination of hard and soft core) and handling all the complexity. As a user, we work only in the transaction layer, where life is easy, the sky is blue and girls are beautiful.
In more details:
Good data? That's great, that's what we want!
Let's see how working in the transaction layer looks like.