Discussion:
[3.16-rcX][pciehp][radeon] PCIe HotPlug conflicts with radeon GPU
Bjorn Helgaas
2014-09-11 22:26:21 UTC
Permalink
[+cc linux-pci]
Hello devs,
There are two issues I am encountering with the PCIe Hotplug driver on my Lenovo Laptop (W500). I note this goes back further than 3.15.
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f244d8b623dae7a7bc695b0336f67729b95a9736
https://bugzilla.kernel.org/show_bug.cgi?id=79701
https://bugzilla.kernel.org/show_bug.cgi?id=77261
1) If I enable the device to use both the integrated and discrete GPU, pciehp will decide to force unload radeon because it puts itself into a power saving state, fails back to the Intel integrated GPU in this case unless I tell radeon.ko to runpm=0 (no power management, then pciehp wont touch it).
2) If the Radeon GPU resets and you use pci_reset=1 for kernel module option, pciehp decides to force unload radeon even though the GPU is trying to setup after failing.
Kernel I am using right now: 3.16.0-0.rc7.git3.1.fc21.x86_64 (about to boot into snapshot kernel-core-3.16.0-0.rc7.git4.1.fc21.x86_64)
Hi Shawn,

Thanks for the report and sorry that it got dropped. But I see you're
cc'd on https://bugzilla.kernel.org/show_bug.cgi?id=79701, so you've
probably seen the work there. If you can try out the patches I just
posted, that would be great.

Bjorn
Shawn Starr
2014-09-23 18:53:35 UTC
Permalink
Post by Bjorn Helgaas
[+cc linux-pci]
Hello devs,
There are two issues I am encountering with the PCIe Hotplug driver on my
Lenovo Laptop (W500). I note this goes back further than 3.15.
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=
f244d8b623dae7a7bc695b0336f67729b95a9736
https://bugzilla.kernel.org/show_bug.cgi?id=79701
https://bugzilla.kernel.org/show_bug.cgi?id=77261
1) If I enable the device to use both the integrated and discrete GPU,
pciehp will decide to force unload radeon because it puts itself into a
power saving state, fails back to the Intel integrated GPU in this case
unless I tell radeon.ko to runpm=0 (no power management, then pciehp wont
touch it).
2) If the Radeon GPU resets and you use pci_reset=1 for kernel module
option, pciehp decides to force unload radeon even though the GPU is
trying to setup after failing.
Kernel I am using right now: 3.16.0-0.rc7.git3.1.fc21.x86_64 (about to
boot into snapshot kernel-core-3.16.0-0.rc7.git4.1.fc21.x86_64)
Hi Shawn,
Thanks for the report and sorry that it got dropped. But I see you're
cc'd on https://bugzilla.kernel.org/show_bug.cgi?id=79701, so you've
probably seen the work there. If you can try out the patches I just
posted, that would be great.
Bjorn
Hi Bjorn,

I will be testing this in 3.17-rcX if it hits 3.17, otherwise manually patch
it in.

Thanks,
Shawn
Shawn Starr
2014-10-11 19:37:09 UTC
Permalink
Post by Bjorn Helgaas
[+cc linux-pci]
Hello devs,
There are two issues I am encountering with the PCIe Hotplug driver on my
Lenovo Laptop (W500). I note this goes back further than 3.15.
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=
f244d8b623dae7a7bc695b0336f67729b95a9736
https://bugzilla.kernel.org/show_bug.cgi?id=79701
https://bugzilla.kernel.org/show_bug.cgi?id=77261
1) If I enable the device to use both the integrated and discrete GPU,
pciehp will decide to force unload radeon because it puts itself into a
power saving state, fails back to the Intel integrated GPU in this case
unless I tell radeon.ko to runpm=0 (no power management, then pciehp wont
touch it).
2) If the Radeon GPU resets and you use pci_reset=1 for kernel module
option, pciehp decides to force unload radeon even though the GPU is
trying to setup after failing.
Kernel I am using right now: 3.16.0-0.rc7.git3.1.fc21.x86_64 (about to
boot into snapshot kernel-core-3.16.0-0.rc7.git4.1.fc21.x86_64)
Hi Shawn,
Thanks for the report and sorry that it got dropped. But I see you're
cc'd on https://bugzilla.kernel.org/show_bug.cgi?id=79701, so you've
probably seen the work there. If you can try out the patches I just
posted, that would be great.
Bjorn
Hi Bjorn,

For #1) This is fixed in linux-next (tracking 3.18.0-0.rc0.git1.2.fc22.1.x86_64
nondebug kernel for Fedora). PCIe HotPlug no longer unloads radeon. For this
bugzilla report we can close it.

#2) This still has weird results however, radeon.hard_reset=1 is experimental
and while it attempts to reset GPU, PCIe HotPlug seems to interact in this.

This can be tested by adding to grub command line radeon.hard_reset=1.
When X has started up, trigger a reset by cat
/sys/kernel/debug/dri/#/radeon_gpu_reset. It will output 0, cat it again will
show 1.

Attempt to drag a window. The this will trigger a GPU reset, but fail to
recover, its unknown if PCIe HotPlug is preventing a proper reset or not but
there is pciehp calls in the stack trace.

Thanks,
Shawn
Bjorn Helgaas
2014-10-13 16:11:28 UTC
Permalink
[+cc Alex, Christian, dri-devel]
Post by Shawn Starr
Post by Bjorn Helgaas
[+cc linux-pci]
Hello devs,
There are two issues I am encountering with the PCIe Hotplug driver on my
Lenovo Laptop (W500). I note this goes back further than 3.15.
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=
f244d8b623dae7a7bc695b0336f67729b95a9736
https://bugzilla.kernel.org/show_bug.cgi?id=79701
https://bugzilla.kernel.org/show_bug.cgi?id=77261
1) If I enable the device to use both the integrated and discrete GPU,
pciehp will decide to force unload radeon because it puts itself into a
power saving state, fails back to the Intel integrated GPU in this case
unless I tell radeon.ko to runpm=0 (no power management, then pciehp wont
touch it).
2) If the Radeon GPU resets and you use pci_reset=1 for kernel module
option, pciehp decides to force unload radeon even though the GPU is
trying to setup after failing.
Kernel I am using right now: 3.16.0-0.rc7.git3.1.fc21.x86_64 (about to
boot into snapshot kernel-core-3.16.0-0.rc7.git4.1.fc21.x86_64)
Hi Shawn,
Thanks for the report and sorry that it got dropped. But I see you're
cc'd on https://bugzilla.kernel.org/show_bug.cgi?id=79701, so you've
probably seen the work there. If you can try out the patches I just
posted, that would be great.
Bjorn
Hi Bjorn,
For #1) This is fixed in linux-next (tracking 3.18.0-0.rc0.git1.2.fc22.1.x86_64
nondebug kernel for Fedora). PCIe HotPlug no longer unloads radeon. For this
bugzilla report we can close it.
#2) This still has weird results however, radeon.hard_reset=1 is experimental
and while it attempts to reset GPU, PCIe HotPlug seems to interact in this.
This can be tested by adding to grub command line radeon.hard_reset=1.
When X has started up, trigger a reset by cat
/sys/kernel/debug/dri/#/radeon_gpu_reset. It will output 0, cat it again will
show 1.
Attempt to drag a window. The this will trigger a GPU reset, but fail to
recover, its unknown if PCIe HotPlug is preventing a proper reset or not but
there is pciehp calls in the stack trace.
A PCIe device reset usually looks like a hotplug event because the
PCIe link goes down and comes back up. As far as the PCI core is
concerned, it can't tell the difference between (1) a simple reset
where the link bounces and (2) removal of one device followed by
addition of another.

b440bde74f04 ("PCI: Add pci_ignore_hotplug() to ignore hotplug events
for a device") addressed this for some similar cases, but it looks
like we probably need some more calls to pci_ignore_hotplug() in the
radeon driver reset methods.

Can you please open a bugzilla and attach the complete dmesg log,
including the GPU reset and recovery failure?

Bjorn

Loading...