Discussion:
[PATCH 1/1] pci/quirks: fix a dmar fault for intel 82599 card
Li, Zhen-Hua
2014-09-30 06:09:54 UTC
Permalink
On a HP system with Intel Corporation 82599 ethernet adapter, when kernel
crashed and the kdump kernel boots with intel_iommu=on, there may be some
unexpected DMA requests on this adapter, which will cause DMA Remapping
faults like:
dmar: DRHD: handling fault status reg 102
dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
DMAR:[fault reason 01] Present bit in root entry is clear

Analysis for this bug:

The present bit is set in this function:

static struct context_entry * device_to_context_entry(
struct intel_iommu *iommu, u8 bus, u8 devfn)
{
......
set_root_present(root);
......
}

Calling tree:
ixgbe_open
ixgbe_setup_tx_resources
intel_alloc_coherent
__intel_map_single
domain_context_mapping
domain_context_mapping_one
device_to_context_entry

This means, the present bit in root entry will not be set until the device
driver is loaded.

But in the kdump kernel, some hardware device does not know the OS is the
second kernel and the drivers should be loaded again, this causes there are
some unexpected DMA requsts on this device when it has not been initialized,
and then the DMA Remapping errors come.

To fix this DMAR fault, we need to reset the bus that this device on. Reset
the device itself does not work.

There also was a discussion:
https://lkml.org/lkml/2013/5/14/9

Signed-off-by: Li, Zhen-Hua <zhen-***@hp.com>
---
drivers/pci/quirks.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 80c2d01..5198af3 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/ktime.h>
#include <asm/dma.h> /* isa_dma_bridge_buggy */
+#include <linux/crash_dump.h>
#include "pci.h"

/*
@@ -3832,3 +3833,13 @@ void pci_dev_specific_enable_acs(struct pci_dev *dev)
}
}
}
+
+#ifdef CONFIG_CRASH_DUMP
+void quirk_reset_buggy_devices(struct pci_dev *dev)
+{
+ if (unlikely(is_kdump_kernel()))
+ pci_try_reset_bus(dev->bus);
+}
+DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_INTEL, 0x10f8,
+ PCI_CLASS_NETWORK_ETHERNET, 8, quirk_reset_buggy_devices);
+#endif
--
2.0.0-rc0
Li, ZhenHua
2014-09-30 06:15:53 UTC
Permalink
Add Joerg to CC list. For it is also related to iommu module.

Joerg,
There was a try for this dmar fault,
https://lkml.org/lkml/2014/8/18/118

This patch is trying to fix the same thing.


Zhenhua
Post by Li, Zhen-Hua
On a HP system with Intel Corporation 82599 ethernet adapter, when kernel
crashed and the kdump kernel boots with intel_iommu=on, there may be some
unexpected DMA requests on this adapter, which will cause DMA Remapping
dmar: DRHD: handling fault status reg 102
dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
DMAR:[fault reason 01] Present bit in root entry is clear
static struct context_entry * device_to_context_entry(
struct intel_iommu *iommu, u8 bus, u8 devfn)
{
......
set_root_present(root);
......
}
ixgbe_open
ixgbe_setup_tx_resources
intel_alloc_coherent
__intel_map_single
domain_context_mapping
domain_context_mapping_one
device_to_context_entry
This means, the present bit in root entry will not be set until the device
driver is loaded.
But in the kdump kernel, some hardware device does not know the OS is the
second kernel and the drivers should be loaded again, this causes there are
some unexpected DMA requsts on this device when it has not been initialized,
and then the DMA Remapping errors come.
To fix this DMAR fault, we need to reset the bus that this device on. Reset
the device itself does not work.
https://lkml.org/lkml/2013/5/14/9
---
drivers/pci/quirks.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 80c2d01..5198af3 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/ktime.h>
#include <asm/dma.h> /* isa_dma_bridge_buggy */
+#include <linux/crash_dump.h>
#include "pci.h"
/*
@@ -3832,3 +3833,13 @@ void pci_dev_specific_enable_acs(struct pci_dev *dev)
}
}
}
+
+#ifdef CONFIG_CRASH_DUMP
+void quirk_reset_buggy_devices(struct pci_dev *dev)
+{
+ if (unlikely(is_kdump_kernel()))
+ pci_try_reset_bus(dev->bus);
+}
+DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_INTEL, 0x10f8,
+ PCI_CLASS_NETWORK_ETHERNET, 8, quirk_reset_buggy_devices);
+#endif
Bjorn Helgaas
2014-10-02 15:09:50 UTC
Permalink
Post by Li, ZhenHua
Add Joerg to CC list. For it is also related to iommu module.
Joerg,
There was a try for this dmar fault,
https://lkml.org/lkml/2014/8/18/118
This patch is trying to fix the same thing.
Zhenhua
Post by Li, Zhen-Hua
On a HP system with Intel Corporation 82599 ethernet adapter, when kernel
crashed and the kdump kernel boots with intel_iommu=on, there may be some
unexpected DMA requests on this adapter, which will cause DMA Remapping
dmar: DRHD: handling fault status reg 102
dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
DMAR:[fault reason 01] Present bit in root entry is clear
static struct context_entry * device_to_context_entry(
struct intel_iommu *iommu, u8 bus, u8 devfn)
{
......
set_root_present(root);
......
}
ixgbe_open
ixgbe_setup_tx_resources
intel_alloc_coherent
__intel_map_single
domain_context_mapping
domain_context_mapping_one
device_to_context_entry
This means, the present bit in root entry will not be set until the device
driver is loaded.
But in the kdump kernel, some hardware device does not know the OS is the
second kernel and the drivers should be loaded again, this causes there are
some unexpected DMA requsts on this device when it has not been initialized,
and then the DMA Remapping errors come.
To fix this DMAR fault, we need to reset the bus that this device on. Reset
the device itself does not work.
This seems like something that could happen with *any* device, not
just the 82599 NIC. Or is there something in the "kernel crash ->
kexec -> kdump kernel" path that stops DMA for most devices, but not
for the 82599?
Post by Li, ZhenHua
Post by Li, Zhen-Hua
https://lkml.org/lkml/2013/5/14/9
---
drivers/pci/quirks.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 80c2d01..5198af3 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/ktime.h>
#include <asm/dma.h> /* isa_dma_bridge_buggy */
+#include <linux/crash_dump.h>
#include "pci.h"
/*
@@ -3832,3 +3833,13 @@ void pci_dev_specific_enable_acs(struct pci_dev *dev)
}
}
}
+
+#ifdef CONFIG_CRASH_DUMP
+void quirk_reset_buggy_devices(struct pci_dev *dev)
+{
+ if (unlikely(is_kdump_kernel()))
+ pci_try_reset_bus(dev->bus);
+}
+DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_INTEL, 0x10f8,
+ PCI_CLASS_NETWORK_ETHERNET, 8, quirk_reset_buggy_devices);
+#endif
Alexander Duyck
2014-10-03 14:28:50 UTC
Permalink
Post by Bjorn Helgaas
Post by Li, ZhenHua
Add Joerg to CC list. For it is also related to iommu module.
Joerg,
There was a try for this dmar fault,
https://lkml.org/lkml/2014/8/18/118
This patch is trying to fix the same thing.
Zhenhua
Post by Li, Zhen-Hua
On a HP system with Intel Corporation 82599 ethernet adapter, when kernel
crashed and the kdump kernel boots with intel_iommu=on, there may be some
unexpected DMA requests on this adapter, which will cause DMA Remapping
dmar: DRHD: handling fault status reg 102
dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
DMAR:[fault reason 01] Present bit in root entry is clear
static struct context_entry * device_to_context_entry(
struct intel_iommu *iommu, u8 bus, u8 devfn)
{
......
set_root_present(root);
......
}
ixgbe_open
ixgbe_setup_tx_resources
intel_alloc_coherent
__intel_map_single
domain_context_mapping
domain_context_mapping_one
device_to_context_entry
This means, the present bit in root entry will not be set until the device
driver is loaded.
But in the kdump kernel, some hardware device does not know the OS is the
second kernel and the drivers should be loaded again, this causes there are
some unexpected DMA requsts on this device when it has not been initialized,
and then the DMA Remapping errors come.
To fix this DMAR fault, we need to reset the bus that this device on. Reset
the device itself does not work.
This seems like something that could happen with *any* device, not
just the 82599 NIC. Or is there something in the "kernel crash ->
kexec -> kdump kernel" path that stops DMA for most devices, but not
for the 82599?lex
This is an *any* device problem. Specifically any device that is doing
active DMA when a kdump kernel is triggered will cause this issue since
the IOMMU will not have valid mappings for the DMA events until the
device driver itself is loaded and resets the device.

Thanks,

Alex
Li, ZhenHua
2014-10-08 01:46:04 UTC
Permalink
well, then I will create a patch for ALL pcie devices.
Post by Alexander Duyck
Post by Bjorn Helgaas
Post by Li, ZhenHua
Add Joerg to CC list. For it is also related to iommu module.
Joerg,
There was a try for this dmar fault,
https://lkml.org/lkml/2014/8/18/118
This patch is trying to fix the same thing.
Zhenhua
Post by Li, Zhen-Hua
On a HP system with Intel Corporation 82599 ethernet adapter, when kernel
crashed and the kdump kernel boots with intel_iommu=on, there may be some
unexpected DMA requests on this adapter, which will cause DMA Remapping
dmar: DRHD: handling fault status reg 102
dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
DMAR:[fault reason 01] Present bit in root entry is clear
static struct context_entry * device_to_context_entry(
struct intel_iommu *iommu, u8 bus, u8 devfn)
{
......
set_root_present(root);
......
}
ixgbe_open
ixgbe_setup_tx_resources
intel_alloc_coherent
__intel_map_single
domain_context_mapping
domain_context_mapping_one
device_to_context_entry
This means, the present bit in root entry will not be set until the device
driver is loaded.
But in the kdump kernel, some hardware device does not know the OS is the
second kernel and the drivers should be loaded again, this causes there are
some unexpected DMA requsts on this device when it has not been initialized,
and then the DMA Remapping errors come.
To fix this DMAR fault, we need to reset the bus that this device on. Reset
the device itself does not work.
This seems like something that could happen with *any* device, not
just the 82599 NIC. Or is there something in the "kernel crash ->
kexec -> kdump kernel" path that stops DMA for most devices, but not
for the 82599?lex
This is an *any* device problem. Specifically any device that is doing
active DMA when a kdump kernel is triggered will cause this issue since
the IOMMU will not have valid mappings for the DMA events until the
device driver itself is loaded and resets the device.
Thanks,
Alex
Loading...