Discussion:
[PATCH 1/8] iommu/vt-d: Fix a few existing lines for checkpatch.pl
Bill Sumner
2014-04-25 00:36:31 UTC
Permalink
Things like:
1. no space before tabs
2. no initialization of static variables to 0 or NULL
3. no extra parentheses around the value on the 'return' statement
4. no line over 80-characters

The first three patches in this set enable the intel-iommu driver to consist
of multiple *.c source files by moving many of the existing definitions,
prototypes, and structure definitions from the front of intel-iommu.c into
a new intel-iommu-private.h file -- and replacing them with a #include
of that file.


The last five patches in this set use the above enablement to implement,
within the new source file intel-iommu-kdump.c, a fix for:

A kdump problem about DMA that has been discussed for a long time.
That is, when a kernel panics and boots into the kdump kernel, DMA that was
started by the panicked kernel is not stopped before the kdump kernel is booted;
and the kdump kernel disables the IOMMU while this DMA continues.
This causes the IOMMU to stop translating the DMA addresses as IOVAs and
begin to treat them as physical memory addresses -- which causes the DMA to either:
1. generate DMAR errors or
2. generate PCI SERR errors or
3. transfer data to or from incorrect areas of memory.
Often this causes the dump to fail.

This patch set modifies the behavior of the Intel iommu in the crashdump kernel:
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue
using its current buffers until the device drivers in the crashdump kernel
initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers
in the crashdump kernel than the iova ranges that were in-use at the time
of the panic.

Advantages of this approach:
1. All manipulation of the IO-device is done by the Linux device-driver
for that device.
2. This approach behaves in a manner very similar to operation without an
active iommu.
3. Any activity between the IO-device and its RMRR areas is handled by the
device-driver in the same manner as during a non-kdump boot.
4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
This supports the practice of creating a special kdump kernel without
drivers for any devices that are not required for taking a crashdump.
5. Minimal code-changes among the existing mainline intel-iommu code.


Signed-off-by: Bill Sumner <bill.sumner-***@public.gmane.org>
---
drivers/iommu/intel-iommu.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 69fa7da..f80b4bb 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -69,7 +69,8 @@
to match. That way, we can use 'unsigned long' for PFNs with impunity. */
#define DOMAIN_MAX_PFN(gaw) ((unsigned long) min_t(uint64_t, \
__DOMAIN_MAX_PFN(gaw), (unsigned long)-1))
-#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << VTD_PAGE_SHIFT)
+#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << \
+ VTD_PAGE_SHIFT)

#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT)
#define DMA_32BIT_PFN IOVA_PFN(DMA_BIT_MASK(32))
@@ -172,7 +173,7 @@ static int rwbf_quirk;
* set to 1 to panic kernel if can't successfully enable VT-d
* (used when kernel is launched w/ TXT)
*/
-static int force_on = 0;
+static int force_on;

/*
* 0: Present
@@ -187,7 +188,7 @@ struct root_entry {
#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
static inline bool root_present(struct root_entry *root)
{
- return (root->val & 1);
+ return root->val & 1;
}
static inline void set_root_present(struct root_entry *root)
{
@@ -225,7 +226,7 @@ struct context_entry {

static inline bool context_present(struct context_entry *context)
{
- return (context->lo & 1);
+ return context->lo & 1;
}
static inline void context_set_present(struct context_entry *context)
{
@@ -303,7 +304,7 @@ static inline bool dma_pte_present(struct dma_pte *pte)

static inline bool dma_pte_superpage(struct dma_pte *pte)
{
- return (pte->val & (1 << 7));
+ return pte->val & (1 << 7);
}

static inline int first_pte_in_page(struct dma_pte *pte)
@@ -314,7 +315,7 @@ static inline int first_pte_in_page(struct dma_pte *pte)
/*
* This domain is a statically identity mapping domain.
* 1. This domain creats a static 1:1 mapping to all usable memory.
- * 2. It maps to each iommu if successful.
+ * 2. It maps to each iommu if successful.
* 3. Each iommu mapps to this domain if successful.
*/
static struct dmar_domain *si_domain;
@@ -344,7 +345,7 @@ struct dmar_domain {
DECLARE_BITMAP(iommu_bmp, IOMMU_UNITS_SUPPORTED);
/* bitmap of iommus this domain uses*/

- struct list_head devices; /* all devices' list */
+ struct list_head devices; /* all devices' list */
struct iova_domain iovad; /* iova's that belong to this domain */

struct dma_pte *pgd; /* virtual address */
--
Bill Sumner <bill.sumner-***@public.gmane.org>
Bill Sumner
2014-04-25 00:36:32 UTC
Permalink
In intel-iommu.c, move downward the few lines near the
front that should not move to an intel-iommu-private.h
file (mostly data-item definitions.) Also move upward
a couple of inline functions that allocate and free pages.
This leaves at the front of the file a consolidated block
of the lines that would move to an intel-iommu-private.h file.

Signed-off-by: Bill Sumner <bill.sumner-***@public.gmane.org>
---
drivers/iommu/intel-iommu.c | 74 +++++++++++++++++++++++----------------------
1 file changed, 38 insertions(+), 36 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f80b4bb..49fdac5 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -163,17 +163,6 @@ static inline unsigned long virt_to_dma_pfn(void *p)
return page_to_dma_pfn(virt_to_page(p));
}

-/* global iommu list, set NULL for ignored DMAR units */
-static struct intel_iommu **g_iommus;
-
-static void __init check_tylersburg_isoch(void);
-static int rwbf_quirk;
-
-/*
- * set to 1 to panic kernel if can't successfully enable VT-d
- * (used when kernel is launched w/ TXT)
- */
-static int force_on;

/*
* 0: Present
@@ -312,15 +301,6 @@ static inline int first_pte_in_page(struct dma_pte *pte)
return !((unsigned long)pte & ~VTD_PAGE_MASK);
}

-/*
- * This domain is a statically identity mapping domain.
- * 1. This domain creats a static 1:1 mapping to all usable memory.
- * 2. It maps to each iommu if successful.
- * 3. Each iommu mapps to this domain if successful.
- */
-static struct dmar_domain *si_domain;
-static int hw_pass_through = 1;
-
/* devices under the same p2p bridge are owned in one domain */
#define DOMAIN_FLAG_P2P_MULTIPLE_DEVICES (1 << 0)

@@ -394,12 +374,50 @@ struct dmar_atsr_unit {
u8 include_all:1; /* include all ports */
};

+static inline void *alloc_pgtable_page(int node)
+{
+ struct page *page;
+ void *vaddr = NULL;
+
+ page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
+ if (page)
+ vaddr = page_address(page);
+ return vaddr;
+}
+
+static inline void free_pgtable_page(void *vaddr)
+{
+ free_page((unsigned long)vaddr);
+}
+
static LIST_HEAD(dmar_atsr_units);
static LIST_HEAD(dmar_rmrr_units);

#define for_each_rmrr_units(rmrr) \
list_for_each_entry(rmrr, &dmar_rmrr_units, list)

+
+static void __init check_tylersburg_isoch(void);
+
+/* global iommu list, set NULL for ignored DMAR units */
+static struct intel_iommu **g_iommus;
+static int rwbf_quirk;
+
+/*
+ * set to 1 to panic kernel if can't successfully enable VT-d
+ * (used when kernel is launched w/ TXT)
+ */
+static int force_on;
+
+/*
+ * This domain is a statically identity mapping domain.
+ * 1. This domain creats a static 1:1 mapping to all usable memory.
+ * 2. It maps to each iommu if successful.
+ * 3. Each iommu mapps to this domain if successful.
+ */
+static struct dmar_domain *si_domain;
+static int hw_pass_through = 1;
+
static void flush_unmaps_timeout(unsigned long data);

static DEFINE_TIMER(unmap_timer, flush_unmaps_timeout, 0, 0);
@@ -494,22 +512,6 @@ static struct kmem_cache *iommu_domain_cache;
static struct kmem_cache *iommu_devinfo_cache;
static struct kmem_cache *iommu_iova_cache;

-static inline void *alloc_pgtable_page(int node)
-{
- struct page *page;
- void *vaddr = NULL;
-
- page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
- if (page)
- vaddr = page_address(page);
- return vaddr;
-}
-
-static inline void free_pgtable_page(void *vaddr)
-{
- free_page((unsigned long)vaddr);
-}
-
static inline void *alloc_domain_mem(void)
{
return kmem_cache_alloc(iommu_domain_cache, GFP_ATOMIC);
--
Bill Sumner <bill.sumner-***@public.gmane.org>
Bill Sumner
2014-04-25 00:36:34 UTC
Permalink
Allow specification of the domain-id for the new domain.
This patch only adds the 'did' parameter to iommu_attach_domain()
and modifies all of its callers to specify the default value of -1
which says "no did specified, allocate a new one".

This is no functional change from current behaviour -- just enables
a functional change to be made in a later patch.

Signed-off-by: Bill Sumner <bill.sumner-***@public.gmane.org>
---
drivers/iommu/intel-iommu.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 4116377..226c1d0 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1181,7 +1181,8 @@ static struct dmar_domain *alloc_domain(bool vm)
}

static int iommu_attach_domain(struct dmar_domain *domain,
- struct intel_iommu *iommu)
+ struct intel_iommu *iommu,
+ int domain_number)
{
int num;
unsigned long ndomains;
@@ -1191,12 +1192,15 @@ static int iommu_attach_domain(struct dmar_domain *domain,

spin_lock_irqsave(&iommu->lock, flags);

- num = find_first_zero_bit(iommu->domain_ids, ndomains);
- if (num >= ndomains) {
- spin_unlock_irqrestore(&iommu->lock, flags);
- printk(KERN_ERR "IOMMU: no free domain ids\n");
- return -ENOMEM;
- }
+ if (domain_number < 0) {
+ num = find_first_zero_bit(iommu->domain_ids, ndomains);
+ if (num >= ndomains) {
+ spin_unlock_irqrestore(&iommu->lock, flags);
+ pr_err("IOMMU: no free domain ids\n");
+ return -ENOMEM;
+ }
+ } else
+ num = domain_number;

domain->id = num;
domain->iommu_count++;
@@ -1875,6 +1879,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
struct pci_dev *dev_tmp = NULL;
unsigned long flags;
u8 bus, devfn, bridge_bus, bridge_devfn;
+ int did = -1; /* Default to "no domain_id supplied" */

domain = find_domain(dev);
if (domain)
@@ -1915,7 +1920,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
domain = alloc_domain(false);
if (!domain)
goto error;
- if (iommu_attach_domain(domain, iommu)) {
+ if (iommu_attach_domain(domain, iommu, did)) {
free_domain_mem(domain);
domain = NULL;
goto error;
@@ -2083,7 +2088,7 @@ static int __init si_domain_init(int hw)
si_domain->flags = DOMAIN_FLAG_STATIC_IDENTITY;

for_each_active_iommu(iommu, drhd) {
- ret = iommu_attach_domain(si_domain, iommu);
+ ret = iommu_attach_domain(si_domain, iommu, (int) -1);
if (ret) {
domain_exit(si_domain);
return -EFAULT;
--
Bill Sumner <bill.sumner-***@public.gmane.org>
Bill Sumner
2014-04-25 00:36:33 UTC
Permalink
This post might be inappropriate. Click to display it.
Bill Sumner
2014-04-25 00:36:35 UTC
Permalink
Add data declarations and prototypes needed for kdump.

Signed-off-by: Bill Sumner <bill.sumner-***@public.gmane.org>
---
drivers/iommu/intel-iommu-private.h | 94 +++++++++++++++++++++++++++++++++++++
1 file changed, 94 insertions(+)

diff --git a/drivers/iommu/intel-iommu-private.h b/drivers/iommu/intel-iommu-private.h
index 480399c..3f20234 100644
--- a/drivers/iommu/intel-iommu-private.h
+++ b/drivers/iommu/intel-iommu-private.h
@@ -361,3 +361,97 @@ static inline void free_pgtable_page(void *vaddr)
{
free_page((unsigned long)vaddr);
}
+
+#ifdef CONFIG_CRASH_DUMP
+/*
+ * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
+ *
+ * Fixes the crashdump kernel to deal with an active iommu and legacy
+ * DMA from the (old) panicked kernel in a manner similar to how legacy
+ * DMA is handled when no hardware iommu was in use by the old kernel --
+ * allow the legacy DMA to continue into its current buffers.
+ *
+ * In the crashdump kernel, this code:
+ * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA)
+ * 2. leaves the current translations in-place so that legacy DMA will
+ * continue to use its current buffers,
+ * 3. allocates to the device drivers in the crashdump kernel
+ * portions of the iova address ranges that are different
+ * from the iova address ranges that were being used by the old kernel
+ * at the time of the panic.
+ *
+ */
+
+struct domain_values_entry {
+ struct list_head link; /* link entries into a list */
+ struct iova_domain iovad; /* iova's that belong to this domain */
+ struct dma_pte *pgd; /* virtual address */
+ int did; /* domain id */
+ int gaw; /* max guest address width */
+ int iommu_superpage; /* Level of superpages supported:
+ 0 == 4KiB (no superpages), 1 == 2MiB,
+ 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
+};
+
+
+
+/* ------------------------------------------------------------------------
+ * Prototypes of interface functions
+ * ------------------------------------------------------------------------
+ */
+
+extern int
+intel_iommu_copy_translation_tables(struct dmar_drhd_unit *drhd,
+ struct root_entry **root_old_p, struct root_entry **root_new_p,
+ int g_num_of_iommus);
+
+extern int
+intel_iommu_get_dids_from_old_kernel(struct intel_iommu *iommu);
+
+extern struct domain_values_entry *
+intel_iommu_did_to_domain_values_entry(int did, struct intel_iommu *iommu);
+
+
+/* ------------------------------------------------------------------------
+ * Utility functions for accessing the iommu Translation Tables
+ * ------------------------------------------------------------------------
+ */
+
+static inline struct context_entry *
+get_context_phys_from_root(struct root_entry *root)
+{
+ return (struct context_entry *)
+ (root_present(root) ? (void *) (root->val & VTD_PAGE_MASK)
+ : NULL);
+}
+
+static inline int
+context_get_p(struct context_entry *c) {return((c->lo >> 0) & 0x1); }
+
+static inline int
+context_get_fpdi(struct context_entry *c) {return((c->lo >> 1) & 0x1); }
+
+static inline int
+context_get_t(struct context_entry *c) {return((c->lo >> 2) & 0x3); }
+
+static inline u64
+context_get_asr(struct context_entry *c) {return((c->lo >> 12)); }
+
+static inline int
+context_get_aw(struct context_entry *c) {return((c->hi >> 0) & 0x7); }
+
+static inline int
+context_get_aval(struct context_entry *c) {return((c->hi >> 3) & 0xf); }
+
+static inline int
+context_get_did(struct context_entry *c) {return((c->hi >> 8) & 0xffff); }
+
+static inline void
+context_put_asr(struct context_entry *c, unsigned long asr)
+{
+ c->lo &= (~VTD_PAGE_MASK);
+ c->lo |= (asr << VTD_PAGE_SHIFT);
+}
+
+#endif /* CONFIG_CRASH_DUMP */
+
--
Bill Sumner <bill.sumner-***@public.gmane.org>
Bill Sumner
2014-04-25 00:36:36 UTC
Permalink
Create intel-iommu-kdump.c
Populate it with support functions to copy iommu translation tables from
from the panicked kernel into the kdump kernel in the event of a crash.

Update Makefile to build intel-iommu-kdump.o

Signed-off-by: Bill Sumner <bill.sumner-***@public.gmane.org>
---
drivers/iommu/Makefile | 2 +-
drivers/iommu/intel-iommu-kdump.c | 590 ++++++++++++++++++++++++++++++++++++++
2 files changed, 591 insertions(+), 1 deletion(-)
create mode 100644 drivers/iommu/intel-iommu-kdump.c

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 5d58bf1..bd61452 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
obj-$(CONFIG_AMD_IOMMU_V2) += amd_iommu_v2.o
obj-$(CONFIG_ARM_SMMU) += arm-smmu.o
obj-$(CONFIG_DMAR_TABLE) += dmar.o
-obj-$(CONFIG_INTEL_IOMMU) += iova.o intel-iommu.o
+obj-$(CONFIG_INTEL_IOMMU) += iova.o intel-iommu.o intel-iommu-kdump.o
obj-$(CONFIG_IRQ_REMAP) += intel_irq_remapping.o irq_remapping.o
obj-$(CONFIG_OMAP_IOMMU) += omap-iommu.o
obj-$(CONFIG_OMAP_IOMMU) += omap-iommu2.o
diff --git a/drivers/iommu/intel-iommu-kdump.c b/drivers/iommu/intel-iommu-kdump.c
new file mode 100644
index 0000000..4e653e048
--- /dev/null
+++ b/drivers/iommu/intel-iommu-kdump.c
@@ -0,0 +1,590 @@
+/*
+ * Copyright (C) 2014 Hewlett-Packard Development Company, L.P.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Copyright (C) 2014 Hewlett-Packard Development Company, L.P.
+ * Author: Bill Sumner <bill.sumner-***@public.gmane.org>
+ */
+#include <linux/init.h>
+#include <linux/bitmap.h>
+#include <linux/debugfs.h>
+#include <linux/export.h>
+#include <linux/slab.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+#include <linux/dmar.h>
+#include <linux/dma-mapping.h>
+#include <linux/mempool.h>
+#include <linux/timer.h>
+#include <linux/iova.h>
+#include <linux/iommu.h>
+#include <linux/intel-iommu.h>
+#include <linux/syscore_ops.h>
+#include <linux/tboot.h>
+#include <linux/dmi.h>
+#include <linux/pci-ats.h>
+#include <linux/memblock.h>
+#include <asm/irq_remapping.h>
+#include <asm/cacheflush.h>
+#include <linux/iommu.h>
+
+#include "irq_remapping.h"
+#include "pci.h"
+#include "intel-iommu-private.h"
+#include <linux/crash_dump.h>
+
+#ifdef CONFIG_CRASH_DUMP
+
+
+/* Lists of domain_values_entry to hold domain values found during the copy.
+ * One list for each iommu in g_number_of_iommus.
+ */
+static struct list_head *domain_values_list;
+
+
+/* ========================================================================
+ * Copy iommu translation tables from old kernel into new kernel.
+ * Entry to this set of functions is: intel_iommu_copy_translation_tables()
+ * ------------------------------------------------------------------------
+ */
+#define RET_BADCOPY -1 /* Return-code: Cannot copy translate tables */
+
+/*
+ * Copy memory from a physically-addressed area into a virtually-addressed area
+ */
+static int oldcopy(void *to, void *from, int size)
+{
+ size_t ret = 0; /* Length copied */
+ unsigned long pfn; /* Page Frame Number */
+ char *buf = to; /* Adr(Output buffer) */
+ size_t csize = (size_t)size; /* Num(bytes to copy) */
+ unsigned long offset; /* Lower 12 bits of from */
+ int userbuf = 0; /* to is in kernel space */
+
+
+ pfn = ((unsigned long) from) >> VTD_PAGE_SHIFT;
+ offset = ((unsigned long) from) & (~VTD_PAGE_MASK);
+ ret = copy_oldmem_page(pfn, buf, csize, offset, userbuf);
+
+ return (int) ret;
+}
+
+
+
+/*
+ * Struct copy_page_addr_parms is used to allow copy_page_addr()
+ * to accumulate values across multiple calls and returns.
+ */
+struct copy_page_addr_parms {
+ u32 first; /* flag: first-time */
+ u32 last; /* flag: last-time */
+ u32 bus; /* last bus number we saw */
+ u32 devfn; /* last devfn we saw */
+ u32 shift; /* last shift we saw */
+ u64 pte; /* Page Table Entry */
+ u64 next_addr; /* next-expected page_addr */
+
+ u64 page_addr; /* page_addr accumulating size */
+ u64 page_size; /* page_size accumulated */
+
+ struct domain_values_entry *dve; /* to accumulate iova ranges */
+};
+
+/*
+ * constant for initializing instances of copy_page_addr_parms properly.
+ */
+static struct copy_page_addr_parms copy_page_addr_parms_init = {1, 0};
+
+
+
+/*
+ * Lowest-level function in the 'Copy Page Tables' set
+ * Called once for each page_addr present in an iommu page-address table.
+ *
+ * Because of the depth-first traversal of the page-tables by the
+ * higher-level functions that call 'copy_page_addr', all pages
+ * of a domain will be presented in ascending order of IO Virtual Address.
+ *
+ * This function accumulates each contiguous range of these IOVAs and
+ * reserves it within the proper domain in the crashdump kernel when a
+ * non-contiguous range is detected, as determined by any of the following:
+ * 1. a change in the bus or device owning the presented page
+ * 2. a change in the page-size of the presented page (parameter shift)
+ * 3. a change in the page-table entry of the presented page
+ * 4. a presented IOVA that does not match the expected next-page address
+ * 5. the 'last' flag is set, indicating that all IOVAs have been seen.
+ */
+static int copy_page_addr(u64 page_addr, u32 shift, u32 bus, u32 devfn,
+ u64 pte, struct domain_values_entry *dve,
+ void *parms)
+{
+ struct copy_page_addr_parms *ppap = parms;
+
+ u64 page_size = ((u64)1 << shift); /* page_size */
+ u64 pfn_lo; /* For reserving IOVA range */
+ u64 pfn_hi; /* For reserving IOVA range */
+ struct iova *iova_p; /* For reserving IOVA range */
+
+ if (!ppap) {
+ pr_err("ERROR: ppap is NULL: 0x%3.3x(%3.3d) DevFn: 0x%3.3x(%3.3d) Page: 0x%16.16llx Size: 0x%16.16llx(%lld)\n",
+ bus, bus, devfn, devfn, page_addr,
+ page_size, page_size);
+ return 0;
+ }
+
+ /* If (only extending current addr range) */
+ if (ppap->first == 0 &&
+ ppap->last == 0 &&
+ ppap->bus == bus &&
+ ppap->devfn == devfn &&
+ ppap->shift == shift &&
+ (ppap->pte & ~VTD_PAGE_MASK) == (pte & ~VTD_PAGE_MASK) &&
+ ppap->next_addr == page_addr) {
+
+ /* Update page size and next-expected address */
+ ppap->next_addr += page_size;
+ ppap->page_size += page_size;
+ return 0;
+ }
+
+ if (!ppap->first) {
+ /* Close-out the accumulated IOVA address range */
+
+ if (!ppap->dve) {
+ pr_err("%s ERROR: ppap->dve is NULL -- needed to reserve range for B:D:F=%2.2x:%2.2x:%1.1x\n",
+ __func__,
+ ppap->bus, ppap->devfn >> 3, ppap->devfn & 0x7);
+ return RET_BADCOPY;
+ }
+ pfn_lo = IOVA_PFN(ppap->page_addr);
+ pfn_hi = IOVA_PFN(ppap->page_addr + ppap->page_size);
+ iova_p = reserve_iova(&ppap->dve->iovad, pfn_lo, pfn_hi);
+ }
+
+ /* Prepare for a new IOVA address range */
+ ppap->first = 0; /* Not first-time anymore */
+ ppap->bus = bus;
+ ppap->devfn = devfn;
+ ppap->shift = shift;
+ ppap->pte = pte;
+ ppap->next_addr = page_addr + page_size; /* Next-expected page_addr */
+
+ ppap->page_addr = page_addr; /* Addr(new page) */
+ ppap->page_size = page_size; /* Size(new page) */
+
+ ppap->dve = dve; /* adr(device_values_entry for new range) */
+
+ return 0;
+}
+
+/*
+ * Recursive function to copy the tree of page tables (max 6 recursions)
+ * Parameter 'shift' controls the recursion
+ */
+static int copy_page_table(struct dma_pte **dma_pte_new_p,
+ struct dma_pte *dma_pte_phys,
+ u32 shift, u64 page_addr,
+ struct intel_iommu *iommu,
+ u32 bus, u32 devfn,
+ struct domain_values_entry *dve, void *ppap)
+{
+ int ret; /* Integer return code */
+ struct dma_pte *p; /* Physical adr(each entry) iterator */
+ struct dma_pte *pgt_new_virt; /* Adr(dma_pte in new kernel) */
+ struct dma_pte *dma_pte_next; /* Adr(next table down) */
+ u64 u; /* index(each entry in page_table) */
+
+
+ /* If (already done all levels -- problem) */
+ if (shift < 12) {
+ pr_err("ERROR %s shift < 12 %p\n", __func__, dma_pte_phys);
+ pr_err("shift %d, page_addr %16.16llu bus %3.3u devfn %3.3u\n",
+ shift, page_addr, bus, devfn);
+ return RET_BADCOPY;
+ }
+
+ /* allocate a page table in the new kernel
+ * copy contents from old kernel
+ * then update each entry in the table in the new kernel
+ */
+
+ pgt_new_virt = (struct dma_pte *)alloc_pgtable_page(iommu->node);
+ if (!pgt_new_virt)
+ return -ENOMEM;
+
+ ret = oldcopy(pgt_new_virt, dma_pte_phys, VTD_PAGE_SIZE);
+ if (ret <= 0)
+ return ret;
+
+ for (u = 0, p = pgt_new_virt; u < 512; u++, p++) {
+
+ if (((p->val & DMA_PTE_READ) == 0) &&
+ ((p->val & DMA_PTE_WRITE) == 0))
+ continue;
+
+ if (dma_pte_superpage(p) || (shift == 12)) {
+
+ ret = copy_page_addr(page_addr | (u << shift),
+ shift, bus, devfn, p->val, dve, ppap);
+ if (ret)
+ return ret;
+ continue;
+ }
+
+ ret = copy_page_table(&dma_pte_next,
+ (struct dma_pte *)(p->val & VTD_PAGE_MASK),
+ shift-9, page_addr | (u << shift),
+ iommu, bus, devfn, dve, ppap);
+ if (ret)
+ return ret;
+
+ p->val &= ~VTD_PAGE_MASK; /* Clear old and set new pgd */
+ p->val |= ((u64)dma_pte_next & VTD_PAGE_MASK);
+ }
+
+ *dma_pte_new_p = (struct dma_pte *)virt_to_phys(pgt_new_virt);
+ __iommu_flush_cache(iommu, pgt_new_virt, VTD_PAGE_SIZE);
+
+ return 0;
+}
+
+
+/*
+ * Called once for each context_entry found in a copied context_entry_table
+ * Each context_entry represents one PCIe device handled by the IOMMU.
+ *
+ * The 'domain_values_list' contains one 'domain_values_entry' for each
+ * unique domain-id found while copying the context entries for each iommu.
+ *
+ * The Intel-iommu spec. requires that every context_entry that contains
+ * the same domain-id point to the same set of page translation tables.
+ * The hardware uses this to improve the use of its translation cache.
+ * In order to insure that the copied translate tables abide by this
+ * requirement, this function keeps a list of domain-ids (dids) that
+ * have already been seen for this iommu. This function checks each entry
+ * already on the list for a domain-id that matches the domain-id in this
+ * context_entry. If found, this function places the address of the previous
+ * context's tree of page translation tables into this context_entry.
+ * If a matching previous entry is not found, a new 'domain_values_entry'
+ * structure is created for the domain-id in this context_entry and
+ * copy_page_table is called to duplicate its tree of page tables.
+ */
+
+enum returns_from_copy_context_entry {
+RET_CCE_NOT_PRESENT = 1,
+RET_CCE_NEW_PAGE_TABLES,
+RET_CCE_PASS_THROUGH_1,
+RET_CCE_PASS_THROUGH_2,
+RET_CCE_RESERVED_VALUE,
+RET_CCE_PREVIOUS_DID
+};
+static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 devfn,
+ void *ppap, struct context_entry *ce)
+{
+ int ret = 0; /* Integer Return Code */
+ u32 shift = 0; /* bits to shift page_addr */
+ u64 page_addr = 0; /* Address of translated page */
+ struct dma_pte *pgt_old_phys; /* Adr(page_table in the old kernel) */
+ struct dma_pte *pgt_new_phys; /* Adr(page_table in the new kernel) */
+ unsigned long asr; /* New asr value for new context */
+ u8 t; /* Translation-type from context */
+ u8 aw; /* Address-width from context */
+ u32 aw_shift[8] = {
+ 12+9+9, /* [000b] 30-bit AGAW (2-level page table) */
+ 12+9+9+9, /* [001b] 39-bit AGAW (3-level page table) */
+ 12+9+9+9+9, /* [010b] 48-bit AGAW (4-level page table) */
+ 12+9+9+9+9+9, /* [011b] 57-bit AGAW (5-level page table) */
+ 12+9+9+9+9+9+9, /* [100b] 64-bit AGAW (6-level page table) */
+ 0, /* [111b] Reserved */
+ 0, /* [110b] Reserved */
+ 0, /* [111b] Reserved */
+ };
+
+ struct domain_values_entry *dve = NULL;
+
+
+ if (!context_get_p(ce)) { /* If (context not present) */
+ ret = RET_CCE_NOT_PRESENT; /* Skip it */
+ goto exit;
+ }
+
+ t = context_get_t(ce);
+
+ /* If we have seen this domain-id before on this iommu,
+ * give this context the same page-tables and we are done.
+ */
+ list_for_each_entry(dve, &domain_values_list[iommu->seq_id], link) {
+ if (dve->did == (int) context_get_did(ce)) {
+ switch (t) {
+ case 0: /* page tables */
+ case 1: /* page tables */
+ asr = virt_to_phys(dve->pgd) >> VTD_PAGE_SHIFT;
+ context_put_asr(ce, asr);
+ ret = RET_CCE_PREVIOUS_DID;
+ break;
+
+ case 2: /* Pass through */
+ if (dve->pgd == NULL)
+ ret = RET_CCE_PASS_THROUGH_2;
+ else
+ ret = RET_BADCOPY;
+ break;
+
+ default: /* Bad value of 't'*/
+ ret = RET_BADCOPY;
+ break;
+ }
+ goto exit;
+ }
+ }
+
+ /* Since we now know that this is a new domain-id for this iommu,
+ * create a new entry, add it to the list, and handle its
+ * page tables.
+ */
+
+ dve = kcalloc(1, sizeof(struct domain_values_entry), GFP_KERNEL);
+ if (!dve) {
+ ret = -ENOMEM;
+ goto exit;
+ }
+
+ dve->did = (int) context_get_did(ce);
+ dve->gaw = (int) agaw_to_width(context_get_aw(ce));
+ dve->pgd = NULL;
+ init_iova_domain(&dve->iovad, DMA_32BIT_PFN);
+
+ list_add(&dve->link, &domain_values_list[iommu->seq_id]);
+
+
+ if (t == 0 || t == 1) { /* If (context has page tables) */
+ aw = context_get_aw(ce);
+ shift = aw_shift[aw];
+
+ pgt_old_phys = (struct dma_pte *)(context_get_asr(ce) << 12);
+
+ ret = copy_page_table(&pgt_new_phys, pgt_old_phys,
+ shift-9, page_addr, iommu, bus, devfn, dve, ppap);
+
+ if (ret) /* if (problem) bail out */
+ goto exit;
+
+ asr = ((unsigned long)(pgt_new_phys)) >> VTD_PAGE_SHIFT;
+ context_put_asr(ce, asr);
+ dve->pgd = phys_to_virt((unsigned long)pgt_new_phys);
+ ret = RET_CCE_NEW_PAGE_TABLES;
+ goto exit;
+ }
+
+ if (t == 2) { /* If (Identity mapped pass-through) */
+ ret = RET_CCE_PASS_THROUGH_1; /* REVISIT: Skip for now */
+ goto exit;
+ }
+
+ ret = RET_CCE_RESERVED_VALUE; /* Else ce->t is a Reserved value */
+ /* Note fall-through */
+
+exit: /* all returns come through here to insure good clean-up */
+ return ret;
+}
+
+
+/*
+ * Called once for each context_entry_table found in the root_entry_table
+ */
+static int copy_context_entry_table(struct intel_iommu *iommu,
+ u32 bus, void *ppap,
+ struct context_entry **context_new_p,
+ struct context_entry *context_old_phys)
+{
+ int ret = 0; /* Integer return code */
+ struct context_entry *ce; /* Iterator */
+ struct context_entry *context_new_phys; /* adr(table in new kernel) */
+ struct context_entry *context_new_virt; /* adr(table in new kernel) */
+ u32 devfn = 0; /* PCI Device & function */
+
+ /* allocate a context-entry table in the new kernel
+ * copy contents from old kernel
+ * then update each entry in the table in the new kernel
+ */
+ context_new_virt =
+ (struct context_entry *)alloc_pgtable_page(iommu->node);
+ if (!context_new_virt)
+ return -ENOMEM;
+
+ context_new_phys =
+ (struct context_entry *)virt_to_phys(context_new_virt);
+
+ oldcopy(context_new_virt, context_old_phys, VTD_PAGE_SIZE);
+
+ for (devfn = 0, ce = context_new_virt; devfn < 256; devfn++, ce++) {
+
+ if (!context_get_p(ce)) /* If (context not present) */
+ continue; /* Skip it */
+
+ ret = copy_context_entry(iommu, bus, devfn, ppap, ce);
+ if (ret < 0) /* if (problem) */
+ return RET_BADCOPY;
+
+ switch (ret) {
+ case RET_CCE_NOT_PRESENT:
+ continue;
+ case RET_CCE_NEW_PAGE_TABLES:
+ continue;
+ case RET_CCE_PASS_THROUGH_1:
+ continue;
+ case RET_CCE_PASS_THROUGH_2:
+ continue;
+ case RET_CCE_RESERVED_VALUE:
+ return RET_BADCOPY;
+ case RET_CCE_PREVIOUS_DID:
+ continue;
+ default:
+ return RET_BADCOPY;
+ };
+ }
+
+ *context_new_p = context_new_phys;
+ __iommu_flush_cache(iommu, context_new_virt, VTD_PAGE_SIZE);
+ return 0;
+}
+
+
+/*
+ * Highest-level function in the 'copy translation tables' set of functions
+ */
+static int copy_root_entry_table(struct intel_iommu *iommu, void *ppap,
+ struct root_entry **root_new_virt_p,
+ struct root_entry *root_old_phys)
+{
+ int ret = 0; /* Integer return code */
+ u32 bus; /* Index: root-entry-table */
+ struct root_entry *re; /* Virt(iterator: new table) */
+ struct root_entry *root_new_virt; /* Virt(table in new kernel) */
+ struct context_entry *context_old_phys; /* Phys(context table entry) */
+ struct context_entry *context_new_phys; /* Phys(new context_entry) */
+
+ /*
+ * allocate a root-entry table in the new kernel
+ * copy contents from old kernel
+ * then update each entry in the table in the new kernel
+ */
+
+ root_new_virt = (struct root_entry *)alloc_pgtable_page(iommu->node);
+ if (!root_new_virt)
+ return -ENOMEM;
+
+ oldcopy(root_new_virt, root_old_phys, VTD_PAGE_SIZE);
+
+ for (bus = 0, re = root_new_virt; bus < 256; bus += 1, re += 1) {
+
+ if (!root_present(re))
+ continue;
+
+ context_old_phys = get_context_phys_from_root(re);
+
+ if (!context_old_phys)
+ continue;
+
+ ret = copy_context_entry_table(iommu, bus, ppap,
+ &context_new_phys,
+ context_old_phys);
+ if (ret)
+ return ret;
+
+ re->val &= ~VTD_PAGE_MASK;
+ set_root_value(re, (unsigned long)context_new_phys);
+ }
+
+ *root_new_virt_p = root_new_virt;
+ __iommu_flush_cache(iommu, root_new_virt, VTD_PAGE_SIZE);
+ return 0;
+}
+
+/*
+ * Interface to the "copy translation tables" set of functions
+ * from mainline code.
+ */
+int intel_iommu_copy_translation_tables(struct dmar_drhd_unit *drhd,
+ struct root_entry **root_old_phys_p,
+ struct root_entry **root_new_virt_p,
+ int g_num_of_iommus)
+{
+ struct intel_iommu *iommu; /* Virt(iommu hardware registers) */
+ unsigned long long q; /* quadword scratch */
+ struct root_entry *root_phys; /* Phys(table in old kernel) */
+ struct root_entry *root_new; /* Virt(table in new kernel) */
+ int ret = 0; /* Integer return code */
+ int i = 0; /* Loop index */
+
+ /* Structure so copy_page_addr() can accumulate things
+ * over multiple calls and returns
+ */
+ struct copy_page_addr_parms ppa_parms = copy_page_addr_parms_init;
+ struct copy_page_addr_parms *ppap = &ppa_parms;
+
+
+ iommu = drhd->iommu;
+ q = readq(iommu->reg + DMAR_RTADDR_REG);
+
+ if (!q)
+ return -1;
+
+ *root_old_phys_p = (struct root_entry *)q; /* Returned to caller */
+
+ /* If (list needs initializing) do it here */
+ if (!domain_values_list) {
+ domain_values_list =
+ kcalloc(g_num_of_iommus, sizeof(struct list_head),
+ GFP_KERNEL);
+
+ if (!domain_values_list) {
+ pr_err("Allocation failed for domain_values_list array\n");
+ return -ENOMEM;
+ }
+ for (i = 0; i < g_num_of_iommus; i++)
+ INIT_LIST_HEAD(&domain_values_list[i]);
+ }
+
+ /* Copy the root-entry table from the old kernel
+ * foreach context_entry_table in root_entry
+ * foreach context_entry in context_entry_table
+ * foreach level-1 page_table_entry in context_entry
+ * foreach level-2 page_table_entry in level 1 page_table_entry
+ * Above pattern continues up to 6 levels of page tables
+ * Sanity-check the entry
+ * Process the bus, devfn, page_address, page_size
+ */
+
+ root_phys = (struct root_entry *)q;
+ ret = copy_root_entry_table(iommu, ppap, &root_new, root_phys);
+ if (ret)
+ return ret;
+
+
+ ppa_parms.last = 1;
+ copy_page_addr(0, 0, 0, 0, 0, NULL, ppap);
+ *root_new_virt_p = root_new; /* Returned to caller */
+
+ /* The translation tables in the new kernel should now contain
+ * the same translations as the tables in the old kernel.
+ * This will allow us to update the iommu hdw to use the new tables.
+ *
+ * NOTE: Neither the iommu hardware nor the iommu->root_entry
+ * struct-value is updated herein.
+ * These are left for the caller to do.
+ */
+
+ return 0;
+}
+#endif /* CONFIG_CRASH_DUMP */
--
Bill Sumner <bill.sumner-***@public.gmane.org>
Bill Sumner
2014-04-25 00:36:38 UTC
Permalink
Add "#include <linux/crash_dump.h>" to gain access to is_kdump_kernel()

Modify the operation of the following functions when called during crash dump:
device_to_domain_id
get_domain_for_dev
init_dmars
intel_iommu_init

Signed-off-by: Bill Sumner <bill.sumner-***@public.gmane.org>
---
drivers/iommu/intel-iommu.c | 133 +++++++++++++++++++++++++++++++++++++++-----
1 file changed, 120 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 226c1d0..8b3e509 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -46,6 +46,7 @@
#include "irq_remapping.h"
#include "pci.h"
#include "intel-iommu-private.h"
+#include <linux/crash_dump.h>

static LIST_HEAD(dmar_atsr_units);
static LIST_HEAD(dmar_rmrr_units);
@@ -53,6 +54,7 @@ static LIST_HEAD(dmar_rmrr_units);
#define for_each_rmrr_units(rmrr) \
list_for_each_entry(rmrr, &dmar_rmrr_units, list)

+static int intel_iommu_in_crashdump;

static void __init check_tylersburg_isoch(void);

@@ -1796,6 +1798,24 @@ static void domain_remove_dev_info(struct dmar_domain *domain)
spin_unlock_irqrestore(&device_domain_lock, flags);
}

+#ifdef CONFIG_CRASH_DUMP
+static int device_to_domain_id(struct intel_iommu *iommu, u8 bus, u8 devfn)
+{
+ int did = -1; /* domain-id returned */
+ struct root_entry *root;
+ struct context_entry *context;
+ unsigned long flags;
+
+ spin_lock_irqsave(&iommu->lock, flags);
+ root = &iommu->root_entry[bus];
+ context = get_context_addr_from_root(root);
+ if (context && context_present(context+devfn))
+ did = context_get_did(context+devfn);
+ spin_unlock_irqrestore(&iommu->lock, flags);
+ return did;
+}
+#endif /* CONFIG_CRASH_DUMP */
+
/*
* find_domain
* Note: we use struct device->archdata.iommu stores the info
@@ -1880,6 +1900,9 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
unsigned long flags;
u8 bus, devfn, bridge_bus, bridge_devfn;
int did = -1; /* Default to "no domain_id supplied" */
+#ifdef CONFIG_CRASH_DUMP
+ struct domain_values_entry *dve = NULL;
+#endif /* CONFIG_CRASH_DUMP */

domain = find_domain(dev);
if (domain)
@@ -1920,6 +1943,24 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
domain = alloc_domain(false);
if (!domain)
goto error;
+
+#ifdef CONFIG_CRASH_DUMP
+ if (intel_iommu_in_crashdump) {
+ /*
+ * if this device had a did in the old kernel
+ * use its values instead of generating new ones
+ */
+ did = device_to_domain_id(iommu, bus, devfn);
+ if (did > 0 || (did == 0 && !cap_caching_mode(iommu->cap)))
+ dve = intel_iommu_did_to_domain_values_entry(did,
+ iommu);
+ if (dve)
+ gaw = dve->gaw;
+ else
+ did = -1;
+ }
+#endif /* CONFIG_CRASH_DUMP */
+
if (iommu_attach_domain(domain, iommu, did)) {
free_domain_mem(domain);
domain = NULL;
@@ -1929,6 +1970,18 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
if (domain_init(domain, gaw))
goto error;

+#ifdef CONFIG_CRASH_DUMP
+ if (intel_iommu_in_crashdump && dve) {
+
+ if (domain->pgd)
+ free_pgtable_page(domain->pgd);
+
+ domain->pgd = dve->pgd;
+
+ copy_reserved_iova(&dve->iovad, &domain->iovad);
+ }
+#endif /* CONFIG_CRASH_DUMP */
+
/* register pcie-to-pci device */
if (dev_tmp) {
domain = dmar_insert_dev_info(iommu, bridge_bus, bridge_devfn,
@@ -2331,6 +2384,10 @@ static int __init init_dmars(void)
struct device *dev;
struct intel_iommu *iommu;
int i, ret;
+#ifdef CONFIG_CRASH_DUMP
+ struct root_entry *root_old_phys;
+ struct root_entry *root_new_virt;
+#endif /* CONFIG_CRASH_DUMP */

/*
* for each drhd
@@ -2374,16 +2431,40 @@ static int __init init_dmars(void)
if (ret)
goto free_iommu;

- /*
- * TBD:
- * we could share the same root & context tables
- * among all IOMMU's. Need to Split it later.
- */
- ret = iommu_alloc_root_entry(iommu);
- if (ret) {
- printk(KERN_ERR "IOMMU: allocate root entry failed\n");
- goto free_iommu;
+#ifdef CONFIG_CRASH_DUMP
+ if (intel_iommu_in_crashdump) {
+ pr_info("IOMMU Copying translate tables from panicked kernel\n");
+ ret = intel_iommu_copy_translation_tables(drhd,
+ &root_old_phys, &root_new_virt,
+ g_num_of_iommus);
+ if (ret) {
+ pr_err("IOMMU: Copy translate tables failed\n");
+
+ /* Best to stop trying */
+ intel_iommu_in_crashdump = false;
+ goto error;
+ }
+ iommu->root_entry = root_new_virt;
+ pr_info("IOMMU: root_new_virt:0x%12.12llx phys:0x%12.12llx\n",
+ (u64)root_new_virt,
+ virt_to_phys(root_new_virt));
+ intel_iommu_get_dids_from_old_kernel(iommu);
+ } else {
+#endif /* CONFIG_CRASH_DUMP */
+ /*
+ * TBD:
+ * we could share the same root & context tables
+ * among all IOMMU's. Need to Split it later.
+ */
+ ret = iommu_alloc_root_entry(iommu);
+ if (ret) {
+ printk(KERN_ERR "IOMMU: allocate root entry failed\n");
+ goto free_iommu;
+ }
+#ifdef CONFIG_CRASH_DUMP
}
+#endif /* CONFIG_CRASH_DUMP */
+
if (!ecap_pass_through(iommu->ecap))
hw_pass_through = 0;
}
@@ -2442,6 +2523,16 @@ static int __init init_dmars(void)

check_tylersburg_isoch();

+#ifdef CONFIG_CRASH_DUMP
+ /*
+ * In the crashdump kernel: Skip setting-up new domains for
+ * si, rmrr, and the isa bus on the expectation that these
+ * translations were copied from the old kernel.
+ */
+ if (intel_iommu_in_crashdump)
+ goto skip_new_domains_for_si_rmrr_isa;
+#endif /* CONFIG_CRASH_DUMP */
+
/*
* If pass through is not set or not enabled, setup context entries for
* identity mappings for rmrr, gfx, and isa and may fall back to static
@@ -2482,6 +2573,10 @@ static int __init init_dmars(void)

iommu_prepare_isa();

+#ifdef CONFIG_CRASH_DUMP
+skip_new_domains_for_si_rmrr_isa:;
+#endif /* CONFIG_CRASH_DUMP */
+
/*
* for each drhd
* enable fault log
@@ -3624,12 +3719,24 @@ int __init intel_iommu_init(void)
goto out_free_dmar;
}

+#ifdef CONFIG_CRASH_DUMP
/*
- * Disable translation if already enabled prior to OS handover.
+ * If (This is the crash kernel)
+ * Set: copy iommu translate tables from old kernel
+ * Skip disabling the iommu hardware translations
*/
- for_each_active_iommu(iommu, drhd)
- if (iommu->gcmd & DMA_GCMD_TE)
- iommu_disable_translation(iommu);
+ if (is_kdump_kernel()) {
+ intel_iommu_in_crashdump = true;
+ pr_info("IOMMU intel_iommu_in_crashdump = true\n");
+ pr_info("IOMMU Skip disabling iommu hardware translations\n");
+ } else
+#endif /* CONFIG_CRASH_DUMP */
+ /*
+ * Disable translation if already enabled prior to OS handover.
+ */
+ for_each_active_iommu(iommu, drhd)
+ if (iommu->gcmd & DMA_GCMD_TE)
+ iommu_disable_translation(iommu);

if (dmar_dev_scope_init() < 0) {
if (force_on)
--
Bill Sumner <bill.sumner-***@public.gmane.org>
Bill Sumner
2014-04-25 00:36:37 UTC
Permalink
Interfaces for when a new domain in the crashdump kernel needs some
values from the panicked kernel's context entries.

Signed-off-by: Bill Sumner <bill.sumner-***@public.gmane.org>
---
drivers/iommu/intel-iommu-kdump.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/drivers/iommu/intel-iommu-kdump.c b/drivers/iommu/intel-iommu-kdump.c
index 4e653e048..f3c777e 100644
--- a/drivers/iommu/intel-iommu-kdump.c
+++ b/drivers/iommu/intel-iommu-kdump.c
@@ -53,6 +53,45 @@ static struct list_head *domain_values_list;


/* ========================================================================
+ * Interfaces for when a new domain in the crashdump kernel needs some
+ * values from the panicked kernel's context entries
+ * ------------------------------------------------------------------------
+ */
+
+
+struct domain_values_entry *intel_iommu_did_to_domain_values_entry(int did,
+ struct intel_iommu *iommu)
+{
+ struct domain_values_entry *dve; /* iterator */
+
+ list_for_each_entry(dve, &domain_values_list[iommu->seq_id], link)
+ if (dve->did == did)
+ return dve;
+ return NULL;
+}
+
+
+/* Mark domain-id's from old kernel as in-use on this iommu so that a new
+ * domain-id is allocated in the case where there is a device in the new kernel
+ * that was not in the old kernel -- and therefore a new domain-id is needed.
+ */
+int intel_iommu_get_dids_from_old_kernel(struct intel_iommu *iommu)
+{
+ struct domain_values_entry *dve; /* iterator */
+
+ pr_info("IOMMU:%d Domain ids from panicked kernel:\n", iommu->seq_id);
+
+ list_for_each_entry(dve, &domain_values_list[iommu->seq_id], link) {
+ set_bit(dve->did, iommu->domain_ids);
+ pr_info("DID did:%d(0x%4.4x)\n", dve->did, dve->did);
+ }
+
+ pr_info("----------------------------------------\n");
+ return 0;
+}
+
+
+/* ========================================================================
* Copy iommu translation tables from old kernel into new kernel.
* Entry to this set of functions is: intel_iommu_copy_translation_tables()
* ------------------------------------------------------------------------
--
Bill Sumner <bill.sumner-***@public.gmane.org>
David Woodhouse
2014-04-30 10:49:33 UTC
Permalink
Post by Bill Sumner
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue
using its current buffers until the device drivers in the crashdump kernel
initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers
in the crashdump kernel than the iova ranges that were in-use at the time
of the panic.
There could be all kinds of existing mappings in the DMA page tables,
and I'm not sure it's safe to preserve them. What prevents the crashdump
kernel from trying to use any of the physical pages which are
accessible, and which could thus be corrupted by stray DMA?

In fact, the old kernel could even have set up 1:1 passthrough mappings
for some devices, which would then be able to DMA *anywhere*. Surely we
need to prevent that?

After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.

That way, we allow the "rogue" DMA to continue to the same virtual bus
addresses, but it can only ever affect one piece of physical memory and
can't have detrimental effects elsewhere.

Was that option considered and discounted for some reason? It seems like
it would make sense...
--
David Woodhouse Open Source Technology Centre
David.Woodhouse-***@public.gmane.org Intel Corporation
Jerry Hoemann
2014-05-02 20:13:07 UTC
Permalink
On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:

Hi David,

As you may know, Bill has retired and I am picking up this work.
I am still coming up to speed in this area so my goal is to understand
your concerns and research them as I dig through code and specs.

My apologizes for the delay in replying and for our missing your
earlier questions.
Post by David Woodhouse
Post by Bill Sumner
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue
using its current buffers until the device drivers in the crashdump kernel
initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers
in the crashdump kernel than the iova ranges that were in-use at the time
of the panic.
There could be all kinds of existing mappings in the DMA page tables,
and I'm not sure it's safe to preserve them. What prevents the crashdump
kernel from trying to use any of the physical pages which are
accessible, and which could thus be corrupted by stray DMA?
In kdump, we switch to, and execute from the capture kernel. (AKA 2nd kernel,
crash kernel.) This is a separate distinct instance of linux. One of
the intents of this switch is to (kdump.txt):

"This ensures that ongoing Direct Memory Access
(DMA) from the system kernel does not corrupt the dump-capture kernel.
The kexec -p command loads the dump-capture kernel into this reserved
memory."

As capture kernel is allocated early in boot, we shouldn't have DMA
targeted to it once the capture kernel is loaded.

Now, the capture kernel will try to access 1st kernel memory via /proc/vmcore
after it boots and runs makedumpfile. Is it this access that you
are concerned with?
Post by David Woodhouse
In fact, the old kernel could even have set up 1:1 passthrough mappings
for some devices, which would then be able to DMA *anywhere*. Surely we
need to prevent that?
From prior patch version comments, I know Bill was aware of the
issue of pass-through, but don't know to what extent he tested with
the feature enabled. E.g. in Jan and prior versions he stated he
had not tested w/ pass through. He subsequently dropped this statement.

The approach of the patch is to just allow the outstanding DMA
to complete. Assuming the targeted address of the pass through
was sane, does this differ greatly from the non pass through case?
Now, if the DMA was truly going to random places (like the capture
kernel itself) I'm not sure what we would do. Suggestions?
Post by David Woodhouse
After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.
That way, we allow the "rogue" DMA to continue to the same virtual bus
addresses, but it can only ever affect one piece of physical memory and
can't have detrimental effects elsewhere.
Was that option considered and discounted for some reason? It seems like
it would make sense...
I don't know if this was considered.

I will need time to go through code and the spec to understand
implications better.

Thanks

Jerry
--
----------------------------------------------------------------------------
Jerry Hoemann Software Engineer Hewlett-Packard

3404 E Harmony Rd. MS 57 phone: (970) 898-1022
Ft. Collins, CO 80528 FAX: (970) 898-XXXX
email: jerry.hoemann-***@public.gmane.org
----------------------------------------------------------------------------
Jerry Hoemann
2014-05-07 18:25:15 UTC
Permalink
David,

I received the following email from Bill Sumner addressing your
earlier email.

Jerry
Post by David Woodhouse
Was that option considered and discounted for some reason? It seems like
it would make sense.
Considered ?
Yes. It is an interesting idea. See technical discussion below.

Discounted for some reason ?
Not really. Only for lack of time. With only a limited amount of time,
I focused on providing a clean set of patches on a recent baseline.
Post by David Woodhouse
Post by Bill Sumner
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue
using its current buffers until the device drivers in the crashdump kernel
initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers
in the crashdump kernel than the iova ranges that were in-use at the time
of the panic.
There could be all kinds of existing mappings in the DMA page tables,
and I'm not sure it's safe to preserve them.
Actually I think it is safe to preserve them in a great number of cases,
and there is some percentage of cases where something else will work better.

Fortunately, history shows that the panicked kernel generates bad
translations rarely enough that many crashdumps on systems
without iommu hardware still succeed by simply allowing the DMA/IO
to continue into the existing buffers. The patch set uses the same
technique when the iommu is active. So I think the odds start
out in our favor -- enough in our favor that we should implement
the current patch set into Linux and then begin work to improve it.

Since the patch set is currently available, and its technique has already been
somewhat tested by three companies, the remaining path for including it
into Linux is short. This would significantly improve the crashdump success
rate when the Intel iommu is active. It would also provide a foundation for
investigations into more advanced techniques that would further increase
the success rate.

Bad translations do still happen -- bad programming, tables getting hosed,
etc. We would like to find a way to get a good dump in this extra percentage
of cases. It sounds like you have some good ideas in this area.

The iommu hardware guarantees that DMA/IO can only access the memory
areas described in the DMA page tables. We can be quite sure that the only
physical memory areas that any DMA/IO can access are ones that are described
by the contents of the translation tables. This comes with a few caveats:
1. hw-passthru and the 'si' domain -- see discussion under a later question.
The problem with these is that they allow DMA/IO access to any of
physical memory.
2. The iommu hardware will use valid translation entries to the "wrong" place
just as readily as it will use ones to the "right" place. Having the
kdump kernel check-out the memory areas described by the tables before
using them seems like a good idea. For instance: any DMA buffers from the
panicked kernel that point into the kdump kernel's area would be highly
suspect.
3. Random writes into the translate tables may well write known-bad values
into fields or reserved areas -- values which will cause the iommu
hardware to reject that entry. Unfortunately we cannot count on this
happening, but it felt like a bright-spot worth mentioning.
Post by David Woodhouse
What prevents the crashdump
kernel from trying to use any of the physical pages which are
accessible, and which could thus be corrupted by stray DMA?
DMA into the kdump area will corrupt the kdump and cause loss of the dump.
Note that this was the original problem with disabling the iommu at the
beginning of the kdump kernel which forced DMA that was going to
its original (good) buffers to begin going into essentially random
places -- almost all of them "wrong".

However, I believe that the kdump kernel itself will not be the problem
for the following reasons:

As I understand the kdump architecture, the kdump kernel is restricted
to the physical memory area that was reserved for it by the platform
kernel during its initialization (when the platform kernel presumably was
still healthy.) The kdump kernel is assumed to be clean and healthy,
so it will not be attempting to use any memory outside of what it is
assigned -- except for reading pages of the panicked kernel in order to
write them to the dump file.

Assuming that the DMA page tables were checked to insure that no DMA page
table points into the kdump kernel's reserved area, no stray DMA/IO will
affect the kdump kernel.
Post by David Woodhouse
In fact, the old kernel could even have set up 1:1 passthrough mappings
for some devices, which would then be able to DMA *anywhere*. Surely we
need to prevent that?
Yes, I agree.
The 1:1 passthrough mappings seem to be problematic -- both the
use of hw-passthrough by the iommu and the 'si' domain set up in the DMA
page tables. These mappings completely bypass one of the basic reasons
for using the iommu hardware -- to restrict DMA access to known-safe
areas of memory.

I would prefer that Linux not use either of these mechanisms unless
it is absolutely necessary -- in which case it could be explicitly
enabled. After all, there are probably still some (hopefully few) devices
that absolutely require it. Also, there may be circumstances where a
performance gain outweighs the additional risk to the crashdump.

If the kdump kernel finds a 1:1 passthrough domain among the DMA page tables,
the real issue comes if we also need that device for taking the crashdump.
If we do not need it, then pointing all of that device's IOVAs at a safe
buffer -- as you recommend -- looks like a good solution.

If kdump does need it, I can think of two ways to handle things:
1. Just leave it. This is what happens when there is no hardware iommu
active, and this has worked OK there for a long time. This option
clearly depends upon the 1:1 passthrough device not being the problem.
This is also what my patches do, since they are modeled on handling
the DMA buffers in the same manner as when there is no iommu active.

2. As you suggest, create a safe buffer and force all of this device's
IOVAs into it. Then begin mapping real buffers when the kdump kernel
begins using the device.
Post by David Woodhouse
After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.
That way, we allow the "rogue" DMA to continue to the same virtual bus
addresses, but it can only ever affect one piece of physical memory and
can't have detrimental effects elsewhere.
Just a few technical observations and questions that hopefully will help
implement this enhancement:

Since each device may eventually be used by the kdump kernel, then each
device will need its own domain-id and its own set of DMA page tables
so that the IOVAs requested by the kdump kernel can map that device's
IOVAs to that device's buffers.

As IO devices have grown smarter, many of them, particularly NICs and
storage interfaces, use DMA for work queues and status-reporting
vectors in addition to buffers of data to be transferred.
Some experimenting and testing may be necessary to determine how
these devices behave when the translation for the work queue is
switched to a safe-buffer which does not contain valid entries for
that device.

Questions that came to mind as I thought about this proposal:
1. Does the iommu need to know when the device driver has reset the device
and that it is safe to add translations to the DMA page tables?

2. If it needs to know, how does it know, since the device driver asking
for an IOVA via the DMA subsystem is usually the first indication to
the iommu driver about the device and this may not guarantee that the
device driver has already reset the device at that point?

3. For any given device, which IOVAs will be mapped to the safe buffer ?
a. Only the IOVAs active at the time of the panic, which would require
scanning the existing DMA page tables to find them.
b. All possible IOVAs ? This would seem to be a very large number of
pages for the page tables -- especially since each device may need
its own set of DMA page tables. There could still be only one
"safe data buffer" with a lot of page tables pointing to it.
c. Determine these "on the fly" by capturing DMAR faults or some similar
mechanism ?
d. Other possibilities ?
Post by David Woodhouse
Was that option considered and discounted for some reason? It seems like
it would make sense.
--
Bill Sumner


Forwarded by Jerry Hoemann
--
----------------------------------------------------------------------------
Jerry Hoemann Software Engineer Hewlett-Packard

3404 E Harmony Rd. MS 57 phone: (970) 898-1022
Ft. Collins, CO 80528 FAX: (970) 898-XXXX
email: jerry.hoemann-***@public.gmane.org
----------------------------------------------------------------------------
Joerg Roedel
2014-07-02 13:32:59 UTC
Permalink
Hi David,
Post by David Woodhouse
There could be all kinds of existing mappings in the DMA page tables,
and I'm not sure it's safe to preserve them. What prevents the crashdump
kernel from trying to use any of the physical pages which are
accessible, and which could thus be corrupted by stray DMA?
In fact, the old kernel could even have set up 1:1 passthrough mappings
for some devices, which would then be able to DMA *anywhere*. Surely we
need to prevent that?
Ideally we would prevent that, yes. But the problem is that a failed DMA
transaction might put the device into an unrecoverable state. Usually
any in-flight DMA transactions should only target buffers set up by the
previous kernel and not corrupt any data.
Post by David Woodhouse
After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.
That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.

So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.


Joerg
Jerry Hoemann
2014-07-11 16:27:45 UTC
Permalink
Post by Jerry Hoemann
Hi David,
Post by David Woodhouse
There could be all kinds of existing mappings in the DMA page tables,
and I'm not sure it's safe to preserve them. What prevents the crashdump
kernel from trying to use any of the physical pages which are
accessible, and which could thus be corrupted by stray DMA?
In fact, the old kernel could even have set up 1:1 passthrough mappings
for some devices, which would then be able to DMA *anywhere*. Surely we
need to prevent that?
Ideally we would prevent that, yes. But the problem is that a failed DMA
transaction might put the device into an unrecoverable state. Usually
any in-flight DMA transactions should only target buffers set up by the
previous kernel and not corrupt any data.
Post by David Woodhouse
After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.
That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.
So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.
Joerg
David, Joerg,

What do you think here? Do you want me to update the patch set for 3.17?

Jerry
--
----------------------------------------------------------------------------
Jerry Hoemann Software Engineer Hewlett-Packard

3404 E Harmony Rd. MS 57 phone: (970) 898-1022
Ft. Collins, CO 80528 FAX: (970) 898-XXXX
email: jerry.hoemann-***@public.gmane.org
----------------------------------------------------------------------------
Li, ZhenHua
2014-10-15 08:10:02 UTC
Permalink
David, Joerg,
I plan to merge this patch set with 3.17 stable kernel, and split this
patch set into two :
1. The core part, including the changed functions, like [Patch 4/8],
[Patch 8/8].
2. For the formatting issues, like [Patch 1/8],[Patch 3/8], including
the changes for code formations, creation of new files
intel-iommu-kdump.c, intel-iommu-private.h.

I believe this will make the patch set more clear to read and understand.

What are your suggestions?

Thanks
Zhenhua
Post by Jerry Hoemann
Post by Jerry Hoemann
Hi David,
Post by David Woodhouse
There could be all kinds of existing mappings in the DMA page tables,
and I'm not sure it's safe to preserve them. What prevents the crashdump
kernel from trying to use any of the physical pages which are
accessible, and which could thus be corrupted by stray DMA?
In fact, the old kernel could even have set up 1:1 passthrough mappings
for some devices, which would then be able to DMA *anywhere*. Surely we
need to prevent that?
Ideally we would prevent that, yes. But the problem is that a failed DMA
transaction might put the device into an unrecoverable state. Usually
any in-flight DMA transactions should only target buffers set up by the
previous kernel and not corrupt any data.
Post by David Woodhouse
After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.
That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.
So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.
Joerg
David, Joerg,
What do you think here? Do you want me to update the patch set for 3.17?
Jerry
Li, ZhenHua
2014-10-15 08:45:27 UTC
Permalink
Add Tom to CC list.
Post by Jerry Hoemann
David, Joerg,
I plan to merge this patch set with 3.17 stable kernel, and split this
1. The core part, including the changed functions, like [Patch 4/8],
[Patch 8/8].
2. For the formatting issues, like [Patch 1/8],[Patch 3/8], including
the changes for code formations, creation of new files
intel-iommu-kdump.c, intel-iommu-private.h.
I believe this will make the patch set more clear to read and understand.
What are your suggestions?
Thanks
Zhenhua
Post by Jerry Hoemann
Post by Jerry Hoemann
Hi David,
Post by David Woodhouse
There could be all kinds of existing mappings in the DMA page tables,
and I'm not sure it's safe to preserve them. What prevents the crashdump
kernel from trying to use any of the physical pages which are
accessible, and which could thus be corrupted by stray DMA?
In fact, the old kernel could even have set up 1:1 passthrough mappings
for some devices, which would then be able to DMA *anywhere*. Surely we
need to prevent that?
Ideally we would prevent that, yes. But the problem is that a failed DMA
transaction might put the device into an unrecoverable state. Usually
any in-flight DMA transactions should only target buffers set up by the
previous kernel and not corrupt any data.
Post by David Woodhouse
After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.
That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.
So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.
Joerg
David, Joerg,
What do you think here? Do you want me to update the patch set for 3.17?
Jerry
Bjorn Helgaas
2014-10-22 02:16:46 UTC
Permalink
[-cc Bill, +cc Zhen-Hua, Eric, Tom, Jerry]

Hi Joerg,

I was looking at Zhen-Hua's recent patches, trying to figure out if I
need to do anything with them. Resetting devices in the old kernel
seems like a non-starter. Resetting devices in the new kernel, ...,
well, maybe. It seems ugly, and it seems like the sort of problem
that IOMMUs are designed to solve. Anyway, I found this old
Post by Joerg Roedel
Post by David Woodhouse
After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.
That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.
This in-flight DMA is from a device programmed by the old kernel, and
it would be reading data from the old kernel's buffers. I think
you're suggesting that we might want that DMA read to complete so the
device can update filesystem metadata?

I don't really understand that argument. Don't we usually want to
stop any data from escaping the machine after a crash, on the theory
that the old kernel is crashing because something is catastrophically
wrong and we may have already corrupted things in memory? If so,
allowing this old DMA to complete is just as likely to make things
worse as to make them better.

Without kdump, we likely would reboot through the BIOS and the device
would get reset and the DMA would never happen at all. So if we made
the dump kernel program the IOMMU to prevent the DMA, that seems like
a similar situation.
Post by Joerg Roedel
So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.
This makes the dump kernel more dependent on data from the old kernel,
which we obviously want to avoid when possible.

I didn't find the previous discussion where pointing every virtual bus
address at the same physical scratch page was proposed. Why was that
better than programming the IOMMU to reject every DMA?

Bjorn
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2014-10-22 02:47:22 UTC
Permalink
Post by Bjorn Helgaas
[-cc Bill, +cc Zhen-Hua, Eric, Tom, Jerry]
Hi Joerg,
I was looking at Zhen-Hua's recent patches, trying to figure out if I
need to do anything with them. Resetting devices in the old kernel
seems like a non-starter. Resetting devices in the new kernel, ...,
well, maybe. It seems ugly, and it seems like the sort of problem
that IOMMUs are designed to solve. Anyway, I found this old
For context here is the kexec on panic design, and what I know from
previous rounds of similar conversations.

The way kexec on panic aka kdump is designed to work is that the
recovery kernel lives in a piece of memory reserved at boot time and
known not to be in use by any driver (because we never ever use it for
DMA). If DMA's continue from any source the old kernel may be a little
more corrupted but our currently running kernel should not.

Device drivers that we use in the recovery kernel are required to be
able to initialize their devices from an arbitrary state or fail to
initialize their devices.

We have discussed things on various occassions but IOMMUs all have their
own individual idiosynchrousies and came late to the party so that it
is hard to generalize.

The reserved region is generally low enough in memory that simply
not using IOMMUs works.

The major challenge with initializing an IOMMU would be that there are
potentially devices whose driver is not loaded in the recover kernel
with on-going DMA sessions (perhaps a NIC in response to network
packet).

Which essentially means that if you are going to use an IOMMU slot in a
recovery kernel you have to either know that IOMMU slot was reserved for
the recovery kernel (what has always felt like the easiest way to me).
Or you have to know everything that could target that IOMMU slot has
been reset or has it's driver loaded.

I have always thought the simplist and easiest solution would be to
reserve a few IOMMU slots for the kexec on panic kernel. But if folks
can find other ways to guarantee that an on-going DMA isn't targeting
an IOMMU slot (such as resetting everything downstream from that
IOMMU slot) more power to you.
Post by Bjorn Helgaas
Post by Joerg Roedel
Post by David Woodhouse
After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.
That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.
This in-flight DMA is from a device programmed by the old kernel, and
it would be reading data from the old kernel's buffers. I think
you're suggesting that we might want that DMA read to complete so the
device can update filesystem metadata?
I don't really understand that argument. Don't we usually want to
stop any data from escaping the machine after a crash, on the theory
that the old kernel is crashing because something is catastrophically
wrong and we may have already corrupted things in memory? If so,
allowing this old DMA to complete is just as likely to make things
worse as to make them better.
Without kdump, we likely would reboot through the BIOS and the device
would get reset and the DMA would never happen at all. So if we made
the dump kernel program the IOMMU to prevent the DMA, that seems like
a similar situation.
Post by Joerg Roedel
So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.
This makes the dump kernel more dependent on data from the old kernel,
which we obviously want to avoid when possible.
I didn't find the previous discussion where pointing every virtual bus
address at the same physical scratch page was proposed. Why was that
better than programming the IOMMU to reject every DMA?
Bjorn
Eric
Li, ZhenHua
2014-10-22 03:08:36 UTC
Permalink
Need more time to read and think about these mails. I just want to
clarify one thing: Bill has left HP, and now I inherited his works.
That's why I sent an update of his patch
https://lkml.org/lkml/2014/10/21/134
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Bjorn Helgaas
[-cc Bill, +cc Zhen-Hua, Eric, Tom, Jerry]
Hi Joerg,
I was looking at Zhen-Hua's recent patches, trying to figure out if I
need to do anything with them. Resetting devices in the old kernel
seems like a non-starter. Resetting devices in the new kernel, ...,
well, maybe. It seems ugly, and it seems like the sort of problem
that IOMMUs are designed to solve. Anyway, I found this old
For context here is the kexec on panic design, and what I know from
previous rounds of similar conversations.
The way kexec on panic aka kdump is designed to work is that the
recovery kernel lives in a piece of memory reserved at boot time and
known not to be in use by any driver (because we never ever use it for
DMA). If DMA's continue from any source the old kernel may be a little
more corrupted but our currently running kernel should not.
Device drivers that we use in the recovery kernel are required to be
able to initialize their devices from an arbitrary state or fail to
initialize their devices.
We have discussed things on various occassions but IOMMUs all have their
own individual idiosynchrousies and came late to the party so that it
is hard to generalize.
The reserved region is generally low enough in memory that simply
not using IOMMUs works.
The major challenge with initializing an IOMMU would be that there are
potentially devices whose driver is not loaded in the recover kernel
with on-going DMA sessions (perhaps a NIC in response to network
packet).
Which essentially means that if you are going to use an IOMMU slot in a
recovery kernel you have to either know that IOMMU slot was reserved for
the recovery kernel (what has always felt like the easiest way to me).
Or you have to know everything that could target that IOMMU slot has
been reset or has it's driver loaded.
I have always thought the simplist and easiest solution would be to
reserve a few IOMMU slots for the kexec on panic kernel. But if folks
can find other ways to guarantee that an on-going DMA isn't targeting
an IOMMU slot (such as resetting everything downstream from that
IOMMU slot) more power to you.
Post by Bjorn Helgaas
Post by Joerg Roedel
Post by David Woodhouse
After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.
That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.
This in-flight DMA is from a device programmed by the old kernel, and
it would be reading data from the old kernel's buffers. I think
you're suggesting that we might want that DMA read to complete so the
device can update filesystem metadata?
I don't really understand that argument. Don't we usually want to
stop any data from escaping the machine after a crash, on the theory
that the old kernel is crashing because something is catastrophically
wrong and we may have already corrupted things in memory? If so,
allowing this old DMA to complete is just as likely to make things
worse as to make them better.
Without kdump, we likely would reboot through the BIOS and the device
would get reset and the DMA would never happen at all. So if we made
the dump kernel program the IOMMU to prevent the DMA, that seems like
a similar situation.
Post by Joerg Roedel
So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.
This makes the dump kernel more dependent on data from the old kernel,
which we obviously want to avoid when possible.
I didn't find the previous discussion where pointing every virtual bus
address at the same physical scratch page was proposed. Why was that
better than programming the IOMMU to reject every DMA?
Bjorn
Eric
Joerg Roedel
2014-10-22 13:21:58 UTC
Permalink
Hi Bjorn,
Post by Bjorn Helgaas
I was looking at Zhen-Hua's recent patches, trying to figure out if I
need to do anything with them. Resetting devices in the old kernel
seems like a non-starter. Resetting devices in the new kernel, ...,
well, maybe. It seems ugly, and it seems like the sort of problem
that IOMMUs are designed to solve.
Actually resetting the devices in the kdump kernel would be one of the
better solutions for this problem. When we have a generic way to stop
all in-flight DMA from the PCI endpoints we could safely disable and
then re-enable the IOMMU.
Post by Bjorn Helgaas
Post by Joerg Roedel
That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.
This in-flight DMA is from a device programmed by the old kernel, and
it would be reading data from the old kernel's buffers. I think
you're suggesting that we might want that DMA read to complete so the
device can update filesystem metadata?
Well, it is not about updating filesystem metadata. In the kdump kernel
we have these options:

1) Disable the IOMMU. Problem here is, that DMA is now
untranslated, so that any in-flight DMA might read or write
from a random location in memory, corrupting the kdump or
even the new kexec kernel memory. So this is a non-starter.

2) Re-program the IOMMU to block all DMA. This is safer as it
does not corrupt any data in memory. But some devices react
very poorly on a master abort from the IOMMU, so bad that the
driver in the kdump kernel fails to initialize that device.
In this case taking dump itself might fail (and often does,
according to reports)

3) To prevent master aborts like in option (2), David suggested
to map the whole DMA address space to a scratch page. This
also prevents system memory corruption and the master aborts.
But the problem is, that in-flight DMA will now read all
zeros. This can corrupt disk and network data, at worst it
nulls out the superblocks of your filesystem and you lose all
data. So this is not an option too.

What we currently do in the VT-d driver is a mixture of (1) and (2). The
VT-d driver disables the IOMMU hardware (opening a race window for
memory data corruption), re-initializes it to reject any ongoing DMA
(which causes master aborts for in-flight DMA) and re-enables the IOMMU
hardware.

But this strategy fails in heavy IO environments quite often and we look
into ways to solve the problem, or at least improve the current
situation as good as we can.

I talked to David about this at LPC and we came up with basically this
procedure:

1. If the VT-d driver finds the IOMMU enabled, it reuses its
root-context table. This way the IOMMU must not be disabled
and re-enabled, eliminating the race we have now.
Other data structures like the context-entries are copied
over from the old kernel. We basically keep all mappings
from the old kernel, allowing any in-flight DMA to succeed.
No memory or disk data corruption.
The issue here is, that we modify data from the old kernel
which is about to be dumped. But there are ways to work
around that.

2. When a device driver issues the first dma_map command for a
device, we assign a new and empty page-table, thus removing
all mappings from the old kernel for the device.
Rationale is, that at this point the device driver should
have reset the device to a point where all in-flight DMA is
canceled.

This approach goes into the same direction as Bill Sumners patch-set,
which Li took over. But it goes not as far as keeping the old mappings
while the kdump kernel is still working with the devices (which might
introduce new issues and corner cases).
Post by Bjorn Helgaas
Post by Joerg Roedel
So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.
This makes the dump kernel more dependent on data from the old kernel,
which we obviously want to avoid when possible.
Sure, but this is not really possible here (unless we have a generic and
reliable way to reset all PCI endpoint devices and cancel all in-flight
DMA before we disable the IOMMU in the kdump kernel).
Otherwise we always risk data corruption somewhere, in system memory or
on disk.
Post by Bjorn Helgaas
I didn't find the previous discussion where pointing every virtual bus
address at the same physical scratch page was proposed. Why was that
better than programming the IOMMU to reject every DMA?
As I said, the problem is that this causes master aborts which some
devices can't recover from, so that the device driver in the kdump
kernel fails to initialize the device.


Joerg
Bjorn Helgaas
2014-10-22 18:26:07 UTC
Permalink
Post by Joerg Roedel
Hi Bjorn,
Post by Bjorn Helgaas
I was looking at Zhen-Hua's recent patches, trying to figure out if I
need to do anything with them. Resetting devices in the old kernel
seems like a non-starter. Resetting devices in the new kernel, ...,
well, maybe. It seems ugly, and it seems like the sort of problem
that IOMMUs are designed to solve.
Actually resetting the devices in the kdump kernel would be one of the
better solutions for this problem. When we have a generic way to stop
all in-flight DMA from the PCI endpoints we could safely disable and
then re-enable the IOMMU.
Post by Bjorn Helgaas
Post by Joerg Roedel
That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.
This in-flight DMA is from a device programmed by the old kernel, and
it would be reading data from the old kernel's buffers. I think
you're suggesting that we might want that DMA read to complete so the
device can update filesystem metadata?
Well, it is not about updating filesystem metadata. In the kdump kernel
1) Disable the IOMMU. Problem here is, that DMA is now
untranslated, so that any in-flight DMA might read or write
from a random location in memory, corrupting the kdump or
even the new kexec kernel memory. So this is a non-starter.
Agreed (at least if the IOMMU was enabled in the crashed kernel).
Post by Joerg Roedel
2) Re-program the IOMMU to block all DMA. This is safer as it
does not corrupt any data in memory. But some devices react
very poorly on a master abort from the IOMMU, so bad that the
driver in the kdump kernel fails to initialize that device.
In this case taking dump itself might fail (and often does,
according to reports)
Sounds like an option, even though broken devices work poorly.
Post by Joerg Roedel
3) To prevent master aborts like in option (2), David suggested
to map the whole DMA address space to a scratch page. This
also prevents system memory corruption and the master aborts.
But the problem is, that in-flight DMA will now read all
zeros. This can corrupt disk and network data, at worst it
nulls out the superblocks of your filesystem and you lose all
data. So this is not an option too.
Ah, yes, I see your point now. This allows corrupted data, e.g., all
zeroes, to be written to disk or network after the kernel crash. I
agree; this doesn't sound like a good option.

And the proposal below is a 4th option (leave IOMMU enabled, reusing
crashed kernel's mappings until drivers make new mappings).
Post by Joerg Roedel
What we currently do in the VT-d driver is a mixture of (1) and (2). The
VT-d driver disables the IOMMU hardware (opening a race window for
memory data corruption), re-initializes it to reject any ongoing DMA
(which causes master aborts for in-flight DMA) and re-enables the IOMMU
hardware.
But this strategy fails in heavy IO environments quite often and we look
into ways to solve the problem, or at least improve the current
situation as good as we can.
I talked to David about this at LPC and we came up with basically this
1. If the VT-d driver finds the IOMMU enabled, it reuses its
root-context table. This way the IOMMU must not be disabled
and re-enabled, eliminating the race we have now.
Other data structures like the context-entries are copied
over from the old kernel. We basically keep all mappings
from the old kernel, allowing any in-flight DMA to succeed.
No memory or disk data corruption.
If the crashed kernel had corrupted memory, couldn't an in-flight DMA
read that corrupted data from memory and write it to disk?

I guess you could argue that this is merely a race, and the in-flight
DMA could just as easily have happened before the kernel crash, so
there's always a window and the only question is whether it closes
when the IOMMU driver starts up or when the device driver starts up.
Post by Joerg Roedel
The issue here is, that we modify data from the old kernel
which is about to be dumped. But there are ways to work
around that.
2. When a device driver issues the first dma_map command for a
device, we assign a new and empty page-table, thus removing
all mappings from the old kernel for the device.
Rationale is, that at this point the device driver should
have reset the device to a point where all in-flight DMA is
canceled.
This approach goes into the same direction as Bill Sumners patch-set,
which Li took over. But it goes not as far as keeping the old mappings
while the kdump kernel is still working with the devices (which might
introduce new issues and corner cases).
Post by Bjorn Helgaas
Post by Joerg Roedel
So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.
This makes the dump kernel more dependent on data from the old kernel,
which we obviously want to avoid when possible.
Sure, but this is not really possible here (unless we have a generic and
reliable way to reset all PCI endpoint devices and cancel all in-flight
DMA before we disable the IOMMU in the kdump kernel).
Otherwise we always risk data corruption somewhere, in system memory or
on disk.
Post by Bjorn Helgaas
I didn't find the previous discussion where pointing every virtual bus
address at the same physical scratch page was proposed. Why was that
better than programming the IOMMU to reject every DMA?
As I said, the problem is that this causes master aborts which some
devices can't recover from, so that the device driver in the kdump
kernel fails to initialize the device.
Yes, thanks for making that explicit again.

Bjorn

Loading...