Discussion:
[PATCH 0/4] VFIO Misc fixes
Gavin Shan
2014-05-19 03:01:06 UTC
Permalink
Changelog
=========
v1 -> v2:
* Change the comments and commit log in PATCH[4/4] (Alex).
* Export 2 MSI relevant functions (Alex).
v2 -> v3:
* Add missed header file in PATCH[4/4].

Gavin Shan (4):
PCI: Export MSI message relevant functions
drivers/vfio: Rework offsetofend()
drivers/vfio/pci: Fix wrong MSI interrupt count
vfio/pci: Restore MSIx message prior to enabling

drivers/pci/msi.c | 2 ++
drivers/vfio/pci/vfio_pci.c | 3 +--
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
include/linux/vfio.h | 5 ++---
4 files changed, 20 insertions(+), 5 deletions(-)
--
1.8.3.2
Gavin Shan
2014-05-19 03:01:08 UTC
Permalink
The macro offsetofend() introduces unnecessary temporary variable
"tmp". The patch avoids that and saves a bit memory in stack.

Signed-off-by: Gavin Shan <***@linux.vnet.ibm.com>
---
include/linux/vfio.h | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 81022a52..8ec980b 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -86,9 +86,8 @@ extern void vfio_unregister_iommu_driver(
* from user space. This allows us to easily determine if the provided
* structure is sized to include various fields.
*/
-#define offsetofend(TYPE, MEMBER) ({ \
- TYPE tmp; \
- offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
+#define offsetofend(TYPE, MEMBER) \
+ (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))

/*
* External user API
--
1.8.3.2
Gavin Shan
2014-05-19 03:01:07 UTC
Permalink
The patch exports 2 MSI message relevant functions, which will be
used by VFIO PCI driver. The VFIO PCI driver would be built as
a module.

Signed-off-by: Gavin Shan <***@linux.vnet.ibm.com>
---
drivers/pci/msi.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 955ab79..2350271 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -324,6 +324,7 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg)

__get_cached_msi_msg(entry, msg);
}
+EXPORT_SYMBOL_GPL(get_cached_msi_msg);

void __write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
@@ -368,6 +369,7 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg)

__write_msi_msg(entry, msg);
}
+EXPORT_SYMBOL_GPL(write_msi_msg);

static void free_msi_irqs(struct pci_dev *dev)
{
--
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gavin Shan
2014-05-22 05:10:14 UTC
Permalink
Post by Gavin Shan
The patch exports 2 MSI message relevant functions, which will be
used by VFIO PCI driver. The VFIO PCI driver would be built as
a module.
Bjorn, could you help ack it if you don't have objection to it?
I guess Alex is probably waiting to merge the subsequent patch,
which depends on this one.
Post by Gavin Shan
---
drivers/pci/msi.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 955ab79..2350271 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -324,6 +324,7 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg)
__get_cached_msi_msg(entry, msg);
}
+EXPORT_SYMBOL_GPL(get_cached_msi_msg);
void __write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
@@ -368,6 +369,7 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg)
__write_msi_msg(entry, msg);
}
+EXPORT_SYMBOL_GPL(write_msi_msg);
static void free_msi_irqs(struct pci_dev *dev)
{
Thanks,
Gavin
Bjorn Helgaas
2014-09-04 22:57:36 UTC
Permalink
Post by Gavin Shan
The patch exports 2 MSI message relevant functions, which will be
used by VFIO PCI driver. The VFIO PCI driver would be built as
a module.
Acked-by: Bjorn Helgaas <***@google.com>

I think Alex will merge this along with the other ones. Sorry this
took so long. I don't really like this, but I just can't figure out
any solution that's better.
Post by Gavin Shan
---
drivers/pci/msi.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 955ab79..2350271 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -324,6 +324,7 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg)
__get_cached_msi_msg(entry, msg);
}
+EXPORT_SYMBOL_GPL(get_cached_msi_msg);
void __write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
@@ -368,6 +369,7 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg)
__write_msi_msg(entry, msg);
}
+EXPORT_SYMBOL_GPL(write_msi_msg);
static void free_msi_irqs(struct pci_dev *dev)
{
--
1.8.3.2
Gavin Shan
2014-09-05 00:15:55 UTC
Permalink
Post by Bjorn Helgaas
Post by Gavin Shan
The patch exports 2 MSI message relevant functions, which will be
used by VFIO PCI driver. The VFIO PCI driver would be built as
a module.
I think Alex will merge this along with the other ones. Sorry this
took so long. I don't really like this, but I just can't figure out
any solution that's better.
Thanks, Bjorn. I thought you must forget this. Lets get it in firstly
and I'll do more investigation later to see if I can figure out something
better.

Thanks,
Gavin
Post by Bjorn Helgaas
Post by Gavin Shan
---
drivers/pci/msi.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 955ab79..2350271 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -324,6 +324,7 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg)
__get_cached_msi_msg(entry, msg);
}
+EXPORT_SYMBOL_GPL(get_cached_msi_msg);
void __write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
@@ -368,6 +369,7 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg)
__write_msi_msg(entry, msg);
}
+EXPORT_SYMBOL_GPL(write_msi_msg);
static void free_msi_irqs(struct pci_dev *dev)
{
--
1.8.3.2
Gavin Shan
2014-05-19 03:01:10 UTC
Permalink
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.

Fix the problem by restoring the host cached MSI message prior to
enabling each vector.

Reported-by: Wen Xiong <***@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <***@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <***@redhat.com>
---
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9..553212f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/eventfd.h>
+#include <linux/msi.h>
#include <linux/pci.h>
#include <linux/file.h>
#include <linux/poll.h>
@@ -548,6 +549,20 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
return PTR_ERR(trigger);
}

+ /*
+ * The MSIx vector table resides in device memory which may be cleared
+ * via backdoor resets. We don't allow direct access to the vector
+ * table so even if a userspace driver attempts to save/restore around
+ * such a reset it would be unsuccessful. To avoid this, restore the
+ * cached value of the message prior to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;
+
+ get_cached_msi_msg(irq, &msg);
+ write_msi_msg(irq, &msg);
+ }
+
ret = request_irq(irq, vfio_msihandler, 0,
vdev->ctx[vector].name, trigger);
if (ret) {
--
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas
2014-05-30 22:12:32 UTC
Permalink
Post by Gavin Shan
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.
Fix the problem by restoring the host cached MSI message prior to
enabling each vector.
---
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9..553212f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/eventfd.h>
+#include <linux/msi.h>
#include <linux/pci.h>
#include <linux/file.h>
#include <linux/poll.h>
@@ -548,6 +549,20 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
return PTR_ERR(trigger);
}
+ /*
+ * The MSIx vector table resides in device memory which may be cleared
+ * via backdoor resets. We don't allow direct access to the vector
+ * table so even if a userspace driver attempts to save/restore around
+ * such a reset it would be unsuccessful. To avoid this, restore the
+ * cached value of the message prior to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;
+
+ get_cached_msi_msg(irq, &msg);
+ write_msi_msg(irq, &msg);
+ }
I think this is pretty ugly. Drivers should not be writing to the
MSI-X vector table, so I don't really want to export these internal
implementation functions if we can avoid it.

I chatted with Alex about this last week on IRC, trying to understand
what's going on here, but I'm afraid I didn't get very far.

I think I understand what happens when there's no virtualization
involved. The driver enables MSI-X and writes the vector table via
this path:

pci_enable_msix
msix_capability_init
arch_setup_msi_irqs
native_setup_msi_irqs # .setup_msi_irqs (on x86)
setup_msi_irq
write_msi_msg
__write_msi_msg # write vector table

When a device is reset, its MSI-X vector table is cleared. The type
of reset (FLR, "backdoor", etc.) doesn't really matter.

After a device reset, the driver would use this path to restore the
vector table:

pci_restore_state
pci_restore_msi_state
__pci_restore_msix_state
arch_restore_msi_irqs
default_restore_msi_irqs # .restore_msi_irqs (on x86)
default_restore_msi_irq
write_msi_msg
__write_msi_msg # write vector table

This rewrites the MSI-X vector table (it doesn't use any data that was
saved by pci_save_state(), so it's not really a "restore" in that
sense; it writes the vector table from scratch based on the data
structures maintained by the MSI core).

If the same driver is running in a qemu guest, it still calls
pci_enable_msix() and pci_restore_state(), but apparently the restore
path doesn't work. Alex mentioned that qemu virtualizes the vector
table, so I assume it traps the writel() to the vector table when
enabling MSI-X? And I assume qemu would also trap the writel() in the
restore path, but it sounded like it ignores the write because we're
writing the same data qemu believes to be there?

I'd like to understand more details about how those writel()s
performed by the guest kernel are handled. Alex mentioned that the
vector table is inaccessible to the guest, and I see code in
vfio_pci_bar_rw() that looks like it excludes the table area, so I
assume that is involved somehow, but I don't know how to connect the
dots. Obviously the enable path must be handled differently from the
restore path somehow, because if the enable used vfio_pci_bar_rw(),
that write would just be dropped, too, and it's not.
Post by Gavin Shan
ret = request_irq(irq, vfio_msihandler, 0,
vdev->ctx[vector].name, trigger);
if (ret) {
--
1.8.3.2
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gavin Shan
2014-05-31 11:42:52 UTC
Permalink
Post by Bjorn Helgaas
Post by Gavin Shan
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.
Fix the problem by restoring the host cached MSI message prior to
enabling each vector.
---
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9..553212f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/eventfd.h>
+#include <linux/msi.h>
#include <linux/pci.h>
#include <linux/file.h>
#include <linux/poll.h>
@@ -548,6 +549,20 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
return PTR_ERR(trigger);
}
+ /*
+ * The MSIx vector table resides in device memory which may be cleared
+ * via backdoor resets. We don't allow direct access to the vector
+ * table so even if a userspace driver attempts to save/restore around
+ * such a reset it would be unsuccessful. To avoid this, restore the
+ * cached value of the message prior to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;
+
+ get_cached_msi_msg(irq, &msg);
+ write_msi_msg(irq, &msg);
+ }
I think this is pretty ugly. Drivers should not be writing to the
MSI-X vector table, so I don't really want to export these internal
implementation functions if we can avoid it.
I agree that it's ugly and I need discuss with Alex about the potential
solutions: fix the issue either from guest or qemu.

- If the "reset" is special backdoor for some devices, the device driver
on guest side should have something like: disable MSIx entries that have
been enabled (updating MSIx entries maintained by QEMU), pci_save_state(),
reset(), pci_restore_state(), enable MSIx entries (updating MSIx entries
maintained by QEMU). Disadvantage of this way would be guest driver has
to accomodate QEMU, which sounds bad.

- In QEMU, we could have some quirk to trap when writting to registers
for reset on basis of devices. From there, to clear the MSIx entries
maintained by QEMU. It's similar thing to be applied when having FLR
reset. We have to have separate quirk to accomodate every kind of devices.

- Last one is what we had. However, it's really "hack".
Post by Bjorn Helgaas
I chatted with Alex about this last week on IRC, trying to understand
what's going on here, but I'm afraid I didn't get very far.
I think I understand what happens when there's no virtualization
involved. The driver enables MSI-X and writes the vector table via
pci_enable_msix
msix_capability_init
arch_setup_msi_irqs
native_setup_msi_irqs # .setup_msi_irqs (on x86)
setup_msi_irq
write_msi_msg
__write_msi_msg # write vector table
When a device is reset, its MSI-X vector table is cleared. The type
of reset (FLR, "backdoor", etc.) doesn't really matter.
After a device reset, the driver would use this path to restore the
pci_restore_state
pci_restore_msi_state
__pci_restore_msix_state
arch_restore_msi_irqs
default_restore_msi_irqs # .restore_msi_irqs (on x86)
default_restore_msi_irq
write_msi_msg
__write_msi_msg # write vector table
This rewrites the MSI-X vector table (it doesn't use any data that was
saved by pci_save_state(), so it's not really a "restore" in that
sense; it writes the vector table from scratch based on the data
structures maintained by the MSI core).
If the same driver is running in a qemu guest, it still calls
pci_enable_msix() and pci_restore_state(), but apparently the restore
path doesn't work. Alex mentioned that qemu virtualizes the vector
table, so I assume it traps the writel() to the vector table when
enabling MSI-X? And I assume qemu would also trap the writel() in the
restore path, but it sounded like it ignores the write because we're
writing the same data qemu believes to be there?
I'd like to understand more details about how those writel()s
performed by the guest kernel are handled. Alex mentioned that the
vector table is inaccessible to the guest, and I see code in
vfio_pci_bar_rw() that looks like it excludes the table area, so I
assume that is involved somehow, but I don't know how to connect the
dots. Obviously the enable path must be handled differently from the
restore path somehow, because if the enable used vfio_pci_bar_rw(),
that write would just be dropped, too, and it's not.
The problem is basically the MSIx entries maintained in QEMU mismatched
with those in hardware (host kernel), which is caused by backdoor "reset":

- Guest driver enables MSIx entries. MSIx entries are marked as "enabled"
in hardware, QEMU, guest.
- Guest driver calls pci_save_state() and then issues backdoor reset. We
lose everything in MSIx table in hardware. QEMU still maintains "enabled"
MSIx entries.
- Guest driver calls to pci_restore_state() and tries to enable MSIx entries.
Writing to MSIx entries trapped in QEMU. QEMU won't update MSIx entries in
hardware because the MSIx entries are marked as "enabled" in QEMU.

Thanks,
Gavin
Bjorn Helgaas
2014-06-02 16:57:05 UTC
Permalink
Post by Gavin Shan
Post by Bjorn Helgaas
Post by Gavin Shan
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.
Fix the problem by restoring the host cached MSI message prior to
enabling each vector.
---
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9..553212f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/eventfd.h>
+#include <linux/msi.h>
#include <linux/pci.h>
#include <linux/file.h>
#include <linux/poll.h>
@@ -548,6 +549,20 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
return PTR_ERR(trigger);
}
+ /*
+ * The MSIx vector table resides in device memory which may be cleared
+ * via backdoor resets. We don't allow direct access to the vector
+ * table so even if a userspace driver attempts to save/restore around
+ * such a reset it would be unsuccessful. To avoid this, restore the
+ * cached value of the message prior to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;
+
+ get_cached_msi_msg(irq, &msg);
+ write_msi_msg(irq, &msg);
+ }
I think this is pretty ugly. Drivers should not be writing to the
MSI-X vector table, so I don't really want to export these internal
implementation functions if we can avoid it.
I agree that it's ugly and I need discuss with Alex about the potential
solutions: fix the issue either from guest or qemu.
- If the "reset" is special backdoor for some devices, the device driver
on guest side should have something like: disable MSIx entries that have
been enabled (updating MSIx entries maintained by QEMU), pci_save_state(),
reset(), pci_restore_state(), enable MSIx entries (updating MSIx entries
maintained by QEMU). Disadvantage of this way would be guest driver has
to accomodate QEMU, which sounds bad.
I agree, this sounds even worse.
Post by Gavin Shan
- In QEMU, we could have some quirk to trap when writting to registers
for reset on basis of devices. From there, to clear the MSIx entries
maintained by QEMU. It's similar thing to be applied when having FLR
reset. We have to have separate quirk to accomodate every kind of devices.
This also sounds bad.
Post by Gavin Shan
- Last one is what we had. However, it's really "hack".
Post by Bjorn Helgaas
I chatted with Alex about this last week on IRC, trying to understand
what's going on here, but I'm afraid I didn't get very far.
I think I understand what happens when there's no virtualization
involved. The driver enables MSI-X and writes the vector table via
pci_enable_msix
msix_capability_init
arch_setup_msi_irqs
native_setup_msi_irqs # .setup_msi_irqs (on x86)
setup_msi_irq
write_msi_msg
__write_msi_msg # write vector table
When a device is reset, its MSI-X vector table is cleared. The type
of reset (FLR, "backdoor", etc.) doesn't really matter.
After a device reset, the driver would use this path to restore the
pci_restore_state
pci_restore_msi_state
__pci_restore_msix_state
arch_restore_msi_irqs
default_restore_msi_irqs # .restore_msi_irqs (on x86)
default_restore_msi_irq
write_msi_msg
__write_msi_msg # write vector table
This rewrites the MSI-X vector table (it doesn't use any data that was
saved by pci_save_state(), so it's not really a "restore" in that
sense; it writes the vector table from scratch based on the data
structures maintained by the MSI core).
If the same driver is running in a qemu guest, it still calls
pci_enable_msix() and pci_restore_state(), but apparently the restore
path doesn't work. Alex mentioned that qemu virtualizes the vector
table, so I assume it traps the writel() to the vector table when
enabling MSI-X? And I assume qemu would also trap the writel() in the
restore path, but it sounded like it ignores the write because we're
writing the same data qemu believes to be there?
I'd like to understand more details about how those writel()s
performed by the guest kernel are handled. Alex mentioned that the
vector table is inaccessible to the guest, and I see code in
vfio_pci_bar_rw() that looks like it excludes the table area, so I
assume that is involved somehow, but I don't know how to connect the
dots. Obviously the enable path must be handled differently from the
restore path somehow, because if the enable used vfio_pci_bar_rw(),
that write would just be dropped, too, and it's not.
The problem is basically the MSIx entries maintained in QEMU mismatched
- Guest driver enables MSIx entries. MSIx entries are marked as "enabled"
in hardware, QEMU, guest.
- Guest driver calls pci_save_state() and then issues backdoor reset. We
lose everything in MSIx table in hardware. QEMU still maintains "enabled"
MSIx entries.
- Guest driver calls to pci_restore_state() and tries to enable MSIx entries.
Writing to MSIx entries trapped in QEMU. QEMU won't update MSIx entries in
hardware because the MSIx entries are marked as "enabled" in QEMU.
It sounds like QEMU assumes the MSIx entries can't be changed by
anything other than the writes it traps. This assumption is false
(the entries are cleared when the driver resets the device, and QEMU
doesn't know about the reset).

Why can't QEMU trap the write from pci_restore_state() and update the
hardware, even if it thinks nothing has changed?

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gavin Shan
2014-06-05 05:51:34 UTC
Permalink
.../...

[ Remove the confusing description ]
Post by Bjorn Helgaas
It sounds like QEMU assumes the MSIx entries can't be changed by
anything other than the writes it traps. This assumption is false
(the entries are cleared when the driver resets the device, and QEMU
doesn't know about the reset).
If I'm correct enough, QEMU disallows access to MSIx table in HW.
Access is captured by QEMU and terminated there for most of cases.
MSIx message can't be written to HW.
Post by Bjorn Helgaas
Why can't QEMU trap the write from pci_restore_state() and update the
hardware, even if it thinks nothing has changed?
For MSIx messages, pci_restore_start() restores what the device got
from QEMU. I think the MSIx message isn't expected one by HW (more
details below).

Sorry, Bjorn. I think my last reply should have confused you as that's not
correct. The problem and tentative fix has been there for a some time.
I almost forgot the details. I rechecked the discussion about the topic.
It's not what I described in last reply:

http://comments.gmane.org/gmane.comp.emulators.kvm.devel/119689

Let me correct it like this. Alex.W in the cc list is the VFIO expert.
I might have something wrong about VFIO and Alex could help correcting :-)

1) Guest: PCI device works fine in guest
2) QEMU: MSIx entry cache (unmasked). It seems the MSIx message maintained
by QEMU is figured out by itself and inconsistent with HW (host kernel). It's
separate (potential) issue. So QEMU and host don't exchange MSIx message with
each other.
3) Guest: PCI device driver calls pci_save_state(), issue reset,
pci_restore_state().
4) QEMU got trapped and notify VFIO PCI device to start the MSIx interrupt,
which is done by ioctl() to VFIO PCI device on host side. It seems that VFIO
device driver does request_irq() and setup irqfd stuff so that the interrupt
can be propagated to QEMU.

The problem is that we got MSIx message lost, which was caused by the
reset. Unfortunately, no one tried retoring the message to hardware.
Eventually, the PCI device sends DMA (for MSIx interrupt) traffic with
0x0's address/data, which isn't allowed on Power platform and causes
EEH error.

Since MSIx message QEMU and host owes are different and QEMU is having
invalid message, so it's not making sense to update hardware with QEMU's
cached message. On the other hand, the message data should be restored
to HW by somebody and the senario is related to VFIO PCI. It sounds
fair to have VFIO PCI driver resotres the message as we did. As you said,
it's ugly for driver to write MSIx message. I'm not sure.
Post by Bjorn Helgaas
From guest itself, PCI code is consistent and I don't think there has
anything we need improve for this: pci_save_state(), reset, pci_restore_state()
should work fine.
Post by Bjorn Helgaas
From the host side, we probably can restore MSIx message in request_irq().
In the IRQ chip callbacks (e.g. startup, unmask), we could have overhead
to restore MSIx message. However, it's totally unnecessarily to host itself.

Hopefully, I make myself clear this time :-)

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gavin Shan
2014-09-10 08:13:42 UTC
Permalink
Post by Gavin Shan
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.
Fix the problem by restoring the host cached MSI message prior to
enabling each vector.
Alex, please let me know if I need resend this one to you. The patch
has been pending for long time, I'm not sure if you still can grab
it somewhere.

As you might see, Bjorn will take that one with PCI changes. This patch
depends on the changes.

Thanks,
Gavin
Post by Gavin Shan
---
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9..553212f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/eventfd.h>
+#include <linux/msi.h>
#include <linux/pci.h>
#include <linux/file.h>
#include <linux/poll.h>
@@ -548,6 +549,20 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
return PTR_ERR(trigger);
}
+ /*
+ * The MSIx vector table resides in device memory which may be cleared
+ * via backdoor resets. We don't allow direct access to the vector
+ * table so even if a userspace driver attempts to save/restore around
+ * such a reset it would be unsuccessful. To avoid this, restore the
+ * cached value of the message prior to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;
+
+ get_cached_msi_msg(irq, &msg);
+ write_msi_msg(irq, &msg);
+ }
+
ret = request_irq(irq, vfio_msihandler, 0,
vdev->ctx[vector].name, trigger);
if (ret) {
--
1.8.3.2
Gavin Shan
2014-09-26 03:19:58 UTC
Permalink
Post by Gavin Shan
Post by Gavin Shan
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.
Fix the problem by restoring the host cached MSI message prior to
enabling each vector.
Alex, please let me know if I need resend this one to you. The patch
has been pending for long time, I'm not sure if you still can grab
it somewhere.
As you might see, Bjorn will take that one with PCI changes. This patch
depends on the changes.
Alex, I guess you probably missed last reply. Bjorn acked the first
patch and you can pick both of them if I understand correctly. Please
let me know if I need resend those 2 patches?

Thanks,
Gavin
Post by Gavin Shan
Thanks,
Gavin
Post by Gavin Shan
---
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9..553212f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/eventfd.h>
+#include <linux/msi.h>
#include <linux/pci.h>
#include <linux/file.h>
#include <linux/poll.h>
@@ -548,6 +549,20 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
return PTR_ERR(trigger);
}
+ /*
+ * The MSIx vector table resides in device memory which may be cleared
+ * via backdoor resets. We don't allow direct access to the vector
+ * table so even if a userspace driver attempts to save/restore around
+ * such a reset it would be unsuccessful. To avoid this, restore the
+ * cached value of the message prior to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;
+
+ get_cached_msi_msg(irq, &msg);
+ write_msi_msg(irq, &msg);
+ }
+
ret = request_irq(irq, vfio_msihandler, 0,
vdev->ctx[vector].name, trigger);
if (ret) {
--
1.8.3.2
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson
2014-09-26 03:46:44 UTC
Permalink
Post by Gavin Shan
Post by Gavin Shan
Post by Gavin Shan
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.
Fix the problem by restoring the host cached MSI message prior to
enabling each vector.
Alex, please let me know if I need resend this one to you. The patch
has been pending for long time, I'm not sure if you still can grab
it somewhere.
As you might see, Bjorn will take that one with PCI changes. This patch
depends on the changes.
Alex, I guess you probably missed last reply. Bjorn acked the first
patch and you can pick both of them if I understand correctly. Please
let me know if I need resend those 2 patches?
Please update the patches, add Bjorn's ACK, test and resend. I'd like
to at least know that it still applies and resolves the problem on the
current code base since the patch is 4 months old. Thanks,

Alex
Post by Gavin Shan
Post by Gavin Shan
Post by Gavin Shan
---
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9..553212f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/eventfd.h>
+#include <linux/msi.h>
#include <linux/pci.h>
#include <linux/file.h>
#include <linux/poll.h>
@@ -548,6 +549,20 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
return PTR_ERR(trigger);
}
+ /*
+ * The MSIx vector table resides in device memory which may be cleared
+ * via backdoor resets. We don't allow direct access to the vector
+ * table so even if a userspace driver attempts to save/restore around
+ * such a reset it would be unsuccessful. To avoid this, restore the
+ * cached value of the message prior to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;
+
+ get_cached_msi_msg(irq, &msg);
+ write_msi_msg(irq, &msg);
+ }
+
ret = request_irq(irq, vfio_msihandler, 0,
vdev->ctx[vector].name, trigger);
if (ret) {
--
1.8.3.2
Gavin Shan
2014-09-27 05:33:07 UTC
Permalink
Post by Alex Williamson
Post by Gavin Shan
Post by Gavin Shan
Post by Gavin Shan
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.
Fix the problem by restoring the host cached MSI message prior to
enabling each vector.
Alex, please let me know if I need resend this one to you. The patch
has been pending for long time, I'm not sure if you still can grab
it somewhere.
As you might see, Bjorn will take that one with PCI changes. This patch
depends on the changes.
Alex, I guess you probably missed last reply. Bjorn acked the first
patch and you can pick both of them if I understand correctly. Please
let me know if I need resend those 2 patches?
Please update the patches, add Bjorn's ACK, test and resend. I'd like
to at least know that it still applies and resolves the problem on the
current code base since the patch is 4 months old. Thanks,
Retested and it helps avoiding unexpected EEH error as before though
the error because of MSIx message lost is eventually progagated to
guest and the adapter is recovered successfully by the feature
"EEH support for guest". I'll resend it with Bjorn's ack.

Thanks,
Gavin
Post by Alex Williamson
Alex
Post by Gavin Shan
Post by Gavin Shan
Post by Gavin Shan
---
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9..553212f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/interrupt.h>
#include <linux/eventfd.h>
+#include <linux/msi.h>
#include <linux/pci.h>
#include <linux/file.h>
#include <linux/poll.h>
@@ -548,6 +549,20 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
return PTR_ERR(trigger);
}
+ /*
+ * The MSIx vector table resides in device memory which may be cleared
+ * via backdoor resets. We don't allow direct access to the vector
+ * table so even if a userspace driver attempts to save/restore around
+ * such a reset it would be unsuccessful. To avoid this, restore the
+ * cached value of the message prior to enabling.
+ */
+ if (msix) {
+ struct msi_msg msg;
+
+ get_cached_msi_msg(irq, &msg);
+ write_msi_msg(irq, &msg);
+ }
+
ret = request_irq(irq, vfio_msihandler, 0,
vdev->ctx[vector].name, trigger);
if (ret) {
--
1.8.3.2
Gavin Shan
2014-05-19 03:01:09 UTC
Permalink
According PCI local bus specification, the register of Message
Control for MSI (offset: 2, length: 2) has bit#0 to enable or
disable MSI logic and it shouldn't be part contributing to the
calculation of MSI interrupt count. The patch fixes the issue.

Signed-off-by: Gavin Shan <***@linux.vnet.ibm.com>
---
drivers/vfio/pci/vfio_pci.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 7ba0424..6b8cd07 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -196,8 +196,7 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device *vdev, int irq_type)
if (pos) {
pci_read_config_word(vdev->pdev,
pos + PCI_MSI_FLAGS, &flags);
-
- return 1 << (flags & PCI_MSI_FLAGS_QMASK);
+ return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
}
} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
u8 pos;
--
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Williamson
2014-05-19 21:37:19 UTC
Permalink
Post by Gavin Shan
Changelog
=========
* Change the comments and commit log in PATCH[4/4] (Alex).
* Export 2 MSI relevant functions (Alex).
* Add missed header file in PATCH[4/4].
Please specify the version in the Subject for all patches.

It would probably be a good idea to separate this series. Patches 2 & 3
stand on their own. Send them that way so we can actually get them in.
Patch 4 is dependent on Patch 1, so send them as a series.

Bjorn, if you approve of the symbol export in 1/4, I'm happy to either
take it with your Ack or let you also take 4/4. Otherwise please let us
know so Gavin can try something else. Thanks,

Alex
Post by Gavin Shan
PCI: Export MSI message relevant functions
drivers/vfio: Rework offsetofend()
drivers/vfio/pci: Fix wrong MSI interrupt count
vfio/pci: Restore MSIx message prior to enabling
drivers/pci/msi.c | 2 ++
drivers/vfio/pci/vfio_pci.c | 3 +--
drivers/vfio/pci/vfio_pci_intrs.c | 15 +++++++++++++++
include/linux/vfio.h | 5 ++---
4 files changed, 20 insertions(+), 5 deletions(-)
Alex Williamson
2014-05-30 21:06:42 UTC
Permalink
Post by Alex Williamson
Post by Gavin Shan
Changelog
=========
* Change the comments and commit log in PATCH[4/4] (Alex).
* Export 2 MSI relevant functions (Alex).
* Add missed header file in PATCH[4/4].
Please specify the version in the Subject for all patches.
It would probably be a good idea to separate this series. Patches 2 & 3
stand on their own. Send them that way so we can actually get them in.
Patch 4 is dependent on Patch 1, so send them as a series.
I'm going to go ahead and grab patches 2 & 3 here because I don't want
to see them miss another kernel. Thanks,

Alex
Loading...