Discussion:
[RESEND 0/5] PCIe, AER: Misc cleanup
Chen, Gong
2014-08-13 06:22:36 UTC
Permalink
No response since last commit so spread it to a bigger range.

This patch series is for AER related cleanup & update based on PCIe
SPEC r3.0.
Chen, Gong
2014-08-13 06:22:37 UTC
Permalink
Previous format definition uses MACRO BIT(...), which is not very
maintainable as Bjorn mentioned before:
"I'd like to see all those "BIT(...)" things changed to use the #defines
that already exist in include/uapi/linux/pci_regs.h, e.g.,
PCI_ERR_COR_RCVR. That way grep will find these uses, which will make
maintenance easier."

Now here it is.

Signed-off-by: Chen, Gong <***@linux.intel.com>
---
include/ras/ras_event.h | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 47da53c27ffa..0f2cca4ccbf0 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -8,6 +8,7 @@
#include <linux/tracepoint.h>
#include <linux/edac.h>
#include <linux/ktime.h>
+#include <linux/pci.h>
#include <linux/aer.h>
#include <linux/cper.h>

@@ -174,24 +175,24 @@ TRACE_EVENT(mc_event,
*/

#define aer_correctable_errors \
- {BIT(0), "Receiver Error"}, \
- {BIT(6), "Bad TLP"}, \
- {BIT(7), "Bad DLLP"}, \
- {BIT(8), "RELAY_NUM Rollover"}, \
- {BIT(12), "Replay Timer Timeout"}, \
- {BIT(13), "Advisory Non-Fatal"}
+ {PCI_ERR_COR_RCVR, "Receiver Error"}, \
+ {PCI_ERR_COR_BAD_TLP, "Bad TLP"}, \
+ {PCI_ERR_COR_BAD_DLLP, "Bad DLLP"}, \
+ {PCI_ERR_COR_REP_ROLL, "RELAY_NUM Rollover"}, \
+ {PCI_ERR_COR_REP_TIMER, "Replay Timer Timeout"},\
+ {PCI_ERR_COR_ADV_NFAT, "Advisory Non-Fatal"}

#define aer_uncorrectable_errors \
- {BIT(4), "Data Link Protocol"}, \
- {BIT(12), "Poisoned TLP"}, \
- {BIT(13), "Flow Control Protocol"}, \
- {BIT(14), "Completion Timeout"}, \
- {BIT(15), "Completer Abort"}, \
- {BIT(16), "Unexpected Completion"}, \
- {BIT(17), "Receiver Overflow"}, \
- {BIT(18), "Malformed TLP"}, \
- {BIT(19), "ECRC"}, \
- {BIT(20), "Unsupported Request"}
+ {PCI_ERR_UNC_DLP, "Data Link Protocol"}, \
+ {PCI_ERR_UNC_POISON_TLP,"Poisoned TLP"}, \
+ {PCI_ERR_UNC_FCP, "Flow Control Protocol"}, \
+ {PCI_ERR_UNC_COMP_TIME, "Completion Timeout"}, \
+ {PCI_ERR_UNC_COMP_ABORT,"Completer Abort"}, \
+ {PCI_ERR_UNC_UNX_COMP, "Unexpected Completion"}, \
+ {PCI_ERR_UNC_RX_OVER, "Receiver Overflow"}, \
+ {PCI_ERR_UNC_MALF_TLP, "Malformed TLP"}, \
+ {PCI_ERR_UNC_ECRC, "ECRC"}, \
+ {PCI_ERR_UNC_UNSUP, "Unsupported Request"}

TRACE_EVENT(aer_event,
TP_PROTO(const char *dev_name,
--
2.0.0.rc2
Chen, Gong
2014-08-13 06:22:41 UTC
Permalink
In PCI-e SPEC r3.0, BIT 0 of Uncorrectable Error Status Register
is redefined and it has an explicit requirement that when writing
this field, a value of 1b is the only choice. So change previous
initial maks from 0 to 1.

Signed-off-by: Chen, Gong <***@linux.intel.com>
---
NOTE: After scratching all use cases, this is the most obvious use
case to violate the SPEC. Most of use cases just read first and
then overwrite for clear purpose. Even so, such fix is obvious to
not compatiable with previous SPEC definition. Do we need a dirty
hack?

arch/mips/pci/pci-octeon.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
index 59cccd95688b..f1bfdc201297 100644
--- a/arch/mips/pci/pci-octeon.c
+++ b/arch/mips/pci/pci-octeon.c
@@ -134,7 +134,7 @@ int pcibios_plat_dev_init(struct pci_dev *dev)
dconfig);
/* Enable reporting of all uncorrectable errors */
/* Uncorrectable Error Mask - turned on bits disable errors */
- pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, 0);
+ pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, 1);
/*
* Leave severity at HW default. This only controls if
* errors are reported as uncorrectable or
--
2.0.0.rc2
Bjorn Helgaas
2014-09-05 23:34:21 UTC
Permalink
Post by Chen, Gong
In PCI-e SPEC r3.0, BIT 0 of Uncorrectable Error Status Register
is redefined and it has an explicit requirement that when writing
this field, a value of 1b is the only choice. So change previous
initial maks from 0 to 1.
---
NOTE: After scratching all use cases, this is the most obvious use
case to violate the SPEC. Most of use cases just read first and
then overwrite for clear purpose. Even so, such fix is obvious to
not compatiable with previous SPEC definition. Do we need a dirty
hack?
arch/mips/pci/pci-octeon.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
index 59cccd95688b..f1bfdc201297 100644
--- a/arch/mips/pci/pci-octeon.c
+++ b/arch/mips/pci/pci-octeon.c
@@ -134,7 +134,7 @@ int pcibios_plat_dev_init(struct pci_dev *dev)
dconfig);
/* Enable reporting of all uncorrectable errors */
/* Uncorrectable Error Mask - turned on bits disable errors */
- pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, 0);
+ pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, 1);
I see the text in the spec that says we should only write 1 to bit 0 (sec
7.10.3, for anybody following along). It looks like that change was made
between PCIe r1.0 and r1.1. It would really be nice to have more context
about why the change was made, because if there's hardware in the field
that implements r1.0 behavior, this patch will change the way it works, and
I don't know how to verify that is safe.

Does this actually change fix a problem? If it fixes a problem that
happens on real hardware, that's a much better reason to make a change than
just to comply with the spec.

Sec 7.10.2 also says we should ignore the value of bit 0 in the
Uncorrectable Error Status register, and I don't see any place where we
follow that advice.

Bjorn
Post by Chen, Gong
* Leave severity at HW default. This only controls if
* errors are reported as uncorrectable or
--
2.0.0.rc2
Chen, Gong
2014-09-09 07:12:52 UTC
Permalink
Date: Fri, 5 Sep 2014 17:34:21 -0600
Subject: Re: [RESEND RFC 5/5] PCIe, AER: Update initial value of UC error
mask
User-Agent: Mutt/1.5.21 (2010-09-15)
Post by Chen, Gong
In PCI-e SPEC r3.0, BIT 0 of Uncorrectable Error Status Register
is redefined and it has an explicit requirement that when writing
this field, a value of 1b is the only choice. So change previous
initial maks from 0 to 1.
---
NOTE: After scratching all use cases, this is the most obvious use
case to violate the SPEC. Most of use cases just read first and
then overwrite for clear purpose. Even so, such fix is obvious to
not compatiable with previous SPEC definition. Do we need a dirty
hack?
arch/mips/pci/pci-octeon.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
index 59cccd95688b..f1bfdc201297 100644
--- a/arch/mips/pci/pci-octeon.c
+++ b/arch/mips/pci/pci-octeon.c
@@ -134,7 +134,7 @@ int pcibios_plat_dev_init(struct pci_dev *dev)
dconfig);
/* Enable reporting of all uncorrectable errors */
/* Uncorrectable Error Mask - turned on bits disable errors */
- pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, 0);
+ pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, 1);
I see the text in the spec that says we should only write 1 to bit 0 (sec
7.10.3, for anybody following along). It looks like that change was made
between PCIe r1.0 and r1.1. It would really be nice to have more context
about why the change was made, because if there's hardware in the field
that implements r1.0 behavior, this patch will change the way it works, and
I don't know how to verify that is safe.
Does this actually change fix a problem? If it fixes a problem that
happens on real hardware, that's a much better reason to make a change than
just to comply with the spec.
Sec 7.10.2 also says we should ignore the value of bit 0 in the
Uncorrectable Error Status register, and I don't see any place where we
follow that advice.
That's why I mark this patch as RFC. As you mentioned above, these are my
concerns, too. I submit such a patch not for merging but throwing a potential
issue. As I noted above, I don't know if it is deserved to fix all affected
placed to comply with spec change. After all, no one reports such an
issue (or maybe have happened :-))
Chen, Gong
2014-08-13 06:22:38 UTC
Permalink
Since commit 6c2b374d is commited, the capability of PCI-e AER
has changed a lot. This patch adds all missed CE/UC error bits
existed in PCI-e SPEC r3.0. Meanwhile, adjust the code format
to make it simpler to read/maintain.

Signed-off-by: Chen, Gong <***@linux.intel.com>
---
drivers/pci/pcie/aer/aerdrv_errprint.c | 60 ++++++++++++++--------------------
1 file changed, 25 insertions(+), 35 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 35d06e177917..5c4f7e252e5e 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -75,44 +75,34 @@ static const char *aer_error_layer[] = {
};

static const char *aer_correctable_error_string[] = {
- "Receiver Error", /* Bit Position 0 */
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- "Bad TLP", /* Bit Position 6 */
- "Bad DLLP", /* Bit Position 7 */
- "RELAY_NUM Rollover", /* Bit Position 8 */
- NULL,
- NULL,
- NULL,
- "Replay Timer Timeout", /* Bit Position 12 */
- "Advisory Non-Fatal", /* Bit Position 13 */
+ [0] = "Receiver Error",
+ [6] = "Bad TLP",
+ [7] = "Bad DLLP",
+ [8] = "RELAY_NUM Rollover",
+ [12] = "Replay Timer Timeout",
+ [13] = "Advisory Non-Fatal Error",
+ [14] = "Corrected Internal Error",
+ [15] = "Header Log Overflow",
};

static const char *aer_uncorrectable_error_string[] = {
- NULL,
- NULL,
- NULL,
- NULL,
- "Data Link Protocol", /* Bit Position 4 */
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- "Poisoned TLP", /* Bit Position 12 */
- "Flow Control Protocol", /* Bit Position 13 */
- "Completion Timeout", /* Bit Position 14 */
- "Completer Abort", /* Bit Position 15 */
- "Unexpected Completion", /* Bit Position 16 */
- "Receiver Overflow", /* Bit Position 17 */
- "Malformed TLP", /* Bit Position 18 */
- "ECRC", /* Bit Position 19 */
- "Unsupported Request", /* Bit Position 20 */
+ [0] = "Undefined",
+ [4] = "Data Link Protocol Error",
+ [5] = "Surprise Down Error",
+ [12] = "Poisoned TLP",
+ [13] = "Flow Control Protocol Error",
+ [14] = "Completion Timeout",
+ [15] = "Completer Abort",
+ [16] = "Unexpected Completion",
+ [17] = "Receiver Overflow",
+ [18] = "Malformed TLP",
+ [19] = "ECRC Error",
+ [20] = "Unsupported Request Error",
+ [21] = "ACS Violation",
+ [22] = "Uncorrectable Internal Error",
+ [23] = "MC Blocked TLP",
+ [24] = "AtomicOp Egress Blocked",
+ [25] = "TLP Prefix Blocked Error",
};

static const char *aer_agent_string[] = {
--
2.0.0.rc2
Bjorn Helgaas
2014-09-05 23:15:43 UTC
Permalink
Post by Chen, Gong
Since commit 6c2b374d is commited, the capability of PCI-e AER
has changed a lot. This patch adds all missed CE/UC error bits
existed in PCI-e SPEC r3.0. Meanwhile, adjust the code format
to make it simpler to read/maintain.
---
drivers/pci/pcie/aer/aerdrv_errprint.c | 60 ++++++++++++++--------------------
1 file changed, 25 insertions(+), 35 deletions(-)
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 35d06e177917..5c4f7e252e5e 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -75,44 +75,34 @@ static const char *aer_error_layer[] = {
};
static const char *aer_correctable_error_string[] = {
- "Receiver Error", /* Bit Position 0 */
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- "Bad TLP", /* Bit Position 6 */
- "Bad DLLP", /* Bit Position 7 */
- "RELAY_NUM Rollover", /* Bit Position 8 */
- NULL,
- NULL,
- NULL,
- "Replay Timer Timeout", /* Bit Position 12 */
- "Advisory Non-Fatal", /* Bit Position 13 */
+ [0] = "Receiver Error",
+ [6] = "Bad TLP",
+ [7] = "Bad DLLP",
+ [8] = "RELAY_NUM Rollover",
+ [12] = "Replay Timer Timeout",
+ [13] = "Advisory Non-Fatal Error",
+ [14] = "Corrected Internal Error",
+ [15] = "Header Log Overflow",
You replaced bare numbers with the existing #defines in the previous patch
(thank you), but now we're adding them here. I'm pretty sure you can use
the #defines here, e.g.,

[PCI_ERR_COR_RCVR] = "Receiver Error",

In fact, it would be really nice if you could figure out a way to have only
one set of these strings. Right now, we have the set in
include/ras/ras_event.h, and then another set here in aerdrv_errprint.c,
and they contain exactly the same information.

Bjorn
Post by Chen, Gong
};
static const char *aer_uncorrectable_error_string[] = {
- NULL,
- NULL,
- NULL,
- NULL,
- "Data Link Protocol", /* Bit Position 4 */
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- "Poisoned TLP", /* Bit Position 12 */
- "Flow Control Protocol", /* Bit Position 13 */
- "Completion Timeout", /* Bit Position 14 */
- "Completer Abort", /* Bit Position 15 */
- "Unexpected Completion", /* Bit Position 16 */
- "Receiver Overflow", /* Bit Position 17 */
- "Malformed TLP", /* Bit Position 18 */
- "ECRC", /* Bit Position 19 */
- "Unsupported Request", /* Bit Position 20 */
+ [0] = "Undefined",
+ [4] = "Data Link Protocol Error",
+ [5] = "Surprise Down Error",
+ [12] = "Poisoned TLP",
+ [13] = "Flow Control Protocol Error",
+ [14] = "Completion Timeout",
+ [15] = "Completer Abort",
+ [16] = "Unexpected Completion",
+ [17] = "Receiver Overflow",
+ [18] = "Malformed TLP",
+ [19] = "ECRC Error",
+ [20] = "Unsupported Request Error",
+ [21] = "ACS Violation",
+ [22] = "Uncorrectable Internal Error",
+ [23] = "MC Blocked TLP",
+ [24] = "AtomicOp Egress Blocked",
+ [25] = "TLP Prefix Blocked Error",
};
static const char *aer_agent_string[] = {
--
2.0.0.rc2
Chen, Gong
2014-09-09 07:03:22 UTC
Permalink
Date: Fri, 5 Sep 2014 17:15:43 -0600
Subject: Re: [RESEND 2/5] PCIe, AER: Replenish missed AER status bits for
AER driver
User-Agent: Mutt/1.5.21 (2010-09-15)
Post by Chen, Gong
Since commit 6c2b374d is commited, the capability of PCI-e AER
has changed a lot. This patch adds all missed CE/UC error bits
existed in PCI-e SPEC r3.0. Meanwhile, adjust the code format
to make it simpler to read/maintain.
---
drivers/pci/pcie/aer/aerdrv_errprint.c | 60 ++++++++++++++--------------------
1 file changed, 25 insertions(+), 35 deletions(-)
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 35d06e177917..5c4f7e252e5e 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -75,44 +75,34 @@ static const char *aer_error_layer[] = {
};
static const char *aer_correctable_error_string[] = {
- "Receiver Error", /* Bit Position 0 */
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- "Bad TLP", /* Bit Position 6 */
- "Bad DLLP", /* Bit Position 7 */
- "RELAY_NUM Rollover", /* Bit Position 8 */
- NULL,
- NULL,
- NULL,
- "Replay Timer Timeout", /* Bit Position 12 */
- "Advisory Non-Fatal", /* Bit Position 13 */
+ [0] = "Receiver Error",
+ [6] = "Bad TLP",
+ [7] = "Bad DLLP",
+ [8] = "RELAY_NUM Rollover",
+ [12] = "Replay Timer Timeout",
+ [13] = "Advisory Non-Fatal Error",
+ [14] = "Corrected Internal Error",
+ [15] = "Header Log Overflow",
You replaced bare numbers with the existing #defines in the previous patch
(thank you), but now we're adding them here. I'm pretty sure you can use
the #defines here, e.g.,
[PCI_ERR_COR_RCVR] = "Receiver Error",
Considering PCI_ERR_COR_* stuff are not BIT offset, I need a conversion like
[ilog2(PCI_ERR_COR_RCVR)] = "xxx". But in ras_event.h I need the same
conversion like aer_correctable_error_string[ilog2(PCI_ERR_COR_RCVR)]. It looks
like a little bit clumsy and suboptimal. I can add extra BIT definition in
include/uapi/linux/pci_regs.h like below:

#define PCI_ERR_COR_RCVR 0x00000001 /* Receiver Error Status */
+#define PCI_ERR_COR_RCVR_BIT ilog2(PCI_ERR_COR_RCVR)

or more direct way:
#define PCI_ERR_COR_RCVR 0x00000001 /* Receiver Error Status */
+#define PCI_ERR_COR_RCVR_BIT 0

I can't find better method by now.
In fact, it would be really nice if you could figure out a way to have only
one set of these strings. Right now, we have the set in
As above implied, I can export aer_correctable_error_string etc.
Bjorn Helgaas
2014-09-25 15:51:00 UTC
Permalink
[+cc Heather]
Post by Chen, Gong
Since commit 6c2b374d is commited, the capability of PCI-e AER
has changed a lot. This patch adds all missed CE/UC error bits
existed in PCI-e SPEC r3.0. Meanwhile, adjust the code format
to make it simpler to read/maintain.
---
drivers/pci/pcie/aer/aerdrv_errprint.c | 60 ++++++++++++++--------------------
1 file changed, 25 insertions(+), 35 deletions(-)
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 35d06e177917..5c4f7e252e5e 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -75,44 +75,34 @@ static const char *aer_error_layer[] = {
};
static const char *aer_correctable_error_string[] = {
- "Receiver Error", /* Bit Position 0 */
- NULL,
- NULL,
- NULL,
- NULL,
- NULL,
- "Bad TLP", /* Bit Position 6 */
- "Bad DLLP", /* Bit Position 7 */
- "RELAY_NUM Rollover", /* Bit Position 8 */
- NULL,
- NULL,
- NULL,
- "Replay Timer Timeout", /* Bit Position 12 */
- "Advisory Non-Fatal", /* Bit Position 13 */
+ [0] = "Receiver Error",
+ [6] = "Bad TLP",
+ [7] = "Bad DLLP",
+ [8] = "RELAY_NUM Rollover",
+ [12] = "Replay Timer Timeout",
+ [13] = "Advisory Non-Fatal Error",
+ [14] = "Corrected Internal Error",
+ [15] = "Header Log Overflow",
This patch does two things at once: (1) adds new error strings and (2)
converts to the designated initializer style. The first is useful but I
don't think the second really helps anything.

We still have to manually match up the array index, e.g., "14", with the
#define, PCI_ERR_COR_INTERNAL, and then count bits to make sure it
matches the constant 0x00004000.

I'm still holding out for a change that solves that problem. I would also
like to avoid duplicating all the strings between include/ras/ras_event.h
and drivers/pci/pcie/aer/aerdrv_errprint.c.

In the meantime, I applied the patch below, which does just (1).

Bjorn


commit d179111767aa2a1d594023ce65abf9c81bfbb0cf
Author: Chen, Gong <***@linux.intel.com>
Date: Thu Sep 25 09:36:43 2014 -0600

PCI/AER: Add additional PCIe AER error strings

Add strings for all AER error bits defined in PCIe r3.0.

[bhelgaas: changelog, drop designated initializer change]
Signed-off-by: Chen, Gong <***@linux.intel.com>
Signed-off-by: Bjorn Helgaas <***@google.com>

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 35d06e177917..c6849d9e86ce 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -89,15 +89,17 @@ static const char *aer_correctable_error_string[] = {
NULL,
"Replay Timer Timeout", /* Bit Position 12 */
"Advisory Non-Fatal", /* Bit Position 13 */
+ "Corrected Internal Error", /* Bit Position 14 */
+ "Header Log Overflow", /* Bit Position 15 */
};

static const char *aer_uncorrectable_error_string[] = {
- NULL,
+ "Undefined", /* Bit Position 0 */
NULL,
NULL,
NULL,
"Data Link Protocol", /* Bit Position 4 */
- NULL,
+ "Surprise Down Error", /* Bit Position 5 */
NULL,
NULL,
NULL,
@@ -113,6 +115,11 @@ static const char *aer_uncorrectable_error_string[] = {
"Malformed TLP", /* Bit Position 18 */
"ECRC", /* Bit Position 19 */
"Unsupported Request", /* Bit Position 20 */
+ "ACS Violation", /* Bit Position 21 */
+ "Uncorrectable Internal Error", /* Bit Position 22 */
+ "MC Blocked TLP", /* Bit Position 23 */
+ "AtomicOp Egress Blocked", /* Bit Position 24 */
+ "TLP Prefix Blocked Error", /* Bit Position 25 */
};

static const char *aer_agent_string[] = {
Chen, Gong
2014-08-13 06:22:40 UTC
Permalink
In PCI-e SPEC r3.0, BIT 0 of Uncorrectable Error Status Register
has been redefined for a different purpose.

BIT 0:
Undefined =E2=80=93 The value read from this bit is undefined. In
previous versions of this specification, this
bit was used to mask a Link Training Error.
System software must ignore the value read from
this bit. System software must only write a value
of 1b to this bit.

Update related MACRO definition to reflect this change.

Signed-off-by: Chen, Gong <***@linux.intel.com>
---
drivers/vfio/pci/vfio_pci_config.c | 2 +-
include/ras/ras_event.h | 2 +-
include/uapi/linux/pci_regs.h | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio=
_pci_config.c
index e50790e91f76..1de3f94aa7de 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -727,7 +727,7 @@ static int __init init_pci_ext_cap_err_perm(struct =
perm_bits *perm)
p_setd(perm, 0, ALL_VIRT, NO_WRITE);
=20
/* Writable bits mask */
- mask =3D PCI_ERR_UNC_TRAIN | /* Training */
+ mask =3D PCI_ERR_UNC_UND | /* Undefined */
PCI_ERR_UNC_DLP | /* Data Link Protocol */
PCI_ERR_UNC_SURPDN | /* Surprise Down */
PCI_ERR_UNC_POISON_TLP | /* Poisoned TLP */
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 0f04a9755d1e..79abb9c71772 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -185,7 +185,7 @@ TRACE_EVENT(mc_event,
{PCI_ERR_COR_LOG_OVER, "Header Log Overflow"}
=20
#define aer_uncorrectable_errors \
- {PCI_ERR_UNC_TRAIN, "Undefined"}, \
+ {PCI_ERR_UNC_UND, "Undefined"}, \
{PCI_ERR_UNC_DLP, "Data Link Protocol Error"}, \
{PCI_ERR_UNC_SURPDN, "Surprise Down Error"}, \
{PCI_ERR_UNC_POISON_TLP,"Poisoned TLP"}, \
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_reg=
s.h
index 30db069bce62..99e3182f2c96 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -630,7 +630,7 @@
=20
/* Advanced Error Reporting */
#define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */
-#define PCI_ERR_UNC_TRAIN 0x00000001 /* Training */
+#define PCI_ERR_UNC_UND 0x00000001 /* Undefined */
#define PCI_ERR_UNC_DLP 0x00000010 /* Data Link Protocol */
#define PCI_ERR_UNC_SURPDN 0x00000020 /* Surprise Down */
#define PCI_ERR_UNC_POISON_TLP 0x00001000 /* Poisoned TLP */
--=20
2.0.0.rc2
Chen, Gong
2014-08-13 06:22:39 UTC
Permalink
This patch adds all missed AER error bits(CE & UC) existed in PCI-e
SPEC r3.0 for trace interface.

Signed-off-by: Chen, Gong <***@linux.intel.com>
---
include/ras/ras_event.h | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 0f2cca4ccbf0..0f04a9755d1e 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -174,25 +174,34 @@ TRACE_EVENT(mc_event,
* u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED
*/

-#define aer_correctable_errors \
- {PCI_ERR_COR_RCVR, "Receiver Error"}, \
- {PCI_ERR_COR_BAD_TLP, "Bad TLP"}, \
- {PCI_ERR_COR_BAD_DLLP, "Bad DLLP"}, \
- {PCI_ERR_COR_REP_ROLL, "RELAY_NUM Rollover"}, \
- {PCI_ERR_COR_REP_TIMER, "Replay Timer Timeout"},\
- {PCI_ERR_COR_ADV_NFAT, "Advisory Non-Fatal"}
-
-#define aer_uncorrectable_errors \
- {PCI_ERR_UNC_DLP, "Data Link Protocol"}, \
+#define aer_correctable_errors \
+ {PCI_ERR_COR_RCVR, "Receiver Error"}, \
+ {PCI_ERR_COR_BAD_TLP, "Bad TLP"}, \
+ {PCI_ERR_COR_BAD_DLLP, "Bad DLLP"}, \
+ {PCI_ERR_COR_REP_ROLL, "RELAY_NUM Rollover"}, \
+ {PCI_ERR_COR_REP_TIMER, "Replay Timer Timeout"}, \
+ {PCI_ERR_COR_ADV_NFAT, "Advisory Non-Fatal Error"}, \
+ {PCI_ERR_COR_INTERNAL, "Corrected Internal Error"}, \
+ {PCI_ERR_COR_LOG_OVER, "Header Log Overflow"}
+
+#define aer_uncorrectable_errors \
+ {PCI_ERR_UNC_TRAIN, "Undefined"}, \
+ {PCI_ERR_UNC_DLP, "Data Link Protocol Error"}, \
+ {PCI_ERR_UNC_SURPDN, "Surprise Down Error"}, \
{PCI_ERR_UNC_POISON_TLP,"Poisoned TLP"}, \
- {PCI_ERR_UNC_FCP, "Flow Control Protocol"}, \
+ {PCI_ERR_UNC_FCP, "Flow Control Protocol Error"}, \
{PCI_ERR_UNC_COMP_TIME, "Completion Timeout"}, \
{PCI_ERR_UNC_COMP_ABORT,"Completer Abort"}, \
{PCI_ERR_UNC_UNX_COMP, "Unexpected Completion"}, \
{PCI_ERR_UNC_RX_OVER, "Receiver Overflow"}, \
{PCI_ERR_UNC_MALF_TLP, "Malformed TLP"}, \
- {PCI_ERR_UNC_ECRC, "ECRC"}, \
- {PCI_ERR_UNC_UNSUP, "Unsupported Request"}
+ {PCI_ERR_UNC_ECRC, "ECRC Error"}, \
+ {PCI_ERR_UNC_UNSUP, "Unsupported Request Error"}, \
+ {PCI_ERR_UNC_ACSV, "ACS Violation"}, \
+ {PCI_ERR_UNC_INTN, "Uncorrectable Internal Error"},\
+ {PCI_ERR_UNC_MCBTLP, "MC Blocked TLP"}, \
+ {PCI_ERR_UNC_ATOMEG, "AtomicOp Egress Blocked"}, \
+ {PCI_ERR_UNC_TLPPRE, "TLP Prefix Blocked Error"}

TRACE_EVENT(aer_event,
TP_PROTO(const char *dev_name,
--
2.0.0.rc2
Bjorn Helgaas
2014-08-13 13:52:45 UTC
Permalink
Post by Chen, Gong
No response since last commit so spread it to a bigger range.
This patch series is for AER related cleanup & update based on PCIe
SPEC r3.0.
I haven't responded because I've been on vacation for the past three
weeks. If there's no change in the patches themselves, and if they
are still in http://patchwork.ozlabs.org/project/linux-pci/list, the
only thing reposting them does is make more work for me.

Bjorn
Chen, Gong
2014-08-14 01:52:36 UTC
Permalink
Post by Bjorn Helgaas
I haven't responded because I've been on vacation for the past three
weeks. If there's no change in the patches themselves, and if they
are still in http://patchwork.ozlabs.org/project/linux-pci/list, the
only thing reposting them does is make more work for me.
There is one difference existed in Patch 1. I add more explanation in
the comments as Boris suggested.
Chen, Gong
2014-09-02 01:31:59 UTC
Permalink
Date: Wed, 13 Aug 2014 21:52:36 -0400
Subject: Re: [RESEND 0/5] PCIe, AER: Misc cleanup
User-Agent: Mutt/1.5.23 (2014-03-12)
Post by Bjorn Helgaas
I haven't responded because I've been on vacation for the past three
weeks. If there's no change in the patches themselves, and if they
are still in http://patchwork.ozlabs.org/project/linux-pci/list, the
only thing reposting them does is make more work for me.
There is one difference existed in Patch 1. I add more explanation in
the comments as Boris suggested.
Hi, Bjorn

Any comments?
Bjorn Helgaas
2014-09-25 15:54:02 UTC
Permalink
[+cc Heather]
Post by Chen, Gong
No response since last commit so spread it to a bigger range.
This patch series is for AER related cleanup & update based on PCIe
SPEC r3.0.
I applied patches 1, 2 (modified as described in my response to it), 3, and
4 to pci/aer for v3.19, thanks!

I didn't apply patch 5 because it apparently doesn't fix a reported bug,
and it has the potential to break hardware that used the PCIe r1.0
definition of the "Training Error" bit.

Bjorn

Loading...