Discussion:
[PATCH] pci, add sysfs numa_node write function
Prarit Bhargava
2014-10-15 19:05:24 UTC
Permalink
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.

Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE. The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.

Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.

To use this, one can do

echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node

to set the numa node for PCI device 0000:ff:1f.3.

Cc: Myron Stowe <***@redhat.com>
Cc: Bjorn Helgaas <***@google.com>
Cc: linux-***@vger.kernel.org
Signed-off-by: Prarit Bhargava <***@redhat.com>
---
drivers/pci/pci-sysfs.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 92b6d9a..c05ed30 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -221,12 +221,33 @@ static ssize_t enabled_show(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR_RW(enabled);

#ifdef CONFIG_NUMA
+static ssize_t numa_node_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int node, ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ret = kstrtoint(buf, 0, &node);
+ if (ret)
+ return ret;
+
+ if (!node_online(node))
+ return -EINVAL;
+
+ dev->numa_node = node;
+
+ return count;
+}
+
static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
return sprintf(buf, "%d\n", dev->numa_node);
}
-static DEVICE_ATTR_RO(numa_node);
+static DEVICE_ATTR_RW(numa_node);
#endif

static ssize_t dma_mask_bits_show(struct device *dev,
--
1.7.9.3
Bjorn Helgaas
2014-10-15 19:23:32 UTC
Permalink
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
Post by Prarit Bhargava
The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.
Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.
I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?
Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.
It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?

If there's information that numactl uses, maybe the kernel should use that, too?

A sysfs interface might be a useful workaround, but obviously it would
be far better if we could fix the BIOS and/or kernel so the workaround
isn't necessary in the first place.

Bjorn
Post by Prarit Bhargava
---
drivers/pci/pci-sysfs.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 92b6d9a..c05ed30 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -221,12 +221,33 @@ static ssize_t enabled_show(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR_RW(enabled);
#ifdef CONFIG_NUMA
+static ssize_t numa_node_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int node, ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ret = kstrtoint(buf, 0, &node);
+ if (ret)
+ return ret;
+
+ if (!node_online(node))
+ return -EINVAL;
+
+ dev->numa_node = node;
+
+ return count;
+}
+
static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
return sprintf(buf, "%d\n", dev->numa_node);
}
-static DEVICE_ATTR_RO(numa_node);
+static DEVICE_ATTR_RW(numa_node);
#endif
static ssize_t dma_mask_bits_show(struct device *dev,
--
1.7.9.3
Prarit Bhargava
2014-10-15 19:47:59 UTC
Permalink
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
Post by Bjorn Helgaas
Post by Prarit Bhargava
The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.
Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.
I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?
You can map at least what nodes are available (although I suppose you can get
the same information from dmesg). You have to do a bit of hunting through the
PCI tree to determine the root PCI devices, but you can determine which root
device is connected to which node.
Post by Bjorn Helgaas
Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.
It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?
Yeah ... unfortunately the BIOS is broken in this case. And I know what you're
thinking ;) -- why not get the BIOS fixed? I'm through relying on BIOS fixes
which can take six months to a year to appear in a production version... I've
been bitten too many times by promises of BIOS fixes that never materialize.

We have systems that only have a support cycle of 3 years, and things like ACPI
_PXM updates are at the bottom of the list :/.

FWIW, on this particular system I have a filed a bug with the vendor.
Post by Bjorn Helgaas
If there's information that numactl uses, maybe the kernel should use that, too?
A sysfs interface might be a useful workaround, but obviously it would
be far better if we could fix the BIOS and/or kernel so the workaround
isn't necessary in the first place.
Yep ... but like I said, I don't think anyone wants to wait a year. What if we
never see a fix?

Side issue: While investigating this I noticed that plain kmalloc() is used in
the setup code. Is there a reason we don't use kmalloc_node() in
pci_alloc_dev(), and other allocation functions? It seems like we should be to
optimize system performance. OTOH ... I haven't done any measurements to see if
it actually makes a difference :)

P.
Post by Bjorn Helgaas
Bjorn
Post by Prarit Bhargava
---
drivers/pci/pci-sysfs.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 92b6d9a..c05ed30 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -221,12 +221,33 @@ static ssize_t enabled_show(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR_RW(enabled);
#ifdef CONFIG_NUMA
+static ssize_t numa_node_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int node, ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ret = kstrtoint(buf, 0, &node);
+ if (ret)
+ return ret;
+
+ if (!node_online(node))
+ return -EINVAL;
+
+ dev->numa_node = node;
+
+ return count;
+}
+
static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
return sprintf(buf, "%d\n", dev->numa_node);
}
-static DEVICE_ATTR_RO(numa_node);
+static DEVICE_ATTR_RW(numa_node);
#endif
static ssize_t dma_mask_bits_show(struct device *dev,
--
1.7.9.3
Prarit Bhargava
2014-10-15 19:51:28 UTC
Permalink
Post by Prarit Bhargava
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
Whups. This thread:

http://marc.info/?l=linux-crypto-vger&m=141279031626999&w=2
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.
Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.
I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?
You can map at least what nodes are available (although I suppose you can get
the same information from dmesg). You have to do a bit of hunting through the
PCI tree to determine the root PCI devices, but you can determine which root
device is connected to which node.
Post by Bjorn Helgaas
Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.
It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?
Yeah ... unfortunately the BIOS is broken in this case. And I know what you're
thinking ;) -- why not get the BIOS fixed? I'm through relying on BIOS fixes
which can take six months to a year to appear in a production version... I've
been bitten too many times by promises of BIOS fixes that never materialize.
We have systems that only have a support cycle of 3 years, and things like ACPI
_PXM updates are at the bottom of the list :/.
FWIW, on this particular system I have a filed a bug with the vendor.
Post by Bjorn Helgaas
If there's information that numactl uses, maybe the kernel should use that, too?
A sysfs interface might be a useful workaround, but obviously it would
be far better if we could fix the BIOS and/or kernel so the workaround
isn't necessary in the first place.
Yep ... but like I said, I don't think anyone wants to wait a year. What if we
never see a fix?
Side issue: While investigating this I noticed that plain kmalloc() is used in
the setup code. Is there a reason we don't use kmalloc_node() in
pci_alloc_dev(), and other allocation functions? It seems like we should be to
optimize system performance. OTOH ... I haven't done any measurements to see if
it actually makes a difference :)
P.
Post by Bjorn Helgaas
Bjorn
Post by Prarit Bhargava
---
drivers/pci/pci-sysfs.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 92b6d9a..c05ed30 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -221,12 +221,33 @@ static ssize_t enabled_show(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR_RW(enabled);
#ifdef CONFIG_NUMA
+static ssize_t numa_node_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int node, ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ret = kstrtoint(buf, 0, &node);
+ if (ret)
+ return ret;
+
+ if (!node_online(node))
+ return -EINVAL;
+
+ dev->numa_node = node;
+
+ return count;
+}
+
static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
return sprintf(buf, "%d\n", dev->numa_node);
}
-static DEVICE_ATTR_RO(numa_node);
+static DEVICE_ATTR_RW(numa_node);
#endif
static ssize_t dma_mask_bits_show(struct device *dev,
--
1.7.9.3
Bjorn Helgaas
2014-10-15 21:20:30 UTC
Permalink
Post by Prarit Bhargava
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
Post by Bjorn Helgaas
Post by Prarit Bhargava
The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.
Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.
I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?
You can map at least what nodes are available (although I suppose you can get
the same information from dmesg). You have to do a bit of hunting through the
PCI tree to determine the root PCI devices, but you can determine which root
device is connected to which node.
Is numactl reading SRAT? SLIT? SMBIOS tables? Presumably the kernel
has access to whatever information you're getting from numactl and
lspci, and if so, maybe we can do the workaround automatically in the
kernel.
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.
It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?
Yeah ... unfortunately the BIOS is broken in this case. And I know what you're
thinking ;) -- why not get the BIOS fixed? I'm through relying on BIOS fixes
which can take six months to a year to appear in a production version... I've
been bitten too many times by promises of BIOS fixes that never materialize.
Yep, I understand. The question is how we implement a workaround so
it doesn't become the accepted way to do things. Obviously we don't
want people manually grubbing through numactl/lspci output or writing
shell scripts to do things that *should* happen automatically.
Post by Prarit Bhargava
We have systems that only have a support cycle of 3 years, and things like ACPI
_PXM updates are at the bottom of the list :/.
Something's wrong with this picture. If vendors are building systems
where node information is important, and the platform doesn't tell the
OS what the node numbers ARE, then in my opinion, the vendor is
essentially asking for low performance and is getting what he asked
for, and his customers should learn that the answer is to shop
elsewhere.

Somewhere in the picture there needs to be a feedback loop that
encourages the vendor to fix the problem. I don't see that happening
yet. Having QAT fail because the platform didn't supply the
information required to make it work would be a nice loop. I don't
want to completely paper over the problem without providing some other
kind of feedback at the same time.

You're probably aware of [1], which was the same problem. Apparently
it was originally reported to RedHat as [2] (which is private, so I
can't read it). That led to a workaround hack for some AMD systems
[3, 4].

[1] https://bugzilla.kernel.org/show_bug.cgi?id=72051
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1040440
[3] http://lkml.kernel.org/r/1394053603-3724-1-git-send-email-***@amd.com
[4] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=94d4bb5b1397
Post by Prarit Bhargava
FWIW, on this particular system I have a filed a bug with the vendor.
Post by Bjorn Helgaas
If there's information that numactl uses, maybe the kernel should use that, too?
A sysfs interface might be a useful workaround, but obviously it would
be far better if we could fix the BIOS and/or kernel so the workaround
isn't necessary in the first place.
Yep ... but like I said, I don't think anyone wants to wait a year. What if we
never see a fix?
Side issue: While investigating this I noticed that plain kmalloc() is used in
the setup code. Is there a reason we don't use kmalloc_node() in
pci_alloc_dev(), and other allocation functions? It seems like we should be to
optimize system performance. OTOH ... I haven't done any measurements to see if
it actually makes a difference :)
Yeah, that probably would make sense. We do use kmalloc_node() for
some of the host bridge stuff, thanks to 965cd0e4a5e5 ("x86, PCI,
ACPI: Use kmalloc_node() to optimize for performance"). But it
probably would make sense to extend that farther down, too.
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
---
drivers/pci/pci-sysfs.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 92b6d9a..c05ed30 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -221,12 +221,33 @@ static ssize_t enabled_show(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR_RW(enabled);
#ifdef CONFIG_NUMA
+static ssize_t numa_node_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int node, ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ret = kstrtoint(buf, 0, &node);
+ if (ret)
+ return ret;
+
+ if (!node_online(node))
+ return -EINVAL;
+
+ dev->numa_node = node;
+
+ return count;
+}
+
static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
return sprintf(buf, "%d\n", dev->numa_node);
}
-static DEVICE_ATTR_RO(numa_node);
+static DEVICE_ATTR_RW(numa_node);
#endif
static ssize_t dma_mask_bits_show(struct device *dev,
--
1.7.9.3
Prarit Bhargava
2014-10-16 12:32:47 UTC
Permalink
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
Post by Bjorn Helgaas
Post by Prarit Bhargava
The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.
Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.
I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?
You can map at least what nodes are available (although I suppose you can get
the same information from dmesg). You have to do a bit of hunting through the
PCI tree to determine the root PCI devices, but you can determine which root
device is connected to which node.
Is numactl reading SRAT? SLIT? SMBIOS tables? Presumably the kernel
has access to whatever information you're getting from numactl and
lspci, and if so, maybe we can do the workaround automatically in the
kernel.
I'll go figure that out ...
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.
It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?
Yeah ... unfortunately the BIOS is broken in this case. And I know what you're
thinking ;) -- why not get the BIOS fixed? I'm through relying on BIOS fixes
which can take six months to a year to appear in a production version... I've
been bitten too many times by promises of BIOS fixes that never materialize.
Yep, I understand. The question is how we implement a workaround so
it doesn't become the accepted way to do things. Obviously we don't
want people manually grubbing through numactl/lspci output or writing
shell scripts to do things that *should* happen automatically.
Post by Prarit Bhargava
We have systems that only have a support cycle of 3 years, and things like ACPI
_PXM updates are at the bottom of the list :/.
Somewhere in the picture there needs to be a feedback loop that
encourages the vendor to fix the problem. I don't see that happening
yet. Having QAT fail because the platform didn't supply the
information required to make it work would be a nice loop. I don't
want to completely paper over the problem without providing some other
kind of feedback at the same time.
Okay -- I see what you're after here and I completely agree with it. But
sometimes I feel like I banging on a silent drum with some of these companies
about this stuff :( My frustration with these companies is starting to show I
guess...
Post by Bjorn Helgaas
You're probably aware of [1], which was the same problem. Apparently
it was originally reported to RedHat as [2] (which is private, so I
can't read it). That led to a workaround hack for some AMD systems
[3, 4].
Yeah ... part of me was thinking that maybe I should do something like
the above but I didn't know how you'd feel about expanding that hack. I'll look
into it. I'd prefer it to be opt-in with a kernel parameter.

P.
Alexander Duyck
2014-10-16 14:44:02 UTC
Permalink
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
This is just short-sighted thinking. The fact that the PCI device
advertises -1 means that either the BIOS isn't configured, or the PCI
slots are shared as was the case on some Nehalem systems where the IOH
was shared between two sockets. I suspect that this driver doesn't even
take that into account as it was likely only written for Sandy Bridge
architectures.
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.
It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?
Yeah ... unfortunately the BIOS is broken in this case. And I know what you're
thinking ;) -- why not get the BIOS fixed? I'm through relying on BIOS fixes
which can take six months to a year to appear in a production version... I've
been bitten too many times by promises of BIOS fixes that never materialize.
Yep, I understand. The question is how we implement a workaround so
it doesn't become the accepted way to do things. Obviously we don't
want people manually grubbing through numactl/lspci output or writing
shell scripts to do things that *should* happen automatically.
I'd say if nothing else we should flag the system as tainted as soon as
we start overwriting BIOS/ACPI configured values with sysfs. This is one
of the reasons for the TAINT_FIRMWARE_WORKAROUND even existing.
Post by Prarit Bhargava
Post by Bjorn Helgaas
Somewhere in the picture there needs to be a feedback loop that
encourages the vendor to fix the problem. I don't see that happening
yet. Having QAT fail because the platform didn't supply the
information required to make it work would be a nice loop. I don't
want to completely paper over the problem without providing some other
kind of feedback at the same time.
Okay -- I see what you're after here and I completely agree with it. But
sometimes I feel like I banging on a silent drum with some of these companies
about this stuff :( My frustration with these companies is starting to show I
guess...
Just how visible is the QAT driver load failure? I has a similar issue
with DCA not being configured in a number of BIOSes and it wasn't until
I made the issue painfully visible with TAINT_FIRMWARE_WORKAROUND that I
started to see any traction on getting this fixed in the BIOSes.

We would need to sort out the systems that actually have bad BIOSes
versus just being configured without PCI slots directly associated with
any given NUMA node since there are systems where that is a valid
configuration.
Post by Prarit Bhargava
Post by Bjorn Helgaas
You're probably aware of [1], which was the same problem. Apparently
it was originally reported to RedHat as [2] (which is private, so I
can't read it). That led to a workaround hack for some AMD systems
[3, 4].
Yeah ... part of me was thinking that maybe I should do something like
the above but I didn't know how you'd feel about expanding that hack. I'll look
into it. I'd prefer it to be opt-in with a kernel parameter.
P.
Are you thinking something like a "pci=assign-numa"? The problem is
there doesn't seem to be a good way to currently determine the NUMA
layout without the information being provided by the BIOS/ACPI tables,
and we probably don't want to be creating a definition of the NUMA
layout per platform.

Thanks,

Alex
Prarit Bhargava
2014-10-16 16:07:18 UTC
Permalink
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
This is just short-sighted thinking. The fact that the PCI device advertises -1
means that either the BIOS isn't configured, or the PCI slots are shared as was
the case on some Nehalem systems where the IOH was shared between two sockets.
Yes ...
I suspect that this driver doesn't even take that into account as it was likely
only written for Sandy Bridge architectures.
Nope. New hardware. The issue is that there is only a performance impact if
local node memory is used, o/w the claim is that the performance drops to that
of doing software crypto.
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.
It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?
Yeah ... unfortunately the BIOS is broken in this case. And I know what you're
thinking ;) -- why not get the BIOS fixed? I'm through relying on BIOS fixes
which can take six months to a year to appear in a production version... I've
been bitten too many times by promises of BIOS fixes that never materialize.
Yep, I understand. The question is how we implement a workaround so
it doesn't become the accepted way to do things. Obviously we don't
want people manually grubbing through numactl/lspci output or writing
shell scripts to do things that *should* happen automatically.
I'd say if nothing else we should flag the system as tainted as soon as we start
overwriting BIOS/ACPI configured values with sysfs. This is one of the reasons
for the TAINT_FIRMWARE_WORKAROUND even existing.
I was thinking that I could modify the patch to do this, but I'd rather
investigate Bjorn's suggestion first. I think his approach has some merits, but
I will definitely TAINT if I go with that approach too.
Post by Prarit Bhargava
Post by Bjorn Helgaas
Somewhere in the picture there needs to be a feedback loop that
encourages the vendor to fix the problem. I don't see that happening
yet. Having QAT fail because the platform didn't supply the
information required to make it work would be a nice loop. I don't
want to completely paper over the problem without providing some other
kind of feedback at the same time.
Okay -- I see what you're after here and I completely agree with it. But
sometimes I feel like I banging on a silent drum with some of these companies
about this stuff :( My frustration with these companies is starting to show I
guess...
Just how visible is the QAT driver load failure? I has a similar issue with DCA
not being configured in a number of BIOSes and it wasn't until I made the issue
painfully visible with TAINT_FIRMWARE_WORKAROUND that I started to see any
traction on getting this fixed in the BIOSes.
Just a couple of printks to the screen.
We would need to sort out the systems that actually have bad BIOSes versus just
being configured without PCI slots directly associated with any given NUMA node
since there are systems where that is a valid configuration.
The problem is NUMA_NO_NODE, which as you point out can be a valid
configuration. So in some cases system designers may have intentionally done
this (Can anyone think of a valid reason to leave off the _PXM, or have it
assigned to NUMA_NO_NODE?), so the previous statement about having an opt-in,
then attempting to calculate the node location, and now TAINTING might be a good
direction to move in.

OTOH ... what do we do about older unsupported hardware that won't have new BIOS
releases? Those would basically say "Go fix your BIOS" and there's nothing that
could be done :/. All those users see is a loud warning...
Post by Prarit Bhargava
Post by Bjorn Helgaas
You're probably aware of [1], which was the same problem. Apparently
it was originally reported to RedHat as [2] (which is private, so I
can't read it). That led to a workaround hack for some AMD systems
[3, 4].
Yeah ... part of me was thinking that maybe I should do something like
the above but I didn't know how you'd feel about expanding that hack. I'll look
into it. I'd prefer it to be opt-in with a kernel parameter.
P.
Are you thinking something like a "pci=assign-numa"? The problem is there
doesn't seem to be a good way to currently determine the NUMA layout without the
information being provided by the BIOS/ACPI tables, and we probably don't want
to be creating a definition of the NUMA layout per platform.
Well ... let me think about this for a bit. The big issue is what happens
during socket hot-add events and the PCI root bridges that are added at that
time. It may not be possible to come up with a correct calculation :( but let
me give it a shot. IIRC the node-to-socket map should be static...

P.
Thanks,
Alex
Alexander Duyck
2014-10-16 17:04:05 UTC
Permalink
Post by Prarit Bhargava
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
This is just short-sighted thinking. The fact that the PCI device advertises -1
means that either the BIOS isn't configured, or the PCI slots are shared as was
the case on some Nehalem systems where the IOH was shared between two sockets.
Yes ...
I suspect that this driver doesn't even take that into account as it was likely
only written for Sandy Bridge architectures.
Nope. New hardware. The issue is that there is only a performance impact if
local node memory is used, o/w the claim is that the performance drops to that
of doing software crypto.
Yes, but the problem is what is considered a local node? In the case of
the Nehalem architecture it wasn't uncommon to have two sockets share an
IOH. As a result each node would have its own memory, but the PCI
devices were shared. That platform would fail this test, but each node
has its own memory and I suspect the QPI overhead wouldn't be that bad
since the link is dedicated to the IOH connection.

What I am getting at is that this driver was likely intended for
specifically the current and follow-on generations of Xeon processors.
As such it deciding to reject all NUMA_NO_NODE configuration may be
acceptable for it since they have a very specific architecture and CPU
in mind. For the workaround if you are going to try and do something as
generic, you will need to be mindful of how the other architectures deal
with NUMA and PCI.
Post by Prarit Bhargava
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.
It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?
Yeah ... unfortunately the BIOS is broken in this case. And I know what you're
thinking ;) -- why not get the BIOS fixed? I'm through relying on BIOS fixes
which can take six months to a year to appear in a production version... I've
been bitten too many times by promises of BIOS fixes that never materialize.
Yep, I understand. The question is how we implement a workaround so
it doesn't become the accepted way to do things. Obviously we don't
want people manually grubbing through numactl/lspci output or writing
shell scripts to do things that *should* happen automatically.
I'd say if nothing else we should flag the system as tainted as soon as we start
overwriting BIOS/ACPI configured values with sysfs. This is one of the reasons
for the TAINT_FIRMWARE_WORKAROUND even existing.
I was thinking that I could modify the patch to do this, but I'd rather
investigate Bjorn's suggestion first. I think his approach has some merits, but
I will definitely TAINT if I go with that approach too.
If we are doing any sort of workaround we should mark the system as
being tainted. Otherwise we are just going to let whatever workaround
you implement be accepted and nobody will ever bother to fix the BIOS in
this generation or the next.
Post by Prarit Bhargava
Post by Prarit Bhargava
Post by Bjorn Helgaas
Somewhere in the picture there needs to be a feedback loop that
encourages the vendor to fix the problem. I don't see that happening
yet. Having QAT fail because the platform didn't supply the
information required to make it work would be a nice loop. I don't
want to completely paper over the problem without providing some other
kind of feedback at the same time.
Okay -- I see what you're after here and I completely agree with it. But
sometimes I feel like I banging on a silent drum with some of these companies
about this stuff :( My frustration with these companies is starting to show I
guess...
Just how visible is the QAT driver load failure? I has a similar issue with DCA
not being configured in a number of BIOSes and it wasn't until I made the issue
painfully visible with TAINT_FIRMWARE_WORKAROUND that I started to see any
traction on getting this fixed in the BIOSes.
Just a couple of printks to the screen.
Yeah, taint would definitely be preferred. Then if something has a
panic later because we didn't work around this correctly we should be
able to see that this is a tainted system in the panic dump. I did the
same thing for DSA.
Post by Prarit Bhargava
We would need to sort out the systems that actually have bad BIOSes versus just
being configured without PCI slots directly associated with any given NUMA node
since there are systems where that is a valid configuration.
The problem is NUMA_NO_NODE, which as you point out can be a valid
configuration. So in some cases system designers may have intentionally done
this (Can anyone think of a valid reason to leave off the _PXM, or have it
assigned to NUMA_NO_NODE?), so the previous statement about having an opt-in,
then attempting to calculate the node location, and now TAINTING might be a good
direction to move in.
That is my thought. We would just need to come up with a good way of
sorting these platforms out and that can be the challenge. From past
experience I have even seen some systems where the memory layout wasn't
configured correctly in the BIOS/ACPI tables so this is pretty much just
a crapshoot as to if we can figure out the actual layout or not via
software. Maybe this could be identified with
TAINT_OVERRIDDEN_ACPI_TABLE since we are having to hand code an ACPI table.
Post by Prarit Bhargava
OTOH ... what do we do about older unsupported hardware that won't have new BIOS
releases? Those would basically say "Go fix your BIOS" and there's nothing that
could be done :/. All those users see is a loud warning...
I'm not saying we cannot provide a workaround for it, but at the same
time as you have pointed out this is mostly just a performance issue due
to the BIOS and we could be introducing new issues by trying to fix it
in the kernel.

The loud warning is both for the users and those of us that are stuck
debugging the issues they find when their platform doesn't run as
expected and triggers a panic. You have to keep in mind that by working
around things we run the risk of introducing new issues and we have to
know when we have entered one of those cases.
Post by Prarit Bhargava
Post by Prarit Bhargava
Post by Bjorn Helgaas
You're probably aware of [1], which was the same problem. Apparently
it was originally reported to RedHat as [2] (which is private, so I
can't read it). That led to a workaround hack for some AMD systems
[3, 4].
Yeah ... part of me was thinking that maybe I should do something like
the above but I didn't know how you'd feel about expanding that hack. I'll look
into it. I'd prefer it to be opt-in with a kernel parameter.
P.
Are you thinking something like a "pci=assign-numa"? The problem is there
doesn't seem to be a good way to currently determine the NUMA layout without the
information being provided by the BIOS/ACPI tables, and we probably don't want
to be creating a definition of the NUMA layout per platform.
Well ... let me think about this for a bit. The big issue is what happens
during socket hot-add events and the PCI root bridges that are added at that
time. It may not be possible to come up with a correct calculation :( but let
me give it a shot. IIRC the node-to-socket map should be static...
P.
I don't know if there is a good way to resolve all this programmatically
since I have seen a number of issues where even the NUMA memory layout
is messed up. The sysfs might be a good approach however I would
suggest tainting the kernel at that point as you are overwritting
pre-defined BIOS/ACPI table values and that may have certain side
effects if things don't get configured correctly.

Thanks,

Alex
Myron Stowe
2014-10-16 19:45:23 UTC
Permalink
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
Post by Bjorn Helgaas
Post by Prarit Bhargava
The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.
Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.
I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?
You can map at least what nodes are available (although I suppose you can get
the same information from dmesg). You have to do a bit of hunting through the
PCI tree to determine the root PCI devices, but you can determine which root
device is connected to which node.
Is numactl reading SRAT? SLIT? SMBIOS tables? Presumably the kernel
has access to whatever information you're getting from numactl and
lspci, and if so, maybe we can do the workaround automatically in the
kernel.
I'll go figure that out ...
Post by Bjorn Helgaas
Post by Prarit Bhargava
Post by Bjorn Helgaas
Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.
It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?
Yeah ... unfortunately the BIOS is broken in this case. And I know what you're
thinking ;) -- why not get the BIOS fixed? I'm through relying on BIOS fixes
which can take six months to a year to appear in a production version... I've
been bitten too many times by promises of BIOS fixes that never materialize.
Yep, I understand. The question is how we implement a workaround so
it doesn't become the accepted way to do things. Obviously we don't
want people manually grubbing through numactl/lspci output or writing
shell scripts to do things that *should* happen automatically.
Post by Prarit Bhargava
We have systems that only have a support cycle of 3 years, and things like ACPI
_PXM updates are at the bottom of the list :/.
Somewhere in the picture there needs to be a feedback loop that
encourages the vendor to fix the problem. I don't see that happening
yet. Having QAT fail because the platform didn't supply the
information required to make it work would be a nice loop. I don't
want to completely paper over the problem without providing some other
kind of feedback at the same time.
Okay -- I see what you're after here and I completely agree with it. But
sometimes I feel like I banging on a silent drum with some of these companies
about this stuff :( My frustration with these companies is starting to show I
guess...
Post by Bjorn Helgaas
You're probably aware of [1], which was the same problem. Apparently
it was originally reported to RedHat as [2] (which is private, so I
can't read it). That led to a workaround hack for some AMD systems
[3, 4].
Yeah ... part of me was thinking that maybe I should do something like
the above but I didn't know how you'd feel about expanding that hack. I'll look
into it. I'd prefer it to be opt-in with a kernel parameter.
I believe the point was to not consider expanding upon the AMD chip-set
specific driver, or some similar tactic, and to investigate how
'numactl' obtains such information in order to see if that is a possible
solution that could be used going forward. It's good to see that the
latest discussions with Alexander seem to suggest that is the direction
being taken now.


Pursuing something like the AMD scenario, where a chip-set specific
driver reads internal registers, is an untenable dead end. It leads to
a perpetual cycle in which the driver has to constantly be updated as
new platform chip-sets are introduced. To solve such situations,
architecture independent mechanisms are thought up, formalized, and put
in place - in this case ACPI's _PXM methods and/or SRAT, SLIT tables.

Chip-set specific drivers like 'amd_bus.c' perpetuate circumventing
architected solutions, more often than not introducing subsequent issues
themselves.

The previous NUMA info issue that Bjorn referenced is a prime example of
both: it "snoops" out proximity info from the chip-set when the BIOS has
not supplied ACPI _PXM method(s) (i.e. the circumvention) and in doing
such, had at least three sub-issues; all of which can be understood by
carefully reading the chain of events from the PCI list's thread [1].

As a result, the 'amd_bus.c' patch was pared down from a substantial
re-factoring and update effort to its bare minimum - at one point there
was even discussion about removing the driver completely as was done
with 'intel_bus.c' (note that the NUMA related proximity "snooping"
capability only exists for AMD based platforms, there is no equivalent
for Intel based platforms).

In that particular instance the end result was that the customer impact
was significant enough financially that it forced the vendor to updated
there BIOS to include the necessary _PXM methods.


Introducing kernel parameters is a similar scenario. Customers should
not need to know about, nor utilize, kernel parameters in their normal
day-to-day activities. Yes, such capabilities are invaluable for those
of us that work with the kernel but customers just should not be
expected to have to resort to such.


[1] https://lkml.org/lkml/2014/3/5/898


Myron
Post by Prarit Bhargava
P.
Prarit Bhargava
2014-10-17 11:59:56 UTC
Permalink
Post by Prarit Bhargava
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
Post by Bjorn Helgaas
Post by Prarit Bhargava
The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.
Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.
I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?
numactl simply determines the number of nodes on the system by traversing

/sys/devices/system/node

For example, the output of 'numactl --hardware'

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 32075 MB
node 0 free: 30862 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 32239 MB
node 1 free: 31251 MB
node distances:
node 0 1
0: 10 21
1: 21 10

I then use lpsci to determine what pci root nodes there are, and knowing a
little bit about the system architecture (are the IOHs shared between sockets,
are they 1/socket, etc.) I can make an educated guess on what the numa_node
value should be.

I thought about hardcoding something similar in the kernel ... but there are
some significant issues (which Myron points out as well). The code would have
to be constantly maintained. That is not something we can do long term. Every
new processor & chipset would have to be added to the code to determine exactly
how the pci root bridge was connected. Even then, it is entirely possible (as
in the case with some cpu hotplug systems) that IOH-less sockets were populated
[1] which would severely hamper any assumptions in a calculation.

That also doesn't take into account systems where the manufacturer *wants*
NUMA_NO_NODE, and doesn't want per-node allocations.

Alexander has suggested that modifying my original patch to do (sorry for the
cut-and-paste)

static ssize_t numa_node_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int node, ret;

if (!capable(CAP_SYS_ADMIN))
return -EPERM;

ret = kstrtoint(buf, 0, &node);
if (ret)
return ret;

if (!node_online(node))
return -EINVAL;

WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
FW_BUG "ACPI _PXM should have set this device's root bridge
numa_node to %d but set it to %d. Please contact your hardware vendor for
updates.",
node, dev->numa_node);

dev->numa_node = node;

return count;
}

*might* be amenable to everyone. I think it nicely handles Bjorn's concern
about being loud when overriding the _PXM entries ...

P.

[1] These sockets really aren't IOH-less. The IOH has been disabled in BIOS.
Jiang Liu
2014-10-19 11:35:15 UTC
Permalink
Post by Prarit Bhargava
Post by Prarit Bhargava
Post by Bjorn Helgaas
Hi Prarit,
Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.
It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.
The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)
Post by Bjorn Helgaas
Post by Prarit Bhargava
The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.
Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.
I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?
numactl simply determines the number of nodes on the system by traversing
/sys/devices/system/node
For example, the output of 'numactl --hardware'
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 32075 MB
node 0 free: 30862 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 32239 MB
node 1 free: 31251 MB
node 0 1
0: 10 21
1: 21 10
I then use lpsci to determine what pci root nodes there are, and knowing a
little bit about the system architecture (are the IOHs shared between sockets,
are they 1/socket, etc.) I can make an educated guess on what the numa_node
value should be.
Hi Prarit,
It may be a little hard for normal users to figure out NUMA node
id from [segment:bus]. [segment:bus] for PCIe host bridge are under
control of BIOS/firmware. For example, one a four socket system, I have
observed one vender putting all four PCIe host brdiges into PCI segment
0, another vender assigning different segment for each PCIe host bridge.
And socket hotplug may also make the scenario more complex.
Actually I ran into the same issue when trying to optimize
interrupt vector allocation according to NUMA affinity. So it will
be great if we could find an acceptable solution here:)
Regard!
Gerry
Post by Prarit Bhargava
I thought about hardcoding something similar in the kernel ... but there are
some significant issues (which Myron points out as well). The code would have
to be constantly maintained. That is not something we can do long term. Every
new processor & chipset would have to be added to the code to determine exactly
how the pci root bridge was connected. Even then, it is entirely possible (as
in the case with some cpu hotplug systems) that IOH-less sockets were populated
[1] which would severely hamper any assumptions in a calculation.
That also doesn't take into account systems where the manufacturer *wants*
NUMA_NO_NODE, and doesn't want per-node allocations.
Alexander has suggested that modifying my original patch to do (sorry for the
cut-and-paste)
static ssize_t numa_node_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int node, ret;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
ret = kstrtoint(buf, 0, &node);
if (ret)
return ret;
if (!node_online(node))
return -EINVAL;
WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
FW_BUG "ACPI _PXM should have set this device's root bridge
numa_node to %d but set it to %d. Please contact your hardware vendor for
updates.",
node, dev->numa_node);
dev->numa_node = node;
return count;
}
*might* be amenable to everyone. I think it nicely handles Bjorn's concern
about being loud when overriding the _PXM entries ...
P.
[1] These sockets really aren't IOH-less. The IOH has been disabled in BIOS.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Loading...