[PATCH] pci, add sysfs numa_node write function

Discussion:

Prarit Bhargava

2014-10-15 19:05:24 UTC

Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.

Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE. The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.

Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.

To use this, one can do

echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node

to set the numa node for PCI device 0000:ff:1f.3.

Cc: Myron Stowe <***@redhat.com>
Cc: Bjorn Helgaas <***@google.com>
Cc: linux-***@vger.kernel.org
Signed-off-by: Prarit Bhargava <***@redhat.com>
---
drivers/pci/pci-sysfs.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 92b6d9a..c05ed30 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -221,12 +221,33 @@ static ssize_t enabled_show(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR_RW(enabled);

#ifdef CONFIG_NUMA
+static ssize_t numa_node_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int node, ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ret = kstrtoint(buf, 0, &node);
+ if (ret)
+ return ret;
+
+ if (!node_online(node))
+ return -EINVAL;
+
+ dev->numa_node = node;
+
+ return count;
+}
+
static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
return sprintf(buf, "%d\n", dev->numa_node);
}
-static DEVICE_ATTR_RO(numa_node);
+static DEVICE_ATTR_RW(numa_node);
#endif

static ssize_t dma_mask_bits_show(struct device *dev,

--
1.7.9.3

Bjorn Helgaas

2014-10-15 19:23:32 UTC

Permalink

Hi Prarit,

Post by Prarit Bhargava
Consider a multi-node, multiple pci root bridge system which can be
configured into one large node or one node/socket. When configuring the
system the numa_node value for each PCI root bridge is always set
incorrectly to -1, or NUMA_NO_NODE, rather than to the node value of each
socket. Each PCI device inherits the numa value directly from it's parent
device, so that the NUMA_NO_NODE value is passed through the entire PCI
tree.
Some new drivers, such as the Intel QAT driver, drivers/crypto/qat,
require that a specific node be assigned to the device in order to
achieve maximum performance for the device, and will fail to load if the
device has NUMA_NO_NODE.

It seems ... unfriendly for a driver to fail to load just because it
can't guarantee maximum performance. Out of curiosity, where does
this actually happen? I had a quick look for NUMA_NO_NODE and
module_init() functions in drivers/crypto/qat, and I didn't see the
spot.

Post by Prarit Bhargava
The driver would load if the numa_node value
was equal to or greater than -1 and quickly hacking the driver results in
a functional QAT driver.
Using lspci and numactl it is easy to determine what the numa value should
be. The problem is that there is no way to set it. This patch adds a
store function for the PCI device's numa_node value.

I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?

Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.

It definitely seems wrong that we don't set the node number correctly.
pci_acpi_scan_root() sets the node number by looking for a _PXM method
that applies to the host bridge. Why does that not work in this case?
Does the BIOS not supply _PXM?

If there's information that numactl uses, maybe the kernel should use that, too?

A sysfs interface might be a useful workaround, but obviously it would
be far better if we could fix the BIOS and/or kernel so the workaround
isn't necessary in the first place.

Bjorn

Post by Prarit Bhargava
---
drivers/pci/pci-sysfs.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 92b6d9a..c05ed30 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -221,12 +221,33 @@ static ssize_t enabled_show(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR_RW(enabled);
#ifdef CONFIG_NUMA
+static ssize_t numa_node_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int node, ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ ret = kstrtoint(buf, 0, &node);
+ if (ret)
+ return ret;
+
+ if (!node_online(node))
+ return -EINVAL;
+
+ dev->numa_node = node;
+
+ return count;
+}
+
static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
return sprintf(buf, "%d\n", dev->numa_node);
}
-static DEVICE_ATTR_RO(numa_node);
+static DEVICE_ATTR_RW(numa_node);
#endif
static ssize_t dma_mask_bits_show(struct device *dev,
--
1.7.9.3

Prarit Bhargava

2014-10-15 19:47:59 UTC

Permalink

Post by Bjorn Helgaas
Hi Prarit,

The whole point of the Intel QAT driver is to guarantee max performance. If
that is not possible the driver should not load (according to the thread
mentioned below)

Post by Bjorn Helgaas

I'm not familiar with numactl. It sounds like it can show you the
NUMA topology? Where does that information come from?

You can map at least what nodes are available (although I suppose you can get
the same information from dmesg). You have to do a bit of hunting through the
PCI tree to determine the root PCI devices, but you can determine which root
device is connected to which node.

Post by Bjorn Helgaas

Post by Prarit Bhargava
To use this, one can do
echo 3 > /sys/devices/pci0000:ff/0000:ff:1f.3/numa_node
to set the numa node for PCI device 0000:ff:1f.3.

Yeah ... unfortunately the BIOS is broken in this case. And I know what you're
thinking ;) -- why not get the BIOS fixed? I'm through relying on BIOS fixes
which can take six months to a year to appear in a production version... I've
been bitten too many times by promises of BIOS fixes that never materialize.

We have systems that only have a support cycle of 3 years, and things like ACPI
_PXM updates are at the bottom of the list :/.

FWIW, on this particular system I have a filed a bug with the vendor.

Post by Bjorn Helgaas
If there's information that numactl uses, maybe the kernel should use that, too?
A sysfs interface might be a useful workaround, but obviously it would
be far better if we could fix the BIOS and/or kernel so the workaround
isn't necessary in the first place.

Yep ... but like I said, I don't think anyone wants to wait a year. What if we
never see a fix?

Side issue: While investigating this I noticed that plain kmalloc() is used in
the setup code. Is there a reason we don't use kmalloc_node() in
pci_alloc_dev(), and other allocation functions? It seems like we should be to
optimize system performance. OTOH ... I haven't done any measurements to see if
it actually makes a difference :)

P.

Post by Bjorn Helgaas
Bjorn

Prarit Bhargava

2014-10-15 19:51:28 UTC