RE: [Ipoverib] IPoIB-RC and Checksums

From ipoverib-bounces@ietf.org Tue Nov 16 16:19:29 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA15335 for ; Tue, 16 Nov 2004 16:19:28 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUAhQ-0000tq-47; Tue, 16 Nov 2004 16:17:16 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUAcW-0006rM-C6 for ipoverib@megatron.ietf.org; Tue, 16 Nov 2004 16:12:12 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA14424 for ; Tue, 16 Nov 2004 16:12:10 -0500 (EST) Received: from rwcrmhc13.comcast.net ([204.127.198.39]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUAen-0008FP-65 for ipoverib@ietf.org; Tue, 16 Nov 2004 16:14:33 -0500 Received: from e2az0 (h00095b0697b5.ne.client2.attbi.com[24.218.177.14]) by comcast.net (rwcrmhc13) with SMTP id <20041116211130015008mohie>; Tue, 16 Nov 2004 21:11:32 +0000 Message-ID: <002501c4cc20$d58501a0$6501a8c0@comcast.net> From: "Hal Rosenstock" To: "IPoverIB" Date: Tue, 16 Nov 2004 16:11:25 -0500 MIME-Version: 1.0 X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-Spam-Score: 0.5 (/) X-Scan-Signature: 25620135586de10c627e3628c432b04a Subject: [Ipoverib] A Couple of IPoIB Questions X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Hal Rosenstock List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1109502172==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org This is a multi-part message in MIME format. --===============1109502172== Content-Type: multipart/alternative; boundary="----=_NextPart_000_0022_01C4CBF6.EC397B80" This is a multi-part message in MIME format. ------=_NextPart_000_0022_01C4CBF6.EC397B80 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi, I have a couple of questions relative to IPoIB: 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface MUST "FullMember" join the IB multicast group = defined by the broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 = only, does this group still need be joined ? If not, where do the parameters for any IPv6 groups come from ? I am = presuming that this group needs to be joined in=20 the IPv6 only case. I just want to be sure. 2. ALso, what is the latest status of the Vivek's connected mode draft ? = Will it be moving forward ? Thanks. -- Hal ------=_NextPart_000_0022_01C4CBF6.EC397B80 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

Hi,

I have a couple of questions relative = to=20 IPoIB:

1. = draft-ietf-ipoib-ip-over-infiniband-07.txt=20 states:

"Every IPoIB interface MUST "FullMember" = join the=20 IB multicast group defined by the broadcast-GID."

Isn't the broadcast group for IPv4 ? = When the IPoIB=20 interface is IPv6 only, does this group still need be joined = ?

If not, where do the parameters for any = IPv6 groups=20 come from ? I am presuming that this group needs to be joined in =

the IPv6 only case. I just want to be=20 sure.

2. ALso, what is the latest status of = the Vivek's=20 connected mode draft ? Will it be moving forward ?

Thanks.

-- Hal

------=_NextPart_000_0022_01C4CBF6.EC397B80-- --===============1109502172== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1109502172==-- From ipoverib-bounces@ietf.org Tue Nov 16 17:38:17 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA03561 for ; Tue, 16 Nov 2004 17:38:16 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUBjo-0003SL-3E; Tue, 16 Nov 2004 17:23:48 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUAxQ-0006Bo-D8 for ipoverib@megatron.ietf.org; Tue, 16 Nov 2004 16:33:48 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA17654 for ; Tue, 16 Nov 2004 16:33:46 -0500 (EST) Received: from nwkea-mail-2.sun.com ([192.18.42.14]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUAzh-0000Z9-Hx for ipoverib@ietf.org; Tue, 16 Nov 2004 16:36:10 -0500 Received: from phys-bos-2.sfbay.sun.com ([129.146.14.24]) by nwkea-mail-2.sun.com (8.12.10/8.12.9) with ESMTP id iAGLXk7s028946 for ; Tue, 16 Nov 2004 13:33:46 -0800 (PST) Received: from Sun.COM (sr1-umpk-05.SFBay.Sun.COM [129.146.11.163]) by bos-mail1.sfbay.sun.com (Sun Java System Messaging Server 6.1 HotFix 0.02 (built Jul 26 2004)) with ESMTP id <0I7A00CDBJWA1250@bos-mail1.sfbay.sun.com> for ipoverib@ietf.org; Tue, 16 Nov 2004 13:33:46 -0800 (PST) Date: Tue, 16 Nov 2004 13:33:46 -0800 From: Kanoj Sarcar Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-reply-to: <002501c4cc20$d58501a0$6501a8c0@comcast.net> To: Hal Rosenstock Message-id: <419A723A.6000600@Sun.COM> MIME-version: 1.0 Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7bit X-Accept-Language: en-us, en References: <002501c4cc20$d58501a0$6501a8c0@comcast.net> User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.4) Gecko/20040414 X-Spam-Score: 0.0 (/) X-Scan-Signature: 769a46790fb42fbb0b0cc700c82f7081 Content-Transfer-Encoding: 7bit Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Kanoj.Sarcar@Sun.COM List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org Content-Transfer-Encoding: 7bit Hal Rosenstock wrote: > Hi, Hi, > > I have a couple of questions relative to IPoIB: > > 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: > "Every IPoIB interface MUST "FullMember" join the IB multicast group > defined by the broadcast-GID." > > Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 > only, does this group still need be joined ? > If not, where do the parameters for any IPv6 groups come from ? I am > presuming that this group needs to be joined in > the IPv6 only case. I just want to be sure. Previously on the WG, we went thru a discussion on this, and the consensus was that all interfaces (irrespective of ipv4 only, ipv6 only, or ipv4 and ipv6) MUST join the broadcast-GID and obtain parameters for all IPv4 and IPv6 groups from this one single broadcast-GID. We further discussed changing the signature part of the address of the broadcast group to reflect that it was IPv4 and IPv6 agnostic, but maintained the IPv4 signature to make it easier for current implementations to make any required changes to adapt to this rule. Thanks. Kanoj > > 2. ALso, what is the latest status of the Vivek's connected mode draft ? > Will it be moving forward ? > > Thanks. > > -- Hal > > > ------------------------------------------------------------------------ > > _______________________________________________ > IPoverIB mailing list > IPoverIB@ietf.org > https://www1.ietf.org/mailman/listinfo/ipoverib _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Tue Nov 16 17:55:27 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA05725 for ; Tue, 16 Nov 2004 17:55:26 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUC3D-00083k-0P; Tue, 16 Nov 2004 17:43:51 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUBs8-0000s5-RY; Tue, 16 Nov 2004 17:32:24 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA02428; Tue, 16 Nov 2004 17:32:22 -0500 (EST) Received: from e34.co.us.ibm.com ([32.97.110.132]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUBuQ-00057Z-F4; Tue, 16 Nov 2004 17:34:47 -0500 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e34.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAGMVpAD540146; Tue, 16 Nov 2004 17:31:52 -0500 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAGMVpl3208082; Tue, 16 Nov 2004 15:31:51 -0700 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAGMVlCv011949; Tue, 16 Nov 2004 15:31:48 -0700 Received: from d03nm122.boulder.ibm.com (d03nm122.boulder.ibm.com [9.17.195.148]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAGMVlJ7011945; Tue, 16 Nov 2004 15:31:47 -0700 To: Hal Rosenstock MIME-Version: 1.0 Subject: Re: [Ipoverib] A Couple of IPoIB Questions X-Mailer: Lotus Notes Release 5.0.11 July 24, 2002 Message-ID: From: Vivek Kashyap Date: Tue, 16 Nov 2004 14:31:27 -0800 X-MIMETrack: Serialize by Router on D03NM122/03/M/IBM(Release 6.51HF338 | June 21, 2004) at 11/16/2004 15:31:47, Serialize complete at 11/16/2004 15:31:47 X-Spam-Score: 0.5 (/) X-Scan-Signature: cd3fc8e909678b38737fc606dec187f0 Cc: IPoverIB , ipoverib-bounces@ietf.org X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1927796390==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org This is a multipart message in MIME format. --===============1927796390== Content-Type: multipart/alternative; boundary="=_alternative 007BEBB488256F4E_=" This is a multipart message in MIME format. --=_alternative 007BEBB488256F4E_= Content-Type: text/plain; charset="us-ascii" See below in Vivek -- Vivek Kashyap Linux Technology Center, IBM vivk@us.ibm.com kashyapv@us.ibm.com Ph: 503 578 3422 T/L: 775 3422 "Hal Rosenstock" Sent by: ipoverib-bounces@ietf.org 11/16/2004 01:11 PM Please respond to Hal Rosenstock To: "IPoverIB" cc: Subject: [Ipoverib] A Couple of IPoIB Questions Hi, I have a couple of questions relative to IPoIB: 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 only, does this group still need be joined ? If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in the IPv6 only case. I just want to be sure. Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined whether you are running at v4 or v6 layer. 2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ? I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were some interesting suggestions that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too). a. The current draft makes the various modes mutually exclusive i.e. RC, UC and UD are not allowed simultaneously in the same IP subnet. The thought is that it is a link characteristic and hence different per connection mode. It was suggested that one be allowed to mix up RC/UC. This goes back to the original suggestion in the first draft which was: IPoIB-UD must always be supported. Additionally, the interface can also support either both of RC and UC, or one of them. Or neither of them. b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers. One thought can be 'yes, but user beware': The IB connections are made using the service ID that is derived from the QPN as described in the draft. If a second attempt succeeds then there are two links. It is up to the implementation to either allow or disallow multiple links. Thoughts? Thanks. -- Hal_______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --=_alternative 007BEBB488256F4E_= Content-Type: text/html; charset="us-ascii"
See below in <VK>

Vivek
--
Vivek Kashyap
Linux Technology Center, IBM
vivk@us.ibm.com
kashyapv@us.ibm.com
Ph: 503 578 3422 T/L: 775 3422

"Hal Rosenstock" <hnrose@earthlink.net>
Sent by: ipoverib-bounces@ietf.org

11/16/2004 01:11 PM
Please respond to Hal Rosenstock

To: "IPoverIB" <ipoverib@ietf.org>
cc:
Subject: [Ipoverib] A Couple of IPoIB Questions

Hi,

I have a couple of questions relative to IPoIB:

1. draft-ietf-ipoib-ip-over-infiniband-07.txt states:
"Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID."

Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 only, does this group still need be joined ?
If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in
the IPv6 only case. I just want to be sure.

<VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined whether you are running at v4 or v6 layer. <VK>

2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ?

<VK> I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were some interesting suggestions that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too).

a. The current draft makes the various modes mutually exclusive i.e. RC, UC and UD are not allowed simultaneously in the same IP subnet. The thought is that it is a link characteristic and hence different per connection mode. It was suggested that one be allowed to mix up RC/UC. This goes back to the original suggestion in the first draft which was:

IPoIB-UD must always be supported. Additionally, the interface can also support either both of RC and UC, or one of them. Or neither of them.

b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers.

One thought can be 'yes, but user beware': The IB connections are made using the service ID that is derived from the QPN as described in the draft. If a second attempt succeeds then there are two links. It is up to the implementation to either allow or disallow multiple links.

Thoughts?

<VK>

Thanks.

-- Hal_______________________________________________
IPoverIB mailing list
IPoverIB@ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib

--=_alternative 007BEBB488256F4E_=-- --===============1927796390== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1927796390==-- From ipoverib-bounces@ietf.org Tue Nov 16 23:15:59 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA05344 for ; Tue, 16 Nov 2004 23:15:59 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUHA7-0002EU-Hz; Tue, 16 Nov 2004 23:11:19 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUH7b-0001ii-Nw for ipoverib@megatron.ietf.org; Tue, 16 Nov 2004 23:08:43 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA04703 for ; Tue, 16 Nov 2004 23:08:40 -0500 (EST) Received: from palrel10.hp.com ([156.153.255.245]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUH9v-0005Ai-Cx for ipoverib@ietf.org; Tue, 16 Nov 2004 23:11:08 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel10.hp.com (Postfix) with ESMTP id C12E561E81; Tue, 16 Nov 2004 20:08:41 -0800 (PST) Received: from MK73191c.cup.hp.com (mk731916.cup.hp.com [15.8.80.134]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id UAA06560; Tue, 16 Nov 2004 20:06:23 -0800 (PST) Message-Id: <6.1.2.0.2.20041116200245.0c795268@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Tue, 16 Nov 2004 20:08:25 -0800 To: Vivek Kashyap , Hal Rosenstock From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 16c9da4896bf5539ae3547c6c25f06a0 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0999522517==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0999522517== Content-Type: multipart/alternative; boundary="=====================_39086102==.ALT" --=====================_39086102==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 02:31 PM 11/16/2004, Vivek Kashyap wrote: >See below in > >Vivek >-- >Vivek Kashyap >Linux Technology Center, IBM >vivk@us.ibm.com >kashyapv@us.ibm.com >Ph: 503 578 3422 T/L: 775 3422 > > > >"Hal Rosenstock" >Sent by: ipoverib-bounces@ietf.org > >11/16/2004 01:11 PM >Please respond to Hal Rosenstock > > To: "IPoverIB" > cc: > Subject: [Ipoverib] A Couple of IPoIB Questions > > > >Hi, > >I have a couple of questions relative to IPoIB: > >1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: >"Every IPoIB interface MUST "FullMember" join the IB multicast group >defined by the broadcast-GID." > >Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 >only, does this group still need be joined ? >If not, where do the parameters for any IPv6 groups come from ? I am >presuming that this group needs to be joined in >the IPv6 only case. I just want to be sure. > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined >whether you are running at v4 or v6 layer. > >2. ALso, what is the latest status of the Vivek's connected mode draft ? >Will it be moving forward ? > > I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by >the end of the month. There were some interesting suggestions that were >made during the IETF WG meeting. Two of the suggestions of consequence are >given below. The others we can discuss when the minutes are published >(they include some additional requests on clarification on the >transmission draft too). > >a. The current draft makes the various modes mutually exclusive i.e. RC, >UC and UD are not allowed simultaneously in the same IP subnet. The >thought is that it is a link characteristic and hence different per >connection mode. It was suggested that one be allowed to mix up RC/UC. >This goes back to the original suggestion in the first draft which was: > >IPoIB-UD must always be supported. Additionally, the interface can also >support either both of RC and UC, or one of them. Or neither of them. UD MUST always be supported. I personally don't care whether one does RC or UC but I don't think both are required as a MAY option. The advantage of RC is the send credit algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in the fabric while send credits provide a simple method to maintain bandwidth / injection control on a per flow basis. I see no problems with supporting both UD and *C on the same subnet; it is rather foolish to attempt to mandate these be on separate subnets. >b. Another suggestion was to allow multiple connected mode links (i.e. at >IB UC/RC level) between peers. > >One thought can be 'yes, but user beware': The IB connections are made >using the service ID that is derived from the QPN as described in the >draft. If a second attempt succeeds then there are two links. It is up to >the implementation to either allow or disallow multiple links. Again, this has been suggested in the past (though most who were involved in the original discussions years gone by are likely gone since much of this discussion occurred before the IETF workgroup was established). There is obvious benefit to supporting multiple RC per endnode pair. I do not see any technical reason to oppose nor any issue from an interoperability perspective. There is no reason for a "user beware". The work is rather straight to do and implement and the benefit to customers, is again, rather obvious when one considers what the IB fabric offers and how connections can be enable flows through multipath as well as transparent fail-over, flow scheduling, mapping of DiffServ to different arbitration / paths, etc. Mike --=====================_39086102==.ALT Content-Type: text/html; charset="us-ascii" At 02:31 PM 11/16/2004, Vivek Kashyap wrote:

See below in <VK>

Vivek
--
Vivek Kashyap
Linux Technology Center, IBM
vivk@us.ibm.com
kashyapv@us.ibm.com
Ph: 503 578 3422 T/L: 775 3422

"Hal Rosenstock" <hnrose@earthlink.net>
Sent by: ipoverib-bounces@ietf.org

11/16/2004 01:11 PM
Please respond to Hal Rosenstock

        To:        "IPoverIB" <ipoverib@ietf.org>
        cc:
        Subject:        [Ipoverib] A Couple of IPoIB Questions

Hi,

I have a couple of questions relative to IPoIB:

1. draft-ietf-ipoib-ip-over-infiniband-07.txt states:
"Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID."

Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 only, does this group still need be joined ?
If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in
the IPv6 only case. I just want to be sure.

<VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined whether you are running at v4 or v6 layer. <VK>

2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ?

<VK> I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were some interesting suggestions that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too).

a. The current draft makes the various modes mutually exclusive i.e. RC, UC and UD are not allowed simultaneously in the same IP subnet. The thought is that it is a link characteristic and hence different per connection mode. It was suggested that one be allowed to mix up RC/UC. This goes back to the original suggestion in the first draft which was:

IPoIB-UD must always be supported. Additionally, the interface can also support either both of RC and UC, or one of them. Or neither of them.

UD MUST always be supported. I personally don't care whether one does RC or UC but I don't think both are required as a MAY option. The advantage of RC is the send credit algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in the fabric while send credits provide a simple method to maintain bandwidth / injection control on a per flow basis.

I see no problems with supporting both UD and *C on the same subnet; it is rather foolish to attempt to mandate these be on separate subnets.

b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers.

One thought can be 'yes, but user beware': The IB connections are made using the service ID that is derived from the QPN as described in the draft. If a second attempt succeeds then there are two links. It is up to the implementation to either allow or disallow multiple links.

Again, this has been suggested in the past (though most who were involved in the original discussions years gone by are likely gone since much of this discussion occurred before the IETF workgroup was established). There is obvious benefit to supporting multiple RC per endnode pair. I do not see any technical reason to oppose nor any issue from an interoperability perspective. There is no reason for a "user beware". The work is rather straight to do and implement and the benefit to customers, is again, rather obvious when one considers what the IB fabric offers and how connections can be enable flows through multipath as well as transparent fail-over, flow scheduling, mapping of DiffServ to different arbitration / paths, etc.

Mike --=====================_39086102==.ALT-- --===============0999522517== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0999522517==-- From ipoverib-bounces@ietf.org Wed Nov 17 02:48:46 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA23075 for ; Wed, 17 Nov 2004 02:48:46 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUKVA-0002Ih-Or; Wed, 17 Nov 2004 02:45:16 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUKPE-0000at-KV for ipoverib@megatron.ietf.org; Wed, 17 Nov 2004 02:39:09 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA21960 for ; Wed, 17 Nov 2004 02:39:07 -0500 (EST) Received: from e31.co.us.ibm.com ([32.97.110.129]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUKRa-0001AR-Cm for ipoverib@ietf.org; Wed, 17 Nov 2004 02:41:35 -0500 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e31.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAH7cZCB311842 for ; Wed, 17 Nov 2004 02:38:35 -0500 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAH7cZ5S168204 for ; Wed, 17 Nov 2004 00:38:35 -0700 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAH7cZrn030821 for ; Wed, 17 Nov 2004 00:38:35 -0700 Received: from d03nm122.boulder.ibm.com (d03nm122.boulder.ibm.com [9.17.195.148]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAH7cY1K030818; Wed, 17 Nov 2004 00:38:34 -0700 Subject: Re: [Ipoverib] A Couple of IPoIB Questions To: Michael Krause X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: Vivek Kashyap Date: Tue, 16 Nov 2004 23:38:29 -0800 X-MIMETrack: Serialize by Router on D03NM122/03/M/IBM(Release 6.51HF338 | June 21, 2004) at 11/17/2004 00:38:34 MIME-Version: 1.0 X-Spam-Score: 0.1 (/) X-Scan-Signature: f5932bfc8385127f631fc458a872feb1 Cc: Hal Rosenstock , IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1847019054==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============1847019054== Content-type: multipart/alternative; Boundary="0__=08BBE5DCDFB25B938f9e8a93df938690918c08BBE5DCDFB25B93" Content-Disposition: inline --0__=08BBE5DCDFB25B938f9e8a93df938690918c08BBE5DCDFB25B93 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Hi, I have a couple of questions relative to IPoIB: 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 o= nly, does this group still need be joined ? If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in the IPv6 only case. I just want to be sure. Yes, the broadcast-GID is at the InfiniBand layer and MUST b= e joined whether you are running at v4 or v6 layer. 2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ? I'll be submitting it as draft-ietf-ipoib-connected-mode-00.= txt by the end of the month. There were some interesting suggestions = that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too). a. The current draft makes the various modes mutually exclusive i= .e. RC, UC and UD are not allowed simultaneously in the same IP subne= t. The thought is that it is a link characteristic and hence differe= nt per connection mode. It was suggested that one be allowed to mix = up RC/UC. This goes back to the original suggestion in the first dra= ft which was: IPoIB-UD must always be supported. Additionally, the interface ca= n also support either both of RC and UC, or one of them. Or neither= of them. UD MUST always be supported. That is and has always been the requirement right from the first draft. I personally don't care whether one does RC or UC but I don't think bo= th are required as a MAY option. The advantage of RC is the send credit algorithm. The advantage of UC is the lack of ACK packets. ACK is noi= se in the fabric while send credits provide a simple method to maintain bandwidth / injection control on a per flow basis. I see no problems with supporting both UD and *C on the same subnet; it= is rather foolish to attempt to mandate these be on separate subnets.b As per the connected-mode draft the UD mechanism is *always* required; address resolutoin depends on it. The only point of discussion is whether all nodes must support the = same link characteristics in the subnet i.e. all are RC (and UD), or all = or UC (and UD), or all are UD only. The alternative is to allow all the= nodes to be mixed up with some nodes being RC/UD, others UC/UD and a= third set UD only and yet others probably supporting all. within th= e same IP subnet. [Can the same serviceID be used by both RC and UC ?]= The third alternative is to associating UD only or UD + one of RC o= r UC on the same interface. In such a case if mismatched/unsupported connected modes are supported by two nodes then the fall back to UD.= This option is not too different from UD QP + RC or UC mechanism. b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers. One thought can be 'yes, but user beware': The IB connections are= made using the service ID that is derived from the QPN as describ= ed in the draft. If a second attempt succeeds then there are two lin= ks. It is up to the implementation to either allow or disallow multip= le links. Again, this has been suggested in the past (though most who were involv= ed in the original discussions years gone by are likely gone since much of= this discussion occurred before the IETF workgroup was established). I'm one of the vestiges of those early times along with you and a = few others...so we have hope :). There is obvious benefit to supporting multiple RC per endnode pair. = I do not see any technical reason to oppose nor any issue from an interoperability perspective. There is no reason for a "user beware". It is not opposed. The 'user beware' is only underscoring that th= e the peer interface might not support multiple links- it might enforce a= limited number of connections (maybe only one) between a pair of GIDs. Similarly, an implementation not wanting to support multiple links MUST= take steps to deny multiple requests. The work is rather straight to do and implement and the benefit to customers, is again, rather obvious when one considers what the IB fabr= ic offers and how connections can be enable flows through multipath as wel= l as transparent fail-over, flow scheduling, mapping of DiffServ to differen= t arbitration / paths, etc. In addition Large MTU and APM are two of the main reasons why I'v= e been proposing IPoIB-connected mode for so long. In terms of IPoIB itse= lf, except for the Large MTU, the parameters are hidden from it. Mike = --0__=08BBE5DCDFB25B938f9e8a93df938690918c08BBE5DCDFB25B93 Content-type: text/html; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: quoted-printable

Hi,

I have a couple of questions relative to IPoIB:

1. draft-ietf-ipoib-ip-over-infiniband-07.txt states:
"Every IPoIB interface MUST "FullMember" join the IB mul= ticast group defined by the broadcast-GID."

Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 o= nly, does this group still need be joined ?
If not, where do the parameters for any IPv6 groups come from ? I am pr= esuming that this group needs to be joined in
the IPv6 only case. I just want to be su= re.

<VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST b= e joined whether you are running at v4 or v6 layer. <VK>

2. ALso, what is the latest status of the Vivek's connected mode draft = ? Will it be moving forward ?

<VK> I'll be submitting it as draft-ietf-ipoib-connected-mode-00.= txt by the end of the month. There were some interesting suggestions th= at were made during the IETF WG meeting. Two of the suggestions of cons= equence are given below. The others we can discuss when the minutes are= published (they include some additional requests on clarification on t= he transmission draft too).

a. The current draft makes the various modes mutually exclusive i.e. RC= , UC and UD are not allowed simultaneously in the same IP subnet. The t= hought is that it is a link characteristic and hence different per conn= ection mode. It was suggested that one be allowed to mix up RC/UC. This= goes back to the original suggestion in the first draft which was:

IPoIB-UD must always be supported. Additionally, the interface can also= support either both of RC and UC, or one of them. Or neither of them. =

UD MUST always be supported.

<VK> That is and has al= ways been the requirement right from the first draft. <VK>=

I personally don't care whet= her one does RC or UC but I don't think both are required as a MAY opti= on. The advantage of RC is the send credit algorithm. The advantage o= f UC is the lack of ACK packets. ACK is noise in the fabric while send= credits provide a simple method to maintain bandwidth / injection cont= rol on a per flow basis.

I see no problems with supporting both UD and *C on the same subnet; it= is rather foolish to attempt to mandate these be on separate subnets.b=

<VK> As per the connected-mode dra= ft the UD mechanism is *always* required; address resolutoin depends on= it.

The only point of discussion is whether= all nodes must support the same link characteristics in the subnet i.e= . all are RC (and UD), or all or UC (and UD), or all are UD only. The a= lternative is to allow all the nodes to be mixed up with some nodes bei= ng RC/UD, others UC/UD and a third set UD only and yet others probably= supporting all. within the same IP subnet. [Can the same serviceID be = used by both RC and UC ?]

The third alternative is to associating = UD only or UD + one of RC or UC on the same interface. In such a case = if mismatched/unsupported connected modes are supported by two nodes th= en the fall back to UD. This option is not too different from UD QP + = RC or UC mechanism.

<VK>

b. Another suggestion was to allow multi= ple connected mode links (i.e. at IB UC/RC level) between peers.

One thought can be 'yes, but user beware': The IB connections are made = using the service ID that is derived from the QPN as described in the d= raft. If a second attempt succeeds then there are two links. It is up t= o the implementation to either allow or disallow multiple links.

Again, this has been suggested in the past (though most who were involv= ed in the original discussions years gone by are likely gone since much= of this discussion occurred before the IETF workgroup was established)= .

<VK> I'm one of the ves= tiges of those early times along with you and a few others...so we have= hope :). <VK>

There is obvious benefit to = supporting multiple RC per endnode pair. I do not see any technical re= ason to oppose nor any issue from an interoperability perspective. The= re is no reason for a "user beware".

<VK> It is not opposed.= The 'user beware' is only underscoring that the the peer interface mi= ght not support multiple links- it might enforce a limited number of co= nnections (maybe only one) between a pair of GIDs. Similarly, an implem= entation not wanting to support multiple links MUST take steps to deny = multiple requests.

<VK>

The work is rather straight = to do and implement and the benefit to customers, is again, rather obvi= ous when one considers what the IB fabric offers and how connections ca= n be enable flows through multipath as well as transparent fail-over, f= low scheduling, mapping of DiffServ to different arbitration / paths, e= tc.

<VK> In addition Large= MTU and APM are two of the main reasons why I've been proposing IPoIB-= connected mode for so long. In terms of IPoIB itself, except for the La= rge MTU, the parameters are hidden from it.<VK>

Mike = --0__=08BBE5DCDFB25B938f9e8a93df938690918c08BBE5DCDFB25B93-- --===============1847019054== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1847019054==-- From ipoverib-bounces@ietf.org Wed Nov 17 20:48:13 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA15027 for ; Wed, 17 Nov 2004 20:48:13 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUbJj-0002Z0-2u; Wed, 17 Nov 2004 20:42:35 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUbIs-0002Od-HL for ipoverib@megatron.ietf.org; Wed, 17 Nov 2004 20:41:43 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA14475 for ; Wed, 17 Nov 2004 20:41:40 -0500 (EST) Received: from palrel13.hp.com ([156.153.255.238]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUbLD-0003lb-96 for ipoverib@ietf.org; Wed, 17 Nov 2004 20:44:19 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel13.hp.com (Postfix) with ESMTP id 3572B1C0FA81; Wed, 17 Nov 2004 17:41:29 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.202.164]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id RAA16884; Wed, 17 Nov 2004 17:38:58 -0800 (PST) Message-Id: <6.1.2.0.2.20041117164050.01df1290@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Wed, 17 Nov 2004 16:46:52 -0800 To: Vivek Kashyap From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: Mime-Version: 1.0 X-Spam-Score: 0.1 (/) X-Scan-Signature: 8b6657e60309a1317174c9db2ae5f227 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0818580825==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0818580825== Content-Type: multipart/alternative; boundary="=====================_7107019==.ALT" --=====================_7107019==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 11:38 PM 11/16/2004, Vivek Kashyap wrote: >Hi, I have a couple of questions relative to IPoIB: 1. >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface >MUST "FullMember" join the IB multicast group defined by the >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB >interface is IPv6 only, does this group still need be joined ? If not, >where do the parameters for any IPv6 groups come from ? I am presuming >that this group needs to be joined in the IPv6 only case. I just want to >be sure. > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined >whether you are running at v4 or v6 layer. 2. ALso, what is the >latest status of the Vivek's connected mode draft ? Will it be moving >forward ? I'll be submitting it as >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were >some interesting suggestions that were made during the IETF WG meeting. >Two of the suggestions of consequence are given below. The others we can >discuss when the minutes are published (they include some additional >requests on clarification on the transmission draft too). a. The current >draft makes the various modes mutually exclusive i.e. RC, UC and UD are >not allowed simultaneously in the same IP subnet. The thought is that it >is a link characteristic and hence different per connection mode. It was >suggested that one be allowed to mix up RC/UC. This goes back to the >original suggestion in the first draft which was: IPoIB-UD must always be >supported. Additionally, the interface can also support either both of RC >and UC, or one of them. Or neither of them. > >UD MUST always be supported. > > That is and has always been the requirement right from the first >draft. > >I personally don't care whether one does RC or UC but I don't think both >are required as a MAY option. The advantage of RC is the send credit >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in >the fabric while send credits provide a simple method to maintain >bandwidth / injection control on a per flow basis. > >I see no problems with supporting both UD and *C on the same subnet; it is >rather foolish to attempt to mandate these be on separate subnets.b > As per the connected-mode draft the UD mechanism is *always* >required; address resolutoin depends on it. > >The only point of discussion is whether all nodes must support the same >link characteristics in the subnet i.e. all are RC (and UD), or all or UC >(and UD), or all are UD only. Obviously I would oppose such a solution as it creates artificial constraints with little benefit. >The alternative is to allow all the nodes to be mixed up with some nodes >being RC/UD, others UC/UD and a third set UD only and yet others probably >supporting all. within the same IP subnet. [Can the same serviceID be used >by both RC and UC ?] > >The third alternative is to associating UD only or UD + one of RC or UC on >the same interface. In such a case if mismatched/unsupported connected >modes are supported by two nodes then the fall back to UD. This option is >not too different from UD QP + RC or UC mechanism. KISS: - UD universal - *C opportunistic - Local management issue to control what is sent on the *C interface. No need to specify - Advertise whether one or more ports are supported by UD or *C - Advertise whether one or more QP are supported by UD or *C - Let local management determine policy for what services are mapped where - no need to specify This is both an interoperable approach and simple to implement. There may be some desire to add a policy interface to state preference for specific types of traffic over a given QP. I would not oppose this but would view this as a separate draft once the basics are worked out. > >b. Another suggestion was to allow multiple connected mode links (i.e. at >IB UC/RC level) between peers. One thought can be 'yes, but user beware': >The IB connections are made using the service ID that is derived from the >QPN as described in the draft. If a second attempt succeeds then there are >two links. It is up to the implementation to either allow or disallow >multiple links. > >Again, this has been suggested in the past (though most who were involved >in the original discussions years gone by are likely gone since much of >this discussion occurred before the IETF workgroup was established). > > I'm one of the vestiges of those early times along with you and a few >others...so we have hope :). > >There is obvious benefit to supporting multiple RC per endnode pair. I do >not see any technical reason to oppose nor any issue from an >interoperability perspective. There is no reason for a "user beware". > > It is not opposed. The 'user beware' is only underscoring that the >the peer interface might not support multiple links- it might enforce a >limited number of connections (maybe only one) between a pair of GIDs. >Similarly, an implementation not wanting to support multiple links MUST >take steps to deny multiple requests. *C requires CM to operate thus it is a local issue whether additional CM operations are accepted or not. A given requester node may issue N and a given responder may state 0-N as an implementation may limit the number of *C available for IP traffic. > > >The work is rather straight to do and implement and the benefit to >customers, is again, rather obvious when one considers what the IB fabric >offers and how connections can be enable flows through multipath as well >as transparent fail-over, flow scheduling, mapping of DiffServ to >different arbitration / paths, etc. > > In addition Large MTU and APM are two of the main reasons why I've >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, >except for the Large MTU, the parameters are hidden from it. Mike --=====================_7107019==.ALT Content-Type: text/html; charset="us-ascii" At 11:38 PM 11/16/2004, Vivek Kashyap wrote:

Hi, I have a couple of questions relative to IPoIB: 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 only, does this group still need be joined ? If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in the IPv6 only case. I just want to be sure.
<VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined whether you are running at v4 or v6 layer. <VK> 2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ? <VK> I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were some interesting suggestions that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too). a. The current draft makes the various modes mutually exclusive i.e. RC, UC and UD are not allowed simultaneously in the same IP subnet. The thought is that it is a link characteristic and hence different per connection mode. It was suggested that one be allowed to mix up RC/UC. This goes back to the original suggestion in the first draft which was: IPoIB-UD must always be supported. Additionally, the interface can also support either both of RC and UC, or one of them. Or neither of them.

UD MUST always be supported.

<VK> That is and has always been the requirement right from the first draft. <VK>

I personally don't care whether one does RC or UC but I don't think both are required as a MAY option. The advantage of RC is the send credit algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in the fabric while send credits provide a simple method to maintain bandwidth / injection control on a per flow basis.

I see no problems with supporting both UD and *C on the same subnet; it is rather foolish to attempt to mandate these be on separate subnets.b
<VK> As per the connected-mode draft the UD mechanism is *always* required; address resolutoin depends on it.

The only point of discussion is whether all nodes must support the same link characteristics in the subnet i.e. all are RC (and UD), or all or UC (and UD), or all are UD only.

Obviously I would oppose such a solution as it creates artificial constraints with little benefit.

The alternative is to allow all the nodes to be mixed up with some nodes being RC/UD, others UC/UD and a third set UD only and yet others probably supporting all. within the same IP subnet. [Can the same serviceID be used by both RC and UC ?]

The third alternative is to associating UD only or UD + one of RC or UC on the same interface. In such a case if mismatched/unsupported connected modes are supported by two nodes then the fall back to UD. This option is not too different from UD QP + RC or UC mechanism.

KISS:

- UD universal
- *C opportunistic
        - Local management issue to control what is sent on the *C interface. No need to specify
        - Advertise whether one or more ports are supported by UD or *C
        - Advertise whether one or more QP are supported by UD or *C
        - Let local management determine policy for what services are mapped where - no need to specify

This is both an interoperable approach and simple to implement. There may be some desire to add a policy interface to state preference for specific types of traffic over a given QP. I would not oppose this but would view this as a separate draft once the basics are worked out.

<VK>
b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers. One thought can be 'yes, but user beware': The IB connections are made using the service ID that is derived from the QPN as described in the draft. If a second attempt succeeds then there are two links. It is up to the implementation to either allow or disallow multiple links.

Again, this has been suggested in the past (though most who were involved in the original discussions years gone by are likely gone since much of this discussion occurred before the IETF workgroup was established).

<VK> I'm one of the vestiges of those early times along with you and a few others...so we have hope :). <VK>

There is obvious benefit to supporting multiple RC per endnode pair. I do not see any technical reason to oppose nor any issue from an interoperability perspective. There is no reason for a "user beware".

<VK> It is not opposed. The 'user beware' is only underscoring that the the peer interface might not support multiple links- it might enforce a limited number of connections (maybe only one) between a pair of GIDs. Similarly, an implementation not wanting to support multiple links MUST take steps to deny multiple requests.

*C requires CM to operate thus it is a local issue whether additional CM operations are accepted or not. A given requester node may issue N and a given responder may state 0-N as an implementation may limit the number of *C available for IP traffic.

<VK>

The work is rather straight to do and implement and the benefit to customers, is again, rather obvious when one considers what the IB fabric offers and how connections can be enable flows through multipath as well as transparent fail-over, flow scheduling, mapping of DiffServ to different arbitration / paths, etc.

<VK> In addition Large MTU and APM are two of the main reasons why I've been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, except for the Large MTU, the parameters are hidden from it.<VK>

Mike --=====================_7107019==.ALT-- --===============0818580825== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0818580825==-- From ipoverib-bounces@ietf.org Thu Nov 18 01:58:22 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id BAA07467 for ; Thu, 18 Nov 2004 01:58:22 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUgC8-0005HN-SD; Thu, 18 Nov 2004 01:55:05 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUg5U-0003nX-DM for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 01:48:14 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id BAA06599 for ; Thu, 18 Nov 2004 01:48:10 -0500 (EST) Received: from e35.co.us.ibm.com ([32.97.110.133]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUg83-0001JR-6S for ipoverib@ietf.org; Thu, 18 Nov 2004 01:50:51 -0500 Received: from westrelay04.boulder.ibm.com (westrelay04.boulder.ibm.com [9.17.193.32]) by e35.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAI6le9G528124 for ; Thu, 18 Nov 2004 01:47:40 -0500 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by westrelay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAI6liOM160584 for ; Wed, 17 Nov 2004 23:47:44 -0700 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAI6ldIx010999 for ; Wed, 17 Nov 2004 23:47:39 -0700 Received: from w-vkashyap95.des.sequent.com (sig-9-49-129-78.mts.ibm.com [9.49.129.78]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAI6lT4X010858; Wed, 17 Nov 2004 23:47:38 -0700 Date: Wed, 17 Nov 2004 22:46:49 -0800 (Pacific Standard Time) From: Vivek Kashyap To: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <6.1.2.0.2.20041117164050.01df1290@esmail.cup.hp.com> Message-ID: X-X-Sender: kashyapv@imap.linux.ibm.com MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: ff0adf256e4dd459cc25215cfa732ac1 Cc: IPoverIB , Vivek Kashyap X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org Mike the format is really off in the last mail from you - making it difficult to follow. Other than that let us discuss in the context of the draft. The draft is built upon the following: 1. IPoIB-RC and IPoIB-UC are optional. 2. IPoIB connected mode depends on a UD QP for address resolution and multicast. As far as I know, there has been an agreement since the earliest connected mode draft I posted. I'd like the WG to give input on the following issues: 3. Where does the UD QP come from? Choose one of: a. It is a UD QP that is associated with the interface at startup. b. It is a UD QP that is shared with IPoIB-UD. 3a is more generic. It can be considered to include the case 3b. The original proposal was limited to 3b. 4. Link characteristics The broadcast domain for IPoIB-RC/UC is determined exactly as the IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this step. Do all interfaces in the IPoIB-conneced mode(CM) have the same link characteristics? i.e. a. all are either IPoIB-RC or IPoIB-UC. -- There is also a UD QP associated. The UD QP will be either 3a or 3b based on WG concensus. -- All unicast transmission is on the IPoIB mode i.e. RC or UC. b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC or both. -- The presence of the flags indicate the type of communication possible. -- The decision of communicating using a specific mode is determined by the supported modes and the local policy. Note that incompatible policies imply that the fallback is communication over UD. -- fallback mode of communication is UD 4b adds a lot of flexibility at the expense of a simple decision. 4a. by contrast is straightforward. 5. MTU negotiation In the private data field of the CM message the desired MTU is included. It was suggested during the IPoIB meeting at IETF that it need not be symmetric. That is a good idea. Thus each peer declares the max MTU it prefers REQ: REP: RTU: 6. Multiple connections for the same IP address Local decision. Note that the peer might choose to not honour multiple connections. Vivek On Wed, 17 Nov 2004, Michael Krause wrote: > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface > >MUST "FullMember" join the IB multicast group defined by the > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > >interface is IPv6 only, does this group still need be joined ? If not, > >where do the parameters for any IPv6 groups come from ? I am presuming > >that this group needs to be joined in the IPv6 only case. I just want to > >be sure. > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined > >whether you are running at v4 or v6 layer. 2. ALso, what is the > >latest status of the Vivek's connected mode draft ? Will it be moving > >forward ? I'll be submitting it as > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were > >some interesting suggestions that were made during the IETF WG meeting. > >Two of the suggestions of consequence are given below. The others we can > >discuss when the minutes are published (they include some additional > >requests on clarification on the transmission draft too). a. The current > >draft makes the various modes mutually exclusive i.e. RC, UC and UD are > >not allowed simultaneously in the same IP subnet. The thought is that it > >is a link characteristic and hence different per connection mode. It was > >suggested that one be allowed to mix up RC/UC. This goes back to the > >original suggestion in the first draft which was: IPoIB-UD must always be > >supported. Additionally, the interface can also support either both of RC > >and UC, or one of them. Or neither of them. > > > >UD MUST always be supported. > > > > That is and has always been the requirement right from the first > >draft. > > > >I personally don't care whether one does RC or UC but I don't think both > >are required as a MAY option. The advantage of RC is the send credit > >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in > >the fabric while send credits provide a simple method to maintain > >bandwidth / injection control on a per flow basis. > > > >I see no problems with supporting both UD and *C on the same subnet; it is > >rather foolish to attempt to mandate these be on separate subnets.b > > As per the connected-mode draft the UD mechanism is *always* > >required; address resolutoin depends on it. > > > >The only point of discussion is whether all nodes must support the same > >link characteristics in the subnet i.e. all are RC (and UD), or all or UC > >(and UD), or all are UD only. > > Obviously I would oppose such a solution as it creates artificial > constraints with little benefit. > > >The alternative is to allow all the nodes to be mixed up with some nodes > >being RC/UD, others UC/UD and a third set UD only and yet others probably > >supporting all. within the same IP subnet. [Can the same serviceID be used > >by both RC and UC ?] > > > >The third alternative is to associating UD only or UD + one of RC or UC on > >the same interface. In such a case if mismatched/unsupported connected > >modes are supported by two nodes then the fall back to UD. This option is > >not too different from UD QP + RC or UC mechanism. > > KISS: > > - UD universal > - *C opportunistic > - Local management issue to control what is sent on the *C > interface. No need to specify > - Advertise whether one or more ports are supported by UD or *C > - Advertise whether one or more QP are supported by UD or *C > - Let local management determine policy for what services are > mapped where - no need to specify > > This is both an interoperable approach and simple to implement. There may > be some desire to add a policy interface to state preference for specific > types of traffic over a given QP. I would not oppose this but would view > this as a separate draft once the basics are worked out. > > > > > > >b. Another suggestion was to allow multiple connected mode links (i.e. at > >IB UC/RC level) between peers. One thought can be 'yes, but user beware': > >The IB connections are made using the service ID that is derived from the > >QPN as described in the draft. If a second attempt succeeds then there are > >two links. It is up to the implementation to either allow or disallow > >multiple links. > > > >Again, this has been suggested in the past (though most who were involved > >in the original discussions years gone by are likely gone since much of > >this discussion occurred before the IETF workgroup was established). > > > > I'm one of the vestiges of those early times along with you and a few > >others...so we have hope :). > > > >There is obvious benefit to supporting multiple RC per endnode pair. I do > >not see any technical reason to oppose nor any issue from an > >interoperability perspective. There is no reason for a "user beware". > > > > It is not opposed. The 'user beware' is only underscoring that the > >the peer interface might not support multiple links- it might enforce a > >limited number of connections (maybe only one) between a pair of GIDs. > >Similarly, an implementation not wanting to support multiple links MUST > >take steps to deny multiple requests. > > *C requires CM to operate thus it is a local issue whether additional CM > operations are accepted or not. A given requester node may issue N and a > given responder may state 0-N as an implementation may limit the number of > *C available for IP traffic. > > > > > > > >The work is rather straight to do and implement and the benefit to > >customers, is again, rather obvious when one considers what the IB fabric > >offers and how connections can be enable flows through multipath as well > >as transparent fail-over, flow scheduling, mapping of DiffServ to > >different arbitration / paths, etc. > > > > In addition Large MTU and APM are two of the main reasons why I've > >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, > >except for the Large MTU, the parameters are hidden from it. > > Mike __ Vivek Kashyap Linux Technology Center, IBM _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 10:06:13 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA01001 for ; Thu, 18 Nov 2004 10:06:13 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUno1-0000mi-Es; Thu, 18 Nov 2004 10:02:41 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUniY-0006NC-S2 for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 09:57:02 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id JAA00093 for ; Thu, 18 Nov 2004 09:57:00 -0500 (EST) Received: from umhlanga.stratnet.net ([12.162.17.40]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUnlC-00036Q-Aw for ipoverib@ietf.org; Thu, 18 Nov 2004 09:59:46 -0500 Received: from exch-1.topspincom.com ([12.162.17.3]) by umhlanga.STRATNET.NET with Microsoft SMTPSVC(5.0.2195.5329); Thu, 18 Nov 2004 06:57:00 -0800 Received: from eddore ([10.10.253.169]) by exch-1.topspincom.com with Microsoft SMTPSVC(5.0.2195.5329); Thu, 18 Nov 2004 06:57:00 -0800 Received: from roland by eddore with local (Exim 4.34) id 1CUniQ-00075r-OC; Thu, 18 Nov 2004 06:57:00 -0800 To: Vivek Kashyap X-Message-Flag: Warning: May contain useful information References: From: Roland Dreier Date: Thu, 18 Nov 2004 06:56:54 -0800 In-Reply-To: (Vivek Kashyap's message of "Wed, 17 Nov 2004 22:46:49 -0800 (Pacific Standard Time)") Message-ID: <52oehvmci1.fsf@topspin.com> User-Agent: Gnus/5.1006 (Gnus v5.10.6) XEmacs/21.4 (Security Through Obscurity, linux) MIME-Version: 1.0 X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: roland@topspin.com Subject: Re: [Ipoverib] A Couple of IPoIB Questions Content-Type: text/plain; charset=us-ascii X-Spam-Checker-Version: SpamAssassin 2.64 (2004-01-11) on eddore X-Spam-Status: No, hits=0.1 required=5.0 tests=AWL autolearn=ham version=2.64 X-SA-Exim-Version: 4.1 (built Tue, 17 Aug 2004 11:06:07 +0200) X-SA-Exim-Scanned: Yes (on eddore) X-OriginalArrivalTime: 18 Nov 2004 14:57:00.0185 (UTC) FILETIME=[DB4A0090:01C4CD7E] X-Spam-Score: 0.0 (/) X-Scan-Signature: 30ac594df0e66ffa5a93eb4c48bcb014 Cc: Michael Krause , Vivek Kashyap , IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org > Mike the format is really off in the last mail from you - > making it difficult to follow. Vivek, I think that if you used standard quoting in your replies instead of your own "" format, it would be much easier to follow email threads involving your replies. Thanks, Roland _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 10:34:22 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA04040 for ; Thu, 18 Nov 2004 10:34:21 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUoFd-0003WP-JE; Thu, 18 Nov 2004 10:31:13 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUnxQ-0005fg-DX for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 10:12:25 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA02076 for ; Thu, 18 Nov 2004 10:12:21 -0500 (EST) Received: from palrel10.hp.com ([156.153.255.245]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUnzs-0003TM-T9 for ipoverib@ietf.org; Thu, 18 Nov 2004 10:15:08 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel10.hp.com (Postfix) with ESMTP id 21DD49957D for ; Thu, 18 Nov 2004 07:12:11 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.202.164]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id HAA24105 for ; Thu, 18 Nov 2004 07:09:42 -0800 (PST) Message-Id: <6.1.2.0.2.20041118065847.0208dd40@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 07:09:47 -0800 To: IPoverIB From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: <6.1.2.0.2.20041117164050.01df1290@esmail.cup.hp.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 645960076aa293effd9740db2f975dc3 X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0433829689==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0433829689== Content-Type: multipart/alternative; boundary="=====================_55855235==.ALT" --=====================_55855235==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 10:46 PM 11/17/2004, Vivek Kashyap wrote: >Mike the format is really off in the last mail from you - making it difficult >to follow. > > >Other than that let us discuss in the context of the draft. The draft is >built upon the following: > >1. IPoIB-RC and IPoIB-UC are optional. I would prefer only one be used - either RC or UC. I've provided some logic for either one as a preference but don't see a reason to have both. Both just leads to options which leads to interoperability problems. >2. IPoIB connected mode depends on a UD QP for address resolution and >multicast. > >As far as I know, there has been an agreement since the earliest connected >mode >draft I posted. > > >I'd like the WG to give input on the following issues: > >3. Where does the UD QP come from? Choose one of: > >a. It is a UD QP that is associated with the interface at startup. > >b. It is a UD QP that is shared with IPoIB-UD. > > >3a is more generic. It can be considered to include the case 3b. The original >proposal was limited to 3b. From an implementation point of view, all of this will be hidden within the driver below IP. As such, the driver will maintain the associations. Currently, each driver "instance" (may be multiple per IB port) will have at least 1 UD QP. Given the existing protocol already defines how to share this QP with other nodes, why not just re-use it and avoid doing more work? The driver can then map on a per endnode pair basis what *C QP go with what the UD QP and the spec remains largely silent on how this is accomplished. >4. Link characteristics > >The broadcast domain for IPoIB-RC/UC is determined exactly as the >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this >step. > >Do all interfaces in the IPoIB-conneced mode(CM) have the same link >characteristics? i.e. From an implementation perspective, this is generally simplest. >a. all are either IPoIB-RC or IPoIB-UC. Preference is only 1 to be defined. > -- There is also a UD QP associated. The UD QP will be either 3a > or 3b > based on WG concensus. > > -- All unicast transmission is on the IPoIB mode i.e. RC or UC. For a given endnode pair, the policy of which QP is used for a given unicast IP datagram is really a local issue. I see some merit in the attempt to bifurcate this to multicast / broadcast to the UD QP and unicast to the *C QP. However, if the datagram fits in the PMTU of the UD QP, then either could be used. The driver would work either case. Please keep in mind that multiple *C QP can be used and their usage needs to be a local issue and not defined within the spec. >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC >or both. > > -- The presence of the flags indicate the type of communication > possible. > -- The decision of communicating using a specific mode is > determined by > the supported modes and the local policy. Note that incompatible > policies imply that the fallback is communication over UD. > -- fallback mode of communication is UD > > >4b adds a lot of flexibility at the expense of a simple decision. 4a. by >contrast is straightforward. > > >5. MTU negotiation > > In the private data field of the CM message the desired MTU is > included. > > It was suggested during the IPoIB meeting at IETF that it need not be > symmetric. That is a good idea. Thus each peer declares the max > MTU it > prefers > > > REQ: > REP: > RTU: Rephrase this as maximum logical MTU to avoid confusion with the IB link MTU. If you start down this path, then you may need to also consider an exchange of what range of DiffServ code points to use as well. Not clear that anyone needs to deal with any latency or bandwidth guarantees but the "camel's nose is starting to enter the tent" as the saying goes. >6. Multiple connections for the same IP address > > Local decision. Note that the peer might choose to not honour > multiple > connections. Agreed. Mike >Vivek > > > > > >On Wed, 17 Nov 2004, Michael Krause wrote: > > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface > > >MUST "FullMember" join the IB multicast group defined by the > > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > > >interface is IPv6 only, does this group still need be joined ? If not, > > >where do the parameters for any IPv6 groups come from ? I am presuming > > >that this group needs to be joined in the IPv6 only case. I just want to > > >be sure. > > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined > > >whether you are running at v4 or v6 layer. 2. ALso, what is the > > >latest status of the Vivek's connected mode draft ? Will it be moving > > >forward ? I'll be submitting it as > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were > > >some interesting suggestions that were made during the IETF WG meeting. > > >Two of the suggestions of consequence are given below. The others we can > > >discuss when the minutes are published (they include some additional > > >requests on clarification on the transmission draft too). a. The current > > >draft makes the various modes mutually exclusive i.e. RC, UC and UD are > > >not allowed simultaneously in the same IP subnet. The thought is that it > > >is a link characteristic and hence different per connection mode. It was > > >suggested that one be allowed to mix up RC/UC. This goes back to the > > >original suggestion in the first draft which was: IPoIB-UD must always be > > >supported. Additionally, the interface can also support either both of RC > > >and UC, or one of them. Or neither of them. > > > > > >UD MUST always be supported. > > > > > > That is and has always been the requirement right from the first > > >draft. > > > > > >I personally don't care whether one does RC or UC but I don't think both > > >are required as a MAY option. The advantage of RC is the send credit > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in > > >the fabric while send credits provide a simple method to maintain > > >bandwidth / injection control on a per flow basis. > > > > > >I see no problems with supporting both UD and *C on the same subnet; it is > > >rather foolish to attempt to mandate these be on separate subnets.b > > > As per the connected-mode draft the UD mechanism is *always* > > >required; address resolutoin depends on it. > > > > > >The only point of discussion is whether all nodes must support the same > > >link characteristics in the subnet i.e. all are RC (and UD), or all or UC > > >(and UD), or all are UD only. > > > > Obviously I would oppose such a solution as it creates artificial > > constraints with little benefit. > > > > >The alternative is to allow all the nodes to be mixed up with some nodes > > >being RC/UD, others UC/UD and a third set UD only and yet others probably > > >supporting all. within the same IP subnet. [Can the same serviceID be used > > >by both RC and UC ?] > > > > > >The third alternative is to associating UD only or UD + one of RC or UC on > > >the same interface. In such a case if mismatched/unsupported connected > > >modes are supported by two nodes then the fall back to UD. This option is > > >not too different from UD QP + RC or UC mechanism. > > > > KISS: > > > > - UD universal > > - *C opportunistic > > - Local management issue to control what is sent on the *C > > interface. No need to specify > > - Advertise whether one or more ports are supported by UD or *C > > - Advertise whether one or more QP are supported by UD or *C > > - Let local management determine policy for what services are > > mapped where - no need to specify > > > > This is both an interoperable approach and simple to implement. There may > > be some desire to add a policy interface to state preference for specific > > types of traffic over a given QP. I would not oppose this but would view > > this as a separate draft once the basics are worked out. > > > > > > > > > > > >b. Another suggestion was to allow multiple connected mode links (i.e. at > > >IB UC/RC level) between peers. One thought can be 'yes, but user beware': > > >The IB connections are made using the service ID that is derived from the > > >QPN as described in the draft. If a second attempt succeeds then there are > > >two links. It is up to the implementation to either allow or disallow > > >multiple links. > > > > > >Again, this has been suggested in the past (though most who were involved > > >in the original discussions years gone by are likely gone since much of > > >this discussion occurred before the IETF workgroup was established). > > > > > > I'm one of the vestiges of those early times along with you and a few > > >others...so we have hope :). > > > > > >There is obvious benefit to supporting multiple RC per endnode pair. I do > > >not see any technical reason to oppose nor any issue from an > > >interoperability perspective. There is no reason for a "user beware". > > > > > > It is not opposed. The 'user beware' is only underscoring that the > > >the peer interface might not support multiple links- it might enforce a > > >limited number of connections (maybe only one) between a pair of GIDs. > > >Similarly, an implementation not wanting to support multiple links MUST > > >take steps to deny multiple requests. > > > > *C requires CM to operate thus it is a local issue whether additional CM > > operations are accepted or not. A given requester node may issue N and a > > given responder may state 0-N as an implementation may limit the number of > > *C available for IP traffic. > > > > > > > > > > > > >The work is rather straight to do and implement and the benefit to > > >customers, is again, rather obvious when one considers what the IB fabric > > >offers and how connections can be enable flows through multipath as well > > >as transparent fail-over, flow scheduling, mapping of DiffServ to > > >different arbitration / paths, etc. > > > > > > In addition Large MTU and APM are two of the main reasons why I've > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, > > >except for the Large MTU, the parameters are hidden from it. > > > > Mike > >__ > >Vivek Kashyap >Linux Technology Center, IBM > > >_______________________________________________ >IPoverIB mailing list >IPoverIB@ietf.org >https://www1.ietf.org/mailman/listinfo/ipoverib --=====================_55855235==.ALT Content-Type: text/html; charset="us-ascii" At 10:46 PM 11/17/2004, Vivek Kashyap wrote:

Mike the format is really off in the last mail from you - making it difficult
to follow.

Other than that let us discuss in the context of the draft. The draft is
built upon the following:

1. IPoIB-RC and IPoIB-UC are optional.

I would prefer only one be used - either RC or UC. I've provided some logic for either one as a preference but don't see a reason to have both. Both just leads to options which leads to interoperability problems.

2. IPoIB connected mode depends on a UD QP for address resolution and multicast.

As far as I know, there has been an agreement since the earliest connected mode
draft I posted.

I'd like the WG to give input on the following issues:

3. Where does the UD QP come from? Choose one of:

a. It is a UD QP that is associated with the interface at startup.

b. It is a UD QP that is shared with IPoIB-UD.

3a is more generic. It can be considered to include the case 3b. The original
proposal was limited to 3b.

From an implementation point of view, all of this will be hidden within the driver below IP. As such, the driver will maintain the associations. Currently, each driver "instance" (may be multiple per IB port) will have at least 1 UD QP. Given the existing protocol already defines how to share this QP with other nodes, why not just re-use it and avoid doing more work? The driver can then map on a per endnode pair basis what *C QP go with what the UD QP and the spec remains largely silent on how this is accomplished.

4. Link characteristics

The broadcast domain for IPoIB-RC/UC is determined exactly as the
IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this
step.

Do all interfaces in the IPoIB-conneced mode(CM) have the same link characteristics? i.e.

From an implementation perspective, this is generally simplest.

a. all are either IPoIB-RC or IPoIB-UC.

Preference is only 1 to be defined.

        -- There is also a UD QP associated. The UD QP will be either 3a or 3b
           based on WG concensus.

        -- All unicast transmission is on the IPoIB mode i.e. RC or UC.

For a given endnode pair, the policy of which QP is used for a given unicast IP datagram is really a local issue. I see some merit in the attempt to bifurcate this to multicast / broadcast to the UD QP and unicast to the *C QP. However, if the datagram fits in the PMTU of the UD QP, then either could be used. The driver would work either case. Please keep in mind that multiple *C QP can be used and their usage needs to be a local issue and not defined within the spec.

b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC
or both.

        -- The presence of the flags indicate the type of communication possible.
        -- The decision of communicating using a specific mode is determined by
           the supported modes and the local policy. Note that incompatible
           policies imply that the fallback is communication over UD.
        -- fallback mode of communication is UD

4b adds a lot of flexibility at the expense of a simple decision. 4a. by
contrast is straightforward.

5. MTU negotiation

        In the private data field of the CM message the desired MTU is
        included.

        It was suggested during the IPoIB meeting at IETF that it need not be
        symmetric. That is a good idea. Thus each peer declares the max MTU it
        prefers

        REQ: <my desired MTU>
        REP: <my desired MTU>
        RTU:

Rephrase this as maximum logical MTU to avoid confusion with the IB link MTU. If you start down this path, then you may need to also consider an exchange of what range of DiffServ code points to use as well. Not clear that anyone needs to deal with any latency or bandwidth guarantees but the "camel's nose is starting to enter the tent" as the saying goes.

6. Multiple connections for the same IP address

Local decision. Note that the peer might choose to not honour multiple
connections.

Agreed.

Mike

Vivek

On Wed, 17 Nov 2004, Michael Krause wrote:

> At 11:38 PM 11/16/2004, Vivek Kashyap wrote:
>
>
>
> >Hi, I have a couple of questions relative to IPoIB: 1.
> >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface
> >MUST "FullMember" join the IB multicast group defined by the
> >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB
> >interface is IPv6 only, does this group still need be joined ? If not,
> >where do the parameters for any IPv6 groups come from ? I am presuming
> >that this group needs to be joined in the IPv6 only case. I just want to
> >be sure.
> ><VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined
> >whether you are running at v4 or v6 layer. <VK> 2. ALso, what is the
> >latest status of the Vivek's connected mode draft ? Will it be moving
> >forward ? <VK> I'll be submitting it as
> >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were
> >some interesting suggestions that were made during the IETF WG meeting.
> >Two of the suggestions of consequence are given below. The others we can
> >discuss when the minutes are published (they include some additional
> >requests on clarification on the transmission draft too). a. The current
> >draft makes the various modes mutually exclusive i.e. RC, UC and UD are
> >not allowed simultaneously in the same IP subnet. The thought is that it
> >is a link characteristic and hence different per connection mode. It was
> >suggested that one be allowed to mix up RC/UC. This goes back to the
> >original suggestion in the first draft which was: IPoIB-UD must always be
> >supported. Additionally, the interface can also support either both of RC
> >and UC, or one of them. Or neither of them.
> >
> >UD MUST always be supported.
> >
> ><VK> That is and has always been the requirement right from the first
> >draft. <VK>
> >
> >I personally don't care whether one does RC or UC but I don't think both
> >are required as a MAY option. The advantage of RC is the send credit
> >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in
> >the fabric while send credits provide a simple method to maintain
> >bandwidth / injection control on a per flow basis.
> >
> >I see no problems with supporting both UD and *C on the same subnet; it is
> >rather foolish to attempt to mandate these be on separate subnets.b
> ><VK> As per the connected-mode draft the UD mechanism is *always*
> >required; address resolutoin depends on it.
> >
> >The only point of discussion is whether all nodes must support the same
> >link characteristics in the subnet i.e. all are RC (and UD), or all or UC
> >(and UD), or all are UD only.
>
> Obviously I would oppose such a solution as it creates artificial
> constraints with little benefit.
>
> >The alternative is to allow all the nodes to be mixed up with some nodes
> >being RC/UD, others UC/UD and a third set UD only and yet others probably
> >supporting all. within the same IP subnet. [Can the same serviceID be used
> >by both RC and UC ?]
> >
> >The third alternative is to associating UD only or UD + one of RC or UC on
> >the same interface. In such a case if mismatched/unsupported connected
> >modes are supported by two nodes then the fall back to UD. This option is
> >not too different from UD QP + RC or UC mechanism.
>
> KISS:
>
> - UD universal
> - *C opportunistic
>          - Local management issue to control what is sent on the *C
> interface. No need to specify
>          - Advertise whether one or more ports are supported by UD or *C
>          - Advertise whether one or more QP are supported by UD or *C
>          - Let local management determine policy for what services are
> mapped where - no need to specify
>
> This is both an interoperable approach and simple to implement. There may
> be some desire to add a policy interface to state preference for specific
> types of traffic over a given QP. I would not oppose this but would view
> this as a separate draft once the basics are worked out.
>
>
>
> ><VK>
> >b. Another suggestion was to allow multiple connected mode links (i.e. at
> >IB UC/RC level) between peers. One thought can be 'yes, but user beware':
> >The IB connections are made using the service ID that is derived from the
> >QPN as described in the draft. If a second attempt succeeds then there are
> >two links. It is up to the implementation to either allow or disallow
> >multiple links.
> >
> >Again, this has been suggested in the past (though most who were involved
> >in the original discussions years gone by are likely gone since much of
> >this discussion occurred before the IETF workgroup was established).
> >
> ><VK> I'm one of the vestiges of those early times along with you and a few
> >others...so we have hope :). <VK>
> >
> >There is obvious benefit to supporting multiple RC per endnode pair. I do
> >not see any technical reason to oppose nor any issue from an
> >interoperability perspective. There is no reason for a "user beware".
> >
> ><VK> It is not opposed. The 'user beware' is only underscoring that the
> >the peer interface might not support multiple links- it might enforce a
> >limited number of connections (maybe only one) between a pair of GIDs.
> >Similarly, an implementation not wanting to support multiple links MUST
> >take steps to deny multiple requests.
>
> *C requires CM to operate thus it is a local issue whether additional CM
> operations are accepted or not. A given requester node may issue N and a
> given responder may state 0-N as an implementation may limit the number of
> *C available for IP traffic.
>
>
> ><VK>
> >
> >The work is rather straight to do and implement and the benefit to
> >customers, is again, rather obvious when one considers what the IB fabric
> >offers and how connections can be enable flows through multipath as well
> >as transparent fail-over, flow scheduling, mapping of DiffServ to
> >different arbitration / paths, etc.
> >
> ><VK> In addition Large MTU and APM are two of the main reasons why I've
> >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself,
> >except for the Large MTU, the parameters are hidden from it.<VK>
>
> Mike

__

Vivek Kashyap
Linux Technology Center, IBM

_______________________________________________
IPoverIB mailing list
IPoverIB@ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib

--=====================_55855235==.ALT-- --===============0433829689== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0433829689==-- From ipoverib-bounces@ietf.org Thu Nov 18 14:44:29 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA25723 for ; Thu, 18 Nov 2004 14:44:29 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUs2f-0005U8-Vx; Thu, 18 Nov 2004 14:34:05 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUry2-0003nU-8R for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 14:29:18 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA24295 for ; Thu, 18 Nov 2004 14:29:16 -0500 (EST) Received: from nwkea-mail-1.sun.com ([192.18.42.13]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUs0h-0001GP-U7 for ipoverib@ietf.org; Thu, 18 Nov 2004 14:32:04 -0500 Received: from jurassic.eng.sun.com ([129.146.85.105]) by nwkea-mail-1.sun.com (8.12.10/8.12.9) with ESMTP id iAIJTF6O028050 for ; Thu, 18 Nov 2004 11:29:15 -0800 (PST) Received: from taipei (taipei.SFBay.Sun.COM [129.146.85.178]) by jurassic.eng.sun.com (8.13.1+Sun/8.13.1) with SMTP id iAIJTE6R162628 for ; Thu, 18 Nov 2004 11:29:15 -0800 (PST) Message-Id: <200411181929.iAIJTE6R162628@jurassic.eng.sun.com> Date: Thu, 18 Nov 2004 11:27:41 -0800 (PST) From: "H.K. Jerry Chu" To: ipoverib@ietf.org MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Content-MD5: BGw7MK9FivR50WJyLLgglw== X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.6_68 SunOS 5.10 sun4u sparc X-Spam-Score: 0.0 (/) X-Scan-Signature: 8b431ad66d60be2d47c7bfeb879db82c Subject: [Ipoverib] comments on draft-kashyap-ipoib-connected-mode-02.txt X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: "H.K. Jerry Chu" List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org In the last IETF61 IPoIB meeting I made several comments on the connected mode draft. I'm sending them to the list for a general discussion. (Yes I saw some disucssion on the connected mode draft already. I'll try to catch up with the thread after this mail.) 1. The draft makes a distinction between IPoIB-CM interfaces and IPoIB-UD interfaces, and portrays IPoIB-UC or IPoIB-RC as separate subnets superimposed on top of an IPoIB-UD subnet. For the above to work, due to a lack of multicast support, a fully connected network by itself can't meet the requirement of an IP link unless multicast is fully emulated through the use of multiple unicasts. The latter is complex and cumbersome. A much simpler model, which I think was presented in earlier drafts, is to fold the use of IB connections fully into a regular IPoIB-UD subnet, allowing any two IPoIB nodes to optionally negotiate the use of IB connection between themselves. This much simplified model is not without its drawback. Some nice IP link attributes are no longer unique within a link. E.g., the link MTU now becomes per-node-pair MTU. Moreover, the MTU size for multicast will be different from the MTU size for unicast if IB connections are used. IB UC/RC may exhibit different RAS, flow control, QoS or other link characteristics than UD. But I consider these problems a reasonable price to pay for a seamless support of UC/RC mode in an IPoIB link defined by UD. 2. The negotiation of the per-connection MTU seems more complicated than necessary. I think all is needed is for a node to advertise its own "receive MTU". That is, the MTU size its peer should never go over when sending packets to the local interface. Yes this may break the traditional concept of "symmetric" MTUs. But we're already breaking the notion of per-link MTU, requring a lot of changes in the host stack anyway. This additonal breakage doesn't seem much. I haven't verified if this asymmetric MTU matches well with IBA connections though. 3. Regarding allowing multiple IB connections between a node pair, since given an IP address there is only one link-address for it implying one QPN, hence one service-ID, if a single service-ID can be used to create multiple IB connections then this can happen transparently. Otherwise we've got a problem. Jerry _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 14:49:01 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA26105 for ; Thu, 18 Nov 2004 14:49:01 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUs8m-0006a2-HP; Thu, 18 Nov 2004 14:40:24 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUs24-00056x-EB for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 14:33:28 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA24565 for ; Thu, 18 Nov 2004 14:33:26 -0500 (EST) Received: from e34.co.us.ibm.com ([32.97.110.132]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUs4e-0001Km-Ru for ipoverib@ietf.org; Thu, 18 Nov 2004 14:36:14 -0500 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e34.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAIJWkAD544024 for ; Thu, 18 Nov 2004 14:32:46 -0500 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAIJWkCQ220310 for ; Thu, 18 Nov 2004 12:32:46 -0700 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAIJWjvL022516 for ; Thu, 18 Nov 2004 12:32:45 -0700 Received: from DYN319548.beaverton.ibm.com (DYN319548.beaverton.ibm.com [9.47.22.85]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAIJWivB022486; Thu, 18 Nov 2004 12:32:45 -0700 Date: Thu, 18 Nov 2004 11:33:40 -0800 (PST) From: Vivek Kashyap X-X-Sender: kashyapv@dyn319548.beaverton.ibm.com To: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <6.1.2.0.2.20041118065847.0208dd40@esmail.cup.hp.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: 24d000849df6f171c5ec1cca2ea21b82 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Thu, 18 Nov 2004, Michael Krause wrote: > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > >Mike the format is really off in the last mail from you - making it difficult > >to follow. > > > > > >Other than that let us discuss in the context of the draft. The draft is > >built upon the following: > > > >1. IPoIB-RC and IPoIB-UC are optional. > > I would prefer only one be used - either RC or UC. I've provided some > logic for either one as a preference but don't see a reason to have > both. Both just leads to options which leads to interoperability problems. ok. See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. It states that the RC and UC are mutually exclusive flags. > > >2. IPoIB connected mode depends on a UD QP for address resolution and > >multicast. > > > >As far as I know, there has been an agreement since the earliest connected > >mode > >draft I posted. > > > > > >I'd like the WG to give input on the following issues: > > > >3. Where does the UD QP come from? Choose one of: > > > >a. It is a UD QP that is associated with the interface at startup. > > > >b. It is a UD QP that is shared with IPoIB-UD. > > > > > >3a is more generic. It can be considered to include the case 3b. The original > >proposal was limited to 3b. > > From an implementation point of view, all of this will be hidden within > the driver below IP. As such, the driver will maintain the > associations. Currently, each driver "instance" (may be multiple per IB > port) will have at least 1 UD QP. Given the existing protocol already > defines how to share this QP with other nodes, why not just re-use it and > avoid doing more work? The driver can then map on a per endnode pair basis > what *C QP go with what the UD QP and the spec remains largely silent on > how this is accomplished. The draft at present states that 'IPoIB-CM implementation MAY use the same UD QP as used by IPoIB-UD...'. See section 3.0. I believe it covers what you are stating. > >4. Link characteristics > > > >The broadcast domain for IPoIB-RC/UC is determined exactly as the > >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this > >step. > > > >Do all interfaces in the IPoIB-conneced mode(CM) have the same link > >characteristics? i.e. > > From an implementation perspective, this is generally simplest. > > >a. all are either IPoIB-RC or IPoIB-UC. > > Preference is only 1 to be defined. > > > > -- There is also a UD QP associated. The UD QP will be either 3a > > or 3b > > based on WG concensus. > > > > -- All unicast transmission is on the IPoIB mode i.e. RC or UC. > > For a given endnode pair, the policy of which QP is used for a given > unicast IP datagram is really a local issue. I see some merit in the Not if an implementation chooses to only receive unicast on the CM modes in an IPoIB-CM subnet. I think the WG must either mandate that between two IP address all unicast communication can be over either UD or the supported CM, or state that all unicast communication must be over IPoIB-CM. Hence my attempt at a detailed discussion on these issues. Issues such as in order delivery need to be considered: e.g. if RC and UD are used to mix up the traffic, say of TCP segments of the same connection, they may no longer be received in order. > attempt to bifurcate this to multicast / broadcast to the UD QP and unicast > to the *C QP. However, if the datagram fits in the PMTU of the UD QP, then > either could be used. The driver would work either case. Please keep in > mind that multiple *C QP can be used and their usage needs to be a local > issue and not defined within the spec. > > >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC > >or both. > > > > -- The presence of the flags indicate the type of communication > > possible. > > -- The decision of communicating using a specific mode is > > determined by > > the supported modes and the local policy. Note that incompatible > > policies imply that the fallback is communication over UD. > > -- fallback mode of communication is UD > > > > > >4b adds a lot of flexibility at the expense of a simple decision. 4a. by > >contrast is straightforward. > > > > > >5. MTU negotiation > > > > In the private data field of the CM message the desired MTU is > > included. > > > > It was suggested during the IPoIB meeting at IETF that it need not be > > symmetric. That is a good idea. Thus each peer declares the max > > MTU it > > prefers > > > > > > REQ: > > REP: > > RTU: > > Rephrase this as maximum logical MTU to avoid confusion with the IB link It is covered in section 5.1 of the draft. > MTU. If you start down this path, then you may need to also consider an > exchange of what range of DiffServ code points to use as well. Not clear > that anyone needs to deal with any latency or bandwidth guarantees but the > "camel's nose is starting to enter the tent" as the saying goes. The camel comes along if Diffserv etc. as listed above are included. Hence they are not in the draft. > > > >6. Multiple connections for the same IP address > > > > Local decision. Note that the peer might choose to not honour > > multiple > > connections. > > Agreed. > > Mike > > > > > >Vivek > > > > > > > > > > > >On Wed, 17 Nov 2004, Michael Krause wrote: > > > > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > > > > > > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface > > > >MUST "FullMember" join the IB multicast group defined by the > > > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > > > >interface is IPv6 only, does this group still need be joined ? If not, > > > >where do the parameters for any IPv6 groups come from ? I am presuming > > > >that this group needs to be joined in the IPv6 only case. I just want to > > > >be sure. > > > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined > > > >whether you are running at v4 or v6 layer. 2. ALso, what is the > > > >latest status of the Vivek's connected mode draft ? Will it be moving > > > >forward ? I'll be submitting it as > > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were > > > >some interesting suggestions that were made during the IETF WG meeting. > > > >Two of the suggestions of consequence are given below. The others we can > > > >discuss when the minutes are published (they include some additional > > > >requests on clarification on the transmission draft too). a. The current > > > >draft makes the various modes mutually exclusive i.e. RC, UC and UD are > > > >not allowed simultaneously in the same IP subnet. The thought is that it > > > >is a link characteristic and hence different per connection mode. It was > > > >suggested that one be allowed to mix up RC/UC. This goes back to the > > > >original suggestion in the first draft which was: IPoIB-UD must always be > > > >supported. Additionally, the interface can also support either both of RC > > > >and UC, or one of them. Or neither of them. > > > > > > > >UD MUST always be supported. > > > > > > > > That is and has always been the requirement right from the first > > > >draft. > > > > > > > >I personally don't care whether one does RC or UC but I don't think both > > > >are required as a MAY option. The advantage of RC is the send credit > > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in > > > >the fabric while send credits provide a simple method to maintain > > > >bandwidth / injection control on a per flow basis. > > > > > > > >I see no problems with supporting both UD and *C on the same subnet; it is > > > >rather foolish to attempt to mandate these be on separate subnets.b > > > > As per the connected-mode draft the UD mechanism is *always* > > > >required; address resolutoin depends on it. > > > > > > > >The only point of discussion is whether all nodes must support the same > > > >link characteristics in the subnet i.e. all are RC (and UD), or all or UC > > > >(and UD), or all are UD only. > > > > > > Obviously I would oppose such a solution as it creates artificial > > > constraints with little benefit. > > > > > > >The alternative is to allow all the nodes to be mixed up with some nodes > > > >being RC/UD, others UC/UD and a third set UD only and yet others probably > > > >supporting all. within the same IP subnet. [Can the same serviceID be used > > > >by both RC and UC ?] > > > > > > > >The third alternative is to associating UD only or UD + one of RC or UC on > > > >the same interface. In such a case if mismatched/unsupported connected > > > >modes are supported by two nodes then the fall back to UD. This option is > > > >not too different from UD QP + RC or UC mechanism. > > > > > > KISS: > > > > > > - UD universal > > > - *C opportunistic > > > - Local management issue to control what is sent on the *C > > > interface. No need to specify > > > - Advertise whether one or more ports are supported by UD or *C > > > - Advertise whether one or more QP are supported by UD or *C > > > - Let local management determine policy for what services are > > > mapped where - no need to specify > > > > > > This is both an interoperable approach and simple to implement. There may > > > be some desire to add a policy interface to state preference for specific > > > types of traffic over a given QP. I would not oppose this but would view > > > this as a separate draft once the basics are worked out. > > > > > > > > > > > > > > > > >b. Another suggestion was to allow multiple connected mode links (i.e. at > > > >IB UC/RC level) between peers. One thought can be 'yes, but user beware': > > > >The IB connections are made using the service ID that is derived from the > > > >QPN as described in the draft. If a second attempt succeeds then there are > > > >two links. It is up to the implementation to either allow or disallow > > > >multiple links. > > > > > > > >Again, this has been suggested in the past (though most who were involved > > > >in the original discussions years gone by are likely gone since much of > > > >this discussion occurred before the IETF workgroup was established). > > > > > > > > I'm one of the vestiges of those early times along with you and a few > > > >others...so we have hope :). > > > > > > > >There is obvious benefit to supporting multiple RC per endnode pair. I do > > > >not see any technical reason to oppose nor any issue from an > > > >interoperability perspective. There is no reason for a "user beware". > > > > > > > > It is not opposed. The 'user beware' is only underscoring that the > > > >the peer interface might not support multiple links- it might enforce a > > > >limited number of connections (maybe only one) between a pair of GIDs. > > > >Similarly, an implementation not wanting to support multiple links MUST > > > >take steps to deny multiple requests. > > > > > > *C requires CM to operate thus it is a local issue whether additional CM > > > operations are accepted or not. A given requester node may issue N and a > > > given responder may state 0-N as an implementation may limit the number of > > > *C available for IP traffic. > > > > > > > > > > > > > > > > > >The work is rather straight to do and implement and the benefit to > > > >customers, is again, rather obvious when one considers what the IB fabric > > > >offers and how connections can be enable flows through multipath as well > > > >as transparent fail-over, flow scheduling, mapping of DiffServ to > > > >different arbitration / paths, etc. > > > > > > > > In addition Large MTU and APM are two of the main reasons why I've > > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, > > > >except for the Large MTU, the parameters are hidden from it. > > > > > > Mike > > > >__ > > > >Vivek Kashyap > >Linux Technology Center, IBM > > > > > >_______________________________________________ > >IPoverIB mailing list > >IPoverIB@ietf.org > >https://www1.ietf.org/mailman/listinfo/ipoverib > _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 16:16:02 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA15373 for ; Thu, 18 Nov 2004 16:16:01 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUsoT-0004S1-1s; Thu, 18 Nov 2004 15:23:29 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUskI-00014N-5S for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 15:19:10 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA02128 for ; Thu, 18 Nov 2004 15:19:07 -0500 (EST) Received: from taurus.voltaire.com ([212.143.27.73]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUsmy-0002d4-HZ for ipoverib@ietf.org; Thu, 18 Nov 2004 15:21:57 -0500 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: [Ipoverib] IPoIB-RC and Checksums Date: Thu, 18 Nov 2004 22:18:25 +0200 Message-ID: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com> Thread-Topic: [Ipoverib] IPoIB-RC and Checksums Thread-Index: AcTNp5+apRItYwv3SJuqKkSsHwbzhwAAsDuQ From: "Yaron Haviv" To: "IPoverIB" X-Spam-Score: 0.0 (/) X-Scan-Signature: 93238566e09e6e262849b4f805833007 Content-Transfer-Encoding: quoted-printable X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org Content-Transfer-Encoding: quoted-printable In GbE usually the NIC Tx Segmentation (large send) capability comes hand in hand with Checksum offload for greater efficiency (and zero copy) On UD we decided not to address checksum offloading, since we cannot guarantee that the node will not forward an un-checked packet=20 Where as in RC we can have examples of devices that can guarantee checksum=20 One example is an IB-IP gateway that always checksum outgoing and incoming packets, and can act as a remote IP NIC to the Host =20 I suggest we include a checksum option in the CM Exchange=20 Where a node can request that its peer will not checksum the packet for it And also signal that he sends packets that are already checked=20 That can help improve performance of IPoIB RC P.S. another note, we discussed in IETF was that we may want to mention/suggest (not mandate) use of SRQ for IPoIB-RC in order to preserve memory=20 Yaron _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 17:44:20 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA27799 for ; Thu, 18 Nov 2004 17:44:20 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUulv-0006oN-1m; Thu, 18 Nov 2004 17:28:59 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUud6-0003uQ-Ju for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 17:19:52 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA24897 for ; Thu, 18 Nov 2004 17:19:49 -0500 (EST) Received: from atorelbas01.hp.com ([156.153.255.245] helo=palrel10.hp.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUufb-0000hI-QS for ipoverib@ietf.org; Thu, 18 Nov 2004 17:22:40 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel10.hp.com (Postfix) with ESMTP id F3BA41CD7D; Thu, 18 Nov 2004 14:19:38 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.201.129]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id OAA21563; Thu, 18 Nov 2004 14:16:59 -0800 (PST) Message-Id: <6.1.2.0.2.20041118132352.0c98a550@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 13:26:43 -0800 To: Vivek Kashyap From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: <6.1.2.0.2.20041118065847.0208dd40@esmail.cup.hp.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: b92e72fc2b623ddd11e6d81413fb81b2 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0941738355==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0941738355== Content-Type: multipart/alternative; boundary="=====================_81454966==.ALT" --=====================_81454966==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 11:33 AM 11/18/2004, Vivek Kashyap wrote: >On Thu, 18 Nov 2004, Michael Krause wrote: > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > > >Mike the format is really off in the last mail from you - making it > difficult > > >to follow. > > > > > > > > >Other than that let us discuss in the context of the draft. The draft is > > >built upon the following: > > > > > >1. IPoIB-RC and IPoIB-UC are optional. > > > > I would prefer only one be used - either RC or UC. I've provided some > > logic for either one as a preference but don't see a reason to have > > both. Both just leads to options which leads to interoperability problems. > >ok. >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. >It states that the RC and UC are mutually exclusive flags. My preference is to only support one of the two in a spec not to have flags to indicate what is implemented. The benefits of connected mode operation should be done with only one form of communication not two. > > > > >2. IPoIB connected mode depends on a UD QP for address resolution and > > >multicast. > > > > > >As far as I know, there has been an agreement since the earliest > connected > > >mode > > >draft I posted. > > > > > > > > >I'd like the WG to give input on the following issues: > > > > > >3. Where does the UD QP come from? Choose one of: > > > > > >a. It is a UD QP that is associated with the interface at startup. > > > > > >b. It is a UD QP that is shared with IPoIB-UD. > > > > > > > > >3a is more generic. It can be considered to include the case 3b. The > original > > >proposal was limited to 3b. > > > > From an implementation point of view, all of this will be hidden within > > the driver below IP. As such, the driver will maintain the > > associations. Currently, each driver "instance" (may be multiple per IB > > port) will have at least 1 UD QP. Given the existing protocol already > > defines how to share this QP with other nodes, why not just re-use it and > > avoid doing more work? The driver can then map on a per endnode pair > basis > > what *C QP go with what the UD QP and the spec remains largely silent on > > how this is accomplished. > >The draft at present states that 'IPoIB-CM implementation MAY use the same UD >QP as used by IPoIB-UD...'. See section 3.0. I believe it covers what you >are stating. > > > >4. Link characteristics > > > > > >The broadcast domain for IPoIB-RC/UC is determined exactly as the > > >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in > this > > >step. > > > > > >Do all interfaces in the IPoIB-conneced mode(CM) have the same link > > >characteristics? i.e. > > > > From an implementation perspective, this is generally simplest. > > > > >a. all are either IPoIB-RC or IPoIB-UC. > > > > Preference is only 1 to be defined. > > > > > > > -- There is also a UD QP associated. The UD QP will be either 3a > > > or 3b > > > based on WG concensus. > > > > > > -- All unicast transmission is on the IPoIB mode i.e. RC or UC. > > > > For a given endnode pair, the policy of which QP is used for a given > > unicast IP datagram is really a local issue. I see some merit in the > >Not if an implementation chooses to only receive unicast on the CM modes in > an IPoIB-CM subnet. I think the WG must either mandate that between two >IP address all unicast communication can be over either UD or the >supported CM, >or state that all unicast communication must be over IPoIB-CM. Hence my >attempt at a detailed discussion on these issues. > >Issues such as in order delivery need to be considered: e.g. if RC and UD are >used to mix up the traffic, say of TCP segments of the same connection, they >may no longer be received in order. If a designer is stupid, they may do this. However, one would expect some intelligence here and one may prefer to have specific data flows or DiffServ code points or whatever used to determine which connection or which UD QP and that one would again apply an intelligent and predictable algorithm such that mix-n-match for a given TCP connection does not occur. Given multiple *C QP can be supported, it is not tenable to state that all unicast must go over a given QP or that no unicast can occur on a UD QP. > > attempt to bifurcate this to multicast / broadcast to the UD QP and > unicast > > to the *C QP. However, if the datagram fits in the PMTU of the UD QP, > then > > either could be used. The driver would work either case. Please keep in > > mind that multiple *C QP can be used and their usage needs to be a local > > issue and not defined within the spec. > > > > >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC > > >or both. > > > > > > -- The presence of the flags indicate the type of communication > > > possible. > > > -- The decision of communicating using a specific mode is > > > determined by > > > the supported modes and the local policy. Note that > incompatible > > > policies imply that the fallback is communication over UD. > > > -- fallback mode of communication is UD > > > > > > > > >4b adds a lot of flexibility at the expense of a simple decision. 4a. by > > >contrast is straightforward. > > > > > > > > >5. MTU negotiation > > > > > > In the private data field of the CM message the desired MTU is > > > included. > > > > > > It was suggested during the IPoIB meeting at IETF that it > need not be > > > symmetric. That is a good idea. Thus each peer declares the max > > > MTU it > > > prefers > > > > > > > > > REQ: > > > REP: > > > RTU: > > > > Rephrase this as maximum logical MTU to avoid confusion with the IB link > >It is covered in section 5.1 of the draft. > > > MTU. If you start down this path, then you may need to also consider an > > exchange of what range of DiffServ code points to use as well. Not clear > > that anyone needs to deal with any latency or bandwidth guarantees but the > > "camel's nose is starting to enter the tent" as the saying goes. > >The camel comes along if Diffserv etc. as listed above are >included. Hence they are not in the draft. > > > > > > > >6. Multiple connections for the same IP address > > > > > > Local decision. Note that the peer might choose to not honour > > > multiple > > > connections. > > > > Agreed. > > > > Mike > > > > > > > > > > >Vivek > > > > > > > > > > > > > > > > > >On Wed, 17 Nov 2004, Michael Krause wrote: > > > > > > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > > > > > > > > > > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > > > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB > interface > > > > >MUST "FullMember" join the IB multicast group defined by the > > > > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > > > > >interface is IPv6 only, does this group still need be joined ? If > not, > > > > >where do the parameters for any IPv6 groups come from ? I am presuming > > > > >that this group needs to be joined in the IPv6 only case. I just > want to > > > > >be sure. > > > > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be > joined > > > > >whether you are running at v4 or v6 layer. 2. ALso, what is the > > > > >latest status of the Vivek's connected mode draft ? Will it be moving > > > > >forward ? I'll be submitting it as > > > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. > There were > > > > >some interesting suggestions that were made during the IETF WG > meeting. > > > > >Two of the suggestions of consequence are given below. The others > we can > > > > >discuss when the minutes are published (they include some additional > > > > >requests on clarification on the transmission draft too). a. The > current > > > > >draft makes the various modes mutually exclusive i.e. RC, UC and > UD are > > > > >not allowed simultaneously in the same IP subnet. The thought is > that it > > > > >is a link characteristic and hence different per connection mode. > It was > > > > >suggested that one be allowed to mix up RC/UC. This goes back to the > > > > >original suggestion in the first draft which was: IPoIB-UD must > always be > > > > >supported. Additionally, the interface can also support either > both of RC > > > > >and UC, or one of them. Or neither of them. > > > > > > > > > >UD MUST always be supported. > > > > > > > > > > That is and has always been the requirement right from the first > > > > >draft. > > > > > > > > > >I personally don't care whether one does RC or UC but I don't > think both > > > > >are required as a MAY option. The advantage of RC is the send credit > > > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is > noise in > > > > >the fabric while send credits provide a simple method to maintain > > > > >bandwidth / injection control on a per flow basis. > > > > > > > > > >I see no problems with supporting both UD and *C on the same > subnet; it is > > > > >rather foolish to attempt to mandate these be on separate subnets.b > > > > > As per the connected-mode draft the UD mechanism is *always* > > > > >required; address resolutoin depends on it. > > > > > > > > > >The only point of discussion is whether all nodes must support the > same > > > > >link characteristics in the subnet i.e. all are RC (and UD), or > all or UC > > > > >(and UD), or all are UD only. > > > > > > > > Obviously I would oppose such a solution as it creates artificial > > > > constraints with little benefit. > > > > > > > > >The alternative is to allow all the nodes to be mixed up with some > nodes > > > > >being RC/UD, others UC/UD and a third set UD only and yet others > probably > > > > >supporting all. within the same IP subnet. [Can the same serviceID > be used > > > > >by both RC and UC ?] > > > > > > > > > >The third alternative is to associating UD only or UD + one of RC > or UC on > > > > >the same interface. In such a case if mismatched/unsupported connected > > > > >modes are supported by two nodes then the fall back to UD. This > option is > > > > >not too different from UD QP + RC or UC mechanism. > > > > > > > > KISS: > > > > > > > > - UD universal > > > > - *C opportunistic > > > > - Local management issue to control what is sent on the *C > > > > interface. No need to specify > > > > - Advertise whether one or more ports are supported by UD > or *C > > > > - Advertise whether one or more QP are supported by UD or *C > > > > - Let local management determine policy for what services are > > > > mapped where - no need to specify > > > > > > > > This is both an interoperable approach and simple to > implement. There may > > > > be some desire to add a policy interface to state preference for > specific > > > > types of traffic over a given QP. I would not oppose this but > would view > > > > this as a separate draft once the basics are worked out. > > > > > > > > > > > > > > > > > > > > > >b. Another suggestion was to allow multiple connected mode links > (i.e. at > > > > >IB UC/RC level) between peers. One thought can be 'yes, but user > beware': > > > > >The IB connections are made using the service ID that is derived > from the > > > > >QPN as described in the draft. If a second attempt succeeds then > there are > > > > >two links. It is up to the implementation to either allow or disallow > > > > >multiple links. > > > > > > > > > >Again, this has been suggested in the past (though most who were > involved > > > > >in the original discussions years gone by are likely gone since > much of > > > > >this discussion occurred before the IETF workgroup was established). > > > > > > > > > > I'm one of the vestiges of those early times along with you > and a few > > > > >others...so we have hope :). > > > > > > > > > >There is obvious benefit to supporting multiple RC per endnode > pair. I do > > > > >not see any technical reason to oppose nor any issue from an > > > > >interoperability perspective. There is no reason for a "user beware". > > > > > > > > > > It is not opposed. The 'user beware' is only underscoring > that the > > > > >the peer interface might not support multiple links- it might > enforce a > > > > >limited number of connections (maybe only one) between a pair of GIDs. > > > > >Similarly, an implementation not wanting to support multiple links > MUST > > > > >take steps to deny multiple requests. > > > > > > > > *C requires CM to operate thus it is a local issue whether > additional CM > > > > operations are accepted or not. A given requester node may issue N > and a > > > > given responder may state 0-N as an implementation may limit the > number of > > > > *C available for IP traffic. > > > > > > > > > > > > > > > > > > > > > > >The work is rather straight to do and implement and the benefit to > > > > >customers, is again, rather obvious when one considers what the IB > fabric > > > > >offers and how connections can be enable flows through multipath > as well > > > > >as transparent fail-over, flow scheduling, mapping of DiffServ to > > > > >different arbitration / paths, etc. > > > > > > > > > > In addition Large MTU and APM are two of the main reasons why > I've > > > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB > itself, > > > > >except for the Large MTU, the parameters are hidden from it. > > > > > > > > Mike > > > > > >__ > > > > > >Vivek Kashyap > > >Linux Technology Center, IBM > > > > > > > > >_______________________________________________ > > >IPoverIB mailing list > > >IPoverIB@ietf.org > > >https://www1.ietf.org/mailman/listinfo/ipoverib > > > > >_______________________________________________ >IPoverIB mailing list >IPoverIB@ietf.org >https://www1.ietf.org/mailman/listinfo/ipoverib --=====================_81454966==.ALT Content-Type: text/html; charset="us-ascii" At 11:33 AM 11/18/2004, Vivek Kashyap wrote:

On Thu, 18 Nov 2004, Michael Krause wrote:

> At 10:46 PM 11/17/2004, Vivek Kashyap wrote:
> >Mike the format is really off in the last mail from you - making it difficult
> >to follow.
> >
> >
> >Other than that let us discuss in the context of the draft. The draft is
> >built upon the following:
> >
> >1. IPoIB-RC and IPoIB-UC are optional.
>
> I would prefer only one be used - either RC or UC. I've provided some
> logic for either one as a preference but don't see a reason to have
> both. Both just leads to options which leads to interoperability problems.

ok.
See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt.
It states that the RC and UC are mutually exclusive flags.

My preference is to only support one of the two in a spec not to have flags to indicate what is implemented. The benefits of connected mode operation should be done with only one form of communication not two.

>
> >2. IPoIB connected mode depends on a UD QP for address resolution and
> >multicast.
> >
> >As far as I know, there has been an agreement since the earliest connected
> >mode
> >draft I posted.
> >
> >
> >I'd like the WG to give input on the following issues:
> >
> >3. Where does the UD QP come from? Choose one of:
> >
> >a. It is a UD QP that is associated with the interface at startup.
> >
> >b. It is a UD QP that is shared with IPoIB-UD.
> >
> >
> >3a is more generic. It can be considered to include the case 3b. The original
> >proposal was limited to 3b.
>
> From an implementation point of view, all of this will be hidden within
> the driver below IP. As such, the driver will maintain the
> associations. Currently, each driver "instance" (may be multiple per IB
> port) will have at least 1 UD QP. Given the existing protocol already
> defines how to share this QP with other nodes, why not just re-use it and
> avoid doing more work? The driver can then map on a per endnode pair basis
> what *C QP go with what the UD QP and the spec remains largely silent on
> how this is accomplished.

The draft at present states that 'IPoIB-CM implementation MAY use the same UD
QP as used by IPoIB-UD...'. See section 3.0. I believe it covers what you
are stating.

> >4. Link characteristics
> >
> >The broadcast domain for IPoIB-RC/UC is determined exactly as the
> >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this
> >step.
> >
> >Do all interfaces in the IPoIB-conneced mode(CM) have the same link
> >characteristics? i.e.
>
> From an implementation perspective, this is generally simplest.
>
> >a. all are either IPoIB-RC or IPoIB-UC.
>
> Preference is only 1 to be defined.
>
>
> >         -- There is also a UD QP associated. The UD QP will be either 3a
> > or 3b
> >            based on WG concensus.
> >
> >         -- All unicast transmission is on the IPoIB mode i.e. RC or UC.
>
> For a given endnode pair, the policy of which QP is used for a given
> unicast IP datagram is really a local issue. I see some merit in the

Not if an implementation chooses to only receive unicast on the CM modes in
an IPoIB-CM subnet. I think the WG must either mandate that between two
IP address all unicast communication can be over either UD or the supported CM,
or state that all unicast communication must be over IPoIB-CM. Hence my
attempt at a detailed discussion on these issues.

Issues such as in order delivery need to be considered: e.g. if RC and UD are
used to mix up the traffic, say of TCP segments of the same connection, they
may no longer be received in order.

If a designer is stupid, they may do this. However, one would expect some intelligence here and one may prefer to have specific data flows or DiffServ code points or whatever used to determine which connection or which UD QP and that one would again apply an intelligent and predictable algorithm such that mix-n-match for a given TCP connection does not occur. Given multiple *C QP can be supported, it is not tenable to state that all unicast must go over a given QP or that no unicast can occur on a UD QP.

> attempt to bifurcate this to multicast / broadcast to the UD QP and unicast
> to the *C QP. However, if the datagram fits in the PMTU of the UD QP, then
> either could be used. The driver would work either case. Please keep in
> mind that multiple *C QP can be used and their usage needs to be a local
> issue and not defined within the spec.
>
> >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC
> >or both.
> >
> >         -- The presence of the flags indicate the type of communication
> > possible.
> >         -- The decision of communicating using a specific mode is
> > determined by
> >            the supported modes and the local policy. Note that incompatible
> >            policies imply that the fallback is communication over UD.
> >         -- fallback mode of communication is UD
> >
> >
> >4b adds a lot of flexibility at the expense of a simple decision. 4a. by
> >contrast is straightforward.
> >
> >
> >5. MTU negotiation
> >
> >         In the private data field of the CM message the desired MTU is
> >         included.
> >
> >         It was suggested during the IPoIB meeting at IETF that it need not be
> >         symmetric. That is a good idea. Thus each peer declares the max
> > MTU it
> >         prefers
> >
> >
> >         REQ: <my desired MTU>
> >         REP: <my desired MTU>
> >         RTU:
>
> Rephrase this as maximum logical MTU to avoid confusion with the IB link

It is covered in section 5.1 of the draft.

> MTU. If you start down this path, then you may need to also consider an
> exchange of what range of DiffServ code points to use as well. Not clear
> that anyone needs to deal with any latency or bandwidth guarantees but the
> "camel's nose is starting to enter the tent" as the saying goes.

The camel comes along if Diffserv etc. as listed above are
included. Hence they are not in the draft.

>
>
> >6. Multiple connections for the same IP address
> >
> >         Local decision. Note that the peer might choose to not honour
> > multiple
> >         connections.
>
> Agreed.
>
> Mike
>
>
>
>
> >Vivek
> >
> >
> >
> >
> >
> >On Wed, 17 Nov 2004, Michael Krause wrote:
> >
> > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote:
> > >
> > >
> > >
> > > >Hi, I have a couple of questions relative to IPoIB: 1.
> > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface
> > > >MUST "FullMember" join the IB multicast group defined by the
> > > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB
> > > >interface is IPv6 only, does this group still need be joined ? If not,
> > > >where do the parameters for any IPv6 groups come from ? I am presuming
> > > >that this group needs to be joined in the IPv6 only case. I just want to
> > > >be sure.
> > > ><VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined
> > > >whether you are running at v4 or v6 layer. <VK> 2. ALso, what is the
> > > >latest status of the Vivek's connected mode draft ? Will it be moving
> > > >forward ? <VK> I'll be submitting it as
> > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were
> > > >some interesting suggestions that were made during the IETF WG meeting.
> > > >Two of the suggestions of consequence are given below. The others we can
> > > >discuss when the minutes are published (they include some additional
> > > >requests on clarification on the transmission draft too). a. The current
> > > >draft makes the various modes mutually exclusive i.e. RC, UC and UD are
> > > >not allowed simultaneously in the same IP subnet. The thought is that it
> > > >is a link characteristic and hence different per connection mode. It was
> > > >suggested that one be allowed to mix up RC/UC. This goes back to the
> > > >original suggestion in the first draft which was: IPoIB-UD must always be
> > > >supported. Additionally, the interface can also support either both of RC
> > > >and UC, or one of them. Or neither of them.
> > > >
> > > >UD MUST always be supported.
> > > >
> > > ><VK> That is and has always been the requirement right from the first
> > > >draft. <VK>
> > > >
> > > >I personally don't care whether one does RC or UC but I don't think both
> > > >are required as a MAY option. The advantage of RC is the send credit
> > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in
> > > >the fabric while send credits provide a simple method to maintain
> > > >bandwidth / injection control on a per flow basis.
> > > >
> > > >I see no problems with supporting both UD and *C on the same subnet; it is
> > > >rather foolish to attempt to mandate these be on separate subnets.b
> > > ><VK> As per the connected-mode draft the UD mechanism is *always*
> > > >required; address resolutoin depends on it.
> > > >
> > > >The only point of discussion is whether all nodes must support the same
> > > >link characteristics in the subnet i.e. all are RC (and UD), or all or UC
> > > >(and UD), or all are UD only.
> > >
> > > Obviously I would oppose such a solution as it creates artificial
> > > constraints with little benefit.
> > >
> > > >The alternative is to allow all the nodes to be mixed up with some nodes
> > > >being RC/UD, others UC/UD and a third set UD only and yet others probably
> > > >supporting all. within the same IP subnet. [Can the same serviceID be used
> > > >by both RC and UC ?]
> > > >
> > > >The third alternative is to associating UD only or UD + one of RC or UC on
> > > >the same interface. In such a case if mismatched/unsupported connected
> > > >modes are supported by two nodes then the fall back to UD. This option is
> > > >not too different from UD QP + RC or UC mechanism.
> > >
> > > KISS:
> > >
> > > - UD universal
> > > - *C opportunistic
> > >          - Local management issue to control what is sent on the *C
> > > interface. No need to specify
> > >          - Advertise whether one or more ports are supported by UD or *C
> > >          - Advertise whether one or more QP are supported by UD or *C
> > >          - Let local management determine policy for what services are
> > > mapped where - no need to specify
> > >
> > > This is both an interoperable approach and simple to implement. There may
> > > be some desire to add a policy interface to state preference for specific
> > > types of traffic over a given QP. I would not oppose this but would view
> > > this as a separate draft once the basics are worked out.
> > >
> > >
> > >
> > > ><VK>
> > > >b. Another suggestion was to allow multiple connected mode links (i.e. at
> > > >IB UC/RC level) between peers. One thought can be 'yes, but user beware':
> > > >The IB connections are made using the service ID that is derived from the
> > > >QPN as described in the draft. If a second attempt succeeds then there are
> > > >two links. It is up to the implementation to either allow or disallow
> > > >multiple links.
> > > >
> > > >Again, this has been suggested in the past (though most who were involved
> > > >in the original discussions years gone by are likely gone since much of
> > > >this discussion occurred before the IETF workgroup was established).
> > > >
> > > ><VK> I'm one of the vestiges of those early times along with you and a few
> > > >others...so we have hope :). <VK>
> > > >
> > > >There is obvious benefit to supporting multiple RC per endnode pair. I do
> > > >not see any technical reason to oppose nor any issue from an
> > > >interoperability perspective. There is no reason for a "user beware".
> > > >
> > > ><VK> It is not opposed. The 'user beware' is only underscoring that the
> > > >the peer interface might not support multiple links- it might enforce a
> > > >limited number of connections (maybe only one) between a pair of GIDs.
> > > >Similarly, an implementation not wanting to support multiple links MUST
> > > >take steps to deny multiple requests.
> > >
> > > *C requires CM to operate thus it is a local issue whether additional CM
> > > operations are accepted or not. A given requester node may issue N and a
> > > given responder may state 0-N as an implementation may limit the number of
> > > *C available for IP traffic.
> > >
> > >
> > > ><VK>
> > > >
> > > >The work is rather straight to do and implement and the benefit to
> > > >customers, is again, rather obvious when one considers what the IB fabric
> > > >offers and how connections can be enable flows through multipath as well
> > > >as transparent fail-over, flow scheduling, mapping of DiffServ to
> > > >different arbitration / paths, etc.
> > > >
> > > ><VK> In addition Large MTU and APM are two of the main reasons why I've
> > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself,
> > > >except for the Large MTU, the parameters are hidden from it.<VK>
> > >
> > > Mike
> >
> >__
> >
> >Vivek Kashyap
> >Linux Technology Center, IBM
> >
> >
> >_______________________________________________
> >IPoverIB mailing list
> >IPoverIB@ietf.org
> >https://www1.ietf.org/mailman/listinfo/ipoverib
>

_______________________________________________
IPoverIB mailing list
IPoverIB@ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib

--=====================_81454966==.ALT-- --===============0941738355== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0941738355==-- From ipoverib-bounces@ietf.org Thu Nov 18 18:12:35 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA01074 for ; Thu, 18 Nov 2004 18:12:35 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUvMg-000354-Sx; Thu, 18 Nov 2004 18:06:58 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUvEE-0007CM-3C for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 17:58:14 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA29110 for ; Thu, 18 Nov 2004 17:58:10 -0500 (EST) Received: from atorelbas04.hp.com ([156.153.255.238] helo=palrel13.hp.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUvGv-0001gq-SZ for ipoverib@ietf.org; Thu, 18 Nov 2004 18:01:02 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel13.hp.com (Postfix) with ESMTP id 59E981C13A5C for ; Thu, 18 Nov 2004 14:58:12 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.201.129]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id OAA24252 for ; Thu, 18 Nov 2004 14:55:44 -0800 (PST) Message-Id: <6.1.2.0.2.20041118144754.01eb9db8@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 14:50:50 -0800 To: "IPoverIB" From: Michael Krause Subject: Re: [Ipoverib] IPoIB-RC and Checksums In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 00e94c813bef7832af255170dca19e36 X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1232672044==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============1232672044== Content-Type: multipart/alternative; boundary="=====================_83799597==.ALT" --=====================_83799597==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 12:18 PM 11/18/2004, Yaron Haviv wrote: >In GbE usually the NIC Tx Segmentation (large send) capability comes >hand in hand with Checksum offload for greater efficiency (and zero >copy) > >On UD we decided not to address checksum offloading, since we cannot >guarantee that the node will not forward an un-checked packet > >Where as in RC we can have examples of devices that can guarantee >checksum One example is an IB-IP gateway that always checksum outgoing and >incoming packets, and can act as a remote IP NIC to the Host > >I suggest we include a checksum option in the CM Exchange Where a node >can request that its peer will not checksum the packet for it And also >signal that he sends packets that are already checked That can help >improve performance of IPoIB RC > >P.S. another note, we discussed in IETF was that we may want to >mention/suggest (not mandate) use of SRQ for IPoIB-RC in order to >preserve memory I remain opposed to disabling checksums under any circumstances. I do not believe there is a method to guarantee that a packet will not be routed by higher layers within the network stack without sticking one's nose way into the packet. There is nothing that precludes an IB HCA from providing checksum off-load today through a private interface much as what is done with Ethernet today. It isn't hard and presents no interoperability issues as it is a local optimization. Attempting to do this as an optimization between endnode pairs makes this more complex and requires knowledge that may not be available between all combinations of endnode pairs. Mike --=====================_83799597==.ALT Content-Type: text/html; charset="us-ascii" At 12:18 PM 11/18/2004, Yaron Haviv wrote:

In GbE usually the NIC Tx Segmentation (large send) capability comes
hand in hand with Checksum offload for greater efficiency (and zero
copy)

On UD we decided not to address checksum offloading, since we cannot
guarantee that the node will not forward an un-checked packet

Where as in RC we can have examples of devices that can guarantee
checksum One example is an IB-IP gateway that always checksum outgoing and
incoming packets, and can act as a remote IP NIC to the Host

I suggest we include a checksum option in the CM Exchange Where a node can request that its peer will not checksum the packet for it And also signal that he sends packets that are already checked That can help improve performance of IPoIB RC

P.S. another note, we discussed in IETF was that we may want to
mention/suggest (not mandate) use of SRQ for IPoIB-RC in order to
preserve memory

I remain opposed to disabling checksums under any circumstances. I do not believe there is a method to guarantee that a packet will not be routed by higher layers within the network stack without sticking one's nose way into the packet. There is nothing that precludes an IB HCA from providing checksum off-load today through a private interface much as what is done with Ethernet today. It isn't hard and presents no interoperability issues as it is a local optimization. Attempting to do this as an optimization between endnode pairs makes this more complex and requires knowledge that may not be available between all combinations of endnode pairs.

Mike --=====================_83799597==.ALT-- --===============1232672044== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1232672044==-- From ipoverib-bounces@ietf.org Thu Nov 18 18:30:47 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA03400 for ; Thu, 18 Nov 2004 18:30:46 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUvZ0-0002mG-B0; Thu, 18 Nov 2004 18:19:42 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUvQY-0004zN-GH for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 18:10:58 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA00781 for ; Thu, 18 Nov 2004 18:10:55 -0500 (EST) Received: from brmea-mail-4.sun.com ([192.18.98.36]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUvTG-0001yQ-HU for ipoverib@ietf.org; Thu, 18 Nov 2004 18:13:46 -0500 Received: from phys-bos-2.sfbay.sun.com ([129.146.14.24]) by brmea-mail-4.sun.com (8.12.10/8.12.9) with ESMTP id iAINAvun005431 for ; Thu, 18 Nov 2004 16:10:57 -0700 (MST) Received: from reveille (reveille.SFBay.Sun.COM [129.146.95.151]) by bos-mail1.sfbay.sun.com (Sun Java System Messaging Server 6.1 HotFix 0.02 (built Jul 26 2004)) with ESMTP id <0I7E009DGDQ8U570@bos-mail1.sfbay.sun.com> for ipoverib@ietf.org; Thu, 18 Nov 2004 15:10:56 -0800 (PST) Date: Thu, 18 Nov 2004 15:10:56 -0800 From: Bill Strahm Subject: Re: [Ipoverib] IPoIB-RC and Checksums In-reply-to: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com> To: Yaron Haviv Message-id: <1100819456.12386.16.camel@reveille.sfbay.sun.com> Organization: Sun Microsystems MIME-version: 1.0 X-Mailer: Evolution 2.0.2 (2.0.2-1) Content-type: text/plain Content-transfer-encoding: 7bit References: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com> X-Spam-Score: 0.0 (/) X-Scan-Signature: 92df29fa99cf13e554b84c8374345c17 Content-Transfer-Encoding: 7bit Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Bill.Strahm@Sun.COM List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org Content-Transfer-Encoding: 7bit Let me try to be clear with my understanding of the position of the IESG. Not sending IP/TCP header checksums in an IP packet is a non-starter. Using checksum offload technologies to accelerate these computations is one thing, sending the packets with a checksum value of 0 and not checking on receive is another. I talked with Allison Mankin in Washington D.C. and she was terrified of "raw" IB packets (ie. RC/UD) getting out on the internet and messing things up because there aren't congestion controls built into these protocols that will behave correctly with IP. I would caution the group in considering removing checksumming from packets when it is relatively cheap for hardware to be added to HCA/HBA's that can calculate the checksum before sending it on the wire. Some comments inline. Bill On Thu, 2004-11-18 at 22:18 +0200, Yaron Haviv wrote: > In GbE usually the NIC Tx Segmentation (large send) capability comes > hand in hand with Checksum offload for greater efficiency (and zero > copy) > > On UD we decided not to address checksum offloading, since we cannot > guarantee that the node will not forward an un-checked packet I do not believe either the IEEE or the IETF have ever addressed checksum offloading. I am not sure that there is a protocol piece to do here - it is an implementation issue between the OS and the hardware. > > Where as in RC we can have examples of devices that can guarantee > checksum > One example is an IB-IP gateway that always checksum outgoing and > incoming packets, and can act as a remote IP NIC to the Host Here you are talking about a different device. And again - I am not sure that there is an IETF standard here. Much like the IETF does not want to standardize iSER over IB (with no IP in the middle) I don't believe it wants to standardize Host/OS <--> NIC interactions. The device you are proposing does not have (require might be a better word here) an IP interaction between the HOST and an IP ofload NIC (I have heard of several implementations of things called a VNIC or virtual NIC) I do not believe there is a proposal to standardize a VNIC protocol - and if there was, I do not believe that this is IETF work. > > I suggest we include a checksum option in the CM Exchange > Where a node can request that its peer will not checksum the packet for > it > And also signal that he sends packets that are already checked > That can help improve performance of IPoIB RC > I believe this is a non-starter in the IESG - Margarete, can you confirm this ? > P.S. another note, we discussed in IETF was that we may want to > mention/suggest (not mandate) use of SRQ for IPoIB-RC in order to > preserve memory > Again, in the spirit of Wire protocol vs. Implementation. I think this is an implementation issue that will not change wire protocols at all. Is there a point where using SRQ vs. Not Using SRQ would have to change the wire protocol ? If not - lets not say anything. If there is - I would be very interested in understanding. Bill _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 19:27:08 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA08725 for ; Thu, 18 Nov 2004 19:27:08 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwZm-0004H2-SZ; Thu, 18 Nov 2004 19:24:34 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwNc-0001Uw-UT for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 19:12:02 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA07531 for ; Thu, 18 Nov 2004 19:11:57 -0500 (EST) Received: from e33.co.us.ibm.com ([32.97.110.131]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUwQK-0003Zj-Nw for ipoverib@ietf.org; Thu, 18 Nov 2004 19:14:49 -0500 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e33.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAJ0BRJT255154 for ; Thu, 18 Nov 2004 19:11:27 -0500 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAJ0BRmo121334 for ; Thu, 18 Nov 2004 17:11:27 -0700 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ0BQi7008292 for ; Thu, 18 Nov 2004 17:11:27 -0700 Received: from DYN319548.beaverton.ibm.com (DYN319548.beaverton.ibm.com [9.47.22.85]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ0BPvB008256; Thu, 18 Nov 2004 17:11:26 -0700 Date: Thu, 18 Nov 2004 16:12:23 -0800 (PST) From: Vivek Kashyap X-X-Sender: kashyapv@dyn319548.beaverton.ibm.com To: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <6.1.2.0.2.20041118132352.0c98a550@esmail.cup.hp.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: efb5d987e2484f3d9a304cc31a003441 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Thu, 18 Nov 2004, Michael Krause wrote: > At 11:33 AM 11/18/2004, Vivek Kashyap wrote: > >On Thu, 18 Nov 2004, Michael Krause wrote: > > > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > > > >Mike the format is really off in the last mail from you - making it > > difficult > > > >to follow. > > > > > > > > > > > >Other than that let us discuss in the context of the draft. The draft is > > > >built upon the following: > > > > > > > >1. IPoIB-RC and IPoIB-UC are optional. > > > > > > I would prefer only one be used - either RC or UC. I've provided some > > > logic for either one as a preference but don't see a reason to have > > > both. Both just leads to options which leads to interoperability problems. > > > >ok. > >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. > >It states that the RC and UC are mutually exclusive flags. > > My preference is to only support one of the two in a spec not to have flags > to indicate what is implemented. The benefits of connected mode operation > should be done with only one form of communication not two. A given subnet will support only one of the two. Not both simultaneously. The flag only indicates which type it is. RC and UC are both useful to different people and implementations so both are allowed. I suggest that both not be allowed in the same IPoIB subnet though. > > > > > > > > >2. IPoIB connected mode depends on a UD QP for address resolution and > > > >multicast. > > > > > > > >As far as I know, there has been an agreement since the earliest > > connected > > > >mode > > > >draft I posted. > > > > > > > > > > > >I'd like the WG to give input on the following issues: > > > > > > > >3. Where does the UD QP come from? Choose one of: > > > > > > > >a. It is a UD QP that is associated with the interface at startup. > > > > > > > >b. It is a UD QP that is shared with IPoIB-UD. > > > > > > > > > > > >3a is more generic. It can be considered to include the case 3b. The > > original > > > >proposal was limited to 3b. > > > > > > From an implementation point of view, all of this will be hidden within > > > the driver below IP. As such, the driver will maintain the > > > associations. Currently, each driver "instance" (may be multiple per IB > > > port) will have at least 1 UD QP. Given the existing protocol already > > > defines how to share this QP with other nodes, why not just re-use it and > > > avoid doing more work? The driver can then map on a per endnode pair > > basis > > > what *C QP go with what the UD QP and the spec remains largely silent on > > > how this is accomplished. > > > >The draft at present states that 'IPoIB-CM implementation MAY use the same UD > >QP as used by IPoIB-UD...'. See section 3.0. I believe it covers what you > >are stating. > > > > > >4. Link characteristics > > > > > > > >The broadcast domain for IPoIB-RC/UC is determined exactly as the > > > >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in > > this > > > >step. > > > > > > > >Do all interfaces in the IPoIB-conneced mode(CM) have the same link > > > >characteristics? i.e. > > > > > > From an implementation perspective, this is generally simplest. > > > > > > >a. all are either IPoIB-RC or IPoIB-UC. > > > > > > Preference is only 1 to be defined. > > > > > > > > > > -- There is also a UD QP associated. The UD QP will be either 3a > > > > or 3b > > > > based on WG concensus. > > > > > > > > -- All unicast transmission is on the IPoIB mode i.e. RC or UC. > > > > > > For a given endnode pair, the policy of which QP is used for a given > > > unicast IP datagram is really a local issue. I see some merit in the > > > >Not if an implementation chooses to only receive unicast on the CM modes in > > an IPoIB-CM subnet. I think the WG must either mandate that between two > >IP address all unicast communication can be over either UD or the > >supported CM, > >or state that all unicast communication must be over IPoIB-CM. Hence my > >attempt at a detailed discussion on these issues. > > > >Issues such as in order delivery need to be considered: e.g. if RC and UD are > >used to mix up the traffic, say of TCP segments of the same connection, they > >may no longer be received in order. > > If a designer is stupid, they may do this. However, one would expect some > intelligence here and one may prefer to have specific data flows or > DiffServ code points or whatever used to determine which connection or > which UD QP and that one would again apply an intelligent and predictable > algorithm such that mix-n-match for a given TCP connection does not > occur. Given multiple *C QP can be supported, it is not tenable to state > that all unicast must go over a given QP or that no unicast can occur on a > UD QP. > You mised my point which was that the specification cannot be silent on this and say it is a local issue. That can lead to interoperability failure. The specification must support or disallow unicast communication over UD QP in an IPoIB-CM. You prefer that such communication be supported. That works. Any other thoughts? > > > > attempt to bifurcate this to multicast / broadcast to the UD QP and > > unicast > > > to the *C QP. However, if the datagram fits in the PMTU of the UD QP, > > then > > > either could be used. The driver would work either case. Please keep in > > > mind that multiple *C QP can be used and their usage needs to be a local > > > issue and not defined within the spec. > > > > > > >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC > > > >or both. > > > > > > > > -- The presence of the flags indicate the type of communication > > > > possible. > > > > -- The decision of communicating using a specific mode is > > > > determined by > > > > the supported modes and the local policy. Note that > > incompatible > > > > policies imply that the fallback is communication over UD. > > > > -- fallback mode of communication is UD > > > > > > > > > > > >4b adds a lot of flexibility at the expense of a simple decision. 4a. by > > > >contrast is straightforward. > > > > > > > > > > > >5. MTU negotiation > > > > > > > > In the private data field of the CM message the desired MTU is > > > > included. > > > > > > > > It was suggested during the IPoIB meeting at IETF that it > > need not be > > > > symmetric. That is a good idea. Thus each peer declares the max > > > > MTU it > > > > prefers > > > > > > > > > > > > REQ: > > > > REP: > > > > RTU: > > > > > > Rephrase this as maximum logical MTU to avoid confusion with the IB link > > > >It is covered in section 5.1 of the draft. > > > > > MTU. If you start down this path, then you may need to also consider an > > > exchange of what range of DiffServ code points to use as well. Not clear > > > that anyone needs to deal with any latency or bandwidth guarantees but the > > > "camel's nose is starting to enter the tent" as the saying goes. > > > >The camel comes along if Diffserv etc. as listed above are > >included. Hence they are not in the draft. > > > > > > > > > > > >6. Multiple connections for the same IP address > > > > > > > > Local decision. Note that the peer might choose to not honour > > > > multiple > > > > connections. > > > > > > Agreed. > > > > > > Mike > > > > > > > > > > > > > > > >Vivek > > > > > > > > > > > > > > > > > > > > > > > >On Wed, 17 Nov 2004, Michael Krause wrote: > > > > > > > > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > > > > > > > > > > > > > > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > > > > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB > > interface > > > > > >MUST "FullMember" join the IB multicast group defined by the > > > > > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > > > > > >interface is IPv6 only, does this group still need be joined ? If > > not, > > > > > >where do the parameters for any IPv6 groups come from ? I am presuming > > > > > >that this group needs to be joined in the IPv6 only case. I just > > want to > > > > > >be sure. > > > > > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be > > joined > > > > > >whether you are running at v4 or v6 layer. 2. ALso, what is the > > > > > >latest status of the Vivek's connected mode draft ? Will it be moving > > > > > >forward ? I'll be submitting it as > > > > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. > > There were > > > > > >some interesting suggestions that were made during the IETF WG > > meeting. > > > > > >Two of the suggestions of consequence are given below. The others > > we can > > > > > >discuss when the minutes are published (they include some additional > > > > > >requests on clarification on the transmission draft too). a. The > > current > > > > > >draft makes the various modes mutually exclusive i.e. RC, UC and > > UD are > > > > > >not allowed simultaneously in the same IP subnet. The thought is > > that it > > > > > >is a link characteristic and hence different per connection mode. > > It was > > > > > >suggested that one be allowed to mix up RC/UC. This goes back to the > > > > > >original suggestion in the first draft which was: IPoIB-UD must > > always be > > > > > >supported. Additionally, the interface can also support either > > both of RC > > > > > >and UC, or one of them. Or neither of them. > > > > > > > > > > > >UD MUST always be supported. > > > > > > > > > > > > That is and has always been the requirement right from the first > > > > > >draft. > > > > > > > > > > > >I personally don't care whether one does RC or UC but I don't > > think both > > > > > >are required as a MAY option. The advantage of RC is the send credit > > > > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is > > noise in > > > > > >the fabric while send credits provide a simple method to maintain > > > > > >bandwidth / injection control on a per flow basis. > > > > > > > > > > > >I see no problems with supporting both UD and *C on the same > > subnet; it is > > > > > >rather foolish to attempt to mandate these be on separate subnets.b > > > > > > As per the connected-mode draft the UD mechanism is *always* > > > > > >required; address resolutoin depends on it. > > > > > > > > > > > >The only point of discussion is whether all nodes must support the > > same > > > > > >link characteristics in the subnet i.e. all are RC (and UD), or > > all or UC > > > > > >(and UD), or all are UD only. > > > > > > > > > > Obviously I would oppose such a solution as it creates artificial > > > > > constraints with little benefit. > > > > > > > > > > >The alternative is to allow all the nodes to be mixed up with some > > nodes > > > > > >being RC/UD, others UC/UD and a third set UD only and yet others > > probably > > > > > >supporting all. within the same IP subnet. [Can the same serviceID > > be used > > > > > >by both RC and UC ?] > > > > > > > > > > > >The third alternative is to associating UD only or UD + one of RC > > or UC on > > > > > >the same interface. In such a case if mismatched/unsupported connected > > > > > >modes are supported by two nodes then the fall back to UD. This > > option is > > > > > >not too different from UD QP + RC or UC mechanism. > > > > > > > > > > KISS: > > > > > > > > > > - UD universal > > > > > - *C opportunistic > > > > > - Local management issue to control what is sent on the *C > > > > > interface. No need to specify > > > > > - Advertise whether one or more ports are supported by UD > > or *C > > > > > - Advertise whether one or more QP are supported by UD or *C > > > > > - Let local management determine policy for what services are > > > > > mapped where - no need to specify > > > > > > > > > > This is both an interoperable approach and simple to > > implement. There may > > > > > be some desire to add a policy interface to state preference for > > specific > > > > > types of traffic over a given QP. I would not oppose this but > > would view > > > > > this as a separate draft once the basics are worked out. > > > > > > > > > > > > > > > > > > > > > > > > > > >b. Another suggestion was to allow multiple connected mode links > > (i.e. at > > > > > >IB UC/RC level) between peers. One thought can be 'yes, but user > > beware': > > > > > >The IB connections are made using the service ID that is derived > > from the > > > > > >QPN as described in the draft. If a second attempt succeeds then > > there are > > > > > >two links. It is up to the implementation to either allow or disallow > > > > > >multiple links. > > > > > > > > > > > >Again, this has been suggested in the past (though most who were > > involved > > > > > >in the original discussions years gone by are likely gone since > > much of > > > > > >this discussion occurred before the IETF workgroup was established). > > > > > > > > > > > > I'm one of the vestiges of those early times along with you > > and a few > > > > > >others...so we have hope :). > > > > > > > > > > > >There is obvious benefit to supporting multiple RC per endnode > > pair. I do > > > > > >not see any technical reason to oppose nor any issue from an > > > > > >interoperability perspective. There is no reason for a "user beware". > > > > > > > > > > > > It is not opposed. The 'user beware' is only underscoring > > that the > > > > > >the peer interface might not support multiple links- it might > > enforce a > > > > > >limited number of connections (maybe only one) between a pair of GIDs. > > > > > >Similarly, an implementation not wanting to support multiple links > > MUST > > > > > >take steps to deny multiple requests. > > > > > > > > > > *C requires CM to operate thus it is a local issue whether > > additional CM > > > > > operations are accepted or not. A given requester node may issue N > > and a > > > > > given responder may state 0-N as an implementation may limit the > > number of > > > > > *C available for IP traffic. > > > > > > > > > > > > > > > > > > > > > > > > > > > >The work is rather straight to do and implement and the benefit to > > > > > >customers, is again, rather obvious when one considers what the IB > > fabric > > > > > >offers and how connections can be enable flows through multipath > > as well > > > > > >as transparent fail-over, flow scheduling, mapping of DiffServ to > > > > > >different arbitration / paths, etc. > > > > > > > > > > > > In addition Large MTU and APM are two of the main reasons why > > I've > > > > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB > > itself, > > > > > >except for the Large MTU, the parameters are hidden from it. > > > > > > > > > > Mike > > > > > > > >__ > > > > > > > >Vivek Kashyap > > > >Linux Technology Center, IBM > > > > > > > > > > > >_______________________________________________ > > > >IPoverIB mailing list > > > >IPoverIB@ietf.org > > > >https://www1.ietf.org/mailman/listinfo/ipoverib > > > > > > > > >_______________________________________________ > >IPoverIB mailing list > >IPoverIB@ietf.org > >https://www1.ietf.org/mailman/listinfo/ipoverib > _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 19:33:53 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA09510 for ; Thu, 18 Nov 2004 19:33:53 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwaO-0004ou-Ky; Thu, 18 Nov 2004 19:25:12 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwZY-00042V-7W for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 19:24:20 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA08428 for ; Thu, 18 Nov 2004 19:24:16 -0500 (EST) Received: from e31.co.us.ibm.com ([32.97.110.129]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUwcD-0003ps-Q7 for ipoverib@ietf.org; Thu, 18 Nov 2004 19:27:09 -0500 Received: from westrelay01.boulder.ibm.com (westrelay01.boulder.ibm.com [9.17.195.10]) by e31.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAJ0NjCB284252 for ; Thu, 18 Nov 2004 19:23:45 -0500 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by westrelay01.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAJ0NjdU192854 for ; Thu, 18 Nov 2004 17:23:45 -0700 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ0MMPc028488 for ; Thu, 18 Nov 2004 17:23:45 -0700 Received: from DYN319548.beaverton.ibm.com (DYN319548.beaverton.ibm.com [9.47.22.85]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAINwqds008184; Thu, 18 Nov 2004 16:58:53 -0700 Date: Thu, 18 Nov 2004 15:59:50 -0800 (PST) From: Vivek Kashyap X-X-Sender: kashyapv@dyn319548.beaverton.ibm.com To: "H.K. Jerry Chu" Subject: Re: [Ipoverib] comments on draft-kashyap-ipoib-connected-mode-02.txt In-Reply-To: <200411181929.iAIJTE6R162628@jurassic.eng.sun.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: 386e0819b1192672467565a524848168 Cc: ipoverib@ietf.org X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Thu, 18 Nov 2004, H.K. Jerry Chu wrote: > In the last IETF61 IPoIB meeting I made several comments on the > connected mode draft. I'm sending them to the list for a general > discussion. (Yes I saw some disucssion on the connected mode > draft already. I'll try to catch up with the thread after this > mail.) > > 1. The draft makes a distinction between IPoIB-CM interfaces > and IPoIB-UD interfaces, and portrays IPoIB-UC or IPoIB-RC as > separate subnets superimposed on top of an IPoIB-UD subnet. > > For the above to work, due to a lack of multicast support, a fully > connected network by itself can't meet the requirement of an IP > link unless multicast is fully emulated through the use of > multiple unicasts. The latter is complex and cumbersome. Exactly. The current draft also continues to use UD for multicast. > > A much simpler model, which I think was presented in earlier > drafts, is to fold the use of IB connections fully into a > regular IPoIB-UD subnet, allowing any two IPoIB nodes to > optionally negotiate the use of IB connection between themselves. The difference in the earlier draft and this one is that I modified the requirement on the UD QP. That is, it need not be that IPoIB-CM and IPoIB-UD share a QP but that any UD QP will do for IPoIB-CM. In effect an implementation can still share the UD QP. The only issue is whether the same IP subnet can contain pure IPoIB-UD mixed in with IPoIB-CM nodes or, all nodes must be of the same type. - all IPoIB-UD or - all IPoIB-RC or -- all IPoIB-UC I beleive all of the same type is a good option to choose. > > This much simplified model is not without its drawback. Some > nice IP link attributes are no longer unique within a link. > E.g., the link MTU now becomes per-node-pair MTU. Moreover, > the MTU size for multicast will be different from the MTU size > for unicast if IB connections are used. IB UC/RC may exhibit > different RAS, flow control, QoS or other link characteristics > than UD. But I consider these problems a reasonable price to > pay for a seamless support of UC/RC mode in an IPoIB link > defined by UD. > > 2. The negotiation of the per-connection MTU seems more > complicated than necessary. I think all is needed is for a > node to advertise its own "receive MTU". That is, the MTU > size its peer should never go over when sending packets > to the local interface. Yes this may break the traditional > concept of "symmetric" MTUs. But we're already breaking the > notion of per-link MTU, requring a lot of changes in the host > stack anyway. This additonal breakage doesn't seem much. > > I haven't verified if this asymmetric MTU matches well with > IBA connections though. How about: The MTU I would think is exchanged at the IB level during the IPoIB-CM connection setup. The IP layer at both ends keeps a per connection MTU if the implementation permits it. At the link layer the connection will not send messages larger than that requested by the peer. > > 3. Regarding allowing multiple IB connections between a node > pair, since given an IP address there is only one link-address > for it implying one QPN, hence one service-ID, if a single > service-ID can be used to create multiple IB connections > then this can happen transparently. Otherwise we've got a > problem. > > Jerry > > > _______________________________________________ > IPoverIB mailing list > IPoverIB@ietf.org > https://www1.ietf.org/mailman/listinfo/ipoverib > > _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 19:34:22 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA09590 for ; Thu, 18 Nov 2004 19:34:22 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwaU-0004ss-51; Thu, 18 Nov 2004 19:25:18 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwZg-0004AU-Fh for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 19:24:29 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA08446 for ; Thu, 18 Nov 2004 19:24:24 -0500 (EST) Received: from palrel13.hp.com ([156.153.255.238]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUwcE-0003qD-AE for ipoverib@ietf.org; Thu, 18 Nov 2004 19:27:17 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel13.hp.com (Postfix) with ESMTP id 31D3E1C02E95; Thu, 18 Nov 2004 16:24:16 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.201.129]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id QAA29957; Thu, 18 Nov 2004 16:21:47 -0800 (PST) Message-Id: <6.1.2.0.2.20041118161705.0cbc5900@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 16:22:35 -0800 To: Vivek Kashyap From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: <6.1.2.0.2.20041118132352.0c98a550@esmail.cup.hp.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 0e9ebc0cbd700a87c0637ad0e2c91610 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1908633025==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============1908633025== Content-Type: multipart/alternative; boundary="=====================_88959486==.ALT" --=====================_88959486==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 04:12 PM 11/18/2004, Vivek Kashyap wrote: >On Thu, 18 Nov 2004, Michael Krause wrote: > > > At 11:33 AM 11/18/2004, Vivek Kashyap wrote: > > >On Thu, 18 Nov 2004, Michael Krause wrote: > > > > > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > > > > >Mike the format is really off in the last mail from you - making it > > > difficult > > > > >to follow. > > > > > > > > > > > > > > >Other than that let us discuss in the context of the draft. The > draft is > > > > >built upon the following: > > > > > > > > > >1. IPoIB-RC and IPoIB-UC are optional. > > > > > > > > I would prefer only one be used - either RC or UC. I've provided some > > > > logic for either one as a preference but don't see a reason to have > > > > both. Both just leads to options which leads to interoperability > problems. > > > > > >ok. > > >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. > > >It states that the RC and UC are mutually exclusive flags. > > > > My preference is to only support one of the two in a spec not to have > flags > > to indicate what is implemented. The benefits of connected mode operation > > should be done with only one form of communication not two. > >A given subnet will support only one of the two. Not both simultaneously. The >flag only indicates which type it is. RC and UC are both useful to different >people and implementations so both are allowed. I suggest that both not be >allowed in the same IPoIB subnet though. To be explicit, I think there is benefit in implementing one and only one of the two. Having two options serves no purpose and adds unnecessary complexity. Interoperability will end up requiring both to be done if customers are to not get upset. Let's just pick one of the two and apply KISS. To get this started, I'll propose RC as that is a bit nicer to the fabric than UC and is already implemented in most OS and CA drivers today so it makes it faster to adopt with minimal driver software update. > > If a designer is stupid, they may do this. However, one would expect some > > intelligence here and one may prefer to have specific data flows or > > DiffServ code points or whatever used to determine which connection or > > which UD QP and that one would again apply an intelligent and predictable > > algorithm such that mix-n-match for a given TCP connection does not > > occur. Given multiple *C QP can be supported, it is not tenable to state > > that all unicast must go over a given QP or that no unicast can occur on a > > UD QP. > > > >You mised my point which was that the specification cannot be silent on this >and say it is a local issue. That can lead to interoperability failure. The >specification must support or disallow unicast communication over UD QP >in an >IPoIB-CM. > >You prefer that such communication be supported. That works. Any other >thoughts? I prefer that guidance be provided and that it remain a local implementation issue as to what QP is used for a given flow. I do not see interoperability issues only potential performance if people are stupid. The industry has a way to deal with stupidity and too much time is spent on preventing people from being stupid. Even a so-so intelligent implementation could have a simple flag for a given target IP address that states which QP to target for all or a subset of the flows with minimal cost to implement and troubleshoot / validate. Mike --=====================_88959486==.ALT Content-Type: text/html; charset="us-ascii" At 04:12 PM 11/18/2004, Vivek Kashyap wrote:

On Thu, 18 Nov 2004, Michael Krause wrote:

> At 11:33 AM 11/18/2004, Vivek Kashyap wrote:
> >On Thu, 18 Nov 2004, Michael Krause wrote:
> >
> > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote:
> > > >Mike the format is really off in the last mail from you - making it
> > difficult
> > > >to follow.
> > > >
> > > >
> > > >Other than that let us discuss in the context of the draft. The draft is
> > > >built upon the following:
> > > >
> > > >1. IPoIB-RC and IPoIB-UC are optional.
> > >
> > > I would prefer only one be used - either RC or UC. I've provided some
> > > logic for either one as a preference but don't see a reason to have
> > > both. Both just leads to options which leads to interoperability problems.
> >
> >ok.
> >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt.
> >It states that the RC and UC are mutually exclusive flags.
>
> My preference is to only support one of the two in a spec not to have flags
> to indicate what is implemented. The benefits of connected mode operation
> should be done with only one form of communication not two.

A given subnet will support only one of the two. Not both simultaneously. The
flag only indicates which type it is. RC and UC are both useful to different
people and implementations so both are allowed. I suggest that both not be
allowed in the same IPoIB subnet though.

To be explicit, I think there is benefit in implementing one and only one of the two. Having two options serves no purpose and adds unnecessary complexity. Interoperability will end up requiring both to be done if customers are to not get upset. Let's just pick one of the two and apply KISS. To get this started, I'll propose RC as that is a bit nicer to the fabric than UC and is already implemented in most OS and CA drivers today so it makes it faster to adopt with minimal driver software update.

<snip>

> If a designer is stupid, they may do this. However, one would expect some
> intelligence here and one may prefer to have specific data flows or
> DiffServ code points or whatever used to determine which connection or
> which UD QP and that one would again apply an intelligent and predictable
> algorithm such that mix-n-match for a given TCP connection does not
> occur. Given multiple *C QP can be supported, it is not tenable to state
> that all unicast must go over a given QP or that no unicast can occur on a
> UD QP.
>

You mised my point which was that the specification cannot be silent on this
and say it is a local issue. That can lead to interoperability failure. The
specification must support or disallow unicast communication over UD QP in an
IPoIB-CM.

You prefer that such communication be supported. That works. Any other thoughts?

I prefer that guidance be provided and that it remain a local implementation issue as to what QP is used for a given flow. I do not see interoperability issues only potential performance if people are stupid. The industry has a way to deal with stupidity and too much time is spent on preventing people from being stupid. Even a so-so intelligent implementation could have a simple flag for a given target IP address that states which QP to target for all or a subset of the flows with minimal cost to implement and troubleshoot / validate.

Mike --=====================_88959486==.ALT-- --===============1908633025== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1908633025==-- From ipoverib-bounces@ietf.org Thu Nov 18 20:18:22 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA13345 for ; Thu, 18 Nov 2004 20:18:22 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUxMe-00025K-NN; Thu, 18 Nov 2004 20:15:04 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUxLa-0001MC-9l for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 20:13:58 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA12878 for ; Thu, 18 Nov 2004 20:13:56 -0500 (EST) Received: from e33.co.us.ibm.com ([32.97.110.131]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUxOJ-0004uo-2G for ipoverib@ietf.org; Thu, 18 Nov 2004 20:16:47 -0500 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e33.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAJ1DQJT729970 for ; Thu, 18 Nov 2004 20:13:26 -0500 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAJ1DPmo134128 for ; Thu, 18 Nov 2004 18:13:25 -0700 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ1DPVX012032 for ; Thu, 18 Nov 2004 18:13:25 -0700 Received: from DYN319548.beaverton.ibm.com (DYN319548.beaverton.ibm.com [9.47.22.85]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ1DO1B012019; Thu, 18 Nov 2004 18:13:25 -0700 Date: Thu, 18 Nov 2004 17:14:22 -0800 (PST) From: Vivek Kashyap X-X-Sender: kashyapv@dyn319548.beaverton.ibm.com To: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <6.1.2.0.2.20041118161705.0cbc5900@esmail.cup.hp.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: 10d3e4e3c32e363f129e380e644649be Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Thu, 18 Nov 2004, Michael Krause wrote: > At 04:12 PM 11/18/2004, Vivek Kashyap wrote: > >On Thu, 18 Nov 2004, Michael Krause wrote: > > > > > At 11:33 AM 11/18/2004, Vivek Kashyap wrote: > > > >On Thu, 18 Nov 2004, Michael Krause wrote: > > > > > > > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > > > > > >Mike the format is really off in the last mail from you - making it > > > > difficult > > > > > >to follow. > > > > > > > > > > > > > > > > > >Other than that let us discuss in the context of the draft. The > > draft is > > > > > >built upon the following: > > > > > > > > > > > >1. IPoIB-RC and IPoIB-UC are optional. > > > > > > > > > > I would prefer only one be used - either RC or UC. I've provided some > > > > > logic for either one as a preference but don't see a reason to have > > > > > both. Both just leads to options which leads to interoperability > > problems. > > > > > > > >ok. > > > >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. > > > >It states that the RC and UC are mutually exclusive flags. > > > > > > My preference is to only support one of the two in a spec not to have > > flags > > > to indicate what is implemented. The benefits of connected mode operation > > > should be done with only one form of communication not two. > > > >A given subnet will support only one of the two. Not both simultaneously. The > >flag only indicates which type it is. RC and UC are both useful to different > >people and implementations so both are allowed. I suggest that both not be > >allowed in the same IPoIB subnet though. > RC and UC both have benefits. There is almost no difference other than the connection flag between the two. > To be explicit, I think there is benefit in implementing one and only one > of the two. Having two options serves no purpose and adds unnecessary > complexity. Interoperability will end up requiring both to be done if > customers are to not get upset. Let's just pick one of the two and apply > KISS. To get this started, I'll propose RC as that is a bit nicer to the > fabric than UC and is already implemented in most OS and CA drivers today > so it makes it faster to adopt with minimal driver software update. > > > > > > If a designer is stupid, they may do this. However, one would expect some > > > intelligence here and one may prefer to have specific data flows or > > > DiffServ code points or whatever used to determine which connection or > > > which UD QP and that one would again apply an intelligent and predictable > > > algorithm such that mix-n-match for a given TCP connection does not > > > occur. Given multiple *C QP can be supported, it is not tenable to state > > > that all unicast must go over a given QP or that no unicast can occur on a > > > UD QP. > > > > > > >You mised my point which was that the specification cannot be silent on this > >and say it is a local issue. That can lead to interoperability failure. The > >specification must support or disallow unicast communication over UD QP > >in an > >IPoIB-CM. > > > >You prefer that such communication be supported. That works. Any other > >thoughts? > > I prefer that guidance be provided and that it remain a local > implementation issue as to what QP is used for a given flow. I do not see > interoperability issues only potential performance if people are > stupid. The industry has a way to deal with stupidity and too much time is > spent on preventing people from being stupid. Even a so-so intelligent > implementation could have a simple flag for a given target IP address that > states which QP to target for all or a subset of the flows with minimal > cost to implement and troubleshoot / validate. If something is left unspecified there is every chance that incompatible implementations result - this has nothing to do with the mental faculties of the implementors. Therefore I'll add relevant text. > > Mike _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 20:56:36 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA16397 for ; Thu, 18 Nov 2004 20:56:36 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUxvN-0001W0-Fg; Thu, 18 Nov 2004 20:50:57 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUxr5-0000ae-0H for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 20:46:31 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA15725 for ; Thu, 18 Nov 2004 20:46:28 -0500 (EST) Received: from palrel12.hp.com ([156.153.255.237]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUxtj-0005cu-Rz for ipoverib@ietf.org; Thu, 18 Nov 2004 20:49:17 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel12.hp.com (Postfix) with ESMTP id E7AA3406C11; Thu, 18 Nov 2004 17:46:24 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.201.129]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id RAA05281; Thu, 18 Nov 2004 17:43:55 -0800 (PST) Message-Id: <6.1.2.0.2.20041118174214.0cbee5a8@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 17:45:09 -0800 To: Vivek Kashyap From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: <6.1.2.0.2.20041118161705.0cbc5900@esmail.cup.hp.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 1449ead51a2ff026dcb23465f5379250 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: ,