From ipoverib-bounces@ietf.org Tue Nov 16 16:19:29 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA15335 for ; Tue, 16 Nov 2004 16:19:28 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUAhQ-0000tq-47; Tue, 16 Nov 2004 16:17:16 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUAcW-0006rM-C6 for ipoverib@megatron.ietf.org; Tue, 16 Nov 2004 16:12:12 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA14424 for ; Tue, 16 Nov 2004 16:12:10 -0500 (EST) Received: from rwcrmhc13.comcast.net ([204.127.198.39]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUAen-0008FP-65 for ipoverib@ietf.org; Tue, 16 Nov 2004 16:14:33 -0500 Received: from e2az0 (h00095b0697b5.ne.client2.attbi.com[24.218.177.14]) by comcast.net (rwcrmhc13) with SMTP id <20041116211130015008mohie>; Tue, 16 Nov 2004 21:11:32 +0000 Message-ID: <002501c4cc20$d58501a0$6501a8c0@comcast.net> From: "Hal Rosenstock" To: "IPoverIB" Date: Tue, 16 Nov 2004 16:11:25 -0500 MIME-Version: 1.0 X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-Spam-Score: 0.5 (/) X-Scan-Signature: 25620135586de10c627e3628c432b04a Subject: [Ipoverib] A Couple of IPoIB Questions X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Hal Rosenstock List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1109502172==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org This is a multi-part message in MIME format. --===============1109502172== Content-Type: multipart/alternative; boundary="----=_NextPart_000_0022_01C4CBF6.EC397B80" This is a multi-part message in MIME format. ------=_NextPart_000_0022_01C4CBF6.EC397B80 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi, I have a couple of questions relative to IPoIB: 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface MUST "FullMember" join the IB multicast group = defined by the broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 = only, does this group still need be joined ? If not, where do the parameters for any IPv6 groups come from ? I am = presuming that this group needs to be joined in=20 the IPv6 only case. I just want to be sure. 2. ALso, what is the latest status of the Vivek's connected mode draft ? = Will it be moving forward ? Thanks. -- Hal ------=_NextPart_000_0022_01C4CBF6.EC397B80 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hi,
 
I have a couple of questions relative = to=20 IPoIB:
 
1. = draft-ietf-ipoib-ip-over-infiniband-07.txt=20 states:
"Every IPoIB interface MUST "FullMember" = join the=20 IB multicast group defined by the broadcast-GID."
 
Isn't the broadcast group for IPv4 ? = When the IPoIB=20 interface is IPv6 only, does this group still need be joined = ?
If not, where do the parameters for any = IPv6 groups=20 come from ? I am presuming that this group needs to be joined in =
the IPv6 only case. I just want to be=20 sure.
 
2. ALso, what is the latest status of = the Vivek's=20 connected mode draft ? Will it be moving forward ?
 
Thanks.
 
-- Hal
------=_NextPart_000_0022_01C4CBF6.EC397B80-- --===============1109502172== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1109502172==-- From ipoverib-bounces@ietf.org Tue Nov 16 17:38:17 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA03561 for ; Tue, 16 Nov 2004 17:38:16 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUBjo-0003SL-3E; Tue, 16 Nov 2004 17:23:48 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUAxQ-0006Bo-D8 for ipoverib@megatron.ietf.org; Tue, 16 Nov 2004 16:33:48 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA17654 for ; Tue, 16 Nov 2004 16:33:46 -0500 (EST) Received: from nwkea-mail-2.sun.com ([192.18.42.14]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUAzh-0000Z9-Hx for ipoverib@ietf.org; Tue, 16 Nov 2004 16:36:10 -0500 Received: from phys-bos-2.sfbay.sun.com ([129.146.14.24]) by nwkea-mail-2.sun.com (8.12.10/8.12.9) with ESMTP id iAGLXk7s028946 for ; Tue, 16 Nov 2004 13:33:46 -0800 (PST) Received: from Sun.COM (sr1-umpk-05.SFBay.Sun.COM [129.146.11.163]) by bos-mail1.sfbay.sun.com (Sun Java System Messaging Server 6.1 HotFix 0.02 (built Jul 26 2004)) with ESMTP id <0I7A00CDBJWA1250@bos-mail1.sfbay.sun.com> for ipoverib@ietf.org; Tue, 16 Nov 2004 13:33:46 -0800 (PST) Date: Tue, 16 Nov 2004 13:33:46 -0800 From: Kanoj Sarcar Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-reply-to: <002501c4cc20$d58501a0$6501a8c0@comcast.net> To: Hal Rosenstock Message-id: <419A723A.6000600@Sun.COM> MIME-version: 1.0 Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7bit X-Accept-Language: en-us, en References: <002501c4cc20$d58501a0$6501a8c0@comcast.net> User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.4) Gecko/20040414 X-Spam-Score: 0.0 (/) X-Scan-Signature: 769a46790fb42fbb0b0cc700c82f7081 Content-Transfer-Encoding: 7bit Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Kanoj.Sarcar@Sun.COM List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org Content-Transfer-Encoding: 7bit Hal Rosenstock wrote: > Hi, Hi, > > I have a couple of questions relative to IPoIB: > > 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: > "Every IPoIB interface MUST "FullMember" join the IB multicast group > defined by the broadcast-GID." > > Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 > only, does this group still need be joined ? > If not, where do the parameters for any IPv6 groups come from ? I am > presuming that this group needs to be joined in > the IPv6 only case. I just want to be sure. Previously on the WG, we went thru a discussion on this, and the consensus was that all interfaces (irrespective of ipv4 only, ipv6 only, or ipv4 and ipv6) MUST join the broadcast-GID and obtain parameters for all IPv4 and IPv6 groups from this one single broadcast-GID. We further discussed changing the signature part of the address of the broadcast group to reflect that it was IPv4 and IPv6 agnostic, but maintained the IPv4 signature to make it easier for current implementations to make any required changes to adapt to this rule. Thanks. Kanoj > > 2. ALso, what is the latest status of the Vivek's connected mode draft ? > Will it be moving forward ? > > Thanks. > > -- Hal > > > ------------------------------------------------------------------------ > > _______________________________________________ > IPoverIB mailing list > IPoverIB@ietf.org > https://www1.ietf.org/mailman/listinfo/ipoverib _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Tue Nov 16 17:55:27 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA05725 for ; Tue, 16 Nov 2004 17:55:26 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUC3D-00083k-0P; Tue, 16 Nov 2004 17:43:51 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUBs8-0000s5-RY; Tue, 16 Nov 2004 17:32:24 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA02428; Tue, 16 Nov 2004 17:32:22 -0500 (EST) Received: from e34.co.us.ibm.com ([32.97.110.132]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUBuQ-00057Z-F4; Tue, 16 Nov 2004 17:34:47 -0500 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e34.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAGMVpAD540146; Tue, 16 Nov 2004 17:31:52 -0500 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAGMVpl3208082; Tue, 16 Nov 2004 15:31:51 -0700 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAGMVlCv011949; Tue, 16 Nov 2004 15:31:48 -0700 Received: from d03nm122.boulder.ibm.com (d03nm122.boulder.ibm.com [9.17.195.148]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAGMVlJ7011945; Tue, 16 Nov 2004 15:31:47 -0700 To: Hal Rosenstock MIME-Version: 1.0 Subject: Re: [Ipoverib] A Couple of IPoIB Questions X-Mailer: Lotus Notes Release 5.0.11 July 24, 2002 Message-ID: From: Vivek Kashyap Date: Tue, 16 Nov 2004 14:31:27 -0800 X-MIMETrack: Serialize by Router on D03NM122/03/M/IBM(Release 6.51HF338 | June 21, 2004) at 11/16/2004 15:31:47, Serialize complete at 11/16/2004 15:31:47 X-Spam-Score: 0.5 (/) X-Scan-Signature: cd3fc8e909678b38737fc606dec187f0 Cc: IPoverIB , ipoverib-bounces@ietf.org X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1927796390==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org This is a multipart message in MIME format. --===============1927796390== Content-Type: multipart/alternative; boundary="=_alternative 007BEBB488256F4E_=" This is a multipart message in MIME format. --=_alternative 007BEBB488256F4E_= Content-Type: text/plain; charset="us-ascii" See below in Vivek -- Vivek Kashyap Linux Technology Center, IBM vivk@us.ibm.com kashyapv@us.ibm.com Ph: 503 578 3422 T/L: 775 3422 "Hal Rosenstock" Sent by: ipoverib-bounces@ietf.org 11/16/2004 01:11 PM Please respond to Hal Rosenstock To: "IPoverIB" cc: Subject: [Ipoverib] A Couple of IPoIB Questions Hi, I have a couple of questions relative to IPoIB: 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 only, does this group still need be joined ? If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in the IPv6 only case. I just want to be sure. Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined whether you are running at v4 or v6 layer. 2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ? I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were some interesting suggestions that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too). a. The current draft makes the various modes mutually exclusive i.e. RC, UC and UD are not allowed simultaneously in the same IP subnet. The thought is that it is a link characteristic and hence different per connection mode. It was suggested that one be allowed to mix up RC/UC. This goes back to the original suggestion in the first draft which was: IPoIB-UD must always be supported. Additionally, the interface can also support either both of RC and UC, or one of them. Or neither of them. b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers. One thought can be 'yes, but user beware': The IB connections are made using the service ID that is derived from the QPN as described in the draft. If a second attempt succeeds then there are two links. It is up to the implementation to either allow or disallow multiple links. Thoughts? Thanks. -- Hal_______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --=_alternative 007BEBB488256F4E_= Content-Type: text/html; charset="us-ascii"
See below in <VK>

Vivek
--
Vivek Kashyap
Linux Technology Center, IBM
vivk@us.ibm.com
kashyapv@us.ibm.com
Ph: 503 578 3422 T/L: 775 3422



"Hal Rosenstock" <hnrose@earthlink.net>
Sent by: ipoverib-bounces@ietf.org

11/16/2004 01:11 PM
Please respond to Hal Rosenstock

       
        To:        "IPoverIB" <ipoverib@ietf.org>
        cc:        
        Subject:        [Ipoverib] A Couple of IPoIB Questions



Hi,
 
I have a couple of questions relative to IPoIB:
 
1. draft-ietf-ipoib-ip-over-infiniband-07.txt states:
"Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID."
 
Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 only, does this group still need be joined ?
If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in
the IPv6 only case. I just want to be sure.
 
<VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined whether you are running at v4 or v6 layer. <VK>

2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ?

<VK> I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were some interesting suggestions that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too).

a. The current draft makes the various modes mutually exclusive i.e. RC, UC and UD are not allowed simultaneously in the same IP subnet. The thought is that it is a link characteristic and hence different per connection mode. It was suggested that one be allowed to mix up RC/UC. This goes back to the original suggestion in the first draft which was:

IPoIB-UD must always be supported. Additionally, the interface can also support either both of RC and UC, or one of them. Or neither of them.

b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers.

One thought can be 'yes, but user beware': The IB connections are made using the service ID that is derived from the QPN as described in the draft. If a second attempt succeeds then there are two links. It is up to the implementation to either allow or disallow multiple links.

Thoughts?

<VK>


 
Thanks.
 
-- Hal_______________________________________________
IPoverIB mailing list
IPoverIB@ietf.org
https://www1.ietf.org/mailman/listinfo/ipoverib


--=_alternative 007BEBB488256F4E_=-- --===============1927796390== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1927796390==-- From ipoverib-bounces@ietf.org Tue Nov 16 23:15:59 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA05344 for ; Tue, 16 Nov 2004 23:15:59 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUHA7-0002EU-Hz; Tue, 16 Nov 2004 23:11:19 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUH7b-0001ii-Nw for ipoverib@megatron.ietf.org; Tue, 16 Nov 2004 23:08:43 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id XAA04703 for ; Tue, 16 Nov 2004 23:08:40 -0500 (EST) Received: from palrel10.hp.com ([156.153.255.245]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUH9v-0005Ai-Cx for ipoverib@ietf.org; Tue, 16 Nov 2004 23:11:08 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel10.hp.com (Postfix) with ESMTP id C12E561E81; Tue, 16 Nov 2004 20:08:41 -0800 (PST) Received: from MK73191c.cup.hp.com (mk731916.cup.hp.com [15.8.80.134]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id UAA06560; Tue, 16 Nov 2004 20:06:23 -0800 (PST) Message-Id: <6.1.2.0.2.20041116200245.0c795268@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Tue, 16 Nov 2004 20:08:25 -0800 To: Vivek Kashyap , Hal Rosenstock From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 16c9da4896bf5539ae3547c6c25f06a0 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0999522517==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0999522517== Content-Type: multipart/alternative; boundary="=====================_39086102==.ALT" --=====================_39086102==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 02:31 PM 11/16/2004, Vivek Kashyap wrote: >See below in > >Vivek >-- >Vivek Kashyap >Linux Technology Center, IBM >vivk@us.ibm.com >kashyapv@us.ibm.com >Ph: 503 578 3422 T/L: 775 3422 > > > >"Hal Rosenstock" >Sent by: ipoverib-bounces@ietf.org > >11/16/2004 01:11 PM >Please respond to Hal Rosenstock > > To: "IPoverIB" > cc: > Subject: [Ipoverib] A Couple of IPoIB Questions > > > >Hi, > >I have a couple of questions relative to IPoIB: > >1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: >"Every IPoIB interface MUST "FullMember" join the IB multicast group >defined by the broadcast-GID." > >Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 >only, does this group still need be joined ? >If not, where do the parameters for any IPv6 groups come from ? I am >presuming that this group needs to be joined in >the IPv6 only case. I just want to be sure. > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined >whether you are running at v4 or v6 layer. > >2. ALso, what is the latest status of the Vivek's connected mode draft ? >Will it be moving forward ? > > I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by >the end of the month. There were some interesting suggestions that were >made during the IETF WG meeting. Two of the suggestions of consequence are >given below. The others we can discuss when the minutes are published >(they include some additional requests on clarification on the >transmission draft too). > >a. The current draft makes the various modes mutually exclusive i.e. RC, >UC and UD are not allowed simultaneously in the same IP subnet. The >thought is that it is a link characteristic and hence different per >connection mode. It was suggested that one be allowed to mix up RC/UC. >This goes back to the original suggestion in the first draft which was: > >IPoIB-UD must always be supported. Additionally, the interface can also >support either both of RC and UC, or one of them. Or neither of them. UD MUST always be supported. I personally don't care whether one does RC or UC but I don't think both are required as a MAY option. The advantage of RC is the send credit algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in the fabric while send credits provide a simple method to maintain bandwidth / injection control on a per flow basis. I see no problems with supporting both UD and *C on the same subnet; it is rather foolish to attempt to mandate these be on separate subnets. >b. Another suggestion was to allow multiple connected mode links (i.e. at >IB UC/RC level) between peers. > >One thought can be 'yes, but user beware': The IB connections are made >using the service ID that is derived from the QPN as described in the >draft. If a second attempt succeeds then there are two links. It is up to >the implementation to either allow or disallow multiple links. Again, this has been suggested in the past (though most who were involved in the original discussions years gone by are likely gone since much of this discussion occurred before the IETF workgroup was established). There is obvious benefit to supporting multiple RC per endnode pair. I do not see any technical reason to oppose nor any issue from an interoperability perspective. There is no reason for a "user beware". The work is rather straight to do and implement and the benefit to customers, is again, rather obvious when one considers what the IB fabric offers and how connections can be enable flows through multipath as well as transparent fail-over, flow scheduling, mapping of DiffServ to different arbitration / paths, etc. Mike --=====================_39086102==.ALT Content-Type: text/html; charset="us-ascii" At 02:31 PM 11/16/2004, Vivek Kashyap wrote:

See below in <VK>

Vivek
--
Vivek Kashyap
Linux Technology Center, IBM
vivk@us.ibm.com
kashyapv@us.ibm.com
Ph: 503 578 3422 T/L: 775 3422



"Hal Rosenstock" <hnrose@earthlink.net>
Sent by: ipoverib-bounces@ietf.org

11/16/2004 01:11 PM
Please respond to Hal Rosenstock
       
        To:        "IPoverIB" <ipoverib@ietf.org>
        cc:       
        Subject:        [Ipoverib] A Couple of IPoIB Questions



Hi,
 
I have a couple of questions relative to IPoIB:
 
1. draft-ietf-ipoib-ip-over-infiniband-07.txt states:
"Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID."
 
Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 only, does this group still need be joined ?
If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in
the IPv6 only case. I just want to be sure.
 
<VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined whether you are running at v4 or v6 layer. <VK>

2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ?

<VK> I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were some interesting suggestions that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too).

a. The current draft makes the various modes mutually exclusive i.e. RC, UC and UD are not allowed simultaneously in the same IP subnet. The thought is that it is a link characteristic and hence different per connection mode. It was suggested that one be allowed to mix up RC/UC. This goes back to the original suggestion in the first draft which was:

IPoIB-UD must always be supported. Additionally, the interface can also support either both of RC and UC, or one of them. Or neither of them.

UD MUST always be supported.  I personally don't care whether one does RC or UC but I don't think both are required as a MAY option.  The advantage of RC is the send credit algorithm.  The advantage of UC is the lack of ACK packets.  ACK is noise in the fabric while send credits provide a simple method to maintain bandwidth / injection control on a per flow basis.

I see no problems with supporting both UD and *C on the same subnet; it is rather foolish to attempt to mandate these be on separate subnets.

b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers.

One thought can be 'yes, but user beware': The IB connections are made using the service ID that is derived from the QPN as described in the draft. If a second attempt succeeds then there are two links. It is up to the implementation to either allow or disallow multiple links.

Again, this has been suggested in the past (though most who were involved in the original discussions years gone by are likely gone since much of this discussion occurred before the IETF workgroup was established).  There is obvious benefit to supporting multiple RC per endnode pair.  I do not see any technical reason to oppose nor any issue from an interoperability perspective.  There is no reason for a "user beware".  The work is rather straight to do and implement and the benefit to customers, is again, rather obvious when one considers what the IB fabric offers and how connections can be enable flows through multipath as well as transparent fail-over, flow scheduling, mapping of DiffServ to different arbitration / paths, etc.

Mike
--=====================_39086102==.ALT-- --===============0999522517== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0999522517==-- From ipoverib-bounces@ietf.org Wed Nov 17 02:48:46 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA23075 for ; Wed, 17 Nov 2004 02:48:46 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUKVA-0002Ih-Or; Wed, 17 Nov 2004 02:45:16 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUKPE-0000at-KV for ipoverib@megatron.ietf.org; Wed, 17 Nov 2004 02:39:09 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA21960 for ; Wed, 17 Nov 2004 02:39:07 -0500 (EST) Received: from e31.co.us.ibm.com ([32.97.110.129]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUKRa-0001AR-Cm for ipoverib@ietf.org; Wed, 17 Nov 2004 02:41:35 -0500 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e31.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAH7cZCB311842 for ; Wed, 17 Nov 2004 02:38:35 -0500 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAH7cZ5S168204 for ; Wed, 17 Nov 2004 00:38:35 -0700 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAH7cZrn030821 for ; Wed, 17 Nov 2004 00:38:35 -0700 Received: from d03nm122.boulder.ibm.com (d03nm122.boulder.ibm.com [9.17.195.148]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAH7cY1K030818; Wed, 17 Nov 2004 00:38:34 -0700 Subject: Re: [Ipoverib] A Couple of IPoIB Questions To: Michael Krause X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: Vivek Kashyap Date: Tue, 16 Nov 2004 23:38:29 -0800 X-MIMETrack: Serialize by Router on D03NM122/03/M/IBM(Release 6.51HF338 | June 21, 2004) at 11/17/2004 00:38:34 MIME-Version: 1.0 X-Spam-Score: 0.1 (/) X-Scan-Signature: f5932bfc8385127f631fc458a872feb1 Cc: Hal Rosenstock , IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1847019054==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============1847019054== Content-type: multipart/alternative; Boundary="0__=08BBE5DCDFB25B938f9e8a93df938690918c08BBE5DCDFB25B93" Content-Disposition: inline --0__=08BBE5DCDFB25B938f9e8a93df938690918c08BBE5DCDFB25B93 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Hi, I have a couple of questions relative to IPoIB: 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 o= nly, does this group still need be joined ? If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in the IPv6 only case. I just want to be sure. Yes, the broadcast-GID is at the InfiniBand layer and MUST b= e joined whether you are running at v4 or v6 layer. 2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ? I'll be submitting it as draft-ietf-ipoib-connected-mode-00.= txt by the end of the month. There were some interesting suggestions = that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too). a. The current draft makes the various modes mutually exclusive i= .e. RC, UC and UD are not allowed simultaneously in the same IP subne= t. The thought is that it is a link characteristic and hence differe= nt per connection mode. It was suggested that one be allowed to mix = up RC/UC. This goes back to the original suggestion in the first dra= ft which was: IPoIB-UD must always be supported. Additionally, the interface ca= n also support either both of RC and UC, or one of them. Or neither= of them. UD MUST always be supported. That is and has always been the requirement right from the first draft. I personally don't care whether one does RC or UC but I don't think bo= th are required as a MAY option. The advantage of RC is the send credit algorithm. The advantage of UC is the lack of ACK packets. ACK is noi= se in the fabric while send credits provide a simple method to maintain bandwidth / injection control on a per flow basis. I see no problems with supporting both UD and *C on the same subnet; it= is rather foolish to attempt to mandate these be on separate subnets.b As per the connected-mode draft the UD mechanism is *always* required; address resolutoin depends on it. The only point of discussion is whether all nodes must support the = same link characteristics in the subnet i.e. all are RC (and UD), or all = or UC (and UD), or all are UD only. The alternative is to allow all the= nodes to be mixed up with some nodes being RC/UD, others UC/UD and a= third set UD only and yet others probably supporting all. within th= e same IP subnet. [Can the same serviceID be used by both RC and UC ?]= The third alternative is to associating UD only or UD + one of RC o= r UC on the same interface. In such a case if mismatched/unsupported connected modes are supported by two nodes then the fall back to UD.= This option is not too different from UD QP + RC or UC mechanism. b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers. One thought can be 'yes, but user beware': The IB connections are= made using the service ID that is derived from the QPN as describ= ed in the draft. If a second attempt succeeds then there are two lin= ks. It is up to the implementation to either allow or disallow multip= le links. Again, this has been suggested in the past (though most who were involv= ed in the original discussions years gone by are likely gone since much of= this discussion occurred before the IETF workgroup was established). I'm one of the vestiges of those early times along with you and a = few others...so we have hope :). There is obvious benefit to supporting multiple RC per endnode pair. = I do not see any technical reason to oppose nor any issue from an interoperability perspective. There is no reason for a "user beware". It is not opposed. The 'user beware' is only underscoring that th= e the peer interface might not support multiple links- it might enforce a= limited number of connections (maybe only one) between a pair of GIDs. Similarly, an implementation not wanting to support multiple links MUST= take steps to deny multiple requests. The work is rather straight to do and implement and the benefit to customers, is again, rather obvious when one considers what the IB fabr= ic offers and how connections can be enable flows through multipath as wel= l as transparent fail-over, flow scheduling, mapping of DiffServ to differen= t arbitration / paths, etc. In addition Large MTU and APM are two of the main reasons why I'v= e been proposing IPoIB-connected mode for so long. In terms of IPoIB itse= lf, except for the Large MTU, the parameters are hidden from it. Mike = --0__=08BBE5DCDFB25B938f9e8a93df938690918c08BBE5DCDFB25B93 Content-type: text/html; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: quoted-printable



      Hi,


      I have a couple of questions relative to IPoIB:


      1. draft-ietf-ipoib-ip-over-infiniband-07.txt states:

      "Every IPoIB interface MUST "FullMember" join the IB mul= ticast group defined by the broadcast-GID."

      Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 o= nly, does this group still need be joined ?

      If not, where do the parameters for any IPv6 groups come from ? I am pr= esuming that this group needs to be joined in

      the IPv6 only case. I just want to be su= re.

      <VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST b= e joined whether you are running at v4 or v6 layer. <VK>

      2. ALso, what is the latest status of the Vivek's connected mode draft = ? Will it be moving forward ?


      <VK> I'll be submitting it as draft-ietf-ipoib-connected-mode-00.= txt by the end of the month. There were some interesting suggestions th= at were made during the IETF WG meeting. Two of the suggestions of cons= equence are given below. The others we can discuss when the minutes are= published (they include some additional requests on clarification on t= he transmission draft too).


      a. The current draft makes the various modes mutually exclusive i.e. RC= , UC and UD are not allowed simultaneously in the same IP subnet. The t= hought is that it is a link characteristic and hence different per conn= ection mode. It was suggested that one be allowed to mix up RC/UC. This= goes back to the original suggestion in the first draft which was:

      IPoIB-UD must always be supported. Additionally, the interface can also= support either both of RC and UC, or one of them. Or neither of them. =


UD MUST always be supported.


<VK> That is and has al= ways been the requirement right from the first draft. <VK>=

I personally don't care whet= her one does RC or UC but I don't think both are required as a MAY opti= on. The advantage of RC is the send credit algorithm. The advantage o= f UC is the lack of ACK packets. ACK is noise in the fabric while send= credits provide a simple method to maintain bandwidth / injection cont= rol on a per flow basis.

I see no problems with supporting both UD and *C on the same subnet; it= is rather foolish to attempt to mandate these be on separate subnets.b=
      <VK> As per the connected-mode dra= ft the UD mechanism is *always* required; address resolutoin depends on= it.

      The only point of discussion is whether= all nodes must support the same link characteristics in the subnet i.e= . all are RC (and UD), or all or UC (and UD), or all are UD only. The a= lternative is to allow all the nodes to be mixed up with some nodes bei= ng RC/UD, others UC/UD and a third set UD only and yet others probably= supporting all. within the same IP subnet. [Can the same serviceID be = used by both RC and UC ?]

      The third alternative is to associating = UD only or UD + one of RC or UC on the same interface. In such a case = if mismatched/unsupported connected modes are supported by two nodes th= en the fall back to UD. This option is not too different from UD QP + = RC or UC mechanism.

      <VK>

        b. Another suggestion was to allow multi= ple connected mode links (i.e. at IB UC/RC level) between peers.=

        One thought can be 'yes, but user beware': The IB connections are made = using the service ID that is derived from the QPN as described in the d= raft. If a second attempt succeeds then there are two links. It is up t= o the implementation to either allow or disallow multiple links.
        =

    Again, this has been suggested in the past (though most who were involv= ed in the original discussions years gone by are likely gone since much= of this discussion occurred before the IETF workgroup was established)= .


    <VK> I'm one of the ves= tiges of those early times along with you and a few others...so we have= hope :). <VK>

    There is obvious benefit to = supporting multiple RC per endnode pair. I do not see any technical re= ason to oppose nor any issue from an interoperability perspective. The= re is no reason for a "user beware".

    <VK> It is not opposed.= The 'user beware' is only underscoring that the the peer interface mi= ght not support multiple links- it might enforce a limited number of co= nnections (maybe only one) between a pair of GIDs. Similarly, an implem= entation not wanting to support multiple links MUST take steps to deny = multiple requests.

    <VK>

    The work is rather straight = to do and implement and the benefit to customers, is again, rather obvi= ous when one considers what the IB fabric offers and how connections ca= n be enable flows through multipath as well as transparent fail-over, f= low scheduling, mapping of DiffServ to different arbitration / paths, e= tc.

    <VK> In addition Large= MTU and APM are two of the main reasons why I've been proposing IPoIB-= connected mode for so long. In terms of IPoIB itself, except for the La= rge MTU, the parameters are hidden from it.<VK>

    Mike
    = --0__=08BBE5DCDFB25B938f9e8a93df938690918c08BBE5DCDFB25B93-- --===============1847019054== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1847019054==-- From ipoverib-bounces@ietf.org Wed Nov 17 20:48:13 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA15027 for ; Wed, 17 Nov 2004 20:48:13 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUbJj-0002Z0-2u; Wed, 17 Nov 2004 20:42:35 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUbIs-0002Od-HL for ipoverib@megatron.ietf.org; Wed, 17 Nov 2004 20:41:43 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA14475 for ; Wed, 17 Nov 2004 20:41:40 -0500 (EST) Received: from palrel13.hp.com ([156.153.255.238]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUbLD-0003lb-96 for ipoverib@ietf.org; Wed, 17 Nov 2004 20:44:19 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel13.hp.com (Postfix) with ESMTP id 3572B1C0FA81; Wed, 17 Nov 2004 17:41:29 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.202.164]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id RAA16884; Wed, 17 Nov 2004 17:38:58 -0800 (PST) Message-Id: <6.1.2.0.2.20041117164050.01df1290@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Wed, 17 Nov 2004 16:46:52 -0800 To: Vivek Kashyap From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: Mime-Version: 1.0 X-Spam-Score: 0.1 (/) X-Scan-Signature: 8b6657e60309a1317174c9db2ae5f227 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0818580825==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0818580825== Content-Type: multipart/alternative; boundary="=====================_7107019==.ALT" --=====================_7107019==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 11:38 PM 11/16/2004, Vivek Kashyap wrote: >Hi, I have a couple of questions relative to IPoIB: 1. >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface >MUST "FullMember" join the IB multicast group defined by the >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB >interface is IPv6 only, does this group still need be joined ? If not, >where do the parameters for any IPv6 groups come from ? I am presuming >that this group needs to be joined in the IPv6 only case. I just want to >be sure. > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined >whether you are running at v4 or v6 layer. 2. ALso, what is the >latest status of the Vivek's connected mode draft ? Will it be moving >forward ? I'll be submitting it as >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were >some interesting suggestions that were made during the IETF WG meeting. >Two of the suggestions of consequence are given below. The others we can >discuss when the minutes are published (they include some additional >requests on clarification on the transmission draft too). a. The current >draft makes the various modes mutually exclusive i.e. RC, UC and UD are >not allowed simultaneously in the same IP subnet. The thought is that it >is a link characteristic and hence different per connection mode. It was >suggested that one be allowed to mix up RC/UC. This goes back to the >original suggestion in the first draft which was: IPoIB-UD must always be >supported. Additionally, the interface can also support either both of RC >and UC, or one of them. Or neither of them. > >UD MUST always be supported. > > That is and has always been the requirement right from the first >draft. > >I personally don't care whether one does RC or UC but I don't think both >are required as a MAY option. The advantage of RC is the send credit >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in >the fabric while send credits provide a simple method to maintain >bandwidth / injection control on a per flow basis. > >I see no problems with supporting both UD and *C on the same subnet; it is >rather foolish to attempt to mandate these be on separate subnets.b > As per the connected-mode draft the UD mechanism is *always* >required; address resolutoin depends on it. > >The only point of discussion is whether all nodes must support the same >link characteristics in the subnet i.e. all are RC (and UD), or all or UC >(and UD), or all are UD only. Obviously I would oppose such a solution as it creates artificial constraints with little benefit. >The alternative is to allow all the nodes to be mixed up with some nodes >being RC/UD, others UC/UD and a third set UD only and yet others probably >supporting all. within the same IP subnet. [Can the same serviceID be used >by both RC and UC ?] > >The third alternative is to associating UD only or UD + one of RC or UC on >the same interface. In such a case if mismatched/unsupported connected >modes are supported by two nodes then the fall back to UD. This option is >not too different from UD QP + RC or UC mechanism. KISS: - UD universal - *C opportunistic - Local management issue to control what is sent on the *C interface. No need to specify - Advertise whether one or more ports are supported by UD or *C - Advertise whether one or more QP are supported by UD or *C - Let local management determine policy for what services are mapped where - no need to specify This is both an interoperable approach and simple to implement. There may be some desire to add a policy interface to state preference for specific types of traffic over a given QP. I would not oppose this but would view this as a separate draft once the basics are worked out. > >b. Another suggestion was to allow multiple connected mode links (i.e. at >IB UC/RC level) between peers. One thought can be 'yes, but user beware': >The IB connections are made using the service ID that is derived from the >QPN as described in the draft. If a second attempt succeeds then there are >two links. It is up to the implementation to either allow or disallow >multiple links. > >Again, this has been suggested in the past (though most who were involved >in the original discussions years gone by are likely gone since much of >this discussion occurred before the IETF workgroup was established). > > I'm one of the vestiges of those early times along with you and a few >others...so we have hope :). > >There is obvious benefit to supporting multiple RC per endnode pair. I do >not see any technical reason to oppose nor any issue from an >interoperability perspective. There is no reason for a "user beware". > > It is not opposed. The 'user beware' is only underscoring that the >the peer interface might not support multiple links- it might enforce a >limited number of connections (maybe only one) between a pair of GIDs. >Similarly, an implementation not wanting to support multiple links MUST >take steps to deny multiple requests. *C requires CM to operate thus it is a local issue whether additional CM operations are accepted or not. A given requester node may issue N and a given responder may state 0-N as an implementation may limit the number of *C available for IP traffic. > > >The work is rather straight to do and implement and the benefit to >customers, is again, rather obvious when one considers what the IB fabric >offers and how connections can be enable flows through multipath as well >as transparent fail-over, flow scheduling, mapping of DiffServ to >different arbitration / paths, etc. > > In addition Large MTU and APM are two of the main reasons why I've >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, >except for the Large MTU, the parameters are hidden from it. Mike --=====================_7107019==.ALT Content-Type: text/html; charset="us-ascii" At 11:38 PM 11/16/2004, Vivek Kashyap wrote:



        Hi, I have a couple of questions relative to IPoIB: 1. draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface MUST "FullMember" join the IB multicast group defined by the broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB interface is IPv6 only, does this group still need be joined ? If not, where do the parameters for any IPv6 groups come from ? I am presuming that this group needs to be joined in the IPv6 only case. I just want to be sure.
        <VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined whether you are running at v4 or v6 layer. <VK>
        2. ALso, what is the latest status of the Vivek's connected mode draft ? Will it be moving forward ? <VK> I'll be submitting it as draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were some interesting suggestions that were made during the IETF WG meeting. Two of the suggestions of consequence are given below. The others we can discuss when the minutes are published (they include some additional requests on clarification on the transmission draft too). a. The current draft makes the various modes mutually exclusive i.e. RC, UC and UD are not allowed simultaneously in the same IP subnet. The thought is that it is a link characteristic and hence different per connection mode. It was suggested that one be allowed to mix up RC/UC. This goes back to the original suggestion in the first draft which was: IPoIB-UD must always be supported. Additionally, the interface can also support either both of RC and UC, or one of them. Or neither of them.

    UD MUST always be supported.


    <VK> That is and has always been the requirement right from the first draft. <VK>

    I personally don't care whether one does RC or UC but I don't think both are required as a MAY option. The advantage of RC is the send credit algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in the fabric while send credits provide a simple method to maintain bandwidth / injection control on a per flow basis.

    I see no problems with supporting both UD and *C on the same subnet; it is rather foolish to attempt to mandate these be on separate subnets.b
      <VK> As per the connected-mode draft the UD mechanism is *always* required; address resolutoin depends on it.

      The only point of discussion is whether all nodes must support the same link characteristics in the subnet i.e. all are RC (and UD), or all or UC (and UD), or all are UD only.

    Obviously I would oppose such a solution as it creates artificial constraints with little benefit.

      The alternative is to allow all the nodes to be mixed up with some nodes being RC/UD, others UC/UD and a third set UD only and yet others probably supporting all. within the same IP subnet. [Can the same serviceID be used by both RC and UC ?]

      The third alternative is to associating UD only or UD + one of RC or UC on the same interface. In such a case if mismatched/unsupported connected modes are supported by two nodes then the fall back to UD. This option is not too different from UD QP + RC or UC mechanism.

    KISS:

    - UD universal
    - *C opportunistic
            - Local management issue to control what is sent on the *C interface.  No need to specify
            - Advertise whether one or more ports are supported by UD or *C
            - Advertise whether one or more QP are supported by UD or *C
            - Let local management determine policy for what services are mapped where - no need to specify

    This is both an interoperable approach and simple to implement.  There may be some desire to add a policy interface to state preference for specific types of traffic over a given QP.  I would not oppose this but would view this as a separate draft once the basics are worked out.



      <VK>
        b. Another suggestion was to allow multiple connected mode links (i.e. at IB UC/RC level) between peers. One thought can be 'yes, but user beware': The IB connections are made using the service ID that is derived from the QPN as described in the draft. If a second attempt succeeds then there are two links. It is up to the implementation to either allow or disallow multiple links.

    Again, this has been suggested in the past (though most who were involved in the original discussions years gone by are likely gone since much of this discussion occurred before the IETF workgroup was established).


    <VK> I'm one of the vestiges of those early times along with you and a few others...so we have hope :). <VK>

    There is obvious benefit to supporting multiple RC per endnode pair. I do not see any technical reason to oppose nor any issue from an interoperability perspective. There is no reason for a "user beware".

    <VK> It is not opposed. The 'user beware' is only underscoring that the the peer interface might not support multiple links- it might enforce a limited number of connections (maybe only one) between a pair of GIDs. Similarly, an implementation not wanting to support multiple links MUST take steps to deny multiple requests.

    *C requires CM to operate thus it is a local issue whether additional CM operations are accepted or not.  A given requester node may issue N and a given responder may state 0-N as an implementation may limit the number of *C available for IP traffic.


    <VK>

    The work is rather straight to do and implement and the benefit to customers, is again, rather obvious when one considers what the IB fabric offers and how connections can be enable flows through multipath as well as transparent fail-over, flow scheduling, mapping of DiffServ to different arbitration / paths, etc.

    <VK> In addition Large MTU and APM are two of the main reasons why I've been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, except for the Large MTU, the parameters are hidden from it.<VK>

    Mike
    --=====================_7107019==.ALT-- --===============0818580825== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0818580825==-- From ipoverib-bounces@ietf.org Thu Nov 18 01:58:22 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id BAA07467 for ; Thu, 18 Nov 2004 01:58:22 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUgC8-0005HN-SD; Thu, 18 Nov 2004 01:55:05 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUg5U-0003nX-DM for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 01:48:14 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id BAA06599 for ; Thu, 18 Nov 2004 01:48:10 -0500 (EST) Received: from e35.co.us.ibm.com ([32.97.110.133]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUg83-0001JR-6S for ipoverib@ietf.org; Thu, 18 Nov 2004 01:50:51 -0500 Received: from westrelay04.boulder.ibm.com (westrelay04.boulder.ibm.com [9.17.193.32]) by e35.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAI6le9G528124 for ; Thu, 18 Nov 2004 01:47:40 -0500 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by westrelay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAI6liOM160584 for ; Wed, 17 Nov 2004 23:47:44 -0700 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAI6ldIx010999 for ; Wed, 17 Nov 2004 23:47:39 -0700 Received: from w-vkashyap95.des.sequent.com (sig-9-49-129-78.mts.ibm.com [9.49.129.78]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAI6lT4X010858; Wed, 17 Nov 2004 23:47:38 -0700 Date: Wed, 17 Nov 2004 22:46:49 -0800 (Pacific Standard Time) From: Vivek Kashyap To: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <6.1.2.0.2.20041117164050.01df1290@esmail.cup.hp.com> Message-ID: X-X-Sender: kashyapv@imap.linux.ibm.com MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: ff0adf256e4dd459cc25215cfa732ac1 Cc: IPoverIB , Vivek Kashyap X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org Mike the format is really off in the last mail from you - making it difficult to follow. Other than that let us discuss in the context of the draft. The draft is built upon the following: 1. IPoIB-RC and IPoIB-UC are optional. 2. IPoIB connected mode depends on a UD QP for address resolution and multicast. As far as I know, there has been an agreement since the earliest connected mode draft I posted. I'd like the WG to give input on the following issues: 3. Where does the UD QP come from? Choose one of: a. It is a UD QP that is associated with the interface at startup. b. It is a UD QP that is shared with IPoIB-UD. 3a is more generic. It can be considered to include the case 3b. The original proposal was limited to 3b. 4. Link characteristics The broadcast domain for IPoIB-RC/UC is determined exactly as the IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this step. Do all interfaces in the IPoIB-conneced mode(CM) have the same link characteristics? i.e. a. all are either IPoIB-RC or IPoIB-UC. -- There is also a UD QP associated. The UD QP will be either 3a or 3b based on WG concensus. -- All unicast transmission is on the IPoIB mode i.e. RC or UC. b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC or both. -- The presence of the flags indicate the type of communication possible. -- The decision of communicating using a specific mode is determined by the supported modes and the local policy. Note that incompatible policies imply that the fallback is communication over UD. -- fallback mode of communication is UD 4b adds a lot of flexibility at the expense of a simple decision. 4a. by contrast is straightforward. 5. MTU negotiation In the private data field of the CM message the desired MTU is included. It was suggested during the IPoIB meeting at IETF that it need not be symmetric. That is a good idea. Thus each peer declares the max MTU it prefers REQ: REP: RTU: 6. Multiple connections for the same IP address Local decision. Note that the peer might choose to not honour multiple connections. Vivek On Wed, 17 Nov 2004, Michael Krause wrote: > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface > >MUST "FullMember" join the IB multicast group defined by the > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > >interface is IPv6 only, does this group still need be joined ? If not, > >where do the parameters for any IPv6 groups come from ? I am presuming > >that this group needs to be joined in the IPv6 only case. I just want to > >be sure. > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined > >whether you are running at v4 or v6 layer. 2. ALso, what is the > >latest status of the Vivek's connected mode draft ? Will it be moving > >forward ? I'll be submitting it as > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were > >some interesting suggestions that were made during the IETF WG meeting. > >Two of the suggestions of consequence are given below. The others we can > >discuss when the minutes are published (they include some additional > >requests on clarification on the transmission draft too). a. The current > >draft makes the various modes mutually exclusive i.e. RC, UC and UD are > >not allowed simultaneously in the same IP subnet. The thought is that it > >is a link characteristic and hence different per connection mode. It was > >suggested that one be allowed to mix up RC/UC. This goes back to the > >original suggestion in the first draft which was: IPoIB-UD must always be > >supported. Additionally, the interface can also support either both of RC > >and UC, or one of them. Or neither of them. > > > >UD MUST always be supported. > > > > That is and has always been the requirement right from the first > >draft. > > > >I personally don't care whether one does RC or UC but I don't think both > >are required as a MAY option. The advantage of RC is the send credit > >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in > >the fabric while send credits provide a simple method to maintain > >bandwidth / injection control on a per flow basis. > > > >I see no problems with supporting both UD and *C on the same subnet; it is > >rather foolish to attempt to mandate these be on separate subnets.b > > As per the connected-mode draft the UD mechanism is *always* > >required; address resolutoin depends on it. > > > >The only point of discussion is whether all nodes must support the same > >link characteristics in the subnet i.e. all are RC (and UD), or all or UC > >(and UD), or all are UD only. > > Obviously I would oppose such a solution as it creates artificial > constraints with little benefit. > > >The alternative is to allow all the nodes to be mixed up with some nodes > >being RC/UD, others UC/UD and a third set UD only and yet others probably > >supporting all. within the same IP subnet. [Can the same serviceID be used > >by both RC and UC ?] > > > >The third alternative is to associating UD only or UD + one of RC or UC on > >the same interface. In such a case if mismatched/unsupported connected > >modes are supported by two nodes then the fall back to UD. This option is > >not too different from UD QP + RC or UC mechanism. > > KISS: > > - UD universal > - *C opportunistic > - Local management issue to control what is sent on the *C > interface. No need to specify > - Advertise whether one or more ports are supported by UD or *C > - Advertise whether one or more QP are supported by UD or *C > - Let local management determine policy for what services are > mapped where - no need to specify > > This is both an interoperable approach and simple to implement. There may > be some desire to add a policy interface to state preference for specific > types of traffic over a given QP. I would not oppose this but would view > this as a separate draft once the basics are worked out. > > > > > > >b. Another suggestion was to allow multiple connected mode links (i.e. at > >IB UC/RC level) between peers. One thought can be 'yes, but user beware': > >The IB connections are made using the service ID that is derived from the > >QPN as described in the draft. If a second attempt succeeds then there are > >two links. It is up to the implementation to either allow or disallow > >multiple links. > > > >Again, this has been suggested in the past (though most who were involved > >in the original discussions years gone by are likely gone since much of > >this discussion occurred before the IETF workgroup was established). > > > > I'm one of the vestiges of those early times along with you and a few > >others...so we have hope :). > > > >There is obvious benefit to supporting multiple RC per endnode pair. I do > >not see any technical reason to oppose nor any issue from an > >interoperability perspective. There is no reason for a "user beware". > > > > It is not opposed. The 'user beware' is only underscoring that the > >the peer interface might not support multiple links- it might enforce a > >limited number of connections (maybe only one) between a pair of GIDs. > >Similarly, an implementation not wanting to support multiple links MUST > >take steps to deny multiple requests. > > *C requires CM to operate thus it is a local issue whether additional CM > operations are accepted or not. A given requester node may issue N and a > given responder may state 0-N as an implementation may limit the number of > *C available for IP traffic. > > > > > > > >The work is rather straight to do and implement and the benefit to > >customers, is again, rather obvious when one considers what the IB fabric > >offers and how connections can be enable flows through multipath as well > >as transparent fail-over, flow scheduling, mapping of DiffServ to > >different arbitration / paths, etc. > > > > In addition Large MTU and APM are two of the main reasons why I've > >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, > >except for the Large MTU, the parameters are hidden from it. > > Mike __ Vivek Kashyap Linux Technology Center, IBM _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 10:06:13 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA01001 for ; Thu, 18 Nov 2004 10:06:13 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUno1-0000mi-Es; Thu, 18 Nov 2004 10:02:41 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUniY-0006NC-S2 for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 09:57:02 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id JAA00093 for ; Thu, 18 Nov 2004 09:57:00 -0500 (EST) Received: from umhlanga.stratnet.net ([12.162.17.40]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUnlC-00036Q-Aw for ipoverib@ietf.org; Thu, 18 Nov 2004 09:59:46 -0500 Received: from exch-1.topspincom.com ([12.162.17.3]) by umhlanga.STRATNET.NET with Microsoft SMTPSVC(5.0.2195.5329); Thu, 18 Nov 2004 06:57:00 -0800 Received: from eddore ([10.10.253.169]) by exch-1.topspincom.com with Microsoft SMTPSVC(5.0.2195.5329); Thu, 18 Nov 2004 06:57:00 -0800 Received: from roland by eddore with local (Exim 4.34) id 1CUniQ-00075r-OC; Thu, 18 Nov 2004 06:57:00 -0800 To: Vivek Kashyap X-Message-Flag: Warning: May contain useful information References: From: Roland Dreier Date: Thu, 18 Nov 2004 06:56:54 -0800 In-Reply-To: (Vivek Kashyap's message of "Wed, 17 Nov 2004 22:46:49 -0800 (Pacific Standard Time)") Message-ID: <52oehvmci1.fsf@topspin.com> User-Agent: Gnus/5.1006 (Gnus v5.10.6) XEmacs/21.4 (Security Through Obscurity, linux) MIME-Version: 1.0 X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: roland@topspin.com Subject: Re: [Ipoverib] A Couple of IPoIB Questions Content-Type: text/plain; charset=us-ascii X-Spam-Checker-Version: SpamAssassin 2.64 (2004-01-11) on eddore X-Spam-Status: No, hits=0.1 required=5.0 tests=AWL autolearn=ham version=2.64 X-SA-Exim-Version: 4.1 (built Tue, 17 Aug 2004 11:06:07 +0200) X-SA-Exim-Scanned: Yes (on eddore) X-OriginalArrivalTime: 18 Nov 2004 14:57:00.0185 (UTC) FILETIME=[DB4A0090:01C4CD7E] X-Spam-Score: 0.0 (/) X-Scan-Signature: 30ac594df0e66ffa5a93eb4c48bcb014 Cc: Michael Krause , Vivek Kashyap , IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org > Mike the format is really off in the last mail from you - > making it difficult to follow. Vivek, I think that if you used standard quoting in your replies instead of your own "" format, it would be much easier to follow email threads involving your replies. Thanks, Roland _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 10:34:22 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA04040 for ; Thu, 18 Nov 2004 10:34:21 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUoFd-0003WP-JE; Thu, 18 Nov 2004 10:31:13 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUnxQ-0005fg-DX for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 10:12:25 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA02076 for ; Thu, 18 Nov 2004 10:12:21 -0500 (EST) Received: from palrel10.hp.com ([156.153.255.245]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUnzs-0003TM-T9 for ipoverib@ietf.org; Thu, 18 Nov 2004 10:15:08 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel10.hp.com (Postfix) with ESMTP id 21DD49957D for ; Thu, 18 Nov 2004 07:12:11 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.202.164]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id HAA24105 for ; Thu, 18 Nov 2004 07:09:42 -0800 (PST) Message-Id: <6.1.2.0.2.20041118065847.0208dd40@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 07:09:47 -0800 To: IPoverIB From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: <6.1.2.0.2.20041117164050.01df1290@esmail.cup.hp.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 645960076aa293effd9740db2f975dc3 X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0433829689==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0433829689== Content-Type: multipart/alternative; boundary="=====================_55855235==.ALT" --=====================_55855235==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 10:46 PM 11/17/2004, Vivek Kashyap wrote: >Mike the format is really off in the last mail from you - making it difficult >to follow. > > >Other than that let us discuss in the context of the draft. The draft is >built upon the following: > >1. IPoIB-RC and IPoIB-UC are optional. I would prefer only one be used - either RC or UC. I've provided some logic for either one as a preference but don't see a reason to have both. Both just leads to options which leads to interoperability problems. >2. IPoIB connected mode depends on a UD QP for address resolution and >multicast. > >As far as I know, there has been an agreement since the earliest connected >mode >draft I posted. > > >I'd like the WG to give input on the following issues: > >3. Where does the UD QP come from? Choose one of: > >a. It is a UD QP that is associated with the interface at startup. > >b. It is a UD QP that is shared with IPoIB-UD. > > >3a is more generic. It can be considered to include the case 3b. The original >proposal was limited to 3b. From an implementation point of view, all of this will be hidden within the driver below IP. As such, the driver will maintain the associations. Currently, each driver "instance" (may be multiple per IB port) will have at least 1 UD QP. Given the existing protocol already defines how to share this QP with other nodes, why not just re-use it and avoid doing more work? The driver can then map on a per endnode pair basis what *C QP go with what the UD QP and the spec remains largely silent on how this is accomplished. >4. Link characteristics > >The broadcast domain for IPoIB-RC/UC is determined exactly as the >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this >step. > >Do all interfaces in the IPoIB-conneced mode(CM) have the same link >characteristics? i.e. From an implementation perspective, this is generally simplest. >a. all are either IPoIB-RC or IPoIB-UC. Preference is only 1 to be defined. > -- There is also a UD QP associated. The UD QP will be either 3a > or 3b > based on WG concensus. > > -- All unicast transmission is on the IPoIB mode i.e. RC or UC. For a given endnode pair, the policy of which QP is used for a given unicast IP datagram is really a local issue. I see some merit in the attempt to bifurcate this to multicast / broadcast to the UD QP and unicast to the *C QP. However, if the datagram fits in the PMTU of the UD QP, then either could be used. The driver would work either case. Please keep in mind that multiple *C QP can be used and their usage needs to be a local issue and not defined within the spec. >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC >or both. > > -- The presence of the flags indicate the type of communication > possible. > -- The decision of communicating using a specific mode is > determined by > the supported modes and the local policy. Note that incompatible > policies imply that the fallback is communication over UD. > -- fallback mode of communication is UD > > >4b adds a lot of flexibility at the expense of a simple decision. 4a. by >contrast is straightforward. > > >5. MTU negotiation > > In the private data field of the CM message the desired MTU is > included. > > It was suggested during the IPoIB meeting at IETF that it need not be > symmetric. That is a good idea. Thus each peer declares the max > MTU it > prefers > > > REQ: > REP: > RTU: Rephrase this as maximum logical MTU to avoid confusion with the IB link MTU. If you start down this path, then you may need to also consider an exchange of what range of DiffServ code points to use as well. Not clear that anyone needs to deal with any latency or bandwidth guarantees but the "camel's nose is starting to enter the tent" as the saying goes. >6. Multiple connections for the same IP address > > Local decision. Note that the peer might choose to not honour > multiple > connections. Agreed. Mike >Vivek > > > > > >On Wed, 17 Nov 2004, Michael Krause wrote: > > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface > > >MUST "FullMember" join the IB multicast group defined by the > > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > > >interface is IPv6 only, does this group still need be joined ? If not, > > >where do the parameters for any IPv6 groups come from ? I am presuming > > >that this group needs to be joined in the IPv6 only case. I just want to > > >be sure. > > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined > > >whether you are running at v4 or v6 layer. 2. ALso, what is the > > >latest status of the Vivek's connected mode draft ? Will it be moving > > >forward ? I'll be submitting it as > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were > > >some interesting suggestions that were made during the IETF WG meeting. > > >Two of the suggestions of consequence are given below. The others we can > > >discuss when the minutes are published (they include some additional > > >requests on clarification on the transmission draft too). a. The current > > >draft makes the various modes mutually exclusive i.e. RC, UC and UD are > > >not allowed simultaneously in the same IP subnet. The thought is that it > > >is a link characteristic and hence different per connection mode. It was > > >suggested that one be allowed to mix up RC/UC. This goes back to the > > >original suggestion in the first draft which was: IPoIB-UD must always be > > >supported. Additionally, the interface can also support either both of RC > > >and UC, or one of them. Or neither of them. > > > > > >UD MUST always be supported. > > > > > > That is and has always been the requirement right from the first > > >draft. > > > > > >I personally don't care whether one does RC or UC but I don't think both > > >are required as a MAY option. The advantage of RC is the send credit > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in > > >the fabric while send credits provide a simple method to maintain > > >bandwidth / injection control on a per flow basis. > > > > > >I see no problems with supporting both UD and *C on the same subnet; it is > > >rather foolish to attempt to mandate these be on separate subnets.b > > > As per the connected-mode draft the UD mechanism is *always* > > >required; address resolutoin depends on it. > > > > > >The only point of discussion is whether all nodes must support the same > > >link characteristics in the subnet i.e. all are RC (and UD), or all or UC > > >(and UD), or all are UD only. > > > > Obviously I would oppose such a solution as it creates artificial > > constraints with little benefit. > > > > >The alternative is to allow all the nodes to be mixed up with some nodes > > >being RC/UD, others UC/UD and a third set UD only and yet others probably > > >supporting all. within the same IP subnet. [Can the same serviceID be used > > >by both RC and UC ?] > > > > > >The third alternative is to associating UD only or UD + one of RC or UC on > > >the same interface. In such a case if mismatched/unsupported connected > > >modes are supported by two nodes then the fall back to UD. This option is > > >not too different from UD QP + RC or UC mechanism. > > > > KISS: > > > > - UD universal > > - *C opportunistic > > - Local management issue to control what is sent on the *C > > interface. No need to specify > > - Advertise whether one or more ports are supported by UD or *C > > - Advertise whether one or more QP are supported by UD or *C > > - Let local management determine policy for what services are > > mapped where - no need to specify > > > > This is both an interoperable approach and simple to implement. There may > > be some desire to add a policy interface to state preference for specific > > types of traffic over a given QP. I would not oppose this but would view > > this as a separate draft once the basics are worked out. > > > > > > > > > > > >b. Another suggestion was to allow multiple connected mode links (i.e. at > > >IB UC/RC level) between peers. One thought can be 'yes, but user beware': > > >The IB connections are made using the service ID that is derived from the > > >QPN as described in the draft. If a second attempt succeeds then there are > > >two links. It is up to the implementation to either allow or disallow > > >multiple links. > > > > > >Again, this has been suggested in the past (though most who were involved > > >in the original discussions years gone by are likely gone since much of > > >this discussion occurred before the IETF workgroup was established). > > > > > > I'm one of the vestiges of those early times along with you and a few > > >others...so we have hope :). > > > > > >There is obvious benefit to supporting multiple RC per endnode pair. I do > > >not see any technical reason to oppose nor any issue from an > > >interoperability perspective. There is no reason for a "user beware". > > > > > > It is not opposed. The 'user beware' is only underscoring that the > > >the peer interface might not support multiple links- it might enforce a > > >limited number of connections (maybe only one) between a pair of GIDs. > > >Similarly, an implementation not wanting to support multiple links MUST > > >take steps to deny multiple requests. > > > > *C requires CM to operate thus it is a local issue whether additional CM > > operations are accepted or not. A given requester node may issue N and a > > given responder may state 0-N as an implementation may limit the number of > > *C available for IP traffic. > > > > > > > > > > > > >The work is rather straight to do and implement and the benefit to > > >customers, is again, rather obvious when one considers what the IB fabric > > >offers and how connections can be enable flows through multipath as well > > >as transparent fail-over, flow scheduling, mapping of DiffServ to > > >different arbitration / paths, etc. > > > > > > In addition Large MTU and APM are two of the main reasons why I've > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, > > >except for the Large MTU, the parameters are hidden from it. > > > > Mike > >__ > >Vivek Kashyap >Linux Technology Center, IBM > > >_______________________________________________ >IPoverIB mailing list >IPoverIB@ietf.org >https://www1.ietf.org/mailman/listinfo/ipoverib --=====================_55855235==.ALT Content-Type: text/html; charset="us-ascii" At 10:46 PM 11/17/2004, Vivek Kashyap wrote:
    Mike the format is really off in the last mail from you - making it difficult
    to follow.


    Other than that let us discuss in the context of the draft. The draft is
    built upon the following:

    1. IPoIB-RC and IPoIB-UC are optional.

    I would prefer only one be used - either RC or UC.  I've provided some logic for either one as a preference but don't see a reason to have both.  Both just leads to options which leads to interoperability problems.

    2. IPoIB connected mode depends on a UD QP for address resolution and multicast.

    As far as I know, there has been an agreement since the earliest connected mode
    draft I posted.


    I'd like the WG to give input on the following issues:

    3. Where does the UD QP come from?  Choose one of:

    a. It is a UD QP that is associated with the interface at startup.

    b. It is a UD QP that is shared with IPoIB-UD.


    3a is more generic. It can be considered to include the case 3b.  The original
    proposal was limited to 3b.

    From an implementation point of view, all of this will be hidden within the driver below IP.  As such, the driver will maintain the associations.  Currently, each driver "instance" (may be multiple per IB port) will have at least 1 UD QP.  Given the existing protocol already defines how to share this QP with other nodes, why not just re-use it and avoid doing more work?  The driver can then map on a per endnode pair basis what *C QP go with what the UD QP and the spec remains largely silent on how this is accomplished.

    4. Link characteristics

    The broadcast domain for IPoIB-RC/UC is determined exactly as the
    IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this
    step.

    Do all interfaces in the IPoIB-conneced mode(CM) have the same link characteristics? i.e.

    From an implementation perspective, this is generally simplest.

    a. all are either IPoIB-RC or IPoIB-UC.

    Preference is only 1 to be defined.


            -- There is also a UD QP associated. The UD QP will be either 3a or 3b
               based on WG concensus.

            -- All unicast transmission is on the IPoIB mode i.e. RC or UC.

    For a given endnode pair, the policy of which QP is used for a given unicast IP datagram is really a local issue.  I see some merit in the attempt to bifurcate this to multicast / broadcast to the UD QP and unicast to the *C QP.  However, if the datagram fits in the PMTU of the UD QP, then either could be used.  The driver would work either case.  Please keep in mind that multiple *C QP can be used and their usage needs to be a local issue and not defined within the spec.

    b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC
    or both.

            -- The presence of the flags indicate the type of communication possible.
            -- The decision of communicating using a specific mode is determined by
               the supported modes and the local policy. Note that incompatible
               policies imply that the fallback is communication over UD.
            -- fallback mode of communication is UD


    4b adds a lot of flexibility at the expense of a simple decision. 4a. by
    contrast is straightforward.


    5. MTU negotiation

            In the private data field of the CM message the desired MTU is
            included.

            It was suggested during the IPoIB meeting at IETF that it need not be
            symmetric. That is a good idea. Thus each peer declares the max MTU it
            prefers


            REQ: <my desired MTU>
            REP: <my desired MTU>
            RTU:

    Rephrase this as maximum logical MTU to avoid confusion with the IB link MTU.  If you start down this path, then you may need to also consider an exchange of what range of DiffServ code points to use as well.  Not clear that anyone needs to deal with any latency or bandwidth guarantees but the "camel's nose is starting to enter the tent" as the saying goes.


    6. Multiple connections for the same IP address

            Local decision. Note that the peer might choose to not honour multiple
            connections.

    Agreed.

    Mike




    Vivek





    On Wed, 17 Nov 2004, Michael Krause wrote:

    > At 11:38 PM 11/16/2004, Vivek Kashyap wrote:
    >
    >
    >
    > >Hi,  I have a couple of questions relative to IPoIB:  1.
    > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface
    > >MUST "FullMember" join the IB multicast group defined by the
    > >broadcast-GID."  Isn't the broadcast group for IPv4 ? When the IPoIB
    > >interface is IPv6 only, does this group still need be joined ?  If not,
    > >where do the parameters for any IPv6 groups come from ? I am presuming
    > >that this group needs to be joined in  the IPv6 only case. I just want to
    > >be sure.
    > ><VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined
    > >whether you are running at v4 or v6 layer. <VK>  2. ALso, what is the
    > >latest status of the Vivek's connected mode draft ? Will it be moving
    > >forward ?  <VK> I'll be submitting it as
    > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were
    > >some interesting suggestions that were made during the IETF WG meeting.
    > >Two of the suggestions of consequence are given below. The others we can
    > >discuss when the minutes are published (they include some additional
    > >requests on clarification on the transmission draft too).  a. The current
    > >draft makes the various modes mutually exclusive i.e. RC, UC and UD are
    > >not allowed simultaneously in the same IP subnet. The thought is that it
    > >is a link characteristic and hence different per connection mode. It was
    > >suggested that one be allowed to mix up RC/UC. This goes back to the
    > >original suggestion in the first draft which was:  IPoIB-UD must always be
    > >supported. Additionally, the interface can also support either both of RC
    > >and UC, or one of them. Or neither of them.
    > >
    > >UD MUST always be supported.
    > >
    > ><VK> That is and has always been the requirement right from the first
    > >draft. <VK>
    > >
    > >I personally don't care whether one does RC or UC but I don't think both
    > >are required as a MAY option. The advantage of RC is the send credit
    > >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in
    > >the fabric while send credits provide a simple method to maintain
    > >bandwidth / injection control on a per flow basis.
    > >
    > >I see no problems with supporting both UD and *C on the same subnet; it is
    > >rather foolish to attempt to mandate these be on separate subnets.b
    > ><VK> As per the connected-mode draft the UD mechanism is *always*
    > >required; address resolutoin depends on it.
    > >
    > >The only point of discussion is whether all nodes must support the same
    > >link characteristics in the subnet i.e. all are RC (and UD), or all or UC
    > >(and UD), or all are UD only.
    >
    > Obviously I would oppose such a solution as it creates artificial
    > constraints with little benefit.
    >
    > >The alternative is to allow all the nodes to be mixed up with some nodes
    > >being RC/UD, others UC/UD and a third set UD only and yet others probably
    > >supporting all. within the same IP subnet. [Can the same serviceID be used
    > >by both RC and UC ?]
    > >
    > >The third alternative is to associating UD only or UD + one of RC or UC on
    > >the same interface. In such a case if mismatched/unsupported connected
    > >modes are supported by two nodes then the fall back to UD. This option is
    > >not too different from UD QP + RC or UC mechanism.
    >
    > KISS:
    >
    > - UD universal
    > - *C opportunistic
    >          - Local management issue to control what is sent on the *C
    > interface.  No need to specify
    >          - Advertise whether one or more ports are supported by UD or *C
    >          - Advertise whether one or more QP are supported by UD or *C
    >          - Let local management determine policy for what services are
    > mapped where - no need to specify
    >
    > This is both an interoperable approach and simple to implement.  There may
    > be some desire to add a policy interface to state preference for specific
    > types of traffic over a given QP.  I would not oppose this but would view
    > this as a separate draft once the basics are worked out.
    >
    >
    >
    > ><VK>
    > >b. Another suggestion was to allow multiple connected mode links (i.e. at
    > >IB UC/RC level) between peers.  One thought can be 'yes, but user beware':
    > >The IB connections are made using the service ID that is derived from the
    > >QPN as described in the draft. If a second attempt succeeds then there are
    > >two links. It is up to the implementation to either allow or disallow
    > >multiple links.
    > >
    > >Again, this has been suggested in the past (though most who were involved
    > >in the original discussions years gone by are likely gone since much of
    > >this discussion occurred before the IETF workgroup was established).
    > >
    > ><VK> I'm one of the vestiges of those early times along with you and a few
    > >others...so we have hope :). <VK>
    > >
    > >There is obvious benefit to supporting multiple RC per endnode pair. I do
    > >not see any technical reason to oppose nor any issue from an
    > >interoperability perspective. There is no reason for a "user beware".
    > >
    > ><VK> It is not opposed. The 'user beware' is only underscoring that the
    > >the peer interface might not support multiple links- it might enforce a
    > >limited number of connections (maybe only one) between a pair of GIDs.
    > >Similarly, an implementation not wanting to support multiple links MUST
    > >take steps to deny multiple requests.
    >
    > *C requires CM to operate thus it is a local issue whether additional CM
    > operations are accepted or not.  A given requester node may issue N and a
    > given responder may state 0-N as an implementation may limit the number of
    > *C available for IP traffic.
    >
    >
    > ><VK>
    > >
    > >The work is rather straight to do and implement and the benefit to
    > >customers, is again, rather obvious when one considers what the IB fabric
    > >offers and how connections can be enable flows through multipath as well
    > >as transparent fail-over, flow scheduling, mapping of DiffServ to
    > >different arbitration / paths, etc.
    > >
    > ><VK> In addition Large MTU and APM are two of the main reasons why I've
    > >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself,
    > >except for the Large MTU, the parameters are hidden from it.<VK>
    >
    > Mike

    __

    Vivek Kashyap
    Linux Technology Center, IBM


    _______________________________________________
    IPoverIB mailing list
    IPoverIB@ietf.org
    https://www1.ietf.org/mailman/listinfo/ipoverib
    --=====================_55855235==.ALT-- --===============0433829689== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0433829689==-- From ipoverib-bounces@ietf.org Thu Nov 18 14:44:29 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA25723 for ; Thu, 18 Nov 2004 14:44:29 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUs2f-0005U8-Vx; Thu, 18 Nov 2004 14:34:05 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUry2-0003nU-8R for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 14:29:18 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA24295 for ; Thu, 18 Nov 2004 14:29:16 -0500 (EST) Received: from nwkea-mail-1.sun.com ([192.18.42.13]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUs0h-0001GP-U7 for ipoverib@ietf.org; Thu, 18 Nov 2004 14:32:04 -0500 Received: from jurassic.eng.sun.com ([129.146.85.105]) by nwkea-mail-1.sun.com (8.12.10/8.12.9) with ESMTP id iAIJTF6O028050 for ; Thu, 18 Nov 2004 11:29:15 -0800 (PST) Received: from taipei (taipei.SFBay.Sun.COM [129.146.85.178]) by jurassic.eng.sun.com (8.13.1+Sun/8.13.1) with SMTP id iAIJTE6R162628 for ; Thu, 18 Nov 2004 11:29:15 -0800 (PST) Message-Id: <200411181929.iAIJTE6R162628@jurassic.eng.sun.com> Date: Thu, 18 Nov 2004 11:27:41 -0800 (PST) From: "H.K. Jerry Chu" To: ipoverib@ietf.org MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Content-MD5: BGw7MK9FivR50WJyLLgglw== X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.6_68 SunOS 5.10 sun4u sparc X-Spam-Score: 0.0 (/) X-Scan-Signature: 8b431ad66d60be2d47c7bfeb879db82c Subject: [Ipoverib] comments on draft-kashyap-ipoib-connected-mode-02.txt X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: "H.K. Jerry Chu" List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org In the last IETF61 IPoIB meeting I made several comments on the connected mode draft. I'm sending them to the list for a general discussion. (Yes I saw some disucssion on the connected mode draft already. I'll try to catch up with the thread after this mail.) 1. The draft makes a distinction between IPoIB-CM interfaces and IPoIB-UD interfaces, and portrays IPoIB-UC or IPoIB-RC as separate subnets superimposed on top of an IPoIB-UD subnet. For the above to work, due to a lack of multicast support, a fully connected network by itself can't meet the requirement of an IP link unless multicast is fully emulated through the use of multiple unicasts. The latter is complex and cumbersome. A much simpler model, which I think was presented in earlier drafts, is to fold the use of IB connections fully into a regular IPoIB-UD subnet, allowing any two IPoIB nodes to optionally negotiate the use of IB connection between themselves. This much simplified model is not without its drawback. Some nice IP link attributes are no longer unique within a link. E.g., the link MTU now becomes per-node-pair MTU. Moreover, the MTU size for multicast will be different from the MTU size for unicast if IB connections are used. IB UC/RC may exhibit different RAS, flow control, QoS or other link characteristics than UD. But I consider these problems a reasonable price to pay for a seamless support of UC/RC mode in an IPoIB link defined by UD. 2. The negotiation of the per-connection MTU seems more complicated than necessary. I think all is needed is for a node to advertise its own "receive MTU". That is, the MTU size its peer should never go over when sending packets to the local interface. Yes this may break the traditional concept of "symmetric" MTUs. But we're already breaking the notion of per-link MTU, requring a lot of changes in the host stack anyway. This additonal breakage doesn't seem much. I haven't verified if this asymmetric MTU matches well with IBA connections though. 3. Regarding allowing multiple IB connections between a node pair, since given an IP address there is only one link-address for it implying one QPN, hence one service-ID, if a single service-ID can be used to create multiple IB connections then this can happen transparently. Otherwise we've got a problem. Jerry _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 14:49:01 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA26105 for ; Thu, 18 Nov 2004 14:49:01 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUs8m-0006a2-HP; Thu, 18 Nov 2004 14:40:24 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUs24-00056x-EB for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 14:33:28 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA24565 for ; Thu, 18 Nov 2004 14:33:26 -0500 (EST) Received: from e34.co.us.ibm.com ([32.97.110.132]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUs4e-0001Km-Ru for ipoverib@ietf.org; Thu, 18 Nov 2004 14:36:14 -0500 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e34.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAIJWkAD544024 for ; Thu, 18 Nov 2004 14:32:46 -0500 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAIJWkCQ220310 for ; Thu, 18 Nov 2004 12:32:46 -0700 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAIJWjvL022516 for ; Thu, 18 Nov 2004 12:32:45 -0700 Received: from DYN319548.beaverton.ibm.com (DYN319548.beaverton.ibm.com [9.47.22.85]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAIJWivB022486; Thu, 18 Nov 2004 12:32:45 -0700 Date: Thu, 18 Nov 2004 11:33:40 -0800 (PST) From: Vivek Kashyap X-X-Sender: kashyapv@dyn319548.beaverton.ibm.com To: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <6.1.2.0.2.20041118065847.0208dd40@esmail.cup.hp.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: 24d000849df6f171c5ec1cca2ea21b82 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Thu, 18 Nov 2004, Michael Krause wrote: > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > >Mike the format is really off in the last mail from you - making it difficult > >to follow. > > > > > >Other than that let us discuss in the context of the draft. The draft is > >built upon the following: > > > >1. IPoIB-RC and IPoIB-UC are optional. > > I would prefer only one be used - either RC or UC. I've provided some > logic for either one as a preference but don't see a reason to have > both. Both just leads to options which leads to interoperability problems. ok. See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. It states that the RC and UC are mutually exclusive flags. > > >2. IPoIB connected mode depends on a UD QP for address resolution and > >multicast. > > > >As far as I know, there has been an agreement since the earliest connected > >mode > >draft I posted. > > > > > >I'd like the WG to give input on the following issues: > > > >3. Where does the UD QP come from? Choose one of: > > > >a. It is a UD QP that is associated with the interface at startup. > > > >b. It is a UD QP that is shared with IPoIB-UD. > > > > > >3a is more generic. It can be considered to include the case 3b. The original > >proposal was limited to 3b. > > From an implementation point of view, all of this will be hidden within > the driver below IP. As such, the driver will maintain the > associations. Currently, each driver "instance" (may be multiple per IB > port) will have at least 1 UD QP. Given the existing protocol already > defines how to share this QP with other nodes, why not just re-use it and > avoid doing more work? The driver can then map on a per endnode pair basis > what *C QP go with what the UD QP and the spec remains largely silent on > how this is accomplished. The draft at present states that 'IPoIB-CM implementation MAY use the same UD QP as used by IPoIB-UD...'. See section 3.0. I believe it covers what you are stating. > >4. Link characteristics > > > >The broadcast domain for IPoIB-RC/UC is determined exactly as the > >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this > >step. > > > >Do all interfaces in the IPoIB-conneced mode(CM) have the same link > >characteristics? i.e. > > From an implementation perspective, this is generally simplest. > > >a. all are either IPoIB-RC or IPoIB-UC. > > Preference is only 1 to be defined. > > > > -- There is also a UD QP associated. The UD QP will be either 3a > > or 3b > > based on WG concensus. > > > > -- All unicast transmission is on the IPoIB mode i.e. RC or UC. > > For a given endnode pair, the policy of which QP is used for a given > unicast IP datagram is really a local issue. I see some merit in the Not if an implementation chooses to only receive unicast on the CM modes in an IPoIB-CM subnet. I think the WG must either mandate that between two IP address all unicast communication can be over either UD or the supported CM, or state that all unicast communication must be over IPoIB-CM. Hence my attempt at a detailed discussion on these issues. Issues such as in order delivery need to be considered: e.g. if RC and UD are used to mix up the traffic, say of TCP segments of the same connection, they may no longer be received in order. > attempt to bifurcate this to multicast / broadcast to the UD QP and unicast > to the *C QP. However, if the datagram fits in the PMTU of the UD QP, then > either could be used. The driver would work either case. Please keep in > mind that multiple *C QP can be used and their usage needs to be a local > issue and not defined within the spec. > > >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC > >or both. > > > > -- The presence of the flags indicate the type of communication > > possible. > > -- The decision of communicating using a specific mode is > > determined by > > the supported modes and the local policy. Note that incompatible > > policies imply that the fallback is communication over UD. > > -- fallback mode of communication is UD > > > > > >4b adds a lot of flexibility at the expense of a simple decision. 4a. by > >contrast is straightforward. > > > > > >5. MTU negotiation > > > > In the private data field of the CM message the desired MTU is > > included. > > > > It was suggested during the IPoIB meeting at IETF that it need not be > > symmetric. That is a good idea. Thus each peer declares the max > > MTU it > > prefers > > > > > > REQ: > > REP: > > RTU: > > Rephrase this as maximum logical MTU to avoid confusion with the IB link It is covered in section 5.1 of the draft. > MTU. If you start down this path, then you may need to also consider an > exchange of what range of DiffServ code points to use as well. Not clear > that anyone needs to deal with any latency or bandwidth guarantees but the > "camel's nose is starting to enter the tent" as the saying goes. The camel comes along if Diffserv etc. as listed above are included. Hence they are not in the draft. > > > >6. Multiple connections for the same IP address > > > > Local decision. Note that the peer might choose to not honour > > multiple > > connections. > > Agreed. > > Mike > > > > > >Vivek > > > > > > > > > > > >On Wed, 17 Nov 2004, Michael Krause wrote: > > > > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > > > > > > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface > > > >MUST "FullMember" join the IB multicast group defined by the > > > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > > > >interface is IPv6 only, does this group still need be joined ? If not, > > > >where do the parameters for any IPv6 groups come from ? I am presuming > > > >that this group needs to be joined in the IPv6 only case. I just want to > > > >be sure. > > > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined > > > >whether you are running at v4 or v6 layer. 2. ALso, what is the > > > >latest status of the Vivek's connected mode draft ? Will it be moving > > > >forward ? I'll be submitting it as > > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were > > > >some interesting suggestions that were made during the IETF WG meeting. > > > >Two of the suggestions of consequence are given below. The others we can > > > >discuss when the minutes are published (they include some additional > > > >requests on clarification on the transmission draft too). a. The current > > > >draft makes the various modes mutually exclusive i.e. RC, UC and UD are > > > >not allowed simultaneously in the same IP subnet. The thought is that it > > > >is a link characteristic and hence different per connection mode. It was > > > >suggested that one be allowed to mix up RC/UC. This goes back to the > > > >original suggestion in the first draft which was: IPoIB-UD must always be > > > >supported. Additionally, the interface can also support either both of RC > > > >and UC, or one of them. Or neither of them. > > > > > > > >UD MUST always be supported. > > > > > > > > That is and has always been the requirement right from the first > > > >draft. > > > > > > > >I personally don't care whether one does RC or UC but I don't think both > > > >are required as a MAY option. The advantage of RC is the send credit > > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in > > > >the fabric while send credits provide a simple method to maintain > > > >bandwidth / injection control on a per flow basis. > > > > > > > >I see no problems with supporting both UD and *C on the same subnet; it is > > > >rather foolish to attempt to mandate these be on separate subnets.b > > > > As per the connected-mode draft the UD mechanism is *always* > > > >required; address resolutoin depends on it. > > > > > > > >The only point of discussion is whether all nodes must support the same > > > >link characteristics in the subnet i.e. all are RC (and UD), or all or UC > > > >(and UD), or all are UD only. > > > > > > Obviously I would oppose such a solution as it creates artificial > > > constraints with little benefit. > > > > > > >The alternative is to allow all the nodes to be mixed up with some nodes > > > >being RC/UD, others UC/UD and a third set UD only and yet others probably > > > >supporting all. within the same IP subnet. [Can the same serviceID be used > > > >by both RC and UC ?] > > > > > > > >The third alternative is to associating UD only or UD + one of RC or UC on > > > >the same interface. In such a case if mismatched/unsupported connected > > > >modes are supported by two nodes then the fall back to UD. This option is > > > >not too different from UD QP + RC or UC mechanism. > > > > > > KISS: > > > > > > - UD universal > > > - *C opportunistic > > > - Local management issue to control what is sent on the *C > > > interface. No need to specify > > > - Advertise whether one or more ports are supported by UD or *C > > > - Advertise whether one or more QP are supported by UD or *C > > > - Let local management determine policy for what services are > > > mapped where - no need to specify > > > > > > This is both an interoperable approach and simple to implement. There may > > > be some desire to add a policy interface to state preference for specific > > > types of traffic over a given QP. I would not oppose this but would view > > > this as a separate draft once the basics are worked out. > > > > > > > > > > > > > > > > >b. Another suggestion was to allow multiple connected mode links (i.e. at > > > >IB UC/RC level) between peers. One thought can be 'yes, but user beware': > > > >The IB connections are made using the service ID that is derived from the > > > >QPN as described in the draft. If a second attempt succeeds then there are > > > >two links. It is up to the implementation to either allow or disallow > > > >multiple links. > > > > > > > >Again, this has been suggested in the past (though most who were involved > > > >in the original discussions years gone by are likely gone since much of > > > >this discussion occurred before the IETF workgroup was established). > > > > > > > > I'm one of the vestiges of those early times along with you and a few > > > >others...so we have hope :). > > > > > > > >There is obvious benefit to supporting multiple RC per endnode pair. I do > > > >not see any technical reason to oppose nor any issue from an > > > >interoperability perspective. There is no reason for a "user beware". > > > > > > > > It is not opposed. The 'user beware' is only underscoring that the > > > >the peer interface might not support multiple links- it might enforce a > > > >limited number of connections (maybe only one) between a pair of GIDs. > > > >Similarly, an implementation not wanting to support multiple links MUST > > > >take steps to deny multiple requests. > > > > > > *C requires CM to operate thus it is a local issue whether additional CM > > > operations are accepted or not. A given requester node may issue N and a > > > given responder may state 0-N as an implementation may limit the number of > > > *C available for IP traffic. > > > > > > > > > > > > > > > > > >The work is rather straight to do and implement and the benefit to > > > >customers, is again, rather obvious when one considers what the IB fabric > > > >offers and how connections can be enable flows through multipath as well > > > >as transparent fail-over, flow scheduling, mapping of DiffServ to > > > >different arbitration / paths, etc. > > > > > > > > In addition Large MTU and APM are two of the main reasons why I've > > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself, > > > >except for the Large MTU, the parameters are hidden from it. > > > > > > Mike > > > >__ > > > >Vivek Kashyap > >Linux Technology Center, IBM > > > > > >_______________________________________________ > >IPoverIB mailing list > >IPoverIB@ietf.org > >https://www1.ietf.org/mailman/listinfo/ipoverib > _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 16:16:02 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA15373 for ; Thu, 18 Nov 2004 16:16:01 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUsoT-0004S1-1s; Thu, 18 Nov 2004 15:23:29 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUskI-00014N-5S for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 15:19:10 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA02128 for ; Thu, 18 Nov 2004 15:19:07 -0500 (EST) Received: from taurus.voltaire.com ([212.143.27.73]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUsmy-0002d4-HZ for ipoverib@ietf.org; Thu, 18 Nov 2004 15:21:57 -0500 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: [Ipoverib] IPoIB-RC and Checksums Date: Thu, 18 Nov 2004 22:18:25 +0200 Message-ID: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com> Thread-Topic: [Ipoverib] IPoIB-RC and Checksums Thread-Index: AcTNp5+apRItYwv3SJuqKkSsHwbzhwAAsDuQ From: "Yaron Haviv" To: "IPoverIB" X-Spam-Score: 0.0 (/) X-Scan-Signature: 93238566e09e6e262849b4f805833007 Content-Transfer-Encoding: quoted-printable X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org Content-Transfer-Encoding: quoted-printable In GbE usually the NIC Tx Segmentation (large send) capability comes hand in hand with Checksum offload for greater efficiency (and zero copy) On UD we decided not to address checksum offloading, since we cannot guarantee that the node will not forward an un-checked packet=20 Where as in RC we can have examples of devices that can guarantee checksum=20 One example is an IB-IP gateway that always checksum outgoing and incoming packets, and can act as a remote IP NIC to the Host =20 I suggest we include a checksum option in the CM Exchange=20 Where a node can request that its peer will not checksum the packet for it And also signal that he sends packets that are already checked=20 That can help improve performance of IPoIB RC P.S. another note, we discussed in IETF was that we may want to mention/suggest (not mandate) use of SRQ for IPoIB-RC in order to preserve memory=20 Yaron _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 17:44:20 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA27799 for ; Thu, 18 Nov 2004 17:44:20 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUulv-0006oN-1m; Thu, 18 Nov 2004 17:28:59 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUud6-0003uQ-Ju for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 17:19:52 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA24897 for ; Thu, 18 Nov 2004 17:19:49 -0500 (EST) Received: from atorelbas01.hp.com ([156.153.255.245] helo=palrel10.hp.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUufb-0000hI-QS for ipoverib@ietf.org; Thu, 18 Nov 2004 17:22:40 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel10.hp.com (Postfix) with ESMTP id F3BA41CD7D; Thu, 18 Nov 2004 14:19:38 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.201.129]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id OAA21563; Thu, 18 Nov 2004 14:16:59 -0800 (PST) Message-Id: <6.1.2.0.2.20041118132352.0c98a550@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 13:26:43 -0800 To: Vivek Kashyap From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: <6.1.2.0.2.20041118065847.0208dd40@esmail.cup.hp.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: b92e72fc2b623ddd11e6d81413fb81b2 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0941738355==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0941738355== Content-Type: multipart/alternative; boundary="=====================_81454966==.ALT" --=====================_81454966==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 11:33 AM 11/18/2004, Vivek Kashyap wrote: >On Thu, 18 Nov 2004, Michael Krause wrote: > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > > >Mike the format is really off in the last mail from you - making it > difficult > > >to follow. > > > > > > > > >Other than that let us discuss in the context of the draft. The draft is > > >built upon the following: > > > > > >1. IPoIB-RC and IPoIB-UC are optional. > > > > I would prefer only one be used - either RC or UC. I've provided some > > logic for either one as a preference but don't see a reason to have > > both. Both just leads to options which leads to interoperability problems. > >ok. >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. >It states that the RC and UC are mutually exclusive flags. My preference is to only support one of the two in a spec not to have flags to indicate what is implemented. The benefits of connected mode operation should be done with only one form of communication not two. > > > > >2. IPoIB connected mode depends on a UD QP for address resolution and > > >multicast. > > > > > >As far as I know, there has been an agreement since the earliest > connected > > >mode > > >draft I posted. > > > > > > > > >I'd like the WG to give input on the following issues: > > > > > >3. Where does the UD QP come from? Choose one of: > > > > > >a. It is a UD QP that is associated with the interface at startup. > > > > > >b. It is a UD QP that is shared with IPoIB-UD. > > > > > > > > >3a is more generic. It can be considered to include the case 3b. The > original > > >proposal was limited to 3b. > > > > From an implementation point of view, all of this will be hidden within > > the driver below IP. As such, the driver will maintain the > > associations. Currently, each driver "instance" (may be multiple per IB > > port) will have at least 1 UD QP. Given the existing protocol already > > defines how to share this QP with other nodes, why not just re-use it and > > avoid doing more work? The driver can then map on a per endnode pair > basis > > what *C QP go with what the UD QP and the spec remains largely silent on > > how this is accomplished. > >The draft at present states that 'IPoIB-CM implementation MAY use the same UD >QP as used by IPoIB-UD...'. See section 3.0. I believe it covers what you >are stating. > > > >4. Link characteristics > > > > > >The broadcast domain for IPoIB-RC/UC is determined exactly as the > > >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in > this > > >step. > > > > > >Do all interfaces in the IPoIB-conneced mode(CM) have the same link > > >characteristics? i.e. > > > > From an implementation perspective, this is generally simplest. > > > > >a. all are either IPoIB-RC or IPoIB-UC. > > > > Preference is only 1 to be defined. > > > > > > > -- There is also a UD QP associated. The UD QP will be either 3a > > > or 3b > > > based on WG concensus. > > > > > > -- All unicast transmission is on the IPoIB mode i.e. RC or UC. > > > > For a given endnode pair, the policy of which QP is used for a given > > unicast IP datagram is really a local issue. I see some merit in the > >Not if an implementation chooses to only receive unicast on the CM modes in > an IPoIB-CM subnet. I think the WG must either mandate that between two >IP address all unicast communication can be over either UD or the >supported CM, >or state that all unicast communication must be over IPoIB-CM. Hence my >attempt at a detailed discussion on these issues. > >Issues such as in order delivery need to be considered: e.g. if RC and UD are >used to mix up the traffic, say of TCP segments of the same connection, they >may no longer be received in order. If a designer is stupid, they may do this. However, one would expect some intelligence here and one may prefer to have specific data flows or DiffServ code points or whatever used to determine which connection or which UD QP and that one would again apply an intelligent and predictable algorithm such that mix-n-match for a given TCP connection does not occur. Given multiple *C QP can be supported, it is not tenable to state that all unicast must go over a given QP or that no unicast can occur on a UD QP. > > attempt to bifurcate this to multicast / broadcast to the UD QP and > unicast > > to the *C QP. However, if the datagram fits in the PMTU of the UD QP, > then > > either could be used. The driver would work either case. Please keep in > > mind that multiple *C QP can be used and their usage needs to be a local > > issue and not defined within the spec. > > > > >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC > > >or both. > > > > > > -- The presence of the flags indicate the type of communication > > > possible. > > > -- The decision of communicating using a specific mode is > > > determined by > > > the supported modes and the local policy. Note that > incompatible > > > policies imply that the fallback is communication over UD. > > > -- fallback mode of communication is UD > > > > > > > > >4b adds a lot of flexibility at the expense of a simple decision. 4a. by > > >contrast is straightforward. > > > > > > > > >5. MTU negotiation > > > > > > In the private data field of the CM message the desired MTU is > > > included. > > > > > > It was suggested during the IPoIB meeting at IETF that it > need not be > > > symmetric. That is a good idea. Thus each peer declares the max > > > MTU it > > > prefers > > > > > > > > > REQ: > > > REP: > > > RTU: > > > > Rephrase this as maximum logical MTU to avoid confusion with the IB link > >It is covered in section 5.1 of the draft. > > > MTU. If you start down this path, then you may need to also consider an > > exchange of what range of DiffServ code points to use as well. Not clear > > that anyone needs to deal with any latency or bandwidth guarantees but the > > "camel's nose is starting to enter the tent" as the saying goes. > >The camel comes along if Diffserv etc. as listed above are >included. Hence they are not in the draft. > > > > > > > >6. Multiple connections for the same IP address > > > > > > Local decision. Note that the peer might choose to not honour > > > multiple > > > connections. > > > > Agreed. > > > > Mike > > > > > > > > > > >Vivek > > > > > > > > > > > > > > > > > >On Wed, 17 Nov 2004, Michael Krause wrote: > > > > > > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > > > > > > > > > > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > > > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB > interface > > > > >MUST "FullMember" join the IB multicast group defined by the > > > > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > > > > >interface is IPv6 only, does this group still need be joined ? If > not, > > > > >where do the parameters for any IPv6 groups come from ? I am presuming > > > > >that this group needs to be joined in the IPv6 only case. I just > want to > > > > >be sure. > > > > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be > joined > > > > >whether you are running at v4 or v6 layer. 2. ALso, what is the > > > > >latest status of the Vivek's connected mode draft ? Will it be moving > > > > >forward ? I'll be submitting it as > > > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. > There were > > > > >some interesting suggestions that were made during the IETF WG > meeting. > > > > >Two of the suggestions of consequence are given below. The others > we can > > > > >discuss when the minutes are published (they include some additional > > > > >requests on clarification on the transmission draft too). a. The > current > > > > >draft makes the various modes mutually exclusive i.e. RC, UC and > UD are > > > > >not allowed simultaneously in the same IP subnet. The thought is > that it > > > > >is a link characteristic and hence different per connection mode. > It was > > > > >suggested that one be allowed to mix up RC/UC. This goes back to the > > > > >original suggestion in the first draft which was: IPoIB-UD must > always be > > > > >supported. Additionally, the interface can also support either > both of RC > > > > >and UC, or one of them. Or neither of them. > > > > > > > > > >UD MUST always be supported. > > > > > > > > > > That is and has always been the requirement right from the first > > > > >draft. > > > > > > > > > >I personally don't care whether one does RC or UC but I don't > think both > > > > >are required as a MAY option. The advantage of RC is the send credit > > > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is > noise in > > > > >the fabric while send credits provide a simple method to maintain > > > > >bandwidth / injection control on a per flow basis. > > > > > > > > > >I see no problems with supporting both UD and *C on the same > subnet; it is > > > > >rather foolish to attempt to mandate these be on separate subnets.b > > > > > As per the connected-mode draft the UD mechanism is *always* > > > > >required; address resolutoin depends on it. > > > > > > > > > >The only point of discussion is whether all nodes must support the > same > > > > >link characteristics in the subnet i.e. all are RC (and UD), or > all or UC > > > > >(and UD), or all are UD only. > > > > > > > > Obviously I would oppose such a solution as it creates artificial > > > > constraints with little benefit. > > > > > > > > >The alternative is to allow all the nodes to be mixed up with some > nodes > > > > >being RC/UD, others UC/UD and a third set UD only and yet others > probably > > > > >supporting all. within the same IP subnet. [Can the same serviceID > be used > > > > >by both RC and UC ?] > > > > > > > > > >The third alternative is to associating UD only or UD + one of RC > or UC on > > > > >the same interface. In such a case if mismatched/unsupported connected > > > > >modes are supported by two nodes then the fall back to UD. This > option is > > > > >not too different from UD QP + RC or UC mechanism. > > > > > > > > KISS: > > > > > > > > - UD universal > > > > - *C opportunistic > > > > - Local management issue to control what is sent on the *C > > > > interface. No need to specify > > > > - Advertise whether one or more ports are supported by UD > or *C > > > > - Advertise whether one or more QP are supported by UD or *C > > > > - Let local management determine policy for what services are > > > > mapped where - no need to specify > > > > > > > > This is both an interoperable approach and simple to > implement. There may > > > > be some desire to add a policy interface to state preference for > specific > > > > types of traffic over a given QP. I would not oppose this but > would view > > > > this as a separate draft once the basics are worked out. > > > > > > > > > > > > > > > > > > > > > >b. Another suggestion was to allow multiple connected mode links > (i.e. at > > > > >IB UC/RC level) between peers. One thought can be 'yes, but user > beware': > > > > >The IB connections are made using the service ID that is derived > from the > > > > >QPN as described in the draft. If a second attempt succeeds then > there are > > > > >two links. It is up to the implementation to either allow or disallow > > > > >multiple links. > > > > > > > > > >Again, this has been suggested in the past (though most who were > involved > > > > >in the original discussions years gone by are likely gone since > much of > > > > >this discussion occurred before the IETF workgroup was established). > > > > > > > > > > I'm one of the vestiges of those early times along with you > and a few > > > > >others...so we have hope :). > > > > > > > > > >There is obvious benefit to supporting multiple RC per endnode > pair. I do > > > > >not see any technical reason to oppose nor any issue from an > > > > >interoperability perspective. There is no reason for a "user beware". > > > > > > > > > > It is not opposed. The 'user beware' is only underscoring > that the > > > > >the peer interface might not support multiple links- it might > enforce a > > > > >limited number of connections (maybe only one) between a pair of GIDs. > > > > >Similarly, an implementation not wanting to support multiple links > MUST > > > > >take steps to deny multiple requests. > > > > > > > > *C requires CM to operate thus it is a local issue whether > additional CM > > > > operations are accepted or not. A given requester node may issue N > and a > > > > given responder may state 0-N as an implementation may limit the > number of > > > > *C available for IP traffic. > > > > > > > > > > > > > > > > > > > > > > >The work is rather straight to do and implement and the benefit to > > > > >customers, is again, rather obvious when one considers what the IB > fabric > > > > >offers and how connections can be enable flows through multipath > as well > > > > >as transparent fail-over, flow scheduling, mapping of DiffServ to > > > > >different arbitration / paths, etc. > > > > > > > > > > In addition Large MTU and APM are two of the main reasons why > I've > > > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB > itself, > > > > >except for the Large MTU, the parameters are hidden from it. > > > > > > > > Mike > > > > > >__ > > > > > >Vivek Kashyap > > >Linux Technology Center, IBM > > > > > > > > >_______________________________________________ > > >IPoverIB mailing list > > >IPoverIB@ietf.org > > >https://www1.ietf.org/mailman/listinfo/ipoverib > > > > >_______________________________________________ >IPoverIB mailing list >IPoverIB@ietf.org >https://www1.ietf.org/mailman/listinfo/ipoverib --=====================_81454966==.ALT Content-Type: text/html; charset="us-ascii" At 11:33 AM 11/18/2004, Vivek Kashyap wrote:
    On Thu, 18 Nov 2004, Michael Krause wrote:

    > At 10:46 PM 11/17/2004, Vivek Kashyap wrote:
    > >Mike the format is really off in the last mail from you - making it difficult
    > >to follow.
    > >
    > >
    > >Other than that let us discuss in the context of the draft. The draft is
    > >built upon the following:
    > >
    > >1. IPoIB-RC and IPoIB-UC are optional.
    >
    > I would prefer only one be used - either RC or UC.  I've provided some
    > logic for either one as a preference but don't see a reason to have
    > both.  Both just leads to options which leads to interoperability problems.

    ok.
    See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt.
    It states that the RC and UC are mutually exclusive flags.

    My preference is to only support one of the two in a spec not to have flags to indicate what is implemented.  The benefits of connected mode operation should be done with only one form of communication not two.


    >
    > >2. IPoIB connected mode depends on a UD QP for address resolution and
    > >multicast.
    > >
    > >As far as I know, there has been an agreement since the earliest connected
    > >mode
    > >draft I posted.
    > >
    > >
    > >I'd like the WG to give input on the following issues:
    > >
    > >3. Where does the UD QP come from?  Choose one of:
    > >
    > >a. It is a UD QP that is associated with the interface at startup.
    > >
    > >b. It is a UD QP that is shared with IPoIB-UD.
    > >
    > >
    > >3a is more generic. It can be considered to include the case 3b.  The original
    > >proposal was limited to 3b.
    >
    >  From an implementation point of view, all of this will be hidden within
    > the driver below IP.  As such, the driver will maintain the
    > associations.  Currently, each driver "instance" (may be multiple per IB
    > port) will have at least 1 UD QP.  Given the existing protocol already
    > defines how to share this QP with other nodes, why not just re-use it and
    > avoid doing more work?  The driver can then map on a per endnode pair basis
    > what *C QP go with what the UD QP and the spec remains largely silent on
    > how this is accomplished.

    The draft at present states that 'IPoIB-CM implementation MAY use the same UD
    QP as used by IPoIB-UD...'. See section 3.0. I believe it covers what you
    are stating.

    > >4. Link characteristics
    > >
    > >The broadcast domain for IPoIB-RC/UC is determined exactly as the
    > >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in this
    > >step.
    > >
    > >Do all interfaces in the IPoIB-conneced mode(CM) have the same link
    > >characteristics? i.e.
    >
    >  From an implementation perspective, this is generally simplest.
    >
    > >a. all are either IPoIB-RC or IPoIB-UC.
    >
    > Preference is only 1 to be defined.
    >
    >
    > >         -- There is also a UD QP associated. The UD QP will be either 3a
    > > or 3b
    > >            based on WG concensus.
    > >
    > >         -- All unicast transmission is on the IPoIB mode i.e. RC or UC.
    >
    > For a given endnode pair, the policy of which QP is used for a given
    > unicast IP datagram is really a local issue.  I see some merit in the

    Not if an implementation chooses to only receive unicast on the CM modes in
     an IPoIB-CM subnet. I think the WG must either mandate that between two
    IP address all unicast communication can be over either UD or the supported CM,
    or state that all unicast communication must be over IPoIB-CM. Hence my
    attempt at a detailed discussion on these issues.

    Issues such as in order delivery need to be considered: e.g. if RC and UD are
    used to mix up the traffic, say of TCP segments of the same connection, they
    may no longer be received in order.

    If a designer is stupid, they may do this. However, one would expect some intelligence here and one may prefer to have specific data flows or DiffServ code points or whatever used to determine which connection or which UD QP and that one would again apply an intelligent and predictable algorithm such that mix-n-match for a given TCP connection does not occur.  Given multiple *C QP can be supported, it is not tenable to state that all unicast must go over a given QP or that no unicast can occur on a UD QP.


    > attempt to bifurcate this to multicast / broadcast to the UD QP and unicast
    > to the *C QP.  However, if the datagram fits in the PMTU of the UD QP, then
    > either could be used.  The driver would work either case.  Please keep in
    > mind that multiple *C QP can be used and their usage needs to be a local
    > issue and not defined within the spec.
    >
    > >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC
    > >or both.
    > >
    > >         -- The presence of the flags indicate the type of communication
    > > possible.
    > >         -- The decision of communicating using a specific mode is
    > > determined by
    > >            the supported modes and the local policy. Note that incompatible
    > >            policies imply that the fallback is communication over UD.
    > >         -- fallback mode of communication is UD
    > >
    > >
    > >4b adds a lot of flexibility at the expense of a simple decision. 4a. by
    > >contrast is straightforward.
    > >
    > >
    > >5. MTU negotiation
    > >
    > >         In the private data field of the CM message the desired MTU is
    > >         included.
    > >
    > >         It was suggested during the IPoIB meeting at IETF that it need not be
    > >         symmetric. That is a good idea. Thus each peer declares the max
    > > MTU it
    > >         prefers
    > >
    > >
    > >         REQ: <my desired MTU>
    > >         REP: <my desired MTU>
    > >         RTU:
    >
    > Rephrase this as maximum logical MTU to avoid confusion with the IB link

    It is covered in section 5.1 of the draft.

    > MTU.  If you start down this path, then you may need to also consider an
    > exchange of what range of DiffServ code points to use as well.  Not clear
    > that anyone needs to deal with any latency or bandwidth guarantees but the
    > "camel's nose is starting to enter the tent" as the saying goes.

    The camel comes along if  Diffserv  etc. as listed above are
    included. Hence they are not in the draft.

    >
    >
    > >6. Multiple connections for the same IP address
    > >
    > >         Local decision. Note that the peer might choose to not honour
    > > multiple
    > >         connections.
    >
    > Agreed.
    >
    > Mike
    >
    >
    >
    >
    > >Vivek
    > >
    > >
    > >
    > >
    > >
    > >On Wed, 17 Nov 2004, Michael Krause wrote:
    > >
    > > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote:
    > > >
    > > >
    > > >
    > > > >Hi,  I have a couple of questions relative to IPoIB:  1.
    > > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB interface
    > > > >MUST "FullMember" join the IB multicast group defined by the
    > > > >broadcast-GID."  Isn't the broadcast group for IPv4 ? When the IPoIB
    > > > >interface is IPv6 only, does this group still need be joined ?  If not,
    > > > >where do the parameters for any IPv6 groups come from ? I am presuming
    > > > >that this group needs to be joined in  the IPv6 only case. I just want to
    > > > >be sure.
    > > > ><VK> Yes, the broadcast-GID is at the InfiniBand layer and MUST be joined
    > > > >whether you are running at v4 or v6 layer. <VK>  2. ALso, what is the
    > > > >latest status of the Vivek's connected mode draft ? Will it be moving
    > > > >forward ?  <VK> I'll be submitting it as
    > > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. There were
    > > > >some interesting suggestions that were made during the IETF WG meeting.
    > > > >Two of the suggestions of consequence are given below. The others we can
    > > > >discuss when the minutes are published (they include some additional
    > > > >requests on clarification on the transmission draft too).  a. The current
    > > > >draft makes the various modes mutually exclusive i.e. RC, UC and UD are
    > > > >not allowed simultaneously in the same IP subnet. The thought is that it
    > > > >is a link characteristic and hence different per connection mode. It was
    > > > >suggested that one be allowed to mix up RC/UC. This goes back to the
    > > > >original suggestion in the first draft which was:  IPoIB-UD must always be
    > > > >supported. Additionally, the interface can also support either both of RC
    > > > >and UC, or one of them. Or neither of them.
    > > > >
    > > > >UD MUST always be supported.
    > > > >
    > > > ><VK> That is and has always been the requirement right from the first
    > > > >draft. <VK>
    > > > >
    > > > >I personally don't care whether one does RC or UC but I don't think both
    > > > >are required as a MAY option. The advantage of RC is the send credit
    > > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is noise in
    > > > >the fabric while send credits provide a simple method to maintain
    > > > >bandwidth / injection control on a per flow basis.
    > > > >
    > > > >I see no problems with supporting both UD and *C on the same subnet; it is
    > > > >rather foolish to attempt to mandate these be on separate subnets.b
    > > > ><VK> As per the connected-mode draft the UD mechanism is *always*
    > > > >required; address resolutoin depends on it.
    > > > >
    > > > >The only point of discussion is whether all nodes must support the same
    > > > >link characteristics in the subnet i.e. all are RC (and UD), or all or UC
    > > > >(and UD), or all are UD only.
    > > >
    > > > Obviously I would oppose such a solution as it creates artificial
    > > > constraints with little benefit.
    > > >
    > > > >The alternative is to allow all the nodes to be mixed up with some nodes
    > > > >being RC/UD, others UC/UD and a third set UD only and yet others probably
    > > > >supporting all. within the same IP subnet. [Can the same serviceID be used
    > > > >by both RC and UC ?]
    > > > >
    > > > >The third alternative is to associating UD only or UD + one of RC or UC on
    > > > >the same interface. In such a case if mismatched/unsupported connected
    > > > >modes are supported by two nodes then the fall back to UD. This option is
    > > > >not too different from UD QP + RC or UC mechanism.
    > > >
    > > > KISS:
    > > >
    > > > - UD universal
    > > > - *C opportunistic
    > > >          - Local management issue to control what is sent on the *C
    > > > interface.  No need to specify
    > > >          - Advertise whether one or more ports are supported by UD or *C
    > > >          - Advertise whether one or more QP are supported by UD or *C
    > > >          - Let local management determine policy for what services are
    > > > mapped where - no need to specify
    > > >
    > > > This is both an interoperable approach and simple to implement.  There may
    > > > be some desire to add a policy interface to state preference for specific
    > > > types of traffic over a given QP.  I would not oppose this but would view
    > > > this as a separate draft once the basics are worked out.
    > > >
    > > >
    > > >
    > > > ><VK>
    > > > >b. Another suggestion was to allow multiple connected mode links (i.e. at
    > > > >IB UC/RC level) between peers.  One thought can be 'yes, but user beware':
    > > > >The IB connections are made using the service ID that is derived from the
    > > > >QPN as described in the draft. If a second attempt succeeds then there are
    > > > >two links. It is up to the implementation to either allow or disallow
    > > > >multiple links.
    > > > >
    > > > >Again, this has been suggested in the past (though most who were involved
    > > > >in the original discussions years gone by are likely gone since much of
    > > > >this discussion occurred before the IETF workgroup was established).
    > > > >
    > > > ><VK> I'm one of the vestiges of those early times along with you and a few
    > > > >others...so we have hope :). <VK>
    > > > >
    > > > >There is obvious benefit to supporting multiple RC per endnode pair. I do
    > > > >not see any technical reason to oppose nor any issue from an
    > > > >interoperability perspective. There is no reason for a "user beware".
    > > > >
    > > > ><VK> It is not opposed. The 'user beware' is only underscoring that the
    > > > >the peer interface might not support multiple links- it might enforce a
    > > > >limited number of connections (maybe only one) between a pair of GIDs.
    > > > >Similarly, an implementation not wanting to support multiple links MUST
    > > > >take steps to deny multiple requests.
    > > >
    > > > *C requires CM to operate thus it is a local issue whether additional CM
    > > > operations are accepted or not.  A given requester node may issue N and a
    > > > given responder may state 0-N as an implementation may limit the number of
    > > > *C available for IP traffic.
    > > >
    > > >
    > > > ><VK>
    > > > >
    > > > >The work is rather straight to do and implement and the benefit to
    > > > >customers, is again, rather obvious when one considers what the IB fabric
    > > > >offers and how connections can be enable flows through multipath as well
    > > > >as transparent fail-over, flow scheduling, mapping of DiffServ to
    > > > >different arbitration / paths, etc.
    > > > >
    > > > ><VK> In addition Large MTU and APM are two of the main reasons why I've
    > > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB itself,
    > > > >except for the Large MTU, the parameters are hidden from it.<VK>
    > > >
    > > > Mike
    > >
    > >__
    > >
    > >Vivek Kashyap
    > >Linux Technology Center, IBM
    > >
    > >
    > >_______________________________________________
    > >IPoverIB mailing list
    > >IPoverIB@ietf.org
    > >https://www1.ietf.org/mailman/listinfo/ipoverib
    >


    _______________________________________________
    IPoverIB mailing list
    IPoverIB@ietf.org
    https://www1.ietf.org/mailman/listinfo/ipoverib
    --=====================_81454966==.ALT-- --===============0941738355== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0941738355==-- From ipoverib-bounces@ietf.org Thu Nov 18 18:12:35 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA01074 for ; Thu, 18 Nov 2004 18:12:35 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUvMg-000354-Sx; Thu, 18 Nov 2004 18:06:58 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUvEE-0007CM-3C for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 17:58:14 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA29110 for ; Thu, 18 Nov 2004 17:58:10 -0500 (EST) Received: from atorelbas04.hp.com ([156.153.255.238] helo=palrel13.hp.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUvGv-0001gq-SZ for ipoverib@ietf.org; Thu, 18 Nov 2004 18:01:02 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel13.hp.com (Postfix) with ESMTP id 59E981C13A5C for ; Thu, 18 Nov 2004 14:58:12 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.201.129]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id OAA24252 for ; Thu, 18 Nov 2004 14:55:44 -0800 (PST) Message-Id: <6.1.2.0.2.20041118144754.01eb9db8@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 14:50:50 -0800 To: "IPoverIB" From: Michael Krause Subject: Re: [Ipoverib] IPoIB-RC and Checksums In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 00e94c813bef7832af255170dca19e36 X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1232672044==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============1232672044== Content-Type: multipart/alternative; boundary="=====================_83799597==.ALT" --=====================_83799597==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 12:18 PM 11/18/2004, Yaron Haviv wrote: >In GbE usually the NIC Tx Segmentation (large send) capability comes >hand in hand with Checksum offload for greater efficiency (and zero >copy) > >On UD we decided not to address checksum offloading, since we cannot >guarantee that the node will not forward an un-checked packet > >Where as in RC we can have examples of devices that can guarantee >checksum One example is an IB-IP gateway that always checksum outgoing and >incoming packets, and can act as a remote IP NIC to the Host > >I suggest we include a checksum option in the CM Exchange Where a node >can request that its peer will not checksum the packet for it And also >signal that he sends packets that are already checked That can help >improve performance of IPoIB RC > >P.S. another note, we discussed in IETF was that we may want to >mention/suggest (not mandate) use of SRQ for IPoIB-RC in order to >preserve memory I remain opposed to disabling checksums under any circumstances. I do not believe there is a method to guarantee that a packet will not be routed by higher layers within the network stack without sticking one's nose way into the packet. There is nothing that precludes an IB HCA from providing checksum off-load today through a private interface much as what is done with Ethernet today. It isn't hard and presents no interoperability issues as it is a local optimization. Attempting to do this as an optimization between endnode pairs makes this more complex and requires knowledge that may not be available between all combinations of endnode pairs. Mike --=====================_83799597==.ALT Content-Type: text/html; charset="us-ascii" At 12:18 PM 11/18/2004, Yaron Haviv wrote:
    In GbE usually the NIC Tx Segmentation (large send) capability comes
    hand in hand with Checksum offload for greater efficiency (and zero
    copy)

    On UD we decided not to address checksum offloading, since we cannot
    guarantee that the node will not forward an un-checked packet

    Where as in RC we can have examples of devices that can guarantee
    checksum One example is an IB-IP gateway that always checksum outgoing and
    incoming packets, and can act as a remote IP NIC to the Host 

    I suggest we include a checksum option in the CM Exchange  Where a node can request that its peer will not checksum the packet for it And also signal that he sends packets that are already checked  That can help improve performance of IPoIB RC

    P.S. another note, we discussed in IETF was that we may want to
    mention/suggest (not mandate) use of SRQ for IPoIB-RC in order to
    preserve memory

    I remain opposed to disabling checksums under any circumstances.  I do not believe there is a method to guarantee that a packet will not be routed by higher layers within the network stack without sticking one's nose way into the packet.  There is nothing that precludes an IB HCA from providing checksum off-load today through a private interface much as what is done with Ethernet today.   It isn't hard and presents no interoperability issues as it is a local optimization.  Attempting to do this as an optimization between endnode pairs makes this more complex and requires knowledge that may not be available between all combinations of endnode pairs.

    Mike
    --=====================_83799597==.ALT-- --===============1232672044== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1232672044==-- From ipoverib-bounces@ietf.org Thu Nov 18 18:30:47 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA03400 for ; Thu, 18 Nov 2004 18:30:46 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUvZ0-0002mG-B0; Thu, 18 Nov 2004 18:19:42 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUvQY-0004zN-GH for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 18:10:58 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA00781 for ; Thu, 18 Nov 2004 18:10:55 -0500 (EST) Received: from brmea-mail-4.sun.com ([192.18.98.36]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUvTG-0001yQ-HU for ipoverib@ietf.org; Thu, 18 Nov 2004 18:13:46 -0500 Received: from phys-bos-2.sfbay.sun.com ([129.146.14.24]) by brmea-mail-4.sun.com (8.12.10/8.12.9) with ESMTP id iAINAvun005431 for ; Thu, 18 Nov 2004 16:10:57 -0700 (MST) Received: from reveille (reveille.SFBay.Sun.COM [129.146.95.151]) by bos-mail1.sfbay.sun.com (Sun Java System Messaging Server 6.1 HotFix 0.02 (built Jul 26 2004)) with ESMTP id <0I7E009DGDQ8U570@bos-mail1.sfbay.sun.com> for ipoverib@ietf.org; Thu, 18 Nov 2004 15:10:56 -0800 (PST) Date: Thu, 18 Nov 2004 15:10:56 -0800 From: Bill Strahm Subject: Re: [Ipoverib] IPoIB-RC and Checksums In-reply-to: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com> To: Yaron Haviv Message-id: <1100819456.12386.16.camel@reveille.sfbay.sun.com> Organization: Sun Microsystems MIME-version: 1.0 X-Mailer: Evolution 2.0.2 (2.0.2-1) Content-type: text/plain Content-transfer-encoding: 7bit References: <35EA21F54A45CB47B879F21A91F4862F2CBC58@taurus.voltaire.com> X-Spam-Score: 0.0 (/) X-Scan-Signature: 92df29fa99cf13e554b84c8374345c17 Content-Transfer-Encoding: 7bit Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Bill.Strahm@Sun.COM List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org Content-Transfer-Encoding: 7bit Let me try to be clear with my understanding of the position of the IESG. Not sending IP/TCP header checksums in an IP packet is a non-starter. Using checksum offload technologies to accelerate these computations is one thing, sending the packets with a checksum value of 0 and not checking on receive is another. I talked with Allison Mankin in Washington D.C. and she was terrified of "raw" IB packets (ie. RC/UD) getting out on the internet and messing things up because there aren't congestion controls built into these protocols that will behave correctly with IP. I would caution the group in considering removing checksumming from packets when it is relatively cheap for hardware to be added to HCA/HBA's that can calculate the checksum before sending it on the wire. Some comments inline. Bill On Thu, 2004-11-18 at 22:18 +0200, Yaron Haviv wrote: > In GbE usually the NIC Tx Segmentation (large send) capability comes > hand in hand with Checksum offload for greater efficiency (and zero > copy) > > On UD we decided not to address checksum offloading, since we cannot > guarantee that the node will not forward an un-checked packet I do not believe either the IEEE or the IETF have ever addressed checksum offloading. I am not sure that there is a protocol piece to do here - it is an implementation issue between the OS and the hardware. > > Where as in RC we can have examples of devices that can guarantee > checksum > One example is an IB-IP gateway that always checksum outgoing and > incoming packets, and can act as a remote IP NIC to the Host Here you are talking about a different device. And again - I am not sure that there is an IETF standard here. Much like the IETF does not want to standardize iSER over IB (with no IP in the middle) I don't believe it wants to standardize Host/OS <--> NIC interactions. The device you are proposing does not have (require might be a better word here) an IP interaction between the HOST and an IP ofload NIC (I have heard of several implementations of things called a VNIC or virtual NIC) I do not believe there is a proposal to standardize a VNIC protocol - and if there was, I do not believe that this is IETF work. > > I suggest we include a checksum option in the CM Exchange > Where a node can request that its peer will not checksum the packet for > it > And also signal that he sends packets that are already checked > That can help improve performance of IPoIB RC > I believe this is a non-starter in the IESG - Margarete, can you confirm this ? > P.S. another note, we discussed in IETF was that we may want to > mention/suggest (not mandate) use of SRQ for IPoIB-RC in order to > preserve memory > Again, in the spirit of Wire protocol vs. Implementation. I think this is an implementation issue that will not change wire protocols at all. Is there a point where using SRQ vs. Not Using SRQ would have to change the wire protocol ? If not - lets not say anything. If there is - I would be very interested in understanding. Bill _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 19:27:08 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA08725 for ; Thu, 18 Nov 2004 19:27:08 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwZm-0004H2-SZ; Thu, 18 Nov 2004 19:24:34 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwNc-0001Uw-UT for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 19:12:02 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA07531 for ; Thu, 18 Nov 2004 19:11:57 -0500 (EST) Received: from e33.co.us.ibm.com ([32.97.110.131]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUwQK-0003Zj-Nw for ipoverib@ietf.org; Thu, 18 Nov 2004 19:14:49 -0500 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e33.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAJ0BRJT255154 for ; Thu, 18 Nov 2004 19:11:27 -0500 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAJ0BRmo121334 for ; Thu, 18 Nov 2004 17:11:27 -0700 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ0BQi7008292 for ; Thu, 18 Nov 2004 17:11:27 -0700 Received: from DYN319548.beaverton.ibm.com (DYN319548.beaverton.ibm.com [9.47.22.85]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ0BPvB008256; Thu, 18 Nov 2004 17:11:26 -0700 Date: Thu, 18 Nov 2004 16:12:23 -0800 (PST) From: Vivek Kashyap X-X-Sender: kashyapv@dyn319548.beaverton.ibm.com To: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <6.1.2.0.2.20041118132352.0c98a550@esmail.cup.hp.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: efb5d987e2484f3d9a304cc31a003441 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Thu, 18 Nov 2004, Michael Krause wrote: > At 11:33 AM 11/18/2004, Vivek Kashyap wrote: > >On Thu, 18 Nov 2004, Michael Krause wrote: > > > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > > > >Mike the format is really off in the last mail from you - making it > > difficult > > > >to follow. > > > > > > > > > > > >Other than that let us discuss in the context of the draft. The draft is > > > >built upon the following: > > > > > > > >1. IPoIB-RC and IPoIB-UC are optional. > > > > > > I would prefer only one be used - either RC or UC. I've provided some > > > logic for either one as a preference but don't see a reason to have > > > both. Both just leads to options which leads to interoperability problems. > > > >ok. > >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. > >It states that the RC and UC are mutually exclusive flags. > > My preference is to only support one of the two in a spec not to have flags > to indicate what is implemented. The benefits of connected mode operation > should be done with only one form of communication not two. A given subnet will support only one of the two. Not both simultaneously. The flag only indicates which type it is. RC and UC are both useful to different people and implementations so both are allowed. I suggest that both not be allowed in the same IPoIB subnet though. > > > > > > > > >2. IPoIB connected mode depends on a UD QP for address resolution and > > > >multicast. > > > > > > > >As far as I know, there has been an agreement since the earliest > > connected > > > >mode > > > >draft I posted. > > > > > > > > > > > >I'd like the WG to give input on the following issues: > > > > > > > >3. Where does the UD QP come from? Choose one of: > > > > > > > >a. It is a UD QP that is associated with the interface at startup. > > > > > > > >b. It is a UD QP that is shared with IPoIB-UD. > > > > > > > > > > > >3a is more generic. It can be considered to include the case 3b. The > > original > > > >proposal was limited to 3b. > > > > > > From an implementation point of view, all of this will be hidden within > > > the driver below IP. As such, the driver will maintain the > > > associations. Currently, each driver "instance" (may be multiple per IB > > > port) will have at least 1 UD QP. Given the existing protocol already > > > defines how to share this QP with other nodes, why not just re-use it and > > > avoid doing more work? The driver can then map on a per endnode pair > > basis > > > what *C QP go with what the UD QP and the spec remains largely silent on > > > how this is accomplished. > > > >The draft at present states that 'IPoIB-CM implementation MAY use the same UD > >QP as used by IPoIB-UD...'. See section 3.0. I believe it covers what you > >are stating. > > > > > >4. Link characteristics > > > > > > > >The broadcast domain for IPoIB-RC/UC is determined exactly as the > > > >IPoIB-UD case i.e. through the broadcast-GID. A UD as per 3 is used in > > this > > > >step. > > > > > > > >Do all interfaces in the IPoIB-conneced mode(CM) have the same link > > > >characteristics? i.e. > > > > > > From an implementation perspective, this is generally simplest. > > > > > > >a. all are either IPoIB-RC or IPoIB-UC. > > > > > > Preference is only 1 to be defined. > > > > > > > > > > -- There is also a UD QP associated. The UD QP will be either 3a > > > > or 3b > > > > based on WG concensus. > > > > > > > > -- All unicast transmission is on the IPoIB mode i.e. RC or UC. > > > > > > For a given endnode pair, the policy of which QP is used for a given > > > unicast IP datagram is really a local issue. I see some merit in the > > > >Not if an implementation chooses to only receive unicast on the CM modes in > > an IPoIB-CM subnet. I think the WG must either mandate that between two > >IP address all unicast communication can be over either UD or the > >supported CM, > >or state that all unicast communication must be over IPoIB-CM. Hence my > >attempt at a detailed discussion on these issues. > > > >Issues such as in order delivery need to be considered: e.g. if RC and UD are > >used to mix up the traffic, say of TCP segments of the same connection, they > >may no longer be received in order. > > If a designer is stupid, they may do this. However, one would expect some > intelligence here and one may prefer to have specific data flows or > DiffServ code points or whatever used to determine which connection or > which UD QP and that one would again apply an intelligent and predictable > algorithm such that mix-n-match for a given TCP connection does not > occur. Given multiple *C QP can be supported, it is not tenable to state > that all unicast must go over a given QP or that no unicast can occur on a > UD QP. > You mised my point which was that the specification cannot be silent on this and say it is a local issue. That can lead to interoperability failure. The specification must support or disallow unicast communication over UD QP in an IPoIB-CM. You prefer that such communication be supported. That works. Any other thoughts? > > > > attempt to bifurcate this to multicast / broadcast to the UD QP and > > unicast > > > to the *C QP. However, if the datagram fits in the PMTU of the UD QP, > > then > > > either could be used. The driver would work either case. Please keep in > > > mind that multiple *C QP can be used and their usage needs to be a local > > > issue and not defined within the spec. > > > > > > >b. all are IPoIB-UD. Additionally they can be one of IPoIB-RC or IPoIB-UC > > > >or both. > > > > > > > > -- The presence of the flags indicate the type of communication > > > > possible. > > > > -- The decision of communicating using a specific mode is > > > > determined by > > > > the supported modes and the local policy. Note that > > incompatible > > > > policies imply that the fallback is communication over UD. > > > > -- fallback mode of communication is UD > > > > > > > > > > > >4b adds a lot of flexibility at the expense of a simple decision. 4a. by > > > >contrast is straightforward. > > > > > > > > > > > >5. MTU negotiation > > > > > > > > In the private data field of the CM message the desired MTU is > > > > included. > > > > > > > > It was suggested during the IPoIB meeting at IETF that it > > need not be > > > > symmetric. That is a good idea. Thus each peer declares the max > > > > MTU it > > > > prefers > > > > > > > > > > > > REQ: > > > > REP: > > > > RTU: > > > > > > Rephrase this as maximum logical MTU to avoid confusion with the IB link > > > >It is covered in section 5.1 of the draft. > > > > > MTU. If you start down this path, then you may need to also consider an > > > exchange of what range of DiffServ code points to use as well. Not clear > > > that anyone needs to deal with any latency or bandwidth guarantees but the > > > "camel's nose is starting to enter the tent" as the saying goes. > > > >The camel comes along if Diffserv etc. as listed above are > >included. Hence they are not in the draft. > > > > > > > > > > > >6. Multiple connections for the same IP address > > > > > > > > Local decision. Note that the peer might choose to not honour > > > > multiple > > > > connections. > > > > > > Agreed. > > > > > > Mike > > > > > > > > > > > > > > > >Vivek > > > > > > > > > > > > > > > > > > > > > > > >On Wed, 17 Nov 2004, Michael Krause wrote: > > > > > > > > > At 11:38 PM 11/16/2004, Vivek Kashyap wrote: > > > > > > > > > > > > > > > > > > > > >Hi, I have a couple of questions relative to IPoIB: 1. > > > > > >draft-ietf-ipoib-ip-over-infiniband-07.txt states: "Every IPoIB > > interface > > > > > >MUST "FullMember" join the IB multicast group defined by the > > > > > >broadcast-GID." Isn't the broadcast group for IPv4 ? When the IPoIB > > > > > >interface is IPv6 only, does this group still need be joined ? If > > not, > > > > > >where do the parameters for any IPv6 groups come from ? I am presuming > > > > > >that this group needs to be joined in the IPv6 only case. I just > > want to > > > > > >be sure. > > > > > > Yes, the broadcast-GID is at the InfiniBand layer and MUST be > > joined > > > > > >whether you are running at v4 or v6 layer. 2. ALso, what is the > > > > > >latest status of the Vivek's connected mode draft ? Will it be moving > > > > > >forward ? I'll be submitting it as > > > > > >draft-ietf-ipoib-connected-mode-00.txt by the end of the month. > > There were > > > > > >some interesting suggestions that were made during the IETF WG > > meeting. > > > > > >Two of the suggestions of consequence are given below. The others > > we can > > > > > >discuss when the minutes are published (they include some additional > > > > > >requests on clarification on the transmission draft too). a. The > > current > > > > > >draft makes the various modes mutually exclusive i.e. RC, UC and > > UD are > > > > > >not allowed simultaneously in the same IP subnet. The thought is > > that it > > > > > >is a link characteristic and hence different per connection mode. > > It was > > > > > >suggested that one be allowed to mix up RC/UC. This goes back to the > > > > > >original suggestion in the first draft which was: IPoIB-UD must > > always be > > > > > >supported. Additionally, the interface can also support either > > both of RC > > > > > >and UC, or one of them. Or neither of them. > > > > > > > > > > > >UD MUST always be supported. > > > > > > > > > > > > That is and has always been the requirement right from the first > > > > > >draft. > > > > > > > > > > > >I personally don't care whether one does RC or UC but I don't > > think both > > > > > >are required as a MAY option. The advantage of RC is the send credit > > > > > >algorithm. The advantage of UC is the lack of ACK packets. ACK is > > noise in > > > > > >the fabric while send credits provide a simple method to maintain > > > > > >bandwidth / injection control on a per flow basis. > > > > > > > > > > > >I see no problems with supporting both UD and *C on the same > > subnet; it is > > > > > >rather foolish to attempt to mandate these be on separate subnets.b > > > > > > As per the connected-mode draft the UD mechanism is *always* > > > > > >required; address resolutoin depends on it. > > > > > > > > > > > >The only point of discussion is whether all nodes must support the > > same > > > > > >link characteristics in the subnet i.e. all are RC (and UD), or > > all or UC > > > > > >(and UD), or all are UD only. > > > > > > > > > > Obviously I would oppose such a solution as it creates artificial > > > > > constraints with little benefit. > > > > > > > > > > >The alternative is to allow all the nodes to be mixed up with some > > nodes > > > > > >being RC/UD, others UC/UD and a third set UD only and yet others > > probably > > > > > >supporting all. within the same IP subnet. [Can the same serviceID > > be used > > > > > >by both RC and UC ?] > > > > > > > > > > > >The third alternative is to associating UD only or UD + one of RC > > or UC on > > > > > >the same interface. In such a case if mismatched/unsupported connected > > > > > >modes are supported by two nodes then the fall back to UD. This > > option is > > > > > >not too different from UD QP + RC or UC mechanism. > > > > > > > > > > KISS: > > > > > > > > > > - UD universal > > > > > - *C opportunistic > > > > > - Local management issue to control what is sent on the *C > > > > > interface. No need to specify > > > > > - Advertise whether one or more ports are supported by UD > > or *C > > > > > - Advertise whether one or more QP are supported by UD or *C > > > > > - Let local management determine policy for what services are > > > > > mapped where - no need to specify > > > > > > > > > > This is both an interoperable approach and simple to > > implement. There may > > > > > be some desire to add a policy interface to state preference for > > specific > > > > > types of traffic over a given QP. I would not oppose this but > > would view > > > > > this as a separate draft once the basics are worked out. > > > > > > > > > > > > > > > > > > > > > > > > > > >b. Another suggestion was to allow multiple connected mode links > > (i.e. at > > > > > >IB UC/RC level) between peers. One thought can be 'yes, but user > > beware': > > > > > >The IB connections are made using the service ID that is derived > > from the > > > > > >QPN as described in the draft. If a second attempt succeeds then > > there are > > > > > >two links. It is up to the implementation to either allow or disallow > > > > > >multiple links. > > > > > > > > > > > >Again, this has been suggested in the past (though most who were > > involved > > > > > >in the original discussions years gone by are likely gone since > > much of > > > > > >this discussion occurred before the IETF workgroup was established). > > > > > > > > > > > > I'm one of the vestiges of those early times along with you > > and a few > > > > > >others...so we have hope :). > > > > > > > > > > > >There is obvious benefit to supporting multiple RC per endnode > > pair. I do > > > > > >not see any technical reason to oppose nor any issue from an > > > > > >interoperability perspective. There is no reason for a "user beware". > > > > > > > > > > > > It is not opposed. The 'user beware' is only underscoring > > that the > > > > > >the peer interface might not support multiple links- it might > > enforce a > > > > > >limited number of connections (maybe only one) between a pair of GIDs. > > > > > >Similarly, an implementation not wanting to support multiple links > > MUST > > > > > >take steps to deny multiple requests. > > > > > > > > > > *C requires CM to operate thus it is a local issue whether > > additional CM > > > > > operations are accepted or not. A given requester node may issue N > > and a > > > > > given responder may state 0-N as an implementation may limit the > > number of > > > > > *C available for IP traffic. > > > > > > > > > > > > > > > > > > > > > > > > > > > >The work is rather straight to do and implement and the benefit to > > > > > >customers, is again, rather obvious when one considers what the IB > > fabric > > > > > >offers and how connections can be enable flows through multipath > > as well > > > > > >as transparent fail-over, flow scheduling, mapping of DiffServ to > > > > > >different arbitration / paths, etc. > > > > > > > > > > > > In addition Large MTU and APM are two of the main reasons why > > I've > > > > > >been proposing IPoIB-connected mode for so long. In terms of IPoIB > > itself, > > > > > >except for the Large MTU, the parameters are hidden from it. > > > > > > > > > > Mike > > > > > > > >__ > > > > > > > >Vivek Kashyap > > > >Linux Technology Center, IBM > > > > > > > > > > > >_______________________________________________ > > > >IPoverIB mailing list > > > >IPoverIB@ietf.org > > > >https://www1.ietf.org/mailman/listinfo/ipoverib > > > > > > > > >_______________________________________________ > >IPoverIB mailing list > >IPoverIB@ietf.org > >https://www1.ietf.org/mailman/listinfo/ipoverib > _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 19:33:53 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA09510 for ; Thu, 18 Nov 2004 19:33:53 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwaO-0004ou-Ky; Thu, 18 Nov 2004 19:25:12 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwZY-00042V-7W for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 19:24:20 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA08428 for ; Thu, 18 Nov 2004 19:24:16 -0500 (EST) Received: from e31.co.us.ibm.com ([32.97.110.129]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUwcD-0003ps-Q7 for ipoverib@ietf.org; Thu, 18 Nov 2004 19:27:09 -0500 Received: from westrelay01.boulder.ibm.com (westrelay01.boulder.ibm.com [9.17.195.10]) by e31.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAJ0NjCB284252 for ; Thu, 18 Nov 2004 19:23:45 -0500 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by westrelay01.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAJ0NjdU192854 for ; Thu, 18 Nov 2004 17:23:45 -0700 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ0MMPc028488 for ; Thu, 18 Nov 2004 17:23:45 -0700 Received: from DYN319548.beaverton.ibm.com (DYN319548.beaverton.ibm.com [9.47.22.85]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAINwqds008184; Thu, 18 Nov 2004 16:58:53 -0700 Date: Thu, 18 Nov 2004 15:59:50 -0800 (PST) From: Vivek Kashyap X-X-Sender: kashyapv@dyn319548.beaverton.ibm.com To: "H.K. Jerry Chu" Subject: Re: [Ipoverib] comments on draft-kashyap-ipoib-connected-mode-02.txt In-Reply-To: <200411181929.iAIJTE6R162628@jurassic.eng.sun.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: 386e0819b1192672467565a524848168 Cc: ipoverib@ietf.org X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Thu, 18 Nov 2004, H.K. Jerry Chu wrote: > In the last IETF61 IPoIB meeting I made several comments on the > connected mode draft. I'm sending them to the list for a general > discussion. (Yes I saw some disucssion on the connected mode > draft already. I'll try to catch up with the thread after this > mail.) > > 1. The draft makes a distinction between IPoIB-CM interfaces > and IPoIB-UD interfaces, and portrays IPoIB-UC or IPoIB-RC as > separate subnets superimposed on top of an IPoIB-UD subnet. > > For the above to work, due to a lack of multicast support, a fully > connected network by itself can't meet the requirement of an IP > link unless multicast is fully emulated through the use of > multiple unicasts. The latter is complex and cumbersome. Exactly. The current draft also continues to use UD for multicast. > > A much simpler model, which I think was presented in earlier > drafts, is to fold the use of IB connections fully into a > regular IPoIB-UD subnet, allowing any two IPoIB nodes to > optionally negotiate the use of IB connection between themselves. The difference in the earlier draft and this one is that I modified the requirement on the UD QP. That is, it need not be that IPoIB-CM and IPoIB-UD share a QP but that any UD QP will do for IPoIB-CM. In effect an implementation can still share the UD QP. The only issue is whether the same IP subnet can contain pure IPoIB-UD mixed in with IPoIB-CM nodes or, all nodes must be of the same type. - all IPoIB-UD or - all IPoIB-RC or -- all IPoIB-UC I beleive all of the same type is a good option to choose. > > This much simplified model is not without its drawback. Some > nice IP link attributes are no longer unique within a link. > E.g., the link MTU now becomes per-node-pair MTU. Moreover, > the MTU size for multicast will be different from the MTU size > for unicast if IB connections are used. IB UC/RC may exhibit > different RAS, flow control, QoS or other link characteristics > than UD. But I consider these problems a reasonable price to > pay for a seamless support of UC/RC mode in an IPoIB link > defined by UD. > > 2. The negotiation of the per-connection MTU seems more > complicated than necessary. I think all is needed is for a > node to advertise its own "receive MTU". That is, the MTU > size its peer should never go over when sending packets > to the local interface. Yes this may break the traditional > concept of "symmetric" MTUs. But we're already breaking the > notion of per-link MTU, requring a lot of changes in the host > stack anyway. This additonal breakage doesn't seem much. > > I haven't verified if this asymmetric MTU matches well with > IBA connections though. How about: The MTU I would think is exchanged at the IB level during the IPoIB-CM connection setup. The IP layer at both ends keeps a per connection MTU if the implementation permits it. At the link layer the connection will not send messages larger than that requested by the peer. > > 3. Regarding allowing multiple IB connections between a node > pair, since given an IP address there is only one link-address > for it implying one QPN, hence one service-ID, if a single > service-ID can be used to create multiple IB connections > then this can happen transparently. Otherwise we've got a > problem. > > Jerry > > > _______________________________________________ > IPoverIB mailing list > IPoverIB@ietf.org > https://www1.ietf.org/mailman/listinfo/ipoverib > > _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 19:34:22 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA09590 for ; Thu, 18 Nov 2004 19:34:22 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwaU-0004ss-51; Thu, 18 Nov 2004 19:25:18 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUwZg-0004AU-Fh for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 19:24:29 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA08446 for ; Thu, 18 Nov 2004 19:24:24 -0500 (EST) Received: from palrel13.hp.com ([156.153.255.238]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUwcE-0003qD-AE for ipoverib@ietf.org; Thu, 18 Nov 2004 19:27:17 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel13.hp.com (Postfix) with ESMTP id 31D3E1C02E95; Thu, 18 Nov 2004 16:24:16 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.201.129]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id QAA29957; Thu, 18 Nov 2004 16:21:47 -0800 (PST) Message-Id: <6.1.2.0.2.20041118161705.0cbc5900@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 16:22:35 -0800 To: Vivek Kashyap From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: <6.1.2.0.2.20041118132352.0c98a550@esmail.cup.hp.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 0e9ebc0cbd700a87c0637ad0e2c91610 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1908633025==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============1908633025== Content-Type: multipart/alternative; boundary="=====================_88959486==.ALT" --=====================_88959486==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 04:12 PM 11/18/2004, Vivek Kashyap wrote: >On Thu, 18 Nov 2004, Michael Krause wrote: > > > At 11:33 AM 11/18/2004, Vivek Kashyap wrote: > > >On Thu, 18 Nov 2004, Michael Krause wrote: > > > > > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > > > > >Mike the format is really off in the last mail from you - making it > > > difficult > > > > >to follow. > > > > > > > > > > > > > > >Other than that let us discuss in the context of the draft. The > draft is > > > > >built upon the following: > > > > > > > > > >1. IPoIB-RC and IPoIB-UC are optional. > > > > > > > > I would prefer only one be used - either RC or UC. I've provided some > > > > logic for either one as a preference but don't see a reason to have > > > > both. Both just leads to options which leads to interoperability > problems. > > > > > >ok. > > >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. > > >It states that the RC and UC are mutually exclusive flags. > > > > My preference is to only support one of the two in a spec not to have > flags > > to indicate what is implemented. The benefits of connected mode operation > > should be done with only one form of communication not two. > >A given subnet will support only one of the two. Not both simultaneously. The >flag only indicates which type it is. RC and UC are both useful to different >people and implementations so both are allowed. I suggest that both not be >allowed in the same IPoIB subnet though. To be explicit, I think there is benefit in implementing one and only one of the two. Having two options serves no purpose and adds unnecessary complexity. Interoperability will end up requiring both to be done if customers are to not get upset. Let's just pick one of the two and apply KISS. To get this started, I'll propose RC as that is a bit nicer to the fabric than UC and is already implemented in most OS and CA drivers today so it makes it faster to adopt with minimal driver software update. > > If a designer is stupid, they may do this. However, one would expect some > > intelligence here and one may prefer to have specific data flows or > > DiffServ code points or whatever used to determine which connection or > > which UD QP and that one would again apply an intelligent and predictable > > algorithm such that mix-n-match for a given TCP connection does not > > occur. Given multiple *C QP can be supported, it is not tenable to state > > that all unicast must go over a given QP or that no unicast can occur on a > > UD QP. > > > >You mised my point which was that the specification cannot be silent on this >and say it is a local issue. That can lead to interoperability failure. The >specification must support or disallow unicast communication over UD QP >in an >IPoIB-CM. > >You prefer that such communication be supported. That works. Any other >thoughts? I prefer that guidance be provided and that it remain a local implementation issue as to what QP is used for a given flow. I do not see interoperability issues only potential performance if people are stupid. The industry has a way to deal with stupidity and too much time is spent on preventing people from being stupid. Even a so-so intelligent implementation could have a simple flag for a given target IP address that states which QP to target for all or a subset of the flows with minimal cost to implement and troubleshoot / validate. Mike --=====================_88959486==.ALT Content-Type: text/html; charset="us-ascii" At 04:12 PM 11/18/2004, Vivek Kashyap wrote:
    On Thu, 18 Nov 2004, Michael Krause wrote:

    > At 11:33 AM 11/18/2004, Vivek Kashyap wrote:
    > >On Thu, 18 Nov 2004, Michael Krause wrote:
    > >
    > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote:
    > > > >Mike the format is really off in the last mail from you - making it
    > > difficult
    > > > >to follow.
    > > > >
    > > > >
    > > > >Other than that let us discuss in the context of the draft. The draft is
    > > > >built upon the following:
    > > > >
    > > > >1. IPoIB-RC and IPoIB-UC are optional.
    > > >
    > > > I would prefer only one be used - either RC or UC.  I've provided some
    > > > logic for either one as a preference but don't see a reason to have
    > > > both.  Both just leads to options which leads to interoperability problems.
    > >
    > >ok.
    > >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt.
    > >It states that the RC and UC are mutually exclusive flags.
    >
    > My preference is to only support one of the two in a spec not to have flags
    > to indicate what is implemented.  The benefits of connected mode operation
    > should be done with only one form of communication not two.

    A given subnet will support only one of the two. Not both simultaneously. The
    flag only indicates which type it is. RC and UC are both useful to different
    people and implementations so both are allowed. I suggest that both not be
    allowed in the same IPoIB subnet though.

    To be explicit, I think there is benefit in implementing one and only one of the two.  Having two options serves no purpose and adds unnecessary complexity.  Interoperability will end up requiring both to be done if customers are to not get upset.  Let's just pick one of the two and apply KISS.  To get this started, I'll propose RC as that is a bit nicer to the fabric than UC and is already implemented in most OS and CA drivers today so it makes it faster to adopt with minimal driver software update.

    <snip>

    > If a designer is stupid, they may do this. However, one would expect some
    > intelligence here and one may prefer to have specific data flows or
    > DiffServ code points or whatever used to determine which connection or
    > which UD QP and that one would again apply an intelligent and predictable
    > algorithm such that mix-n-match for a given TCP connection does not
    > occur.  Given multiple *C QP can be supported, it is not tenable to state
    > that all unicast must go over a given QP or that no unicast can occur on a
    > UD QP.
    >

    You mised my point which was that the specification cannot be silent on this
    and say it is a local issue. That can lead to interoperability failure. The
    specification must support or disallow  unicast communication over UD QP in an
    IPoIB-CM. 

    You prefer that such communication be supported. That works. Any other thoughts?

    I prefer that guidance be provided and that it remain a local implementation issue as to what QP is used for a given flow.  I do not see interoperability issues only potential performance if people are stupid.  The industry has a way to deal with stupidity and too much time is spent on preventing people from being stupid.  Even a so-so intelligent implementation could have a simple flag for a given target IP address that states which QP to target for all or a subset of the flows with minimal cost to implement and troubleshoot / validate.

    Mike --=====================_88959486==.ALT-- --===============1908633025== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1908633025==-- From ipoverib-bounces@ietf.org Thu Nov 18 20:18:22 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA13345 for ; Thu, 18 Nov 2004 20:18:22 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUxMe-00025K-NN; Thu, 18 Nov 2004 20:15:04 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUxLa-0001MC-9l for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 20:13:58 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA12878 for ; Thu, 18 Nov 2004 20:13:56 -0500 (EST) Received: from e33.co.us.ibm.com ([32.97.110.131]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUxOJ-0004uo-2G for ipoverib@ietf.org; Thu, 18 Nov 2004 20:16:47 -0500 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e33.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAJ1DQJT729970 for ; Thu, 18 Nov 2004 20:13:26 -0500 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAJ1DPmo134128 for ; Thu, 18 Nov 2004 18:13:25 -0700 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ1DPVX012032 for ; Thu, 18 Nov 2004 18:13:25 -0700 Received: from DYN319548.beaverton.ibm.com (DYN319548.beaverton.ibm.com [9.47.22.85]) by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAJ1DO1B012019; Thu, 18 Nov 2004 18:13:25 -0700 Date: Thu, 18 Nov 2004 17:14:22 -0800 (PST) From: Vivek Kashyap X-X-Sender: kashyapv@dyn319548.beaverton.ibm.com To: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <6.1.2.0.2.20041118161705.0cbc5900@esmail.cup.hp.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: 10d3e4e3c32e363f129e380e644649be Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Thu, 18 Nov 2004, Michael Krause wrote: > At 04:12 PM 11/18/2004, Vivek Kashyap wrote: > >On Thu, 18 Nov 2004, Michael Krause wrote: > > > > > At 11:33 AM 11/18/2004, Vivek Kashyap wrote: > > > >On Thu, 18 Nov 2004, Michael Krause wrote: > > > > > > > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > > > > > >Mike the format is really off in the last mail from you - making it > > > > difficult > > > > > >to follow. > > > > > > > > > > > > > > > > > >Other than that let us discuss in the context of the draft. The > > draft is > > > > > >built upon the following: > > > > > > > > > > > >1. IPoIB-RC and IPoIB-UC are optional. > > > > > > > > > > I would prefer only one be used - either RC or UC. I've provided some > > > > > logic for either one as a preference but don't see a reason to have > > > > > both. Both just leads to options which leads to interoperability > > problems. > > > > > > > >ok. > > > >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. > > > >It states that the RC and UC are mutually exclusive flags. > > > > > > My preference is to only support one of the two in a spec not to have > > flags > > > to indicate what is implemented. The benefits of connected mode operation > > > should be done with only one form of communication not two. > > > >A given subnet will support only one of the two. Not both simultaneously. The > >flag only indicates which type it is. RC and UC are both useful to different > >people and implementations so both are allowed. I suggest that both not be > >allowed in the same IPoIB subnet though. > RC and UC both have benefits. There is almost no difference other than the connection flag between the two. > To be explicit, I think there is benefit in implementing one and only one > of the two. Having two options serves no purpose and adds unnecessary > complexity. Interoperability will end up requiring both to be done if > customers are to not get upset. Let's just pick one of the two and apply > KISS. To get this started, I'll propose RC as that is a bit nicer to the > fabric than UC and is already implemented in most OS and CA drivers today > so it makes it faster to adopt with minimal driver software update. > > > > > > If a designer is stupid, they may do this. However, one would expect some > > > intelligence here and one may prefer to have specific data flows or > > > DiffServ code points or whatever used to determine which connection or > > > which UD QP and that one would again apply an intelligent and predictable > > > algorithm such that mix-n-match for a given TCP connection does not > > > occur. Given multiple *C QP can be supported, it is not tenable to state > > > that all unicast must go over a given QP or that no unicast can occur on a > > > UD QP. > > > > > > >You mised my point which was that the specification cannot be silent on this > >and say it is a local issue. That can lead to interoperability failure. The > >specification must support or disallow unicast communication over UD QP > >in an > >IPoIB-CM. > > > >You prefer that such communication be supported. That works. Any other > >thoughts? > > I prefer that guidance be provided and that it remain a local > implementation issue as to what QP is used for a given flow. I do not see > interoperability issues only potential performance if people are > stupid. The industry has a way to deal with stupidity and too much time is > spent on preventing people from being stupid. Even a so-so intelligent > implementation could have a simple flag for a given target IP address that > states which QP to target for all or a subset of the flows with minimal > cost to implement and troubleshoot / validate. If something is left unspecified there is every chance that incompatible implementations result - this has nothing to do with the mental faculties of the implementors. Therefore I'll add relevant text. > > Mike _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Thu Nov 18 20:56:36 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA16397 for ; Thu, 18 Nov 2004 20:56:36 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUxvN-0001W0-Fg; Thu, 18 Nov 2004 20:50:57 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CUxr5-0000ae-0H for ipoverib@megatron.ietf.org; Thu, 18 Nov 2004 20:46:31 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id UAA15725 for ; Thu, 18 Nov 2004 20:46:28 -0500 (EST) Received: from palrel12.hp.com ([156.153.255.237]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CUxtj-0005cu-Rz for ipoverib@ietf.org; Thu, 18 Nov 2004 20:49:17 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel12.hp.com (Postfix) with ESMTP id E7AA3406C11; Thu, 18 Nov 2004 17:46:24 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.201.129]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id RAA05281; Thu, 18 Nov 2004 17:43:55 -0800 (PST) Message-Id: <6.1.2.0.2.20041118174214.0cbee5a8@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Thu, 18 Nov 2004 17:45:09 -0800 To: Vivek Kashyap From: Michael Krause Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: References: <6.1.2.0.2.20041118161705.0cbc5900@esmail.cup.hp.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 1449ead51a2ff026dcb23465f5379250 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0416962128==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0416962128== Content-Type: multipart/alternative; boundary="=====================_93881494==.ALT" --=====================_93881494==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 05:14 PM 11/18/2004, Vivek Kashyap wrote: >RC and UC both have benefits. There is almost no difference other than >the connection flag between the two. Many host OS implementations do not support UC as RC and UD are all that is really required within the industry. The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented). Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC. > > To be explicit, I think there is benefit in implementing one and only one > > of the two. Having two options serves no purpose and adds unnecessary > > complexity. Interoperability will end up requiring both to be done if > > customers are to not get upset. Let's just pick one of the two and apply > > KISS. To get this started, I'll propose RC as that is a bit nicer to the > > fabric than UC and is already implemented in most OS and CA drivers today > > so it makes it faster to adopt with minimal driver software update. > > > > > > > > > > If a designer is stupid, they may do this. However, one would > expect some > > > > intelligence here and one may prefer to have specific data flows or > > > > DiffServ code points or whatever used to determine which connection or > > > > which UD QP and that one would again apply an intelligent and > predictable > > > > algorithm such that mix-n-match for a given TCP connection does not > > > > occur. Given multiple *C QP can be supported, it is not tenable to > state > > > > that all unicast must go over a given QP or that no unicast can > occur on a > > > > UD QP. > > > > > > > > > >You mised my point which was that the specification cannot be silent > on this > > >and say it is a local issue. That can lead to interoperability > failure. The > > >specification must support or disallow unicast communication over UD QP > > >in an > > >IPoIB-CM. > > > > > >You prefer that such communication be supported. That works. Any other > > >thoughts? > > > > I prefer that guidance be provided and that it remain a local > > implementation issue as to what QP is used for a given flow. I do not see > > interoperability issues only potential performance if people are > > stupid. The industry has a way to deal with stupidity and too much > time is > > spent on preventing people from being stupid. Even a so-so intelligent > > implementation could have a simple flag for a given target IP address that > > states which QP to target for all or a subset of the flows with minimal > > cost to implement and troubleshoot / validate. > >If something is left unspecified there is every chance that incompatible >implementations result - this has nothing to do with the mental faculties >of the implementors. Therefore I'll add relevant text. One can just provide an implementation note to avoid any mental short comings and still avoid specifying this. That should be sufficient while maintaining KISS. Mike --=====================_93881494==.ALT Content-Type: text/html; charset="us-ascii" At 05:14 PM 11/18/2004, Vivek Kashyap wrote:


    RC and UC both have benefits. There is almost no difference other than
    the connection flag between the two.

    Many host OS implementations do not support UC as RC and UD are all that is really required within the industry.  The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented).  Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC.


    > To be explicit, I think there is benefit in implementing one and only one
    > of the two.  Having two options serves no purpose and adds unnecessary
    > complexity.  Interoperability will end up requiring both to be done if
    > customers are to not get upset.  Let's just pick one of the two and apply
    > KISS.  To get this started, I'll propose RC as that is a bit nicer to the
    > fabric than UC and is already implemented in most OS and CA drivers today
    > so it makes it faster to adopt with minimal driver software update.
    >
    > <snip>
    >
    > > > If a designer is stupid, they may do this. However, one would expect some
    > > > intelligence here and one may prefer to have specific data flows or
    > > > DiffServ code points or whatever used to determine which connection or
    > > > which UD QP and that one would again apply an intelligent and predictable
    > > > algorithm such that mix-n-match for a given TCP connection does not
    > > > occur.  Given multiple *C QP can be supported, it is not tenable to state
    > > > that all unicast must go over a given QP or that no unicast can occur on a
    > > > UD QP.
    > > >
    > >
    > >You mised my point which was that the specification cannot be silent on this
    > >and say it is a local issue. That can lead to interoperability failure. The
    > >specification must support or disallow  unicast communication over UD QP
    > >in an
    > >IPoIB-CM.
    > >
    > >You prefer that such communication be supported. That works. Any other
    > >thoughts?
    >
    > I prefer that guidance be provided and that it remain a local
    > implementation issue as to what QP is used for a given flow.  I do not see
    > interoperability issues only potential performance if people are
    > stupid.  The industry has a way to deal with stupidity and too much time is
    > spent on preventing people from being stupid.  Even a so-so intelligent
    > implementation could have a simple flag for a given target IP address that
    > states which QP to target for all or a subset of the flows with minimal
    > cost to implement and troubleshoot / validate.

    If something is left unspecified there is every chance that incompatible
    implementations result - this has nothing to do with the mental faculties
    of the implementors. Therefore I'll add relevant text.

    One can just provide an implementation note to avoid any mental short comings and still avoid specifying this.  That should be sufficient while maintaining KISS.

    Mike
    --=====================_93881494==.ALT-- --===============0416962128== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0416962128==-- From ipoverib-bounces@ietf.org Fri Nov 19 13:31:45 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA29827 for ; Fri, 19 Nov 2004 13:31:44 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CVDUU-0007XN-8m; Fri, 19 Nov 2004 13:28:14 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CVDOO-0005wq-B2 for ipoverib@megatron.ietf.org; Fri, 19 Nov 2004 13:21:56 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA28698 for ; Fri, 19 Nov 2004 13:21:53 -0500 (EST) Received: from nwkea-mail-2.sun.com ([192.18.42.14]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CVDRD-0002yy-5B for ipoverib@ietf.org; Fri, 19 Nov 2004 13:24:54 -0500 Received: from jurassic.eng.sun.com ([129.146.89.31]) by nwkea-mail-2.sun.com (8.12.10/8.12.9) with ESMTP id iAJILppv024996; Fri, 19 Nov 2004 10:21:51 -0800 (PST) Received: from taipei (taipei.SFBay.Sun.COM [129.146.85.178]) by jurassic.eng.sun.com (8.13.1+Sun/8.13.1) with SMTP id iAJILoCl397285; Fri, 19 Nov 2004 10:21:51 -0800 (PST) Message-Id: <200411191821.iAJILoCl397285@jurassic.eng.sun.com> Date: Fri, 19 Nov 2004 10:20:16 -0800 (PST) From: "H.K. Jerry Chu" Subject: Re: [Ipoverib] comments on draft-kashyap-ipoib-connected-mode-02.txt To: kashyapv@us.ibm.com MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Content-MD5: V65NekPioitPmWqWw9llYg== X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.6_68 SunOS 5.10 sun4u sparc X-Spam-Score: 0.0 (/) X-Scan-Signature: 5011df3e2a27abcc044eaa15befcaa87 Cc: ipoverib@ietf.org X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: "H.K. Jerry Chu" List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org >> A much simpler model, which I think was presented in earlier >> drafts, is to fold the use of IB connections fully into a >> regular IPoIB-UD subnet, allowing any two IPoIB nodes to >> optionally negotiate the use of IB connection between themselves. > >The difference in the earlier draft and this one is that >I modified the requirement on the UD QP. That is, it need not be that >IPoIB-CM and IPoIB-UD share a QP but that any UD QP will do for IPoIB-CM. >In effect an implementation can still share the UD QP. > >The only issue is whether the same IP subnet can contain pure >IPoIB-UD mixed in with IPoIB-CM nodes or, all nodes must be of the same type. > - all IPoIB-UD >or > - all IPoIB-RC > >or -- all IPoIB-UC > >I beleive all of the same type is a good option to choose. I don't see a clear benefit for this restriction. E.g., even in all IPoIB-RC or IPoIB-UC, the nice per-link MTU property is no longer there due to multicast supported through UD. Also this restriction will require those implementations that don't support IPoIB over UC or RC to form a different subnet in order to talk IPoIB, hence forcing the adminstrator to maintain at least two IP subnets with one fully contained within another. I don't see why this is needed. > >> >> This much simplified model is not without its drawback. Some >> nice IP link attributes are no longer unique within a link. >> E.g., the link MTU now becomes per-node-pair MTU. Moreover, >> the MTU size for multicast will be different from the MTU size >> for unicast if IB connections are used. IB UC/RC may exhibit >> different RAS, flow control, QoS or other link characteristics >> than UD. But I consider these problems a reasonable price to >> pay for a seamless support of UC/RC mode in an IPoIB link >> defined by UD. >> >> 2. The negotiation of the per-connection MTU seems more >> complicated than necessary. I think all is needed is for a >> node to advertise its own "receive MTU". That is, the MTU >> size its peer should never go over when sending packets >> to the local interface. Yes this may break the traditional >> concept of "symmetric" MTUs. But we're already breaking the >> notion of per-link MTU, requring a lot of changes in the host >> stack anyway. This additonal breakage doesn't seem much. >> >> I haven't verified if this asymmetric MTU matches well with >> IBA connections though. > >How about: > >The MTU I would think is exchanged at the IB level during the >IPoIB-CM connection setup. The IP layer at both ends keeps a per connection >MTU if the implementation permits it. At the link layer the connection will >not send messages larger than that requested by the peer. Not quite understand the above. I'm suggesting to simplify the MTU negotiation at the IPoIB-CM connection setup time by each side advertising the "receive MTU" it can take. The peer must not send more than that size in each post_send(). E.g., if node A advertizes 32KB as its receive MTU and node B advertizes 64KB as its receive MTU, node B must not send any IP pkt through IPoIB-CM to node A that is larger than 32KB. Node A is free to send IP pkts of up to 64KB in size to node B. (But if node A decides to restrict its outbound MTU to 32KB, that's fine too. Node B doesn't need to know about it.) I'm not sure what you mean by the last two sentences above. MTU value must be made known to the IP layer so that latter won't send anything larger than that. Otherwise the pkt will get dropped by the IB layer (unless the latter performs SAR, which is a bad idea). Jerry > > >> >> 3. Regarding allowing multiple IB connections between a node >> pair, since given an IP address there is only one link-address >> for it implying one QPN, hence one service-ID, if a single >> service-ID can be used to create multiple IB connections >> then this can happen transparently. Otherwise we've got a >> problem. >> >> Jerry >> >> >> _______________________________________________ >> IPoverIB mailing list >> IPoverIB@ietf.org >> https://www1.ietf.org/mailman/listinfo/ipoverib >> >> > > > _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Fri Nov 19 22:51:10 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA26309 for ; Fri, 19 Nov 2004 22:51:10 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CVMBo-0001FZ-2t; Fri, 19 Nov 2004 22:45:32 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CVM4F-0007Q2-N6 for ipoverib@megatron.ietf.org; Fri, 19 Nov 2004 22:37:43 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA25297 for ; Fri, 19 Nov 2004 22:37:41 -0500 (EST) Received: from atorelbas04.hp.com ([156.153.255.238] helo=palrel13.hp.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CVM7C-0001ld-MB for ipoverib@ietf.org; Fri, 19 Nov 2004 22:40:47 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel13.hp.com (Postfix) with ESMTP id A64281C00417 for ; Fri, 19 Nov 2004 19:37:41 -0800 (PST) Received: from MK73191c.cup.hp.com (mk731916.cup.hp.com [15.8.80.134]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id TAA02474 for ; Fri, 19 Nov 2004 19:35:20 -0800 (PST) Message-Id: <6.1.2.0.2.20041119192315.04b9bab0@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Fri, 19 Nov 2004 19:30:36 -0800 To: "IPoverIB" From: Michael Krause Subject: Re: [Ipoverib] comments on draft-kashyap-ipoib-connected-mode-02.txt In-Reply-To: <200411191821.iAJILoCl397285@jurassic.eng.sun.com> References: <200411191821.iAJILoCl397285@jurassic.eng.sun.com> Mime-Version: 1.0 X-Spam-Score: 0.0 (/) X-Scan-Signature: 3f3e54d3c03ed638c06aa9fa6861237e X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1141799541==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============1141799541== Content-Type: multipart/alternative; boundary="=====================_47754637==.ALT" --=====================_47754637==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 10:20 AM 11/19/2004, H.K. Jerry Chu wrote: > > > >> A much simpler model, which I think was presented in earlier > >> drafts, is to fold the use of IB connections fully into a > >> regular IPoIB-UD subnet, allowing any two IPoIB nodes to > >> optionally negotiate the use of IB connection between themselves. > > > >The difference in the earlier draft and this one is that > >I modified the requirement on the UD QP. That is, it need not be that > >IPoIB-CM and IPoIB-UD share a QP but that any UD QP will do for IPoIB-CM. > >In effect an implementation can still share the UD QP. > > > >The only issue is whether the same IP subnet can contain pure > >IPoIB-UD mixed in with IPoIB-CM nodes or, all nodes must be of the same > type. > > - all IPoIB-UD > >or > > - all IPoIB-RC > > > >or -- all IPoIB-UC > > > >I beleive all of the same type is a good option to choose. > >I don't see a clear benefit for this restriction. E.g., even in all >IPoIB-RC or >IPoIB-UC, the nice per-link MTU property is no longer there due to multicast >supported through UD. Also this restriction will require those implementations >that don't support IPoIB over UC or RC to form a different subnet in order to >talk IPoIB, hence forcing the adminstrator to maintain at least two IP subnets >with one fully contained within another. I don't see why this is needed. I maintain that *C and UD can co-exist in the same IP subnet and there is no reason to restrict this. Endnode pairs will establish their communication paths and take the appropriate QP to reach a given destination. This is all a local issue in the end sans the all unicast debate in an earlier string. > > > >> > >> This much simplified model is not without its drawback. Some > >> nice IP link attributes are no longer unique within a link. > >> E.g., the link MTU now becomes per-node-pair MTU. Moreover, > >> the MTU size for multicast will be different from the MTU size > >> for unicast if IB connections are used. IB UC/RC may exhibit > >> different RAS, flow control, QoS or other link characteristics > >> than UD. But I consider these problems a reasonable price to > >> pay for a seamless support of UC/RC mode in an IPoIB link > >> defined by UD. > >> > >> 2. The negotiation of the per-connection MTU seems more > >> complicated than necessary. I think all is needed is for a > >> node to advertise its own "receive MTU". That is, the MTU > >> size its peer should never go over when sending packets > >> to the local interface. Yes this may break the traditional > >> concept of "symmetric" MTUs. But we're already breaking the > >> notion of per-link MTU, requring a lot of changes in the host > >> stack anyway. This additonal breakage doesn't seem much. > >> > >> I haven't verified if this asymmetric MTU matches well with > >> IBA connections though. > > > >How about: > > > >The MTU I would think is exchanged at the IB level during the > >IPoIB-CM connection setup. The IP layer at both ends keeps a per connection > >MTU if the implementation permits it. At the link layer the connection will > >not send messages larger than that requested by the peer. > >Not quite understand the above. I'm suggesting to simplify the MTU negotiation >at the IPoIB-CM connection setup time by each side advertising the "receive >MTU" it can take. The peer must not send more than that size in each >post_send(). >E.g., if node A advertizes 32KB as its receive MTU and node B advertizes 64KB >as its receive MTU, node B must not send any IP pkt through IPoIB-CM to node >A that is larger than 32KB. Node A is free to send IP pkts of up to 64KB in >size to node B. (But if node A decides to restrict its outbound MTU to 32KB, >that's fine too. Node B doesn't need to know about it.) > >I'm not sure what you mean by the last two sentences above. MTU value must >be made known to the IP layer so that latter won't send anything larger >than that. Otherwise the pkt will get dropped by the IB layer (unless the >latter performs SAR, which is a bad idea). One might argue that *C is focused on an equivalence to TSO (large send) thus the logical MTU is not required. One might argue that the logical MTU represents an asymmetric maximum receive buffer that will be posted thus messages must be sent that do not exceed this maximum. One might argue that having a single buffer size independent of the *C / UD being used maximizes KISS. I'm open to exploring these options but do not believe all must be supported. > >> 3. Regarding allowing multiple IB connections between a node > >> pair, since given an IP address there is only one link-address > >> for it implying one QPN, hence one service-ID, if a single > >> service-ID can be used to create multiple IB connections > >> then this can happen transparently. Otherwise we've got a > >> problem. A service ID can be used to establish multiple connections thus the creation process should be left as an implementation detail in terms of how many, etc. as I've noted in a previous response. The local endnodes will determine what is allowed per endnode pair and there are no interoperability issues that arise as a result. Mike --=====================_47754637==.ALT Content-Type: text/html; charset="us-ascii" At 10:20 AM 11/19/2004, H.K. Jerry Chu wrote:
    <snip>

    >> A much simpler model, which I think was presented in earlier
    >> drafts, is to fold the use of IB connections fully into a
    >> regular IPoIB-UD subnet, allowing any two IPoIB nodes to
    >> optionally negotiate the use of IB connection between themselves.
    >
    >The difference in the earlier draft and this one is that
    >I modified the requirement on the UD QP. That is, it need not be that
    >IPoIB-CM and IPoIB-UD share a QP but that any UD QP will do for IPoIB-CM.
    >In effect an implementation can still share the UD QP.
    >
    >The only issue is whether the same IP subnet can contain pure
    >IPoIB-UD mixed in with IPoIB-CM nodes or, all nodes must be of the same type.
    >       - all IPoIB-UD
    >or
    >       - all IPoIB-RC
    >
    >or     -- all IPoIB-UC
    >
    >I beleive all of the same type is a good option to choose.

    I don't see a clear benefit for this restriction. E.g., even in all IPoIB-RC or
    IPoIB-UC, the nice per-link MTU property is no longer there due to multicast
    supported through UD. Also this restriction will require those implementations
    that don't support IPoIB over UC or RC to form a different subnet in order to
    talk IPoIB, hence forcing the adminstrator to maintain at least two IP subnets
    with one fully contained within another. I don't see why this is needed.

    I maintain that *C and UD can co-exist in the same IP subnet and there is no reason to restrict this.  Endnode pairs will establish their communication paths and take the appropriate QP to reach a given destination.  This is all a local issue in the end sans the all unicast debate in an earlier string.


    >
    >>
    >> This much simplified model is not without its drawback. Some
    >> nice IP link attributes are no longer unique within a link.
    >> E.g., the link MTU now becomes per-node-pair MTU. Moreover,
    >> the MTU size for multicast will be different from the MTU size
    >> for unicast if IB connections are used. IB UC/RC may exhibit
    >> different RAS, flow control, QoS or other link characteristics
    >> than UD. But I consider these problems a reasonable price to
    >> pay for a seamless support of UC/RC mode in an IPoIB link
    >> defined by UD.
    >>
    >> 2. The negotiation of the per-connection MTU seems more
    >> complicated than necessary. I think all is needed is for a
    >> node to advertise its own "receive MTU". That is, the MTU
    >> size its peer should never go over when sending packets
    >> to the local interface. Yes this may break the traditional
    >> concept of "symmetric" MTUs. But we're already breaking the
    >> notion of per-link MTU, requring a lot of changes in the host
    >> stack anyway. This additonal breakage doesn't seem much.
    >>
    >> I haven't verified if this asymmetric MTU matches well with
    >> IBA connections though.
    >
    >How about:
    >
    >The MTU I would think is exchanged at the IB level during the
    >IPoIB-CM connection setup. The IP layer at both ends keeps a per connection
    >MTU if the implementation permits it. At the link layer the connection will
    >not send messages larger than that requested by the peer.

    Not quite understand the above. I'm suggesting to simplify the MTU negotiation
    at the IPoIB-CM connection setup time by each side advertising the "receive
    MTU" it can take. The peer must not send more than that size in each post_send().
    E.g., if node A advertizes 32KB as its receive MTU and node B advertizes 64KB
    as its receive MTU, node B must not send any IP pkt through IPoIB-CM to node
    A that is larger than 32KB. Node A is free to send IP pkts of up to 64KB in
    size to node B. (But if node A decides to restrict its outbound MTU to 32KB,
    that's fine too. Node B doesn't need to know about it.)

    I'm not sure what you mean by the last two sentences above. MTU value must
    be made known to the IP layer so that latter won't send anything larger
    than that. Otherwise the pkt will get dropped by the IB layer (unless the
    latter performs SAR, which is a bad idea).

    One might argue that *C is focused on an equivalence to TSO (large send) thus the logical MTU is not required.  One might argue that the logical MTU represents an asymmetric maximum receive buffer that will be posted thus messages must be sent that do not exceed this maximum.  One might argue that having a single buffer size independent of the *C / UD being used maximizes KISS.  I'm open to exploring these options but do not believe all must be supported.

    >> 3. Regarding allowing multiple IB connections between a node
    >> pair, since given an IP address there is only one link-address
    >> for it implying one QPN, hence one service-ID, if a single
    >> service-ID can be used to create multiple IB connections
    >> then this can happen transparently. Otherwise we've got a
    >> problem.

    A service ID can be used to establish multiple connections thus the creation process should be left as an implementation detail in terms of how many, etc. as I've noted in a previous response.  The local endnodes will determine what is allowed per endnode pair and there are no interoperability issues that arise as a result.

    Mike

    --=====================_47754637==.ALT-- --===============1141799541== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1141799541==-- From ipoverib-bounces@ietf.org Sat Nov 20 12:08:06 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id MAA08219 for ; Sat, 20 Nov 2004 12:08:06 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CVYXC-0002cm-RG; Sat, 20 Nov 2004 11:56:26 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CVYSn-0001zF-BC for ipoverib@megatron.ietf.org; Sat, 20 Nov 2004 11:51:53 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id LAA07153 for ; Sat, 20 Nov 2004 11:51:50 -0500 (EST) Received: from mail.mellanox.co.il ([194.90.237.34] helo=mtlex01.yok.mtl.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CVYVp-0000db-Jh for ipoverib@ietf.org; Sat, 20 Nov 2004 11:55:04 -0500 Received: by mtlex01.yok.mtl.com with Internet Mail Service (5.5.2653.19) id ; Sat, 20 Nov 2004 18:49:17 +0200 Message-ID: <506C3D7B14CDD411A52C00025558DED6067488EB@mtlex01.yok.mtl.com> From: Dror Goldenberg To: Bill.Strahm@Sun.COM, Yaron Haviv Subject: RE: [Ipoverib] IPoIB-RC and Checksums Date: Sat, 20 Nov 2004 18:49:09 +0200 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) X-Spam-Score: 0.8 (/) X-Scan-Signature: 6ffdee8af20de249c24731d8414917d3 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1955679407==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --===============1955679407== Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C4CF20.DB00EAB0" This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------_=_NextPart_001_01C4CF20.DB00EAB0 Content-Type: text/plain > -----Original Message----- > From: Bill Strahm [mailto:Bill.Strahm@Sun.COM] > Sent: Friday, November 19, 2004 1:11 AM > > On Thu, 2004-11-18 at 22:18 +0200, Yaron Haviv wrote: > > P.S. another note, we discussed in IETF was that we may want to > > mention/suggest (not mandate) use of SRQ for IPoIB-RC in order to > > preserve memory > > > Again, in the spirit of Wire protocol vs. Implementation. I > think this is an implementation issue that will not change > wire protocols at all. Is there a point where using SRQ vs. > Not Using SRQ would have to change the wire protocol ? > > If not - lets not say anything. > If there is - I would be very interested in understanding. Subtle difference: In UC, I don't believe that there is a difference. In RC, the ACKs will be sent with valid end to end credits by the responder HCA if it's connected to a regular QP, and with invalid end to end credits if the QP is connected to a SRQ. This can be observed on the wire with an analyzer. However, it never propagates to the SW above the IB verbs. -Dror > > Bill > > > _______________________________________________ > IPoverIB mailing list > IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib > ------_=_NextPart_001_01C4CF20.DB00EAB0 Content-Type: text/html Content-Transfer-Encoding: quoted-printable RE: [Ipoverib] IPoIB-RC and Checksums

    > -----Original Message-----
    > From: Bill Strahm [mailto:Bill.Strahm@Sun.COM] =
    > Sent: Friday, November 19, 2004 1:11 AM
    >
    > On Thu, 2004-11-18 at 22:18 +0200, Yaron Haviv = wrote:

    > > P.S. another note, we discussed in IETF was = that we may want to
    > > mention/suggest (not mandate) use of SRQ = for IPoIB-RC in order to
    > > preserve memory
    > >
    > Again, in the spirit of Wire protocol vs. = Implementation.  I
    > think this is an implementation issue that will = not change
    > wire protocols at all. Is there a point where = using SRQ vs.
    > Not Using SRQ would have to change the wire = protocol ?
    >
    > If not - lets not say anything.
    > If there is - I would be very interested in = understanding.

    Subtle difference:
    In UC, I don't believe that there is a = difference.
    In RC, the ACKs will be sent with valid end to end = credits by the responder
    HCA if it's connected to a regular QP, and with = invalid end to end credits if
    the QP is connected to a SRQ.
    This can be observed on the wire with an analyzer. = However, it never
    propagates to the SW above the IB verbs.

    -Dror

    >
    > Bill
    >
    >
    > = _______________________________________________
    > IPoverIB mailing list
    > IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib
    >

    ------_=_NextPart_001_01C4CF20.DB00EAB0-- --===============1955679407== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1955679407==-- From ipoverib-bounces@ietf.org Sat Nov 20 12:08:48 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id MAA08292 for ; Sat, 20 Nov 2004 12:08:48 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CVYXD-0002cx-3p; Sat, 20 Nov 2004 11:56:27 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CVYSo-0001zH-Gu for ipoverib@megatron.ietf.org; Sat, 20 Nov 2004 11:51:54 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id LAA07158 for ; Sat, 20 Nov 2004 11:51:51 -0500 (EST) Received: from mail.mellanox.co.il ([194.90.237.34] helo=mtlex01.yok.mtl.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CVYVp-0000dZ-IA for ipoverib@ietf.org; Sat, 20 Nov 2004 11:55:05 -0500 Received: by mtlex01.yok.mtl.com with Internet Mail Service (5.5.2653.19) id ; Sat, 20 Nov 2004 18:49:17 +0200 Message-ID: <506C3D7B14CDD411A52C00025558DED6067488EA@mtlex01.yok.mtl.com> From: Dror Goldenberg To: Michael Krause , Vivek Kashyap Subject: RE: [Ipoverib] A Couple of IPoIB Questions Date: Sat, 20 Nov 2004 18:49:08 +0200 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) X-Spam-Score: 0.9 (/) X-Scan-Signature: 932cba6e0228cc603da43d861a7e09d8 Cc: IPoverIB X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============2046247154==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --===============2046247154== Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C4CF20.DAA84360" This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------_=_NextPart_001_01C4CF20.DAA84360 Content-Type: text/plain -----Original Message----- From: Michael Krause [mailto:krause@cup.hp.com] Sent: Friday, November 19, 2004 3:45 AM To: Vivek Kashyap Cc: IPoverIB Subject: Re: [Ipoverib] A Couple of IPoIB Questions At 05:14 PM 11/18/2004, Vivek Kashyap wrote: RC and UC both have benefits. There is almost no difference other than the connection flag between the two. Many host OS implementations do not support UC as RC and UD are all that is really required within the industry. The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented). Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC. [DG] Mike, A few reasons I think that the end to end credits / RNR in an RC connection is a problem. It may be worth discussing it: 1) Lack of receive WQEs in the responder implies a slow responder. Getting the messaged dropped in this case is desirable from protocols that have injection control such as TCP. In this case it is supposed to back off and restart slowlier. While UC/UD result in a similar behavior of messages being dropped at the receiver when it's slow, RC does not. Instead, there is persistence in getting the message transmitted and the receiver won't be able to tell the requester that it's being slow. 2) How would you configure the RNR retry counters. Would they be configured to infinity ? Doesn't sound good. Would they be configured to a finite value (should be <7), in which case, in the case of a slow receiver you'd end up recreating connections that had end to end credits problem, which is a real overhead on the protocol. 3) What happens with implementations that don't support RNR Nak generation ? That poses more difficulties on (2). ------_=_NextPart_001_01C4CF20.DAA84360 Content-Type: text/html Message
     
    -----Original Message-----
    From: Michael Krause [mailto:krause@cup.hp.com]
    Sent: Friday, November 19, 2004 3:45 AM
    To: Vivek Kashyap
    Cc: IPoverIB
    Subject: Re: [Ipoverib] A Couple of IPoIB Questions

    At 05:14 PM 11/18/2004, Vivek Kashyap wrote:


    RC and UC both have benefits. There is almost no difference other than
    the connection flag between the two.

    Many host OS implementations do not support UC as RC and UD are all that is really required within the industry.  The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented).  Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC. 
    [DG] Mike,
     A few reasons I think that the end to end credits / RNR  in an RC connection is a problem.
    It may be worth discussing it:
    1) Lack of receive WQEs in the responder implies a slow responder. Getting the messaged dropped
        in this case is desirable from protocols that have injection control such as TCP.  In this case it
        is supposed to back off and restart slowlier. While UC/UD result  in a similar behavior of messages
        being dropped at the receiver when it's slow, RC does not. Instead, there is persistence in getting the
        message transmitted and the receiver won't be able to tell the requester that it's being slow.
    2) How would you configure the RNR retry counters. Would they be configured to infinity ? Doesn't sound
        good. Would they be configured to a finite value (should be <7), in which case, in the case of a slow
        receiver you'd end up recreating connections that had end to end credits problem, which is a real
        overhead on the protocol.
    3) What happens with implementations that don't support RNR Nak generation ? That poses more
        difficulties on (2).
     

    ------_=_NextPart_001_01C4CF20.DAA84360-- --===============2046247154== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============2046247154==-- From ipoverib-bounces@ietf.org Mon Nov 22 13:57:42 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA02441 for ; Mon, 22 Nov 2004 13:57:42 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWJJx-0006Tl-DB; Mon, 22 Nov 2004 13:53:53 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWJCr-0004v8-4z for ipoverib@megatron.ietf.org; Mon, 22 Nov 2004 13:46:33 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA01715 for ; Mon, 22 Nov 2004 13:46:31 -0500 (EST) Received: from palrel13.hp.com ([156.153.255.238]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CWJGB-0002KK-Aj for ipoverib@ietf.org; Mon, 22 Nov 2004 13:50:10 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel13.hp.com (Postfix) with ESMTP id CE51B1C0FA7B for ; Mon, 22 Nov 2004 10:46:18 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.203.228]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id KAA27867 for ; Mon, 22 Nov 2004 10:43:43 -0800 (PST) Message-Id: <6.1.2.0.2.20041122103344.01dbf170@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Mon, 22 Nov 2004 10:36:56 -0800 To: "IPoverIB" From: Michael Krause Subject: RE: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <506C3D7B14CDD411A52C00025558DED6067488EA@mtlex01.yok.mtl.c om> References: <506C3D7B14CDD411A52C00025558DED6067488EA@mtlex01.yok.mtl.com> Mime-Version: 1.0 X-Spam-Score: 0.1 (/) X-Scan-Signature: 25eb6223a37c19d53ede858176b14339 X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0030162100==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0030162100== Content-Type: multipart/alternative; boundary="=====================_169606681==.ALT" --=====================_169606681==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 08:49 AM 11/20/2004, Dror Goldenberg wrote: > >-----Original Message----- >From: Michael Krause [mailto:krause@cup.hp.com] >Sent: Friday, November 19, 2004 3:45 AM >To: Vivek Kashyap >Cc: IPoverIB >Subject: Re: [Ipoverib] A Couple of IPoIB Questions > >At 05:14 PM 11/18/2004, Vivek Kashyap wrote: > > >>RC and UC both have benefits. There is almost no difference other than >>the connection flag between the two. > >Many host OS implementations do not support UC as RC and UD are all that >is really required within the industry. The ACK overhead associated with >RC is truly noise and the end-to-end credits are very nice as IB now >supports three signaling rates combined with 4 link widths (though only >three are really being implemented). Such a permutation in bandwidth >capability makes RC a more tenable / good citizen as we designed it to be >so I'd prefer RC. > >[DG] Mike, > A few reasons I think that the end to end credits / RNR in an RC > connection is a problem. >It may be worth discussing it: >1) Lack of receive WQEs in the responder implies a slow responder. Getting >the messaged dropped > in this case is desirable from protocols that have injection control > such as TCP. In this case it > is supposed to back off and restart slowlier. While UC/UD result in > a similar behavior of messages > being dropped at the receiver when it's slow, RC does not. Instead, > there is persistence in getting the > message transmitted and the receiver won't be able to tell the > requester that it's being slow. TCP on the sending side will regulate due to lack of update window credits. Hence, there is no need to restart the large messages that are put forth as the reason for using *C instead of UD. >2) How would you configure the RNR retry counters. Would they be >configured to infinity ? Doesn't sound > good. Would they be configured to a finite value (should be <7), in > which case, in the case of a slow > receiver you'd end up recreating connections that had end to end > credits problem, which is a real > overhead on the protocol. RNR would be no different for IP over IB than for any other IB RC instance. >3) What happens with implementations that don't support RNR Nak generation >? That poses more > difficulties on (2). A HCA is required to support RNR NAK. A TCA has the option. If you don't support RC, then use UD. Where is the real problem as nothing shown here on either side is more than speculation? Mike --=====================_169606681==.ALT Content-Type: text/html; charset="us-ascii" At 08:49 AM 11/20/2004, Dror Goldenberg wrote:
     
    -----Original Message-----
    From: Michael Krause [mailto:krause@cup.hp.com]
    Sent: Friday, November 19, 2004 3:45 AM
    To: Vivek Kashyap
    Cc: IPoverIB
    Subject: Re: [Ipoverib] A Couple of IPoIB Questions

    At 05:14 PM 11/18/2004, Vivek Kashyap wrote:


    RC and UC both have benefits. There is almost no difference other than
    the connection flag between the two.

    Many host OS implementations do not support UC as RC and UD are all that is really required within the industry.  The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented).  Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC.

    [DG] Mike,
     A few reasons I think that the end to end credits / RNR  in an RC connection is a problem.
    It may be worth discussing it:
    1) Lack of receive WQEs in the responder implies a slow responder. Getting the messaged dropped
        in this case is desirable from protocols that have injection control such as TCP.  In this case it
        is supposed to back off and restart slowlier. While UC/UD result  in a similar behavior of messages
        being dropped at the receiver when it's slow, RC does not. Instead, there is persistence in getting the
        message transmitted and the receiver won't be able to tell the requester that it's being slow.

    TCP on the sending side will regulate due to lack of update window  credits.   Hence, there is no need to restart the large messages that are put forth as the reason for using *C instead of UD.

    2) How would you configure the RNR retry counters. Would they be configured to infinity ? Doesn't sound
        good. Would they be configured to a finite value (should be <7), in which case, in the case of a slow
        receiver you'd end up recreating connections that had end to end credits problem, which is a real
        overhead on the protocol.

    RNR would be no different for IP over IB than for any other IB RC instance. 

    3) What happens with implementations that don't support RNR Nak generation ? That poses more
        difficulties on (2).

    A HCA is required to support RNR NAK.  A TCA has the option.  If you don't support RC, then use UD.  Where is the real problem as nothing shown here on either side is more than speculation?

    Mike

    --=====================_169606681==.ALT-- --===============0030162100== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0030162100==-- From ipoverib-bounces@ietf.org Mon Nov 22 15:07:54 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA09989 for ; Mon, 22 Nov 2004 15:07:54 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWKMo-0003zj-EC; Mon, 22 Nov 2004 15:00:54 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWKJx-0000Z5-NK for ipoverib@megatron.ietf.org; Mon, 22 Nov 2004 14:57:57 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id OAA08632 for ; Mon, 22 Nov 2004 14:57:56 -0500 (EST) Received: from e35.co.us.ibm.com ([32.97.110.133]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CWKNQ-00029E-RU for ipoverib@ietf.org; Mon, 22 Nov 2004 15:01:36 -0500 Received: from westrelay03.boulder.ibm.com (westrelay03.boulder.ibm.com [9.17.195.12]) by e35.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAMJvKQf473014 for ; Mon, 22 Nov 2004 14:57:20 -0500 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by westrelay03.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAMJvKcu135998 for ; Mon, 22 Nov 2004 12:57:20 -0700 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAMJvK2O012535 for ; Mon, 22 Nov 2004 12:57:20 -0700 Received: from DYN319548.beaverton.ibm.com (DYN319548.beaverton.ibm.com [9.47.22.85]) by d03av01.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAMJvJjo012425; Mon, 22 Nov 2004 12:57:19 -0700 Date: Mon, 22 Nov 2004 11:57:06 -0800 (PST) From: Vivek Kashyap X-X-Sender: kashyapv@dyn319548.beaverton.ibm.com To: "H.K. Jerry Chu" Subject: Re: [Ipoverib] comments on draft-kashyap-ipoib-connected-mode-02.txt In-Reply-To: <200411191821.iAJILoCl397285@jurassic.eng.sun.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: 156eddb66af16eef49a76ae923b15b92 Cc: ipoverib@ietf.org X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Fri, 19 Nov 2004, H.K. Jerry Chu wrote: > > > >> A much simpler model, which I think was presented in earlier > >> drafts, is to fold the use of IB connections fully into a > >> regular IPoIB-UD subnet, allowing any two IPoIB nodes to > >> optionally negotiate the use of IB connection between themselves. > > > >The difference in the earlier draft and this one is that > >I modified the requirement on the UD QP. That is, it need not be that > >IPoIB-CM and IPoIB-UD share a QP but that any UD QP will do for IPoIB-CM. > >In effect an implementation can still share the UD QP. > > > >The only issue is whether the same IP subnet can contain pure > >IPoIB-UD mixed in with IPoIB-CM nodes or, all nodes must be of the same type. > > - all IPoIB-UD > >or > > - all IPoIB-RC > > > >or -- all IPoIB-UC > > > >I beleive all of the same type is a good option to choose. > > I don't see a clear benefit for this restriction. E.g., even in all IPoIB-RC or > IPoIB-UC, the nice per-link MTU property is no longer there due to multicast > supported through UD. Also this restriction will require those implementations > that don't support IPoIB over UC or RC to form a different subnet in order to > talk IPoIB, hence forcing the adminstrator to maintain at least two IP subnets > with one fully contained within another. I don't see why this is needed. ok..let me posit what seems to be the summary to me (looking for more comments from WG members here). I'm more or less reverting to the earlier version of the draft. In an IPoIB subnet: - Every interface MUST support IPoIB-UD - An interface MAY optionally also support IPoIB-CM (one or both) i.e. removing the mutually exclusive restriction on rc/uc Note: IIRC, the same serviceID can be used for both RC/UC. If not then they have to stay mutually exclusive. - Interoperability is maintained by all nodes supporting IPoIB-UD. Any two interfaces that do not have a connection mode in common will fall back to IPoIB-UD. - The support of any particular IB mode is indicated by the flags in the link layer address. Note: IPoIB-UD is always supported and hence there are no flags to indicate UD support. - An interface completes the IPoIB-UD address resolution and then optionally MAY set up RC/UC connections based on the local support and received flags. - A pure IPoIB-UD implementation ignores the RC/UC flags in link layer address in received packets. It zeroes them on transmit. - Every implementation MUST accept all unicast transmissions received over any of the IPoIB modes it supports. Multicast/Broadcast by their nature will be transmitted and received over the IPoIB-UD only. ***This implies that an interface MAY transmit/receive a packet over any of RC or UC or UD depending on the modes supported between the peer IP and itself.*** - It is an implementation's decision to connect or retry a connect on failure on the CM modes. This decision is independently made per transmission or reception of a connection request. - An implementation MAY make multiple connections to a peer. This is a local decision. So is the decision of the peer to refuse such a connection. The serviceID, link setup, the link address flags, MTU negotiation etc. are covered in the draft. - MTU -- we need to discuss more as below. > > > > >> > >> This much simplified model is not without its drawback. Some > >> nice IP link attributes are no longer unique within a link. > >> E.g., the link MTU now becomes per-node-pair MTU. Moreover, > >> the MTU size for multicast will be different from the MTU size > >> for unicast if IB connections are used. IB UC/RC may exhibit > >> different RAS, flow control, QoS or other link characteristics > >> than UD. But I consider these problems a reasonable price to > >> pay for a seamless support of UC/RC mode in an IPoIB link > >> defined by UD. > >> > >> 2. The negotiation of the per-connection MTU seems more > >> complicated than necessary. I think all is needed is for a > >> node to advertise its own "receive MTU". That is, the MTU > >> size its peer should never go over when sending packets > >> to the local interface. Yes this may break the traditional > >> concept of "symmetric" MTUs. But we're already breaking the > >> notion of per-link MTU, requring a lot of changes in the host > >> stack anyway. This additonal breakage doesn't seem much. > >> > >> I haven't verified if this asymmetric MTU matches well with > >> IBA connections though. > > > >How about: > > > >The MTU I would think is exchanged at the IB level during the > >IPoIB-CM connection setup. The IP layer at both ends keeps a per connection > >MTU if the implementation permits it. At the link layer the connection will > >not send messages larger than that requested by the peer. > > Not quite understand the above. I'm suggesting to simplify the MTU negotiation > at the IPoIB-CM connection setup time by each side advertising the "receive > MTU" it can take. The peer must not send more than that size in each post_send(). > E.g., if node A advertizes 32KB as its receive MTU and node B advertizes 64KB > as its receive MTU, node B must not send any IP pkt through IPoIB-CM to node > A that is larger than 32KB. Node A is free to send IP pkts of up to 64KB in > size to node B. (But if node A decides to restrict its outbound MTU to 32KB, > that's fine too. Node B doesn't need to know about it.) > > I'm not sure what you mean by the last two sentences above. MTU value must > be made known to the IP layer so that latter won't send anything larger > than that. Otherwise the pkt will get dropped by the IB layer (unless the > latter performs SAR, which is a bad idea). A model for an implementation can be that the IP MTU is set large but the actual packet size or segment size is determined by an internal mechanism that determines the MTU for the relevant connection/UD path. An implementation is also allowed to just drop down to the IPoIB-UD MTU and use that always. The interface MTUs at the peers need not be the same at IP or IB layers. I agree with the concept of just exchanging the max receive MTU at the IB connection setup. Vivek > > Jerry > > > > > > >> > >> 3. Regarding allowing multiple IB connections between a node > >> pair, since given an IP address there is only one link-address > >> for it implying one QPN, hence one service-ID, if a single > >> service-ID can be used to create multiple IB connections > >> then this can happen transparently. Otherwise we've got a > >> problem. > >> > >> Jerry > >> > >> > >> _______________________________________________ > >> IPoverIB mailing list > >> IPoverIB@ietf.org > >> https://www1.ietf.org/mailman/listinfo/ipoverib > >> > >> > > > > > > > > > _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Mon Nov 22 17:03:22 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA27034 for ; Mon, 22 Nov 2004 17:03:22 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWM8s-0000NV-VK; Mon, 22 Nov 2004 16:54:38 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWLtF-0001ma-V4 for ipoverib@megatron.ietf.org; Mon, 22 Nov 2004 16:38:30 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA23148 for ; Mon, 22 Nov 2004 16:38:27 -0500 (EST) Received: from mail.mellanox.co.il ([194.90.237.34] helo=mtlex01.yok.mtl.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CWLwm-0000le-3b for ipoverib@ietf.org; Mon, 22 Nov 2004 16:42:08 -0500 Received: by mtlex01.yok.mtl.com with Internet Mail Service (5.5.2653.19) id ; Mon, 22 Nov 2004 23:36:25 +0200 Message-ID: <506C3D7B14CDD411A52C00025558DED606748B67@mtlex01.yok.mtl.com> From: Dror Goldenberg To: Michael Krause , IPoverIB Subject: RE: [Ipoverib] A Couple of IPoIB Questions Date: Mon, 22 Nov 2004 23:36:25 +0200 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) X-Spam-Score: 1.0 (+) X-Scan-Signature: 24d000849df6f171c5ec1cca2ea21b82 X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0242179711==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --===============0242179711== Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C4D0DB.512A04E0" This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------_=_NextPart_001_01C4D0DB.512A04E0 Content-Type: text/plain Hi Mike, Please see below. Thanks Dror -----Original Message----- From: Michael Krause [mailto:krause@cup.hp.com] Sent: Monday, November 22, 2004 8:37 PM To: IPoverIB Subject: RE: [Ipoverib] A Couple of IPoIB Questions At 08:49 AM 11/20/2004, Dror Goldenberg wrote: -----Original Message----- From: Michael Krause [mailto:krause@cup.hp.com ] Sent: Friday, November 19, 2004 3:45 AM To: Vivek Kashyap Cc: IPoverIB Subject: Re: [Ipoverib] A Couple of IPoIB Questions At 05:14 PM 11/18/2004, Vivek Kashyap wrote: RC and UC both have benefits. There is almost no difference other than the connection flag between the two. Many host OS implementations do not support UC as RC and UD are all that is really required within the industry. The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented). Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC. [DG] Mike, A few reasons I think that the end to end credits / RNR in an RC connection is a problem. It may be worth discussing it: 1) Lack of receive WQEs in the responder implies a slow responder. Getting the messaged dropped in this case is desirable from protocols that have injection control such as TCP. In this case it is supposed to back off and restart slowlier. While UC/UD result in a similar behavior of messages being dropped at the receiver when it's slow, RC does not. Instead, there is persistence in getting the message transmitted and the receiver won't be able to tell the requester that it's being slow. TCP on the sending side will regulate due to lack of update window credits. Hence, there is no need to restart the large messages that are put forth as the reason for using *C instead of UD. [dg] I think it'll be common to find very large TCP windows being advertised. Therefore, when you work against a very slow receiver, I think that it makes sense to activate the TCP congestion mechanism rather than to rely on the TCP window which is not intended to take care of congestion. Typically, the overall advertised TCP windows (from all connections together) is much more than actually being posted on the IPoIB QP receive queue. In a slow receiver, the replenishment pace on receive WQEs is slow, and you'd want remote senders to slow down when trying to fill its TCP Windows. 2) How would you configure the RNR retry counters. Would they be configured to infinity ? Doesn't sound good. Would they be configured to a finite value (should be <7), in which case, in the case of a slow receiver you'd end up recreating connections that had end to end credits problem, which is a real overhead on the protocol. RNR would be no different for IP over IB than for any other IB RC instance. [dg] Example ULPs such as SDP and SRP use SW level flow control and do not rely on RNR NAKs. What I am trying to say is if you configure your QP for finite retries and a reasonable timeout, then when the receiver is slow, you'd often get the QP into the error state, after RNR retries are exhausted. The overhead of reestablishing a new connection each time the QP gets into the error state is high. If you use UC, then this is not a problem, because none of this happens. 3) What happens with implementations that don't support RNR Nak generation ? That poses more difficulties on (2). A HCA is required to support RNR NAK. A TCA has the option. If you don't support RC, then use UD. Where is the real problem as nothing shown here on either side is more than speculation? [dg] Agree, there isn't a real problem here. Mike ------_=_NextPart_001_01C4D0DB.512A04E0 Content-Type: text/html Message
    Hi Mike,
     
    Please see below.
     
    Thanks
    Dror
    -----Original Message-----
    From: Michael Krause [mailto:krause@cup.hp.com]
    Sent: Monday, November 22, 2004 8:37 PM
    To: IPoverIB
    Subject: RE: [Ipoverib] A Couple of IPoIB Questions

    At 08:49 AM 11/20/2004, Dror Goldenberg wrote:

    -----Original Message-----
    From: Michael Krause [mailto:krause@cup.hp.com]
    Sent: Friday, November 19, 2004 3:45 AM
    To: Vivek Kashyap
    Cc: IPoverIB
    Subject: Re: [Ipoverib] A Couple of IPoIB Questions

    At 05:14 PM 11/18/2004, Vivek Kashyap wrote:


    RC and UC both have benefits. There is almost no difference other than
    the connection flag between the two.

    Many host OS implementations do not support UC as RC and UD are all that is really required within the industry.  The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented).  Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC.

    [DG] Mike,
     A few reasons I think that the end to end credits / RNR  in an RC connection is a problem.
    It may be worth discussing it:
    1) Lack of receive WQEs in the responder implies a slow responder. Getting the messaged dropped
        in this case is desirable from protocols that have injection control such as TCP.  In this case it
        is supposed to back off and restart slowlier. While UC/UD result  in a similar behavior of messages
        being dropped at the receiver when it's slow, RC does not. Instead, there is persistence in getting the
        message transmitted and the receiver won't be able to tell the requester that it's being slow.

    TCP on the sending side will regulate due to lack of update window  credits.   Hence, there is no need to restart the large messages that are put forth as the reason for using *C instead of UD. 
     
    [dg] I think it'll be common to find very large TCP windows being advertised. Therefore, when you work against a very slow receiver, I think that it makes sense to activate the TCP congestion mechanism rather than to rely on the TCP window which is not intended to take care of congestion. Typically, the overall advertised TCP windows (from all connections together) is much more than actually being posted on the IPoIB QP receive queue. In a slow receiver, the replenishment pace on receive WQEs is slow, and you'd want remote senders to slow down when trying to fill its TCP Windows.

    2) How would you configure the RNR retry counters. Would they be configured to infinity ? Doesn't sound
        good. Would they be configured to a finite value (should be <7), in which case, in the case of a slow
        receiver you'd end up recreating connections that had end to end credits problem, which is a real
        overhead on the protocol.

    RNR would be no different for IP over IB than for any other IB RC instance.   
    [dg] Example ULPs such as SDP and SRP use SW level flow control and do not rely on RNR NAKs. What I am trying to say is if you configure your QP for finite retries and a reasonable timeout, then when the receiver is slow, you'd often get the QP into the error state, after RNR retries are exhausted. The overhead of reestablishing a new connection each time the QP gets into the error state is high.  If you use UC, then this is not a problem, because none of this happens.
     

    3) What happens with implementations that don't support RNR Nak generation ? That poses more
        difficulties on (2).

    A HCA is required to support RNR NAK.  A TCA has the option.  If you don't support RC, then use UD.  Where is the real problem as nothing shown here on either side is more than speculation?
    [dg] Agree, there isn't a real problem here.

    Mike

    ------_=_NextPart_001_01C4D0DB.512A04E0-- --===============0242179711== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0242179711==-- From ipoverib-bounces@ietf.org Mon Nov 22 17:56:01 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA02445 for ; Mon, 22 Nov 2004 17:56:01 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWN2E-0007Xt-WF; Mon, 22 Nov 2004 17:51:51 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWMie-0001za-Nc for ipoverib@megatron.ietf.org; Mon, 22 Nov 2004 17:31:37 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA29281 for ; Mon, 22 Nov 2004 17:31:33 -0500 (EST) Received: from palrel11.hp.com ([156.153.255.246]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CWMmA-00083g-JA for ipoverib@ietf.org; Mon, 22 Nov 2004 17:35:15 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel11.hp.com (Postfix) with ESMTP id 8AC671D3BF for ; Mon, 22 Nov 2004 14:31:33 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.203.228]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id OAA13113 for ; Mon, 22 Nov 2004 14:29:00 -0800 (PST) Message-Id: <6.1.2.0.2.20041122141524.04e69300@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Mon, 22 Nov 2004 14:26:36 -0800 To: IPoverIB From: Michael Krause Subject: RE: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <506C3D7B14CDD411A52C00025558DED606748B67@mtlex01.yok.mtl.c om> References: <506C3D7B14CDD411A52C00025558DED606748B67@mtlex01.yok.mtl.com> Mime-Version: 1.0 X-Spam-Score: 0.2 (/) X-Scan-Signature: 4bb0e9e1ca9d18125bc841b2d8d77e24 X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============1894805059==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============1894805059== Content-Type: multipart/alternative; boundary="=====================_183371944==.ALT" --=====================_183371944==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 01:36 PM 11/22/2004, Dror Goldenberg wrote: >Hi Mike, > >Please see below. > >Thanks >Dror >-----Original Message----- >From: Michael Krause [mailto:krause@cup.hp.com] >Sent: Monday, November 22, 2004 8:37 PM >To: IPoverIB >Subject: RE: [Ipoverib] A Couple of IPoIB Questions > >At 08:49 AM 11/20/2004, Dror Goldenberg wrote: >>-----Original Message----- >>From: Michael Krause [mailto:krause@cup.hp.com] >>Sent: Friday, November 19, 2004 3:45 AM >>To: Vivek Kashyap >>Cc: IPoverIB >>Subject: Re: [Ipoverib] A Couple of IPoIB Questions >> >>At 05:14 PM 11/18/2004, Vivek Kashyap wrote: >> >> >>>RC and UC both have benefits. There is almost no difference other than >>>the connection flag between the two. >>Many host OS implementations do not support UC as RC and UD are all that >>is really required within the industry. The ACK overhead associated with >>RC is truly noise and the end-to-end credits are very nice as IB now >>supports three signaling rates combined with 4 link widths (though only >>three are really being implemented). Such a permutation in bandwidth >>capability makes RC a more tenable / good citizen as we designed it to be >>so I'd prefer RC. >>[DG] Mike, >> A few reasons I think that the end to end credits / RNR in an RC >> connection is a problem. >>It may be worth discussing it: >>1) Lack of receive WQEs in the responder implies a slow responder. >>Getting the messaged dropped >> in this case is desirable from protocols that have injection control >> such as TCP. In this case it >> is supposed to back off and restart slowlier. While UC/UD result in >> a similar behavior of messages >> being dropped at the receiver when it's slow, RC does not. Instead, >> there is persistence in getting the >> message transmitted and the receiver won't be able to tell the >> requester that it's being slow. > >TCP on the sending side will regulate due to lack of update >window credits. Hence, there is no need to restart the large messages >that are put forth as the reason for using *C instead of UD. > >[dg] I think it'll be common to find very large TCP windows being advertised. A TCP window that is advertised is required to have the associated buffering available. While some implementations assume statistical provisioning in the kernel, they assume that the application buffers are available and the only problem is being able to move kernel buffers quickly enough to application buffers which is a transient issue. > Therefore, when you work against a very slow receiver, I think that it > makes sense to activate the TCP congestion mechanism rather than to rely > on the TCP window which is not intended to take care of congestion. > Typically, the overall advertised TCP windows (from all connections > together) is much more than actually being posted on the IPoIB QP receive > queue. In a slow receiver, the replenishment pace on receive WQEs is > slow, and you'd want remote senders to slow down when trying to fill its > TCP Windows. Dropping a buffer is fine but that should be at the TCP/IP level and not a driver decision. A driver should have sufficient buffers to avoid having wasted the network bandwidth. Hence, the driver should be posting sufficient buffers to keep up with the workload which may span multiple connections / datagrams. Use of UC or RC does not change anything in this regard. A drop using UC would simply waste IB network bandwidth, consume HCA resources flushing the work (the transmitter would continue to transmit so nothing is saved there), etc. and only impact one connection at a time. It does nothing for the rest of the connections. So while one might get a bit of benefit akin to a RED scheme, if the endnode pairs are operating at a high workload, all one gets with UC is the ability of one endnode to flood another with no push back except on random connections. This would lead to bursty behavior and unpredictable application responsiveness. RC leads to smoother performance between the endnode pair and with the use of multiple RC QP, one can differentiate traffic for QoS purposes which is something that will benefit applications. >>2) How would you configure the RNR retry counters. Would they be >>configured to infinity ? Doesn't sound >> good. Would they be configured to a finite value (should be <7), in >> which case, in the case of a slow >> receiver you'd end up recreating connections that had end to end >> credits problem, which is a real >> overhead on the protocol. > >RNR would be no different for IP over IB than for any other IB RC instance. >[dg] Example ULPs such as SDP and SRP use SW level flow control and do not >rely on RNR NAKs. These are also not IP based ULP. > What I am trying to say is if you configure your QP for finite retries > and a reasonable timeout, then when the receiver is slow, you'd often get > the QP into the error state, after RNR retries are exhausted. The > overhead of reestablishing a new connection each time the QP gets into > the error state is high. If you use UC, then this is not a problem, > because none of this happens. Given RC uses send credits and therefore should not see a new message unless there is an associated buffer available which increments the credit count, one should not get a RNR NAK ever. The reason for RNR NAK was to deal with a resource other than a receive buffer missing, e.g. QP context or V-to-P translation or whatever not being chip resident and some time would be required to refresh without going into the error state. Given RC is still send-receive based, there should not be any reason for a RNR NAK and no SEND will occur unless a credit is provided. > >>3) What happens with implementations that don't support RNR Nak >>generation ? That poses more >> difficulties on (2). > >A HCA is required to support RNR NAK. A TCA has the option. If you don't >support RC, then use UD. Where is the real problem as nothing shown here >on either side is more than speculation? >[dg] Agree, there isn't a real problem here. Mike --=====================_183371944==.ALT Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable At 01:36 PM 11/22/2004, Dror Goldenberg wrote:
    Hi Mike,
     
    Please see below.
     
    Thanks
    Dror
    -----Original Message-----
    From: Michael Krause [mailto:krause@cup.h= p.com]
    Sent: Monday, November 22, 2004 8:37 PM
    To: IPoverIB
    Subject: RE: [Ipoverib] A Couple of IPoIB Questions

    At 08:49 AM 11/20/2004, Dror Goldenberg wrote:
    -----Original Message-----
    From: Michael Krause [mailto:krause@cup.hp.com]=20
    Sent: Friday, November 19, 2004 3:45 AM
    To: Vivek Kashyap
    Cc: IPoverIB
    Subject: Re: [Ipoverib] A Couple of IPoIB Questions

    At 05:14 PM 11/18/2004, Vivek Kashyap wrote:


    RC and UC both have benefits. There is almost no difference other than
    the connection flag between the two.
    Many host OS implementations do not support UC as RC and UD are all that is really required within the industry.  The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented).  Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC.
    [DG] Mike,
     A few reasons I think that the end to end credits / RNR  in an RC connection is a problem.
    It may be worth discussing it:
    1) Lack of receive WQEs in the responder implies a slow responder. Getting the messaged dropped=20
        in this case is desirable from protocols that have injection control such as TCP.  In this case it
        is supposed to back off and restart slowlier. While UC/UD result  in a similar behavior of messages
        being dropped at the receiver when it's slow, RC does not. Instead, there is persistence in getting the
        message transmitted and the receiver won't be able to tell the requester that it's being slow.

    TCP on the sending side will regulate due to lack of update window  credits.   Hence, there is no need to restart the large messages that are put forth as the reason for using *C instead of UD.
     
    [dg] I think it'll be common to find very large TCP windows being advertised.

    A TCP window that is advertised is required to have the associated buffering available.  While some implementations assume statistical provisioning in the kernel, they assume that the application buffers are available and the only problem is being able to move kernel buffers quickly enough to application buffers which is a transient issue.

     Therefore, when yo= u work against a very slow receiver, I think that it makes sense to activate the TCP congestion mechanism rather than to rely on the TCP window which is not intended to take care of congestion. Typically, the overall advertised TCP windows (from all connections together) is much more than actually being posted on the IPoIB QP receive queue. In a slow receiver, the replenishment pace on receive WQEs is slow, and you'd want remote senders to slow down when trying to fill its TCP Windows.

    Dropping a buffer is fine but that should be at the TCP/IP level and not a driver decision.  A driver should have sufficient buffers to avoid having wasted the network bandwidth.  Hence, the driver should be posting sufficient buffers to keep up with the workload which may span multiple connections / datagrams.  Use of UC or RC does not change anything in this regard.  A drop using UC would simply waste IB network bandwidth, consume HCA resources flushing the work (the transmitter would continue to transmit so nothing is saved there), etc. and only impact one connection at a time.  It does nothing for the rest of the connections.  So while one might get a bit of benefit akin to a RED scheme, if the endnode pairs are operating at a high workload, all one gets with UC is the ability of one endnode to flood another with no push back except on random connections.  This would lead to bursty behavior and unpredictable application responsiveness.  RC leads to smoother performance between the endnode pair and with the use of multiple RC QP, one can differentiate traffic for QoS purposes which is something that will benefit applications.

    2) How would you configu= re the RNR retry counters. Would they be configured to infinity ? Doesn't sound
        good. Would they be configured to a finite value (should be <7), in which case, in the case of a slow
        receiver you'd end up recreating connections that had end to end credits problem, which is a real=20
        overhead on the protocol.

    RNR would be no different for IP over IB than for any other IB RC instance. 
    [dg] Example ULPs such a= s SDP and SRP use SW level flow control and do not rely on RNR NAKs.

    These are also not IP based ULP. 

     What I am trying to say is if you configure your QP for finite retries and a reasonable timeout, then when the receiver is slow, you'd often get the QP into the error state, after RNR retries are exhausted. The overhead of reestablishing a new connection each time the QP gets into the error state is high.  If you use UC, then this is not a problem, because none of this happens.

    Given RC uses send credits and therefore should not see a new message unless there is an associated buffer available which increments the credit count, one should not get a RNR NAK ever.  The reason for RNR NAK was to deal with a resource other than a receive buffer missing, e.g. QP context or V-to-P translation or whatever not being chip resident and some time would be required to refresh without going into the error state.  Given RC is still send-receive based, there should not be any reason for a RNR NAK and no SEND will occur unless a credit is provided.

     
    3) What happens with implementations that don't support RNR Nak generation ? That poses more
        difficulties on (2).

    A HCA is required to support RNR NAK.  A TCA has the option.  If you don't support RC, then use UD.  Where is the real problem as nothing shown here on either side is more than speculation?
    [dg] Agree, there isn't = a real problem here.

    Mike
    --=====================_183371944==.ALT-- --===============1894805059== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============1894805059==-- From ipoverib-bounces@ietf.org Mon Nov 22 21:50:52 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA22674 for ; Mon, 22 Nov 2004 21:50:52 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWQhK-0003yt-CG; Mon, 22 Nov 2004 21:46:30 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWQg3-0003WE-OL for ipoverib@megatron.ietf.org; Mon, 22 Nov 2004 21:45:12 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id VAA22246 for ; Mon, 22 Nov 2004 21:45:10 -0500 (EST) Received: from nwkea-mail-2.sun.com ([192.18.42.14]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CWQja-0008KX-Gg for ipoverib@ietf.org; Mon, 22 Nov 2004 21:48:53 -0500 Received: from jurassic.eng.sun.com ([129.146.17.55]) by nwkea-mail-2.sun.com (8.12.10/8.12.9) with ESMTP id iAN2ifpv001731; Mon, 22 Nov 2004 18:44:41 -0800 (PST) Received: from taipei (taipei.SFBay.Sun.COM [129.146.85.178]) by jurassic.eng.sun.com (8.13.1+Sun/8.13.1) with SMTP id iAN2ifYg985966; Mon, 22 Nov 2004 18:44:41 -0800 (PST) Message-Id: <200411230244.iAN2ifYg985966@jurassic.eng.sun.com> Date: Mon, 22 Nov 2004 18:43:04 -0800 (PST) From: "H.K. Jerry Chu" Subject: Re: [Ipoverib] comments on draft-kashyap-ipoib-connected-mode-02.txt To: krause@cup.hp.com MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Content-MD5: 8SMB2gIO6p6YQLUuy/6vDw== X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.6_68 SunOS 5.10 sun4u sparc X-Spam-Score: 0.0 (/) X-Scan-Signature: ea4ac80f790299f943f0a53be7e1a21a Cc: ipoverib@ietf.org X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: "H.K. Jerry Chu" List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org >>I'm not sure what you mean by the last two sentences above. MTU value must >>be made known to the IP layer so that latter won't send anything larger >>than that. Otherwise the pkt will get dropped by the IB layer (unless the >>latter performs SAR, which is a bad idea). > >One might argue that *C is focused on an equivalence to TSO (large send) >thus the logical MTU is not required. Interesting idea. But I thought for the *C modes it's an error if one end post a send that is larger in size than the WQE posted to the receive queue. If this is true then unless the receiver always posts 2**32 buffer, you'll still need the MTU concept to put a reasonable cap on the buffer size the receive side must prepare. >One might argue that the logical MTU >represents an asymmetric maximum receive buffer that will be posted thus >messages must be sent that do not exceed this maximum. One might argue >that having a single buffer size independent of the *C / UD being used >maximizes KISS. This limits the MTU down to the UD MTU, which defeats one of the main benefits of using *C modes. >I'm open to exploring these options but do not believe all >must be supported. So it looks to me only the middle one makes sense. Jerry _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Mon Nov 22 22:25:27 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA05206 for ; Mon, 22 Nov 2004 22:25:27 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWRD9-0001zj-LJ; Mon, 22 Nov 2004 22:19:23 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWR9o-0001M6-Qe for ipoverib@megatron.ietf.org; Mon, 22 Nov 2004 22:15:56 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id WAA29738 for ; Mon, 22 Nov 2004 22:15:55 -0500 (EST) Received: from brmea-mail-3.sun.com ([192.18.98.34]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CWRDP-00048B-1O for ipoverib@ietf.org; Mon, 22 Nov 2004 22:19:39 -0500 Received: from jurassic.eng.sun.com ([129.146.87.31]) by brmea-mail-3.sun.com (8.12.10/8.12.9) with ESMTP id iAN3FTVu019617; Mon, 22 Nov 2004 20:15:30 -0700 (MST) Received: from taipei (taipei.SFBay.Sun.COM [129.146.85.178]) by jurassic.eng.sun.com (8.13.1+Sun/8.13.1) with SMTP id iAN3FTqu108712; Mon, 22 Nov 2004 19:15:29 -0800 (PST) Message-Id: <200411230315.iAN3FTqu108712@jurassic.eng.sun.com> Date: Mon, 22 Nov 2004 19:13:52 -0800 (PST) From: "H.K. Jerry Chu" Subject: Re: [Ipoverib] A Couple of IPoIB Questions To: krause@cup.hp.com, kashyapv@us.ibm.com MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Content-MD5: 6vs4tXIasA0LV6isCJTkLg== X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.6_68 SunOS 5.10 sun4u sparc X-Spam-Score: 0.0 (/) X-Scan-Signature: 3a4bc66230659131057bb68ed51598f8 Cc: ipoverib@ietf.org X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: "H.K. Jerry Chu" List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org >On Thu, 18 Nov 2004, Michael Krause wrote: > >> At 04:12 PM 11/18/2004, Vivek Kashyap wrote: >> >On Thu, 18 Nov 2004, Michael Krause wrote: >> > >> > > At 11:33 AM 11/18/2004, Vivek Kashyap wrote: >> > > >On Thu, 18 Nov 2004, Michael Krause wrote: >> > > > >> > > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: >> > > > > >Mike the format is really off in the last mail from you - making it >> > > > difficult >> > > > > >to follow. >> > > > > > >> > > > > > >> > > > > >Other than that let us discuss in the context of the draft. The >> > draft is >> > > > > >built upon the following: >> > > > > > >> > > > > >1. IPoIB-RC and IPoIB-UC are optional. >> > > > > >> > > > > I would prefer only one be used - either RC or UC. I've provided some >> > > > > logic for either one as a preference but don't see a reason to have >> > > > > both. Both just leads to options which leads to interoperability >> > problems. >> > > > >> > > >ok. >> > > >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. >> > > >It states that the RC and UC are mutually exclusive flags. >> > > >> > > My preference is to only support one of the two in a spec not to have >> > flags >> > > to indicate what is implemented. The benefits of connected mode operation >> > > should be done with only one form of communication not two. >> > >> >A given subnet will support only one of the two. Not both simultaneously. The >> >flag only indicates which type it is. RC and UC are both useful to different >> >people and implementations so both are allowed. I suggest that both not be >> >allowed in the same IPoIB subnet though. Is there any reason for this somewhat arbitary restriction (of supporting only one type per subnet)? It seems to introduce complexity, not reduce. How would the preferred type be determined? >> > >RC and UC both have benefits. There is almost no difference other than >the connection flag between the two. > >> To be explicit, I think there is benefit in implementing one and only one >> of the two. Having two options serves no purpose and adds unnecessary >> complexity. Interoperability will end up requiring both to be done if Not sure why this would affect interoperability. All implementations must always support UD as the basic fall back plan. >> customers are to not get upset. Let's just pick one of the two and apply >> KISS. To get this started, I'll propose RC as that is a bit nicer to the >> fabric than UC and is already implemented in most OS and CA drivers today >> so it makes it faster to adopt with minimal driver software update. Forcing a choice here can be a problem unless there is a clear winner or concensus. It doesn't seem to be case at this point. >> >> >> >> > > If a designer is stupid, they may do this. However, one would expect some >> > > intelligence here and one may prefer to have specific data flows or >> > > DiffServ code points or whatever used to determine which connection or >> > > which UD QP and that one would again apply an intelligent and predictable >> > > algorithm such that mix-n-match for a given TCP connection does not >> > > occur. Given multiple *C QP can be supported, it is not tenable to state >> > > that all unicast must go over a given QP or that no unicast can occur on a >> > > UD QP. >> > > >> > >> >You mised my point which was that the specification cannot be silent on this >> >and say it is a local issue. That can lead to interoperability failure. The >> >specification must support or disallow unicast communication over UD QP >> >in an >> >IPoIB-CM. >> > >> >You prefer that such communication be supported. That works. Any other >> >thoughts? >> >> I prefer that guidance be provided and that it remain a local >> implementation issue as to what QP is used for a given flow. I do not see >> interoperability issues only potential performance if people are >> stupid. The industry has a way to deal with stupidity and too much time is >> spent on preventing people from being stupid. Even a so-so intelligent >> implementation could have a simple flag for a given target IP address that >> states which QP to target for all or a subset of the flows with minimal >> cost to implement and troubleshoot / validate. > >If something is left unspecified there is every chance that incompatible >implementations result - this has nothing to do with the mental faculties >of the implementors. Therefore I'll add relevant text. There has always been an implicit assumption in my mind that if an endpoint allows a link layer communication channel to be established, it will commit itself to tune in. Otherwise what's the point? How the send side utilizes multiple communication channels can be completely left to the implementation. If people don't think the above is obvious hence some explicit wording is needed that's fine with me too. Jerry > >> >> Mike > > >_______________________________________________ >IPoverIB mailing list >IPoverIB@ietf.org >https://www1.ietf.org/mailman/listinfo/ipoverib _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Tue Nov 23 02:47:54 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA10284 for ; Tue, 23 Nov 2004 02:47:51 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWVND-0005ER-2e; Tue, 23 Nov 2004 02:46:03 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWVIO-0004H0-OF for ipoverib@megatron.ietf.org; Tue, 23 Nov 2004 02:41:05 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA09560 for ; Tue, 23 Nov 2004 02:40:54 -0500 (EST) Received: from e35.co.us.ibm.com ([32.97.110.133]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CWVLs-0004eY-28 for ipoverib@ietf.org; Tue, 23 Nov 2004 02:44:40 -0500 Received: from westrelay03.boulder.ibm.com (westrelay03.boulder.ibm.com [9.17.195.12]) by e35.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id iAN7eJQf291278 for ; Tue, 23 Nov 2004 02:40:19 -0500 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by westrelay03.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id iAN7eJcu218034 for ; Tue, 23 Nov 2004 00:40:19 -0700 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAN7eJgG007806 for ; Tue, 23 Nov 2004 00:40:19 -0700 Received: from w-vkashyap95.des.sequent.com (sig-9-65-35-142.mts.ibm.com [9.65.35.142]) by d03av01.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id iAN7e90Z007581; Tue, 23 Nov 2004 00:40:18 -0700 Date: Mon, 22 Nov 2004 23:39:27 -0800 (Pacific Standard Time) From: Vivek Kashyap To: "H.K. Jerry Chu" Subject: Re: [Ipoverib] A Couple of IPoIB Questions In-Reply-To: <200411230315.iAN3FTqu108712@jurassic.eng.sun.com> Message-ID: X-X-Sender: kashyapv@imap.linux.ibm.com MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Score: 0.0 (/) X-Scan-Signature: 6ba8aaf827dcb437101951262f69b3de Cc: krause@cup.hp.com, ipoverib@ietf.org X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org On Mon, 22 Nov 2004, H.K. Jerry Chu wrote: > >On Thu, 18 Nov 2004, Michael Krause wrote: > > > >> At 04:12 PM 11/18/2004, Vivek Kashyap wrote: > >> >On Thu, 18 Nov 2004, Michael Krause wrote: > >> > > >> > > At 11:33 AM 11/18/2004, Vivek Kashyap wrote: > >> > > >On Thu, 18 Nov 2004, Michael Krause wrote: > >> > > > > >> > > > > At 10:46 PM 11/17/2004, Vivek Kashyap wrote: > >> > > > > >Mike the format is really off in the last mail from you - making it > >> > > > difficult > >> > > > > >to follow. > >> > > > > > > >> > > > > > > >> > > > > >Other than that let us discuss in the context of the draft. The > >> > draft is > >> > > > > >built upon the following: > >> > > > > > > >> > > > > >1. IPoIB-RC and IPoIB-UC are optional. > >> > > > > > >> > > > > I would prefer only one be used - either RC or UC. I've provided some > >> > > > > logic for either one as a preference but don't see a reason to have > >> > > > > both. Both just leads to options which leads to interoperability > >> > problems. > >> > > > > >> > > >ok. > >> > > >See section 3.1 of the draft draft-kashyap-ipoib-connected-mode-02.txt. > >> > > >It states that the RC and UC are mutually exclusive flags. > >> > > > >> > > My preference is to only support one of the two in a spec not to have > >> > flags > >> > > to indicate what is implemented. The benefits of connected mode operation > >> > > should be done with only one form of communication not two. > >> > > >> >A given subnet will support only one of the two. Not both simultaneously. The > >> >flag only indicates which type it is. RC and UC are both useful to different > >> >people and implementations so both are allowed. I suggest that both not be > >> >allowed in the same IPoIB subnet though. > > Is there any reason for this somewhat arbitary restriction (of supporting only > one type per subnet)? It seems to introduce complexity, not reduce. How would > the preferred type be determined? The idea had been to simplify -- the link characteristics within a subnet are the same. However, it appears that the original suggestion of an intermixed set of IPoIB modes is preferable. I'm fine with that. I've summarised current view in my reply to your comment in the 'comments on draft-kashyap-ipoib...' thread. > > >> > > > >RC and UC both have benefits. There is almost no difference other than > >the connection flag between the two. > > > >> To be explicit, I think there is benefit in implementing one and only one > >> of the two. Having two options serves no purpose and adds unnecessary > >> complexity. Interoperability will end up requiring both to be done if > > Not sure why this would affect interoperability. All implementations must always > support UD as the basic fall back plan. > > >> customers are to not get upset. Let's just pick one of the two and apply > >> KISS. To get this started, I'll propose RC as that is a bit nicer to the > >> fabric than UC and is already implemented in most OS and CA drivers today > >> so it makes it faster to adopt with minimal driver software update. > > Forcing a choice here can be a problem unless there is a clear winner or > concensus. It doesn't seem to be case at this point. > > >> > >> > >> > >> > > If a designer is stupid, they may do this. However, one would expect some > >> > > intelligence here and one may prefer to have specific data flows or > >> > > DiffServ code points or whatever used to determine which connection or > >> > > which UD QP and that one would again apply an intelligent and predictable > >> > > algorithm such that mix-n-match for a given TCP connection does not > >> > > occur. Given multiple *C QP can be supported, it is not tenable to state > >> > > that all unicast must go over a given QP or that no unicast can occur on a > >> > > UD QP. > >> > > > >> > > >> >You mised my point which was that the specification cannot be silent on this > >> >and say it is a local issue. That can lead to interoperability failure. The > >> >specification must support or disallow unicast communication over UD QP > >> >in an > >> >IPoIB-CM. > >> > > >> >You prefer that such communication be supported. That works. Any other > >> >thoughts? > >> > >> I prefer that guidance be provided and that it remain a local > >> implementation issue as to what QP is used for a given flow. I do not see > >> interoperability issues only potential performance if people are > >> stupid. The industry has a way to deal with stupidity and too much time is > >> spent on preventing people from being stupid. Even a so-so intelligent > >> implementation could have a simple flag for a given target IP address that > >> states which QP to target for all or a subset of the flows with minimal > >> cost to implement and troubleshoot / validate. > > > >If something is left unspecified there is every chance that incompatible > >implementations result - this has nothing to do with the mental faculties > >of the implementors. Therefore I'll add relevant text. > > There has always been an implicit assumption in my mind that if an endpoint > allows a link layer communication channel to be established, it will commit > itself to tune in. Otherwise what's the point? How the send side utilizes > multiple communication channels can be completely left to the implementation. > > If people don't think the above is obvious hence some explicit wording is needed > that's fine with me too. I was addressing the interoperability requirement. If a sender is allowed to send the datagram on any of the modes then the receiver needs to be instructed to receive the datagram on all of the supported modes. Otherwise we are likely to have communication failure between nodes. It is preferable to specify this. With respect to tuning or optimisations, including implemenation caveats in the draft is helpful though not required. Vivek > > Jerry > > > > >> > >> Mike > > > > > >_______________________________________________ > >IPoverIB mailing list > >IPoverIB@ietf.org > >https://www1.ietf.org/mailman/listinfo/ipoverib > > > __ Vivek Kashyap Linux Technology Center, IBM _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib From ipoverib-bounces@ietf.org Tue Nov 23 03:09:26 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id DAA12097 for ; Tue, 23 Nov 2004 03:09:26 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWVYz-000841-1o; Tue, 23 Nov 2004 02:58:13 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWVXF-0007Xb-L4 for ipoverib@megatron.ietf.org; Tue, 23 Nov 2004 02:56:26 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id CAA11163 for ; Tue, 23 Nov 2004 02:56:23 -0500 (EST) Received: from mail.mellanox.co.il ([194.90.237.34] helo=mtlex01.yok.mtl.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CWVaq-0006y1-SA for ipoverib@ietf.org; Tue, 23 Nov 2004 03:00:10 -0500 Received: by mtlex01.yok.mtl.com with Internet Mail Service (5.5.2653.19) id ; Tue, 23 Nov 2004 09:54:15 +0200 Message-ID: <506C3D7B14CDD411A52C00025558DED606748BDC@mtlex01.yok.mtl.com> From: Dror Goldenberg To: Michael Krause , IPoverIB Subject: RE: [Ipoverib] A Couple of IPoIB Questions Date: Tue, 23 Nov 2004 09:54:13 +0200 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) X-Spam-Score: 0.8 (/) X-Scan-Signature: 56904003e9d74831849863e83b1962ec X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0356602645==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --===============0356602645== Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C4D131.9F9E2240" This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------_=_NextPart_001_01C4D131.9F9E2240 Content-Type: text/plain Hi Mike, My comments below. -Dror -----Original Message----- From: Michael Krause [mailto:krause@cup.hp.com] Sent: Tuesday, November 23, 2004 12:27 AM To: IPoverIB Subject: RE: [Ipoverib] A Couple of IPoIB Questions At 01:36 PM 11/22/2004, Dror Goldenberg wrote: Hi Mike, Please see below. Thanks Dror -----Original Message----- From: Michael Krause [mailto:krause@cup.hp.com ] Sent: Monday, November 22, 2004 8:37 PM To: IPoverIB Subject: RE: [Ipoverib] A Couple of IPoIB Questions At 08:49 AM 11/20/2004, Dror Goldenberg wrote: -----Original Message----- From: Michael Krause [mailto:krause@cup.hp.com ] Sent: Friday, November 19, 2004 3:45 AM To: Vivek Kashyap Cc: IPoverIB Subject: Re: [Ipoverib] A Couple of IPoIB Questions At 05:14 PM 11/18/2004, Vivek Kashyap wrote: RC and UC both have benefits. There is almost no difference other than the connection flag between the two. Many host OS implementations do not support UC as RC and UD are all that is really required within the industry. The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented). Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC. [DG] Mike, A few reasons I think that the end to end credits / RNR in an RC connection is a problem. It may be worth discussing it: 1) Lack of receive WQEs in the responder implies a slow responder. Getting the messaged dropped in this case is desirable from protocols that have injection control such as TCP. In this case it is supposed to back off and restart slowlier. While UC/UD result in a similar behavior of messages being dropped at the receiver when it's slow, RC does not. Instead, there is persistence in getting the message transmitted and the receiver won't be able to tell the requester that it's being slow. TCP on the sending side will regulate due to lack of update window credits. Hence, there is no need to restart the large messages that are put forth as the reason for using *C instead of UD. [dg] I think it'll be common to find very large TCP windows being advertised. A TCP window that is advertised is required to have the associated buffering available. While some implementations assume statistical provisioning in the kernel, they assume that the application buffers are available and the only problem is being able to move kernel buffers quickly enough to application buffers which is a transient issue. [dg] Right, this is exactly the kind of implementations I refer to. These tend to oversubscribe buffers both at the NIC level (i.e. there is much less buffers posted to the RX of the NIC than the sum of TCP windows), and at the TCP level. When such a machine is busy, then packets start dropping. Therefore, when you work against a very slow receiver, I think that it makes sense to activate the TCP congestion mechanism rather than to rely on the TCP window which is not intended to take care of congestion. Typically, the overall advertised TCP windows (from all connections together) is much more than actually being posted on the IPoIB QP receive queue. In a slow receiver, the replenishment pace on receive WQEs is slow, and you'd want remote senders to slow down when trying to fill its TCP Windows. Dropping a buffer is fine but that should be at the TCP/IP level and not a driver decision. A driver should have sufficient buffers to avoid having wasted the network bandwidth. Hence, the driver should be posting sufficient buffers to keep up with the workload which may span multiple connections / datagrams. Use of UC or RC does not change anything in this regard. A drop using UC would simply waste IB network bandwidth, consume HCA resources flushing the work (the transmitter would continue to transmit so nothing is saved there), etc. and only impact one connection at a time. It does nothing for the rest of the connections. So while one might get a bit of benefit akin to a RED scheme, if the endnode pairs are operating at a high workload, all one gets with UC is the ability of one endnode to flood another with no push back except on random connections. This would lead to bursty behavior and unpredictable application responsiveness. RC leads to smoother performance between the endnode pair and with the use of multiple RC QP, one can differentiate traffic for QoS purposes which is something that will benefit applications. [dg] If you work with RC, then in the slow receiver case, backpressure will propagate into the sender (RQ is full, no end to end credits are reflected, peer SQ becomes full and you're out of SQ WQEs). In this case, what will you do in the requester side ? - Tell the upper TCP/IP layers that the NIC TX ring is full - this will cause OS not to post buffers to ANY of current RC connections. I don't think it's desirable, it'll slow down / block your connections with the other remote peers - Pretend as if there is still room in the SQ - but when OS posts to the full SQ, you'll drop the packet -> this will be just the same as the UC case, except that you do it in the sender instead of the receiver - Pretend as if there is still room in the SQ - but when OS posts to the full SQ, you'll queue it in SW. I think it'll risk shared resources. What I am trying to say, is that we need to understand what happens in the case of the slow receiver. I think that in RC what you'll end up having is the peer requester dropping the packets. In UC, you'll get the responder dropping the packets. As of how much you flood the IB fabric, see my comment on the second question. 2) How would you configure the RNR retry counters. Would they be configured to infinity ? Doesn't sound good. Would they be configured to a finite value (should be <7), in which case, in the case of a slow receiver you'd end up recreating connections that had end to end credits problem, which is a real overhead on the protocol. RNR would be no different for IP over IB than for any other IB RC instance. [dg] Example ULPs such as SDP and SRP use SW level flow control and do not rely on RNR NAKs. These are also not IP based ULP. What I am trying to say is if you configure your QP for finite retries and a reasonable timeout, then when the receiver is slow, you'd often get the QP into the error state, after RNR retries are exhausted. The overhead of reestablishing a new connection each time the QP gets into the error state is high. If you use UC, then this is not a problem, because none of this happens. Given RC uses send credits and therefore should not see a new message unless there is an associated buffer available which increments the credit count, one should not get a RNR NAK ever. The reason for RNR NAK was to deal with a resource other than a receive buffer missing, e.g. QP context or V-to-P translation or whatever not being chip resident and some time would be required to refresh without going into the error state. Given RC is still send-receive based, there should not be any reason for a RNR NAK and no SEND will occur unless a credit is provided. [dg] yes and no. If you work with regular RC, then when RQ is empty, then the peer SQ will send probing packets (e.g. send first/send only) to see if credits became available. In this case you will see RNR Nak, but what you inject to the fabric before getting it is a single packet. So I agree that you don't flood the IB fabric in this case. The issue with e2e credits reflection is when one wants to use SRQ, instead of posting to each RQ separately and consuming many resources. In this case, e2e credits are no longer reflected by ACK packets. And you're going to send messages to the remote side without any flow control, and get RNR Naks when peer RQ is empty. If most implementations use SRQ, then fabric is going to be flooded anyways because of slow receivers. I know that SRQ today is not allowed on UC, but that's a different story... Mike ------_=_NextPart_001_01C4D131.9F9E2240 Content-Type: text/html Message
    Hi Mike,
    My comments below.
    -Dror
    -----Original Message-----
    From: Michael Krause [mailto:krause@cup.hp.com]
    Sent: Tuesday, November 23, 2004 12:27 AM
    To: IPoverIB
    Subject: RE: [Ipoverib] A Couple of IPoIB Questions

    At 01:36 PM 11/22/2004, Dror Goldenberg wrote:
    Hi Mike,
     
    Please see below.
     
    Thanks
    Dror
    -----Original Message-----
    From: Michael Krause [mailto:krause@cup.hp.com]
    Sent: Monday, November 22, 2004 8:37 PM
    To: IPoverIB
    Subject: RE: [Ipoverib] A Couple of IPoIB Questions

    At 08:49 AM 11/20/2004, Dror Goldenberg wrote:
    -----Original Message-----
    From: Michael Krause [mailto:krause@cup.hp.com]
    Sent: Friday, November 19, 2004 3:45 AM
    To: Vivek Kashyap
    Cc: IPoverIB
    Subject: Re: [Ipoverib] A Couple of IPoIB Questions

    At 05:14 PM 11/18/2004, Vivek Kashyap wrote:


    RC and UC both have benefits. There is almost no difference other than
    the connection flag between the two.
    Many host OS implementations do not support UC as RC and UD are all that is really required within the industry.  The ACK overhead associated with RC is truly noise and the end-to-end credits are very nice as IB now supports three signaling rates combined with 4 link widths (though only three are really being implemented).  Such a permutation in bandwidth capability makes RC a more tenable / good citizen as we designed it to be so I'd prefer RC.
    [DG] Mike,
     A few reasons I think that the end to end credits / RNR  in an RC connection is a problem.
    It may be worth discussing it:
    1) Lack of receive WQEs in the responder implies a slow responder. Getting the messaged dropped
        in this case is desirable from protocols that have injection control such as TCP.  In this case it
        is supposed to back off and restart slowlier. While UC/UD result  in a similar behavior of messages
        being dropped at the receiver when it's slow, RC does not. Instead, there is persistence in getting the
        message transmitted and the receiver won't be able to tell the requester that it's being slow.

    TCP on the sending side will regulate due to lack of update window  credits.   Hence, there is no need to restart the large messages that are put forth as the reason for using *C instead of UD.

     
    [dg] I think it'll be common to find very large TCP windows being advertised.

    A TCP window that is advertised is required to have the associated buffering available.  While some implementations assume statistical provisioning in the kernel, they assume that the application buffers are available and the only problem is being able to move kernel buffers quickly enough to application buffers which is a transient issue. 
     
    [dg] Right, this is exactly the kind of implementations I refer to. These tend to oversubscribe buffers both at the NIC level (i.e. there is much less buffers posted to the RX of the NIC than the sum of TCP windows), and at the TCP level. When such a machine is busy, then packets start dropping.
     Therefore, when you work against a very slow receiver, I think that it makes sense to activate the TCP congestion mechanism rather than to rely on the TCP window which is not intended to take care of congestion. Typically, the overall advertised TCP windows (from all connections together) is much more than actually being posted on the IPoIB QP receive queue. In a slow receiver, the replenishment pace on receive WQEs is slow, and you'd want remote senders to slow down when trying to fill its TCP Windows.

    Dropping a buffer is fine but that should be at the TCP/IP level and not a driver decision.  A driver should have sufficient buffers to avoid having wasted the network bandwidth.  Hence, the driver should be posting sufficient buffers to keep up with the workload which may span multiple connections / datagrams.  Use of UC or RC does not change anything in this regard.  A drop using UC would simply waste IB network bandwidth, consume HCA resources flushing the work (the transmitter would continue to transmit so nothing is saved there), etc. and only impact one connection at a time.  It does nothing for the rest of the connections.  So while one might get a bit of benefit akin to a RED scheme, if the endnode pairs are operating at a high workload, all one gets with UC is the ability of one endnode to flood another with no push back except on random connections.  This would lead to bursty behavior and unpredictable application responsiveness.  RC leads to smoother performance between the endnode pair and with the use of multiple RC QP, one can differentiate traffic for QoS purposes which is something that will benefit applications. 
     
    [dg] If you work with RC, then in the slow receiver case, backpressure will propagate into the sender (RQ is full, no end to end credits are reflected, peer SQ becomes full and you're out of SQ WQEs). In this case, what will you do in the requester side ?
    - Tell the upper TCP/IP layers that the NIC TX ring is full - this will cause OS not to
       post buffers to ANY of current RC connections. I don't think it's desirable, it'll slow down / block
       your connections with the other remote peers
    - Pretend as if there is still room in the SQ - but when OS posts to the full SQ, you'll drop
        the packet -> this will be just the same as the UC case, except that you do it in the sender
        instead of the receiver
    - Pretend as if there is still room in the SQ - but when OS posts to the full SQ, you'll queue
        it in SW. I think it'll risk shared resources.
     
    What I am trying to say, is that we need to understand what happens in the case of the slow receiver. I think that in RC what you'll end up having is the peer requester dropping the packets. In UC, you'll get the responder dropping the packets. As of how much you flood the IB fabric, see my comment on the second question.

     
    2) How would you configure the RNR retry counters. Would they be configured to infinity ? Doesn't sound
        good. Would they be configured to a finite value (should be <7), in which case, in the case of a slow
        receiver you'd end up recreating connections that had end to end credits problem, which is a real
        overhead on the protocol.

    RNR would be no different for IP over IB than for any other IB RC instance. 
    [dg] Example ULPs such as SDP and SRP use SW level flow control and do not rely on RNR NAKs.

    These are also not IP based ULP. 

     What I am trying to say is if you configure your QP for finite retries and a reasonable timeout, then when the receiver is slow, you'd often get the QP into the error state, after RNR retries are exhausted. The overhead of reestablishing a new connection each time the QP gets into the error state is high.  If you use UC, then this is not a problem, because none of this happens.

    Given RC uses send credits and therefore should not see a new message unless there is an associated buffer available which increments the credit count, one should not get a RNR NAK ever.  The reason for RNR NAK was to deal with a resource other than a receive buffer missing, e.g. QP context or V-to-P translation or whatever not being chip resident and some time would be required to refresh without going into the error state.  Given RC is still send-receive based, there should not be any reason for a RNR NAK and no SEND will occur unless a credit is provided.
     
    [dg] yes and no. If you work with regular RC, then when RQ is empty, then the peer SQ will send probing packets (e.g. send first/send only) to see if credits became available. In this case you will see RNR Nak, but what you inject to the fabric before getting it is a single packet. So I agree that you don't flood the IB fabric in this case.
    The issue with e2e credits reflection is when one wants to use SRQ, instead of posting to each RQ separately and consuming many resources. In this case, e2e credits are no longer reflected by ACK packets. And you're going to send messages to the remote side without any flow control, and get RNR Naks when peer RQ is empty. If most implementations use SRQ, then fabric is going to be flooded anyways because of slow receivers.
    I know that SRQ today is not allowed on UC, but that's a different story...
     

    Mike
    ------_=_NextPart_001_01C4D131.9F9E2240-- --===============0356602645== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0356602645==-- From ipoverib-bounces@ietf.org Tue Nov 23 10:46:39 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA23298 for ; Tue, 23 Nov 2004 10:46:38 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWcjQ-0004BN-AE; Tue, 23 Nov 2004 10:37:28 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CWcVp-0001WK-Kp for ipoverib@megatron.ietf.org; Tue, 23 Nov 2004 10:23:26 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id KAA20769 for ; Tue, 23 Nov 2004 10:23:22 -0500 (EST) Received: from palrel11.hp.com ([156.153.255.246]) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CWcZU-0007KI-Ll for ipoverib@ietf.org; Tue, 23 Nov 2004 10:27:13 -0500 Received: from esmail.cup.hp.com (esmail.cup.hp.com [15.0.65.164]) by palrel11.hp.com (Postfix) with ESMTP id 08ABD30FE9 for ; Tue, 23 Nov 2004 07:23:20 -0800 (PST) Received: from MK73191c.cup.hp.com ([15.244.203.228]) by esmail.cup.hp.com (8.9.3 (PHNE_29774)/8.8.6) with ESMTP id HAA08523 for ; Tue, 23 Nov 2004 07:20:50 -0800 (PST) Message-Id: <6.1.2.0.2.20041123072036.05121170@esmail.cup.hp.com> X-Sender: krause@esmail.cup.hp.com X-Mailer: QUALCOMM Windows Eudora Version 6.1.2.0 Date: Tue, 23 Nov 2004 07:20:59 -0800 To: IPoverIB From: Michael Krause Subject: RE: [Ipoverib] A Couple of IPoIB Questions Mime-Version: 1.0 X-Spam-Score: 0.3 (/) X-Scan-Signature: f8ee348dcc4be4a59bc395f7cd6343ad X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0306016151==" Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --===============0306016151== Content-Type: multipart/alternative; boundary="=====================_244073719==.ALT" --=====================_244073719==.ALT Content-Type: text/plain; charset="us-ascii"; format=flowed At 11:54 PM 11/22/2004, Dror Goldenberg wrote: >Hi Mike, >My comments below. >-Dror >Dropping a buffer is fine but that should be at the TCP/IP level and not a >driver decision. A driver should have sufficient buffers to avoid having >wasted the network bandwidth. Hence, the driver should be posting >sufficient buffers to keep up with the workload which may span multiple >connections / datagrams. Use of UC or RC does not change anything in this >regard. A drop using UC would simply waste IB network bandwidth, consume >HCA resources flushing the work (the transmitter would continue to >transmit so nothing is saved there), etc. and only impact one connection >at a time. It does nothing for the rest of the connections. So while one >might get a bit of benefit akin to a RED scheme, if the endnode pairs are >operating at a high workload, all one gets with UC is the ability of one >endnode to flood another with no push back except on random >connections. This would lead to bursty behavior and unpredictable >application responsiveness. RC leads to smoother performance between the >endnode pair and with the use of multiple RC QP, one can differentiate >traffic for QoS purposes which is something that will benefit applications. > >[dg] If you work with RC, then in the slow receiver case, backpressure >will propagate into the sender (RQ is full, no end to end credits are >reflected, peer SQ becomes full and you're out of SQ WQEs). In this case, >what will you do in the requester side ? The SQ can become full and the send side driver can start to drop datagrams just like one does within any driver below IP. This only impacts the one RC QP and not others (one of my reasons for wanting multiple RC between endnode pairs if performance is critical). BTW, the same issue occurs if one were using Ethernet pause functionality and forward progress could not occur. >- Tell the upper TCP/IP layers that the NIC TX ring is full - this will >cause OS not to > post buffers to ANY of current RC connections. I don't think it's > desirable, it'll slow down / block > your connections with the other remote peers It is treated no differently than today's solutions. >- Pretend as if there is still room in the SQ - but when OS posts to the >full SQ, you'll drop > the packet -> this will be just the same as the UC case, except that > you do it in the sender > instead of the receiver What is wrong with this as it is aligns with today's solutions. >- Pretend as if there is still room in the SQ - but when OS posts to the >full SQ, you'll queue > it in SW. I think it'll risk shared resources. This is a local implementation choice and one that has been implemented in some OS. This deals with thin hardware resources on a given device and works reasonable well under burst traffic. > >What I am trying to say, is that we need to understand what happens in the >case of the slow receiver. I think that in RC what you'll end up having is >the peer requester dropping the packets. In UC, you'll get the responder >dropping the packets. As of how much you flood the IB fabric, see my >comment on the second question. From what I know, it has always been the ph > >>>2) How would you configure the RNR retry counters. Would they be >>>configured to infinity ? Doesn't sound >>> good. Would they be configured to a finite value (should be <7), in >>> which case, in the case of a slow >>> receiver you'd end up recreating connections that had end to end >>> credits problem, which is a real >>> overhead on the protocol. >>RNR would be no different for IP over IB than for any other IB RC >>instance. >>[dg] Example ULPs such as SDP and SRP use SW level flow control and do >>not rely on RNR NAKs. >These are also not IP based ULP. > >> What I am trying to say is if you configure your QP for finite retries >> and a reasonable timeout, then when the receiver is slow, you'd often >> get the QP into the error state, after RNR retries are exhausted. The >> overhead of reestablishing a new connection each time the QP gets into >> the error state is high. If you use UC, then this is not a problem, >> because none of this happens. >Given RC uses send credits and therefore should not see a new message >unless there is an associated buffer available which increments the credit >count, one should not get a RNR NAK ever. The reason for RNR NAK was to >deal with a resource other than a receive buffer missing, e.g. QP context >or V-to-P translation or whatever not being chip resident and some time >would be required to refresh without going into the error state. Given RC >is still send-receive based, there should not be any reason for a RNR NAK >and no SEND will occur unless a credit is provided. > >[dg] yes and no. If you work with regular RC, then when RQ is empty, then >the peer SQ will send probing packets (e.g. send first/send only) to see >if credits became available. In this case you will see RNR Nak, but what >you inject to the fabric before getting it is a single packet. So I agree >that you don't flood the IB fabric in this case. >The issue with e2e credits reflection is when one wants to use SRQ, >instead of posting to each RQ separately and consuming many resources. In >this case, e2e credits are no longer reflected by ACK packets. And you're >going to send messages to the remote side without any flow control, and >get RNR Naks when peer RQ is empty. If most implementations use SRQ, then >fabric is going to be flooded anyways because of slow receivers. >I know that SRQ today is not allowed on UC, but that's a different story... So you are arguing for SRQ which is not supported by UC. I don't know of SRQ has value or not but if you want to discuss SRQ, then let's discuss that and not RC vs. UC as I don't think the counter arguments against RC are significant. Mike --=====================_244073719==.ALT Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable At 11:54 PM 11/22/2004, Dror Goldenberg wrote:
    Hi Mike,
    My comments below.
    -Dror

    <snip>

    Dropping a buffer is fine but that should be at the TCP/IP level and not a driver decision.  A driver should have sufficient buffers to avoid having wasted the network bandwidth.  Hence, the driver should be posting sufficient buffers to keep up with the workload which may span multiple connections / datagrams.  Use of UC or RC does not change anything in this regard.  A drop using UC would simply waste IB network bandwidth, consume HCA resources flushing the work (the transmitter would continue to transmit so nothing is saved there), etc. and only impact one connection at a time.  It does nothing for the rest of the connections.  So while one might get a bit of benefit akin to a RED scheme, if the endnode pairs are operating at a high workload, all one gets with UC is the ability of one endnode to flood another with no push back except on random connections.  This would lead to bursty behavior and unpredictable application responsiveness.  RC leads to smoother performance between the endnode pair and with the use of multiple RC QP, one can differentiate traffic for QoS purposes which is something that will benefit applications.
     
    [dg] If you work with RC= , then in the slow receiver case, backpressure will propagate into the sender (RQ is full, no end to end credits are reflected, peer SQ becomes full and you're out of SQ WQEs). In this case, what will you do in the requester side ?

    The SQ can become full and the send side driver can start to drop datagrams just like one does within any driver below IP.  This only impacts the one RC QP and not others (one of my reasons for wanting multiple RC between endnode pairs if performance is critical).   BTW, the same issue occurs if one were using Ethernet pause functionality and forward progress could not occur.

    - Tell the upper TCP/IP layers that the NIC TX ring is full - this will cause OS not to=20
       post buffers to ANY of current RC connections. I don't think it's desirable, it'll slow down / block
       your connections with the other remote peers

    It is treated no differently than today's solutions.

    - Pretend as if there is still room in the SQ - but when OS posts to the full SQ, you'll drop
        the packet -> this will be just the same as the UC case, except that you do it in the sender
        instead of the receiver

    What is wrong with this as it is aligns with today's solutions.

    - Pretend as if there is still room in the SQ - but when OS posts to the full SQ, you'll queue
        it in SW. I think it'll risk shared resources.

    This is a local implementation choice and one that has been implemented in some OS.  This deals with thin hardware resources on a given device and works reasonable well under burst traffic.

     
    What I am trying to say,= is that we need to understand what happens in the case of the slow receiver. I think that in RC what you'll end up having is the peer requester dropping the packets. In UC, you'll get the responder dropping the packets. As of how much you flood the IB fabric, see my comment on the second question.

    From what I know, it has always been the ph

     
    2) How would you configure the RNR retry counters. Would they be configured to infinity ? Doesn't sound=20
        good. Would they be configured to a finite value (should be <7), in which case, in the case of a slow=20
        receiver you'd end up recreating connections that had end to end credits problem, which is a real=20
        overhead on the protocol.
    RNR would be no different for IP over IB than for any other IB RC instance. =20
    [dg] Example ULPs such a= s SDP and SRP use SW level flow control and do not rely on RNR NAKs.
    These are also not IP based ULP. 

     What I am trying to say is if you configure your QP for finite retries and a reasonable timeout, then when the receiver is slow, you'd often get the QP into the error state, after RNR retries are exhausted. The overhead of reestablishing a new connection each time the QP gets into the error state is high.  If you use UC, then this is not a problem, because none of this happens.
    Given RC uses send credits and therefore should not see a new message unless there is an associated buffer available which increments the credit count, one should not get a RNR NAK ever.  The reason for RNR NAK was to deal with a resource other than a receive buffer missing, e.g. QP context or V-to-P translation or whatever not being chip resident and some time would be required to refresh without going into the error state.  Given RC is still send-receive based, there should not be any reason for a RNR NAK and no SEND will occur unless a credit is provided.
     
    [dg] yes and no. If you work with regular RC, then when RQ is empty, then the peer SQ will send probing packets (e.g. send first/send only) to see if credits became available. In this case you will see RNR Nak, but what you inject to the fabric before getting it is a single packet. So I agree that you don't flood the IB fabric in this case.
    The issue with e2e credits reflection is when one wants to use SRQ, instead of posting to each RQ separately and consuming many resources. In this case, e2e credits are no longer reflected by ACK packets. And you're going to send messages to the remote side without any flow control, and get RNR Naks when peer RQ is empty. If most implementations use SRQ, then fabric is going to be flooded anyways because of slow receivers.
    I know that SRQ today is not allowed on UC, but that's a different story...
    So you are arguing for SRQ which is not supported by UC.   I don't know of SRQ has value or not but if you want to discuss SRQ, then let's discuss that and not RC vs. UC as I don't think the counter arguments against RC are significant.

    Mike --=====================_244073719==.ALT-- --===============0306016151== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --===============0306016151==-- From ipoverib-bounces@ietf.org Mon Nov 29 16:25:32 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA07701 for ; Mon, 29 Nov 2004 16:25:31 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CYs1P-0000Vp-6T; Mon, 29 Nov 2004 15:21:19 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CYrwp-0006FT-Vu; Mon, 29 Nov 2004 15:16:36 -0500 Received: from CNRI.Reston.VA.US (localhost [127.0.0.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA27045; Mon, 29 Nov 2004 15:16:34 -0500 (EST) Message-Id: <200411292016.PAA27045@ietf.org> Mime-Version: 1.0 Content-Type: Multipart/Mixed; Boundary="NextPart" To: i-d-announce@ietf.org From: Internet-Drafts@ietf.org Date: Mon, 29 Nov 2004 15:16:34 -0500 Cc: ipoverib@ietf.org Subject: [Ipoverib] I-D ACTION:draft-ietf-ipoib-dhcp-over-infiniband-07.txt X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --NextPart A New Internet-Draft is available from the on-line Internet-Drafts directories. This draft is a work item of the IP over InfiniBand Working Group of the IETF. Title : DHCP over InfiniBand Author(s) : V. Kashyap Filename : draft-ietf-ipoib-dhcp-over-infiniband-07.txt Pages : 7 Date : 2004-11-29 An InfiniBand network uses a link-layer addressing scheme that is 20-octets long. This is larger than the 16-octets reserved for the hardware address in DHCP/BOOTP message. The above inequality imposes restrictions on the use of the DHCP message fields when used over an IP over InfiniBand(IPoIB) network. This document describes the use of DHCP message fields when implementing DHCP over IPoIB. A URL for this Internet-Draft is: http://www.ietf.org/internet-drafts/draft-ietf-ipoib-dhcp-over-infiniband-07.txt To remove yourself from the I-D Announcement list, send a message to i-d-announce-request@ietf.org with the word unsubscribe in the body of the message. You can also visit https://www1.ietf.org/mailman/listinfo/I-D-announce to change your subscription settings. Internet-Drafts are also available by anonymous FTP. Login with the username "anonymous" and a password of your e-mail address. After logging in, type "cd internet-drafts" and then "get draft-ietf-ipoib-dhcp-over-infiniband-07.txt". A list of Internet-Drafts directories can be found in http://www.ietf.org/shadow.html or ftp://ftp.ietf.org/ietf/1shadow-sites.txt Internet-Drafts can also be obtained by e-mail. Send a message to: mailserv@ietf.org. In the body type: "FILE /internet-drafts/draft-ietf-ipoib-dhcp-over-infiniband-07.txt". NOTE: The mail server at ietf.org can return the document in MIME-encoded form by using the "mpack" utility. To use this feature, insert the command "ENCODING mime" before the "FILE" command. To decode the response(s), you will need "munpack" or a MIME-compliant mail reader. Different MIME-compliant mail readers exhibit different behavior, especially when dealing with "multipart" MIME messages (i.e. documents which have been split up into multiple messages), so check your local documentation on how to manipulate these messages. Below is the data which will enable a MIME compliant mail reader implementation to automatically retrieve the ASCII version of the Internet-Draft. --NextPart Content-Type: Multipart/Alternative; Boundary="OtherAccess" --OtherAccess Content-Type: Message/External-body; access-type="mail-server"; server="mailserv@ietf.org" Content-Type: text/plain Content-ID: <2004-11-29152206.I-D@ietf.org> ENCODING mime FILE /internet-drafts/draft-ietf-ipoib-dhcp-over-infiniband-07.txt --OtherAccess Content-Type: Message/External-body; name="draft-ietf-ipoib-dhcp-over-infiniband-07.txt"; site="ftp.ietf.org"; access-type="anon-ftp"; directory="internet-drafts" Content-Type: text/plain Content-ID: <2004-11-29152206.I-D@ietf.org> --OtherAccess-- --NextPart Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --NextPart-- From ipoverib-bounces@ietf.org Tue Nov 30 16:20:27 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA10096 for ; Tue, 30 Nov 2004 16:20:27 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CZEe3-00022V-5b; Tue, 30 Nov 2004 15:30:43 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CZEKc-0007Vc-RI; Tue, 30 Nov 2004 15:10:39 -0500 Received: from CNRI.Reston.VA.US (localhost [127.0.0.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA28790; Tue, 30 Nov 2004 15:10:36 -0500 (EST) Message-Id: <200411302010.PAA28790@ietf.org> Mime-Version: 1.0 Content-Type: Multipart/Mixed; Boundary="NextPart" To: i-d-announce@ietf.org From: Internet-Drafts@ietf.org Date: Tue, 30 Nov 2004 15:10:36 -0500 Cc: ipoverib@ietf.org Subject: [Ipoverib] I-D ACTION:draft-ietf-ipoib-ibmib-tc-mib-06.txt X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --NextPart A New Internet-Draft is available from the on-line Internet-Drafts directories. This draft is a work item of the IP over InfiniBand Working Group of the IETF. Title : Definition of Textual Conventions and OBJECT-IDENTITIES for IP Over InfiniBand (IPOVERIB) Management Author(s) : S. Harnedy Filename : draft-ietf-ipoib-ibmib-tc-mib-06.txt Pages : 12 Date : 2004-11-30 This memo defines a Management Information Base (MIB) module that contains Textual Conventions and OBJECT-IDENTITIES for use in definitions of management information for IP Over InfiniBand (IPOVERIB) networks. The intent is that these TEXTUAL CONVENTIONs (TCs) will be imported and used in IPOVERIB related MIB modules. A URL for this Internet-Draft is: http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ibmib-tc-mib-06.txt To remove yourself from the I-D Announcement list, send a message to i-d-announce-request@ietf.org with the word unsubscribe in the body of the message. You can also visit https://www1.ietf.org/mailman/listinfo/I-D-announce to change your subscription settings. Internet-Drafts are also available by anonymous FTP. Login with the username "anonymous" and a password of your e-mail address. After logging in, type "cd internet-drafts" and then "get draft-ietf-ipoib-ibmib-tc-mib-06.txt". A list of Internet-Drafts directories can be found in http://www.ietf.org/shadow.html or ftp://ftp.ietf.org/ietf/1shadow-sites.txt Internet-Drafts can also be obtained by e-mail. Send a message to: mailserv@ietf.org. In the body type: "FILE /internet-drafts/draft-ietf-ipoib-ibmib-tc-mib-06.txt". NOTE: The mail server at ietf.org can return the document in MIME-encoded form by using the "mpack" utility. To use this feature, insert the command "ENCODING mime" before the "FILE" command. To decode the response(s), you will need "munpack" or a MIME-compliant mail reader. Different MIME-compliant mail readers exhibit different behavior, especially when dealing with "multipart" MIME messages (i.e. documents which have been split up into multiple messages), so check your local documentation on how to manipulate these messages. Below is the data which will enable a MIME compliant mail reader implementation to automatically retrieve the ASCII version of the Internet-Draft. --NextPart Content-Type: Multipart/Alternative; Boundary="OtherAccess" --OtherAccess Content-Type: Message/External-body; access-type="mail-server"; server="mailserv@ietf.org" Content-Type: text/plain Content-ID: <2004-11-30111228.I-D@ietf.org> ENCODING mime FILE /internet-drafts/draft-ietf-ipoib-ibmib-tc-mib-06.txt --OtherAccess Content-Type: Message/External-body; name="draft-ietf-ipoib-ibmib-tc-mib-06.txt"; site="ftp.ietf.org"; access-type="anon-ftp"; directory="internet-drafts" Content-Type: text/plain Content-ID: <2004-11-30111228.I-D@ietf.org> --OtherAccess-- --NextPart Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --NextPart-- From ipoverib-bounces@ietf.org Tue Nov 30 16:22:27 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA10684 for ; Tue, 30 Nov 2004 16:22:26 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CZEeA-00026X-Ou; Tue, 30 Nov 2004 15:30:50 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CZEKg-0007WE-KW; Tue, 30 Nov 2004 15:10:47 -0500 Received: from CNRI.Reston.VA.US (localhost [127.0.0.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA28804; Tue, 30 Nov 2004 15:10:40 -0500 (EST) Message-Id: <200411302010.PAA28804@ietf.org> Mime-Version: 1.0 Content-Type: Multipart/Mixed; Boundary="NextPart" To: i-d-announce@ietf.org From: Internet-Drafts@ietf.org Date: Tue, 30 Nov 2004 15:10:40 -0500 Cc: ipoverib@ietf.org Subject: [Ipoverib] I-D ACTION:draft-ietf-ipoib-ip-over-infiniband-08.txt X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org --NextPart A New Internet-Draft is available from the on-line Internet-Drafts directories. This draft is a work item of the IP over InfiniBand Working Group of the IETF. Title : Transmission of IP over InfiniBand Author(s) : H. Chu, V. Kashyap Filename : draft-ietf-ipoib-ip-over-infiniband-08.txt Pages : 21 Date : 2004-11-30 This document specifies a method for encapsulating and transmitting IPv4/IPv6 and Address Resolution Protocol (ARP) packets over InfiniBand (IB). It describes the link layer address to be used when resolving the IP addresses in 'IP over InfiniBand (IPoIB)' subnets. The document also describes the mapping from IP multicast addresse to InfiniBand multicast addresses. Additionally this document defines the setup and configuration of IPoIB links. A URL for this Internet-Draft is: http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt To remove yourself from the I-D Announcement list, send a message to i-d-announce-request@ietf.org with the word unsubscribe in the body of the message. You can also visit https://www1.ietf.org/mailman/listinfo/I-D-announce to change your subscription settings. Internet-Drafts are also available by anonymous FTP. Login with the username "anonymous" and a password of your e-mail address. After logging in, type "cd internet-drafts" and then "get draft-ietf-ipoib-ip-over-infiniband-08.txt". A list of Internet-Drafts directories can be found in http://www.ietf.org/shadow.html or ftp://ftp.ietf.org/ietf/1shadow-sites.txt Internet-Drafts can also be obtained by e-mail. Send a message to: mailserv@ietf.org. In the body type: "FILE /internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt". NOTE: The mail server at ietf.org can return the document in MIME-encoded form by using the "mpack" utility. To use this feature, insert the command "ENCODING mime" before the "FILE" command. To decode the response(s), you will need "munpack" or a MIME-compliant mail reader. Different MIME-compliant mail readers exhibit different behavior, especially when dealing with "multipart" MIME messages (i.e. documents which have been split up into multiple messages), so check your local documentation on how to manipulate these messages. Below is the data which will enable a MIME compliant mail reader implementation to automatically retrieve the ASCII version of the Internet-Draft. --NextPart Content-Type: Multipart/Alternative; Boundary="OtherAccess" --OtherAccess Content-Type: Message/External-body; access-type="mail-server"; server="mailserv@ietf.org" Content-Type: text/plain Content-ID: <2004-11-30111239.I-D@ietf.org> ENCODING mime FILE /internet-drafts/draft-ietf-ipoib-ip-over-infiniband-08.txt --OtherAccess Content-Type: Message/External-body; name="draft-ietf-ipoib-ip-over-infiniband-08.txt"; site="ftp.ietf.org"; access-type="anon-ftp"; directory="internet-drafts" Content-Type: text/plain Content-ID: <2004-11-30111239.I-D@ietf.org> --OtherAccess-- --NextPart Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline Content-Transfer-Encoding: 7bit _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib --NextPart-- From ipoverib-bounces@ietf.org Tue Nov 30 18:34:48 2004 Received: from megatron.ietf.org (megatron.ietf.org [132.151.6.71]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA00448 for ; Tue, 30 Nov 2004 18:34:48 -0500 (EST) Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CZHTm-0002l4-9H; Tue, 30 Nov 2004 18:32:18 -0500 Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1CZHOE-0000ua-Ev for ipoverib@megatron.ietf.org; Tue, 30 Nov 2004 18:26:34 -0500 Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA29945 for ; Tue, 30 Nov 2004 18:26:31 -0500 (EST) Received: from volter-fw.ser.netvision.net.il ([212.143.107.30] helo=taurus.voltaire.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1CZHTQ-0006ul-Rd for ipoverib@ietf.org; Tue, 30 Nov 2004 18:31:57 -0500 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: [Ipoverib] MAC Fake with IPoIB Date: Wed, 1 Dec 2004 01:25:44 +0200 Message-ID: <35EA21F54A45CB47B879F21A91F4862F2CC1CD@taurus.voltaire.com> Thread-Topic: [Ipoverib] MAC Fake with IPoIB Thread-Index: AcTNp5+apRItYwv3SJuqKkSsHwbzhwAAaidw From: "Yaron Haviv" To: "IPoverIB" X-Spam-Score: 0.0 (/) X-Scan-Signature: b19722fc8d3865b147c75ae2495625f2 Content-Transfer-Encoding: quoted-printable X-BeenThere: ipoverib@ietf.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: IP over InfiniBand WG Discussion List List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: ipoverib-bounces@ietf.org Errors-To: ipoverib-bounces@ietf.org Content-Transfer-Encoding: quoted-printable Recently we have seen several applications that required=20 An equivalent to Ethernet MAC faking, in order to implement fail-over between two nodes, Currently the way IB is implemented you cannot implement such a capability with IPoIB, and a node cannot take over another node's MAC (GID/LID+QPN). Few possible solutions can be: 1. Implement gratuities ARP's=20 That solves the problems for only some of the applications, and cannot help in cases with Active/Active, >2 configurations Or another example is that it won't work with VRRP=20 So it doesn't solve the problem=20 2. Define that an IPoIB driver should also listen on GID_out traps, and clear those GID's from its ARP cache when they go down. This can be used if it is possible to bring the port down in case of a failure (or make the SM issue the GID_out trap) 3. require IPoIB drivers to support the UNARP RFC (RFC 1868), this allows a node (taking over) to ask a remote node to clear certain ARP entries=20 There is precedence to using UNARP in IP over SONET/SDH (RFC 2176) for the same reasons=20 4. any other methods you guys can come up with=20 Your thoughts/suggestions ?=20 Yaron _______________________________________________ IPoverIB mailing list IPoverIB@ietf.org https://www1.ietf.org/mailman/listinfo/ipoverib