GRID | e-Science

This proposal seeks support for collaborative work between University Groups, the CLRC, UKERNA and CISCO to develop core networking services required for Grid operations. We will demonstrate the use of traffic engineering techniques to configure managed bandwidth and QoS services between main Grid sites, where the key issues are scalability and inter-domain interfaces. We will also investigate and develop high performance data transport mechanisms to permit Grid data transfer across a heterogeneous WAN. Within the UK we will use the SuperJANET4 development network. We will extend the end points to the USA in partnership with key Grid development groups there. We will also extend the demonstrations to European sites as part of our existing involvement with the EU-DataGrid project.

C.Cooper, D.Salmon (CLRC,RAL)
R.Tasker (CLRC, Daresbury Laboratory)
P.Clarke*,(University College London - Physics)
S.Bhatti, J.Crowcroft (University College London – Computer Science)
R.Hughes-Jones (Manchester University - Physics)
J.Sharp, R.Samani (UKERNA)

Introduction

There are at present many initiatives in the research community to develop "computing Grids" for scientific data processing for a wide variety of applications. Grids are based upon middleware layers that interconnect data and applications in a seamless way across a widely distributed environment. Core to the Grid fabric is a high capacity network offering the services needed to manage the different classes of traffic flow required for Grid operations.

Excluding planning and provision of the infrastructure itself then Grid networking issues can broadly be broken down into the following areas:

· Network services based upon traffic engineering, such as managed bandwidth and Quality-of-Service provision.

All of these areas are manifestly crucial to PPARC science goals, and therefore feature prominently in the consortium proposal for a DataGrid for Particle Physics (PP-DataGrid) where support is sought for work in all of the application specific areas.

This proposal is submitted in concert with the PP-DataGrid proposal but is complementary and focuses upon the first two items which embody generic core e-science aspects of networking (i.e. those areas common to all e-science operations).

Traffic management & QoS

Grid operation will require traffic management services running on top of the IP service. In the immediate future there may be need to provide managed bandwidth over IP between some key sites. In the longer term more general Quality of Service (QoS) provision is probably crucial to Grid applications. This issue figures prominently in the PP-DataGrid workplan and the discussions of the EU-DataGrid networking group. The issue is also explicit in the planning of both UKERNA (providers of SuperJANET4) and DANTE (providers of the European research network, Géant). All parties are aware of these requirements but no such service has yet been demonstrated in the Grid context, and in particular end-to-end across multiple administrative domains.

· Short-term data set replication: To be able replicate data sets (~ 1-100 TBytes) between Tier-N sites in ~1 day (0.1-10 Gbits/s). This leads to the requirement for short-term end-to-end managed bandwidth reservation on demand.

· Pseudo-Continuous data replication: Tier-N sites need to replicate processed data to Tier-M sites and vice versa. Typically 1-500 Mbits/s continuous equivalent rate. This leads to the need for managed bandwidth services.

· QoS: In the longer term more sophisticated services are required to differentiate classes of traffic ranging from high quality interactive, control and video applications through to less time critical applications. This leads to the need for production Quality of Service provision.

· Traffic classification: To be able to configure packet classification at ingress to a domain based upon client requirements.

· Class based routing and forwarding behaviour: The ability to handle pa ckets within routers differently according to the associated traffic class.

· Inter-Domain: Development of “Service Level Agreement” mechanisms to operate at inter-domain interfaces.

· Scalability: Solutions which scale easily with the complexity of the network and the number of clients.

These requirements are complementary and motivate this collaborative project to demonstrate end-to-end solutions within a well-defined context.

- Packets can be examined at ingress to a network and classified using various pieces of information from the IP header.

- Congestion control can be addressed using intelligent packet dropping algorithms such as weighted random early detection (WRED) to signal back to applications.

- Explicit congestion notification (ECN) is designed to interact with suitably aware clients and servers.

- Multi Protocol Label Switching (MPLS) is one of the important technologies that is likely be used for, or as part of, the underlying traffic engineering.

The Project aims

The mechanisms mentioned above are themselves not new, and have already been demonstrated within limited environments, and upon homogeneous domains (same supplier’s routers within a single administrative domain). However their use for end-to-end services across a heterogeneous WAN consisting of multiple domains is not straightforward, and at present there is no "production" method with which these services can be provided for live end-to-end applications.

The thrust of this project is therefore to meld existing knowledge with a focused application (the PP-DataGrid) in order to demonstrate the end-to-end services needed for Grid operations across the WAN, both within the UK and also to both the US and Europe.

The PP-DataGrid will be the specific driving application, although the results will be relevant to all other Grid applications. We believe that having such a clear and high profile application is crucial to provide the focus for concrete deliverable targets.

The project will make extensive use of the SuperJANET4 Development Network (SJDN), which is intimately connected to the SuperJanet academic network (ac.uk) upon which live Grid applications will depend. This is a substantial infrastructure which will be available to us through the involvement of UKERNA directly with this project. In particular UKERNA have procured routing equipment which is MPLS capable specifically with this work in mind.

· To understand the applicability and limitations of various traffic engineering tools to implement the traffic management services required by the Grid.

· To demonstrate end-to-end network services over several domains within the UK

· Where possible to demonstrate end-to-end network services to the USA in collaboration with leading US Grid groups.

· Where possible to demonstrate end-to-end network services to CERN as part of our EU-DataGrid commitments.

The infrastructure and routing equipment which will be required for this project are described in detail in the appendices, along with a detailed costing.

Transport Applications

Grid data rates will exceed 100s of Mbit/s over long latency routes. Such transfers will rely upon the availability of high rate/high volume/reliable data transport applications. All three elements are key: the combination of rate and volume are clear, but it is also essential that such applications can recover from faults and complete the data transfer. The transport protocols will also have to deal with efficient replication and update to multiple sites. It is already apparent that protocols such as “standard” TCP based FTP are unlikely to be adequate and therefore today we do not know how to satisfy the DataGrid demands. This is an area where significant work has been done in the CS community based upon modeling and controlled measurements, however it is vital to understand these effects in the context of real Grid traffic patterns running on the WAN. The issues of transport applications are directly related to those of traffic management since applications need to interact with QoS services.

Project aims

· Investigate a variety of high performance data transport mechanisms including advanced TCP and non-TCP applications.

· Demonstrate high performance data transport in a live Grid context aiming for > 1 Gbit/s

· Understand how to integrate suitable applications with the traffic management services previously described and the higher Grid middleware layer.

Industrial Collaboration

UKERNA

UKERNA provision and manage the SuperJANET4 academic network in the UK. UKERNA is already highly involved with e-science activities and have made clear their wish to support Grid operations upon the SJ4 backbone. In the wider sense UKERNA have identified the need to develop QoS services and, for example, have constituted a working group to bring together expertise from different organisations in this area. As part of this strategy UKERNA has great interest in development of traffic engineering based upon MPLS. The routing equipment which is now deployed for SJ4 was procured to be MPLS capable for just this purpose. UKERNA fully support the opportunity to develop this in collaboration with the PP-DataGrid project.

As part of the SuperJANET4 network a national testbed infrastructure is available. The testbed is isolated from the production network and allows layer 2/3 network level experiments to be carried out that would otherwise compromise a production environment. For this project the testbed will be equipped with similar routers to those deployed in the SuperJANET4 production network.

- Engineering effort from within the JANET Network Operations and Service Centre (NOSC)

- Project management effort from within the UKERNA Strategic Technologies Group, specifically Jeremy Sharp and Rina Samani.

The total equivalent cost of UKERNA contributed support amounts to £500K over two years, based upon a dedicated use of the testbed equipment for 33% of the time and a total of 0.25 FTE effort p.a.

UKERNA's involvement in this way will ensure that the results from this project will be rapidly deployed into a production environment as appropriate, benefiting all future Grid applications.

CISCO

Links and external collaborations

PP-DataGrid: The workplan described in this document corresponds to a subset of the workplan being prepared as part of the PP-DataGrid consortium proposal, which also includes all other areas mentioned in the introduction. The work proposed here will be carried out as an integral part of that project, and many of the named people are involved in both projects. Within the PP-DataGrid project support will be sought for the application oriented parts of the work (Grid enablement of applications to provide live traffic flow, monitoring and information services, local support). Here we are explicitly requesting industrial collaboration support for development of generic core e-science network services.

EU-DataGrid: P.Clarke, R.Hughes-Jones, D.Salmon and R.Tasker are integral members of the EU DataGrid project which is already funded. Within this we have significant responsibilities within the networking work-package. Therefore this work will be carried out coherently with the needs of that project.

Transatlantic links: In the USA the group based at Argonne, led by Ian Foster (one of the driving forces in the recent evolution of the Grid) wishes to collaborate with the UK to demonstrate transatlantic QoS services. This is an opportunity we welcome and which will be integrated into the PP-DataGrid proposal. Task TM3 of this proposal covers the extension of the core traffic management services to allow this collaboration to go ahead.

IETF and IRTF: The Department of Computer Science at UCL is involved with various engineering and research work within the Internet community through the IETF and IRTF. This includes work in multimedia, QoS and reliable multicast.

Other work: The Trans-European Research and Education Networking Association (TERENA) has established a task-force on next generation networking (TF-NGN) which provides a forum for exchange of experience on new and emerging network technologies. It aims to define, develop and test new network services for subsequent deployment by the national research and education networks and the European backbone network (TEN-155/Geant). The group also has formal links with DANTE and provides input to the Geant technical programme. Transatlantic links are now being formed between TF-NGN and the Internet2 Quality of Service working group with the objective of defining a common premium IP service for US-Europe tests. These contacts are important because successful implementation of end-to-end quality of service across multiple management domains will depend critically on evolving mechanisms and policies which are agreed by all concerned. Concepts from these fora will provide input to the national test programmes prior to service trials. David Salmon is one UKERNA's representatives on TF-NGN with a special interest in MPLS and QoS/CoS issues.

R.Tasker and C.Cooper are members of the QoS think-tank in the UK. R.Tasker has an internet-2 project running with SLAC to investigate WRED.

J. Crowcroft is involved with a high-speed research link to the USA within through CAIRN network.

Request for support from PPARC

This project provides an exceptional opportunity to build upon links with UKERNA to develop the generic core networking services which will be crucial to the delivery of PPARC e-science goals.

We will be the first such collaboration in the country to obtain access to the SuperJANET development network. This is a significant resource and offers PPARC an un-paralleled opportunity to collaborate in a high profile core e-science activity, i.e. the development of the techniques needed to deliver end-to-end managed bandwidth, QoS and high performance data transport services. The equivalent value of this resource is £500,000 over two years.

We have also been fortunate to obtain the support of one of the worlds leading suppliers of internet routing equipment, CISCO. Apart from substantial aid in the form of routing equipment, CISCO bring invaluable internet engineering expertise to the project in the form of identified CISCO staff.

This is a well focused project with objectives which match those of similar projects in both the EU and USA in the Grid context, and thus support for this project will allow the UK to participate in a global demonstration programme.

Staff posts:

- We request 2.0 FTE staff posts to work on traffic management and QoS for two years. 1.5 FTE will be used to specifically underpin the development of end-to-end services using MPLS in line with the specific objectives of TM1 and TM2, which embody the immediate strategic interests of UKERNA and CISCO. A further 0.5 FTE will be used to support other QoS work such Diffserv, and the international collaborative work.

In order to attract suitable applicants we request the posts post to be at points 9, 11 and 13 on the RA1A scale.

Equipment:

The project will involve up to four MPLS domains linked by SJDN. However due to the high cost of routing equipment we limit to concurrent equipping of only three sites. The RAL domain will only require a domain backbone router as this domain will explore interworking MPLS between routers from different manufacturers. A detailed description of the equipment requirements and costing is given in the appendices. In summary:

- Edge and backbone domain routers
(with expected discount and allowing for industrial contribution) 250,000

Total request to PPARC £319K

We do not request significant resources for local loops (connections between SJDN and sites) in this proposal. This is for two reasons:

- The work proposed here is aimed at developing end-to-end mechanisms. Whilst this work does rely upon suitable routing equipment being available it does not inherently rely upon local loops in order to meet its objectives, although having such makes demonstrations both easier and more convincing. Nevertheless we do not believe that it is reasonable for local loop costs to dominate the equipment costs in this proposal.

- It is likely that such local loops will in any case be needed by the wider context of the PP-DataGrid testbed deliverables. Therefore we assume that some infrastructure will be available from that project.

The total cost of local loops is £800,000 at full price, although we expect significant discounts. We will seek the majority of this elsewhere. If difficulties still remain then (i) the test domains can be centred at the SJDN C-PoPs, hence avoiding much of the cost and/or (ii) local loops may rented for limited periods corresponding to the end-to-end tests. Therefore in this proposal we request only minimal costs of £50,000 over two years.

Relation to other proposals.

The resources sought here are targeted at the development and demonstration of core networking services. As such the work described here can form the basis for a programme to be submitted to the OST/DTI generic e-science support lines. We believe that such a submission would be significantly enhanced by the demonstration of PPARC support for both the generic and collaborative aspects of this proposal.

The resources sought here are complementary to those sought through the PP-DataGrid proposal which would support the application specific areas of networking (testbed application traffic, support at Grid sites, use of services through middleware).

Appendix: Expertise and resources within the proposing collaboration.

UCL

Peter Clarke is a Reader at UCL and leads the LHC experimental group, working primarily on computing requirements of the LHC programme. His group interests are OO software design, Grid data transport applications, Grid traffic management and network information services. He chairs a PPARC networking committee (PPNCG). Within the EU-DataGrid project he is a member of the Project Technical Board and represents the UK in the network work-package (WP7). Within the UK PP-DataGrid project he is responsible for networking coordination, and a member of the interim project management board. He will dedicate at least 50% of his research time to this project, and possibly more if circumstances allow.

Saleem Bhatti is a Lecturer in the Department of Computer Science at UCL. His areas of research include QoS (applications and networks), network management, network security and mobile systems. He was on the programme committee of the 7^th International Workshop on Quality of Service (IWQoS99) and is on the programme committee of Networked Group Communication 2001 (NGC2001).

J. CrowCroft is Professor in the Department of Computer Science at UCL. He researches in multi-media Communications. He is a member of the ACM, a member the British Computer Society, a fellow of the IEE and the royal academy of engineering and a Senior Member the IEEE as well as a member of the editorial team for Computer Networks, Transactions on Networking , IEEE Networks, Monet , and Cluster.

We anticipate an additional Grid networking post through PP-DataGrid direct support. Approximately 50% FTE of this will be directly connected to the application side of this work.

We have applied for at least one studentship to be provided through the PPARC industrial collaboration scheme (CASE), and/or the e-science studentship scheme.

Manchester University.

Richard Hughes-Jones is a researcher in the Physics department. His interests are in areas of computing and networking within the context of the LHC programme including the performance, network management and modeling of Gigabit Ethernet switches. He is secretary of the PPNCG. Within the UK PP-DataGrid project he is a member of the management board and a member of the networking workpackage (WP7). He is currently investigating the performance of LANs, MANs and SuperJANET4 using UDP and TCP flows. He will dedicate approximately 50% FTE to this project.

We anticipate an additional Grid networking post through PP-DataGrid direct support. At least 50% FTE of this will be directly connected to investigating the performance of the network for different QoS conditions in relation to the high performance data transport mechanisms..

We have applied for at least one studentship to be provided through the PPARC e-science studentship scheme, and would expect the student to contribute to this work.

Central Laboratory for Research Councils (CLRC)

Chris Cooper is currently network strategist at Rutherford Appleton Laboratory. He is a consultant to UKERNA and chairs a ‘think tank’ on the introduction of QoS into SuperJANET. He holds a visiting professorship at Oxford Brookes University where he teaches masters courses in networking. His research interests are in all aspects of multiservice networking, recently extended to multiservice middleware in which context he is a co-investigator in EPSRC project ‘Visual Beans’.

David Salmon works in the Scientific Computing Support group of the Information Technology Department of the Rutherford Appleton Laboratory. Prior to this he worked for UKERNA where latterly he was Operations Manager for the European academic and research backbone network TEN-34, a post contracted by DANTE. He is currently involved in network related activities for the DataGrid project and investigating MPLS in a small lab testbed based on Linux systems. He attends the TERENA TF-NGN meetings as a representative of UKERNA. He is a member of the PPNCG and a member of the EU-DataGrid networking workpackage (WP7).

Robin Tasker is Head of Network Development at the Daresbury Laboratory. His research and development interests include wide area QoS across the Internet, the lower-layer (switching) environment and network monitoring. He has been a voting member of the IEEE 802.1 (Internetworking) working group since the late 1980s, and since 1997 has led the UK representation to ISO/IEC JTC1 SC6 (Data Communications) meetings. He is a member of the PPNCG. Within the EU DataGrid he is a member of the networking workpackage (WP7) where he is managing the monitoring activity.

C.Cooper and R.Tasker will provide expertise on QoS and traffic engineering in both IP and lower-layer networks environment.

D.Salmon is already active on a small-scale MPLS pilot project using Linux systems as routers to emulate a backbone network with both core and edge routing elements. The initial aim is to understand label based routing and to combine this with existing QoS/CoS techniques for traffic classification, prioritisation and rate control to implement a protected bandwidth path across the network. Bulk data-transfer applications will be tested over the protected path and experience gained here will be extended to the wider area tests once the SuperJANET development network has been suitably equipped and commissioned.

UKERNA

Jeremy Sharp is manager of the UKERNA Strategic Technologies Group, which is part of the Network Development Division. The role of the Strategic Technologies Group is to provide a view of how network technologies and applications will shape the future of JANET and SuperJANET. In particular it is responsible for the development and implementation of initiatives (such as the present SuperJANET4 Development Strategy and its implementation) that lead to the development of specific new services to the community. Prior to joining UKERNA in 1992, Jeremy worked in the Telecommunications section of the Rutherford Appleton Laboratory.

Rina Samani is the Technology Development Manager within the Strategic Technologies Group in the Network Development Division. She is responsible for managing specific strategic programmes and tracking emerging applications areas and technologies such as Internet2 developments. Also managing development programmes to ensure that the underlying JANET network service is able to meet any novel demands of new applications. RRina joined UKERNA as a Value Added Services Officer in September 1998, working in the Production Services Division. After a year's service she was promoted to the Technology Development Manager's post on 1st November 1999.

Appendix: Tasks & Deliverables:

Traffic management

Task TM1: To understand the use of MPLS as a traffic engineering tool within the CORE SJDN.

Description: In this phase we will understand the use of MPLS running on a variety of CISCO routers. We will develop suitable IP to label mapping at ingress and use MPLS as a basis to configure traffic management for “guaranteed” bandwidth and QoS. The usefulness of MPLS in this context will be assessed. Test equipment is to be situated at participating sites and connected to C-PoP by local loops.

- Month 6: Initial demonstration of sustained throughput for different traffic classes.

Risks: Timely procurement of SJDN routing equipment. Procurement of local loops and routers to connect to C-PoPs.

Task TM2: To demonstrate end-to-end traffic management across multiple domains using live Grid traffic.

Description: This is the main phase of the project within the UK, necessitating the solution of inter-domain issues and leading to a demonstration of end-to-end services between Grid user sites. In this phase we extend the configuration to configure three or more additional independent test-domains peered with SJDN. These test-domains will connect to Grid site end points, and hence act as the entry point for Grid application traffic. Traffic management will be configured on all domains. Inter domain traffic management issues will then be addressed in order to configure end-to-end services. Demonstrations will use live Grid traffic where possible. If and where possible we will also negotiate to include suitable MANs between SJDN and sites. The investigation will focus upon the use of MPLS early on, in accordance with the strategic aims of the industrial collaborators. The work will be widened to address further QoS issues in the latter stages, including Diffserv, WRED and ECN.

- Month 12: Initial demonstration of end-to-end guaranteed bandwidth and QoS. Interim report. Presentation of results at network venues.

Risks: Timely procurement of routing equipment for “test” domains at end user sites. Agreement by Grid applications to route traffic over test system. Multi-vendor issues. Capabilities of MANs.

Task: TM3: To demonstrate end-to-end QoS and traffic management between the UK and USA.

Description: In this phase we will be collaborating with leading Grid development groups in the USA. We will seek to configure the same types of traffic management as per TM2 between end points in the different regions. The scope of the work will explicitly include QoS issues. The main issue will be availability of suitable transatlantic connections upon which such development can take place.

Deliverable dates must necessarily be less concrete at present, until the availability of a suitable transatlantic connection is established.

Risks: Transatlantic interconnection between SJDN and ESNET (or Abilene) upon which such routing development work can take place.

Description: We will seek to configure the same types of traffic management as per TM3 to CERN or other European sites. The main issue here is availability of suitable infrastructure within the Geant network and then into the CERN site. This task is within the scope of the EU-DataGrid project.

Deliverable dates must necessarily be less concrete at present, until the availability of suitable connection is established.

Transport Applications:

Task: TP1: Demonstrate high performance transport applications across the WAN in a live Grid context, with a target of 1 Gbit/s.

Description: Characterise the performance of standard FTP and multiple TCP stream FTP applications over the WAN in the context of Grid traffic Deploy TCP modifications needed for high bandwidth-delay routes (much work already exists) and the focus will be on utilisation in Grid context. Investigate strategies for reliable transport. Investigate non-TCP based transport applications. Provide an interface for such new applications to Grid middleware layer.

Appendix: The SuperJANET4 development network

The diagram shows the configuration of the SuperJANET4 development network (SJDN) and the proposed links to the sites involved in this proposal. The C-PoP sites refer to the SuperJANET Core Point-of-Presence sites within the MCI-Worldcom backbone.

Appendix: Technical layout and costing

Equipment Configuration and Costing

This section describes the routers and test equipment required and then presents the current costs for each item. The topology of the external MPLS domains at the test sites and the Super|JANET Development Network core MPLS routers located at the Leeds, London, Reading and Warrington PoPs is shown in Figure A.1. Some of the tests proposed in the investigation require injection of test and background load traffic into the core MPLS routers at the PoPs. (To suitably load the core network) In addition, the all test equipment at the PoPs require IP access from the production SuperJANE4 network. Figure A.2 gives a more detailed diagram of the equipment and interfaces required for this.

A collaborative project to demonstrate end-to-end traffic management and high performance data transport applications required for Grid operations

Introduction

Traffic management & QoS

The Project aims

Transport Applications

Project aims

Industrial Collaboration

UKERNA

CISCO

Links and external collaborations

Request for support from PPARC

Staff posts:

Equipment:

Total request to PPARC £319K

Relation to other proposals.

Appendix: Expertise and resources within the proposing collaboration.

UCL

Manchester University.

Central Laboratory for Research Councils (CLRC)

UKERNA

Appendix: Tasks & Deliverables:

Traffic management

Task TM1: To understand the use of MPLS as a traffic engineering tool within the CORE SJDN.

Transport Applications:

Appendix: The SuperJANET4 development network

Appendix: Technical layout and costing

Equipment Configuration and Costing