Personal Miscellaneous TCP/IP GRID Quality of Service Multi-Cast  
GRIS/GIISGridFTPMonitoring Schema  

GridFTP Page

GridFTP is the protocol proposed for all data transfers on the GRID. It extends the standard FTP protocol with facilities such as multistreamed transfer, autotuning and globus based security.

Extracted from Secure, Efficient Data Transport and Replica Management for
High-Performance Data-Intensive Computing
, Allcock etc.:

Data-intensive scientific and engineering applications require both transfers of large
amounts of data (terabytes or petabytes) between storage systems and access to large amounts of data (gigabytes or terabytes) by many geographically distributed applications and users for analysis, visualization, etc.
There are already a number of storage systems in use by the Grid community, each of which was designed to satisfy specific needs and requirements for storing, transferring and accessing large datasets. These include the Distributed Parallel Storage System (DPSS) and the High Performance Storage System (HPSS), which provide highperformance access to data and utilize parallel data transfer and/or striping across multiple servers to improve performance [1][2]. The Distributed File System (DFS) supports high-volume usage, dataset replication and local caching. The Storage Resource Broker (SRB) connects heterogeneous data collections, provides a uniform client interface to storage repositories, and provides a metadata catalog for describing and locating data within the storage system [4]. Other systems allow clients to access structured data from a variety of underlying storage systems (e.g., HDF5 [5]).


Unfortunately, most of these storage systems utilize incompatible and often unpublished protocols for accessing data, and therefore require the use of their own client libraries to access data. These incompatible protocols and client libraries effectively partition the datasets available on the grid. Applications that require access to data stored in different storage systems must use multiple access methods. To overcome these incompatible protocols, we have proposed a universal grid data transfer and access protocol called GridFTP that provides secure, efficient data movement in Grid environments. This protocol, which extends the standard FTP
protocol, provides a superset of the features offered by the various Grid storage systems currently in use. We argue that using GridFTP as a common data access protocol would be mutually advantageous to grid storage providers and users. Storage providers would gain a broader user base, because their data would be available to any client, while storage users would gain access to a broader range of storage systems and data.


We chose to extend the FTP protocol because we observed that FTP is the protocol most commonly used for data transfer on the Internet and the most likely candidate for meeting the Grid’s needs. The FTP protocol is an attractive choice for several reasons. First, FTP is a widely implemented and well-understood IETF standard protocol. As a result, there is a large base of code and expertise from which to build. Second, the FTP protocol provides a well-defined architecture for protocol extensions and supports dynamic discovery of the extensions supported by a particular implementation. Third, numerous groups have added extensions through the IETF, and some of these extensions will be particularly useful in the Grid. Finally, in addition to client/server transfers, the FTP protocol also supports transfers directly between two servers, mediated by a third party client (i.e. “third party transfer”).


4.1 Features of GridFTP
Next, we describe the protocol extensions in GridFTP. Some of these features are
supported by FTP extensions that have already been standardized in the IETF, but which are currently seldom implemented. Other features are new extensions to FTP.


4.1.1 Grid Security Infrastructure (GSI) and Kerberos support
Robust and flexible authentication, integrity, and confidentiality features are critical when transferring or accessing files. GridFTP must support Grid Security Infrastructure (GSI) and Kerberos authentication, with user controlled setting of various levels of data integrity and/or confidentiality. GridFTP provides this capability by implementing the GSSAPI authentication mechanisms defined by RFC 2228, “FTP Security Extensions”.


4.1.2 Third-party control of data transfer
To manage large data sets for distributed communities, we must provide authenticated third-party control of data transfers between storage servers. A third-party operation allows a “third-party” user or application at one site to initiate, monitor and control a data transfer operation between two other “parties”: the source and destination sites for the data transfer. Our implementation adds GSSAPI security to the existing third-party transfer capability defined in the FTP standard. The “third-party” authenticates itself on a local machine, and GSSAPI operations authenticate the third party to the source and destination machines for the data transfer.


4.1.3 Parallel data transfer
On wide-area links, using multiple TCP streams in parallel (even between the same
source and destination) can improve aggregate bandwidth over using a single TCP
stream. GridFTP supports parallel data transfer through FTP command extensions and
data channel extensions.


4.1.4 Striped data transfer
Data may be striped or interleaved across multiple servers, as in a DPSS network disk
cache or a striped file system. GridFTP includes extensions that initiate striped transfers, which use multiple TCP streams to transfer data that is partitioned among multiple servers. Striped transfers provide further bandwidth improvements over those achieved with parallel transfers. We have defined GridFTP protocol extenstions that support striped data transfers


4.1 5 Partial file transfer
Many applications would benefit from transferring portions of files rather than complete files. This is particularly important for applications like high-energy physics analysis that require access to relatively small subsets of massive, object-oriented physics database files. The standard FTP protocol requires applications to transfer entire files, or the remainder of a file starting at a particular offset. GridFTP introduces new FTP commands to support transfers of subsets or regions of a file.


4.1.6 Automatic negotiation of TCP buffer/window sizes
Using optimal settings for TCP buffer/window sizes can have a dramatic impact on data transfer performance. However, manually setting TCP buffer/window sizes is an errorprone process (particularly for non-experts) and is often simply not done. GridFTP extends the standard FTP command set and data channel protocol to support both manual setting and automatic negotiation of TCP buffer sizes for large files and for large sets of small files.


4.1.7 Support for reliable and restartable data transfer
Reliable transfer is important for many applications that manage data. Fault recovery
methods for handling transient network failures, server outages, etc. are needed. The FTP standard includes basic features for restarting failed transfers that are not widel implemented. The GridFTP protocol exploits these features and extends them to cover the new data channel protocol.


4.2 The GridFTP Protocol Implementation
In this section, we briefly present the implementation of the GridFTP protocol in the
Globus Grid computing environment. Our current implementation is an alpha release of
the gridFTP libraries, available with limited support to a small number of users. The
current implementation supports partial file transfers, third-party transfers, parallel
transfers and striped transfers. This implementation does not yet support automatic
negotiation of TCP buffer/window sizes.


The implementation consists of two main libraries implemented in C: the
globus_ftp_control_library and the globus_ftp_client_library. The globus_ftp_control_library implements the control channel API. This API provides routines for managing a GridFTP connection, including authentication, creation
of control and data channels, and reading and writing data over data channels. Having separate control and data channels, as defined in the FTP protocol standard, greatly facilitates the support of such features as parallel transfers, striped transfers and thirdparty data transfers. For parallel and striped transfers, the control channel is used to specify a put or get operation; multiple parallel TCP data channels provide concurrent transfers. In third-party transfers, the initiator monitors or aborts transfers via the control channel, while data is transferred over one or more data channels between source and destination sites.
The globus_ftp_client_library implements the GridFTP client API. This API provides
higher-level client features on top of the globus_ftp_control library, including complete file get and put operations, calls to set the level of parallelism for parallel data transfers, partial file transfer operations, third-party transfers, and eventually, functions to set TCP buffer sizes.

 

 

Papers

GridFTP: http://grid-data-management.web.cern.ch/grid-data-management/docs/GridFTP-rfio-report.pdf

GridFTP Update: http://www.niknef.nl/user/templon/GridFTP.pdf

 

{whitepaper}

 

Work using GridFTP

Predicting the Performance of Wide Area Data Transfers, S. Vazhkudai, J. Schoft and I. Foster, 2002

Data Transfer Test Results using scp and gsincftp, Shahzad Muzaffar

 

 

Wed, 13 February, 2002 20:07 Previous PageNext Page
 
 
    email me!
© 2001-2003, Yee-Ting Li, email: ytl@hep.ucl.ac.uk, Tel: +44 (0) 20 7679 1376, Fax: +44 (0) 20 7679 7145
Room D14, High Energy Particle Physics, Dept. of Physics & Astronomy, UCL, Gower St, London, WC1E 6BT