GridFTP Page
GridFTP is the protocol proposed for all data transfers on the GRID.
It extends the standard FTP protocol with facilities such as multistreamed
transfer, autotuning and globus based security.
Extracted from Secure,
Efficient Data Transport and Replica Management for
High-Performance Data-Intensive Computing, Allcock etc.:
Data-intensive scientific and engineering applications require
both transfers of large
amounts of data (terabytes or petabytes) between storage systems and access
to large amounts of data (gigabytes or terabytes) by many geographically
distributed applications and users for analysis, visualization, etc.
There are already a number of storage systems in use by the Grid community,
each of which was designed to satisfy specific needs and requirements
for storing, transferring and accessing large datasets. These include
the Distributed Parallel Storage System (DPSS) and the High Performance
Storage System (HPSS), which provide highperformance access to data and
utilize parallel data transfer and/or striping across multiple servers
to improve performance [1][2]. The Distributed File System (DFS) supports
high-volume usage, dataset replication and local caching. The Storage
Resource Broker (SRB) connects heterogeneous data collections, provides
a uniform client interface to storage repositories, and provides a metadata
catalog for describing and locating data within the storage system [4].
Other systems allow clients to access structured data from a variety of
underlying storage systems (e.g., HDF5 [5]).
Unfortunately, most of these storage systems utilize incompatible and
often unpublished protocols for accessing data, and therefore require
the use of their own client libraries to access data. These incompatible
protocols and client libraries effectively partition the datasets available
on the grid. Applications that require access to data stored in different
storage systems must use multiple access methods. To overcome these incompatible
protocols, we have proposed a universal grid data transfer and access
protocol called GridFTP that provides secure, efficient data movement
in Grid environments. This protocol, which extends the standard FTP
protocol, provides a superset of the features offered by the various Grid
storage systems currently in use. We argue that using GridFTP as a common
data access protocol would be mutually advantageous to grid storage providers
and users. Storage providers would gain a broader user base, because their
data would be available to any client, while storage users would gain
access to a broader range of storage systems and data.
We chose to extend the FTP protocol because we observed that FTP is the
protocol most commonly used for data transfer on the Internet and the
most likely candidate for meeting the Grids needs. The FTP protocol
is an attractive choice for several reasons. First, FTP is a widely implemented
and well-understood IETF standard protocol. As a result, there is a large
base of code and expertise from which to build. Second, the FTP protocol
provides a well-defined architecture for protocol extensions and supports
dynamic discovery of the extensions supported by a particular implementation.
Third, numerous groups have added extensions through the IETF, and some
of these extensions will be particularly useful in the Grid. Finally,
in addition to client/server transfers, the FTP protocol also supports
transfers directly between two servers, mediated by a third party client
(i.e. third party transfer).
4.1 Features of GridFTP
Next, we describe the protocol extensions in GridFTP. Some of these features
are
supported by FTP extensions that have already been standardized in the
IETF, but which are currently seldom implemented. Other features are new
extensions to FTP.
4.1.1 Grid Security Infrastructure (GSI)
and Kerberos support
Robust and flexible authentication, integrity, and confidentiality features
are critical when transferring or accessing files. GridFTP must support
Grid Security Infrastructure (GSI) and Kerberos authentication, with user
controlled setting of various levels of data integrity and/or confidentiality.
GridFTP provides this capability by implementing the GSSAPI authentication
mechanisms defined by RFC 2228, FTP Security Extensions.
4.1.2 Third-party control of data transfer
To manage large data sets for distributed communities, we must provide
authenticated third-party control of data transfers between storage servers.
A third-party operation allows a third-party user or application
at one site to initiate, monitor and control a data transfer operation
between two other parties: the source and destination sites
for the data transfer. Our implementation adds GSSAPI security to the
existing third-party transfer capability defined in the FTP standard.
The third-party authenticates itself on a local machine, and
GSSAPI operations authenticate the third party to the source and destination
machines for the data transfer.
4.1.3 Parallel data transfer
On wide-area links, using multiple TCP streams in parallel (even between
the same
source and destination) can improve aggregate bandwidth over using a single
TCP
stream. GridFTP supports parallel data transfer through FTP command extensions
and
data channel extensions.
4.1.4 Striped data transfer
Data may be striped or interleaved across multiple servers, as in a DPSS
network disk
cache or a striped file system. GridFTP includes extensions that initiate
striped transfers, which use multiple TCP streams to transfer data that
is partitioned among multiple servers. Striped transfers provide further
bandwidth improvements over those achieved with parallel transfers. We
have defined GridFTP protocol extenstions that support striped data transfers
4.1 5 Partial file transfer
Many applications would benefit from transferring portions of files rather
than complete files. This is particularly important for applications like
high-energy physics analysis that require access to relatively small subsets
of massive, object-oriented physics database files. The standard FTP protocol
requires applications to transfer entire files, or the remainder of a
file starting at a particular offset. GridFTP introduces new FTP commands
to support transfers of subsets or regions of a file.
4.1.6 Automatic negotiation of TCP buffer/window sizes
Using optimal settings for TCP buffer/window sizes can have a dramatic
impact on data transfer performance. However, manually setting TCP buffer/window
sizes is an errorprone process (particularly for non-experts) and is often
simply not done. GridFTP extends the standard FTP command set and data
channel protocol to support both manual setting and automatic negotiation
of TCP buffer sizes for large files and for large sets of small files.
4.1.7 Support for reliable and restartable data transfer
Reliable transfer is important for many applications that manage data.
Fault recovery
methods for handling transient network failures, server outages, etc.
are needed. The FTP standard includes basic features for restarting failed
transfers that are not widel implemented. The GridFTP protocol exploits
these features and extends them to cover the new data channel protocol.
4.2 The GridFTP Protocol Implementation
In this section, we briefly present the implementation of the GridFTP
protocol in the
Globus Grid computing environment. Our current implementation is an alpha
release of
the gridFTP libraries, available with limited support to a small number
of users. The
current implementation supports partial file transfers, third-party transfers,
parallel
transfers and striped transfers. This implementation does not yet support
automatic
negotiation of TCP buffer/window sizes.
The implementation consists of two main libraries implemented in C: the
globus_ftp_control_library and the globus_ftp_client_library. The globus_ftp_control_library
implements the control channel API. This API provides routines for managing
a GridFTP connection, including authentication, creation
of control and data channels, and reading and writing data over data channels.
Having separate control and data channels, as defined in the FTP protocol
standard, greatly facilitates the support of such features as parallel
transfers, striped transfers and thirdparty data transfers. For parallel
and striped transfers, the control channel is used to specify a put or
get operation; multiple parallel TCP data channels provide concurrent
transfers. In third-party transfers, the initiator monitors or aborts
transfers via the control channel, while data is transferred over one
or more data channels between source and destination sites.
The globus_ftp_client_library implements the GridFTP client API. This
API provides
higher-level client features on top of the globus_ftp_control library,
including complete file get and put operations, calls to set the level
of parallelism for parallel data transfers, partial file transfer operations,
third-party transfers, and eventually, functions to set TCP buffer sizes.
Papers
GridFTP: http://grid-data-management.web.cern.ch/grid-data-management/docs/GridFTP-rfio-report.pdf
GridFTP Update: http://www.niknef.nl/user/templon/GridFTP.pdf
{whitepaper}
Work using GridFTP
Predicting
the Performance of Wide Area Data Transfers, S. Vazhkudai, J. Schoft
and I. Foster, 2002
Data
Transfer Test Results using scp and gsincftp, Shahzad Muzaffar
Wed, 13 February, 2002 20:07
|
  |
|