2004-07-15 ========== Generic applications ====== Distributed a questionnaire Geophysics MAGIC (complex chemical simulations) Monte-Carlo stuff (CORSIKA) Other interests: GRACE, EU Space Agency (GAIA satelite sim), Planck satelite astrophys community, ITC-IRST, CSP (Torino, Italy; business apps), SIMDAT project (industrial apps). JRA1 overview ====== Security: based on a model: PDP (policy decision point) & PEP (???). Tries to be as flexible as possible. R-GMA peeps are after use-cases and prototype applications monitoring. Job wrappers hold a MPP (Memory Primary Producer), these feed a DbSP (Database Secondary Producer). Keywords: PTF Architecture document: https://edms.cern.ch/document/476451/ Release Plan: https://edms.cern.ch/document/468699/ Prototype installation: http://egee-jra1.web.cern.ch/egee-jra1/Prototype/testbed.htm ARDA ====== Input from generic (?) ====== Message passing interface There is no MPI on separate nodes. The solution appears to be to just copy the appropriate files with RSync or SCP. Meh, well, the point was that MPI could be made to work by just reconfiguring the site, and not having to worry about the middlware. The file copying stuff was clearly to do with getting the same set up on all the diferent nodes. Things we'd like to know from NA4 ====== We need to know if there are any obvious user requirements that I've missed and will not actually be middleware requirements. Interactive jobs: are there interactive jobs that also require fairly intensive network for the interactive part. That is, command-line interface just means that almost any network is acceptable. However, real-time video requires good bandwidth and low jitter. Some things might need low latency and low jitter, maybe? 2004-07-16 ========== NA4/SA2/JRA4 meeting ====== Networking Tutorial ------ Introduce TCP/IP Introduce Diffserv Introduce Premium IP: -> Premium IP gets 200Mb under Geant Introduce Less-than Best Effort -> allows users to use the remaining bandwidth without worrying about stepping on others' toes Introduce MultiProtocol Label Switching -> allows you to build a circuit within the network Introduce Multicast Deployment of the presented services may be a problem. There are 41 networks involved with EGEE. Lots don't tell EGEE if they do simplement the various services or not. Questions: Q: You mainly concentrated on the network situation in Europe. A: Didn't present some technologies, because there is no info on when it will be deployed Q: Clarification of why you need to make changes to run IPv6 apps Q: Why do applications ever need to touch the network? Most of the stuff they use is encapsulating stuff like SOAP etc. A: well, just to make sure, yes? We realise it's mostly the middleware that will have to deal with protocols. JRA4: End-Users' Perspective ------ Sub activities: Bandwidth allocation and reservation -> supply this technology to applications. Performance monitoring and Diagnostic tools Components and Requirements -> network element is not really a part of things yet Possible middlware requirements NA4 Network requirements There are three network requirements from them: -> Comms encryption We think it ought to be done at the application/middlware level. We can't really do it in the network. -> Outbound connectivity We want to understand this better. -> Guaranteed bandwidth Need some use-cases, please Q (Yannick Patios): This is the numbering that was in the first written document. These numbers have changed now that we've changed to the online thing. Q (Yannick): Data encryption: understand that it's hard to do at the network level. Disagree that it should be done at the application level. It should be a middleware service, then it can be done with middlware or on the network. Q: Must be careful not to make encryption mandatory. Q: We should provide a service to make the encryption service easy to use. Doing encryption at the network layer doesn't ensure that data is kept secure, as you don't have control over the network. Yannick: This stuff is being looked at by the Security group. -- Q Yannick: Outbound connectivity: there are some apps that need outbound connectivty. Q (?): From a security PoV this is an anti-requirement. The application *actually* must be able to access outside stuff via a well-known software. It's a middleware and applications requirement. The applications want the functionality, but someone has to provide it. We *must not* work around the problem by allowing WNs access to the internet. -- Q (Masimo): Guaranteed bandwidth: there are use-cases. From an operations point of view there are clear cases where guaranteed bandwidth is required. A: We have some idea, but we would like as many as possible written down concretely. Q: What is the meaning of guaranteed bandwidth? Does it require special hardware A: It does need special hardware, but all the NRENs etc have hardware that work. Q (Cal Loomis): Call for more use-cases is a good thing, because different apps have different precise needs. Detailed NA4 use cases ------ Roberto Barbera When starting a year ago the objectives were: * See if the bandwidths can cope with the needs of ALICE * spot possible bottlenecks (IO<>LAN<>WAN<>LAN<>IO) Started with a standard configuration of the TCP stack. Made sure all the machines had each others' SSH keys, so that secure transfers could be done without typing passowrds. Automatic procedure. -> checks if there is any bit loss or whatever (meh? this is TCP?) They chose sites so that there were lots of different RTTs. Lots of people not very happy with them, because they managed to saturate the network for several days. Then tried to do the tests in a way that it looked like a multi-tier system (as opposed to an entirely peer-based one). Lots of transferring data from tier 1 to all sites, tier 2 to tier 1, and tier 2 to all sites Really managed to saturate the networks. With not many parallel streams, either (single figures). With very large buffer sizes, the bottleneck was the IO system. Conclusions: -> first real stress test of Italian GARR system -> The available bandwidths are inadequate -> Saw limits of NFS -> Big problems for tier 1 and tier 2, but the tier 3 sites are fine. This year: ALICE is doing a data challenge. Conclusions: VOs' planned and unplanned activiities have big impacts on networks. Q (Javier): You mentioned that the network not only means the hight bandwidth but also reliable e2e connections Do you think this can be solved by just supplying enough hardware. Do you think that some sort of software-based bandwidth-on-demand would help. A: Problem is that network is expensive. If you think about producetions: you know the sites (tier 1, tier 2 etc). Therefore you can maximise the link so that the flow of data is fine (tier 1->tier 2->tier 3). For analysis, though, you get files going in all directions. This is very dynamic. Bandwidth on demand is therefore crucial. Things change, too. BioMed use-cases ------ G Romier Three apps with different requirements: GATE (Geant4 application for tomographic emission) EGEE pilot application * Dedicated to nuclear medicine, radiotherapy, brachytherapy * Monte Carlo platform. GATE typical values: -> Image size: 40MB -> Computation time: 3-8 hours on 1 node -> Sevaral hours on 20-50 nodes -> More nodes, faster GATE: target values: Need to be able to submit jobs and receive results from a hospital site connected through a conventional network. Requires that the complete execution of the job occurs in a given time. For eg 8 hours. Guarantee data integrity. No data loss during network transfers. Application 2: PTM 3D / G-PDM 3D App using LCG2 middleware Treats radio images To be deployed in March or April Main part of the computation is 100-1000 independent computations of 1-10 seconds each. Uses data interactively produced by the client on the hospital network 30 Kb data for each subcomputation (input or output). Impossible to storee in advance on a SE - interactive Sequential throughput 3-30 Kb/s Parallelism from 16-64 Target values: Needs to be able submit and control jobs from a conventional network Data must be streamed to and from the job in a secure manner. Guarantee of completion of job once it's accepted. you must know immediately. App 3: SMRI 3D MRI simulater, parallelised with MPI Test was done on 100Mbit/s switched cluster of 18 machines. Master aggregates a global rate of 1-2Mbit/s with 2d, 50Mbit/s with 3D The algorithm is limited by the network. Other ideas: Video conferencing service would be a useful tool In the future, patient data will be shared between hospitals. Traceability will be required. Requirements: QoS: limited duration variations, data integrity, computation completion, traceability of allocated resources. Q (Massimo): Is not security a requirement: A: No, data is anonymised before it is sent out, so there's less need for security Q (Javier): Why is the connectivity to EGEE is network requirement? A: Because small steering packets are sent back and forth over the network. The UI machine can't be an EGEE machine. Javier: so it's a problem for security? Q (?): The problems are not the sheer volume of data, but moving it around into sensible places. I don't think it's that hard a problem, as there are lots of ways to move data around using well-known middlware and bits. Q (Yannick): for this application, the solution they used was a proxy to allow them to work with site policies and so on. They were not an enormous problem. Q (?): The solution is probably all there in EGEE. A project called DILIGENT assumes that there are a large number of resources avbailable, and are made use of at a lower level. Summary of requirments SA2-JRA4 ------ Mathieu Goutelle HEP use case Lots of big numbers. See the slides. Biomedical case: HEP requirements: Relatively simple -> Big fat pipe! Just needs as much bandwidth as can the network can provide. Also, maybe, guaranteed bandwidth. Biomed: Critical: Interactivity (guaranteed bandwidth?) Availability (bw and robustness) Deadline (transfer shceudling and bandwidth reservation) Access from clients outside EGEE (outbound connectivity) End-to-end problem: Performance of the total path not under EGEE/GEANT/NRENs control Services continuity. Discussion ------ Javier: Would really like a long discussion, can't have it here. ?: New apps? Javier: see our webby G Romier: also, we want that sort of info (NA4) Yannick: the use cases they have are being put on their website. Final Discussion ================ Non biomed/non HEP requirements: ---- Licensing for IDL?