Syncing to Archival Storage

This is a user-level guide for syncing a directory to CAC Archival Storage using Globus.

Prerequisites

  • You know how to log into Globus.
  • You are a user of a CAC project with archival storage service enabled. In this document, denotes your CAC user name and denotes your CAC project name.
  • On the Linux host from where you want to run (either one time or regularly scheduled) sync commands, install Globus CLI client. The syncing script is a bash shell command so only Linux is supported.
    • Tip: If running the pip3 install globus-cli command works for you, you can skip the install Globus CLI client documentation altogether.
  • If the source directory is not located on an existing Globus Connect Server endpoint, install Globus Connect Personal for Linux, MacOS, or Windows on the host where the source directory is located.

Log into Globus using CLI

On the Linux host from where you want to run sync commands,

  • Log into Globus using Globus CLI:

    $ globus login
    Please authenticate with Globus here:
    ------------------------------------
    https://auth.globus.org/v2/oauth2/authorize?...........
    ------------------------------------
    
    Enter the resulting Authorization Code here:
    
  • Copy and paste the URL https://auth.globus.org/v2/oauth2/authorize?........... into a web browser. Log into Globus as instructed in the web browser. After logging in, copy and paste the code back into the session where you ran the globus login command and press enter.

    You have successfully logged in to the Globus CLI!
    
    You can check your primary identity with
        globus whoami
    
    For information on which of your identities are in session use
        globus session show
    
    Logout of the Globus CLI with
        globus logout
    
  • Verify you are logged into Globus using the globus whoami command and you should get your Globus ID in the output:

    $ globus whoami
    shl1@cornell.edu
    

Make a Guest Collection on CAC Archive

  • In a web browser, log into Globus. Under File Manager, go to cac#archive02 collection and navigate to the /<CACProject> directory. If you'd like, make a new directory to which data will be copied from the source directory.

  • Follow the documentation on How To Share Data Using Globus to make the newly created directory a guest collection.

Configure the Source

  • If your source directory is located on an existing Globus Connect Server endpoint, you will need to make it a guest collection just as you did for the destination directory on CAC Archive.

  • If the source directory is not located on an existing Globus Connect Server endpoint, install Globus Connect Personal for Linux, MacOS, or Windows on the host where the source directory is located. Start the Globus Connect Personal endpoint on the source host.

Locate Source and Destination

  • Back in Globus CLI client, locate the IDs of source and destination endpoints using the globus endpoint search --filter-scope my-endpoints command:
    $ globus endpoint search --filter-scope my-endpoints
    ID                                   | Owner            | Display Name          
    ------------------------------------ | ---------------- | ----------------------
    4c8b5dda-389e-11ea-9710-021304b0cca7 | shl1@cornell.edu | my_source_endpoint
    606579ae-5b03-11e9-bf32-0edbf3a4e7ee | shl1@cornell.edu | cac_archive_endpoint
    

Install the cli-sync.sh script

  • Download the cli-sync.sh script onto your Linux host.
  • Open cli-sync.sh file and modify the following variables with appropriate values:

    • SOURCE_ENDPOINT: ID of your source endpoint
    • DESTINATION_ENDPOINT: ID of your destination point
    • SOURCE_PATH: Should probably be "/"
    • DESTINATION_PATH: Should probably be "/"
    • SYNCTYPE: Read the comments in the script and decide carefully. checksum is the safest but slowest because it will make the destination host (CAC archive) to read the copied files from disk again to verify the checksum.
  • You now run cli-sync.sh script directly from the shell or as a cronjob for scheduled archival.