5th May, 2015
Last month the NIH issued a position statement on using the cloud for the storage and analysis of controlled-access data, which is subject to the NIH Genomic Data Sharing Policy. The statement is worth a read, as is the blog post by Vivien at the NIH.
“One small step for NIH, one giant leap forward for the community”
If we’re talking about controlled access datasets at the NIH, we’re really talking about dbGaP, a collection data which relates the interaction of genotypes and phenotypes. Since it contains patient phenotypic information, access to the dataset is restricted to scientific research which is consistent with the informed consent agreements provided by individual research participants. A look at the best practices for working with the data shows just how seriously the NIH and PIs take access (quite rightly): no direct access to the internet, password requirements, principle of least privilege, data destruction policies. The list goes on.
With the updated guidance from the NIH, researchers can now meet these requirements, and store and analyze this important dataset in the cloud - a big deal - since the cloud is effectively custom made to work with large datasets like this which often have complex, multi-stage analysis pipelines.
How To Store dbGaP on AWS
This announcement, and the discussions which will take place at Bio-IT World this week, seem like a good time to walk through how to load and secure a genomics dataset such as dbGaP on AWS.
The NIH announcement contains specific requirements on how to properly secure your environment in the cloud, and below I’m going to follow the guidance in the excellent Architecting for Genomic Data Security and Compliance in AWS white paper (hat tip to Angel and Chris), but you’ll be up and running and ready to store genomic data in alignment with this guidance on AWS in just a few minutes.
Roles and instances and Aspera, oh my!
In addition to an AWS account, you’ll need some basic command line chops to get up and running, but although this list of steps may seem a little intimidating, it’s relatively straightforward even if you’re new to AWS.
We’ll use the AWS platform to create an encrypted, controlled access environment with usage audit trails, and then download and store dbGaP into that environment.
Step 0: Securing your Root Account and Adding Usage Auditing
If this is your first AWS account, or you’re still using your root credentials, we need to do some initial setup to make sure your root AWS account is protected. This is a common part of the setup for securing your data in the cloud - so common - that the guidance and checks are built right into the AWS management console.
Once you are logged into AWS, start securing the account by clicking on the Identity & Access Management button on the AWS console.
From here, you will see four security status checks that should be completed before trying to move any controlled data into AWS. Take the time to work with your internal security and compliance teams on this step - it’s best practice to take steps to create separate groups for developers and administrators, create security auditor accounts and enable MFA on the administrator accounts. All IAM user permissions should be designed to align with the permissions assigned in the DAR.
Once all four of the security status checks are green, we can meet the requirement for usage of controlled-access data to be audited by activating CloudTrail for your account.
Simply choose CloudTrail from AWS console and then switch the logging switch to “ON” - just tell CloudTrail which S3 bucket to store the audit logs in.
Step 1: Identity and Access
First, using the IAM service, let’s create an IAM Role that will be used to control access to our controlled-access dataset.
- From the AWS, choose ‘Identity & Access Management’.
- From the left hand side, choose ‘Roles’.
- Click the button Create New Role from the top of the page.
- Enter a Role Name (such as r-dbsrv1).
- Select role type of Amazon EC2 under ‘AWS Service Roles’
- Under ‘Attach Policy’ simply click ‘Next Step’ without assigning any policies. We will manually assign permissions to our role later.
- On the review page, make a note of the ‘Role ARN’. This is an Amazon Resource Name that uniquely identifies our resource. We will use this ARN to identify our role in the next step.
- Click ‘Create Role’ to finish.
Step 2: Setting up our storage
Next we will create a bucket in S3 to store our controlled access datasets. dbGaP will ultimately be stored in S3 (but we’ll load it from the NIH into S3 via EC2 in the next step).
- From the AWS Console, click on S3.
- Click ‘Create Bucket’.
- Choose a bucket name that will hold the controlled access data.
- Once the bucket is created, click on the bucket name to be taken into the empty bucket and then choose Properties. From the properties menu, choose ‘Add bucket policy’ from the Permissions section.
- Add this JSON document as a bucket policy which will provide the role we created in the first step the rights to add encrypted objects to our bucket. Be sure to change the bucket name and role name.
Step 3: Moving dbGaP from the NIH to AWS
Now that we have a home for dbGaP in S3, we’ll load it from the NIH into AWS using EC2, using Aspera Connect to speed up the data transfer.
- From the EC2 console, choose ‘Launch’, and select an m3.medium instance type, follow the launch wizard until…
- On the ‘configure instance details’ step, select the IAM Role Name we’ve been using. This will enable our EC2 instance to securely invoke the permissions given to the role without requiring us to exchange keys or passwords.
- This default configuration will create our instance within a default network configuration that will expose our instance to an Internet Gateway. Since we will only be using this instance for a one-time data transfer, our exposure is limited but best security practice is to logically isolate instances from an Internet Gateway and send traffic through a Network Address Translation (NAT) server.
- On the “Add Storage” step make sure that you have a root volume for booting up your instance as well as an “EBS” volume marked for Encryption which can be used to store controlled datasets. Be sure to choose a GiB size on the EBS volume which is large enough to store your dbGaP dataset.
- “Security Groups” act as firewall rules to control incoming traffic to your instance. It is possible to bootstrap all the commands needed to install Aspera connect and use the command link ASCP executable utility to automatically pull from the NIH GDS Data Repository and avoid granting any rights to the security group at all. However, for this walk through, we will leave either SSH (if using Linux) or RDP (if using Windows) in our security group but limit the source to our IP address or IP address range. We will also want to name our security group, sg-appsrv1.
- You are now ready to review all instance settings, launch the instance and generate a secure key used to connect to the instance.
- Connect to your instance, and follow the download instructions from dbGaP FAQ Archive for instructions on installing Aspera software and downloading the controlled data. When downloading the dataset, be sure to store the data on the encrypted, persisted EBS volume and not the root volume.
Step 4: Copy dbGaP to S3
Now that you have downloaded our controlled dataset onto your encrypted EBS volume, you will want to move that data over to your S3 bucket to make it accessible from any application or server to which you give permission.
- You will need to utilize the AWS command line tools to easily move our controlled data to S3. If you choose an Amazon Linux operating system, these tools are pre-installed. Otherwise, you will need to obtain and install those tools based on the instructions specific to the operating system you choose from here.
- Run the command: aws s3 cp downloaded_data/ s3://db-gap –recursive –-sse
- This command will uploaded all the files in your “downloaded data” directory to your db-gap S3 bucket using AES-256 server side encryption. Be sure to change the yellow highlights to the names you used.
Step 5: Tidy up
After you have verified that the controlled access DbGaP data has been successfully moved into S3, the final step is to stop our running Aspera Connect instance on EC2. Since the server is only needed when we pull data from the NCBI repository into AWS, we can now stop our server and avoid paying for it until we need it again. From the EC2 console, right click on the Aspera Connect instance and choose ‘stop’.
You should now have your controlled DbGaP data moved into and stored in S3, ready for analysis (another story, for another post, another day).