Controlled-Access Genomics Data in the Cloud

5th May, 2015

Last month the NIH issued a position statement on using the cloud for the storage and analysis of controlled-access data, which is subject to the NIH Genomic Data Sharing Policy. The statement is worth a read, as is the blog post by Vivien at the NIH.

“One small step for NIH, one giant leap forward for the community”

If we’re talking about controlled access datasets at the NIH, we’re really talking about dbGaP, a collection data which relates the interaction of genotypes and phenotypes. Since it contains patient phenotypic information, access to the dataset is restricted to scientific research which is consistent with the informed consent agreements provided by individual research participants. A look at the best practices for working with the data shows just how seriously the NIH and PIs take access (quite rightly): no direct access to the internet, password requirements, principle of least privilege, data destruction policies. The list goes on.

With the updated guidance from the NIH, researchers can now meet these requirements, and store and analyze this important dataset in the cloud - a big deal - since the cloud is effectively custom made to work with large datasets like this which often have complex, multi-stage analysis pipelines.

How To Store dbGaP on AWS

This announcement, and the discussions which will take place at Bio-IT World this week, seem like a good time to walk through how to load and secure a genomics dataset such as dbGaP on AWS.

The NIH announcement contains specific requirements on how to properly secure your environment in the cloud, and below I’m going to follow the guidance in the excellent Architecting for Genomic Data Security and Compliance in AWS white paper (hat tip to Angel and Chris), but you’ll be up and running and ready to store genomic data in alignment with this guidance on AWS in just a few minutes.

Roles and instances and Aspera, oh my!

In addition to an AWS account, you’ll need some basic command line chops to get up and running, but although this list of steps may seem a little intimidating, it’s relatively straightforward even if you’re new to AWS.

We’ll use the AWS platform to create an encrypted, controlled access environment with usage audit trails, and then download and store dbGaP into that environment.

Step 0: Securing your Root Account and Adding Usage Auditing

If this is your first AWS account, or you’re still using your root credentials, we need to do some initial setup to make sure your root AWS account is protected. This is a common part of the setup for securing your data in the cloud - so common - that the guidance and checks are built right into the AWS management console.

Once you are logged into AWS, start securing the account by clicking on the Identity & Access Management button on the AWS console.

From here, you will see four security status checks that should be completed before trying to move any controlled data into AWS. Take the time to work with your internal security and compliance teams on this step - it’s best practice to take steps to create separate groups for developers and administrators, create security auditor accounts and enable MFA on the administrator accounts. All IAM user permissions should be designed to align with the permissions assigned in the DAR.

Once all four of the security status checks are green, we can meet the requirement for usage of controlled-access data to be audited by activating CloudTrail for your account.

Simply choose CloudTrail from AWS console and then switch the logging switch to “ON” - just tell CloudTrail which S3 bucket to store the audit logs in.

Step 1: Identity and Access

First, using the IAM service, let’s create an IAM Role that will be used to control access to our controlled-access dataset.

Step 2: Setting up our storage

Next we will create a bucket in S3 to store our controlled access datasets. dbGaP will ultimately be stored in S3 (but we’ll load it from the NIH into S3 via EC2 in the next step).

Step 3: Moving dbGaP from the NIH to AWS

Now that we have a home for dbGaP in S3, we’ll load it from the NIH into AWS using EC2, using Aspera Connect to speed up the data transfer.

Step 4: Copy dbGaP to S3

Now that you have downloaded our controlled dataset onto your encrypted EBS volume, you will want to move that data over to your S3 bucket to make it accessible from any application or server to which you give permission.

Step 5: Tidy up

After you have verified that the controlled access DbGaP data has been successfully moved into S3, the final step is to stop our running Aspera Connect instance on EC2. Since the server is only needed when we pull data from the NCBI repository into AWS, we can now stop our server and avoid paying for it until we need it again. From the EC2 console, right click on the Aspera Connect instance and choose ‘stop’.

Job done

You should now have your controlled DbGaP data moved into and stored in S3, ready for analysis (another story, for another post, another day).

With thanks to Chris and Angel for their help with this post.