-
Notifications
You must be signed in to change notification settings - Fork 1
Using an Amazon Machine Image for analysing samples with Kraken2
We have created an Amazon Machine Image (AMI) that can be used to launch Amazon Web Services server instances to run Kraken2. We aim to provide instructions here that will allow a bioinformatics beginner to use this to run Kraken2 on metagenome samples, but this is currently a work in progress. Please get in touch with Robyn Wright with any questions. There are also general metagenome and other microbiome analysis tutorials on Microbiome Helper and there is also a Microbiome Helper google group where you can also ask questions.
Please note that we are not affiliated with Amazon in any way and will not be getting paid for you following these instructions or when you use a server instance that requires payment. Creating the AMI through Amazon Web Services was the easiest way for us to make a server available that has everything necessary pre-installed, and will allow you to have the RAM for running some of the larger databases. Also note that you will need to increase the available RAM from what is available by default if you wish to run the largest NCBI RefSeq Complete V205 database. My best recommendation is to first try following the instructions as they are here, and to contact Amazon to request any increases required - I have found them to be helpful, knowledgable and quick to respond. You can contact your AWS Solutions Architect or AWS account team for help with this.
If this is the first time you are trying to use a Terminal window or analyse samples then I strongly recommend going through some of the tutorials and pages at Microbiome Helper first so that you don't pay for more time than you need to.
Click here to create a free account. I think that you will need to add payment details in order to set up the account and be able to launch a server. You are able to launch the instance for free, but if you want to be able to run samples then you will need to launch a paid instance. In the account settings you should be able to set up alerts so that you know if you will be charged for something, and the costs associated with launching the instances are clearly marked.
In addition to following the instructions that we've made below, I also recommend having a read through some of the Amazon tutorials and information pages in order to familiarise yourself with what the overall process is, as well as the terminology used.
- Get started with AWS EC2 Linux instances
- Set up to use Amazon EC2
- Create a key pair
- Create a security group
- Launch an AWS instance
- Connect to an instance
- Open the Amazon EC2 console.
- In the navigation pane, choose Network & Security > Key Pairs.
- Choose "Create key pair" (top right orange button).
- Enter a name for the key pair.
- Choose RSA for the key pair type.
- Choose .pem if you are going to be logging in from Mac/Linux and .ppk if you will be logging in from Windows.
- Select "Create key pair".
- The key pair file will be downloaded by your browser and you can move it to whichever location you prefer (note that you will need to know the location of the file).
- If you will use Mac/Linux then go to the Terminal and enter the following command:
chmod 400 key-pair-name.pem
You will need to change key-pair-name.pem to whatever name your key file has.
- Open the Amazon EC2 console.
- In the navigation pane, choose Network & Security > Security Groups.
- Choose "Create security group" (top right orange button).
- Enter a name for the security group. Amazon recommends that you add the region that you will use to this name. For example, I have been using the US East region and named my group the following:
RW_SG_useast1. - Select the default VPC for the region.
- You now have to add some rules so that you will be able to access an instance that you create with this security group. There is guidance on the suggested rules for different use cases here, but I used the following:
- Choose HTTP from the drop down list. In source, choose "Anywhere". Click add rule.
- Choose HTTPS from the drop down list. In source, choose "Anywhere". Click add rule.
- Choose SSH from the drop down list. In source click "My IP" - this should automatically add your IP address.
- Click "Create security group".
This section is modified from the AWS instructions on this topic.
- Open the Amazon EC2 console. Amazon notes that you should select the right region for this - I have been using US East (N. Virginia), but you should choose this to be the same as your security group, selected above. Sometimes different regions may have different availability.
- From the navigation bar, choose AMIs.
- Find the AMI that we'll use: open the menu next to the search bar and select Public images. Type ami-04ae7dc734c4934ec into the search bar and press enter. This should bring up an AMI with the name MicrobiomeHelper2.
- Select the AMI and the choose "Launch instance from AMI" (orange button on the top right).
- First, give your instance a name (this can be anything that you like).
- The application and OS images should already show the MicrobiomeHelper2 image.
- Now we will choose the instance type: If you want to be able to run the NCBI RefSeq Complete V205 database then I recommend the x2idn.24xlarge instance type. This has 1536 GB memory and costs ~$10 (USD) per hour (as of writing this tutorial). Note that I was not able to launch this immediately after creating my account because I did not have access to enough vCPUs. My limit was increased immediately to 16 vCPUs by submitting a support request, but only to 96 after speaking with my AWS Account Solution Architect. You can contact your AWS Solutions Architect or AWS account team for help with this. If you run into issues contact me and I can put you in touch with my AWS Account Solution Architect and he can redirect you to the right person for your region/institution. With 16 vCPUs, you will be able to launch the x1e.4xlarge instance, which has 488 GB memory and is ~$3.33 (USD) per hour. This would allow you plenty of memory for running one of the slightly smaller databases (e.g. the ChocoPhlAn 3-equivalent database).
- Select a key pair for login. If you already made this above, then select this from the dropdown menu. If If you didn't make a key pair yet then follow those instructions now.
- In the network settings, under firewall, click "Select existing security group". Here, you should choose the security group that you made above. If you didn't make a security group yet then follow those instructions now.
- Now go to Configure storage. You will need to add enough storage here that you can store the database (if you are using NCBI RefSeq Complete database then this is about 1200 GB, for the sizes of the others check the methods section of the paper) as well as your samples and the Kraken2 output files. I would add enough storage for size of database + (size of samples * 2) to ensure that you have enough space for the Kraken2 output files.
- You can review the details of the instance on the right hand side of the page, and then click "Launch instance" (orange button on bottom right of screen).
- Choose View all instances to check the status of your instance. It may take a few minutes for the instance to launch. If this didn't work then check the error message. You may need to select an instance with a lower number of vCPUs, contact AWS support to increase your available vCPUs, etc. Please get in touch if you are unsure.
- When the instance has launched, the status will change from "Pending" to "Running". Now you can log in to the instance.
Assuming that you have followed the above instructions to launch an instance, now you can log into it. This section is modified from the Connect to your Linux instance AWS instructions. We will use the ssh instructions here, but if you wish to do this a different way, please follow the AWS instructions at the link.
- Assuming that you are used to using your Terminal window, open a window.
- Log in with the following information:
ssh -i /path/key-pair-name.pem ec2-user@instance-public-dns-name
Your /path/key-pair-name.pem should be the location of your key file and instance-public-dns-name can be found under Public IPv4 DNS in the Instance details. One of the times that I logged into an AWS instance, the command I used here looked like this:
ssh -i /Users/robynwright/Dropbox/Langille_Lab_postdoc/AWS/mh-aws.pem ec2-user@ec2-54-86-134-123.compute-1.amazonaws.com
- You may need to type "yes" to say that you want to continue connecting, and then you should see that you have been logged into the instance.
Now you are ready to run some of your data!
For this, you should either have some fasta/fastq files of your own for analysis, or you can use a subset of the samples that we used in the paper for test purposes. Use either step 1 or step 2 below. Note that if you are using your own samples, then you should run the precursor steps to Kraken2 shown in the tutorial here.
- Upload your own data (do this from a new terminal window that is logged into your computer):
scp -i /path/key-pair-name.pem -r folder_to_upload ec2-user@instance-public-dns-name:~/.
As above, you should replace /path/key-pair-name.pem with your key file and instance-public-dns-name with your Public IPv4 DNS.
An example command for me looked like this:
scp -i /Users/robynwright/Dropbox/Langille_Lab_postdoc/AWS/mh-aws.pem -r IMR_sample/ ec2-user@ec2-54-86-134-123.compute-1.amazonaws.com:~/.
- Get the test sample files:
mkdir test_samples
cd test_samples
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AABvggSeIoCuOZKhKVA-1C-1a/datasets/simulated_mock_samples/ani100_cHIGH_stFalse_r0.fastq.gz?dl=1 -O ani100_cHIGH_stFalse_r0.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AAAlF4IDuh_DGg99Qrspf1Hma/datasets/simulated_mock_samples/ani95_cLOW_stTrue_r1.fastq.gz?dl=1 -O ani95_cLOW_stTrue_r1.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AADPpsypaYGAymr7XnEGVcmVa/datasets/simulated_mock_samples/ani99_cHIGH_stFalse_r7.fastq.gz?dl=1 -O ani99_cHIGH_stFalse_r7.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AAB8o5EUa-zwPXoMza3nCtQta/datasets/simulated_mock_samples/ani100_cHIGH_stTrue_r4.fastq.gz?dl=1 -O ani100_cHIGH_stTrue_r4.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AABLMORfvsFDqJfcDtWJ4DJea/datasets/simulated_mock_samples/ani97_cHIGH_stFalse_r0.fastq.gz?dl=1 -O ani97_cHIGH_stFalse_r0.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AADNwYH_ZiBTR2T65mykZa6Aa/datasets/simulated_mock_samples/ani99_cHIGH_stTrue_r1.fastq.gz?dl=1 -O ani99_cHIGH_stTrue_r1.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AADH3i9TZ-HhxNCNeOrRFHB0a/datasets/simulated_mock_samples/ani100_cLOW_stFalse_r2.fastq.gz?dl=1 -O ani100_cLOW_stFalse_r2.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AACILPTmgvgUdB32O9qpOGhWa/datasets/simulated_mock_samples/ani97_cHIGH_stTrue_r4.fastq.gz?dl=1 -O ani97_cHIGH_stTrue_r4.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AAAYLUx5CcwtKguI8WS2D-aYa/datasets/simulated_mock_samples/ani99_cLOW_stFalse_r2.fastq.gz?dl=1 -O ani99_cLOW_stFalse_r2.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AACBK9DjE7Ud5JQxiPaPEjQaa/datasets/simulated_mock_samples/ani100_cLOW_stTrue_r8.fastq.gz?dl=1 -O ani100_cLOW_stTrue_r8.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AABT46ucHomePQEZP96DV0AAa/datasets/simulated_mock_samples/ani97_cLOW_stFalse_r7.fastq.gz?dl=1 -O ani97_cLOW_stFalse_r7.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AADEqoos3l85FNS6OYHO_7cma/datasets/simulated_mock_samples/ani99_cLOW_stTrue_r6.fastq.gz?dl=1 -O ani99_cLOW_stTrue_r6.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AADQE0oUYmOCsZ62T_E5qehca/datasets/simulated_mock_samples/ani95_cLOW_stFalse_r0.fastq.gz?dl=1 -O ani95_cLOW_stFalse_r0.fastq.gz
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AAC4w3sEJb4HnsoSAYISndKSa/datasets/simulated_mock_samples/ani97_cLOW_stTrue_r7.fastq.gz?dl=1 -O ani97_cLOW_stTrue_r7.fastq.gz
Now you should choose which database to download based on how much memory/storage space you have. These commands are for downloading some of the databases that we made, but you can choose to download any database now, or make your own if you wish. You can see the other databases that we made here. Because you could become disconnected during some of this, I recommend using tmux or screen or something similar. These allow you to disconnect from the instance while leaving things running in the background, or run multiple things at once. Please see tutorials on these programs if you are unfamiliar.
NCBI RefSeq Complete V205 database (the largest and best-performing Kraken2 database)
mkdir kraken2_RefSeqV205_Complete
cd kraken2_RefSeqV205_Complete
aws s3 cp --recursive --no-sign-request s3://kraken2-ncbi-refseq-complete-v205/Kraken2_RefSeqCompleteV205 .
This database is now hosted on AWS, so these instructions have been changed from previous versions to reflect this
ChocoPhlAn 3-equivalent database (approx. 73 GB database with good trade-off between small size and reasonable classification accuracy)
mkdir kraken2_chocophlanV30-201901
cd kraken2_chocophlanV30-201901
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AABACCxIhYU0Elg_2ev3e17za/databases/kraken2_chocophlanV30-201901/database150mers.kmer_distrib?dl=1 -O database150mers.kmer_distrib
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AACu5uBobAFTXS37sohtQKgAa/databases/kraken2_chocophlanV30-201901/database150mers.kraken?dl=1 -O database150mers.kraken
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AADXUzvIvZLGgIK6UItKNH0pa/databases/kraken2_chocophlanV30-201901/opts.k2d?dl=1 -O opts.k2d
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AAB2ZqNc5zuA2ETTLcxpUtSba/databases/kraken2_chocophlanV30-201901/taxo.k2d?dl=1 -O taxo.k2d
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AACVEkoq0hoz2usDaU_eLRtqa/databases/kraken2_chocophlanV30-201901/hash.k2d?dl=1 -O hash.k2d
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AADyAmDrH-dYLEDFBVTKsQVma/databases/kraken2_chocophlanV30-201901/genomes_added.txt?dl=1 -O genomes_added.txt
wget https://www.dropbox.com/sh/lvlz2wpsssvsrad/AADVcKnDoMEnu1XR5hE8Kn63a/databases/kraken2_chocophlanV30-201901/seqid2taxid.map?dl=1 -O seqid2taxid.map
You can used the md5 files on Dropbox to check that the databases copied across correctly.
This means that you won't need to load the database into memory before running each sample into it. It is likely to speed things up, unless you are only running a single sample. You can make the ramdisk with the following commands:
sudo mkdir /mnt/ramdisk
sudo mount -t ramfs -o size=1250g ramfs /mnt/ramdisk
Copy the database into the ramdisk (change the database name if you used a different one):
sudo cp -r kraken2_RefSeqV205_Complete /mnt/ramdisk/
You can check whether it has copied across into memory by running free -g. If it is copied then the amount of "available" space should be approximately the amount of memory your instance has - the size of your database - in my case, I have 375 available after loading the database.
You can follow the tutorial here for classifying reads with Kraken2 and Bracken.
If you are using the test samples then you can follow the instructions here.
If you will terminate the instance (as suggested below, to not incur additional computing and storage charges), ensure that you first copy across your Kraken2 files to either your regular server or your laptop. If you want to, you can continue analysis on this instance, following the rest of the Microbiome Helper SOP.
You should terminate your Amazon EC2 instance so that you do not incur additional charges.
- Go to the EC2 console.
- Select the instance and click the "Instance state" drop down box.
- Select "Terminate instance".