Created by: Marta Karas (martakarass) on 2022-02-18

Credits: Eli Jones (biblicabeebli)

Working with Forest on AWS

This page provides a quick start for using AWS EC2 and EBS to work with Beiwe and other collaborators’ data. In particular, it includes guidelines to leverage AWS multicore computing environment and to run a Python file from multiple concurrent jobs running in the background.

Amazon EC2 and EBS

Amazon Web Services (AWS) is one of many cloud server providers. AWS offer various services, including Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Store (Amazon EBS).

EC2 provides virtual computing environments, known as instances, that come with preconfigured templates (including the operating system and additional software) and allow for various configurations of CPU, memory, storage, and networking capacity.
EBS provides storage volumes for use with EC2 instances. EBS is recommended for data that must be quickly accessible and requires long-term persistence (does not get deleted when you stop, hibernate, or terminate your instance).

Starting setup

In this wiki page, we assume that:

EC2 instance with Linux (Amazon Linux 2 distribution) has been spun up,
ESB volume has been attached to EC2,
PEM private key to access EC2 has been shared with you,
user name and EC2 public DNS URL to be used have been shared with you; for example, the passage <user name>@<EC2 public DNS URL> could look in the lines of ec2-user@ec2-1-1-1-1.compute-1.amazonaws.com.

At the Onnela lab, it is likely all these four have been done by JP.

SSH to an EC2 instance using a PEM key

To access EC2 instance through SSH, use a private key PEM file. Here, the file is called "DP-four-user.pem". It is likely the PEM file has been shared with you via Harvard Secure file transfer.

Once you download the PEM file, you first need to modify permission settings of that PEM file by typing the following in a terminal (note the below line assumes that the PEM file is in the current working directory):

chmod 400 DP-four-user.pem

Then, you can access the instance by typing the following in a terminal (note the below line assumes that the PEM file is in the current working directory):

ssh -i DP-four-user.pem ec2-user@ec2-1-1-1-1.compute-1.amazonaws.com

where ec2-use is a username and ec2-1-1-1-1.compute-1.amazonaws.com is an EC2 public DNS URL. This particular DNS URL is made-up for security reasons and we will use it throughout this wiki page.

Once you ssh, you can type lscpu to see CPU capacity and lsmem to see memory of EC2.

Make an EBS volume available for use on EC2 Linux instance

References used:

To make the newly attached EBS volume, we can format it to set up a file system on it and then mount it, i.e. make it accessible from EC2 as if it was a normal local disk. We follow steps from reference 1. Root user privileges will be needed to perform these steps.

Create a file system

Ssh to the EC2 instance as user with root privileges. Each Linux instance launches with a default Linux system user account with root privileges. For Amazon Linux 2, the user name is ec2-user.

ssh -i DP-four-user.pem ec2-user@ec2-1-1-1-1.compute-1.amazonaws.com

Determine whether there is a file system on the volume. The FSTYPE column shows the file system type for each device.

sudo lsblk -f

NAME          FSTYPE LABEL UUID                                 MOUNTPOINT
nvme0n1                                                         
├─nvme0n1p1   xfs    /     1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a /
└─nvme0n1p128                                                   
nvme1n1       

Here, 1a1a1a1a(...) is a made-up security replacement for a passage of alphanumeric characters. Here, there are two devices attached to the instances – nvme0n1 and nvme1n1. Device nvme1n1 does not have a file system. Partition nvme0n1p1 on device nvme0n1 is formatted using the XFS file system.

We determine that nvme1n1 is the newly attached EBS volume.

Create a file system on EBS volume.

sudo mkfs -t xfs /dev/nvme1n1

Mount the volume

Use the mkdir command to create a mount point directory for EBS volume. The mount point is where the volume is located in the file system tree and where you read and write files to after you mount the volume. Here, the mount point directory is arbitrarily named dv1-mount-point.

sudo mkdir /dv1-mount-point

Mount EBS volume at the directory created in the previous step.

sudo mount /dev/nvme1n1 /dv1-mount-point

After formatting and mounting, you should be able to see the following output:

sudo lsblk -f

NAME          FSTYPE LABEL UUID                                 MOUNTPOINT
nvme0n1                                                         
├─nvme0n1p1   xfs    /     1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a /
└─nvme0n1p128                                                   
nvme1n1       xfs          2b2b2b2b-2b2b-2b2b-2b2b-2b2b2b2b2b2b /dv1-mount-point

Here, 2b2b2b2b(...) is a made-up security replacement for a passage of alphanumeric characters.

Create new users at EC2

We assume that individuals who use EC2/EBS should generally access EC2 instance via one’s respective non-root user account and use root account only when needed. Here, we show how to add a new user named marta.

Assume we are ssh-ed to the EC2 instance as user with root privileges, ec2-user. Create the user account and add it to the system. Then switch to the new account.

sudo adduser marta
sudo su - marta

Add the SSH public key to the user account. General guidance on generating a key pair (public and private key) for use with AWS can be found in reference 3. If you prefer for user marta to use the same key pair as the ec2-user, you can retrieve public key from a ec2-user’s private PEM key using the command below. The command places the public key output in the clipboard (so you can paste it later to the authorized_keys file).

ssh-keygen -f DP-four-user.pem -y | pbcopy

First, create a directory in the user’s home directory for the SSH key file. After the switch, the current directory should be the user’s home directory (you can check it using pwd command). Change file permissions for this newly created directory to 700 (only the owner can read, write, or open the directory).

mkdir .ssh 
chmod 700 .ssh

Create a file named authorized_keys in the .ssh directory and change its file permissions to 600 (only the owner can read or write to the file).

touch .ssh/authorized_keys
chmod 600 .ssh/authorized_keys

Open the authorized_keys file and paste there a public key specific to user marta. I personally use vi text editor and would open the file using

vi .ssh/authorized_keys

then type letter i to enter INSERT mode, then paste the public key, then click Escape to exit INSERT mode, then type :wq! to save and exit the file.

To switch back from marta to ec2-user user, use

exit

Once an individual receives their private PEM key, they can modify permission settings of the PEM file (see “SSH to an EC2 instance using a PEM key” section) and ssh to the EC2 instance

ssh -i <PEM key file name>.pem martar@ec2-1-1-1-1.compute-1.amazonaws.com

where <PEM key file name>.pem is a file name of the private PEM key for user marta, possibly different than DP-four-user.pem of ec2-user.

Review the file permissions of the new EBS volume mount

We review the file permissions of the new EBS volume mount to make sure that all users (root and non-root) can write to the volume. For more information about file permissions, see reference 4. or google for similar.

Assume we are ssh-ed to the EC2 instance as user with root privileges, ec2-user. This command that grants read-write-execute privileges to all users on all EC2 instances that have the file system mounted.

sudo chmod 777 /dv1-mount-point

You may want to create a subdirectory under dv1-mount-point for user marta. The commands below create the subdirectory, make user marta its owner, and grant privilege that only owner can write but everybody else can read and execute files in the directory (recursively).

sudo mkdir /dv1-mount-point/marta
sudo chown marta:marta /dv1-mount-point/marta
sudo chmod -R 755 /dv1-mount-point/marta

Automatically mount an attached EBS volume after reboot

Following reference 1, we set up an automated mounting of the EBS volume in case of a system reboot.

Assume we are ssh-ed to the EC2 instance as user with root privileges, ec2-user. Create a backup of /etc/fstab file to be used in case it is accidentally destroyed or deleted while being edited.

sudo cp /etc/fstab /etc/fstab.orig

Use lsblk command to learn the UUID of the device that you want to mount after reboot.

sudo lsblk -f

NAME          FSTYPE LABEL UUID                                 MOUNTPOINT
nvme0n1                                                         
├─nvme0n1p1   xfs    /     1a1a1a1a-1a1a-1a1a-1a1a-1a1a1a1a1a1a /
└─nvme0n1p128                                                   
nvme1n1       xfs          2b2b2b2b-2b2b-2b2b-2b2b-2b2b2b2b2b2b /dv1-mount-point

Here, it is device nvme1n1 and the UUID of interest is 2b2b2b2b-2b2b-2b2b-2b2b-2b2b2b2b2b2b.

Open the /etc/fstab using any text editor (I personally use vi text editor; see above) and add the following entry, save and exit the file.

UUID=2b2b2b2b-2b2b-2b2b-2b2b-2b2b2b2b2b2b  /data  xfs  defaults,nofail  0  2

See reference 1. for optional steps for checking whether the above procedure worked out.

Transfer files between local machine and EC2/EBS

There are several ways of transferring files between a local machine to EC2/EBS remotes.

SFTP (Secure File Transfer Protocol; SSH File Transfer Protocol)

On your local machine, go to the directory where your EC2 access private PEM key is stored. The following command opens an SFTP connection to EC2 using user marta credentials.

sftp -i DP-four-user.pem marta@ec2-1-1-1-1.compute-1.amazonaws.com

where DP-four-user.pem is the name of the private PEM key of user marta.

To get a list of all available SFTP commands, type help, or ?. These include:

get to download a file from the remote server to the local machine,
put to send a file from the local machine to the remote server.

The following command sends a file to marta’s user directory at EC2 instance:

put /Users/martakaras/Desktop/dropbox.py /home/marta/

The following command sends a file to marta’s user directory at EBS:

put /Users/martakaras/Desktop/dropbox.py /dv1-mount-point/marta/

Cyberduck

Cyberduck is an open-source application that allows browsing and sending/downloading files to/from a remote server.

To use Cyberduck, download the Cyberduck application, open it and select Open Connection button in top-left. A connection wizard window should open. To connect with EC2,

specify SFTP (SSH File Transfer Protocol),
set Server to EC2 public DNS URL (e.g., ec2-1-1-1-1.compute-1.amazonaws.com),
set Port to 22,
set Username of choice (e.g., marta),
set Password to the password corresponding to the username of choice (or keep the field empty if there is no password set for a particular EC2 user),
for SSH Private Key, navigate to and select the private PEM key file used to SSH to the EC2 instance (e.g., DP-four-user.pem),
select Add to Keychain,
click Connect.

If the connection is successful, a new Cyberduck window appears that allows to browse the remote directory and transfer files from/to a local machine via drag-and-drop.

GitHub repo

A possible solution is to maintain a GitHub repository that is cloned in both local machine and remote server. Note that even if the GitHub repository is private, one should never push to the repository any protected information.

Headless Dropbox

A possible solution is to maintain a Dropbox directory that is synced with a remote server. This approach is discussed in the next section.

Transfer files between Dropbox and EC2/EBS

References:

This section addresses a specific case scenario when a collaborator shares with us data via Dropbox (e.g., from their Dropbox Business account with MGB), the “target destination” of the data is the EBS volume, but the data is so large that it cannot be first downloaded locally and then uploaded to the EBS volume via the approaches listed earlier. (For example, Cyberduck does allow to transfer files from Dropbox server to EC2/EBS without an intermediate step of downloading to a local machine, but the approach is not feasible for large files due to very long transferring time and issues resulting from any internet connection interruptions.)

From my experience as of today, installing a “headless” Dropbox on the EC2/EBS and syncing our Dropbox account which whom the data has been shared – is an efficient way of file transferring in such a case.

Note regarding Dropbox availability for use on EC2/EBS for all users

Many instructions to be found on the web provide a default way of installing Dropbox via command line that installs the application in the home directory of EC2 user that is conducting the installation (e.g., at /home/marta). Even if the Dropbox daemon installer is downloaded and launched in a directory shared among all users, the Dropbox daemon will create Dropbox directory at, here, /home/marta/ and sync the files there. This is undesired if the files are to be accessible by all EC2 users (not only user marta).

The instructions below borrow from reference 2. and a few others to force Dropbox daemon to create Dropbox directory and sync files at a particular location on EBS.

Note regarding syncing all files initially

Some comments on the web report that they were unable to sync selectively before the syncing of all the files available at one’s Dropbox account was initiated when the Dropbox daemon starts for the first time. If the directory with collaborators data you want to download to EC2/EBS is one of many directories you have within your Dropbox account, it may mean some of the other files will, undesirably, start getting synced.

A solution to avoid that would be to create a new Dropbox account that only has the directory with collaborators shared with, and use that for EC2/EBS syncing.

Install and use Dropbox daemon

Download the Python script dropbox.py linked in the reference 1. website. (Ctrl+F python on the website to locate the link). Upload the script to a location that is accessible to all users. Here, I used

put /Users/martakaras/Desktop/dropbox.py  /dv1-mount-point/shared/apps

where /dv1-mount-point/shared/apps is directory created earlier to which all the users have full permissions set.

Set full permissions for all users to dropbox.py file.

chmod 777 /dv1-mount-point/shared/apps/dropbox.py

Use dropbox.py to download, install, and start the Dropbox daemon. In the below command, the HOME=/dv1-mount-point/shared part indicates where the Dropbox directory will be located. Using this each time we use dropbox.py is the key to having Dropbox directory syncing in a location other than the default Dropbox daemon location (see “Note regarding Dropbox availability for use on EC2/EBS for all users” above).

HOME=/dv1-mount-point/shared /dv1-mount-point/shared/apps/dropbox.py start -i

where option -i do the auto-install of the Dropbox daemon if it is not available on the system yet.

Follow the instructions from the console (including authorizing access to your Dropbox account).

Once installation is completed, the files sync should start automatically. For reference, syncing approx. 600 GB of data took me approx. 2-3h. You can monitor the Dropbox daemon status, including the syncing status, via status function.

HOME=/dv1-mount-point/shared /dv1-mount-point/shared/apps/dropbox.py status

List of all functions available in dropbox.py to manipulate Dropbox daemon can be found in reference 1.

Finally, the command below sets permissions to users for the Dropbox directory, recursively.

chmod -R 755 /dv1-mount-point/shared/Dropbox

Install and use Anaconda on Linux instance

References:

Anaconda is a distribution of the Python and R programming languages that aims to simplify package management. Among others, Anaconda allows to create multiple so-called environments, each of which may have its own version of Python and its own versions of Python packages.

Install Anaconda

Assume we are ssh-ed to the EC2 instance as user with root privileges, ec2-user. Before installing Anaconda, make sure the following packages are installed on the EB2 instance.

sudo yum install libXcomposite libXcursor libXi libXtst libXrandr alsa-lib mesa-libEGL libXdamage mesa-libGL libXScrnSaver -y 

Navigate to a directory where every user has access to. Here, I used / directory.

cd /

Download the installer script for the latest version of Anaconda. Go to https://www.anaconda.com/products/individual, find the one for Linux, right-click on the download button and use “Copy Link Address” to copy the download address link. Download the file using the link, e.g.

wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh

Run the downloaded script.

When prompted for the installation location, use one that all users will have access to. I used /opt/anaconda (I later noted that internet suggests /opt/anaconda/anaconda3 quite often instead).
When prompted Do you wish the installer to initialize Anaconda3 by running conda init?, select yes.

sudo sh Anaconda3-2021.11-Linux-x86_64.sh

After installation completes, set the following permissions on the directory where anaconda was installed.

sudo chmod ugo+w /opt/anaconda

Add the following passage to user-specific .bashrc file located at user’s home directory – for any user who should have access to that installation of Anaconda.

To learn the home directory of a user, use echo $HOME. For example, echo $HOME prints /home/ec2-user if I am currently ssh-ed as ec2-user user, so I will paste the below passage into /home/ec2-user/.bashrc.
We may want to repeat this pasting procedure for every non-root user too (e.g., user marta).
Note the passage below may differ depending on the Anaconda installation location you chose. Assuming you were installing with the ec2-user user using sudo, your specific passage can be found in /root/.bashrc.

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/opt/anaconda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/opt/anaconda/etc/profile.d/conda.sh" ]; then
        . "/opt/anaconda/etc/profile.d/conda.sh"
    else
        export PATH="/opt/anaconda/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

To see an effect immediately (e.g., to be able to run conda functions), source user’s .bashrc profile.

source $HOME/.bashrc

From now on, you should be able to use conda functions, e.g.

conda list

Use Anaconda on Linux Instance

A neat thing about the setup that follows from the above installation is that while Anaconda installation and base environment are shared among all users, any environment a user create remains by default only visible to that user.

Assume we are ssh-ed to the EC2 instance.

To create a new environment named forest_main with Python version 3.12 use

conda create --name forest_main python=3.12

To activate an environment named forest_main use

conda activate forest_main

To deactivate an active environment use

conda deactivate

To permanently delete an environment named forest_main use

conda env remove -n forest_main

To look up environment locations use

conda info -e

# conda environments:
#
forest_main              /home/marta/.conda/envs/forest_main
base                  *  /opt/anaconda

Here, two Anaconda environments are available: base (currently active, as denoted by *) and forest_main (not currently active).

Set directories for Anaconda environments and packages installation

Following the above installation setup, by default, Anaconda installs new packages in /opt/anaconda/pkgs and creates new environments within a within user-specific home directory (here, in /home/marta/.conda/envs for user marta). Both these directories are located in EC2 server files. As new packages get installed, the space on the EC2 (here, 8 GB) may run out. You can look up the default locations for Anaconda environments and packages by typing

conda info

# (...)
       package cache : /opt/anaconda/pkgs
                          /home/marta/.conda/pkgs
       envs directories : /home/marta/.conda/envs
                          /opt/anaconda/envs
# (...)

As a fix, for my user marta I configured Anaconda pkgs_dirs and envs_dirs variables to point to (previously created by me) specific directories on EBS volume:

conda config --add pkgs_dirs '/dv1-mount-point/marta/apps/conda/pkgs'
conda config --add envs_dirs '/dv1-mount-point/marta/apps/conda/envs'

After the above, creating new environments and creating new users no longer takes space on EC2. I also see now:

conda info

# (...)
          package cache : /dv1-mount-point/marta/apps/conda/pkgs
       envs directories : /dv1-mount-point/marta/apps/conda/envs
                          /home/marta/.conda/envs
                          /opt/anaconda/envs
# (...)

and an environment named forest_main created after above setup will yield:

conda info -e

# conda environments:
#
forest_main              /dv1-mount-point/marta/apps/conda/envs/forest_main
base                  *  /opt/anaconda

Install `forest` Python library

Forest is a Python library for analyzing smartphone-based high-throughput digital phenotyping data collected with the Beiwe platform. Forest implements methods as a Python 3.12 package. Forest is integrated into the Beiwe back-end on AWS but can also be run locally.

Assume we are ssh-ed to the EC2 instance. Use the commands below to activate Anaconda environment of choice (here, forest_main that has Python 3.12 installed) and install git, pip.

conda activate forest_main
conda install git pip

Install forest Python library. Use @<branch name> at the end to specify the code branch of choice. Here, we install forest from main branch.

pip install git+https://github.com/onnela-lab/forest.git@main

Run a Python file from multiple concurrent jobs running in the background

References:

How to run multiple background jobs in linux?

Running a Python file from a command line

Assume we are ssh-ed to the EC2 instance. The below commands activate Anaconda environment of choice (here, forest_main), navigate to the Python script location, run an exemplary Python file, and deactivate the environment.

cd /home/marta/_PROJECTS/beiwe_Nock_U01/py
conda activate forest_main
python download_data_gps_AWS.py 0
conda deactivate

where 0 is a value of an argument passed to download_data_gps_AWS.py code that is read inside the Python file as

# (...)
job_id = int(sys.argv[1])
# (...)

Wrap a Python file run into a bash script

The commands from the above subsection “Running a Python file from a command line” can be wrapped into a following bash script, named e.g. run_download_data_gps_AWS.sh:

#!/bin/bash

# activate a specific conda environment
source /opt/anaconda/etc/profile.d/conda.sh
conda activate forest_main

# run script of interest
python download_data_gps_AWS.py $1

# deactivate a specific conda environment
conda deactivate

Assuming the above bash script is placed at the same directory as download_data_gps_AWS.py file, we can use it to run download_data_gps_AWS.py via:

cd /home/marta/_PROJECTS/beiwe_Nock_U01/py
sh run_download_data_gps_AWS.sh 0

where 0 is a value of an argument passed to run_download_data_gps_AWS.sh (and then to download_data_gps_AWS.py via $1).

Run multiple bash scripts concurrently in the background

The motivation for wrapping up a Python file run, together with Anaconda environment activation/deactivation, into a bash script is that it will allow us to run Python file multiple times concurrently in the background with one command written in a compact way. To accomplish such task, consider the following command (note it is a long but essentially one line of code).

sh run_download_data_gps_AWS.sh 0 & sh run_download_data_gps_AWS.sh 1 & sh run_download_data_gps_AWS.sh 2 & sh run_download_data_gps_AWS.sh 3 & sh run_download_data_gps_AWS.sh 4 & sh run_download_data_gps_AWS.sh 5 & sh run_download_data_gps_AWS.sh 6 & sh run_download_data_gps_AWS.sh 7 & sh run_download_data_gps_AWS.sh 8 & sh run_download_data_gps_AWS.sh 9 & 

where & is used to start multiple background jobs. Then, in my workflow, I’d put the above line of code another bash script named e.g. run_download_data_gps_MANY_AWS.sh, and run it via

nohup sh run_download_data_gps_MANY_AWS.sh

where nohup ensures that the jobs keep running in the background even after we exit the terminal we used to run the above.

The above example launches 10 concurrent background jobs, each with a distinct value that is further used to define job_id variable value in our download_data_gps_AWS.py Python script.

To see a potential use case, consider the case where a list of, say, N=100 Beiwe IDs is available. The above workflow could be employed in a way that each job processes in a loop a subset of 1/10 of N=100 Beiwe IDs using some Python-implemented procedure. Here, a non-overlapping subset with 1/10 of N=100 Beiwe ID could be defined inside download_data_gps_AWS.py based on job_id variable value.

Working with Forest on AWS

Amazon EC2 and EBS

Starting setup

SSH to an EC2 instance using a PEM key

Make an EBS volume available for use on EC2 Linux instance

Create a file system

Mount the volume

Create new users at EC2

Review the file permissions of the new EBS volume mount

Automatically mount an attached EBS volume after reboot

Transfer files between local machine and EC2/EBS

SFTP (Secure File Transfer Protocol; SSH File Transfer Protocol)

Cyberduck

GitHub repo

Headless Dropbox

Transfer files between Dropbox and EC2/EBS

Note regarding Dropbox availability for use on EC2/EBS for all users

Note regarding syncing all files initially

Install and use Dropbox daemon

Install and use Anaconda on Linux instance

Install Anaconda

Use Anaconda on Linux Instance

Set directories for Anaconda environments and packages installation

Install forest Python library

Run a Python file from multiple concurrent jobs running in the background

Running a Python file from a command line

Wrap a Python file run into a bash script

Run multiple bash scripts concurrently in the background

Install `forest` Python library