Home

Forest is a library for analyzing smartphone-based high-throughput digital phenotyping data collected with the Beiwe platform. Forest implements methods as a Python 3.11 package. Forest is integrated into the Beiwe back-end on AWS but can also be run locally.

Table of Contents

Forest trees

Forest structure is based in subpackages, or trees, each of which addresses a specific area of analytics.

  • Bonsai:

    • Location and call/text log data generation of a synthetic population

    • Input: simulation parameters

    • Current maintainer: Georgios Efstathiadis

  • Jasmine:

    • Location data imputation and mobility summary statistics

    • Input: GPS data

    • Current maintainer: Georgios Efstathiadis

  • Willow:

    • Call log and text log summary statistics

    • Input: call/text log (from Android phones only)

    • Current maintainer: Patrick Emedom-Nnamdi

  • Poplar: documentation and example code

    • Various common functions for data preparation, mainly used by other trees (e.g. time zone conversion, reading/writing)

  • Sycamore:

    • Survey data summary statistics and preprocessed data

    • input: survey_timings and survey_answers data

    • Current maintainer: Zachary Clement

  • Oak:

    • Gait time, cadence, and step count statistics

    • Input: accelerometer data

    • Current maintainer: Marcin Straczkiewicz

Expected input data

Input data should be included in one directory. This directory cannot be on a cloud server. Inside that directory, there should be a direct subdirectory corresponding to each participant ID. Inside each Beiwe ID subdirectory, there should be a direct subdirectory corresponding to each downloaded data stream.

Methods are designed to work on data collected using the Beiwe app, with the types of sensor on/off cycles run by the Beiwe app, and with the csv files containing columns generated by Beiwe or generate data matching the data returned from the Beiwe app. Some methods included in Forest are compatible with other data collection environments, but code changes would be required to make Forest work well with those environments.

Did you know you can use Python to download Beiwe data even more conveniently than from the portal? Download mano and follow these instructions here: https://github.com/harvard-nrg/mano

Output & available summary statistics

GPS

The outputs of the GPS module contains:

  1. summary statistics for all specified participants (.csv);

  2. imputed trajectories (.csv) in terms of timestamp, latitude and longitude. By default, it is set to FALSE;

  3. all_BV_set (.pkl), which is a dictionary, with the key as the user ID and the value as a numpy array, where each column represents [start_timestamp, start_latitude, start_longitude, end_timestamp, end_latitude, end_longitude]. If it is your first time run the code, it is set to NULL by default. If you want to continue your analysis from here in the future, all_BV_set is expected to be an input in your new analysis and it will be updated in that run. The size of the file should be fixed overtime;

  4. all_memory_dict (.pkl), which is also a dictionary, with the key as user ID and the value as a numpy array of other parameters for the user. If it is your first time run the code, it is set to NULL by default. If you want to continue your analysis from here in the future, all_memory_dict is expected to be an input in your new analysis and it will be updated in that run. The size of the file should be fixed overtime.

  5. locations_log (.json) json file created if save_osm_log is set to True. It contains information on the places visited by the user, their tags and the time of visit.

List of summary statistics

The summary statistics that are generated are listed below:

Variable

Type

Description of Variable

Description of What it Measures

obs_duration

Float

The total time when the GPS is on

This variable quantifies the missingness and the uncertainty in all other estimates

obs_day

Float

The total time when the GPS is on from 8AM to 8PM

This variable quantifies the missingness in daytime and the majority of uncertainty in all other estimates

obs_night

Float

The total time when the GPS is on from 8PM to 8AM

This variable quantifies the missingness at night and the minority of uncertainty in all other estimates since the user is most likely at home

home_time

Float

Time spent at home over the course of a day (in hours)

Note that a person can have non-zero Distance traveled when at home, which would indicate they are moving within their home. When Home time is zero, this indicates the person has spent the day away from home. “Home” is the most frequently visited significant location for a person between the hours of 8pm and 8am each day over the course of follow up.

dist_traveled

Float

Total distance travelled over the course of a day (in km)

The sum of lengths of all flights. A flight is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause. Please find the technical details here.

radius

Float

Average radius that a person travels from their center over the course of a day (in km)

Centroid = the average of each ‘place visited’ (see definition ‘significant location’) over the course of a day, with weights proportional to the amount of time spent in the location. The radius of gyration is calculated using a time-weighted average of the distance between each place and the centroid, where weights are measured in the same way.

diameter

Float

Largest distance between any two places that a person visited in a day (in km)

Please find the technical details on how a place is defined here

max_dist_home

Float

Largest distance between any places that a person visited in a day and their home (in km)

num_sig_places

int

Number of significant visited at any point over the course of a day

Significant locations are distinct pauses which are at least 15 minutes long and 50 meters apart. They are determined using K-means clustering on locations that a patient visits over the course of follow up. Set K=K+1 and repeat clustering until two significant locations are within 100 meters of one another. Then use the results from the previous step (K-1) as the total number of significant locations.

total_flight_time

Float

Total time spent in flight over the course of a day (in hours)

A flight is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause.

av_flight_length

Float

Average of the length of all flights (straight line movement) that took place over the course of a day (in km)

GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). A flight is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause. Note that a long flight could be composed of several short flights with different directions, but when calculating the average, it is the mean of those short flights. Please find the technical details here.

sd_flight_length

Float

Standard deviation of the length of all flights (straight line movement) that took place over the course of a day (in km)

GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The standard deviation of flights of the day is reported.

av_flight_duration

Float

Average of the duration of all flights (straight line movement) that took place over the course of a day (in hours)

GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The average of the duration of flights of the day is reported.

sd_flight_duration

Float

Standard deviation of the duration of all flights (straight line movement) that took place over the course of a day (in hours)

GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The standard deviation of the duration of flights of the day is reported.

total_pause_time

Float

Total time spent in pause over the course of a day (in hours)

A pause is defined to be a longest period of time spent stationary without a directional change or flight.

av_pause_duration

Float

Average of the duration of all pauses that took place over the course of a day (in hour)

We consider that a participant has a pause if the distance that he has moved during a 30-s period is less than r m. By default, r=10.

sd_pause_duration

Float

Standard deviation of the duration of all pauses that took place over the course of a day (in hour)

GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The standard deviation of duration of pauses over the course of a day is reported.

entropy

Float

Entropy measure based on the proportion of time spent at significant locations over the course of a day

Letting p_i be the proportion of the day spent at significant location I, significant location entropy is calculated as -\sum_{i} p_i*log(p_i), where the sum occurs over all non-zero p_i for that day.

mis_duration

Not Available

Number of hours of GPS data missing over the course of a day

Physical circadian rhythm

Float

A continuous measurement of routine in the interval [0,1] that scores a day with 0 if there was a complete break from routine and 1 if the person followed the exact same routine as have in every other day of follow up

For a detailed description of how this measure is calculated, see Canzian and Musolesi’s 2015 paper in the Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, titled “Trajectories of depression: unobtrusive monitoring of depressive states by means of smartphone mobility traces analysis.” Their procedure was followed using 30-min increments as a bin size.

Physical circadian rhythm stratified

Float

A continuous measurement of routine in the interval [0,1] that scores a day with 0 if there was a complete break from routine and 1 if the person followed the exact same routine as have in every other day of follow up

Calculated in the same way as Physical circadian rhythm, except the procedure is repeated separately for weekends and weekdays.

Call & text logs

List of summary statistics

Variable

Type

Description of Variable

num_in_call

int

The total number of incoming calls.

num_out_call

int

The total number of outgoing calls.

num_mis_call

int

The total number of missed calls.

num_in_caller

int

The total number of unique individuals who called the subject.

num_out_caller

int

The total number of unique individuals called by the subject.

num_mis_caller

int

The total number of unique individuals who called the subject but without answering.

total_mins_in_calls

float

The duration (minute) of all incoming calls.

total_mins_out_call

float

The duration (minute) of all outgoing calls.

num_uniq_individuals_call_or_text

int

The total number of unique individuals who called or texted the subject, or who the subject called or texted. The total number of individuals who the subject had any kind of communication with.

num_s

int

The total number of sent SMS.

num_r

int

The total number of received SMS.

num_mms_s

int

The total number of sent MMS.

num_mms_r

int

The total number of received MMS.

num_s_tel

int

The total number of unique phone numbers that sent messages to the subject.

num_r_tel

int

The total number of unique phone numbers that received messages from the subject.

total_char_s

int

The total number of characters in all sent messages.

total_char_r

int

The total number of characters in all received messages.

text_reciprocity_incoming

int

The total number of unique phone numbers that sent messages to the subject but didn’t get replied.

text_reciprocity_outgoing

int

The total number of unique phone numbers that received messages from the subject but didn’t reply.

Surveys

Note

For best performance with Sycamore, do not change survey questions and answer after the study period has started.

The following variables are created in the “submits_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided. The submits_summary_daily.csv and submits_summary_hourly.csv files contain the same columns, but with additional granularity at the day or hourly levels rather than at the user level.

Variable

Type

Description

survey id

str

ID of the survey for which this row applies to. Note: If submits_by_survey_id is False, surveys will not be aggregated at the survey level (they will only be aggregated by user) so this column will not appear.

year

int

Year of the time period at which submits/deliveries are being aggregated. This is only included in submits_summary_daily.csv and submits_summary_hourly.csv

month

int

Month of the time period at which submits/deliveries are being aggregated. This is only included in submits_summary_daily.csv and submits_summary_hourly.csv

day

int

Day over which submits/deliveries are being aggregated. This is only included in submits_summary_daily.csv and submits_summary_hourly.csv

hour

int

Hour over which submits/deliveries are being aggregated. This is only included in submits_summary_hourly.csv

num_surveys

int

Number of surveys scheduled for delivery to the individual during the period

num_submitted_surveys

int

Number of surveys submitted during the period (i.e. the user hit submit on all surveys)

num_opened_surveys

int

Number of surveys opened by the individual during the time period (i.e. the user answered at least one question)

avg_time_to_submit

float

Average time between survey delivery and survey submission, in seconds, for complete surveys

avg_time_to_open

float

Average time between survey delivery and survey opening, in seconds. This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.

avg_duration

float

Average time between survey opening and survey submission, in seconds.This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.

The following variables are created in the “submits_and_deliveries.csv” file. This file will only be generated if the config file and intervention timings file are provided.

Variable

Type

Description

survey id

str

ID of the survey

delivery_time

str

A scheduled delivery time. If surveys are weekly, delivery times will be generated for each week between start_date and end_date

submit_flg

int

Whether a survey was submitted between this delivery time and the next delivery time

submit_time

str

Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session

time_to_submit

float

Time between survey delivery and survey submission, in seconds. If a survey was incomplete, this will be blank.

time_to_open

float

Time between survey delivery time and the first recorded survey answer, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be 0)

survey_duration

float

Time between the first recorded survey answer and the survey submission, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be NA)

The following variables are created in the “answers_data.csv” file. This file will be generated if a survey config file is available.

Variable

Type

Description

survey id

str

ID of the survey

beiwe_id

str

The participant’s Beiwe ID

question id

str

The ID of the question for this line

question text

str

The question text corresponding to the answer

question type

str

The type of question (radio button, free response, etc.) corresponding to the answer

question answer options

str

The answer options presented to the user (applicable for check box or radio button surveys)

timestamp

str

The Unix timestamp corresponding to the latest time the user was on the question

Local time

str

The local time corresponding to the latest time the user was on the question

last_answer

str

The last answer the user had selected before moving on to the next question or submitting

all_answers

str

A list of all answers the user selected

num_answers

int

The number of different answers selected by the user (the length of the list in all_answers)

first_time

str

The local time corresponding to the earliest time the user was on the question

last_time

str

The local time corresponding to the latest time the user was on the question

time_to_answer

float

The time that the user spent on the question

The following variables are created in the “answers_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided.

Variable

Type

Description

survey id

str

ID of the survey

beiwe_id

str

The participant’s Beiwe ID

question id

str

The ID of the question for this line

num_answers

int

The number of times in the given data the answer is answered

average_time_to_answer

float

The average number of seconds the user takes to answer the question

average_number_of_answers

float

Average number of answers selected for a question. This indicated if a user changed an answer before submitting it.

most_common_answer

str

A user’s most common answer to a question

The following variables are created in the “submits_only.csv” file. This file will always be generated.

Variable

Type

Description

survey id

str

ID of the survey

beiwe_id

str

The participant’s Beiwe ID

surv_inst_flg

int

A “submission flag” which distinguishes submissions that are done by the same individual on the same survey

max_time

str

Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session

min_time

str

The earliest time the individual was interacting with the survey that session

time_to_complete

float

Time between min_time and max_time, in seconds (for responses where a survey_timings file was available)

The following variables are created in a csv file for each survey.

Variable

Type

Description

start_time

str

Time this survey submission was started

end_time

str

Time this survey submission was ended

survey_duration

float

Difference between start and end time, in seconds (for surveys where a survey_timings file was available)

question_1, question_2, …

str

Responses to each question in the survey

Accelerometer

The outputs of the accelerometer module contains gait summary statistics for each specified participant in daily (_gait_daily.csv) or hourly windows (_gait_hourly.csv).

The following variables are created in a csv file for each survey.

Variable

Type

Description

date

str

Time of observation (_gait_daily.csv format: yyyy-mm-dd; _gait_hourly.csv format: yyyy-mm-dd HH:MM:SS’)

walking_time

int

Total walking time (in seconds)

steps

int

Total steps taken

cadence

float

Average cadence in time window (daily or hourly)

Based on multiple data streams

Other resources

You may consider various other resources, for example if you: