Home

Forest is a library for analyzing smartphone-based high-throughput digital phenotyping data collected with the Beiwe platform. Forest implements methods as a Python 3.8 package. Forest is integrated into the Beiwe back-end on AWS but can also be run locally.

Table of Contents

Forest trees
Expected input data
Output & available summary statistics
Other resources

Forest trees 

Forest structure is based in subpackages, or trees, each of which addresses a specific area of analytics.

Bonsai:
- Location and call/text log data generation of a synthetic population
- Input: simulation parameters
- Current maintainer: Georgios Efstathiadis
Jasmine:
- Location data imputation and mobility summary statistics
- Input: GPS data
- Current maintainer: Georgios Efstathiadis
Willow:
- Call log and text log summary statistics
- Input: call/text log (from Android phones only)
- Current maintainer: Patrick Emedom-Nnamdi
Poplar: documentation and example code
- Various common functions for data preparation, mainly used by other trees (e.g. time zone conversion, reading/writing)
Sycamore:
- Survey data summary statistics and preprocessed data
- input: survey_timings and survey_answers data
- Current maintainer: Zachary Clement
Oak:
- Gait time, cadence, and step count statistics
- Input: accelerometer data
- Current maintainer: Marcin Straczkiewicz

Expected input data 

Input data should be included in one directory. This directory cannot be on a cloud server. Inside that directory, there should be a direct subdirectory corresponding to each participant ID. Inside each Beiwe ID subdirectory, there should be a direct subdirectory corresponding to each downloaded data stream.

Methods are designed to work on data collected using the Beiwe app, with the types of sensor on/off cycles run by the Beiwe app, and with the csv files containing columns generated by Beiwe or generate data matching the data returned from the Beiwe app. Some methods included in Forest are compatible with other data collection environments, but code changes would be required to make Forest work well with those environments.

Did you know you can use Python to download Beiwe data even more conveniently than from the portal? Download mano and follow these instructions here: https://github.com/harvard-nrg/mano

Output & available summary statistics 

GPS 

The outputs of the GPS module contains:

summary statistics for all specified participants (.csv);
imputed trajectories (.csv) in terms of timestamp, latitude and longitude. By default, it is set to FALSE;
all_BV_set (.pkl), which is a dictionary, with the key as the user ID and the value as a numpy array, where each column represents [start_timestamp, start_latitude, start_longitude, end_timestamp, end_latitude, end_longitude]. If it is your first time run the code, it is set to NULL by default. If you want to continue your analysis from here in the future, all_BV_set is expected to be an input in your new analysis and it will be updated in that run. The size of the file should be fixed overtime;
all_memory_dict (.pkl), which is also a dictionary, with the key as user ID and the value as a numpy array of other parameters for the user. If it is your first time run the code, it is set to NULL by default. If you want to continue your analysis from here in the future, all_memory_dict is expected to be an input in your new analysis and it will be updated in that run. The size of the file should be fixed overtime.
locations_log (.json) json file created if save_osm_log is set to True. It contains information on the places visited by the user, their tags and the time of visit.

List of summary statistics

The summary statistics that are generated are listed below:

Variable	Type	Description of Variable	Description of What it Measures
obs_duration	Float	The total time when the GPS is on	This variable quantifies the missingness and the uncertainty in all other estimates
obs_day	Float	The total time when the GPS is on from 8AM to 8PM	This variable quantifies the missingness in daytime and the majority of uncertainty in all other estimates
obs_night	Float	The total time when the GPS is on from 8PM to 8AM	This variable quantifies the missingness at night and the minority of uncertainty in all other estimates since the user is most likely at home
home_time	Float	Time spent at home over the course of a day (in hours)	Note that a person can have non-zero `Distance traveled` when at home, which would indicate they are moving within their home. When `Home time` is zero, this indicates the person has spent the day away from home. “Home” is the most frequently visited significant location for a person between the hours of 8pm and 8am each day over the course of follow up.
dist_traveled	Float	Total distance travelled over the course of a day (in km)	The sum of lengths of all flights. A flight is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause. Please find the technical details here.
radius	Float	Average radius that a person travels from their center over the course of a day (in km)	Centroid = the average of each ‘place visited’ (see definition ‘significant location’) over the course of a day, with weights proportional to the amount of time spent in the location. The radius of gyration is calculated using a time-weighted average of the distance between each place and the centroid, where weights are measured in the same way.
diameter	Float	Largest distance between any two places that a person visited in a day (in km)	Please find the technical details on how a place is defined here
max_dist_home	Float	Largest distance between any places that a person visited in a day and their home (in km)
num_sig_places	int	Number of significant visited at any point over the course of a day	Significant locations are distinct pauses which are at least 15 minutes long and 50 meters apart. They are determined using K-means clustering on locations that a patient visits over the course of follow up. Set K=K+1 and repeat clustering until two significant locations are within 100 meters of one another. Then use the results from the previous step (K-1) as the total number of significant locations.
total_flight_time	Float	Total time spent in flight over the course of a day (in hours)	A flight is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause.
av_flight_length	Float	Average of the length of all flights (straight line movement) that took place over the course of a day (in km)	GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). A flight is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause. Note that a long flight could be composed of several short flights with different directions, but when calculating the average, it is the mean of those short flights. Please find the technical details here.
sd_flight_length	Float	Standard deviation of the length of all flights (straight line movement) that took place over the course of a day (in km)	GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The standard deviation of flights of the day is reported.
av_flight_duration	Float	Average of the duration of all flights (straight line movement) that took place over the course of a day (in hours)	GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The average of the duration of flights of the day is reported.
sd_flight_duration	Float	Standard deviation of the duration of all flights (straight line movement) that took place over the course of a day (in hours)	GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The standard deviation of the duration of flights of the day is reported.
total_pause_time	Float	Total time spent in pause over the course of a day (in hours)	A pause is defined to be a longest period of time spent stationary without a directional change or flight.
av_pause_duration	Float	Average of the duration of all pauses that took place over the course of a day (in hour)	We consider that a participant has a pause if the distance that he has moved during a 30-s period is less than `r` m. By default, `r`=10.
sd_pause_duration	Float	Standard deviation of the duration of all pauses that took place over the course of a day (in hour)	GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The standard deviation of duration of pauses over the course of a day is reported.
entropy	Float	Entropy measure based on the proportion of time spent at significant locations over the course of a day	Letting p_i be the proportion of the day spent at significant location I, significant location entropy is calculated as -\sum_{i} p_i*log(p_i), where the sum occurs over all non-zero p_i for that day.
mis_duration	Not Available	Number of hours of GPS data missing over the course of a day
Physical circadian rhythm	Float	A continuous measurement of routine in the interval [0,1] that scores a day with 0 if there was a complete break from routine and 1 if the person followed the exact same routine as have in every other day of follow up	For a detailed description of how this measure is calculated, see Canzian and Musolesi’s 2015 paper in the Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, titled “Trajectories of depression: unobtrusive monitoring of depressive states by means of smartphone mobility traces analysis.” Their procedure was followed using 30-min increments as a bin size.
Physical circadian rhythm stratified	Float	A continuous measurement of routine in the interval [0,1] that scores a day with 0 if there was a complete break from routine and 1 if the person followed the exact same routine as have in every other day of follow up	Calculated in the same way as Physical circadian rhythm, except the procedure is repeated separately for weekends and weekdays.

Call & text logs 

List of summary statistics

Variable	Type	Description of Variable
num_in_call	int	The total number of incoming calls.
num_out_call	int	The total number of outgoing calls.
num_mis_call	int	The total number of missed calls.
num_in_caller	int	The total number of unique individuals who called the subject.
num_out_caller	int	The total number of unique individuals called by the subject.
num_mis_caller	int	The total number of unique individuals who called the subject but without answering.
total_mins_in_calls	float	The duration (minute) of all incoming calls.
total_mins_out_call	float	The duration (minute) of all outgoing calls.
num_uniq_individuals_call_or_text	int	The total number of unique individuals who called or texted the subject, or who the subject called or texted. The total number of individuals who the subject had any kind of communication with.
num_s	int	The total number of sent SMS.
num_r	int	The total number of received SMS.
num_mms_s	int	The total number of sent MMS.
num_mms_r	int	The total number of received MMS.
num_s_tel	int	The total number of unique phone numbers that sent messages to the subject.
num_r_tel	int	The total number of unique phone numbers that received messages from the subject.
total_char_s	int	The total number of characters in all sent messages.
total_char_r	int	The total number of characters in all received messages.
text_reciprocity_incoming	int	The total number of unique phone numbers that sent messages to the subject but didn’t get replied.
text_reciprocity_outgoing	int	The total number of unique phone numbers that received messages from the subject but didn’t reply.

Surveys 

Note

For best performance with Sycamore, do not change survey questions and answer after the study period has started.

The following variables are created in the “submits_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided. The submits_summary_daily.csv and submits_summary_hourly.csv files contain the same columns, but with additional granularity at the day or hourly levels rather than at the user level.

Variable	Type	Description
survey id	str	ID of the survey for which this row applies to. Note: If `submits_by_survey_id` is False, surveys will not be aggregated at the survey level (they will only be aggregated by user) so this column will not appear.
year	int	Year of the time period at which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv`
month	int	Month of the time period at which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv`
day	int	Day over which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv`
hour	int	Hour over which submits/deliveries are being aggregated. This is only included in `submits_summary_hourly.csv`
num_surveys	int	Number of surveys scheduled for delivery to the individual during the period
num_submitted_surveys	int	Number of surveys submitted during the period (i.e. the user hit submit on all surveys)
num_opened_surveys	int	Number of surveys opened by the individual during the time period (i.e. the user answered at least one question)
avg_time_to_submit	float	Average time between survey delivery and survey submission, in seconds, for complete surveys
avg_time_to_open	float	Average time between survey delivery and survey opening, in seconds. This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.
avg_duration	float	Average time between survey opening and survey submission, in seconds.This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.

The following variables are created in the “submits_and_deliveries.csv” file. This file will only be generated if the config file and intervention timings file are provided.

Variable	Type	Description
survey id	str	ID of the survey
delivery_time	str	A scheduled delivery time. If surveys are weekly, delivery times will be generated for each week between start_date and end_date
submit_flg	int	Whether a survey was submitted between this delivery time and the next delivery time
submit_time	str	Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session
time_to_submit	float	Time between survey delivery and survey submission, in seconds. If a survey was incomplete, this will be blank.
time_to_open	float	Time between survey delivery time and the first recorded survey answer, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be 0)
survey_duration	float	Time between the first recorded survey answer and the survey submission, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be NA)

The following variables are created in the “answers_data.csv” file. This file will be generated if a survey config file is available.

Variable	Type	Description
survey id	str	ID of the survey
beiwe_id	str	The participant’s Beiwe ID
question id	str	The ID of the question for this line
question text	str	The question text corresponding to the answer
question type	str	The type of question (radio button, free response, etc.) corresponding to the answer
question answer options	str	The answer options presented to the user (applicable for check box or radio button surveys)
timestamp	str	The Unix timestamp corresponding to the latest time the user was on the question
Local time	str	The local time corresponding to the latest time the user was on the question
last_answer	str	The last answer the user had selected before moving on to the next question or submitting
all_answers	str	A list of all answers the user selected
num_answers	int	The number of different answers selected by the user (the length of the list in all_answers)
first_time	str	The local time corresponding to the earliest time the user was on the question
last_time	str	The local time corresponding to the latest time the user was on the question
time_to_answer	float	The time that the user spent on the question

The following variables are created in the “answers_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided.

Variable	Type	Description
survey id	str	ID of the survey
beiwe_id	str	The participant’s Beiwe ID
question id	str	The ID of the question for this line
num_answers	int	The number of times in the given data the answer is answered
average_time_to_answer	float	The average number of seconds the user takes to answer the question
average_number_of_answers	float	Average number of answers selected for a question. This indicated if a user changed an answer before submitting it.
most_common_answer	str	A user’s most common answer to a question

The following variables are created in the “submits_only.csv” file. This file will always be generated.

Variable	Type	Description
survey id	str	ID of the survey
beiwe_id	str	The participant’s Beiwe ID
surv_inst_flg	int	A “submission flag” which distinguishes submissions that are done by the same individual on the same survey
max_time	str	Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session
min_time	str	The earliest time the individual was interacting with the survey that session
time_to_complete	float	Time between min_time and max_time, in seconds (for responses where a survey_timings file was available)

The following variables are created in a csv file for each survey.

Variable	Type	Description
start_time	str	Time this survey submission was started
end_time	str	Time this survey submission was ended
survey_duration	float	Difference between start and end time, in seconds (for surveys where a survey_timings file was available)
question_1, question_2, …	str	Responses to each question in the survey

Accelerometer 

The outputs of the accelerometer module contains gait summary statistics for each specified participant in daily (_gait_daily.csv) or hourly windows (_gait_hourly.csv).

The following variables are created in a csv file for each survey.

Variable	Type	Description
date	str	Time of observation (_gait_daily.csv format: yyyy-mm-dd; _gait_hourly.csv format: yyyy-mm-dd HH:MM:SS’)
walking_time	int	Total walking time (in seconds)
steps	int	Total steps taken
cadence	float	Average cadence in time window (daily or hourly)

Based on multiple data streams 

Other resources 

You may consider various other resources, for example if you:

Want to know more about the Beiwe platform for smartphone data collection, see the Beiwe Wiki
Want to read our blog and find out about new Forest features link here