Home
Forest is a library for analyzing smartphone-based high-throughput digital phenotyping data collected with the Beiwe platform. Forest implements methods as a Python 3.11 package. Forest is integrated into the Beiwe back-end on AWS but can also be run locally.
Table of Contents
Forest trees
Forest structure is based in subpackages, or trees, each of which addresses a specific area of analytics.
-
Location and call/text log data generation of a synthetic population
Input: simulation parameters
Current maintainer: Georgios Efstathiadis
-
Location data imputation and mobility summary statistics
Input: GPS data
Current maintainer: Georgios Efstathiadis
-
Call log and text log summary statistics
Input: call/text log (from Android phones only)
Current maintainer: Patrick Emedom-Nnamdi
Poplar: documentation and example code
Various common functions for data preparation, mainly used by other trees (e.g. time zone conversion, reading/writing)
-
Survey data summary statistics and preprocessed data
input: survey_timings and survey_answers data
Current maintainer: Zachary Clement
Oak:
Gait time, cadence, and step count statistics
Input: accelerometer data
Current maintainer: Marcin Straczkiewicz
Expected input data
Input data should be included in one directory. This directory cannot be on a cloud server. Inside that directory, there should be a direct subdirectory corresponding to each participant ID. Inside each Beiwe ID subdirectory, there should be a direct subdirectory corresponding to each downloaded data stream.
Methods are designed to work on data collected using the Beiwe app, with the types of sensor on/off cycles run by the Beiwe app, and with the csv files containing columns generated by Beiwe or generate data matching the data returned from the Beiwe app. Some methods included in Forest are compatible with other data collection environments, but code changes would be required to make Forest work well with those environments.
Did you know you can use Python to download Beiwe data even more conveniently than from the portal? Download mano and follow these instructions here: https://github.com/harvard-nrg/mano
Output & available summary statistics
GPS
The outputs of the GPS module contains:
summary statistics for all specified participants (.csv);
imputed trajectories (.csv) in terms of timestamp, latitude and longitude. By default, it is set to FALSE;
all_BV_set (.pkl), which is a dictionary, with the key as the user ID and the value as a numpy array, where each column represents [start_timestamp, start_latitude, start_longitude, end_timestamp, end_latitude, end_longitude]. If it is your first time run the code, it is set to NULL by default. If you want to continue your analysis from here in the future, all_BV_set is expected to be an input in your new analysis and it will be updated in that run. The size of the file should be fixed overtime;
all_memory_dict (.pkl), which is also a dictionary, with the key as user ID and the value as a numpy array of other parameters for the user. If it is your first time run the code, it is set to NULL by default. If you want to continue your analysis from here in the future, all_memory_dict is expected to be an input in your new analysis and it will be updated in that run. The size of the file should be fixed overtime.
locations_log (.json) json file created if
save_osm_log
is set to True. It contains information on the places visited by the user, their tags and the time of visit.
List of summary statistics
The summary statistics that are generated are listed below:
Variable |
Type |
Description of Variable |
Description of What it Measures |
---|---|---|---|
obs_duration |
Float |
The total time when the GPS is on |
This variable quantifies the missingness and the uncertainty in all other estimates |
obs_day |
Float |
The total time when the GPS is on from 8AM to 8PM |
This variable quantifies the missingness in daytime and the majority of uncertainty in all other estimates |
obs_night |
Float |
The total time when the GPS is on from 8PM to 8AM |
This variable quantifies the missingness at night and the minority of uncertainty in all other estimates since the user is most likely at home |
home_time |
Float |
Time spent at home over the course of a day (in hours) |
Note that a person can have non-zero |
dist_traveled |
Float |
Total distance travelled over the course of a day (in km) |
The sum of lengths of all flights. A flight is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause. Please find the technical details here. |
radius |
Float |
Average radius that a person travels from their center over the course of a day (in km) |
Centroid = the average of each ‘place visited’ (see definition ‘significant location’) over the course of a day, with weights proportional to the amount of time spent in the location. The radius of gyration is calculated using a time-weighted average of the distance between each place and the centroid, where weights are measured in the same way. |
diameter |
Float |
Largest distance between any two places that a person visited in a day (in km) |
Please find the technical details on how a place is defined here |
max_dist_home |
Float |
Largest distance between any places that a person visited in a day and their home (in km) |
|
num_sig_places |
int |
Number of significant visited at any point over the course of a day |
Significant locations are distinct pauses which are at least 15 minutes long and 50 meters apart. They are determined using K-means clustering on locations that a patient visits over the course of follow up. Set K=K+1 and repeat clustering until two significant locations are within 100 meters of one another. Then use the results from the previous step (K-1) as the total number of significant locations. |
total_flight_time |
Float |
Total time spent in flight over the course of a day (in hours) |
A flight is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause. |
av_flight_length |
Float |
Average of the length of all flights (straight line movement) that took place over the course of a day (in km) |
GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). A flight is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause. Note that a long flight could be composed of several short flights with different directions, but when calculating the average, it is the mean of those short flights. Please find the technical details here. |
sd_flight_length |
Float |
Standard deviation of the length of all flights (straight line movement) that took place over the course of a day (in km) |
GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The standard deviation of flights of the day is reported. |
av_flight_duration |
Float |
Average of the duration of all flights (straight line movement) that took place over the course of a day (in hours) |
GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The average of the duration of flights of the day is reported. |
sd_flight_duration |
Float |
Standard deviation of the duration of all flights (straight line movement) that took place over the course of a day (in hours) |
GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The standard deviation of the duration of flights of the day is reported. |
total_pause_time |
Float |
Total time spent in pause over the course of a day (in hours) |
A pause is defined to be a longest period of time spent stationary without a directional change or flight. |
av_pause_duration |
Float |
Average of the duration of all pauses that took place over the course of a day (in hour) |
We consider that a participant has a pause if the distance that he has moved during a 30-s period is less than |
sd_pause_duration |
Float |
Standard deviation of the duration of all pauses that took place over the course of a day (in hour) |
GPS is converted into a sequence of flights (straight line movement) and pauses (time spent stationary). The standard deviation of duration of pauses over the course of a day is reported. |
entropy |
Float |
Entropy measure based on the proportion of time spent at significant locations over the course of a day |
Letting p_i be the proportion of the day spent at significant location I, significant location entropy is calculated as -\sum_{i} p_i*log(p_i), where the sum occurs over all non-zero p_i for that day. |
mis_duration |
Not Available |
Number of hours of GPS data missing over the course of a day |
|
Physical circadian rhythm |
Float |
A continuous measurement of routine in the interval [0,1] that scores a day with 0 if there was a complete break from routine and 1 if the person followed the exact same routine as have in every other day of follow up |
For a detailed description of how this measure is calculated, see Canzian and Musolesi’s 2015 paper in the Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, titled “Trajectories of depression: unobtrusive monitoring of depressive states by means of smartphone mobility traces analysis.” Their procedure was followed using 30-min increments as a bin size. |
Physical circadian rhythm stratified |
Float |
A continuous measurement of routine in the interval [0,1] that scores a day with 0 if there was a complete break from routine and 1 if the person followed the exact same routine as have in every other day of follow up |
Calculated in the same way as Physical circadian rhythm, except the procedure is repeated separately for weekends and weekdays. |
Call & text logs
List of summary statistics
Variable |
Type |
Description of Variable |
---|---|---|
num_in_call |
int |
The total number of incoming calls. |
num_out_call |
int |
The total number of outgoing calls. |
num_mis_call |
int |
The total number of missed calls. |
num_in_caller |
int |
The total number of unique individuals who called the subject. |
num_out_caller |
int |
The total number of unique individuals called by the subject. |
num_mis_caller |
int |
The total number of unique individuals who called the subject but without answering. |
total_mins_in_calls |
float |
The duration (minute) of all incoming calls. |
total_mins_out_call |
float |
The duration (minute) of all outgoing calls. |
num_uniq_individuals_call_or_text |
int |
The total number of unique individuals who called or texted the subject, or who the subject called or texted. The total number of individuals who the subject had any kind of communication with. |
num_s |
int |
The total number of sent SMS. |
num_r |
int |
The total number of received SMS. |
num_mms_s |
int |
The total number of sent MMS. |
num_mms_r |
int |
The total number of received MMS. |
num_s_tel |
int |
The total number of unique phone numbers that sent messages to the subject. |
num_r_tel |
int |
The total number of unique phone numbers that received messages from the subject. |
total_char_s |
int |
The total number of characters in all sent messages. |
total_char_r |
int |
The total number of characters in all received messages. |
text_reciprocity_incoming |
int |
The total number of unique phone numbers that sent messages to the subject but didn’t get replied. |
text_reciprocity_outgoing |
int |
The total number of unique phone numbers that received messages from the subject but didn’t reply. |
Surveys
Note
For best performance with Sycamore, do not change survey questions and answer after the study period has started.
The following variables are created in the “submits_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided. The submits_summary_daily.csv
and submits_summary_hourly.csv
files contain the same columns, but with additional granularity at the day or hourly levels rather than at the user level.
Variable |
Type |
Description |
---|---|---|
survey id |
str |
ID of the survey for which this row applies to. Note: If |
year |
int |
Year of the time period at which submits/deliveries are being aggregated. This is only included in |
month |
int |
Month of the time period at which submits/deliveries are being aggregated. This is only included in |
day |
int |
Day over which submits/deliveries are being aggregated. This is only included in |
hour |
int |
Hour over which submits/deliveries are being aggregated. This is only included in |
num_surveys |
int |
Number of surveys scheduled for delivery to the individual during the period |
num_submitted_surveys |
int |
Number of surveys submitted during the period (i.e. the user hit submit on all surveys) |
num_opened_surveys |
int |
Number of surveys opened by the individual during the time period (i.e. the user answered at least one question) |
avg_time_to_submit |
float |
Average time between survey delivery and survey submission, in seconds, for complete surveys |
avg_time_to_open |
float |
Average time between survey delivery and survey opening, in seconds. This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing. |
avg_duration |
float |
Average time between survey opening and survey submission, in seconds.This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing. |
The following variables are created in the “submits_and_deliveries.csv” file. This file will only be generated if the config file and intervention timings file are provided.
Variable |
Type |
Description |
---|---|---|
survey id |
str |
ID of the survey |
delivery_time |
str |
A scheduled delivery time. If surveys are weekly, delivery times will be generated for each week between start_date and end_date |
submit_flg |
int |
Whether a survey was submitted between this delivery time and the next delivery time |
submit_time |
str |
Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session |
time_to_submit |
float |
Time between survey delivery and survey submission, in seconds. If a survey was incomplete, this will be blank. |
time_to_open |
float |
Time between survey delivery time and the first recorded survey answer, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be 0) |
survey_duration |
float |
Time between the first recorded survey answer and the survey submission, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be NA) |
The following variables are created in the “answers_data.csv” file. This file will be generated if a survey config file is available.
Variable |
Type |
Description |
---|---|---|
survey id |
str |
ID of the survey |
beiwe_id |
str |
The participant’s Beiwe ID |
question id |
str |
The ID of the question for this line |
question text |
str |
The question text corresponding to the answer |
question type |
str |
The type of question (radio button, free response, etc.) corresponding to the answer |
question answer options |
str |
The answer options presented to the user (applicable for check box or radio button surveys) |
timestamp |
str |
The Unix timestamp corresponding to the latest time the user was on the question |
Local time |
str |
The local time corresponding to the latest time the user was on the question |
last_answer |
str |
The last answer the user had selected before moving on to the next question or submitting |
all_answers |
str |
A list of all answers the user selected |
num_answers |
int |
The number of different answers selected by the user (the length of the list in all_answers) |
first_time |
str |
The local time corresponding to the earliest time the user was on the question |
last_time |
str |
The local time corresponding to the latest time the user was on the question |
time_to_answer |
float |
The time that the user spent on the question |
The following variables are created in the “answers_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided.
Variable |
Type |
Description |
---|---|---|
survey id |
str |
ID of the survey |
beiwe_id |
str |
The participant’s Beiwe ID |
question id |
str |
The ID of the question for this line |
num_answers |
int |
The number of times in the given data the answer is answered |
average_time_to_answer |
float |
The average number of seconds the user takes to answer the question |
average_number_of_answers |
float |
Average number of answers selected for a question. This indicated if a user changed an answer before submitting it. |
most_common_answer |
str |
A user’s most common answer to a question |
The following variables are created in the “submits_only.csv” file. This file will always be generated.
Variable |
Type |
Description |
---|---|---|
survey id |
str |
ID of the survey |
beiwe_id |
str |
The participant’s Beiwe ID |
surv_inst_flg |
int |
A “submission flag” which distinguishes submissions that are done by the same individual on the same survey |
max_time |
str |
Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session |
min_time |
str |
The earliest time the individual was interacting with the survey that session |
time_to_complete |
float |
Time between min_time and max_time, in seconds (for responses where a survey_timings file was available) |
The following variables are created in a csv file for each survey.
Variable |
Type |
Description |
---|---|---|
start_time |
str |
Time this survey submission was started |
end_time |
str |
Time this survey submission was ended |
survey_duration |
float |
Difference between start and end time, in seconds (for surveys where a survey_timings file was available) |
question_1, question_2, … |
str |
Responses to each question in the survey |
Accelerometer
The outputs of the accelerometer module contains gait summary statistics for each specified participant in daily (_gait_daily.csv) or hourly windows (_gait_hourly.csv).
The following variables are created in a csv file for each survey.
Variable |
Type |
Description |
---|---|---|
date |
str |
Time of observation (_gait_daily.csv format: yyyy-mm-dd; _gait_hourly.csv format: yyyy-mm-dd HH:MM:SS’) |
walking_time |
int |
Total walking time (in seconds) |
steps |
int |
Total steps taken |
cadence |
float |
Average cadence in time window (daily or hourly) |
Based on multiple data streams
Other resources
You may consider various other resources, for example if you:
Want to know more about the Beiwe platform for smartphone data collection, see the Beiwe Wiki
Want to read our blog and find out about new Forest features link here