Sycamore

Executive Summary

Use sycamore to process and analyze Beiwe survey data.

Installation

Before using sycamore, dependencies for librosa (ffmpeg and libsndfile1) must be installed first in order to enable processing of audio survey files.

To install these dependencies on ubuntu, simply run:
sudo apt-get install -y ffmpeg libsndfile1

For more information, see the librosa documentation

Import

User-facing functions can be imported directly from sycamore.

Main Function

from forest.sycamore import compute_survey_stats

Less commonly used functions

from forest.sycamore import aggregate_surveys_config
from forest.sycamore import survey_submits
from forest.sycamore import survey_submits_no_config
from forest.sycamore import agg_changed_answers_summary

Note: Most users will only use compute_survey_stats. However, other functions are listed for users interested in code development, or for users running Sycamore on studies with a very large number of surveys. If a very large number of surveys are collected, the main function (compute_survey_stats), which runs all the other functions, may take a long time when a researcher may only be interested in a specific output

Usage

Download raw data from your Beiwe server and use this package to process survey data generated by the Beiwe app. Summary data provides metrics around survey submissions and survey question completion. Sycamore also takes various auxiliary files which can be downloaded from the Beiwe website to ensure accurate output.

Data

Methods are designed for use on the survey_timings, survey_answers, and audio_recordings data from the Beiwe app.

The survey_timings and survey_answers data streams are required for optimal data processing. The survey_timings stream is the best source of survey data because it has information on when a user responded to each question. Because survey files are not always uploaded to the Beiwe server, the survey_answers data stream is used as a backup to the survey_timings stream. The survey_answers stream only contains information about survey responses and the time of the survey’s final submission, so the survey_answers stream alone shouldn’t be used for survey processing.

The audio_recordings data stream can also be included in survey summary outputs. Sycamore does not process the audio data returned as part of audio surveys, but it can generate summaries with submission frequencies and survey duration for audio surveys.

Auxiliary files

Sycamore requires users to manually download files from the Beiwe website to create some outputs. These files can be downloaded by clicking “Edit this Study” on the study page, and clicking on the relevant file.

The file supplied to config_path can be downloaded by clicking “Export study settings JSON file” under “Export/Import study settings” on the study settings page. If the config_path argument is not supplied, the submits_and_deliveries.csv, submits_summary_daily.csv, and submits_summary_hourly.csv files will not be generated. This is because these files rely on an estimate of when surveys were delivered, and Sycamore gets information about when survey deliveries are made from the study configuration file.

The file supplied to interventions_filepath can be downloaded by clicking “Download Interventions” next to “Intervention Data” on the study settings page. If the interventions_filepath argument is not supplied, and if your study used relative surveys (i.e. surveys are delivered 12 days after a participant’s start date), the submits_and_deliveries.csv, submits_summary_daily.csv, and submits_summary_hourly.csv files will not be generated. This is because the interventions file contains information about each Beiwe user’s intervention date, and Sycamore cannot guess a user’s intervention date from survey data alone. When running Sycamore, be sure to use an up-to-date version of the interventions file which contains intervention dates for users recently added to your study.

The file supplied to history_path can be downloaded by clicking “Download Surveys” next to “Survey History” on the study settings page. If this file is not supplied, Sycamore will not be able to provide prompts corresponding to audio surveys in output files. In addition, if this file is not supplied, and if the text of survey questions was changed during the study, surveys recovered from survey_answers files may not have the correct question IDs.


Functions


sycamore.base.compute_survey_stats

compute_survey_stats runs aggregate_surveys_config, survey_submits, survey_submits_no_config, and agg_changed_answers_summary, and writes their output to csv files. compute_survey_stats takes the following arguments:

data_dir: the path to the directory where Beiwe data is stored. output_dir: the path to the file directory where output is to be written. tz_str: the time zone where the study was conducted. (if not defined, defaults to ‘UTC’). You can see a list of all possible timezone names by importing pytz and using pytz.all_timezones beiwe_ids: the list of Beiwe IDs to run Forest on. If this is not specified, sycamore will run on all users in the data_dir directory. config_path: the filepath to your downloaded survey config file. See above for explanations about downloading auxiliary files. interventions_filepath: the filepath to your downloaded interventions timing file. history_path: the filepath to your downloaded survey history file start_date: the earliest date you think you might want survey information. Beiwe will generate survey deliveries starting at this date, and it will not include any surveys taken prior to this date in any outputs. end_date: the latest date you think you might want survey information. Beiwe will generate survey deliveries ending at this date, and it will not include any surveys taken after to this date in any outputs. submits_timeframe: Which timeframe to generate submission summaries. This must be one of the frequencies specified in forest.constants.Frequency. It determines whether submits_summary_daily.csv or submits_summary_hourly.csv (which aggregate survey deliveries and deliveries at the daily or hourly levels) get generated. The default for this is to generate both hourly and daily summaries, so you can probably just leave this argument alone and delete any unwanted files. But, if you want, you can specify one timeframe.

Example (without config file)

from forest.sycamore import compute_survey_stats

study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"

compute_survey_stats(
    study_dir, output_dir, tz_str, beiwe_ids, start_date=start_date, 
    end_date=end_date
)

Example (with config file)

from forest.constants import Frequency
config_path = path/to/config file
interventions_filepath = path/to/interventions file
history_path = path/to/history/file
study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"
submits_timeframe = Frequency.HOURLY_AND_DAILY


compute_survey_stats(
    study_dir, output_dir, study_tz, beiwe_ids, start_date=start_date, 
    end_date=end_date, config_path = config_path, interventions_filepath = interventions_filepath,
    history_path=history_path, submits_timeframe = submits_timeframe
)

Most users should be able to use compute_survey_stats for all of their survey processing needs. However, if a study has collected a very large number of surveys, subprocesses are also exposed to reduce processing time.


sycamore.common.aggregate_surveys_config

Aggregate all survey information from a study, using the config file to infer information about surveys

Example

from forest.sycamore import aggregate_surveys_config

agg_data = aggregate_surveys_config(study_dir, config_path, study_tz, history_path=history_path)

sycamore.submits.survey_submits

Extract and summarize delivery and submission times

Example

from forest.sycamore.submits import survey_submits

config_path = path/to/config file
interventions_path = path/to/interventions file
history_path = path/to/history/file
study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"

agg_data = aggregate_surveys_config(study_dir, config_path, tz_sttr)

submits_detail, submits_summary = survey_submits(
    config_path, start_date, end_date, beiwe_ids, interventions_path, agg_data, 
    history_path
)

sycamore.submits.survey_submits_no_config

Used to extract an alternative survey submits table that does not include delivery times

Example

from forest.sycamore import survey_submits_no_config

study_dir = "path/to/data"

submits_tbl = survey_submits_no_config(study_dir)

sycamore.responses.agg_changed_answers_summary

Used to extract data summarizing user responses

Example

from forest.sycamore import agg_changed_answers_summary

config_path = path/to/config file
history_path = path/to/history/file
study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
time_start = start time
time_end = end time  
study_tz = Timezone of study (if not defined, defaults to 'UTC')

agg_data = aggregate_surveys_config(study_dir, config_path, study_tz, history_path=history_path)

ca_detail, ca_summary = agg_changed_answers_summary(config_path, agg_data)

FAQ

In the submits_summary.csv file, there are some rows where num_submitted_surveys is greater than num_surveys. How could a user have submitted more surveys than were delivered to them?

Sycamore doesn’t know exactly when surveys were delivered to users. Survey delivery times are estimated using the study configuration file which you enter when you run the code. For example, imagine that you started running a study in March with survey deliveries happening daily, and in April you decided to switch your surveys to be delivered weekly. If you ran Sycamore in April, your config file would tell Sycamore that surveys were delivered weekly throughout the whole study. So, if you had a user submitting surveys daily during March, they would have ~30 survey submissions, but Sycamore would think that only ~5 surveys had been delivered during that time.

In addition, this may happen if a researcher manually re-sends surveys, because Sycamore has no information about manual (unscheduled) deliveries.

In the submits_and_deliveries.csv file, there are a ton of rows with deliveries but no submissions. Why is this happening?

If surveys are sent on a weekly schedule, Sycamore assumes that there is a survey delivered every week between the start_date and end_date which you entered. If you want there to be fewer empty rows in your output, you can move start_date and end_date to be closer to the actual start and end dates of your study.

What does surv_inst_flg mean in the outputs?

surv_inst_flg is a unique identifying number to distinguish different times when the same individual took the same survey. This column is useful for joining outputs together.

List of summary statistics

The following variables are created in the “submits_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided. The submits_summary_daily.csv and submits_summary_hourly.csv files contain the same columns, but with additional granularity at the day or hourly levels rather than at the user level.

Variable

Type

Description of Variable

survey id

str

ID of the survey for which this row applies to. Note: If submits_by_survey_id is False, surveys will not be aggregated at the survey level (they will only be aggregated by user) so this column will not appear.

year

int

Year of the time period at which submits/deliveries are being aggregated. This is only included in submits_summary_daily.csv and submits_summary_hourly.csv

month

int

Month of the time period at which submits/deliveries are being aggregated. This is only included in submits_summary_daily.csv and submits_summary_hourly.csv

day

int

Day over which submits/deliveries are being aggregated. This is only included in submits_summary_daily.csv and submits_summary_hourly.csv

hour

int

Hour over which submits/deliveries are being aggregated. This is only included in submits_summary_hourly.csv

num_surveys

int

Number of surveys scheduled for delivery to the individual during the period

num_submitted_surveys

int

Number of surveys submitted during the period (i.e. the user hit submit on all surveys)

num_opened_surveys

int

Number of surveys opened by the individual during the time period (i.e. the user answered at least one question)

avg_time_to_submit

float

Average time between survey delivery and survey submission, in seconds, for complete surveys

avg_time_to_open

float

Average time between survey delivery and survey opening, in seconds. This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.

avg_duration

float

Average time between survey opening and survey submission, in seconds.This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.


The following variables are created in the “submits_and_deliveries.csv” file. This file will only be generated if the config file and intervention timings file are provided.

Variable

Type

Description of Variable

survey id

str

ID of the survey

delivery_time

str

A scheduled delivery time. If surveys are weekly, delivery times will be generated for each week between start_date and end_date

submit_flg

str

Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session

time_to_submit

float

Time between survey delivery and survey submission, in seconds. If a survey was incomplete, this will be blank.

time_to_open

float

Time between survey delivery time and the first recorded survey answer, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be 0)

survey_duration

float

Time between the first recorded survey answer and the survey submission, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be NA)


The following variables are created in the “answers_data.csv” file. This file will be generated if a survey config file is available.

Variable

Type

Description of Variable

survey id

str

ID of the survey

beiwe_id

str

The participant’s Beiwe ID

question id

str

The ID of the question for this line

question text

str

The question text corresponding to the answer

question type

str

The type of question (radio button, free response, etc.) corresponding to the answer

question answer options

str

The answer options presented to the user (applicable for check box or radio button surveys)

timestamp

str

The Unix timestamp corresponding to the latest time the user was on the question

Local time

str

The local time corresponding to the latest time the user was on the question

last_answer

str

The last answer the user had selected before moving on to the next question or submitting

all_answers

str

A list of all answers the user selected

num_answers

int

The number of different answers selected by the user (the length of the list in all_answers)

first_time

str

The local time corresponding to the earliest time the user was on the question

last_time

str

The local time corresponding to the latest time the user was on the question

time_to_answer

float

The time that the user spent on the question


The following variables are created in the “answers_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided.

Variable

Type

Description of Variable

survey id

str

ID of the survey

beiwe_id

str

The participant’s Beiwe ID

question id

str

The ID of the question for this line

num_answers

int

The number of times in the given data the answer is answered

average_time_to_answer

float

The average number of seconds the user takes to answer the question

average_number_of_answers

float

Average number of answers selected for a question. This indicated if a user changed an answer before submitting it.

most_common_answer

str

A user’s most common answer to a question


The following variables are created in the “submits_only.csv” file. This file will always be generated.

Variable

Type

Description of Variable

survey id

str

ID of the survey

beiwe_id

str

The participant’s Beiwe ID

surv_inst_flg

int

A “submission flag” which distinguishes submissions that are done by the same individual on the same survey

max_time

str

Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session

min_time

str

The earliest time the individual was interacting with the survey that session

time_to_complete

float

Time between min_time and max_time, in seconds (for responses where a survey_timings file was available)


The following variables are created in a csv file for each survey.

Variable

Type

Description of Variable

start_time

str

Time this survey submission was started

end_time

str

Time this survey submission was ended

survey_duration

float

Difference between start and end time, in seconds (for surveys where a survey_timings file was available)

question_1, question_2, …

str

Responses to each question in the survey