Sycamore

Executive Summary

Use sycamore to process and analyze Beiwe survey data.

Installation

Before using sycamore, dependencies for librosa (ffmpeg and libsndfile1) must be installed first in order to enable processing of audio survey files.

To install these dependencies on ubuntu, simply run:
sudo apt-get install -y ffmpeg libsndfile1

For more information, see the librosa documentation

Import

User-facing functions can be imported directly from sycamore.

Main Function

from forest.sycamore import compute_survey_stats

Less commonly used functions

from forest.sycamore import aggregate_surveys_config
from forest.sycamore import survey_submits
from forest.sycamore import survey_submits_no_config
from forest.sycamore import agg_changed_answers_summary

Note: Most users will only use compute_survey_stats. However, other functions are listed for users interested in code development, or for users running Sycamore on studies with a very large number of surveys. If a very large number of surveys are collected, the main function (compute_survey_stats), which runs all the other functions, may take a long time when a researcher may only be interested in a specific output

Usage

Download raw data from your Beiwe server and use this package to process survey data generated by the Beiwe app. Summary data provides metrics around survey submissions and survey question completion. Sycamore also takes various auxiliary files which can be downloaded from the Beiwe website to ensure accurate output.

Data

Methods are designed for use on the survey_timings, survey_answers, and audio_recordings data from the Beiwe app.

The survey_timings and survey_answers data streams are required for optimal data processing. The survey_timings stream is the best source of survey data because it has information on when a user responded to each question. Because survey files are not always uploaded to the Beiwe server, the survey_answers data stream is used as a backup to the survey_timings stream. The survey_answers stream only contains information about survey responses and the time of the survey’s final submission, so the survey_answers stream alone shouldn’t be used for survey processing.

The audio_recordings data stream can also be included in survey summary outputs. Sycamore does not process the audio data returned as part of audio surveys, but it can generate summaries with submission frequencies and survey duration for audio surveys.

Auxiliary files

Sycamore requires users to manually download files from the Beiwe website to create some outputs. These files can be downloaded by clicking “Edit this Study” on the study page, and clicking on the relevant file.

The file supplied to config_path can be downloaded by clicking “Export study settings JSON file” under “Export/Import study settings” on the study settings page. If the config_path argument is not supplied, the submits_and_deliveries.csv, submits_summary_daily.csv, and submits_summary_hourly.csv files will not be generated. This is because these files rely on an estimate of when surveys were delivered, and Sycamore gets information about when survey deliveries are made from the study configuration file.

The file supplied to interventions_filepath can be downloaded by clicking “Download Interventions” next to “Intervention Data” on the study settings page. If the interventions_filepath argument is not supplied, and if your study used relative surveys (i.e. surveys are delivered 12 days after a participant’s start date), the submits_and_deliveries.csv, submits_summary_daily.csv, and submits_summary_hourly.csv files will not be generated. This is because the interventions file contains information about each Beiwe user’s intervention date, and Sycamore cannot guess a user’s intervention date from survey data alone. When running Sycamore, be sure to use an up-to-date version of the interventions file which contains intervention dates for users recently added to your study.

The file supplied to history_path can be downloaded by clicking “Download Surveys” next to “Survey History” on the study settings page. If this file is not supplied, Sycamore will not be able to provide prompts corresponding to audio surveys in output files. In addition, if this file is not supplied, and if the text of survey questions was changed during the study, surveys recovered from survey_answers files may not have the correct question IDs.

Functions

sycamore.base.compute_survey_stats
sycamore.common.aggregate_surveys_config
sycamore.submits.survey_submits
sycamore.submits.survey_submits_no_config
sycamore.responses.agg_changed_answers_summary

`sycamore.base.compute_survey_stats`

compute_survey_stats runs aggregate_surveys_config, survey_submits, survey_submits_no_config, and agg_changed_answers_summary, and writes their output to csv files. compute_survey_stats takes the following arguments:

data_dir: the path to the directory where Beiwe data is stored. output_dir: the path to the file directory where output is to be written. tz_str: the time zone where the study was conducted. (if not defined, defaults to ‘UTC’). You can see a list of all possible timezone names by importing pytz and using pytz.all_timezones beiwe_ids: the list of Beiwe IDs to run Forest on. If this is not specified, sycamore will run on all users in the data_dir directory. config_path: the filepath to your downloaded survey config file. See above for explanations about downloading auxiliary files. interventions_filepath: the filepath to your downloaded interventions timing file. history_path: the filepath to your downloaded survey history file start_date: the earliest date you think you might want survey information. Beiwe will generate survey deliveries starting at this date, and it will not include any surveys taken prior to this date in any outputs. end_date: the latest date you think you might want survey information. Beiwe will generate survey deliveries ending at this date, and it will not include any surveys taken after to this date in any outputs. submits_timeframe: Which timeframe to generate submission summaries. This must be one of the frequencies specified in forest.constants.Frequency. It determines whether submits_summary_daily.csv or submits_summary_hourly.csv (which aggregate survey deliveries and deliveries at the daily or hourly levels) get generated. The default for this is to generate both hourly and daily summaries, so you can probably just leave this argument alone and delete any unwanted files. But, if you want, you can specify one timeframe.

Example (without config file)

from forest.sycamore import compute_survey_stats

study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"

compute_survey_stats(
    study_dir, output_dir, tz_str, beiwe_ids, start_date=start_date, 
    end_date=end_date
)

Example (with config file)

from forest.constants import Frequency
config_path = path/to/config file
interventions_filepath = path/to/interventions file
history_path = path/to/history/file
study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"
submits_timeframe = Frequency.HOURLY_AND_DAILY


compute_survey_stats(
    study_dir, output_dir, study_tz, beiwe_ids, start_date=start_date, 
    end_date=end_date, config_path = config_path, interventions_filepath = interventions_filepath,
    history_path=history_path, submits_timeframe = submits_timeframe
)

Most users should be able to use compute_survey_stats for all of their survey processing needs. However, if a study has collected a very large number of surveys, subprocesses are also exposed to reduce processing time.

`sycamore.common.aggregate_surveys_config`

Aggregate all survey information from a study, using the config file to infer information about surveys

Example

from forest.sycamore import aggregate_surveys_config

agg_data = aggregate_surveys_config(study_dir, config_path, study_tz, history_path=history_path)

`sycamore.submits.survey_submits`

Extract and summarize delivery and submission times

Example

from forest.sycamore.submits import survey_submits

config_path = path/to/config file
interventions_path = path/to/interventions file
history_path = path/to/history/file
study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"

agg_data = aggregate_surveys_config(study_dir, config_path, tz_sttr)

submits_detail, submits_summary = survey_submits(
    config_path, start_date, end_date, beiwe_ids, interventions_path, agg_data, 
    history_path
)

`sycamore.submits.survey_submits_no_config`

Used to extract an alternative survey submits table that does not include delivery times

Example

from forest.sycamore import survey_submits_no_config

study_dir = "path/to/data"

submits_tbl = survey_submits_no_config(study_dir)

`sycamore.responses.agg_changed_answers_summary`

Used to extract data summarizing user responses

Example

from forest.sycamore import agg_changed_answers_summary

config_path = path/to/config file
history_path = path/to/history/file
study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
time_start = start time
time_end = end time  
study_tz = Timezone of study (if not defined, defaults to 'UTC')

agg_data = aggregate_surveys_config(study_dir, config_path, study_tz, history_path=history_path)

ca_detail, ca_summary = agg_changed_answers_summary(config_path, agg_data)

FAQ

In the submits_summary.csv file, there are some rows where num_submitted_surveys is greater than num_surveys. How could a user have submitted more surveys than were delivered to them?

Sycamore doesn’t know exactly when surveys were delivered to users. Survey delivery times are estimated using the study configuration file which you enter when you run the code. For example, imagine that you started running a study in March with survey deliveries happening daily, and in April you decided to switch your surveys to be delivered weekly. If you ran Sycamore in April, your config file would tell Sycamore that surveys were delivered weekly throughout the whole study. So, if you had a user submitting surveys daily during March, they would have ~30 survey submissions, but Sycamore would think that only ~5 surveys had been delivered during that time.

In addition, this may happen if a researcher manually re-sends surveys, because Sycamore has no information about manual (unscheduled) deliveries.

In the submits_and_deliveries.csv file, there are a ton of rows with deliveries but no submissions. Why is this happening?

If surveys are sent on a weekly schedule, Sycamore assumes that there is a survey delivered every week between the start_date and end_date which you entered. If you want there to be fewer empty rows in your output, you can move start_date and end_date to be closer to the actual start and end dates of your study.

What does surv_inst_flg mean in the outputs?

surv_inst_flg is a unique identifying number to distinguish different times when the same individual took the same survey. This column is useful for joining outputs together.

List of summary statistics

The following variables are created in the “submits_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided. The submits_summary_daily.csv and submits_summary_hourly.csv files contain the same columns, but with additional granularity at the day or hourly levels rather than at the user level.

Variable	Type	Description of Variable
survey id	str	ID of the survey for which this row applies to. Note: If `submits_by_survey_id` is False, surveys will not be aggregated at the survey level (they will only be aggregated by user) so this column will not appear.
year	int	Year of the time period at which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv`
month	int	Month of the time period at which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv`
day	int	Day over which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv`
hour	int	Hour over which submits/deliveries are being aggregated. This is only included in `submits_summary_hourly.csv`
num_surveys	int	Number of surveys scheduled for delivery to the individual during the period
num_submitted_surveys	int	Number of surveys submitted during the period (i.e. the user hit submit on all surveys)
num_opened_surveys	int	Number of surveys opened by the individual during the time period (i.e. the user answered at least one question)
avg_time_to_submit	float	Average time between survey delivery and survey submission, in seconds, for complete surveys
avg_time_to_open	float	Average time between survey delivery and survey opening, in seconds. This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.
avg_duration	float	Average time between survey opening and survey submission, in seconds.This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.

The following variables are created in the “submits_and_deliveries.csv” file. This file will only be generated if the config file and intervention timings file are provided.

Variable	Type	Description of Variable
survey id	str	ID of the survey
delivery_time	str	A scheduled delivery time. If surveys are weekly, delivery times will be generated for each week between start_date and end_date
submit_flg	str	Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session
time_to_submit	float	Time between survey delivery and survey submission, in seconds. If a survey was incomplete, this will be blank.
time_to_open	float	Time between survey delivery time and the first recorded survey answer, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be 0)
survey_duration	float	Time between the first recorded survey answer and the survey submission, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be NA)

The following variables are created in the “answers_data.csv” file. This file will be generated if a survey config file is available.

Variable	Type	Description of Variable
survey id	str	ID of the survey
beiwe_id	str	The participant’s Beiwe ID
question id	str	The ID of the question for this line
question text	str	The question text corresponding to the answer
question type	str	The type of question (radio button, free response, etc.) corresponding to the answer
question answer options	str	The answer options presented to the user (applicable for check box or radio button surveys)
timestamp	str	The Unix timestamp corresponding to the latest time the user was on the question
Local time	str	The local time corresponding to the latest time the user was on the question
last_answer	str	The last answer the user had selected before moving on to the next question or submitting
all_answers	str	A list of all answers the user selected
num_answers	int	The number of different answers selected by the user (the length of the list in all_answers)
first_time	str	The local time corresponding to the earliest time the user was on the question
last_time	str	The local time corresponding to the latest time the user was on the question
time_to_answer	float	The time that the user spent on the question

The following variables are created in the “answers_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided.

Variable	Type	Description of Variable
survey id	str	ID of the survey
beiwe_id	str	The participant’s Beiwe ID
question id	str	The ID of the question for this line
num_answers	int	The number of times in the given data the answer is answered
average_time_to_answer	float	The average number of seconds the user takes to answer the question
average_number_of_answers	float	Average number of answers selected for a question. This indicated if a user changed an answer before submitting it.
most_common_answer	str	A user’s most common answer to a question

The following variables are created in the “submits_only.csv” file. This file will always be generated.

Variable	Type	Description of Variable
survey id	str	ID of the survey
beiwe_id	str	The participant’s Beiwe ID
surv_inst_flg	int	A “submission flag” which distinguishes submissions that are done by the same individual on the same survey
max_time	str	Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session
min_time	str	The earliest time the individual was interacting with the survey that session
time_to_complete	float	Time between min_time and max_time, in seconds (for responses where a survey_timings file was available)

The following variables are created in a csv file for each survey.

Variable	Type	Description of Variable
start_time	str	Time this survey submission was started
end_time	str	Time this survey submission was ended
survey_duration	float	Difference between start and end time, in seconds (for surveys where a survey_timings file was available)
question_1, question_2, …	str	Responses to each question in the survey

Sycamore

Executive Summary

Installation

Import

Main Function

Less commonly used functions

Usage

Data

Auxiliary files

Functions

sycamore.base.compute_survey_stats

sycamore.common.aggregate_surveys_config

sycamore.submits.survey_submits

sycamore.submits.survey_submits_no_config

sycamore.responses.agg_changed_answers_summary

FAQ

List of summary statistics

`sycamore.base.compute_survey_stats`

`sycamore.common.aggregate_surveys_config`

`sycamore.submits.survey_submits`

`sycamore.submits.survey_submits_no_config`

`sycamore.responses.agg_changed_answers_summary`