Sycamore
Executive Summary
Use sycamore to process and analyze Beiwe survey data.
Installation
Before using sycamore, dependencies for librosa (ffmpeg and libsndfile1) must be installed first in order to enable processing of audio survey files.
To install these dependencies on ubuntu, simply run:
sudo apt-get install -y ffmpeg libsndfile1
For more information, see the librosa documentation
Import
User-facing functions can be imported directly from sycamore.
Main Function
from forest.sycamore import compute_survey_stats
Less commonly used functions
from forest.sycamore import aggregate_surveys_config
from forest.sycamore import survey_submits
from forest.sycamore import survey_submits_no_config
from forest.sycamore import agg_changed_answers_summary
Note: Most users will only use compute_survey_stats. However, other functions are listed for users interested in code development, or for users running Sycamore on studies with a very large number of surveys. If a very large number of surveys are collected, the main function (compute_survey_stats), which runs all the other functions, may take a long time when a researcher may only be interested in a specific output
Usage
Download raw data from your Beiwe server and use this package to process survey data generated by the Beiwe app. Summary data provides metrics around survey submissions and survey question completion. Sycamore also takes various auxiliary files which can be downloaded from the Beiwe website to ensure accurate output.
Data
Methods are designed for use on the survey_timings, survey_answers, and audio_recordings data from the Beiwe app.
The survey_timings and survey_answers data streams are required for optimal data processing. The survey_timings stream is the best source of survey data because it has information on when a user responded to each question. Because survey files are not always uploaded to the Beiwe server, the survey_answers data stream is used as a backup to the survey_timings stream. The survey_answers stream only contains information about survey responses and the time of the survey’s final submission, so the survey_answers stream alone shouldn’t be used for survey processing.
The audio_recordings data stream can also be included in survey summary outputs. Sycamore does not process the audio data returned as part of audio surveys, but it can generate summaries with submission frequencies and survey duration for audio surveys.
Auxiliary files
Sycamore requires users to manually download files from the Beiwe website to create some outputs. These files can be downloaded by clicking “Edit this Study” on the study page, and clicking on the relevant file.
The file supplied to config_path can be downloaded by clicking “Export study settings JSON file” under “Export/Import study settings” on the study settings page. If the config_path argument is not supplied, the submits_and_deliveries.csv, submits_summary_daily.csv, and submits_summary_hourly.csv files will not be generated. This is because these files rely on an estimate of when surveys were delivered, and Sycamore gets information about when survey deliveries are made from the study configuration file.
The file supplied to interventions_filepath can be downloaded by clicking “Download Interventions” next to “Intervention Data” on the study settings page. If the interventions_filepath argument is not supplied, and if your study used relative surveys (i.e. surveys are delivered 12 days after a participant’s start date), the submits_and_deliveries.csv, submits_summary_daily.csv, and submits_summary_hourly.csv files will not be generated. This is because the interventions file contains information about each Beiwe user’s intervention date, and Sycamore cannot guess a user’s intervention date from survey data alone. When running Sycamore, be sure to use an up-to-date version of the interventions file which contains intervention dates for users recently added to your study.
The file supplied to history_path can be downloaded by clicking “Download Surveys” next to “Survey History” on the study settings page. If this file is not supplied, Sycamore will not be able to provide prompts corresponding to audio surveys in output files. In addition, if this file is not supplied, and if the text of survey questions was changed during the study, surveys recovered from survey_answers files may not have the correct question IDs.
Functions
sycamore.base.compute_survey_stats
compute_survey_stats runs aggregate_surveys_config, survey_submits, survey_submits_no_config, and agg_changed_answers_summary, and writes their output to csv files. compute_survey_stats takes the following arguments:
data_dir: the path to the directory where Beiwe data is stored.
output_dir: the path to the file directory where output is to be written.
tz_str: the time zone where the study was conducted. (if not defined, defaults to ‘UTC’). You can see a list of all possible timezone names by importing pytz and using pytz.all_timezones
beiwe_ids: the list of Beiwe IDs to run Forest on. If this is not specified, sycamore will run on all users in the data_dir directory.
config_path: the filepath to your downloaded survey config file. See above for explanations about downloading auxiliary files.
interventions_filepath: the filepath to your downloaded interventions timing file.
history_path: the filepath to your downloaded survey history file
start_date: the earliest date you think you might want survey information. Beiwe will generate survey deliveries starting at this date, and it will not include any surveys taken prior to this date in any outputs.
end_date: the latest date you think you might want survey information. Beiwe will generate survey deliveries ending at this date, and it will not include any surveys taken after to this date in any outputs.
submits_timeframe: Which timeframe to generate submission summaries. This must be one of the frequencies specified in forest.constants.Frequency. It determines whether submits_summary_daily.csv or submits_summary_hourly.csv (which aggregate survey deliveries and deliveries at the daily or hourly levels) get generated. The default for this is to generate both hourly and daily summaries, so you can probably just leave this argument alone and delete any unwanted files. But, if you want, you can specify one timeframe.
Example (without config file)
from forest.sycamore import compute_survey_stats
study_dir = path/to/data
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"
compute_survey_stats(
study_dir, output_dir, tz_str, beiwe_ids, start_date=start_date,
end_date=end_date
)
Example (with config file)
from forest.constants import Frequency
config_path = path/to/config file
interventions_filepath = path/to/interventions file
history_path = path/to/history/file
study_dir = path/to/data
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"
submits_timeframe = Frequency.HOURLY_AND_DAILY
compute_survey_stats(
study_dir, output_dir, study_tz, beiwe_ids, start_date=start_date,
end_date=end_date, config_path = config_path, interventions_filepath = interventions_filepath,
history_path=history_path, submits_timeframe = submits_timeframe
)
Most users should be able to use compute_survey_stats for all of their survey processing needs. However, if a study has collected a very large number of surveys, subprocesses are also exposed to reduce processing time.
sycamore.common.aggregate_surveys_config
Aggregate all survey information from a study, using the config file to infer information about surveys
Example
from forest.sycamore import aggregate_surveys_config
agg_data = aggregate_surveys_config(study_dir, config_path, study_tz, history_path=history_path)
sycamore.submits.survey_submits
Extract and summarize delivery and submission times
Example
from forest.sycamore.submits import survey_submits
config_path = path/to/config file
interventions_path = path/to/interventions file
history_path = path/to/history/file
study_dir = path/to/data
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"
agg_data = aggregate_surveys_config(study_dir, config_path, tz_sttr)
submits_detail, submits_summary = survey_submits(
config_path, start_date, end_date, beiwe_ids, interventions_path, agg_data,
history_path
)
sycamore.submits.survey_submits_no_config
Used to extract an alternative survey submits table that does not include delivery times
Example
from forest.sycamore import survey_submits_no_config
study_dir = "path/to/data"
submits_tbl = survey_submits_no_config(study_dir)
sycamore.responses.agg_changed_answers_summary
Used to extract data summarizing user responses
Example
from forest.sycamore import agg_changed_answers_summary
config_path = path/to/config file
history_path = path/to/history/file
study_dir = path/to/data
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
time_start = start time
time_end = end time
study_tz = Timezone of study (if not defined, defaults to 'UTC')
agg_data = aggregate_surveys_config(study_dir, config_path, study_tz, history_path=history_path)
ca_detail, ca_summary = agg_changed_answers_summary(config_path, agg_data)
FAQ
In the submits_summary.csv file, there are some rows where num_submitted_surveys is greater than num_surveys. How could a user have submitted more surveys than were delivered to them?
Sycamore doesn’t know exactly when surveys were delivered to users. Survey delivery times are estimated using the study configuration file which you enter when you run the code. For example, imagine that you started running a study in March with survey deliveries happening daily, and in April you decided to switch your surveys to be delivered weekly. If you ran Sycamore in April, your config file would tell Sycamore that surveys were delivered weekly throughout the whole study. So, if you had a user submitting surveys daily during March, they would have ~30 survey submissions, but Sycamore would think that only ~5 surveys had been delivered during that time.
In addition, this may happen if a researcher manually re-sends surveys, because Sycamore has no information about manual (unscheduled) deliveries.
In the submits_and_deliveries.csv file, there are a ton of rows with deliveries but no submissions. Why is this happening?
If surveys are sent on a weekly schedule, Sycamore assumes that there is a survey delivered every week between the start_date and end_date which you entered. If you want there to be fewer empty rows in your output, you can move start_date and end_date to be closer to the actual start and end dates of your study.
What does surv_inst_flg mean in the outputs?
surv_inst_flg is a unique identifying number to distinguish different times when the same individual took the same survey. This column is useful for joining outputs together.
List of summary statistics
The following variables are created in the “submits_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided. The submits_summary_daily.csv and submits_summary_hourly.csv files contain the same columns, but with additional granularity at the day or hourly levels rather than at the user level.
Variable |
Type |
Description of Variable |
|---|---|---|
survey id |
str |
ID of the survey for which this row applies to. Note: If |
year |
int |
Year of the time period at which submits/deliveries are being aggregated. This is only included in |
month |
int |
Month of the time period at which submits/deliveries are being aggregated. This is only included in |
day |
int |
Day over which submits/deliveries are being aggregated. This is only included in |
hour |
int |
Hour over which submits/deliveries are being aggregated. This is only included in |
num_surveys |
int |
Number of surveys scheduled for delivery to the individual during the period |
num_submitted_surveys |
int |
Number of surveys submitted during the period (i.e. the user hit submit on all surveys) |
num_opened_surveys |
int |
Number of surveys opened by the individual during the time period (i.e. the user answered at least one question) |
avg_time_to_submit |
float |
Average time between survey delivery and survey submission, in seconds, for complete surveys |
avg_time_to_open |
float |
Average time between survey delivery and survey opening, in seconds. This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing. |
avg_duration |
float |
Average time between survey opening and survey submission, in seconds.This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing. |
The following variables are created in the “submits_and_deliveries.csv” file. This file will only be generated if the config file and intervention timings file are provided.
Variable |
Type |
Description of Variable |
|---|---|---|
survey id |
str |
ID of the survey |
delivery_time |
str |
A scheduled delivery time. If surveys are weekly, delivery times will be generated for each week between start_date and end_date |
submit_flg |
str |
Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session |
time_to_submit |
float |
Time between survey delivery and survey submission, in seconds. If a survey was incomplete, this will be blank. |
time_to_open |
float |
Time between survey delivery time and the first recorded survey answer, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be 0) |
survey_duration |
float |
Time between the first recorded survey answer and the survey submission, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be NA) |
The following variables are created in the “answers_data.csv” file. This file will be generated if a survey config file is available.
Variable |
Type |
Description of Variable |
|---|---|---|
survey id |
str |
ID of the survey |
beiwe_id |
str |
The participant’s Beiwe ID |
question id |
str |
The ID of the question for this line |
question text |
str |
The question text corresponding to the answer |
question type |
str |
The type of question (radio button, free response, etc.) corresponding to the answer |
question answer options |
str |
The answer options presented to the user (applicable for check box or radio button surveys) |
timestamp |
str |
The Unix timestamp corresponding to the latest time the user was on the question |
Local time |
str |
The local time corresponding to the latest time the user was on the question |
last_answer |
str |
The last answer the user had selected before moving on to the next question or submitting |
all_answers |
str |
A list of all answers the user selected |
num_answers |
int |
The number of different answers selected by the user (the length of the list in all_answers) |
first_time |
str |
The local time corresponding to the earliest time the user was on the question |
last_time |
str |
The local time corresponding to the latest time the user was on the question |
time_to_answer |
float |
The time that the user spent on the question |
The following variables are created in the “answers_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided.
Variable |
Type |
Description of Variable |
|---|---|---|
survey id |
str |
ID of the survey |
beiwe_id |
str |
The participant’s Beiwe ID |
question id |
str |
The ID of the question for this line |
num_answers |
int |
The number of times in the given data the answer is answered |
average_time_to_answer |
float |
The average number of seconds the user takes to answer the question |
average_number_of_answers |
float |
Average number of answers selected for a question. This indicated if a user changed an answer before submitting it. |
most_common_answer |
str |
A user’s most common answer to a question |
The following variables are created in the “submits_only.csv” file. This file will always be generated.
Variable |
Type |
Description of Variable |
|---|---|---|
survey id |
str |
ID of the survey |
beiwe_id |
str |
The participant’s Beiwe ID |
surv_inst_flg |
int |
A “submission flag” which distinguishes submissions that are done by the same individual on the same survey |
max_time |
str |
Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session |
min_time |
str |
The earliest time the individual was interacting with the survey that session |
time_to_complete |
float |
Time between min_time and max_time, in seconds (for responses where a survey_timings file was available) |
The following variables are created in a csv file for each survey.
Variable |
Type |
Description of Variable |
|---|---|---|
start_time |
str |
Time this survey submission was started |
end_time |
str |
Time this survey submission was ended |
survey_duration |
float |
Difference between start and end time, in seconds (for surveys where a survey_timings file was available) |
question_1, question_2, … |
str |
Responses to each question in the survey |