# Sycamore

## Executive Summary

Use `sycamore` to process and analyze Beiwe survey data.

## Installation

Before using sycamore, dependencies for librosa (ffmpeg and libsndfile1) must be installed first in order to enable processing of audio survey files.  

To install these dependencies on ubuntu, simply run:  
`sudo apt-get install -y ffmpeg libsndfile1`  

For more information, see the [librosa documentation](https://librosa.org/doc/latest/install.html)

## Import

User-facing functions can be imported directly from sycamore. 

### Main Function
`from forest.sycamore import compute_survey_stats`  

### Less commonly used functions
```python
from forest.sycamore import aggregate_surveys_config
from forest.sycamore import survey_submits
from forest.sycamore import survey_submits_no_config
from forest.sycamore import agg_changed_answers_summary
```

Note: Most users will only use compute_survey_stats. However, other functions are listed for users interested in code development, or for users running Sycamore on studies with a very large number of surveys. If a very large number of surveys are collected, the main function (`compute_survey_stats`), which runs all the other functions, may take a long time when a researcher may only be interested in a specific output

## Usage
Download raw data from your Beiwe server and use this package to process survey data generated by the Beiwe app. Summary data provides metrics around survey submissions and survey question completion. Sycamore also takes various auxiliary files which can be downloaded from the Beiwe website to ensure accurate output.

## Data
Methods are designed for use on the `survey_timings`, `survey_answers`, and `audio_recordings` data from the Beiwe app.

The `survey_timings` and `survey_answers` data streams are required for optimal data processing. The `survey_timings` stream is the best source of survey data because it has information on when a user responded to each question. Because survey files are not always uploaded to the Beiwe server, the `survey_answers` data stream is used as a backup to the `survey_timings` stream. The `survey_answers` stream only contains information about survey responses and the time of the survey's final submission, so the `survey_answers` stream alone shouldn't be used for survey processing. 

The `audio_recordings` data stream can also be included in survey summary outputs. Sycamore does not process the audio data returned as part of audio surveys, but it can generate summaries with submission frequencies and survey duration for audio surveys. 

## Auxiliary files
Sycamore requires users to manually download files from the Beiwe website to create some outputs. These files can be downloaded by clicking "Edit this Study" on the study page, and clicking on the relevant file. 

The file supplied to `config_path` can be downloaded by clicking "Export study settings JSON file" under "Export/Import study settings" on the study settings page.  If the `config_path` argument is not supplied, the `submits_and_deliveries.csv`, `submits_summary_daily.csv`, and `submits_summary_hourly.csv` files will not be generated. This is because these files rely on an estimate of when surveys were delivered, and Sycamore gets information about when survey deliveries are made from the study configuration file.  

The file supplied to `interventions_filepath` can be downloaded by clicking "Download Interventions" next to "Intervention Data" on the study settings page. If the `interventions_filepath` argument is not supplied, and if your study used relative surveys (i.e. surveys are delivered 12 days after a participant's start date), the `submits_and_deliveries.csv`, `submits_summary_daily.csv`, and `submits_summary_hourly.csv` files will not be generated. This is because the interventions file contains information about each Beiwe user's intervention date, and Sycamore cannot guess a user's intervention date from survey data alone. When running Sycamore, be sure to use an up-to-date version of the interventions file which contains intervention dates for users recently added to your study.  

The file supplied to `history_path` can be downloaded by clicking "Download Surveys" next to "Survey History" on the study settings page. If this file is not supplied, Sycamore will not be able to provide prompts corresponding to audio surveys in output files. In addition, if this file is not supplied, and if the text of survey questions was changed during the study, surveys recovered from `survey_answers` files may not have the correct question IDs.

___
## Functions

* [](#sycamorebasecompute_survey_stats)
* [](#sycamorecommonaggregate_surveys_config)
* [](#sycamoresubmitssurvey_submits)
* [](#sycamoresubmitssurvey_submits_no_config)
* [](#sycamoreresponsesagg_changed_answers_summary)
___

### `sycamore.base.compute_survey_stats`

compute_survey_stats runs aggregate_surveys_config, survey_submits, survey_submits_no_config, and agg_changed_answers_summary, and writes their output to csv files. `compute_survey_stats` takes the following arguments:

`data_dir`: the path to the directory where Beiwe data is stored.
`output_dir`: the path to the file directory where output is to be written.
`tz_str`: the time zone where the study was conducted. (if not defined, defaults to 'UTC'). You can see a list of all possible timezone names by importing `pytz` and using `pytz.all_timezones`
`beiwe_ids`: the list of Beiwe IDs to run Forest on. If this is not specified, sycamore will run on all users in the data_dir directory.
`config_path`: the filepath to your downloaded survey config file. See above for explanations about downloading auxiliary files.
`interventions_filepath`: the filepath to your downloaded interventions timing file. 
`history_path`: the filepath to your downloaded survey history file
`start_date`: the earliest date you think you might want survey information. Beiwe will generate survey deliveries starting at this date, and it will not include any surveys taken prior to this date in any outputs. 
`end_date`: the latest date you think you might want survey information. Beiwe will generate survey deliveries ending at this date, and it will not include any surveys taken after to this date in any outputs. 
`submits_timeframe`: Which timeframe to generate submission summaries. This must be one of the frequencies specified in  `forest.constants.Frequency`. It determines whether `submits_summary_daily.csv` or `submits_summary_hourly.csv` (which aggregate survey deliveries and deliveries at the daily or hourly levels) get generated. The default for this is to generate both hourly and daily summaries, so you can probably just leave this argument alone and delete any unwanted files. But, if you want, you can specify one timeframe.

*Example (without config file)*    
```python
from forest.sycamore import compute_survey_stats

study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"

compute_survey_stats(
    study_dir, output_dir, tz_str, beiwe_ids, start_date=start_date, 
    end_date=end_date
)
```

*Example (with config file)* 
```python
from forest.constants import Frequency
config_path = path/to/config file
interventions_filepath = path/to/interventions file
history_path = path/to/history/file
study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"
submits_timeframe = Frequency.HOURLY_AND_DAILY


compute_survey_stats(
    study_dir, output_dir, study_tz, beiwe_ids, start_date=start_date, 
    end_date=end_date, config_path = config_path, interventions_filepath = interventions_filepath,
    history_path=history_path, submits_timeframe = submits_timeframe
)
```

Most users should be able to use `compute_survey_stats` for all of their survey processing needs. However, if a study has collected a very large number of surveys, subprocesses are also exposed to reduce processing time. 

___
### `sycamore.common.aggregate_surveys_config`

Aggregate all survey information from a study, using the config file to infer information about surveys

*Example*  
```python
from forest.sycamore import aggregate_surveys_config

agg_data = aggregate_surveys_config(study_dir, config_path, study_tz, history_path=history_path)
```
___
### `sycamore.submits.survey_submits`

Extract and summarize delivery and submission times

*Example*  
```python
from forest.sycamore.submits import survey_submits

config_path = path/to/config file
interventions_path = path/to/interventions file
history_path = path/to/history/file
study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
start_date = "2022-01-01"
end_date = "2022-06-04"
tz_str = "America/New_York"

agg_data = aggregate_surveys_config(study_dir, config_path, tz_sttr)

submits_detail, submits_summary = survey_submits(
    config_path, start_date, end_date, beiwe_ids, interventions_path, agg_data, 
    history_path
)
```
___
### `sycamore.submits.survey_submits_no_config`
Used to extract an alternative survey submits table that does not include delivery times

*Example*  
```python
from forest.sycamore import survey_submits_no_config

study_dir = "path/to/data"

submits_tbl = survey_submits_no_config(study_dir)
```
___
### `sycamore.responses.agg_changed_answers_summary`
Used to extract data summarizing user responses
 
*Example*  
```python
from forest.sycamore import agg_changed_answers_summary

config_path = path/to/config file
history_path = path/to/history/file
study_dir = path/to/data  
output_dir = path/to/output
beiwe_ids = list of ids in study_dir
time_start = start time
time_end = end time  
study_tz = Timezone of study (if not defined, defaults to 'UTC')

agg_data = aggregate_surveys_config(study_dir, config_path, study_tz, history_path=history_path)

ca_detail, ca_summary = agg_changed_answers_summary(config_path, agg_data)
```

## FAQ

**In the `submits_summary.csv` file, there are some rows where `num_submitted_surveys` is greater than `num_surveys`. How could a user have submitted more surveys than were delivered to them?**   

Sycamore doesn't know exactly when surveys were delivered to users. Survey delivery times are estimated using the study configuration file which you enter when you run the code. For example, imagine that you started running a study in March with survey deliveries happening daily, and in April you decided to switch your surveys to be delivered weekly. If you ran Sycamore in April, your config file would tell Sycamore that surveys were delivered weekly throughout the whole study. So, if you had a user submitting surveys daily during March, they would have ~30 survey submissions, but Sycamore would think that only ~5 surveys had been delivered during that time.   

In addition, this may happen if a researcher manually re-sends surveys, because Sycamore has no information about manual (unscheduled) deliveries.   

**In the `submits_and_deliveries.csv` file, there are a ton of rows with deliveries but no submissions. Why is this happening?**    

If surveys are sent on a weekly schedule, Sycamore assumes that there is a survey delivered every week between the `start_date` and `end_date` which you entered. If you want there to be fewer empty rows in your output, you can move `start_date` and `end_date` to be closer to the actual start and end dates of your study.   

**What does `surv_inst_flg` mean in the outputs?**   

`surv_inst_flg` is a unique identifying number to distinguish different times when the same individual took the same survey. This column is useful for joining outputs together.  


## List of summary statistics

The following variables are created in the “submits_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided. The `submits_summary_daily.csv` and `submits_summary_hourly.csv` files contain the same columns, but with additional granularity at the day or hourly levels rather than at the user level.


|     Variable                          	|     Type     	|     Description of Variable                                                                                 	|
|---------------------------------------	|--------------	|-------------------------------------------------------------------------------------------------------------	|
|     survey id                      	|       str       	|     ID of the survey for which this row applies to. Note: If `submits_by_survey_id` is False, surveys will not be aggregated at the survey level (they will only be aggregated by user) so this column will not appear.                                                  	|
|     year                	|        int      	|     Year of the time period at which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv`                       	|
|     month                           	|       int       	|     Month of the time period at which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv`                        	|
|     day |        int      	|     Day over which submits/deliveries are being aggregated. This is only included in `submits_summary_daily.csv` and `submits_summary_hourly.csv`  |
|     hour                      	|        int      	|     Hour over which submits/deliveries are being aggregated. This is only included in `submits_summary_hourly.csv`                           	|
|     num_surveys                 	|        int      	|    Number of surveys scheduled for delivery to the individual during the period          	|
|     num_submitted_surveys              	|      int        	|     Number of surveys submitted during the period (i.e. the user hit submit on all surveys)
|     num_opened_surveys           	|       int       	|     Number of surveys opened by the individual during the time period (i.e. the user answered at least one question)                                      	|
|     avg_time_to_submit          	|        float      	|     Average time between survey delivery and survey submission, in seconds, for complete surveys                                        	|
|     avg_time_to_open      	|        float      	|     Average time between survey delivery and survey opening, in seconds. This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.                                   	|
|     avg_duration                   	|      float        	|     Average time between survey opening and survey submission, in seconds.This is averaged over survey responses where a survey_timings file was available because we do not have information about survey opening in responses where a survey_timings file is missing.                                            	|

<br>
The following variables are created in the “submits_and_deliveries.csv” file. This file will only be generated if the config file and intervention timings file are provided.

|     Variable                          	|     Type     	|     Description of Variable                                                                                 	|
|---------------------------------------	|--------------	|-------------------------------------------------------------------------------------------------------------	|
|     survey id          	|       str      	|     ID of the survey                           	|
|     delivery_time            	|       str       	|    A scheduled delivery time. If surveys are weekly, delivery times will be generated for each week between start_date and end_date                	|
|     submit_flg    	|        str      	|     Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session      |
|     time_to_submit                   	|      float        	|     Time between survey delivery and survey submission, in seconds. If a survey was incomplete, this will be blank.            	|
|     time_to_open                   	|      float        	|     Time between survey delivery time and the first recorded survey answer, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be 0)     	|
|     survey_duration               	|        float      	|     Time between the first recorded survey answer and the survey submission, in seconds (for responses where a survey_timings file was available; if only a survey_answers file was available, this will be NA)|                                   	

<br>
The following variables are created in the “answers_data.csv” file. This file will be generated if a survey config file is available.

|     Variable                          	|     Type     	|     Description of Variable                                                                                 	|
|---------------------------------------	|--------------	|-------------------------------------------------------------------------------------------------------------	|
|     survey id |      str        	|     ID of the survey  |
|     beiwe_id |      str        	|     The participant’s Beiwe ID   	|
|     question id                  	|      str        	|     The ID of the question for this line    	|
|     question text |      str        	|     The question text corresponding to the answer    	|
|     question type  |      str        	|    The type of question (radio button, free response, etc.) corresponding to the answer    	|
|     question answer options |      str        	|    The answer options presented to the user (applicable for check box or radio button surveys)       	|
|     timestamp |      str        	|     The Unix timestamp corresponding to the latest time the user was on the question  |
|     Local time |      str        	|     The local time corresponding to the latest time the user was on the question   	|
|     last_answer    	|      str        	|     The last answer the user had selected before moving on to the next question or submitting    	|
|     all_answers |      str        	|     A list of all answers the user selected    	|
|     num_answers |      int        	|    The number of different answers selected by the user (the length of the list in all_answers)   	|
|     first_time |      str        	|    The local time corresponding to the earliest time the user was on the question       	|
|     last_time |      str        	|     The local time corresponding to the latest time the user was on the question   	|
|     time_to_answer       	|      float       	|     The time that the user spent on the question    	|

<br>
The following variables are created in the “answers_summary.csv” file. This file will only be generated if the config file and intervention timings file are provided.

|     Variable                          	|     Type     	|     Description of Variable                                                                                 	|
|---------------------------------------	|--------------	|-------------------------------------------------------------------------------------------------------------	|
|     survey id |      str        	|     ID of the survey  |
|     beiwe_id |      str        	|     The participant’s Beiwe ID   	|
|     question id                  	|      str        	|     The ID of the question for this line    	|
|     num_answers |      int        	|    The number of times in the given data the answer is answered   	|
|     average_time_to_answer  |      float        	|    The average number of seconds the user takes to answer the question       	|
|     average_number_of_answers |      float       	|     Average number of answers selected for a question. This indicated if a user changed an answer before submitting it.   	|
|     most_common_answer       	|      str       	|     A user’s most common answer to a question   	|

<br>
The following variables are created in the “submits_only.csv” file. This file will always be generated.

|     Variable                          	|     Type     	|     Description of Variable                                                                                 	|
|---------------------------------------	|--------------	|-------------------------------------------------------------------------------------------------------------	|
|     survey id |      str        	|     ID of the survey  |
|     beiwe_id |      str        	|     The participant’s Beiwe ID   	|
|     surv_inst_flg            	|      int        	|     A “submission flag” which distinguishes submissions that are done by the same individual on the same survey    	|
|     max_time |      str        	|    Either the time when the user hit submit or the time when the individual stopped interacting with the survey for that session   	|
|     min_time  |      str        	|    The earliest time the individual was interacting with the survey that session       	|
|     time_to_complete |      float       	|     Time between min_time and max_time, in seconds (for responses where a survey_timings file was available)   	|

<br>
The following variables are created in a csv file for each survey.

|     Variable                          	|     Type     	|     Description of Variable                                                                                 	|
|---------------------------------------	|--------------	|-------------------------------------------------------------------------------------------------------------	|
|     start_time |      str        	|     Time this survey submission was started  |
|     end_time |      str        	|     Time this survey submission was ended   	|
|     survey_duration     	|      float   	|     Difference between start and end time, in seconds (for surveys where a survey_timings file was available)    	|
|     question_1, question_2, … |      str        	|    Responses to each question in the survey   	|
<br>