Volunteers could withdraw from participation and demand the deletion of their data as their reidentification was possible. Dependent on the respective study, we provided different rewards for participation.

We excluded data from volunteers with less than 15 d of logging data, no app usage, and missing questionnaire data. Study procedures were somewhat different across the three studies.

However, in all three studies, Big Five personality trait levels were measured with the German version of the Big Five Structure Inventory (BFSI) and naturalistic usage in the field was automatically recorded over a period of 30 d.

Data were regularly transferred to our encrypted server using Secure Sockets Layer (SSL) encryption, when phones were connected to WiFi.

In study 2, volunteers had to answer experience sampling questionnaires during the data collection period on their smartphones. Volunteers in studies 2 and 3 completed the demographic and BFSI personality questionnaires via smartphone at a later time.

In cases where volunteers turned off location services, they were reminded to reactivate them. At the end of mobile data collection, volunteers were instructed to contact the research staff to receive compensation (studies 1 to 3) and to schedule a final neisseria session (study 2).

Big Five personality dimensions were assessed with the German version of the BFSI. The test consists of 300 items and measures the Big Five personality dimensions (openness to experience, conscientiousness, extraversion, agreeableness, and emotional stability) on five domains and 30 facets.

Participants indicated their agreement with items on a four-point Likert scale ranging from untypical for me to typical for me. Additionally, we collected age, gender, highest completed education, and a number of other questionnaires that were used in other research projects.

Questionnaires were administered either via desktop computer (studies 1 and 2) or via smartphone (studies 2 and 3). We used the laboratory version scores from study 2 in this study. Initially, activities were recorded in the form of time-stamped logs. Additionally, the character length of text messages and technical characteristics were collected.

Irreversibly hash-encoded versions of contacts and phone numbers were collected to enable us to measure the number of distinct contacts while preventing the possibility of reidentification.

Information such as names, phone numbers, and content of messages, calls, etc. The dataset consisted of 1,821 behavioral predictors and 35 personality criteria (five domains and 30 facets). Gender, age, and education were used solely for descriptive statistics and were not included as predictors in the models. In a first step, we extracted 15,692 variables from the raw dataset. The extracted variables roughly correspond to the aforementioned behavioral classes of app usage, music consumption, communication and social behavior, mobility, overall phone activity, and day- and nighttime dependency.

Variables with regard to day and night dependency were not computed for music consumption behaviors. These variables contained information about specific data types. The large amounts of data meant it was unfeasible to check for outliers manually, so we used robust estimators.

We compared the predictive performance of elastic net regularized linear regression models with those of nonlinear tree-based random forest models and a baseline model.

The baseline model predicted the mean of the respective training set for all cases in a test set.

Furthermore, the usage of random forest models allowed us to include nonlinear predictor effects and high-dimensional interactions in the models. We evaluated the predictive performance of the models based on the Pearson correlation (r) and the coefficient of determination (R2). We compared the predicted values from our models with latent person-parameter trait estimates from the self-reported values of the personality trait measures.

Because the personality scores in our analyses already represent latent trait estimates, correlation measures were not adjusted for the reliability of the personality trait scales. Thus, the absolute size of the correlations is limited by the reliability of the personality trait measures. Disattenuated correlation coefficients are provided in SI Appendix, Table S5.

We computed performance measures within each fold of the cross-validation procedure and averaged across all outer resampling folds for a single prediction model. To determine whether a model was predictive at all, we carried out t tests by comparing the R2 measures of the random forest model with those of the baseline model. The t tests were based on 10-times repeated 10-fold cross-validation and used a variance correction to specifically account for the dependency structure of cross-validation experiments.

Specifically, we used permutation strategies to determine the unique contributions of each respective behavioral class and the importance of a class within the context of all other classes.

These effects were also tested for significance and adjusted for multiple comparisons. This procedure allowed us to determine the effects of each behavioral class on the average prediction performance across all personality trait dimensions.

P values in the linear mixed models were adjusted for multiple testing with the Holm method. All procedures were performed on domain and facet levels, separately.

Due to the high computational load of the machine-learning procedures, we parallelized the computations on the Linux Cluster of the LRZ-Supercomputing Center, in Garching, near Munich, Germany. For computations on the cluster, R-version 3. We used R 3.



