|
1 | 1 | #' Model Specification Information |
2 | | -#' |
3 | | -#' |
| 2 | +#' |
| 3 | +#' |
4 | 4 | #' An object with class "model_spec" is a container for |
5 | 5 | #' information about a model that will be fit. |
6 | | -#' |
| 6 | +#' |
7 | 7 | #' The main elements of the object are: |
8 | | -#' |
9 | | -#' * `args`: A vector of the main arguments for the model. The |
| 8 | +#' |
| 9 | +#' * `args`: A vector of the main arguments for the model. The |
10 | 10 | #' names of these arguments may be different form their |
11 | 11 | #' counterparts n the underlying model function. For example, for a |
12 | 12 | #' `glmnet` model, the argument name for the amount of the penalty |
13 | | -#' is called "penalty" instead of "lambda" to make it more |
14 | | -#' general and usable across different types of models (and to not |
15 | | -#' be specific to a particular model function). The elements of |
16 | | -#' `args` can be quoted expressions or `varying()`. If left to |
17 | | -#' their defaults (`NULL`), the arguments will use the underlying |
18 | | -#' model functions default value. |
19 | | -#' |
20 | | -#' * `other`: An optional vector of model-function-specific |
21 | | -#' parameters. As with `args`, these can also be quoted or |
| 13 | +#' is called "penalty" instead of "lambda" to make it more general |
| 14 | +#' and usable across different types of models (and to not be |
| 15 | +#' specific to a particular model function). The elements of `args` |
| 16 | +#' can `varying()`. If left to their defaults (`NULL`), the |
| 17 | +#' arguments will use the underlying model functions default value. |
| 18 | +#' As discussed below, the arguments in `args` are captured as |
| 19 | +#' quosures and are not immediately executed. |
| 20 | +#' |
| 21 | +#' * `...`: Optional model-function-specific |
| 22 | +#' parameters. As with `args`, these will be quosures and can be |
22 | 23 | #' `varying()`. |
23 | | -#' |
| 24 | +#' |
24 | 25 | #' * `mode`: The type of model, such as "regression" or |
25 | 26 | #' "classification". Other modes will be added once the package |
26 | 27 | #' adds more functionality. |
27 | | - |
28 | | -#' |
| 28 | +#' |
29 | 29 | #' * `method`: This is a slot that is filled in later by the |
30 | 30 | #' model's constructor function. It generally contains lists of |
31 | 31 | #' information that are used to create the fit and prediction code |
32 | 32 | #' as well as required packages and similar data. |
33 | | -#' |
| 33 | +#' |
34 | 34 | #' * `engine`: This character string declares exactly what |
35 | 35 | #' software will be used. It can be a package name or a technology |
36 | 36 | #' type. |
37 | | -#' |
| 37 | +#' |
38 | 38 | #' This class and structure is the basis for how \pkg{parsnip} |
39 | 39 | #' stores model objects prior to seeing the data. |
40 | | -#' @rdname model_spec |
| 40 | +#' |
| 41 | +#' @section Argument Details: |
| 42 | +#' |
| 43 | +#' An important detail to understand when creating model |
| 44 | +#' specifications is that they are intended to be functionally |
| 45 | +#' independent of the data. While it is true that some tuning |
| 46 | +#' parameters are _data dependent_, the model specification does |
| 47 | +#' not interact with the data at all. |
| 48 | +#' |
| 49 | +#' For example, most R functions immediately evaluate their |
| 50 | +#' arguments. For example, when calling `mean(dat_vec)`, the object |
| 51 | +#' `dat_vec` is immediately evaluated inside of the function. |
| 52 | +#' |
| 53 | +#' `parsnip` model functions do not do this. For example, using |
| 54 | +#' |
| 55 | +#'\preformatted{ |
| 56 | +#' rand_forest(mtry = ncol(iris) - 1) |
| 57 | +#' } |
| 58 | +#' |
| 59 | +#' **does not** execute `ncol(iris) - 1` when creating the specification. |
| 60 | +#' This can be seen in the output: |
| 61 | +#' |
| 62 | +#'\preformatted{ |
| 63 | +#' > rand_forest(mtry = ncol(iris) - 1) |
| 64 | +#' Random Forest Model Specification (unknown) |
| 65 | +#' |
| 66 | +#' Main Arguments: |
| 67 | +#' mtry = ncol(iris) - 1 |
| 68 | +#'} |
| 69 | +#' |
| 70 | +#' The model functions save the argument _expressions_ and their |
| 71 | +#' associated environments (a.k.a. a quosure) to be evaluated later |
| 72 | +#' when either [fit()] or [fit_xy()] are called with the actual |
| 73 | +#' data. |
| 74 | +#' |
| 75 | +#' The consequence of this strategy is that any data required to |
| 76 | +#' get the parameter values must be available when the model is |
| 77 | +#' fit. The two main ways that this can fail is if: |
| 78 | +#' |
| 79 | +#' \enumerate{ |
| 80 | +#' \item The data have been modified between the creation of the |
| 81 | +#' model specification and when the model fit function is invoked. |
| 82 | +#' |
| 83 | +#' \item If the model specification is saved and loaded into a new |
| 84 | +#' session where those same data objects do not exist. |
| 85 | +#' } |
| 86 | +#' |
| 87 | +#' The best way to avoid these issues is to not reference any data |
| 88 | +#' objects in the global environment but to use data descriptors |
| 89 | +#' such as `.cols()`. Another way of writing the previous |
| 90 | +#' specification is |
| 91 | +#' |
| 92 | +#'\preformatted{ |
| 93 | +#' rand_forest(mtry = .cols() - 1) |
| 94 | +#' } |
| 95 | +#' |
| 96 | +#' This is not dependent on any specific data object and |
| 97 | +#' is evaluated immediately before the model fitting process begins. |
| 98 | +#' |
| 99 | +#' One less advantageous approach to solving this issue is to use |
| 100 | +#' quasiquotation. This would insert the actual R object into the |
| 101 | +#' model specification and might be the best idea when the data |
| 102 | +#' object is small. For example, using |
| 103 | +#' |
| 104 | +#'\preformatted{ |
| 105 | +#' rand_forest(mtry = ncol(!!iris) - 1) |
| 106 | +#' } |
| 107 | +#' |
| 108 | +#' would work (and be reproducible between sessions) but embeds |
| 109 | +#' the entire iris data set into the `mtry` expression: |
| 110 | +#' |
| 111 | +#'\preformatted{ |
| 112 | +#' > rand_forest(mtry = ncol(!!iris) - 1) |
| 113 | +#' Random Forest Model Specification (unknown) |
| 114 | +#' |
| 115 | +#' Main Arguments: |
| 116 | +#' mtry = ncol(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, <snip> |
| 117 | +#'} |
| 118 | +#' |
| 119 | +#' However, if there were an object with the number of columns in |
| 120 | +#' it, this wouldn't be too bad: |
| 121 | +#' |
| 122 | +#'\preformatted{ |
| 123 | +#' > mtry_val <- ncol(iris) - 1 |
| 124 | +#' > mtry_val |
| 125 | +#' [1] 4 |
| 126 | +#' > rand_forest(mtry = !!mtry_val) |
| 127 | +#' Random Forest Model Specification (unknown) |
| 128 | +#' |
| 129 | +#' Main Arguments: |
| 130 | +#' mtry = 4 |
| 131 | +#'} |
| 132 | +#' |
| 133 | +#' More information on quosures and quasiquotation can be found at |
| 134 | +#' \url{https://tidyeval.tidyverse.org}. |
| 135 | +#' |
| 136 | +#' @rdname model_spec |
41 | 137 | #' @name model_spec |
42 | 138 | NULL |
43 | 139 |
|
44 | 140 | #' Model Fit Object Information |
45 | | -#' |
46 | | -#' |
| 141 | +#' |
| 142 | +#' |
47 | 143 | #' An object with class "model_fit" is a container for |
48 | 144 | #' information about a model that has been fit to the data. |
49 | | -#' |
| 145 | +#' |
50 | 146 | #' The main elements of the object are: |
51 | | -#' |
52 | | -#' * `lvl`: A vector of factor levels when the outcome is |
| 147 | +#' |
| 148 | +#' * `lvl`: A vector of factor levels when the outcome is |
53 | 149 | #' is a factor. This is `NULL` when the outcome is not a factor |
54 | | -#' vector. |
55 | | -#' |
| 150 | +#' vector. |
| 151 | +#' |
56 | 152 | #' * `spec`: A `model_spec` object. |
57 | | -#' |
| 153 | +#' |
58 | 154 | #' * `fit`: The object produced by the fitting function. |
59 | | -#' |
| 155 | +#' |
60 | 156 | #' * `preproc`: This contains any data-specific information |
61 | 157 | #' required to process new a sample point for prediction. For |
62 | 158 | #' example, if the underlying model function requires arguments `x` |
63 | 159 | #' and `y` and the user passed a formula to `fit`, the `preproc` |
64 | 160 | #' object would contain items such as the terms object and so on. |
65 | 161 | #' When no information is required, this is `NA`. |
66 | | -#' |
67 | | -#' |
| 162 | +#' |
| 163 | +#' As discussed in the documentation for [`model_spec`], the |
| 164 | +#' original arguments to the specification are saved as quosures. |
| 165 | +#' These are evaluated for the `model_fit` object prior to fitting. |
| 166 | +#' If the resulting model object prints its call, any user-defined |
| 167 | +#' options are shown in the call preceded by a tilde (see the |
| 168 | +#' example below). This is a result of the use of quosures in the |
| 169 | +#' specification. |
| 170 | +#' |
68 | 171 | #' This class and structure is the basis for how \pkg{parsnip} |
69 | 172 | #' stores model objects after to seeing the data and applying a model. |
70 | | -#' @rdname model_fit |
| 173 | +#' @rdname model_fit |
71 | 174 | #' @name model_fit |
| 175 | +#' @examples |
| 176 | +#' |
| 177 | +#' # Keep the `x` matrix if the data are not too big. |
| 178 | +#' spec_obj <- linear_reg(x = ifelse(.obs() < 500, TRUE, FALSE)) |
| 179 | +#' spec_obj |
| 180 | +#' |
| 181 | +#' fit_obj <- fit(spec_obj, mpg ~ ., data = mtcars, engine = "lm") |
| 182 | +#' fit_obj |
| 183 | +#' |
| 184 | +#' nrow(fit_obj$fit$x) |
72 | 185 | NULL |
73 | 186 |
|
0 commit comments