Skip to content

Commit ff335de

Browse files
committed
switched .preds and .cols; documentation updates too
1 parent dd22490 commit ff335de

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+2902
-2030
lines changed

R/descriptors.R

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@
1010
#' Existing functions:
1111
#' \itemize{
1212
#' \item `.obs()`: The current number of rows in the data set.
13-
#' \item `.cols()`: The number of columns in the data set that are
13+
#' \item `.preds()`: The number of columns in the data set that are
1414
#' associated with the predictors prior to dummy variable creation.
15-
#' \item `.preds()`: The number of predictors after dummy variables
16-
#' are created (if any).
15+
#' \item `.cols()`: The number of predictor columns availible after dummy
16+
#' variables are created (if any).
1717
#' \item `.facts()`: The number of factor predictors in the dat set.
1818
#' \item `.lvls()`: If the outcome is a factor, this is a table
1919
#' with the counts for each level (and `NA` otherwise).
@@ -29,8 +29,8 @@
2929
#' For example, if you use the model formula `Sepal.Width ~ .` with the `iris`
3030
#' data, the values would be
3131
#' \preformatted{
32-
#' .cols() = 4 (the 4 columns in `iris`)
33-
#' .preds() = 5 (3 numeric columns + 2 from Species dummy variables)
32+
#' .preds() = 4 (the 4 columns in `iris`)
33+
#' .cols() = 5 (3 numeric columns + 2 from Species dummy variables)
3434
#' .obs() = 150
3535
#' .lvls() = NA (no factor outcome)
3636
#' .facts() = 1 (the Species predictor)
@@ -41,8 +41,8 @@
4141
#'
4242
#' If the formula `Species ~ .` where used:
4343
#' \preformatted{
44-
#' .cols() = 4 (the 4 numeric columns in `iris`)
45-
#' .preds() = 4 (same)
44+
#' .preds() = 4 (the 4 numeric columns in `iris`)
45+
#' .cols() = 4 (same)
4646
#' .obs() = 150
4747
#' .lvls() = c(setosa = 50, versicolor = 50, virginica = 50)
4848
#' .facts() = 0
@@ -121,11 +121,11 @@ get_descr_df <- function(formula, data) {
121121
}
122122
} else .lvls <- function() { NA }
123123

124-
.cols <- function() {
124+
.preds <- function() {
125125
ncol(tmp_dat$x)
126126
}
127127

128-
.preds <- function() {
128+
.cols <- function() {
129129
ncol(convert_form_to_xy_fit(formula, data, indicators = TRUE)$x)
130130
}
131131

@@ -233,8 +233,8 @@ get_descr_spark <- function(formula, data) {
233233

234234
obs <- dplyr::tally(data) %>% dplyr::pull()
235235

236-
.cols <- function() length(f_term_labels)
237-
.preds <- function() all_preds
236+
.cols <- function() all_preds
237+
.preds <- function() length(f_term_labels)
238238
.obs <- function() obs
239239
.lvls <- function() y_vals
240240
.facts <- function() factor_pred
@@ -419,8 +419,8 @@ descr_env <- rlang::new_environment(
419419
.obs = function() abort("Descriptor context not set"),
420420
.lvls = function() abort("Descriptor context not set"),
421421
.facts = function() abort("Descriptor context not set"),
422-
.x = function() abort("Descriptor context not set"),
423-
.y = function() abort("Descriptor context not set"),
424-
.dat = function() abort("Descriptor context not set")
422+
.x = function() abort("Descriptor context not set"),
423+
.y = function() abort("Descriptor context not set"),
424+
.dat = function() abort("Descriptor context not set")
425425
)
426426
)

R/model_object_docs.R

Lines changed: 145 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,73 +1,186 @@
11
#' Model Specification Information
2-
#'
3-
#'
2+
#'
3+
#'
44
#' An object with class "model_spec" is a container for
55
#' information about a model that will be fit.
6-
#'
6+
#'
77
#' The main elements of the object are:
8-
#'
9-
#' * `args`: A vector of the main arguments for the model. The
8+
#'
9+
#' * `args`: A vector of the main arguments for the model. The
1010
#' names of these arguments may be different form their
1111
#' counterparts n the underlying model function. For example, for a
1212
#' `glmnet` model, the argument name for the amount of the penalty
13-
#' is called "penalty" instead of "lambda" to make it more
14-
#' general and usable across different types of models (and to not
15-
#' be specific to a particular model function). The elements of
16-
#' `args` can be quoted expressions or `varying()`. If left to
17-
#' their defaults (`NULL`), the arguments will use the underlying
18-
#' model functions default value.
19-
#'
20-
#' * `other`: An optional vector of model-function-specific
21-
#' parameters. As with `args`, these can also be quoted or
13+
#' is called "penalty" instead of "lambda" to make it more general
14+
#' and usable across different types of models (and to not be
15+
#' specific to a particular model function). The elements of `args`
16+
#' can `varying()`. If left to their defaults (`NULL`), the
17+
#' arguments will use the underlying model functions default value.
18+
#' As discussed below, the arguments in `args` are captured as
19+
#' quosures and are not immediately executed.
20+
#'
21+
#' * `...`: Optional model-function-specific
22+
#' parameters. As with `args`, these will be quosures and can be
2223
#' `varying()`.
23-
#'
24+
#'
2425
#' * `mode`: The type of model, such as "regression" or
2526
#' "classification". Other modes will be added once the package
2627
#' adds more functionality.
27-
28-
#'
28+
#'
2929
#' * `method`: This is a slot that is filled in later by the
3030
#' model's constructor function. It generally contains lists of
3131
#' information that are used to create the fit and prediction code
3232
#' as well as required packages and similar data.
33-
#'
33+
#'
3434
#' * `engine`: This character string declares exactly what
3535
#' software will be used. It can be a package name or a technology
3636
#' type.
37-
#'
37+
#'
3838
#' This class and structure is the basis for how \pkg{parsnip}
3939
#' stores model objects prior to seeing the data.
40-
#' @rdname model_spec
40+
#'
41+
#' @section Argument Details:
42+
#'
43+
#' An important detail to understand when creating model
44+
#' specifications is that they are intended to be functionally
45+
#' independent of the data. While it is true that some tuning
46+
#' parameters are _data dependent_, the model specification does
47+
#' not interact with the data at all.
48+
#'
49+
#' For example, most R functions immediately evaluate their
50+
#' arguments. For example, when calling `mean(dat_vec)`, the object
51+
#' `dat_vec` is immediately evaluated inside of the function.
52+
#'
53+
#' `parsnip` model functions do not do this. For example, using
54+
#'
55+
#'\preformatted{
56+
#' rand_forest(mtry = ncol(iris) - 1)
57+
#' }
58+
#'
59+
#' **does not** execute `ncol(iris) - 1` when creating the specification.
60+
#' This can be seen in the output:
61+
#'
62+
#'\preformatted{
63+
#' > rand_forest(mtry = ncol(iris) - 1)
64+
#' Random Forest Model Specification (unknown)
65+
#'
66+
#' Main Arguments:
67+
#' mtry = ncol(iris) - 1
68+
#'}
69+
#'
70+
#' The model functions save the argument _expressions_ and their
71+
#' associated environments (a.k.a. a quosure) to be evaluated later
72+
#' when either [fit()] or [fit_xy()] are called with the actual
73+
#' data.
74+
#'
75+
#' The consequence of this strategy is that any data required to
76+
#' get the parameter values must be available when the model is
77+
#' fit. The two main ways that this can fail is if:
78+
#'
79+
#' \enumerate{
80+
#' \item The data have been modified between the creation of the
81+
#' model specification and when the model fit function is invoked.
82+
#'
83+
#' \item If the model specification is saved and loaded into a new
84+
#' session where those same data objects do not exist.
85+
#' }
86+
#'
87+
#' The best way to avoid these issues is to not reference any data
88+
#' objects in the global environment but to use data descriptors
89+
#' such as `.cols()`. Another way of writing the previous
90+
#' specification is
91+
#'
92+
#'\preformatted{
93+
#' rand_forest(mtry = .cols() - 1)
94+
#' }
95+
#'
96+
#' This is not dependent on any specific data object and
97+
#' is evaluated immediately before the model fitting process begins.
98+
#'
99+
#' One less advantageous approach to solving this issue is to use
100+
#' quasiquotation. This would insert the actual R object into the
101+
#' model specification and might be the best idea when the data
102+
#' object is small. For example, using
103+
#'
104+
#'\preformatted{
105+
#' rand_forest(mtry = ncol(!!iris) - 1)
106+
#' }
107+
#'
108+
#' would work (and be reproducible between sessions) but embeds
109+
#' the entire iris data set into the `mtry` expression:
110+
#'
111+
#'\preformatted{
112+
#' > rand_forest(mtry = ncol(!!iris) - 1)
113+
#' Random Forest Model Specification (unknown)
114+
#'
115+
#' Main Arguments:
116+
#' mtry = ncol(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, <snip>
117+
#'}
118+
#'
119+
#' However, if there were an object with the number of columns in
120+
#' it, this wouldn't be too bad:
121+
#'
122+
#'\preformatted{
123+
#' > mtry_val <- ncol(iris) - 1
124+
#' > mtry_val
125+
#' [1] 4
126+
#' > rand_forest(mtry = !!mtry_val)
127+
#' Random Forest Model Specification (unknown)
128+
#'
129+
#' Main Arguments:
130+
#' mtry = 4
131+
#'}
132+
#'
133+
#' More information on quosures and quasiquotation can be found at
134+
#' \url{https://tidyeval.tidyverse.org}.
135+
#'
136+
#' @rdname model_spec
41137
#' @name model_spec
42138
NULL
43139

44140
#' Model Fit Object Information
45-
#'
46-
#'
141+
#'
142+
#'
47143
#' An object with class "model_fit" is a container for
48144
#' information about a model that has been fit to the data.
49-
#'
145+
#'
50146
#' The main elements of the object are:
51-
#'
52-
#' * `lvl`: A vector of factor levels when the outcome is
147+
#'
148+
#' * `lvl`: A vector of factor levels when the outcome is
53149
#' is a factor. This is `NULL` when the outcome is not a factor
54-
#' vector.
55-
#'
150+
#' vector.
151+
#'
56152
#' * `spec`: A `model_spec` object.
57-
#'
153+
#'
58154
#' * `fit`: The object produced by the fitting function.
59-
#'
155+
#'
60156
#' * `preproc`: This contains any data-specific information
61157
#' required to process new a sample point for prediction. For
62158
#' example, if the underlying model function requires arguments `x`
63159
#' and `y` and the user passed a formula to `fit`, the `preproc`
64160
#' object would contain items such as the terms object and so on.
65161
#' When no information is required, this is `NA`.
66-
#'
67-
#'
162+
#'
163+
#' As discussed in the documentation for [`model_spec`], the
164+
#' original arguments to the specification are saved as quosures.
165+
#' These are evaluated for the `model_fit` object prior to fitting.
166+
#' If the resulting model object prints its call, any user-defined
167+
#' options are shown in the call preceded by a tilde (see the
168+
#' example below). This is a result of the use of quosures in the
169+
#' specification.
170+
#'
68171
#' This class and structure is the basis for how \pkg{parsnip}
69172
#' stores model objects after to seeing the data and applying a model.
70-
#' @rdname model_fit
173+
#' @rdname model_fit
71174
#' @name model_fit
175+
#' @examples
176+
#'
177+
#' # Keep the `x` matrix if the data are not too big.
178+
#' spec_obj <- linear_reg(x = ifelse(.obs() < 500, TRUE, FALSE))
179+
#' spec_obj
180+
#'
181+
#' fit_obj <- fit(spec_obj, mpg ~ ., data = mtcars, engine = "lm")
182+
#' fit_obj
183+
#'
184+
#' nrow(fit_obj$fit$x)
72185
NULL
73186

docs/articles/articles/Classification.html

Lines changed: 29 additions & 20 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)