From e245bd708ee528e4d68f9b5e2f8a1ea54f4e65e9 Mon Sep 17 00:00:00 2001 From: jverzani Date: Wed, 3 Sep 2025 16:02:38 -0400 Subject: [PATCH 1/3] --- for em-dash --- EDA/bivariate-julia.qmd | 6 +++--- EDA/categorical-data-julia.qmd | 2 +- EDA/makie.qmd | 2 +- EDA/tabular-data-julia.qmd | 12 +++++------ EDA/univariate-julia.qmd | 32 +++++++++++++++--------------- Inference/distributions.qmd | 8 ++++---- Inference/inference.qmd | 14 ++++++------- LinearModels/linear-regression.qmd | 4 ++-- _quarto.yml | 4 ++-- index.qmd | 18 ++++++++--------- 10 files changed, 50 insertions(+), 52 deletions(-) diff --git a/EDA/bivariate-julia.qmd b/EDA/bivariate-julia.qmd index 98926b8..606703b 100644 --- a/EDA/bivariate-julia.qmd +++ b/EDA/bivariate-julia.qmd @@ -88,7 +88,7 @@ Putting the categorical variable first, presents a graphic (@fig-grouped-dotplot -Regardless of how the graphic is produced, there appears to be a difference in the centers based on the species, as would be expected -- different species have different sizes. +Regardless of how the graphic is produced, there appears to be a difference in the centers based on the species, as would be expected---different species have different sizes. @@ -207,7 +207,7 @@ x_{1}, & x_{2}, \dots, x_{n}\\ y_{1}, & y_{2}, \dots, y_{n} \end{align*} -Or -- to emphasize how the data is paired off -- as $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$. +Or---to emphasize how the data is paired off---as $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$. ### Numeric summaries @@ -797,7 +797,7 @@ First, suppose we simply adjust the fitted lines up or down for each cluster. Th m2 = lm(@formula(PetalLength ~ PetalWidth + Species), iris) ``` -The second row in the output of `m2` has an identical interpretation as for `m1` -- it is the slope of the regression line. The first line of the output in `m1` is the $x$-intercept, which moves the line up or down. Whereas the first of `m2` is the $x$ intercept for a line that describes *just one* of the species, in this case `setosa`. (A coding for the regression model with a categorical variable chooses one reference level, in this case "setosa."). The 3rd and 4th lines are the slopes for the other two species. +The second row in the output of `m2` has an identical interpretation as for `m1`---it is the slope of the regression line. The first line of the output in `m1` is the $x$-intercept, which moves the line up or down. Whereas the first of `m2` is the $x$ intercept for a line that describes *just one* of the species, in this case `setosa`. (A coding for the regression model with a categorical variable chooses one reference level, in this case "setosa."). The 3rd and 4th lines are the slopes for the other two species. We can plot these individually, one-by-one, in a similar manner as before, however when we call `predict` we include a level for `:Species`. The result is the middle figure in @fig-iris-scatterplot-regression. diff --git a/EDA/categorical-data-julia.qmd b/EDA/categorical-data-julia.qmd index 60656cf..5edb375 100644 --- a/EDA/categorical-data-julia.qmd +++ b/EDA/categorical-data-julia.qmd @@ -294,7 +294,7 @@ plot(p1, p2, layout = (@layout [a b])) As seen in the left graphic of @fig-grouped-barchart, there are groups of bars for each level of the first variable (`:Sex`); the groups represent the variable passed to the `group` keyword argument. The values are looked up in the data frame with the computed column that was named `:value` through the `combine` function. -The same graphic on the left -- without the labeling -- is also made more directly with `groupedbar(freqtable(survey, :Sex, :Smoke))` +The same graphic on the left---without the labeling---is also made more directly with `groupedbar(freqtable(survey, :Sex, :Smoke))` #### Andrews plot diff --git a/EDA/makie.qmd b/EDA/makie.qmd index 2a33e6c..a0231cd 100644 --- a/EDA/makie.qmd +++ b/EDA/makie.qmd @@ -72,7 +72,7 @@ Both the `mapping` and `visual` calls can be used to set attributes: The attributes are those for the underlying plotting function. For `visual(BoxPlot)`, these can be seen at the help page for `boxplot`, displayed with the command `?boxplot`. -The `mapping` calls shows two uses of the mini language for data manipulation. The basic form is `source => function => target` and works very much like the DataFrames mini language does for `select` or `transform`, but unlike those, the function is *always* applied by row. This makes some transformations, such as $z$-scores not possible within this call -- transformations requiring the entire column need to be done within the values passed to `data`. The abbreviated forms are just `source`, as used with the `color=:species` argument; `source => function`; and `source => target`, such as `:bill_length_mm => "bill length (mm)"` used to rename the variable for labeling purposes. When the source involves more than one column selector, tuples should be used to group them. +The `mapping` calls shows two uses of the mini language for data manipulation. The basic form is `source => function => target` and works very much like the DataFrames mini language does for `select` or `transform`, but unlike those, the function is *always* applied by row. This makes some transformations, such as $z$-scores not possible within this call---transformations requiring the entire column need to be done within the values passed to `data`. The abbreviated forms are just `source`, as used with the `color=:species` argument; `source => function`; and `source => target`, such as `:bill_length_mm => "bill length (mm)"` used to rename the variable for labeling purposes. When the source involves more than one column selector, tuples should be used to group them. A few functions are provided to bypass the usual mapping of the data. (For example, `color` maps levels of a factor to a color ramp behind the scenes.) Among these are `nonnumeric` to pass a numeric variable to a value expecting a categorical variable and `verbatim` to avoid this mapping. The latter, `=> verbatim`, will be necessary to add when annotating a figure. diff --git a/EDA/tabular-data-julia.qmd b/EDA/tabular-data-julia.qmd index 8eed7b1..fe39f11 100644 --- a/EDA/tabular-data-julia.qmd +++ b/EDA/tabular-data-julia.qmd @@ -33,7 +33,7 @@ There are different ways to construct a data frame. Consider the task of the Wirecutter in trying to select the best [carry on travel bag](https://www.nytimes.com/wirecutter/reviews/best-carry-on-travel-bags/#how-we-picked-and-tested). After compiling a list of possible models by scouring travel blogs etc., they select some criteria (capacity, compartment design, aesthetics, comfort, ...) and compile data, similar to what one person collected in a [spreadsheet](https://docs.google.com/spreadsheets/d/1fSt_sO1s7moXPHbxBCD3JIKPa8QIZxtKWYUjD6ElZ-c/edit#gid=744941088). -Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatibility, loading style, and a last-checked date -- as this market improves constantly. +Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatibility, loading style, and a last-checked date---as this market improves constantly. ``` product v p l loads checked @@ -140,7 +140,7 @@ push!(d, Dict(:b => "Genius", :v => 25, :p => 228, :lap => "Y", :load => "clamshell", :d => Date("2022-10-01"))) ``` -(A dictionary is a `key => value` container like a named tuple, but keys may be arbitrary `Julia` objects -- not always symbols -- so we explicitly use symbols in the above command.) +(A dictionary is a `key => value` container like a named tuple, but keys may be arbitrary `Julia` objects---not always symbols---so we explicitly use symbols in the above command.) ::: {.callout-note} ##### The `Tables` interface @@ -185,7 +185,7 @@ The filename, may be more general. For example, it could be `download(url)` for ::: {.callout-note} ##### Read and write -The methods `read` and `write` are qualified in the above usage with the `CSV` module. In the `Julia` ecosystem, the `FileIO` package provides a common framework for reading and writing files; it uses the verbs `load` and `save`. This can also be used with `DataFrames`, though it works through the `CSVFiles` package -- and not `CSV`, as illustrated above. The read command would look like `DataFrame(load(fname))` and the write command like `save(fname, df)`. Here `fname` would have a ".csv" extension so that the type of file could be detected. +The methods `read` and `write` are qualified in the above usage with the `CSV` module. In the `Julia` ecosystem, the `FileIO` package provides a common framework for reading and writing files; it uses the verbs `load` and `save`. This can also be used with `DataFrames`, though it works through the `CSVFiles` package---and not `CSV`, as illustrated above. The read command would look like `DataFrame(load(fname))` and the write command like `save(fname, df)`. Here `fname` would have a ".csv" extension so that the type of file could be detected. ::: | Command | Description | @@ -266,7 +266,7 @@ can be very complicated, but here we only assume that `r"name"` will match "name" somewhere in the string; `r"^name"` and `r"name$"` will match "name" at the beginning and ending of a string. Using a regular expression will return a data frame row (when a row index is -specified) -- not a value -- as it is possible to return 0, 1 or more +specified)---not a value---as it is possible to return 0, 1 or more columns in the selection. @@ -395,7 +395,7 @@ For the `cars` data set, the latter can be used to extract the Volkswagen models cars[cars.Manufacturer .== "Volkswagen", :] ``` -This approach lends itself to the description "find all rows matching some value" then "extract the identified rows," -- written as two steps to emphasize there are two passes through the data. Another mental model would be loop over the rows, and keep those that match the query. This is done generically by the `filter` function for collections in `Julia` or by the `subset` function of `DataFrames`. +This approach lends itself to the description "find all rows matching some value" then "extract the identified rows,"---written as two steps to emphasize there are two passes through the data. Another mental model would be loop over the rows, and keep those that match the query. This is done generically by the `filter` function for collections in `Julia` or by the `subset` function of `DataFrames`. The `filter(predicate, collection)` function is used to identify just the values in the collection for which the predicate function returns `true`. When a data frame is used with `filter`, the iteration is over the rows, so the wrapping `eachrow` iterator is not needed. We need a predicate function to replace the `.==` above. One follows. It doesn't need `.==`, as `r` is a data frame row and access produces a value not a vector: @@ -796,7 +796,7 @@ legos.youngest_age = categorical(legos.youngest_age, ordered=true) first(legos[:,r"age"], 2) ``` -With that ordering, an expected pattern becomes clear -- kits for older users have on average more pieces -- though there are unexpected exceptions: +With that ordering, an expected pattern becomes clear---kits for older users have on average more pieces---though there are unexpected exceptions: ```{julia} @chain legos begin diff --git a/EDA/univariate-julia.qmd b/EDA/univariate-julia.qmd index 39e32f5..b8cea39 100644 --- a/EDA/univariate-julia.qmd +++ b/EDA/univariate-julia.qmd @@ -109,7 +109,7 @@ The `missing` value propagates through computations: sum(hip_cost) ``` -In particular, all of these combinations with `missing` yield `missing`, as they should -- if data is not available, combinations based on that data are still not available: +In particular, all of these combinations with `missing` yield `missing`, as they should---if data is not available, combinations based on that data are still not available: ```{julia} 1 + missing, 1 - missing, 1*missing, 1/missing, missing^2, missing == true @@ -167,7 +167,7 @@ For the many purposes, tuples can be exchanged for vectors, as both are iterable sum(whale), mean(whale), length(whale) ``` -Unlike vectors, but like numbers, tuples can not be *modified* after construction. This allows tuples to be quite useful -- and performant -- for programming purposes. +Unlike vectors, but like numbers, tuples can not be *modified* after construction. This allows tuples to be quite useful---and performant---for programming purposes. Tuples can also have names. The basic construction uses "key=value" pairs: @@ -257,7 +257,7 @@ The range `1:1` specifies the value `1`, as does just `1`, but for indexing the whale[1], whale[1:1] ``` -The design is indexing by a scalar -- like `1` -- can drop dimensions (the vector becomes a scalar), whereas indexing by a container -- like `1:1` or, say, `[1]` -- does not drop dimensions. The documentation for indexing of an array has: "If all the indices are scalars, then the result, `X`, is a single element from the array, `A`. Otherwise, `X` is an array with the same number of dimensions as the sum of the dimensionalities of all the indices." +The design is indexing by a scalar---like `1`---can drop dimensions (the vector becomes a scalar), whereas indexing by a container---like `1:1` or, say, `[1]`---does not drop dimensions. The documentation for indexing of an array has: "If all the indices are scalars, then the result, `X`, is a single element from the array, `A`. Otherwise, `X` is an array with the same number of dimensions as the sum of the dimensionalities of all the indices." ::: @@ -295,7 +295,7 @@ whale_copy = whale show(whale_copy) ``` -the container is copied, but -- unlike if we had used `copy(whale)` -- the two variables point to the same container. When a vector is passed into a function, the function works with the container, not a copy. +the container is copied, but---unlike if we had used `copy(whale)`---the two variables point to the same container. When a vector is passed into a function, the function works with the container, not a copy. ### Modification @@ -363,7 +363,7 @@ whale .= [74, 122, 235, 111, 292, 111, 211, 133, 156, 79] show(whale) ``` -But `whale` is not a new object -- as it would be without that dot -- but rather, these values are placed into the container `whale` already refers to -- which also is the container `whale_copy` points at: +But `whale` is not a new object---as it would be without that dot---but rather, these values are placed into the container `whale` already refers to---which also is the container `whale_copy` points at: ```{julia} show(whale_copy) @@ -446,7 +446,7 @@ This assigns `whale` to a *new* container which accepts floating point values, a Though cumbersome, this is not typical usage, as the constructor used to create the data set will promote to a common type, so it would only matter when adjusting the initial values. -For the special case of assigning a *missing* value, the `allowmissing` function from the `DataFrames` package^[The `DataFrames` package is almost always utilized, as it provides a fundamental type to work with tabular data, the data frame. With data frames, the `allowmissing!` function is used, as well, for this task.] creates a vector with a type that allows -- as well -- missing values^[The `allowmissing` function creates a union type with `Missing` in addition to the original data type. It some printouts, a `?` is appended to the type name to indicate this addition.]. Again, re-assignment is necessary: +For the special case of assigning a *missing* value, the `allowmissing` function from the `DataFrames` package^[The `DataFrames` package is almost always utilized, as it provides a fundamental type to work with tabular data, the data frame. With data frames, the `allowmissing!` function is used, as well, for this task.] creates a vector with a type that allows---as well---missing values^[The `allowmissing` function creates a union type with `Missing` in addition to the original data type. It some printouts, a `?` is appended to the type name to indicate this addition.]. Again, re-assignment is necessary: ```{julia} using DataFrames @@ -456,7 +456,7 @@ whale[1] = missing #### Broadcasting -As seen, the functions `length` and `sum` are reductions -- in this case, returning a single number, a scalar, from a vector of numbers. To compute a *sample standard deviation*, say, we follow the formula: +As seen, the functions `length` and `sum` are reductions---in this case, returning a single number, a scalar, from a vector of numbers. To compute a *sample standard deviation*, say, we follow the formula: $$ s = \sqrt{\frac{\sum_i (x_i - \bar{x})^2 }{n-1}}. @@ -474,7 +474,7 @@ To do so we would need to: * Divide a number by another number and take the square root. Embarking on this punch list with a naive attempt at the first -- -`whale - mean(whale)` -- will fail. +`whale - mean(whale)`---will fail. The subtraction of a scalar value from a vector value is not defined, as `Julia` is not implicitly vectorized. Rather the user must be explicit. For this, the concept of broadcasting is useful. In this context, broadcasting will expand the scalar to match the size of the vector and then use vector subtraction to find the result. Broadcasting is done simply by adding a "." (the dot) to the function. For infix operations like `-` this is *before* the operator: @@ -568,7 +568,7 @@ There are different options available for the storage of categorical data. #### Character data -The `String` type in `Julia` is the default type for holding character data. Strings are created with matching single quotes *or* -- for multiline strings -- with matching triple quotes: +The `String` type in `Julia` is the default type for holding character data. Strings are created with matching single quotes *or*---for multiline strings---with matching triple quotes: ```{julia} s = "The quick brown fox ..." @@ -654,7 +654,7 @@ job_title = ["Data Scientist", "Machine Learning Scientist", "Big Data Engineer" #### Symbols -`Julia` as a language can be used to represent the language's code as a data structure in the [language](https://stackoverflow.com/questions/23480722/what-is-a-symbol-in-julia). Symbols are needed to refer to the name, or identifier, of a variable as opposed to the values in the variable. Symbols, being part of the language, are used for other purposes, such as keys for a named tuple of flags for an argument. The access pattern `nt.a` has been mentioned; this is a convenience for `getfield(nt, :a)`, the symbol being used as a key. When data frames are introduced -- essentially a collection of matched data vectors -- symbols will be used to reference the individual variables. +`Julia` as a language can be used to represent the language's code as a data structure in the [language](https://stackoverflow.com/questions/23480722/what-is-a-symbol-in-julia). Symbols are needed to refer to the name, or identifier, of a variable as opposed to the values in the variable. Symbols, being part of the language, are used for other purposes, such as keys for a named tuple of flags for an argument. The access pattern `nt.a` has been mentioned; this is a convenience for `getfield(nt, :a)`, the symbol being used as a key. When data frames are introduced---essentially a collection of matched data vectors---symbols will be used to reference the individual variables. The simple constructor for a symbol is `:`, as in `:some_symbol`. (The `:` constructor makes *expressions*, but in this use, these are interpreted as symbols.) The `Symbol` constructor can also be used to create symbols with spaces, e.g., `Symbol("some symbol")`; the string macro `var"..."` is a convenience. @@ -785,7 +785,7 @@ whale = [74, 122, 235, 111, 292, 111, 211, 133, 156, 79] whale[ whale .>= 200 ] ``` -Or, this example -- which shows the mathematically natural chaining of comparison operators -- filters out only the values in $[100, 125)$: +Or, this example---which shows the mathematically natural chaining of comparison operators--- filters out only the values in $[100, 125)$: ```{julia} whale[ 100 .<= whale .< 125 ] @@ -858,7 +858,7 @@ Structured data may not represent statistical data, but is useful nonetheless, e For a vector of all ones or all zeros, the `ones` and `zeros` functions are useful. The command `ones(n)` will return a vector of `n` zeros using the default `Float64` type. To specify a different type, such as `Int64`, the two-argument form, `ones(T, n)`, is available. Similarly, `zeros` is used to create a vector of zeros. The singular `one()`, `zero()` (one `one(T)` and `zero(T)`) are useful for generic programming. For example, we use the idiom `one.(x)` to create a vector of all `1`s with the length and type of the vector `x`. -Arithmetic sequences, $a, a+h, a+2h, \dots, b$ can be created with the colon operator `a:h:b` or `a:b` when `h` is `1`. This operator returns a recipe for generating the sequence, it is lazy -- it does not generate the sequence. The precedence is such that simple arithmetic operations do not need parentheses. That is `a+1:b-1` represents the sequence $a+1, a+2, \dots, b-1$. Arithmetic sequences prove useful for indexing into a vector. +Arithmetic sequences, $a, a+h, a+2h, \dots, b$ can be created with the colon operator `a:h:b` or `a:b` when `h` is `1`. This operator returns a recipe for generating the sequence, it is lazy---it does not generate the sequence. The precedence is such that simple arithmetic operations do not need parentheses. That is `a+1:b-1` represents the sequence $a+1, a+2, \dots, b-1$. Arithmetic sequences prove useful for indexing into a vector. The colon operator for floating point values may or may not stop at `b`. Programming this is harder than it seems. The simple example of `1/10:1/10:3/10` should be $1/10, 2/10, 3/10$, but it turns out that on the computer `1/10 + 2*1/10` is actually *just larger* than `3/10`. See the value of `3/10 - (1/10 + 1/10 + 1/10)` to investigate. However, the algorithm of `:` does produce the result with $3$ values here. @@ -954,7 +954,7 @@ mad(whale; center=mean(whale)) (We use a semicolon to separate positional arguments from keyword arguments, as that is needed to define a keyword, but commas can be used to call a keyword argument. What is important is that any keywords come last when a function is called.) -Functions, as defined above, are methods of a `generic function`. That is, there can be more than one method for a given name. (There are over 200 methods for the generic function named `+` in base `Julia` -- and packages can extend this even more.) To direct or *dispatch* a call to the appropriate method, `Julia` considers the number and types of its positional arguments. That is, like `+`, functions can be defined differently for integers and floating point values. +Functions, as defined above, are methods of a `generic function`. That is, there can be more than one method for a given name. (There are over 200 methods for the generic function named `+` in base `Julia`---and packages can extend this even more.) To direct or *dispatch* a call to the appropriate method, `Julia` considers the number and types of its positional arguments. That is, like `+`, functions can be defined differently for integers and floating point values. We might like our `MAD` function to be more graceful than to throw a `MethodError` if a vector of strings is passed to it. A vector of strings has type `Vector{String}` so we could make a method just for that type:^[To extend the `mad` function from `StatsBase` requires the extra step of *importing* that function *or* qualifying it with its module, as in `StatsBase.mad(...) = ...`.] `Julia makes adding methods easy, but the types that are used to extend the function shouldn't be owned by other packages, as this is considered **type-piracy**. (Failing to do so may prompt a request for a *letter of marque*.) @@ -1037,7 +1037,7 @@ A *comprehension* is a good alternative to a for loop when accumulation is requi [xi - mean(x) for xi in x] ``` -Comprehensions use *generators*, which can also be used for many other functions, such as `sum`, which was previously illustrated. This form has the advantage of not needing to allocate temporary space to compute. For example `sum([xi for xi in x])` would have to find space for the vector created by the comprehension, but the similar -- and easier to type -- `sum(xi for xi in x)` would not. +Comprehensions use *generators*, which can also be used for many other functions, such as `sum`, which was previously illustrated. This form has the advantage of not needing to allocate temporary space to compute. For example `sum([xi for xi in x])` would have to find space for the vector created by the comprehension, but the similar---and easier to type--- `sum(xi for xi in x)` would not. In this example, we sum the squared differences, passing an optional function to `sum`: @@ -1126,9 +1126,9 @@ $$ A property of the mean is when the data is *centered* by the mean, i.e. take the transformation $y_i = x_i - \bar{x}$, then the mean of the $y_i$ is just $0$. Put another way, the mean is the point where the differences to the left and right of the mean even out when added. -The standard deviation, as a sense of scale, has the property that if the data is centered and *then scaled* by the standard deviation, the center will be $0$ and the scale will be $1$. That is, the $z$-scores have mean $0$ (as they data is centered) and standard deviation $1$ (as the data is scaled) -- the $z$ scores speak to the "shape" or distribution of values of a data set. +The standard deviation, as a sense of scale, has the property that if the data is centered and *then scaled* by the standard deviation, the center will be $0$ and the scale will be $1$. That is, the $z$-scores have mean $0$ (as they data is centered) and standard deviation $1$ (as the data is scaled)---the $z$ scores speak to the "shape" or distribution of values of a data set. -These measures are sensitive -- or not *resistant* -- to one or more *outlying* values. For example, the average wealth of people in a bar changes dramatically if someone like a pre-crash Elon Musk walks in. The standard deviation is similar. So measures based on position are better when data is *skewed*, especially if heavily skewed. For these, the median and IQR are not impacted greatly by one large value. That is they are resistant to outliers. The extreme values (the minimum and maximum) are less so. As such, the range of the data is a poor measure of spread, the range of the middle quartiles (the IQR) a much more resistant measure of spread. +These measures are sensitive---or not *resistant*---to one or more *outlying* values. For example, the average wealth of people in a bar changes dramatically if someone like a pre-crash Elon Musk walks in. The standard deviation is similar. So measures based on position are better when data is *skewed*, especially if heavily skewed. For these, the median and IQR are not impacted greatly by one large value. That is they are resistant to outliers. The extreme values (the minimum and maximum) are less so. As such, the range of the data is a poor measure of spread, the range of the middle quartiles (the IQR) a much more resistant measure of spread. ::: diff --git a/Inference/distributions.qmd b/Inference/distributions.qmd index 0e713fe..6d35a82 100644 --- a/Inference/distributions.qmd +++ b/Inference/distributions.qmd @@ -103,7 +103,7 @@ The transformation $Z = (X - \mu)/\sigma$, following the $z$-score, centers and For a *random sample* $X_1, X_2, \dots, X_n$ the sum $S = \sum_k X_k$ has the property that $E(S) = \sum_k E(X_k)$ ("expectations add"). This is true even if the sample is not iid. For example, finding the average number of heads in $100$ tosses of a fair coin is easy, it being $100 \cdot (1/2) = 50$, the $1/2$ being the expectation of a single heads where $X_i=1$ if heads, and $X_i=0$ if tails. -While $E(X_1 + X_2) = E(X_1) + E(X_2)$, as expectations are *linear* and satisfy $E(aX+bY) = aE(X)+bE(Y)$, it is not the case that $E(X_1 \cdot X_2) = E(X_1) \cdot E(X_2)$ in general -- though it is true when the two random variables are independent. As such, the variance of $S= \sum_k X_k$ is: +While $E(X_1 + X_2) = E(X_1) + E(X_2)$, as expectations are *linear* and satisfy $E(aX+bY) = aE(X)+bE(Y)$, it is not the case that $E(X_1 \cdot X_2) = E(X_1) \cdot E(X_2)$ in general---though it is true when the two random variables are independent. As such, the variance of $S= \sum_k X_k$ is: $$ VAR(\sum_k X_k) = \sum_k VAR(X_k) + 2 \sum_{i < j} COV(X_i, X_j), @@ -380,7 +380,7 @@ Z = Normal(0, 1) mean(Z), std(Z) ``` -There are many facts about standard normals that are useful to know. First we have the three rules of thumb -- $68$, $95$, $99.7$ -- describing the amount of area above $[-1,1]$, $[-2,2]$, and $[-3,3]$. We can see these from the `cdf` with: +There are many facts about standard normals that are useful to know. First we have the three rules of thumb---$68$, $95$, $99.7$---describing the amount of area above $[-1,1]$, $[-2,2]$, and $[-3,3]$. We can see these from the `cdf` with: ```{julia} between(Z, a, b) = cdf(Z,b) - cdf(Z,a) @@ -453,7 +453,7 @@ maximum(abs, rand(T5, 100)), maximum(abs, rand(Z, 100)) @fig-qqplots-distributions shows a quantile-normal plot of $T(3)$ in the lower-left graphic. The long tails cause deviations in the pattern of points from a straight line. -The `skewness` method measures asymmetry. For both the $T$ and the normal distributions -- both bell shaped -- this is $0$. The `kurtosis` function measures *excess kurtosis* a measure of the size of the tails *as compared* to the normal. We can see: +The `skewness` method measures asymmetry. For both the $T$ and the normal distributions---both bell shaped---this is $0$. The `kurtosis` function measures *excess kurtosis* a measure of the size of the tails *as compared* to the normal. We can see: ```{julia} skewness(T5), skewness(Z), kurtosis(T5), kurtosis(Z) @@ -480,7 +480,7 @@ In @fig-qqplots-distributions the lower-right graphic is of the uniform distrib ```{julia} #| echo: false #| label: fig-qqplots-distributions -#| fig-cap: Quantile-normal plots for different distributions -- $T_3$ is leptokurtic; $T_{100}$ is approximately normal; $U$ is platykurtic; and $E$ is skewed. +#| fig-cap: Quantile-normal plots for different distributions---$T_3$ is leptokurtic; $T_{100}$ is approximately normal; $U$ is platykurtic; and $E$ is skewed. probs = range(0.01, 0.99, 40) Ds = (N = Normal(0,1), T₃ = TDist(3), T₁₀₀=TDist(100), diff --git a/Inference/inference.qmd b/Inference/inference.qmd index 001906e..d165435 100644 --- a/Inference/inference.qmd +++ b/Inference/inference.qmd @@ -166,7 +166,7 @@ Since a data set is a single realization, and probability speaks to the frequenc > For a data set drawn from iid random sample with a $Normal(\mu,\sigma)$ population a $(1-\alpha)\cdot 100$% confidence interval is given by $\bar{x} \pm z_{1-\alpha/2}\cdot \sigma/\sqrt{n}$, where $z_{1-\alpha/2}=-z_{\alpha/2}$ satisfies $P(z_{\alpha/2} < Z < z_{1-\alpha/2})$, $Z$ being a standard normal random variable. -@fig-ci-works illustrates confidence intervals based on several independent random samples. Occasionally -- with a probability controlled by $\alpha$ -- the intervals do not cover the true population mean. +@fig-ci-works illustrates confidence intervals based on several independent random samples. Occasionally---with a probability controlled by $\alpha$---the intervals do not cover the true population mean. ```{julia} @@ -241,7 +241,7 @@ Under the assumptions above (iid sample, normal population), the standard error ::: {#exm-t-test} ##### Confidence interval for the mean, no assumption on $\sigma$ -Returning to the coffee-dispenser technician, a cappuccino dispenser has two sources of variance for the amount poured -- the coffee and the foam. This is harder to engineer precisely, so is assumed unknown in the calibration process. Suppose the technician again took $6$ samples to gauge the value of $\mu$. +Returning to the coffee-dispenser technician, a cappuccino dispenser has two sources of variance for the amount poured---the coffee and the foam. This is harder to engineer precisely, so is assumed unknown in the calibration process. Suppose the technician again took $6$ samples to gauge the value of $\mu$. With no assumptions on the value of $\mu$ *or* $\sigma$. A $95$% confidence interval for $\mu$ would be computed by: @@ -256,7 +256,7 @@ SE = s / sqrt(n) ``` -These computations -- and many others -- are carried out by functions in the `HypothesisTests` package. For example, we could have computed the values above through: +These computations---and many others---are carried out by functions in the `HypothesisTests` package. For example, we could have computed the values above through: ```{julia} using HypothesisTests @@ -641,7 +641,7 @@ We can compare the confidence interval identified for $\mu$ to that identified t confint(OneSampleTTest(ys), level = 0.95) ``` -The two differ -- they use different sampling distributions and methods -- though simulations will show both manners create CIs capturing the true mean at the rate of the confidence level. +The two differ---they use different sampling distributions and methods---though simulations will show both manners create CIs capturing the true mean at the rate of the confidence level. The above example does not showcase the advantage of the maximum likelihood methods, but hints at a systematic way to find confidence intervals, which for some cases is optimal, and is more systematic then finding some pivotal quantity (e.g. the $T$-statistic under a normal population assumption). @@ -651,7 +651,7 @@ The above example does not showcase the advantage of the maximum likelihood meth A confidence interval is a means to estimate a population parameter with an appreciation for the variability involved in random sampling. However, sometimes a different language is sought. For example, we might hear a newer product is *better* than an old one, or amongst two different treatments there is a *difference*. These aren't comments about the specific value. The language of hypothesis tests affords the flexibility to incorporate this language. -The basic setup is similar to a courtroom trial in the United States -- as seen on TV: +The basic setup is similar to a courtroom trial in the United States---as seen on TV: * a defendant is judged by a jury with an *assumption of innocence* * presentation of evidence is given @@ -746,7 +746,7 @@ A default value for the null was chosen (`h_0`), a default level for the confide #### Equivalence between hypothesis tests and confidence intervals -For many tests, such as the one-sample $T$-test, there is an equivalence between a two-sided significance test with significance level $\alpha$ and a $(1-\alpha)\cdot 100$% confidence interval -- no surprise given the same $T$-statistic is employed by both. +For many tests, such as the one-sample $T$-test, there is an equivalence between a two-sided significance test with significance level $\alpha$ and a $(1-\alpha)\cdot 100$% confidence interval---no surprise given the same $T$-statistic is employed by both. For example, consider a two-sided significance test of $H_0: \mu=\mu_0$. Let $\alpha$ be the level of significance: we "reject" the null hypothesis if the $p$-value is less than $\alpha$. An iid random sample is summarized by the observed value of the $T$-statistic, $(\bar{x} - \mu_0)/(s/\sqrt{n})$. Let $t^* = t_{1-\alpha/2}$ be the quantile, then if the $p$-value is less than $\alpha$, we must have the observed value in absolute value is greater than $t^*$. That is @@ -1275,7 +1275,7 @@ ps /= sum(ps) # make sum to exactly 1 χ².stat ``` -We don't show the $p$ value -- yet -- as we need to consider an adjusted degrees of freedom. There are $s=2$ parameters estimated from the data, so the degrees of freedom of this test statistic are $18 - 2 - 1 = 15$. The $p$-value is found by computing the area to the *right* of the observed value, giving: +We don't show the $p$ value---yet---as we need to consider an adjusted degrees of freedom. There are $s=2$ parameters estimated from the data, so the degrees of freedom of this test statistic are $18 - 2 - 1 = 15$. The $p$-value is found by computing the area to the *right* of the observed value, giving: ```{julia} 1 - cdf(Chisq(18 - 2 - 1), χ².stat) diff --git a/LinearModels/linear-regression.qmd b/LinearModels/linear-regression.qmd index 1fef551..5bff9ff 100644 --- a/LinearModels/linear-regression.qmd +++ b/LinearModels/linear-regression.qmd @@ -307,7 +307,7 @@ res = lm(fm, cereal) ``` -The output shows what might have been anticipated: there appears to be no connection between `Sodium` and `Calories`, though were this data on dinner foods that might not be the case. The $T$-test displayed for `Sodium` is a test of whether the slope based on `Sodium` is $0$ -- holding the other variables constant -- and the large $p$-value would lead us to accept that hypotheses. +The output shows what might have been anticipated: there appears to be no connection between `Sodium` and `Calories`, though were this data on dinner foods that might not be the case. The $T$-test displayed for `Sodium` is a test of whether the slope based on `Sodium` is $0$---holding the other variables constant---and the large $p$-value would lead us to accept that hypotheses. We drop this variable from the model and refit: @@ -392,7 +392,7 @@ EqualVarianceTTest(y2, y1) However, some comments are warranted. We would have found a slightly different answer (a different sign) had we done `EqualVarianceTTest(y1, y2)`. This is because a choice is made if we consider $\bar{y}_1-\bar{y}_2$ or $\bar{y}_2 - \bar{y}_1$ in the statistic. -In the use of the linear model, there is a new subtlety -- the `group` variable is *categorical* and not numeric. A peek at the *model matrix* (`modelmatrix(res)`) will show that the categorical variable was *coded* with a $0$ for each `g1` and $1$ for each `g2`. The details are handled by the underlying `StatsModels` package which first creates a `ModelFrame` which takes a formula and the data; `ModelMatrix` then creates the matrix, $X$. The call to `ModelFrame` allows a specification of *contrasts*. The above uses the `DummyCoding`, which picks a base level (`"g1"` in this case) and then creates a variable for *each* other level, these variables having values either being `0` or `1`, and `1` only when the factor has that level. Using the notation $1_{j}(x_i)$ for this, we have the above call to `lm` fits the model $y_i = \beta_0 + \beta_1 \cdot 1_{\text{g2}}(x_i) + e_i$ and the model matrix shows this (2nd row below): +In the use of the linear model, there is a new subtlety---the `group` variable is *categorical* and not numeric. A peek at the *model matrix* (`modelmatrix(res)`) will show that the categorical variable was *coded* with a $0$ for each `g1` and $1$ for each `g2`. The details are handled by the underlying `StatsModels` package which first creates a `ModelFrame` which takes a formula and the data; `ModelMatrix` then creates the matrix, $X$. The call to `ModelFrame` allows a specification of *contrasts*. The above uses the `DummyCoding`, which picks a base level (`"g1"` in this case) and then creates a variable for *each* other level, these variables having values either being `0` or `1`, and `1` only when the factor has that level. Using the notation $1_{j}(x_i)$ for this, we have the above call to `lm` fits the model $y_i = \beta_0 + \beta_1 \cdot 1_{\text{g2}}(x_i) + e_i$ and the model matrix shows this (2nd row below): ```{julia} modelmatrix(res) |> permutedims # turned on side to save page space diff --git a/_quarto.yml b/_quarto.yml index 6c29c12..ea13a20 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -1,4 +1,4 @@ -version: "0.0.4" +version: "0.0.5" engines: ['julia'] project: @@ -18,7 +18,7 @@ book: # downloads: pdf page-footer: right: | - © Copyright 2023, John Verzani. All rights reserved. + © Copyright 2023-25, John Verzani. All rights reserved. chapters: - index.qmd - EDA/univariate-julia.qmd diff --git a/index.qmd b/index.qmd index 3fd18a8..3ff27d7 100644 --- a/index.qmd +++ b/index.qmd @@ -2,9 +2,11 @@ This is a collection of notes for using `Julia` for introductory statistics. -In case you haven't heard, [Julia](https://julialang.org/) is an open-source programming language suitable for many tasks, like scientific programming. It is designed for high performance -- Julia programs compile on the fly to efficient native code. Julia has a relatively easy to learn syntax for many tasks, certainly no harder to pick up than `R` and `Python`, widely used scripting languages for the tasks illustrated herein. +In case you haven't heard, [Julia](https://julialang.org/) is an open-source programming language suitable for many tasks, like scientific programming. It is designed for high performance---Julia programs compile on the fly to efficient native code. Julia has a relatively easy to learn syntax for many tasks, certainly no harder to pick up than `R` and `Python`, widely used scripting languages for the tasks illustrated herein. -Why these notes on introductory statistics? No compelling reason save I had done something similar for `R` when `R` was a fledgling `S-Plus` clone. No more, `R` is a juggernaut, and it is almost certain `Julia` will never replace `R` as the programming language of choice for statistics. Besides, `Julia` users can already interface with `R` quite easily through `RCall`. *However*, there are some reasons that `Julia` could be a useful language when learning basic inferential statistics, especially if other real strengths of the `Julia` ecosystem were needed. So these notes show how `Julia` can be used for these tasks, and, hopefully, shows that it works pretty well. +Why these notes on introductory statistics? No compelling reason save I had done something similar for `R` when `R` was a fledgling `S-Plus` clone. No more, `R` is a juggernaut, and it is almost certain `Julia` will never replace `R` as the programming language of choice for statistics. Besides, `Julia` users can already interface with `R` quite easily through `RCall`. + +*However*, there are some reasons that `Julia` could be a useful language when learning basic inferential statistics, especially if other real strengths of the `Julia` ecosystem were needed. So these notes show how `Julia` can be used for these tasks, and, hopefully, shows that it works pretty well. There are some great books published about using `Julia` [@bezanson2017julia] with data science, within which much of this material is covered. For example, @@ -24,7 +26,7 @@ These notes are a work in progress. Feel free to click the "edit this page" butt ## Installing and running Julia -`Julia` can be downloaded from [julialang.org](https://julialang.org/). The language is evolving rapidly. The latest official release is recommended. These notes should work with any version since `v"1.6.0"`. It is recommended to use a version `v"1.9.0"` or later, as there are significant speedups with external packages that make the user experience even better. +`Julia` can be downloaded from [julialang.org](https://julialang.org/) using `juliaup`. The language is evolving rapidly. The latest official release is recommended. These notes should work with any version since `v"1.10.0"`. It is recommended to use the latest version, as the language had general improvement with each new release and some external packages take advantage of these. Once downloaded and installed the `Julia` installation will provide a *command line* for interactive usage and a binary to run scripts. It is envisioned most users will use an alternative interface, though `Julia` has an excellent REPL for command-line usage. @@ -45,9 +47,9 @@ These notes use `quarto` to organize the mix of text, code, and graphics. The `q This section gives a quick orientation for using `Julia`. See this compiled collection of [tutorials](https://julialang.org/learning/tutorials/) for more comprehensive introductions. -As will be seen, `Julia` use *multiple dispatch* (as does `R`) where different function methods can be called using the same generic name. Different methods are dispatched depending on the type and number of the arguments. The `+` sign above, is actually a function call to the `+` function, which in base `Julia` has over 200 different methods, as there are many different implementations for addition. For a beginner this is great -- fewer new function names to remember. +As will be seen, `Julia` use *multiple dispatch* (as does `R`) where different function methods can be called using the same generic name. Different methods are dispatched depending on the type and number of the arguments. The `+` sign above, is actually a function call to the `+` function, which in base `Julia` has over $150$ different methods, as there are many different implementations for addition. For a beginner this is great---fewer new function names to remember; for a power user this is greate---new methods can be specialized for performance purposes. -`Julia` is a *dynamically typed* language, like `R` and `Python`, meaning variables can be reassigned to different values and with different types.^[With the one caveat that generic function names can not be reassigned as variables or vice versa.] Dynamicness makes interactive usage at the REPL or through a notebook much easier. +`Julia` is a *dynamically typed* language, like `R` and `Python`, meaning variables can be reassigned to different values and with different types.^[With the one caveat that generic function names can not be reassigned as variables or vice versa within a session.] Dynamicness makes interactive usage at the REPL or through a notebook much easier. Julia supports the usual mathematical operations familiar to users of a calculator, such as `+`, `-`, `*`, `/`, and `^`. In addition, there a numerous built in functions such as mathematical ones like `sqrt` or programming oriented ones, like `map`. @@ -75,7 +77,7 @@ xs = (1, 2, 3, 7, 9) # tuple sum(xs) / length(xs) ``` -The takeaway -- we can focus more on what the computations mean, and less on how to program a particular computation. +The takeaway---we can focus more on what the computations mean, and less on how to program a particular computation. ## Add-on packages @@ -108,7 +110,3 @@ These notes will utilize numerous add-on packages including: * [`GLM`](https://github.com/JuliaStats/GLM.jl), [`Loess`](https://github.com/JuliaStats/Loess.jl), and [`RobustModels`](https://github.com/getzze/RobustModels.jl), for statistical modeling. Most of these are maintained by the `StatsBase` organization, which provides the `StatsKit` package to load all these with a single command, though we don't illustrate that. - ----- - -Copyright 2023, John Verzani. All rights reserved. From 4d8a0f196c880e9be4c8e594054fc3beee9e6d2a Mon Sep 17 00:00:00 2001 From: jverzani Date: Wed, 3 Sep 2025 16:03:48 -0400 Subject: [PATCH 2/3] clean up dirs --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 0922bf4..595d0f3 100644 --- a/.gitignore +++ b/.gitignore @@ -4,6 +4,7 @@ /site_libs/ /_freeze/ /*/*_files/ +/*/*.html /*/*.ipynb/ TODO.md Manifest.toml From 5449caaacd779effa68f26522035fcda3d1574d8 Mon Sep 17 00:00:00 2001 From: jverzani Date: Wed, 3 Sep 2025 16:56:39 -0400 Subject: [PATCH 3/3] add table figure environments --- EDA/bivariate-julia.qmd | 5 +++++ EDA/makie.qmd | 10 +++++++++- EDA/tabular-data-julia.qmd | 4 +++- EDA/univariate-julia.qmd | 9 ++++++--- Inference/inference.qmd | 8 ++++++-- 5 files changed, 29 insertions(+), 7 deletions(-) diff --git a/EDA/bivariate-julia.qmd b/EDA/bivariate-julia.qmd index 606703b..1483ae3 100644 --- a/EDA/bivariate-julia.qmd +++ b/EDA/bivariate-julia.qmd @@ -448,12 +448,17 @@ lm(@formula(PetalWidth ~ PetalLength), d) The output has more detail to be explained later. For now, we only need to know that the method `coef` will extract the coefficients (in the first column) as a vector of length 2, which we assign to the values `bhat0` and `bhat1` below: +::: {#fig-regression-jitter} + ```{julia} scatter(jitter(l), jitter(w); legend=false) # spread out values bhat0, bhat1 = coef(res) # the coefficients plot!(x -> bhat0 + bhat1 * x) # `predict` does this generically ``` +Scatter plot with computed regression line +::: + ::: {.callout-note} ##### A constant model diff --git a/EDA/makie.qmd b/EDA/makie.qmd index a0231cd..993b7d7 100644 --- a/EDA/makie.qmd +++ b/EDA/makie.qmd @@ -337,7 +337,9 @@ Quantile-quantile plots. The left graphic shows `QQPlot` used to compare the dis A scatter plot shows $x$ and $y$ pairs as points, a line plot connects these points. There are numerous ways to draw lines with the `AlgebraOfGraphics` including: `visual(Lines)`, for connect-the-dots lines; `visual(LinesFill)`, for shading; `visual(HLines)` and `visual(VLines)`, for horizontal and vertical lines; `visual(Rangebars)` to draw vertical or horizontal line segments. -The graph of a function can be drawn using `Lines`, as in this example, where we add in different range bars to emphasize the role that the two parameters play in this function's graph: +The graph of a function can be drawn using `Lines`, as in the example shown in @fig-line-plot, where we add in different range bars to emphasize the role that the two parameters play in the function's graph. + +::: {#fig-line-plot} ```{julia} ϕ(x; μ=0, σ=1) = 1/sqrt(2*pi*σ^2) * exp(-(1/(2σ)) * (x - μ)^2) @@ -358,6 +360,9 @@ c += data((x=[1/10, 1/2], y=[0, ϕ(1)], label=["μ", "σ"])) * draw(c) ``` +Density of standard normal distribution with annotations +::: + The `Rangebars` visual has a `direction` argument, used above to make a horizontal range bar. The annotation has two subtleties: the qualification of `Makie.Text` is needed, as there is a `Text` type in base `Julia`. More idiosyncratically, the use of `verbatim` in `mapping` is needed to avoid an attempt to map the labels to a glyph, such as a pre-defined marker. @@ -416,6 +421,7 @@ f A corner plot, as produced by the `PairPlots` package through its `pairplot` function, is a quick plot to show pair-wise relations amongst multiple numeric values. The graphic uses the lower part of a grid to show paired scatterplots with, by default, contour lines highlighting the relationship. On the diagonal are univariate density plots. +::: {#fig-pairplot} ```{julia} using PairPlots nms = names(penguins, 3:5) @@ -423,6 +429,8 @@ p = select(penguins, nms .=> replace.(nms, "_mm" => "", "_" => " ")) # adjust na pairplot(p) ``` +Corner plot produced by the `PairPlots` package +::: ### 3D scatterplots diff --git a/EDA/tabular-data-julia.qmd b/EDA/tabular-data-julia.qmd index fe39f11..9b86740 100644 --- a/EDA/tabular-data-julia.qmd +++ b/EDA/tabular-data-julia.qmd @@ -188,6 +188,7 @@ The filename, may be more general. For example, it could be `download(url)` for The methods `read` and `write` are qualified in the above usage with the `CSV` module. In the `Julia` ecosystem, the `FileIO` package provides a common framework for reading and writing files; it uses the verbs `load` and `save`. This can also be used with `DataFrames`, though it works through the `CSVFiles` package---and not `CSV`, as illustrated above. The read command would look like `DataFrame(load(fname))` and the write command like `save(fname, df)`. Here `fname` would have a ".csv" extension so that the type of file could be detected. ::: + | Command | Description | |---------|-------------| | `CSV.read(file_name, DataFrame)` | Read csv file from file with given name | @@ -197,7 +198,8 @@ The methods `read` and `write` are qualified in the above usage with the `CSV` m | `DataFrame(load(file_name))` | Read csv file from file with given name using `CSVFiles` | | `save(file_name, df)` | Write data frame `df` to a csv file using `CSVFiles` | -: Basic usage to read/write `.csv` file into a data frame. +: Basic usage to read/write `.csv` file into a data frame. {#tbl-read-write-data-frame} + #### TableScraper diff --git a/EDA/univariate-julia.qmd b/EDA/univariate-julia.qmd index b8cea39..18feb7a 100644 --- a/EDA/univariate-julia.qmd +++ b/EDA/univariate-julia.qmd @@ -397,7 +397,7 @@ show(whale) |`x[[2,3]] = [4,5]`| Assign values to second and third elements of `x`. |`x[:] = [1, 2, 3]` | In-place assignment. Size *and type* of right-hand side must match left-hand side | -: Various indexing patterns. {tbl-colwidths="[30,70]"} +: Various indexing patterns. {#tbl-indexing-patterns tbl-colwidths="[30,70]"} @@ -1057,7 +1057,7 @@ sum(f, xi - mean(whale) for xi in whale) | `eachcol` | for tabular data, iterate over the columns | | `eachrow` | for tabular data, iterate over the rows | -: Various convenient iterators in `Julia`. {tbl-colwidths="[25,75]"} +: Various convenient iterators in `Julia` {#tbl-iterators tbl-colwidths="[25,75]"} @@ -1133,6 +1133,8 @@ These measures are sensitive---or not *resistant*---to one or more *outlying* va ::: + + | Measure | Type | Description | |-------------|--------|---------------| | `mean` | center | Average value | @@ -1147,7 +1149,8 @@ These measures are sensitive---or not *resistant*---to one or more *outlying* va | `summarystats` | position | The quartiles and extrema | | `zscore` | position | Standardizing transformation | -: Various measures of center, spread, and position in a univariate data set. {tbl-colwidths="[25,25,50]"} +: Various measures of center, spread, and position in a univariate data set. {#tbl-measures-center-spread-position tbl-colwidths="[25,25,50]"} + ## Shape {#sec-shape} diff --git a/Inference/inference.qmd b/Inference/inference.qmd index d165435..eaf5fcc 100644 --- a/Inference/inference.qmd +++ b/Inference/inference.qmd @@ -699,7 +699,7 @@ To test this, we characterize the probability a sample mean would be even more t For this problem the $p$-value is the probability the sample mean is `mean(xs)` or more assuming the *null hypothesis*. Centering by $\mu$ and scaling by the standard error, this is: $$ -P(\frac{\bar{X} - \mu}{SE(\bar{X})} > \frac{\bar{x} - \mu}{s/\sqrt{n}} \mid H_0). +P\left(\frac{\bar{X} - \mu}{SE(\bar{X})} > \frac{\bar{x} - \mu}{s/\sqrt{n}} \mid H_0\right). $$ That is, large values of $\bar{X}$ are not enough to be considered statistically significant, rather large values measured in terms of the number of standard errors are. @@ -1513,8 +1513,9 @@ prof = profile(prob, sol; param_ranges=param_ranges) prof[:β₁] ``` -We can visualize the log-likelihood over the value in the $95$% confidence interval with the following: +We can visualize the log-likelihood over the value in the $95$% confidence interval (cf. @fig-log-likelihood). +::: {#fig-log-likelihood} ```{julia} xs = range(-0.02, 0.02, length=100); ys = prof[:β₁].(xs) p = data((x=xs, y=ys)) * visual(Lines) * mapping(:x, :y) @@ -1524,6 +1525,9 @@ p += data((x=ci, y = prof[:β₁].(ci))) * visual(Lines) * mapping(:x, :y) draw(p) ``` +A 95% confidence interval for the log-likelihood +::: + Were a significance test desired, the test statistic requires one more optimization calculuation, this time the maximum log likelihood under $H_0$, which assumes a fixed value of $\beta_1$: ```{julia}