From 4b1527ddada66ff1e4085aed4cb86f5480fc9be7 Mon Sep 17 00:00:00 2001 From: jverzani Date: Wed, 3 Sep 2025 15:16:02 -0400 Subject: [PATCH 1/2] add spellcheck --- .github/workflows/SpellCheck.yml | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 .github/workflows/SpellCheck.yml diff --git a/.github/workflows/SpellCheck.yml b/.github/workflows/SpellCheck.yml new file mode 100644 index 0000000..3d62423 --- /dev/null +++ b/.github/workflows/SpellCheck.yml @@ -0,0 +1,13 @@ +name: Spell Check + +on: [pull_request] + +jobs: + typos-check: + name: Spell Check with Typos + runs-on: ubuntu-latest + steps: + - name: Checkout Actions Repository + uses: actions/checkout@v4 + - name: Check spelling + uses: crate-ci/typos@master \ No newline at end of file From 456fe25d4d02d17d7e6280b80bd9effd56f6c6f8 Mon Sep 17 00:00:00 2001 From: jverzani Date: Wed, 3 Sep 2025 15:27:58 -0400 Subject: [PATCH 2/2] typos --- EDA/bivariate-julia.qmd | 6 +++--- EDA/tabular-data-julia.qmd | 18 +++++++++--------- EDA/univariate-julia.qmd | 2 +- Inference/distributions.qmd | 8 ++++---- Inference/inference.qmd | 6 +++--- _typos.toml | 2 ++ index.qmd | 6 +++--- 7 files changed, 25 insertions(+), 23 deletions(-) diff --git a/EDA/bivariate-julia.qmd b/EDA/bivariate-julia.qmd index 1357824..98926b8 100644 --- a/EDA/bivariate-julia.qmd +++ b/EDA/bivariate-julia.qmd @@ -150,7 +150,7 @@ plot(p1, p2, layout=(@layout [a b]))#, size=fig_size_2) ### Histograms across groups -The density plot is, perhaps, less familiar than a histogram, but makes a better graphic when comparing two or more distributions, as the individual graphics don't overlap. There are attempts to use the histogram, and they can be effective. In @fig-xword-histograms, taken from an article on [fivethirtyeight.com](https://fivethirtyeight.com/features/dan-feyer-american-crossword-puzzle-tournament/) on the time to solve cross word puzzles broken out by day of week, we see a stacked dotplot, which is visually very similar to a histogram, presented for each day of the week. The use of color allows one to distinguish the day, but the overalpping aspect of the graphic inhibits part of the distribution of most days, and only effectively shows the longer tails as the week progresses. In `StatsPlots`, the `grouped hist` function can produce similar graphics. +The density plot is, perhaps, less familiar than a histogram, but makes a better graphic when comparing two or more distributions, as the individual graphics don't overlap. There are attempts to use the histogram, and they can be effective. In @fig-xword-histograms, taken from an article on [fivethirtyeight.com](https://fivethirtyeight.com/features/dan-feyer-american-crossword-puzzle-tournament/) on the time to solve cross word puzzles broken out by day of week, we see a stacked dotplot, which is visually very similar to a histogram, presented for each day of the week. The use of color allows one to distinguish the day, but the overlapping aspect of the graphic inhibits part of the distribution of most days, and only effectively shows the longer tails as the week progresses. In `StatsPlots`, the `grouped hist` function can produce similar graphics. ::: {#fig-xword-histograms} @@ -266,7 +266,7 @@ vline!([mean(l)], linestyle=:dash) hline!([mean(w)], linestyle=:dash) ``` -@fig-scatterplot-l-w shows the length and width data in a scatter plot. Jittering would be helpful to show all the data, as it has been discretized and many points are overplotte. The dashed lines are centered at the means of the respective variables. If the mean is the center of a single variable, then $(\bar{x}, \bar{y})$ may be thought of as the center of the paired data. Thinking of the dashed lines meeting at the origin, four quadrants are formed. The correlation can be viewed as a measure of how much the data sits in opposite quadrants. In the figure, there seems to be more data in quadrants I and III then II and IV, which sugests a *positive* correlation, as confirmed numerically. +@fig-scatterplot-l-w shows the length and width data in a scatter plot. Jittering would be helpful to show all the data, as it has been discretized and many points are overplotte. The dashed lines are centered at the means of the respective variables. If the mean is the center of a single variable, then $(\bar{x}, \bar{y})$ may be thought of as the center of the paired data. Thinking of the dashed lines meeting at the origin, four quadrants are formed. The correlation can be viewed as a measure of how much the data sits in opposite quadrants. In the figure, there seems to be more data in quadrants I and III then II and IV, which suggests a *positive* correlation, as confirmed numerically. By writing the correlation in terms of $z$-scores, the product in that formula is *positive* if the point is in quadrant I or III and negative if in II or IV. So, for example, a big positive number suggests data is concentrated in quadrants I and III or that there is a strong association between the variables. The scaling by the standard deviations, leaves the mathematical constraint that the correlation is between $-1$ and $1$. @@ -816,7 +816,7 @@ for (k,d) ∈ pairs(gdf) # GroupKey, SubDataFrame end ``` -Now we identify different regression lines (slope and intercepts) for each cluster. This is done throuh a *multiplicative* model and is specified in the model formula of `StatsModels` with a `*`: +Now we identify different regression lines (slope and intercepts) for each cluster. This is done through a *multiplicative* model and is specified in the model formula of `StatsModels` with a `*`: ```{julia} m3 = lm(@formula(PetalLength ~ PetalWidth * Species), iris) diff --git a/EDA/tabular-data-julia.qmd b/EDA/tabular-data-julia.qmd index 8ff7a24..2fabe21 100644 --- a/EDA/tabular-data-julia.qmd +++ b/EDA/tabular-data-julia.qmd @@ -33,7 +33,7 @@ There are different ways to construct a data frame. Consider the task of the Wirecutter in trying to select the best [carry on travel bag](https://www.nytimes.com/wirecutter/reviews/best-carry-on-travel-bags/#how-we-picked-and-tested). After compiling a list of possible models by scouring travel blogs etc., they select some criteria (capacity, compartment design, aesthetics, comfort, ...) and compile data, similar to what one person collected in a [spreadsheet](https://docs.google.com/spreadsheets/d/1fSt_sO1s7moXPHbxBCD3JIKPa8QIZxtKWYUjD6ElZ-c/edit#gid=744941088). -Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatability, loading style, and a last-checked date -- as this market improves constantly. +Here we create a much simplified spreadsheet for 3 listed bags with measurements of volume, price, laptop compatibility, loading style, and a last-checked date -- as this market improves constantly. ``` product v p l loads checked @@ -42,7 +42,7 @@ Minaal 3.0 35 349 Y front panel 2022-09 Genius 25 228 Y clamshell 2022-10 ``` -We see that product is a character, volume and price numeric, laptop compatability a Boolean value, load style one of a few levels, and the last checked date, a year-month date. +We see that product is a character, volume and price numeric, laptop compatibility a Boolean value, load style one of a few levels, and the last checked date, a year-month date. We create vectors to hold each. We load the `CategoricalArrays` and `Dates` packages for a few of the variables: @@ -51,7 +51,7 @@ using CategoricalArrays, Dates product = ["Goruck GR2", "Minaal 3.0", "Genius"] volume = [40, 35, 25] price = [395, 349, 228] -laptop_compatability = categorical(["Y","Y","Y"]) +laptop_compatibility = categorical(["Y","Y","Y"]) loading_style = categorical(["front panel", "front panel", "clamshell"]) date_checked = Date.(2022, [9,9,10]) ``` @@ -60,7 +60,7 @@ With this, we use the `DataFrame` constructor to combine these into one data set ```{julia} d = DataFrame(product = product, volume=volume, price=price, - var"laptop compatability"=laptop_compatability, + var"laptop compatibility"=laptop_compatibility, var"loading style"=loading_style, var"date checked"=date_checked) ``` @@ -79,7 +79,7 @@ In the above construction, we repeated the names of the variables to the constr ```{julia} d = DataFrame(; product, volume, price, - var"laptop compatability"=laptop_compatability, + var"laptop compatibility"=laptop_compatibility, var"loading style"=loading_style, var"date checked"=date_checked) ``` @@ -94,7 +94,7 @@ d = DataFrame() # empty data frame d.product = product d.volume = volume d.price = price -d."laptop compatability" = laptop_compatability +d."laptop compatibility" = laptop_compatibility d."loading style" = loading_style d."date checked" = date_checked d @@ -225,7 +225,7 @@ The `rename!` function allows the names to be changed in-place (without returnin ### Indexing and assignment -The values in a data frame can be referenced programatically by a row number and column number, both 1-based. For example, the 2nd row and 3rd column of `d` can be seen to be `349` by observation +The values in a data frame can be referenced programmatically by a row number and column number, both 1-based. For example, the 2nd row and 3rd column of `d` can be seen to be `349` by observation ```{julia} d @@ -441,7 +441,7 @@ cars1 = filter(:Manufacturer => ==("Volkswagen"), cars) cars2 = filter(:MPGCity => >=(20), cars1) ``` -The above required the introduction of an intermediate data frame to store the result of the first `filter` call to pass to the second. This threading through of the modified data is quite common in processing pipelines. The first two approaches with complicated predicate functions can grow unwieldly, so staged modification is common. To support that, the chaining or piping operation (`|>`) is often used: +The above required the introduction of an intermediate data frame to store the result of the first `filter` call to pass to the second. This threading through of the modified data is quite common in processing pipelines. The first two approaches with complicated predicate functions can grow unwieldy, so staged modification is common. To support that, the chaining or piping operation (`|>`) is often used: ```{julia} filter(:Manufacturer => ==("Volkswagen"), cars) |> @@ -590,7 +590,7 @@ When `AsTable` is used on the source columns, as in `AsTable([:p,:v])` then the #### Transform -Extending the columns in the data frame by `select` is common enough that the function `transform` is supplied which always keeps the columns of the original data frame, though they can also be modified through the mini language. The use of transfrom is equivalent to `select(df, :, args...)`. +Extending the columns in the data frame by `select` is common enough that the function `transform` is supplied which always keeps the columns of the original data frame, though they can also be modified through the mini language. The use of transform is equivalent to `select(df, :, args...)`. ::: {.callout-note} diff --git a/EDA/univariate-julia.qmd b/EDA/univariate-julia.qmd index d47f4a4..39e32f5 100644 --- a/EDA/univariate-julia.qmd +++ b/EDA/univariate-julia.qmd @@ -324,7 +324,7 @@ When a vector is passed to a function, if there is no copy made (as opposed to a ::: -Multiple values can be assigned at once. For example, if the data was mis-arranged chronologically, we might have: +Multiple values can be assigned at once. For example, if the data was misarranged chronologically, we might have: ```{julia} whale[ [1,2,3] ] = [235, 74, 122] diff --git a/Inference/distributions.qmd b/Inference/distributions.qmd index 75d17a1..980cafe 100644 --- a/Inference/distributions.qmd +++ b/Inference/distributions.qmd @@ -13,7 +13,7 @@ using CairoMakie, AlgebraOfGraphics This section quickly reviews the basic concepts of probability. -Mathematically a probability is an assignment of numbers to a collection of events (sets) of a probability space. These values may be understood from a model or through long term frequencies. For example, consider the tossing of a *fair* coin. By writing "fair" the assumption is implicitly made that each side (heads or tails) is equally likely to occur on a given toss. That is a mathematical assumption. This can be reaffirmed by tossing the coin *many* times and counting the frequency of a heads occuring. If the coin is fair, the expectation is that heads will occur in about half the tosses. +Mathematically a probability is an assignment of numbers to a collection of events (sets) of a probability space. These values may be understood from a model or through long term frequencies. For example, consider the tossing of a *fair* coin. By writing "fair" the assumption is implicitly made that each side (heads or tails) is equally likely to occur on a given toss. That is a mathematical assumption. This can be reaffirmed by tossing the coin *many* times and counting the frequency of a heads occurring. If the coin is fair, the expectation is that heads will occur in about half the tosses. The mathematical model involves a formalism of sample spaces and events. There are some subtleties due to infinite sets, but we limit our use of events to subsets of finite or countably infinite sets or intervals of the real line. A probability *measure* is a function $P$ which assigns each event $E$ a number with: @@ -75,7 +75,7 @@ A **discrete** random variable is one which has $P(X = k) > 0$ for at most a fin A **continuous** random variable is described by a function $f(x)$ where $P(X \leq a)$ is given by the *area* under $f(x)$ between $-\infty$ and $a$. The function $f(x)$ is called the pdf (probability density function). An immediate consequence is the *total* area under $f(x)$ is $1$ and $f(x) \geq 0$. -When defined, the pdf is the basic description of the distribution of a random variable. It says what is *possible* and *how likely* possible things are. For the two cases above, this is done differently. In the discrete case, the possible values are all $k$ where $f(k) =P(X=k) > 0$, but not all values are equally likely unless $f(k)$ is a constant. For the continuous case there are **no** values with $P(X=k) > 0$, as probabilities are assigned to area, and the corresponding area to this event, for any $k$, is $0$. Rather, values can only appear in itervals with positive area ($f(x) > 0$ within this interval) and for equal-length intervals, those with more area above them are more likely to contain values. +When defined, the pdf is the basic description of the distribution of a random variable. It says what is *possible* and *how likely* possible things are. For the two cases above, this is done differently. In the discrete case, the possible values are all $k$ where $f(k) =P(X=k) > 0$, but not all values are equally likely unless $f(k)$ is a constant. For the continuous case there are **no** values with $P(X=k) > 0$, as probabilities are assigned to area, and the corresponding area to this event, for any $k$, is $0$. Rather, values can only appear in intervals with positive area ($f(x) > 0$ within this interval) and for equal-length intervals, those with more area above them are more likely to contain values. A data set in statistics, $x_1, x_2, \dots, x_n$, is typically modeled by a collection of random variables, $X_1, X_2, \dots, X_n$. That is, the random variables describe the *possible* values that can be collected, the values ($x_1, x_2,\dots$) describe the actual values that were collected. Put differently, random variables describe what can happen *before* a measurement, the values are the result of the measurement. @@ -147,7 +147,7 @@ Statistical inference makes statements using the language of probability about t An intuitive example is the tossing of a fair coin modeling heads by a $1$ and tails by a $0$ then we can *parameterize* the distribution by $f(1) = P(X=1) = p$ and $f(0) = P(X=0) = 1 - P(X=1) = 1-p$. This distribution is summarized by $\mu=p$, $\sigma = \sqrt{p(1-p)}$. A *fair* coin would have $p=1/2$. A sequence of coin tosses, say H,T,T,H,H might be modeled by a sequence of iid random variables, each having this distribution. Then we might expect a few things, where $\hat{p}$ below is the proportion of heads in the $n$ tosses: -* A given data set is not random, but it may be viewed as the result of a random process and had that process been run again would likely result in a different outcome. These different outcomes may be described probabalistically in terms of a distribution. +* A given data set is not random, but it may be viewed as the result of a random process and had that process been run again would likely result in a different outcome. These different outcomes may be described probabilistically in terms of a distribution. * If $n$ is large enough, the sample proportion $\hat{p}$ should be *close* to the population proportion $p$. * Were the sampling repeated, the variation in the values of $\hat{p}$ should be smaller for larger sample sizes, $n$. @@ -345,7 +345,7 @@ draw(p) ``` -In `Distributions` the `Categorical` type can alse have been used to construct this distribution, it being a special case of `DiscreteNonParametric` with the `xs` being $1, \dots, k$. +In `Distributions` the `Categorical` type can also have been used to construct this distribution, it being a special case of `DiscreteNonParametric` with the `xs` being $1, \dots, k$. The multinomial distribution is the distribution of counts for a sequence of $n$ iid random variables from a `Categorical` distribution. This generalizes the binomial distribution. Let $X_i$ be the number of type $i$ in $n$ samples. Then $X_1 + X_2 + \cdots + X_k = n$, so these are not independent. They have mean $E(X_i)=np_i$, variance $VAR(X_i) = np_i (1-p_i)$, like the binomial, but covariance $COV(X_i, X_j) = -np_i p_j, i \neq j$. (Negative, as large values for $X_i$ correlate with smaller values for $X_j$ when $i \neq j$.) diff --git a/Inference/inference.qmd b/Inference/inference.qmd index 002f14d..bd93e60 100644 --- a/Inference/inference.qmd +++ b/Inference/inference.qmd @@ -643,7 +643,7 @@ confint(OneSampleTTest(ys), level = 0.95) The two differ -- they use different sampling distributions and methods -- though simulations will show both manners create CIs capturing the true mean at the rate of the confidence level. -The above example does not showcase the advantage of the maximimum likelihood methods, but hints at a systematic way to find confidence intervals, which for some cases is optimal, and is more systematic then finding some pivotal quantity (e.g. the $T$-statistic under a normal population assumption). +The above example does not showcase the advantage of the maximum likelihood methods, but hints at a systematic way to find confidence intervals, which for some cases is optimal, and is more systematic then finding some pivotal quantity (e.g. the $T$-statistic under a normal population assumption). @@ -656,7 +656,7 @@ The basic setup is similar to a courtroom trial in the United States -- as seen * a defendant is judged by a jury with an *assumption of innocence* * presentation of evidence is given * the jury weighs the evidence *assuming* the defendant is innocent. -* If it is a civil trial a *preponderence of evidence* is enough for the jury to say the defendent is guilty (not innocent); if a criminal trial the standard is if the evidence is "beyond a reasonable doubt" then the defendent is deemed not innocent. Otherwise the defendant is said to be "not guilty," though really it should be that they weren't "proven" to be guilty. +* If it is a civil trial a *preponderance of evidence* is enough for the jury to say the defendant is guilty (not innocent); if a criminal trial the standard is if the evidence is "beyond a reasonable doubt" then the defendant is deemed not innocent. Otherwise the defendant is said to be "not guilty," though really it should be that they weren't "proven" to be guilty. In a hypothesis or significance test for parameters, the setup is similar: @@ -1388,7 +1388,7 @@ $$ H_0: \mu = \mu_0, \quad H_A: \mu = \mu_1 $$ -Suppose the population is $Normal(\mu, \sigma)$. We had a similar setup in the discussion on power, where for a $T$-test specifying three of a $\alpha$, $\beta$, $n$, or an effect size allows the solving of the fourth using known facts about the $T$-statistic. The Neyman-Pearson lemma speaks to the *uniformly most powerful* test under this scenario with a **single** unknown parameter (the mean above, but it could also have been the standard devation, etc.). +Suppose the population is $Normal(\mu, \sigma)$. We had a similar setup in the discussion on power, where for a $T$-test specifying three of a $\alpha$, $\beta$, $n$, or an effect size allows the solving of the fourth using known facts about the $T$-statistic. The Neyman-Pearson lemma speaks to the *uniformly most powerful* test under this scenario with a **single** unknown parameter (the mean above, but it could also have been the standard deviation, etc.). This test can be realized as a likelihood ratio test, which also covers tests of more generality. Suppose the parameters being tested are called $\theta$ which sit in some subset $\Theta_0 \subset \Theta$. The non-directional alternative would be $\theta$ is in $\Theta \setminus \Theta_0$. diff --git a/_typos.toml b/_typos.toml index 1c651b5..b0446bc 100644 --- a/_typos.toml +++ b/_typos.toml @@ -1,2 +1,4 @@ [default.extend-words] Pn = "Pn" +annote = "annote" +Annote = "Annote" diff --git a/index.qmd b/index.qmd index 0ed87a3..6eabb32 100644 --- a/index.qmd +++ b/index.qmd @@ -4,7 +4,7 @@ This is a collection of notes for using `Julia` for introductory statistics. In case you haven't heard, [Julia](https://julialang.org/) is an open-source programming language suitable for many tasks, like scientific programming. It is designed for high performance -- Julia programs compile on the fly to efficient native code. Julia has a relatively easy to learn syntax for many tasks, certainly no harder to pick up than `R` and `Python`, widely used scripting languages for the tasks illustrated herein. -Why these notes on introductory statistics? No compelling reason save I had done something similar for `R` when `R` was a fledgling `S-Plus` clone. No more, `R` is a juggernaut, and it is almost certain `Julia` will never replace `R` as the programming langauage of choice for statistics. Besides, `Julia` users can already interface with `R` quite easily through `RCall`. *However*, there are some reasons that `Julia` could be a useful language when learning basic inferential statistics, especially if other real strengths of the `Julia` ecosystem were needed. So these notes show how `Julia` can be used for these tasks, and, hopefully, shows that it works pretty well. +Why these notes on introductory statistics? No compelling reason save I had done something similar for `R` when `R` was a fledgling `S-Plus` clone. No more, `R` is a juggernaut, and it is almost certain `Julia` will never replace `R` as the programming language of choice for statistics. Besides, `Julia` users can already interface with `R` quite easily through `RCall`. *However*, there are some reasons that `Julia` could be a useful language when learning basic inferential statistics, especially if other real strengths of the `Julia` ecosystem were needed. So these notes show how `Julia` can be used for these tasks, and, hopefully, shows that it works pretty well. There are some great books published about using `Julia` [@bezanson2017julia] with data science, within which much of this material is covered. For example, @@ -29,11 +29,11 @@ Once downloaded and installed the `Julia` installation will provide a *command l Some alternatives to the REPL for interacting with `Julia` are: -* [IJulia](https://github.com/JuliaLang/IJulia.jl): This is a means to use the Jupyter interactive environment to interact with `Julia` through notebooks. It is made available by installing the package `IJulia` (details on package installation follow below). This relies on `Julia`'s seamless interaction with `Python` and leverages many technologies developed for that langauge. +* [IJulia](https://github.com/JuliaLang/IJulia.jl): This is a means to use the Jupyter interactive environment to interact with `Julia` through notebooks. It is made available by installing the package `IJulia` (details on package installation follow below). This relies on `Julia`'s seamless interaction with `Python` and leverages many technologies developed for that language. * [Pluto](https://githuhttps://plutojl.org/): The Pluto environment provides a notebook interface for `Julia` written in `Julia` leveraging many JavaScript technologies for the browser. It has the feature of being reactive, making it well suited for many exploratory tasks and pedagogical demonstrations. * [Visual Studio Code](https://www.julia-vscode.org/): `Julia` is a supported language for the Visual Studio Code editor of Microsoft, a programmer's IDE. -These notes use `quarto` to organize the mix of text, code, and graphics. The `quarto` publishing system is developed by [Posit](https://posit.co/), the developers of the wildly sucessful `RStudio` interface for `R`. The code snippets are run as blocks (within `IJulia`) and the last command executed is shown. (If code is copy-and-pasted into the REPL, each line's output will be displayed.) The code display occurs below the cell, as here, where we show that `Julia` can handle basic addition: +These notes use `quarto` to organize the mix of text, code, and graphics. The `quarto` publishing system is developed by [Posit](https://posit.co/), the developers of the wildly successful `RStudio` interface for `R`. The code snippets are run as blocks (within `IJulia`) and the last command executed is shown. (If code is copy-and-pasted into the REPL, each line's output will be displayed.) The code display occurs below the cell, as here, where we show that `Julia` can handle basic addition: ```{julia} 2 + 2