diff --git a/EDA/tabular-data-julia.qmd b/EDA/tabular-data-julia.qmd index 2fabe21..8eed7b1 100644 --- a/EDA/tabular-data-julia.qmd +++ b/EDA/tabular-data-julia.qmd @@ -673,17 +673,27 @@ flatten(dxy, :value) The split-apply-combine strategy for wrangling data frames was popularized in the influential paper "*The Split-Apply-Combine Strategy for Data Analysis*: [@JSSv040i01] by H. Wickham which introduces this description: "where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together." +#### Grouping data The `groupby` function for data frames does the breaking up, the `DataFrames` mini language can operate on each piece, and the `combine` function can put things together again. -The `groupby` function splits a data frame into pieces of type `GroupedDataFrame`. The basic signature is `groupby(df, cols; sort=nothing, skipmissing=false)`. The `cols` are one or more column selectors. The value of `sort` defaults to `nothing` leaving the order of the groups in the result undefined, but the grouping uses the fastest identified algorithm. The `skipmissing` argument can instruct the skipping of rows which have missing values in the selected columns. +Grouping takes one or more *categorical variables* and creates a "grouped" data frame for each different set of *combined* levels. That is one grouped data frame for each possible combination of the levels. +The `groupby` function splits the data frame into pieces of type `GroupedDataFrame`. The basic signature is `groupby(df, cols; sort=nothing, skipmissing=false)`. The `cols` are one or more column selectors. The value of `sort` defaults to `nothing` leaving the order of the groups in the result undefined, but the grouping uses the fastest identified algorithm. The `skipmissing` argument can instruct the skipping of rows which have missing values in the selected columns. -For example, grouping by loading style, returns two sub data frames. +For example, grouping the baggage data by loading style, returns two sub data frames. ```{julia} gdf = groupby(d, :load) ``` +Grouping by more than one variable is achieved by passing in a selector with multiple variables. The following also creates two sub data frames, but only because the unique combination of both load and date are the same as just load: + +```{julia} +gdf = groupby(d, [:load, :d]) +``` + + + Indices may be used to get each sub data frame. Moreover, the `keys` method returns keys, like indices, to efficiently look up the different sub data frames in `gdf`. The levels for the keys are found in the property `.load`, as with `keys(gdf)[1].load`. Grouped data frames may be iterated over; the `pairs` iterator iterates over both keys and values. The `select` and `subset` functions work over `GroupedDataFrame` objects, here we see the returned values are combined: @@ -694,12 +704,18 @@ subset(gdf, :p => ByRow(<=(350))) (The command `[subset(g, :p => ByRow(<=(350))) for g in gdf]` avoids the `combine` step.) +#### Group and combine + We can see that `combine` also works over `GroupedDataFrame` objects. This example shows how to find the mean city mileage of groups formed by the value of `:Origin` in the `cars` data set: ```{julia} combine(groupby(cars, :Origin), :MPGCity => mean) ``` +Typically, the function applied to a column is a reduction, which results in *aggregated data*. + +By default the keys used to group (`:Origin`) are kept by `combine`. The `keepkeys` argument for `combine` when applied to grouped data controls this feature. + More than one column can be used to group the data. This example first groups by the number of cylinders, then groups by origin. For each of the 9 resulting groups, the group mean for city mileage is taken: ```{julia} @@ -857,7 +873,7 @@ end #### The SplitApplyCombine package There is support for the split-apply-combine paradigm outside of the `DataFrames` package in the package `SplitApplyCombine`. This package is much lighter weight than `DataFrames`. -This support targets other representations of tabular data, such as a vector of nested tuples: +This support targets other representations of tabular data, such as a vector of named tuples: ```{julia} @@ -871,13 +887,13 @@ tbl = [ Many of the main verbs are generic functions: `filter`, `map`, `reduce`; `group` does grouping. There are some others. A few examples follow. -For indexing this structure, we can't index by `[row, col]`, rather we can use `[row][col]` notation, the first to extract the row, the second to access the entries in the column. For example, he we index a selected row by position, by name, and then with multiple names: +For indexing this structure, we can't index by `[row, col]`, rather we can use `[row][col]` notation, the first to extract the row, the second to access the entries in the column. For example, here we index a selected column by position, by name, and then with multiple names: ```{julia} tbl[2][2], tbl[2].v, tbl[2][ [:v, :p] ] ``` -The `invert` function reverses the access pattern, allowing in this case, `[col][row]` access, with: +The `invert` function from `SplitApplyCombine` reverses the access pattern, allowing, in this case, `[col][row]` access, with: ```{julia} itbl = invert(tbl) @@ -1059,12 +1075,28 @@ cars_price = select(cars, 1:3, r"Price" => ByRow(_mean) => :price) first(cars_price, 3) ``` -To aggregate over columns, the `groupby` construct can be used to specify the groupings to aggregate over. In the following this is the car type: +To aggregate over columns, the `groupby` construct can be used to specify the groupings to aggregate over. In the following this is the car type. The `Chain` style is used to make the steps more explicit: ```{julia} -combine(groupby(cars_price, :Type), :price => mean => :avg_price) +@chain cars_price begin + groupby(:Type) + combine(:price => mean => :avg_price) +end ``` +As `mean` is a reduction, the result has one row for each unique set of levels in the column selector specified to `groupby`. Again, that these columns appear in the output is due to the chosen default value of`keepkeys`. + +This similar example shows grouping by two variables and then aggregating miles per gallon over each: + +```{julia} +@chain cars begin + groupby([:Type, :DriveTrain]) + combine(:MPGHighway => mean => :avg_MPG) + sort(:avg_MPG, rev=true) +end +``` + + ### Tidy data; stack, unstack @@ -1078,7 +1110,35 @@ These terms are defined by > A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes. -In addition to `flatten`, the `DataFrames` package provides the functions `stack` and `unstack` for tidying data. We illustrate with an abbreviated version of an example in the `tidyr` vignette. +In addition to `flatten`, the `DataFrames` package provides the functions `stack` and `unstack` for tidying data. + +::: {.callout-note} + +### Stack and unstack +The `stack` and `unstack` commands are used to move from "wide" format, where related values are stored in multiple columns with column names representing some level, to "long" format, where related values are stored in one column, and their categorical levels in another column. + +For example, this wide format as the attributes are in columns `:A` and `:B`; the values in columns `:C` and `:D`. + +```{julia} +wide = DataFrame(A=["A","B"], B=["a","b"], C=["c1","c2"], D=["d1","d2"]) +``` + +With the `stack` function, the values are placed into a single column (by default `:value`) and the variable names placed into a single column (by default `:variable`): + +```{julia} +long = stack(wide, [:C, :D]) +``` + +The values in the `:variable` column come from the data frame names. + +With `unstack` the data can be spread back out into "wide" format. We pass in the two columns indicating the variable and the values to `unstack`: + +```{julia} +unstack(long, :variable, :value) +``` + + +We illustrate this further with an abbreviated version of an example in the `tidyr` vignette. Consider this data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010. The full data has 31 days per month, this abbreviated data works with just the first 8: diff --git a/Inference/distributions.qmd b/Inference/distributions.qmd index 980cafe..0e713fe 100644 --- a/Inference/distributions.qmd +++ b/Inference/distributions.qmd @@ -157,7 +157,7 @@ These imprecise statements can be made more precise, as will be seen. Figure @fi ```{julia} #| echo: false #| label: fig-pop-parameters-sample -#| fig-cap: The top-left figure shows a population, the lower left parameters associated with the population. The top-right shows one sample from the population in red, other possible samples in blue with sample means summarized in the bottom most row. A density estimate of these sample means appears in the lower right figure showing how it is centered on the population, but more concentrated. Inference uses the language of probability to characterize a population parameter based on a single random sample. +#| fig-cap: The top-left figure shows a population, the lower left parameters associated with the population. The top-right shows one sample from the population in red, other possible samples in blue with sample means summarized in the bottom most row. A density estimate of these sample means appears in the lower-right figure showing how it is centered on the population, but more concentrated. Inference uses the language of probability to characterize a population parameter based on a single random sample. μ, σ = 0, 1 P = Normal(μ, σ) xs = range(-3, 3, length=251) @@ -483,16 +483,14 @@ In @fig-qqplots-distributions the lower-right graphic is of the uniform distrib #| fig-cap: Quantile-normal plots for different distributions -- $T_3$ is leptokurtic; $T_{100}$ is approximately normal; $U$ is platykurtic; and $E$ is skewed. probs = range(0.01, 0.99, 40) -Ds = (T₃ = TDist(3), T₁₀₀=TDist(100), U=Uniform(0,1), E=Exponential(1)) -d = DataFrame(D=collect(keys(Ds)), qs = [quantile.(D,probs) for D in Ds]) -d = flatten(d, :qs) -#data(d) * visual(QQNorm, qqline=:fit) * mapping(:qs, main=:D, color=:D, layout=:D) - -d.Z = rand(Normal(0,1), size(d,1)) -data(d) * visual(QQPlot, qqline=:fit) * mapping(:Z, :qs, layout=:D) |> draw +Ds = (N = Normal(0,1), T₃ = TDist(3), T₁₀₀=TDist(100), + U=Uniform(0,1), E=Exponential(1)) +m = DataFrame([quantile.(D, probs) for D in Ds],collect(keys(Ds))) +d = stack(m, Not(:N)) # need to use QQPlot, so replicate the N +data(d) * mapping(:N, :value, layout=:variable => nonnumeric) * + (visual(QQPlot, qqline=:fit)) |> draw ``` - Consider now two, independent Chi-squared random variables $Y_1$ and $Y_2$ with $\nu_1$ and $\nu_2$ degrees of freedom. The ratio $$ @@ -511,14 +509,18 @@ kurtosis(FDist(5, 10)), kurtosis(FDist(5,100)), kurtosis(FDist(100,100)) #| echo: false #| label: fig-f-distributions #| fig-cap: Quantile-normal plots of the $F$ distribution for different degrees of freedom. As the degrees of freedom values get bigger, the distribution is more normal. -νs = [(1, 10), (5,10), (5, 100), (100, 100)] + probs = range(0.01, 0.99, 40) -d = DataFrame(D = ["F($ν₁,$ν₂)" for (ν₁, ν₂) ∈ νs], - qs = [quantile.(FDist(ν...), probs) for ν ∈ νs]) -d = flatten(d, :qs) -#data(d) * visual(QQNorm, qqline=:fit) * mapping(:qs, main=:D, color=:D, layout=:D) |> draw -d.Z = rand(Normal(0,1), size(d,1)) -data(d) * visual(QQPlot, qqline=:fit) * mapping(:Z, :qs, layout=:D) |> draw +Ds = (N = Normal(0,1), + var"F(1,10)" = FDist(1,10), + var"F(5,10)" = FDist(5,10), + var"F(5,100)" = FDist(5,100), + var"F(100, 100)" = FDist(100,100) + ) +m = DataFrame([quantile.(D, probs) for D in Ds],collect(keys(Ds))) +d = stack(m, Not(:N)) # need to use QQPlot, so replicate the N +data(d) * mapping(:N, :value, layout=:variable => nonnumeric) * + (visual(QQPlot, qqline=:fit)) |> draw ``` @@ -557,18 +559,19 @@ In general, how large $n$ needs to be depends on the underlying shape of the *po xbar(D, n, N=200) = [mean(rand(D, n)) for _ in 1:N] sxbar(D, n, N=200) = (xbar(D,n,N) .- mean(D)) / (std(D)/sqrt(n)) -Ds = (normal = Normal(0,1), +Ds = (N = Normal(0,1), + normal = Normal(0,1), leptokurtic = TDist(3), skewed = FDist(2,5), platykurtic = Uniform(0,1)) -df = DataFrame(D=collect(keys(Ds)), zs = [sxbar(D, 10) for D in values(Ds)]) -d = flatten(df, :zs) -#p = data(d) * visual(QQNorm, qqline=:fit) * -# mapping(:zs, layout = :D, main=:D, color=:D) -# draw(p) -d.Z = rand(Normal(0,1), size(d,1)) -data(d) * visual(QQPlot, qqline=:fit) * mapping(:Z, :zs, layout=:D) |> draw +probs = range(0.01, 0.99, 200) +m = DataFrame([quantile.((sxbar(D,10),), probs) for D in Ds],collect(keys(Ds))) +d = stack(m, Not(:N)) # need to use QQPlot +p = data(d) * mapping(:N, :value, layout=:variable => nonnumeric) * + (visual(QQPlot, qqline=:fit)) + +draw(p) ``` @@ -670,8 +673,10 @@ Pops = (var"long-tailed"=TDist(3), skewed=Exponential(1)) df = DataFrame([(D="$SDk $Popk", x=quantile(Popv, ps), y = quantile(rand(SDv, N), ps)) for (SDk,SDv) in pairs(SDs) for (Popk, Popv) in pairs(Pops)]) -p = data(flatten(df, [:x,:y])) * (visual(Scatter) + linear(;interval=nothing)) * - mapping(:x, :y,layout=:D, color=:D) +d = data(flatten(df, [:x,:y])) +p = d * (visual(Scatter) + linear(; interval=nothing)) * + mapping(:x, :y; layout=:D, color=:D) + draw(p, facet=(; linkxaxes=:none, linkyaxes=:none)) ``` diff --git a/Inference/inference.qmd b/Inference/inference.qmd index bd93e60..001906e 100644 --- a/Inference/inference.qmd +++ b/Inference/inference.qmd @@ -37,7 +37,7 @@ xs = rand(Y, n); ys = zeros(n); d = data((x=xs, y=ys)) p = deepcopy(p₀) p = p + d * visual(Scatter) * mapping(:x,:y) -p = p + d * visual(Density; color=:black, alpha=0.15) * mapping(:x) +p = p + d * visual(Density; colormap=(:gray10, 0.15)) * mapping(:x) draw(p; axis=(; title="n = $n")) ``` @@ -49,7 +49,7 @@ xs = rand(Y, n); ys = zeros(n); d = data((x=xs, y=ys)) p = deepcopy(p₀) p = p + d * visual(Scatter) * mapping(:x,:y) -p = p + d * visual(Density; color=:black, alpha=0.15) * mapping(:x) +p = p + d * visual(Density; colormap=(:gray10, 0.15)) * mapping(:x) draw(p; axis=(; title="n = $n")) ``` @@ -61,7 +61,7 @@ xs = rand(Y, n); ys = zeros(n); d = data((x=xs, y=ys)) p = deepcopy(p₀) p = p + d * visual(Scatter) * mapping(:x,:y) -p = p + d * visual(Density; color=:black, alpha=0.15) * mapping(:x) +p = p + d * visual(Density; colormap=(:gray10, 0.15)) * mapping(:x) draw(p; axis=(; title="n = $n")) ``` @@ -73,7 +73,7 @@ xs = rand(Y, n); ys = zeros(n); d = data((x=xs, y=ys)) p = deepcopy(p₀) p = p + d * visual(Scatter) * mapping(:x,:y) -p = p + d * visual(Density; color=:black, alpha=0.15) * mapping(:x) +p = p + d * visual(Density; colormap=(:gray10, 0.15)) * mapping(:x) draw(p; axis=(; title="n = $n")) ``` diff --git a/README-quarto.md b/README-quarto.md new file mode 100644 index 0000000..073bfd9 --- /dev/null +++ b/README-quarto.md @@ -0,0 +1,7 @@ +To get this published + +``` +quarto render # check for Errors... +quarto publish gh-pages --no-render + +``` diff --git a/_quarto.yml b/_quarto.yml index 6f81379..6c29c12 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -3,6 +3,8 @@ engines: ['julia'] project: type: book + output-dir: _book + book: title: "Using Julia for Introductory Statistics" @@ -26,9 +28,10 @@ book: - EDA/makie.qmd - Inference/distributions.qmd - Inference/inference.qmd +# - Inference/permutation_tests.qmd - LinearModels/linear-regression.qmd - - references.qmd # - LinearModels/glm.qmd + - references.qmd bibliography: references.bib csl: the-american-statistician.csl @@ -61,6 +64,6 @@ format: execute: warning: false + freeze: false #auto error: false -# freeze: auto cache: false diff --git a/check_version.jl b/check_version.jl deleted file mode 100644 index c41c4f3..0000000 --- a/check_version.jl +++ /dev/null @@ -1,7 +0,0 @@ -using Purl - -chapters = ("EDA","Inference","LinearModels") -file_extension(file::String) = file[findlast(==('.'), file)+1:end] - -for ch ∈ chapters - qmd = filter(x -> file_extension(x) == "qmd", diff --git a/index.qmd b/index.qmd index 6eabb32..3fd18a8 100644 --- a/index.qmd +++ b/index.qmd @@ -12,6 +12,7 @@ science, within which much of this material is covered. For example, [@storopolihuijzeralonso2021juliadatascience] is very well done, [@nazarathy2021statisticsjulia] covers topics here (cf. the [JuliaCon Workshop](https://www.youtube.com/watch?v=IlPoU5Yr2QI)). +[@RomeoAndJulia] covers many of the topics here with a Biologist's viewpoint. The quarto book [Embrace Uncertainty](https://juliamixedmodels.github.io/EmbraceUncertainty/) [@Embrace-Uncertainty-Fitting-Mixed-Effects-Models-with-Julia] covers the more advanced topic of Mixed-effects Models in `Julia`. Nothing here couldn't be found in those resources, these notes are just an introduction. diff --git a/references.bib b/references.bib index 5f550bd..cce369d 100644 --- a/references.bib +++ b/references.bib @@ -117,6 +117,14 @@ @Book{JuliaForDataAnalysis year = 2022, url = {https://www.manning.com/books/julia-for-data-analysis}} +@book{RomeoAndJulia, + title = {Romeo and Julia where Romeo is Basic Statistics}, + author = {B Lukaszuk}, + url = {https://b-lukaszuk.github.io/RJ_BS_eng/}, + year = {2023}, +} + + @article{JSSv107i04, title={DataFrames.jl: Flexible and Fast Tabular Data in Julia}, volume={107},