+
+

R compatibility notes

+

This package aims to reproduce the numerical results of the R moderndive +and infer packages, not just their API. In several places that means making a +deliberate choice that differs from the “default” or “typical” Python/SciPy/ +statsmodels behavior. They’re collected here so the differences are intentional +and discoverable.

+
+

Statistics

+
    +
  • pop_sd divides by n (population SD, ddof=0), unlike the sample +sd/calculate(stat="sd") and numpy.std’s default which use n 1. This +matches R moderndive::pop_sd.

  • +
  • get_regression_summaries: mse is the mean squared residual using n +in the denominator (so rmse = sqrt(mse)), while sigma is the residual +standard error using n p. Both match the R package; note mse is not +statsmodels’ mse_resid (which uses n p).

  • +
  • GLM summaries use the log-likelihood-based BIC (bic_llf), not +statsmodels’ deviance-based bic, to align with broom::glance.

  • +
  • get_p_value two-sided = 2 × min(left, right) capped at 1 — the infer +convention, which can differ slightly from a symmetric-tail p-value.

  • +
  • F and Chisq are treated as inherently one-sided (right tail) for +p-values regardless of the direction argument, matching infer.

  • +
+
+
+

prop_test (mirrors R’s prop.test, not a plain z-test)

+
    +
  • Chi-square statistic by default (with a chisq_df column), like R’s +prop.test. Pass z=True for the signed z-statistic that a “typical” Python +two-proportion test would report.

  • +
  • Yates’ continuity correction is on by default (correct=True), as in R.

  • +
  • Confidence intervals match R: a Wilson score interval for one +proportion (not the Wald interval statsmodels returns by default), and a +Wald interval widened by the continuity correction for a two-proportion +difference.

  • +
+
+
+

Correlation

+
    +
  • get_correlation drops nulls by default (na_rm=True) so beginners get a +number rather than nan. R’s na.rm defaults to FALSE; pass na_rm=False +to match R exactly.

  • +
  • method="spearman"/"kendall" use the SciPy implementations and match R’s +cor(method=...).

  • +
+
+
+

Regression points / tables

+
    +
  • In-formula transformations are reshaped to match R: a transformed outcome +like np.log(mpg) is shown on the model scale as log_mpg/log_mpg_hat, and +transformed predictors (poly, scale, I) show their original columns +rather than the patsy basis matrix.

  • +
  • get_regression_table prettifies categorical term names +(income[T.High]income: High) by default; default_categorical_levels=True +keeps the raw statsmodels names.

  • +
+
+
+

Datasets

+
    +
  • Datetime columns are stored in UTC. R’s nycflights time_hour is stored in +America/New_York; the bundled Parquet stores the identical instants in UTC, so +a displayed hour differs by the UTC offset (the integer hour column matches R).

  • +
  • early_january_2023_weather is derived from weather. The R dataset ships +temp/dewp/humid/pressure as all-NA; this package recomputes the table +from weather (Newark, first 15 days of Jan 2023) so those columns hold real +values.

  • +
+
+
+

Plotting

+
    +
  • plotly is the default engine (the book is moving to interactive plots); +pass engine="plotnine" anywhere for grammar-of-graphics output. R returns +ggplot2 objects.

  • +
  • visualize’s default bins is 20 (R’s is 15).

  • +
  • Two-sided p-value shading mirrors the observed statistic about 0, matching +infer’s shade_p_value.

  • +
+
+
+

Reproducibility

+
    +
  • Pass seed= to generate() / rep_slice_sample() for reproducible draws +(R uses set.seed()). Identical seeds will not reproduce R’s exact random +draws — only the statistical behavior matches, not the specific RNG stream.

  • +
+
+
+ +