Karol Flisikowski – Statistics, data analysis and more

Welcome
About me
Teaching
Why R?
E-learning
Skippy

Hi there, thanks for visiting my website!

You can find here information about my scientific and didactic work
(courses, software used in my data analysis labs, topics of my diploma theses)
and about students’ scientific Data Science Club.

Currently I’m employed at Gdansk University of Technology – Department of Statistics and Econometrics and at Chongqing Technology and Business University – International College (as Associate Professor).

I’m responsible for lecturing descriptive and mathematical statistics (Statistics I-II) and Data Analysis with R/Python. Recently my research interests focus on the studies of credit constraints.

In June, 2020 I received “Masters of Didactics” grant from the Ministry of Higher Education and Science for years 2020-2023 (GUT). Since May, 2022, together with Data Science Club I am working on two research grants (IDUB Plutonium and Technetium).

“Statistics is the grammar of science“
— Karl Pearson

Gdansk University of Technology (2007-now) – Department of Statistics & Econometrics – Lecturer of various undergraduate, graduate and postgraduate courses in statistics (i. e. Mathematical Statistics, Descriptive Statistics, Data Analysis).

I-Shou University (2018-2019) – International College / Department of International Finance – Lecturing: descriptive and mathematical statistics (Statistics I-II), Econometrics, Economics I, Information Systems in Management, Information Systems in Finance I-II.

Chongqing Technology and Business University (2020-now) – International College – Lecturing: Big Data Analysis with Applications.

First, don’t forget to install R & R-Studio. If you are new to R – please read the complete tutorial on R here. If you are more familiar with Python, you can use it in the lab as well. I really recommend to use this e-book: “Introduction to Probability and Statistics Using R” or textbook: “Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python“.
Recommended DataCamp courses for descriptive statistics: “Exploratory Data Analysis in R“; “Intermediate Data Visualization with ggplot2“; “Introduction to Statistical Modeling in R“; “Introduction to Regression in R“; “Correlation and Regression in R“; “Time Series Analysis in R” [ask me for free access or contact our Data Science Club]
The R-Bootcamp is a gentle and gradual introduction to manipulating and visualizing data in R. A collection of over 250 (mostly) free R programming books – “Big Book of R“.
If you are looking for help, use search engines like R-Seek, Search-R & forum Stack Overflow or R-Bloggers.
I also recommend “cheat-sheets” – posters that will guide you through some R external packages – the whole collection of them is available here.
For those who are more interested in data analysis, I recommend joining our “Data Science Club” – details on our fanpage, FB group, website or Discord channel.
Recommended DataCamp courses for statistical inference: “Foundations of Inference“; “Foundations of Probability in R“; “Inference for Numerical Data in R“; “Inference for Categorical Data in R“; “A/B Testing in R“; “Hypothesis Testing in R“; “ARIMA Models in R” [ask me for free access or contact our Data Science Club]
The R-Bootcamp is a gentle and gradual introduction to manipulating and visualizing data in R. A collection of over 250 (mostly) free R programming books – “Big Book of R“.

In my opinion, this claim "Why not use Python for statistical analysis" is rooted in:

the fact that R (and originally S), was created as a dedicated tool just for statisticians, as a specialized purpose language, Domain-Specific Language (Python is a General-Purpose Language). It was a simplified language, while giving access to a specific data structure, like a frame (data.frame). For about 40 years, as multitudes of statisticians "grew up" on it, it became the lingua franca of statistics. New specialized methods were implemented in it , knowledge was exchanged by illustrating them with programs in R (CrossValidated, for example), articles and books using it were published. It's a bit like Latin in the past, English now and Turbo-Pascal in numerical methods in the 70-90s :)

the availability of specialized statistical methods in R. While in the field of modern methods of so-called ML (Machine Learning; machine learning) Python seems to become the undisputed leader, in the field of classical (small-data) statistics, i.e. basic tests and models Python "makes it" (statsmodels is quite OK), already in the case of specialized statistics, such as clinical biostatistics, Python does not offer much.

In this field, among OpenSource tools, only R is a reasonable alternative (because not competition) to SAS (SAS has been number one in clinical research since 1980 and is likely to be so for 2 decades[1] ). Probably for this reason, among others, Python is virtually non-existent in clinical research. One would also have to ask actuaries and econometricians - how about Python with coverage of the methods they use.

To confirm the second point, I have prepared a list of missing statistical tools in Python (as of 2020):

Lack of modern non-parametric methods for factorial and longitudinal (repeated measures) systems, including methods such as ATS (ANOVA-Type Statistic), WTS (Wald-Type Statistic), ART (Aligned-Rank Transform).

No frequentist procedure (there is only Bayesian, and in clinical research this is not the standard) of generalized and nonlinear mixed models, with a user-defined residual correlation structure. And this means little usefulness when analyzing data in repeated observations. In R, we have glmmTMB, glmmPQL, nlme (only general linear model) for this. The lme4 package, unfortunately, does not allow for this and is, on average, suitable for more serious analysis in this area, unless you fit "unstructured" covariance (convergence problems from lack of degrees of freedom) or "compound symmetry" mastics (but this is not a very realistic assumption). The successor to lme4 is precisely glmmTMB.

No vector generalized linear and additive models (VGLM, VGAM), and this is one of the basic classes in this field.

Lack of quantile linear mixed models (quantile linear mixed models). Lack of procedures to determine reference ranges.

Lack of confidence intervals for proportion differences (important vs. non-inferiority studies) Farrington and Manning's, Gart and Nam's, Miettinen and Nurminen's.

No advanced methods for determining sample size and power for clinical trials. Only basic, "textbook" methods available, practically of little use except in the simplest cases. What is really needed is missing, including 3-arm "Gold Standard" type studies, Williams MED, adaptive design, longitudinal studies (mixed models, GEE, WGEE).

Lack of advanced study design methods (adaptive, multi-arm (MAMS). Conditional power (conditional power) is lacking. This is an absolute foundation in modern research and at the same time one of the most complex issues in the field.

Lack of more sophisticated methods in survival analysis (multiple events - e.g. Andersen-Gill, competing risks (let's call them shorter, co-risks) - e.g. Fine-Gray, Cause-Specific Hazards). The Kaplan-Meier method does not have "confidence bands" (not to be confused with confidence intervals). I did not find the Beta Product Confidence Procedure (BPCP).

The Turnbull estimator for "interval-censored" observations (interval-censored) and Cox for such data (IntCox) are missing. The Pepe-Mori test for correlated co-risks is missing. No Wilcoxon-Prentice test for dependent data.

Lack of "joint-frailty" models (I don't undertake the translation) that allow for analysis of multiple events (e.g., recurrence) and co-risks (e.g., death) together in survival analysis.

I did not find multi-state (multi-risk) models, commonly used in survival analysis in clinical trials.

No "tipping point" type analysis - "tipping point" (sensitivity analysis in the context of missing data)

Lack of comparably (to R) advanced meta-analysis.

Lack of more advanced confidence intervals for the median, such as Nyblon, as well as an exact method (there is only a basic one, based on a binomial distribution).

Lack of more complex contrasts (relevant to clinical trials): Sequen, AVE, Changepoint, Williams and Umbrella-protected Williams, Marcus, Mc7Dermott.

I haven't found a comparably advanced framework for permutation linear models (in R it's e.g. lmPerm, permuco), although there are some first steps in scikit-bio.

I haven't found an implementation of LS-means (in R called EM-means; estimated marginal means), which are the absolute standard in clinical trials for modeling and reporting effects.

I'm not sure if marginal effects are available for the entire GLM class or only for logistic regression, but before performing analyses using them, I would suggest making sure in advance (or counting them yourself).

I have not found more sophisticated methods of multiple comparisons, especially those relevant to clinical trials: hierarchical hypothesis testing (relevant when we have hypotheses ordered according to a certain hierarchy, e.g. non-inferiority → superiority); Fixed sequence procedure; Fallback procedure; Gatekeeping (serial, parallel), e.g. Truncated Hochberg, tr. Holm, tr. Hommel so-called sequential online testing graphical approach.

Non-parametric methods for these contrasts: Tukey, Dunnett, Sequen, Williams, Changepoint, McDermott, Marcus, Marcuss-Umbrella protected.

Some of the items on this list could be implemented fairly quickly, but many of them are very complex, requiring a lot of time, expertise and access to other specialized packages, numerical validation. These are all tasks for multi-person teams of specialists, and for a long time (especially if it's an after-hours job). Besides, there are problems with this in R itself. Not to mention validation (required by regulations) against SAS (reference software in pharma) and possibly Stata.

To be honest, I would not be able to work effectively in my profession using Python alone. I would either have to implement the missing methods myself, or refer to R.

If a person does not need any of the above methods to be happy, he can successfully use Python's arsenal. However, it can be seen that, at least in terms of specialized (non-ML) methods, R offers much more.

Note, this does not mean that Python is generally "bad" in this area, it will successfully fulfill its role in many applications. It's just - there are areas where R provides a larger set of tools, and since, many of these tools are "interdisciplinary", hence the general statement, "R is better for statisticians", but no longer for "data scientists" (I don't know how to translate it sensibly :) , who use other methods (ML, AI) on a daily basis. The latter seem to strongly prefer Python.

GET IN TOUCH

Schedule a Visit

Contact us

As dean’s proxy for e-learning development I’m providing support for the faculty members, organizing trainings, monitoring online courses and evaluating their quality.

As the university coordinator for e-learning I’m involved in the process of certification of lecturers (according to the current ministry-level regulations its obligatory for each faculty member) issuing a certificate confirming skills in the field of e-course design. Since 2015 I work also as university moodle technical administrator.

I am also holding several ICT in higher education, e-teacher, e-methodology and AACSB granted certificates. In December, 2022 I received “Masters of Didactics” certificate from the Ministry of Higher Education and Science and Rijksuniversiteit Groningen (Netherlands). Since Jan, 2023 I am also certified tutor (Collegium Wratislaviense, Poland).

“I know that I’m just a dog but…

If you feel sad, I’ll be your smile.

If you cry, I’ll be your comfort.

And if someone breaks your heart,

we can use mine to live.

E-courses

Moodle e-courses moderated by me (Data Analysis with R/Python -> Statistics).

Data Science Club

Please ask to join our students’ club.

DataCamp

For active students – free access to DataCamp platform.

A Handbook for Statistical Analysis with R and R-Studio

Statistics with R

Users who believe Python is the best tool for statistical analyses, probably might see some R advantages in the end of the study. We will start with preparing infrastructure for the analysis. Then, the quick tour through the R programming basics. Finally, two most important chapters of this book will walk you through the descriptive and inferential statistical data analysis. This online textbook is interactive: it means in many places reader will have the chance to go through tutorials, answer some questions, solve some tasks to go to the next page.

Read textbook