Modern approaches to organizing statistical workflows and documentation

I’m looking for current best practices when it comes to structuring statistical analysis projects and creating reports. Most of my work involves one-time analyses for research papers that won’t need to be rerun once published.

I’m wondering about several aspects of workflow organization. Should I bother creating formal packages for code that’s essentially throwaway after publication? It seems like extra work when I’m not planning to share the code widely.

I’m also curious about effective ways to organize datasets and results. What folder structures work well? And I keep hearing about makefiles for automation but I’m not sure how they fit into statistical workflows.

Any recommendations for tools or methodologies that strike a good balance between being organized and not over-engineering simple analysis projects would be really helpful.

Even temporary projects need some structure - it saves you headaches when you’re revising months later. I keep one master script that runs all my analysis files in order. No more guessing which code created which results. For notes, I use a simple text file where I jot down key decisions and parameter choices as I go. Way easier than trying to remember everything later. Version control is a lifesaver when coauthors want different analyses or journals ask for extra statistical tests. The upfront time you spend organizing pays off when you need to tweak analyses or respond to reviewers.

I used to skip organization completely until I had to recreate analysis for a paper revision six months later. Total nightmare.

Now I do minimal setup. One config file at the top with all parameters and file paths. When I need to change something, I edit one place instead of hunting through ten scripts.

For makefiles - I love them for stats work. Not for complexity but dependency tracking. My makefile knows results.csv depends on clean_data.csv which depends on raw_data.csv. Change the raw data and everything downstream rebuilds automatically.

Skip formal packages but create a simple functions.R file for anything you use twice. Future you will thank present you when you need to tweak that custom plotting function.

One trick - keep a decisions.md file where I log why I chose certain statistical approaches or excluded data points. Reviewers always ask about this and my memory’s terrible.

Even for one-off analyses, it’s beneficial to maintain some organization. While formal packages may not be necessary, creating separate folders for raw data, processed data, scripts, and outputs significantly simplifies locating files. I always include a quick README that explains the project and a concise methods document; this saves time when revising the manuscript. Documenting your decisions and parameter choices is important as reviewers often inquire about specific details. For automation, using makefiles might be excessive for simpler tasks. Instead, a basic shell script to execute tasks in sequence works well and ensures reproducibility.

don’t overthink it. i use dated folders like “2024-03-analysis” and dump everything there with numbered scripts (01_data_prep.R, 02_models.R, etc). takes 2 minutes to set up and works fine. rmarkdown’s great for combining code and writeup in one doc - fewer files to juggle.