Train-test-evaluate is not dead yet

Lander Analytics Team
May 5
5 min read

The old workflow still wins when it matters

By Jared Lander and Joe Marlo

You have options now. You can paste a CSV into ChatGPT and ask it to predict churn. You can pull a pretrained model like Chronos or TabPFN off the shelf and get predictions in minutes. Or you can do it the old way: build your features, split your data, train an XGBoost model, evaluate on a holdout set, tune, repeat. So why would anyone still bother with the full train-test-evaluate cycle when there are faster paths to a prediction?

For most mid-sized companies—teams with strong data scientists but without the budget to pretrain a transformer from scratch—the answer is that the traditional workflow can be your biggest competitive advantage.

Your three options in 2026

It helps to be specific about what we are actually comparing.

LLM endpoints. You send your data to OpenAI, Claude or Gemini and ask for predictions. No training, no code, no infrastructure. You get an answer in seconds. The model has never seen your data before and is not optimized for numerical prediction, but it can pattern-match surprisingly well on small problems.

Pretrained models. Tools like TabPFN and Chronos are trained specifically for tabular or time series tasks. They are not general-purpose language models. They are purpose-built and can be remarkably good. TabPFN v2 outperforms gradient boosting on datasets with fewer than 10,000 samples, and it does so in under three seconds versus hours of hyperparameter tuning. Chronos-2 delivers state-of-the-art zero-shot forecasts across multiple benchmarks without any task-specific training.

Training your own model. XGBoost, ARIMA, random forests, LightGBM. You write the training loop, you pick the features, you tune the hyperparameters, you evaluate on held-out data. It is the most work by a wide margin. But it is also where your company’s institutional knowledge actually gets encoded into the model. You do not need GPU clusters or a research team to train an XGBoost model. You need someone who understands the data.

Each of these is a legitimate choice in the right context. The question is when each one earns its keep.

Shortcuts fall short

LLM endpoints are the easiest path to a prediction, but they are also the least reliable for deep domain-specific work. The model has no memory of your feature distributions, no understanding of your target variable's scale, and no guarantee of consistency across runs. Ask it twice, get two different answers. For exploratory analysis or a quick sanity check on non-sensitive data, fine. For a production forecast, you need something more disciplined — and you are also handing private data to a third party whose retention and training-use policies you do not fully control.

Pretrained models are a big step up. They are purpose-built for prediction tasks, and the results can be genuinely impressive. But three patterns keep showing up in the research that should make you pause before ditching your sklearn and tidymodels imports.

Scale flips the advantage. On small datasets, pretrained models dominate. On large ones, XGBoost and gradient boosting consistently outperform them. When you have sufficient data, a model trained on your data can fully learn the feature interactions that matter for your problem, and the pretrained priors become less valuable.

Supervised baselines are stubbornly competitive. A 2025 arXiv paper evaluated specialized foundation models across genomics, satellite imaging and time series. The finding was blunt: across all three domains, foundation models failed to significantly improve upon tuned supervised learning, despite using two to five orders of magnitude more pretraining data. Simple models matched or outperformed the latest foundation models when properly tuned.

Domain knowledge is your edge, and custom models use it. A pretrained time series model knows about seasonality and trends in general. It does not know that your retail sales collapse on the third Wednesday of every month because that is when a competitor runs promotions, or that your sensor data has a drift pattern tied to equipment age. An LLM knows even less. This is where your data reflecting your actual business provides inherent domain knowledge. Domain-specific features, carefully engineered and fed into an XGBoost or LightGBM model, capture patterns that no pretrained model can learn about your business. The train-test-evaluate cycle is the mechanism that makes those features so valuable.

The practical takeaway

There is nothing stopping you from including LLM endpoints and pretrained models in your supervised backtest shootout. Wrap the OpenAI API call in a predict function, do the same for Chronos or TabPFN and evaluate them on the same holdout set alongside your XGBoost model. Let the data decide. The train-test-evaluate cycle is not married to any particular model class. It is a framework for comparing whatever options you have.

Start with the easy options to establish a baseline. Then ask: can a model trained on your data, with features your team engineered from domain knowledge, beat the off-the-shelf prediction? If the pretrained model wins, great, ship it. If your XGBoost model outperforms it (and at scale, with good feature engineering, it often will), ship that instead. Or ensemble the two: the pretrained model's priors and your custom model's domain knowledge often complement each other, and the combination can beat either one alone. Either way, you only know the answer because you ran the evaluation.

For mid-sized teams, this is the sweet spot. You do not need a research lab. You need people who understand your business, a solid evaluation framework and the discipline to let the holdout set pick the winner.

We can help you figure this out

This is the kind of problem we work on at Lander Analytics. We help teams set up proper evaluation frameworks, benchmark LLM endpoints and pretrained models against tuned baselines and build the domain-specific models that actually win the shootout. Whether your team needs help standing up the pipeline or figuring out which approach to bet on, reach out at info@landeranalytics.com.

Jared P. Lander Founder and Chief Data Scientist Lander Analytics Joe Marlo Director of Data Science Lander Analytics

Subscribe to our Substack and below to our monthly emails for practical AI strategies for your organization: what to build, what to avoid, and how to make systems reliable in the real world.

Work with us: If you want help identifying the right first workflow, building a permissioned knowledge base, or training your team to ship responsibly, reach out at info@landeranalytics.com. About the author: Jared P. Lander is Chief Data Scientist and founder of Lander Analytics, where he helps organizations build practical, measurable AI workflows grounded in strong data foundations.

About the author: Joe Marlo is Director of Data Science at Lander Analytics, where he designs agentic workflows, statistical models, and interactive frontends that put rigorous analysis into production.

Train-test-evaluate is not dead yet

Your three options in 2026

Shortcuts fall short

The practical takeaway

We can help you figure this out

Recent Posts

Get our latest blog posts—delivered monthly!