Human skills rise with practice: a 2,630-person study

A longitudinal look at six skills variously called human skills, soft skills, or (in Happily) Power Skills. Across 2,630 individuals using Happily for at least six months, mean ratings rose by +0.29 points on a 0.5 to 5 scale. Heavier users gained three to six times more than lighter users, the gap survives a regression-to-mean correction, and the largest gains land where there is most room to grow.

The skills in this study go by several names. Researchers and workplace publications increasingly call them human skills, especially in the context of the AI era and what humans bring that machines do not. Practitioners often still call them soft skills, the older umbrella term. Happily calls them Power Skills, the developmental targets in our feedback loop. They are the same six capabilities under different labels: Critical Thinking, Self-awareness, Optimism, Leadership, Initiative, and Empathy. We use Power Skills below for consistency with our prior research.

The framework defense in our last article argued these six are worth coaching toward. This one asks the harder follow-on question: when people actually use Happily over time, do their skill ratings move?

The honest answer takes two corrections to get right. The first is selection. We can only study people who stuck around long enough to accumulate enough ratings, and those people are different from a random employee in ways we cannot perfectly correct for. The second is regression to the mean. People who start low tend to rise just because that is where there is room, even without any coaching. A clean read on coaching effect has to put both effects in the same picture and only claim what survives them.

What survives is a positive dose-response signal that is independent of starting level. Heavier users of Happily gain more rating points than lighter users, in every starting-rating quartile we look at. The effect is largest where there is most room to grow, and smallest at the ceiling. That is the cleanest finding the data can support, and the rest of this article walks through how we got there and what it does, and does not, mean.

Within every starting-level quartile, the top quartile of engagement gained +0.21 more rating points than the bottom quartile, averaged across the four strata. That is the dose-response signal after the regression-to-mean correction.

Why this matters

The framework defense answered "are these the right skills to develop?". This one answers "do they move with sustained use?". Without that second answer the loop is interesting but not necessarily useful. With it, the case for the loop shifts from a theoretical defense to an empirical one.

Methodology

Cohort

2,630 individuals across 80 customer organisations who accumulated at least 6 Power Skill rating events on the same skill over a span of at least 180 days, with at least 2 ratings in each half of their window.

Time window

Per-person window spans from each person's first to last rating. Median span is 452 days. Cohort earliest rating: 2024-05-28; latest: 2026-05-27.

Rating measurement

The rating dimension of each Power Skill: an AI-evaluated quality score (0.5 to 5) of how well the skill is represented in a person's written feedback. This is distinct from the score dimension, which counts how often a skill is expressed.

Outcome

Δ = (mean rating in second half of window) − (mean rating in first half), computed per (person, skill) cell. 10,045 cells in total.

Coaching-dose proxy

Response-active months during the person's window. We tested four alternative proxies (total responses, recognitions received, recognitions given, peer feedback received); all gave the same direction. Active months had the strongest signal.

Regression-to-mean correction

Within each skill, the cohort is split into starting-level quartiles using each cell's first-half mean. The dose-response comparison is then made within each starting-level stratum and averaged across strata. This isolates dose effect from the mechanical rise of low starters.

What we see, after the corrections

Mean change across the whole cohort is positive at +0.29 points, and 67% of person-skill cells show a positive change. That is the headline, but it pools two distinct effects: a real dose-response signal and a regression-to-mean artifact. The chart below pulls them apart.

Figure 1 Four lines, one per starting-level quartile. Each line slopes up from low to high engagement: heavier users gain more, no matter where they started. The lines stack by starting level, which is regression to mean visible in the chart: people who started lower had more room to rise.

Two patterns sit in this chart. Read the chart along its vertical axis to see regression to mean. People who started in the bottom quartile (dark purple) sit highest on the chart, even at low engagement, because the lowest starters have the most room to rise. People who started in the top quartile (lightest line) hover near zero across all engagement levels, because they are at the ceiling and have nowhere to go. The vertical stack of the four lines is the regression-to-mean signal.

Read the chart left to right within any one line to see dose response. Every line slopes upward from the low-engagement column to the high-engagement column. Within every starting-level quartile, heavier users of Happily gained more rating points than lighter users. That is the part of the picture that regression to mean does not explain.

Pulling the two effects apart

The unstratified gap between top-quartile and bottom-quartile engagement is +0.35 points. After controlling for starting level by averaging the gap within each starting-level quartile, it falls to +0.21 points. About 40% of the raw difference was regression-to-mean leakage. The remaining 60% is the genuine dose-response signal.

Dose effect (top engagement quartile mean − bottom engagement quartile mean) within starting-level quartiles
Starting-rating quartile	Δ at low engagement (Q1)	Δ at high engagement (Q4)	Dose effect (Q4 − Q1)
Bottom (lowest starters)	+0.54	+0.86	+0.32
Second	+0.25	+0.54	+0.29
Third	+0.07	+0.26	+0.19
Top (highest starters)	−0.06	+0.00	+0.07
Average across strata	n/a	n/a	+0.21

The dose effect is positive in all four starting-rating quartiles, but the magnitude narrows as starting level rises. People who started in the bottom quartile gained nearly a full third of a rating point more when they engaged heavily with Happily than when they barely engaged. People who started near the top of the rating scale show only a +0.07 dose effect, which is small but still in the right direction.

Why the top stratum still matters

A near-zero dose effect at the top of the scale is consistent with a ceiling, not with a null. People rated near 5 at baseline have very little room to rise on a 0.5 to 5 scale regardless of what they do. The fact that the dose-response direction is still positive there is reassuring, but the magnitude is not the right thing to focus on for that group.

Per skill: where the signal is strongest

Start with the picture. For each skill, the chart below traces the rating journey of the most-engaged quartile, in the skill's colour, against the least-engaged quartile, in grey, from the first half of each person's window to the second. On every skill, the most-engaged moved further. For Initiative, the least-engaged group did not move at all.

Figure 2 For each skill, the most-engaged quartile (33+ active months, in colour) is set against the least-engaged (12 or fewer, in grey). The most-engaged consistently start lower and travel further. The axis is cropped to 2.0 to 4.5 so the journeys stay legible; every delta is labelled.

Two things are tangled in that picture. The most-engaged groups tend to start lower, so part of their larger gain is room to grow rather than engagement itself. Separating the two means controlling for starting level.

Averaging the stratum-controlled dose effect across the four starting-level quartiles gives one number per skill. All six are positive. The three skills with the largest stratum-averaged dose effects are Self-awareness, Initiative, and Optimism. The smaller effects on Critical Thinking, Empathy, and Leadership are still real, but tighter.

Stratum-averaged dose effect (top engagement quartile mean − bottom engagement quartile mean, averaged across starting-level quartiles), per skill
Skill	Mean Δ overall	Stratum-avg dose effect	% cells positive
Self-awareness	+0.33	+0.33	69%
Initiative	+0.21	+0.28	61%
Optimism	+0.33	+0.27	71%
Empathy	+0.24	+0.17	65%
Critical Thinking	+0.43	+0.15	72%
Leadership	+0.16	+0.15	59%

Critical Thinking shows the largest unstratified gain (+0.43), but only an average stratum-controlled dose effect (+0.15). That pattern is consistent with strong baseline-level recovery in the lowest quartile but smaller dose-response. Initiative shows the opposite: a modest unstratified gain (+0.21) but the second-largest dose effect (+0.28), pointing to a steadier link between engagement and rating change.

What the study shows

This study is observational and dose-comparative. Among people who use Happily for at least six months, those who engaged more saw larger rating changes than those who engaged less, and the gap holds after controlling for where they started. That dose-response pattern is the strongest evidence an observational study can offer that the loop contributes to the change. A randomised comparison, with engagement as the only difference between matched groups, would measure that contribution directly, and is the natural next step for this line of research.

What this gives us is a falsifiable, defensible empirical baseline. If the framework were inert, the dose-response gap would collapse inside starting-level quartiles. It does not. It narrows, in the direction the ceiling predicts, but stays positive at every starting level and on every skill. That is the empirical anchor for the case that the loop does something.

Decision implications
Decision	Implication
Who to invest in first	People who started lower have the most room to grow and the largest absolute gains. Highest-starters show diminishing returns at the ceiling, not because they cannot benefit but because the rating scale runs out.
How long to commit	The biggest gains live with users active 33+ months. Six months is enough to see directional movement; sustained, multi-quarter usage is where the larger magnitudes appear.
What to track	Use the rating dimension for development outcomes, not the score. The score dimension counts effort; the rating dimension captures quality, which is what we care about for "are people getting better".
How to talk about results	Internally, the +0.21 stratum-controlled effect is the honest number. The unstratified +0.29 mean change is the cohort-level fact and is also true. Use both, and label which is which.

Limitations

The cohort is self-selected. To enter the analysis, individuals had to accumulate at least six ratings on the same skill over at least 180 days. That selects for engaged users at sticky companies. The result is not a population-level claim about employees; it is a claim about people who use Happily this way.
Observational design. The dose-response comparison is the strongest lever available short of a randomised trial. It does not by itself separate the loop from unmeasured common causes, such as a person whose role, manager, or organisational context is changing in ways that also lift ratings; a controlled comparison would isolate that.
The AI rating model evolves over time. If the model became more generous between 2024 and 2026, observed gains would partly reflect model drift rather than skill change. A future iteration of this study should freeze the model version across the analysis window.
Score-rating coupling. The rating is calibrated on text, and people who write more text give the model more signal to work with. Heavier users write more, which may make their ratings both more accurate and slightly inflated. The size of this effect is not estimated here.
The ceiling at the top quartile (mean Δ near zero) is a real measurement constraint, not a finding about that group. We need a finer-grained scale or a different outcome (e.g. specific competency observations) to measure development among already-strong individuals.
Per-skill comparisons should be read with the sample sizes in mind. Leadership has 1,322 cells and Self-awareness has 2,234. Smaller-cell skills will have wider confidence intervals than this article reports.

References

Happily Research (2026). The six skills that show up where engagement does. Companion article defending the choice of the six Power Skills using WEF Future of Jobs, Goleman EI, and positive-psychology research.
Happily Research (2026). Power skills and recognition: why optimism leads. Standardised-effects analysis of the six skills on recognition outcomes.
Campbell, D. T., & Kenny, D. A. (1999). A Primer on Regression Artifacts. Guilford Press. The standard reference on the regression-to-mean phenomenon and how to control for it.
Cronbach, L. J., & Furby, L. (1970). How we should measure change, or should we? Psychological Bulletin, 74(1), 68–80. Foundational paper on the difficulties of measuring change scores.
Happily Research (2026). Power Skills longitudinal cohort dataset. Internal analysis, 2,630 individuals across 80 companies, 10,045 person-skill cells, May 2026.