How We Use Claude Code Skills to Run 1,000+ ML Experiments a Day

Community Article Published December 8, 2025

image

Giving Claude a Team Memory ( ̄ー ̄)ノ✧

Hugging Face posted a demo yesterday about getting Claude Code to fine-tune an open LLM. But the researchers from Sionic AI already do most of their work with Claude Code. It writes training scripts, debugs CUDA errors, searches hyperparameters overnight. For the actual work of building models, Claude has become the default partner. But there was one thing it couldn't do - remember what the teammates learned last week.

Last month, Sigrid spent three days running experiments on ColBERT parameter configurations. He tested over fifty combinations. He found that longer text chunks, around 4,000 characters, made FDE outperform MaxSim in retrieval tasks. This was a genuine discovery, the kind that saves weeks of work for anyone who encounters the same problem later.

Note: FDE stands for Fixed Dimensional Encoding. It's a technique from Google DeepMind's MuVERA paper that compresses ColBERT's multi-vector representations into a single fixed-size vector. We open-sourced our implementation at sionic-ai/muvera-py, which got star more than 300+. It makes ColBERT-style retrieval practical at scale, because you can use standard vector databases instead of specialized late-interaction infrastructure.

That insight lived in a Slack thread. A few teammates saw it. Everyone else missed it. In two months, someone will almost certainly run the same experiments again, not knowing the answer already exists somewhere in our message history. This happens constantly in ML teams. The knowledge is there. It just isn't findable.

Claude could help us train models, but it couldn't help us avoid repeating mistakes. It knew everything from its training data, but nothing about what others discovered on Tuesday.

image

So we built a system to fix it. The core idea is simple. When you finish an experiment session in Claude Code, you type one command. Claude reads through what you did, extracts the important parts and writes it up as a "skill." That skill goes into a shared registry. The next time anyone on the team asks Claude about a related topic, Claude already knows what your teammate discovered.

Think of it as giving Claude a team memory. Not your personal memory. The team's collective memory.

Quick Look

The setup takes about thirty seconds. Inside Claude Code, you run two commands:

/plugin marketplace add your-org/your-skills-repo
/plugin install all-skills@your-skills-repo

That's it. Claude now has access to every skill your team has recorded.

We use this internally with our own registry at Sionic. But the idea is simple you can build the same thing for your team. The rest of this post shows you how, starting with a real experiment that went into our registry last week.

/advise pulls from team knowledge. You're about to start a pruning experiment, so you type /advise. Claude checks the registry, finds that someone ran pruning experiments on Ministral3 last month, and tells you what they learned. You get warnings about approaches that didn't work. You get the hyperparameters that did. You skip the three days of trial and error your teammate already went through.

Here's what /advise actually looks like in practice.

image (62)

image (63)

/retrospective saves what you learned. You just finished a session where you figured out something useful. Before closing the session, you type /retrospective. Claude reads through your conversation, extracts the key insights, writes them up as a skill file, and opens a pull request to the team registry. After someone reviews and merges it, that knowledge becomes part of what Claude knows.

image

image

The skill file includes everything including the working code, the parameters you settled on, and the mistakes you made along the way, especially the mistakes. I think those are the most valuable part.

Before ending your Claude Session

The hard part of knowledge management has never been storage. It's getting people to write things down. After a long experiment session, nobody wants to open a doc and summarize what happened. The context is still fresh, but the energy is gone. So most insights never get recorded. They sit in someone's head until they fade. So we made Claude do the writing instead.

At the end of the training session coworked with Claude Code, you type /retrospective . That's the only thing you have to remember. Claude Code reads through your conversation. Every command you ran, every error you hit, every fix you tried and pulls out the parts that would help someone else. It structures them into a skill file, creates a git branch, commits the files, pushes to the team registry, and opens a pull request.

image

Here's what a real PR looks like.

image

The PR contains three things.

  • A SKILL.md file holds the actual knowledge: what you were trying to do, what worked, what didn't work, and the parameters you ended up with.
# plugins/training/grpo-external-vllm-server/skills/grpo-external-vllm-server/SKILL.md

---
name: grpo-external-vllm-server
description: "GRPO training skill based on external vLLM server using ms-swift. Usage scenarios: (1) When performing GRPO training with the vLLM server running on a separate GPU, (2) When encountering errors related to vllm_skip_weight_sync, (3) When encountering OpenAI API response parsing errors. Verified on gemma-3-12b-it."
author: Hojin Yang
date: 2025-12-08
---

# grpo-external-vllm-server - Research Notes

## Experiment Overview

| Item | Details |
|------|---------|
| **Date** | 2025-12-08 |
| **Researcher** | Hojin Yang |
| **Goal** | Setup GRPO training using an external vLLM server in ms-swift and fix bugs |
| **Model** | google/gemma-3-12b-it |
| **Environment** | NVIDIA A100-SXM4-80GB x 8, ms-swift, vLLM, DeepSpeed ZeRO2 |

---

## Verified Workflow

### Step 1: Launch vLLM Server (Separate GPUs)

```bash
#!/bin/bash
export CUDA_VISIBLE_DEVICES=6,7

vllm serve google/gemma-3-12b-it \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name default \  # Important: ms-swift requests model as 'default'
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --dtype bfloat16 \
    --trust-remote-code

(...)
  • A plugin.json tells Claude when to surface this skill, the trigger conditions. If you wrote any scripts worth keeping, those go in too.
# plugins/training/grpo-external-vllm-server/.claude-plugin/plugin.json

{
  "name": "grpo-external-vllm-server",
  "version": "1.0.0",
  "description": "GRPO training skill based on external vLLM server using ms-swift. Usage scenarios: (1) When performing GRPO training with the vLLM server running on a separate GPU, (2) When encountering errors related to vllm_skip_weight_sync, (3) When encountering OpenAI API response parsing errors. Verified on gemma-3-12b-it.",
  "author": {
    "name": "Hojin Yang"
  },
  "skills": "./skills",
  "repository": "https://github.com/sionic-ai/sionic-research-skills"
}

A teammate or your manager could review it. Sometimes they add context or catch something that needs more detail, then merge. From that point forward, anyone asking Claude about a related topic like RoPE embeddings, Retrieval model training, grooking tasks, gets this knowledge automatically.

The reason this works better than documentation is timing. Claude writes the skill while everything is still in context. It watched you debug the tensor mismatch. It saw which approaches failed and why. All of that goes into the file without you reconstructing it from memory two days later.

We noticed something unexpected after a few weeks of using this. People started explaining their reasoning more clearly during sessions, knowing Claude would read it at the end. The quality of /retrospective output depends entirely on the conversation that preceded it. When you know this might become a team resource, you narrate your thinking as you go.

The skills that get referenced most aren't the ones documenting clean successes. They're the ones documenting failures. "I tried X and it broke because Y" turns out to be the most useful sentence in the whole system. Success stories tell you one path that worked. Failure stories tell you which paths to skip entirely.

When Starting a New Experiment

The other half of the loop happens before you write any training. You use /advise when you're planning an experiment and want to know what the team has already learned about similar problems.

Here's a real example. You're about to train a small transformer to learn addition using Base-100 tokenization. You have a parameter budget of 0.5M to 4M. Before diving in, you ask Claude what the team knows.

Screenshot 2025-12-08 at 10.45.41 PM (1)

Claude Code searches the skill registry and finds relevant experiments. In this case, it pulls from the ColBERT FDE parameter search logs because the methodology could be useful. The response starts with a structured analysis of your request, then maps lessons from past experiments onto your situation.

image

What comes back isn't generic advice. It's patterns extracted from actual team experiments.

  • Parameter impact analysis: The ColBERT skill documented that ksim=4 works because "16 buckets fit token distribution." That d_proj=32 causes information loss. That R_reps=16 is optimal but with memory tradeoffs. Claude translates these findings into hypotheses for your arithmetic task.
  • Failure patterns: This is where it gets useful. The ColBERT skill recorded that d_proj=32 caused a performance drop because "128 to 32 projection loses information." The lesson: use 64+ or disable entirely. You haven't made this mistake yet, and now you won't.

The whole response takes maybe thirty seconds to generate. You get structured methodology from past experiments, warnings about known failure modes, integration with whatever else you've been discussing, and runnable code. All before you've written a single line of your own.

This is what we mean by team memory. The knowledge exists so /advise makes it findable.

Okay... What Makes a Good Skill?

Not all skills are equally useful. Some get referenced constantly. Others sit in the registry untouched. The difference comes down to three things: trigger conditions, failure documentation, and concrete numbers.

When you ask Claude Code for advice, it doesn't read every skill in the registry. It scans the description fields and picks the ones that match your situation. This means a vague description like "pruning experiments" will never surface.

A specific one would be like the following.

"GRPO training skill based on external vLLM server using ms-swift. 
Usage scenarios:

(1) When performing GRPO training with the vLLM server running on a separate GPU, 
(2) When encountering errors related to vllm_skip_weight_sync, 
(3) When encountering OpenAI API response parsing errors. 

Verified on gemma-3-12b-it."

I think you might notice the pattern. It names the task, lists specific situations where Claude should activate it, and states where it was tested. If someone hits a vllm_skip_weight_sync error next month, Claude will find this skill because that exact phrase is in the description.

So we use a template to enforce it. Writing good descriptions takes practice. The first few skills people create tend to be too broad. After seeing what actually gets surfaced and what doesn't, authors learn to be specific.

# templates/experiment-skill-template/skills/EXPERIMENT_NAME/SKILL.md

---
name: grpo-external-vllm-server
description: "ms-swift external vLLM server-based GRPO training skill. Use cases: (1) Running vLLM server on separate GPUs for GRPO training, (2) When encountering vllm_skip_weight_sync errors, (3) When encountering OpenAI API response parsing errors. Verified on gemma-3-12b-it."
author: Hojin Yang
date: 2025-11-08
---

# grpo-external-vllm-server - Research Note

## Experiment Overview

| Item | Details |
|------|---------|
| **Date** | 2025-11-08 |
| **Experimenter** | Hojin Yang |
| **Objective** | Setup and bug fixing for GRPO training using an external vLLM server in ms-swift |
| **Model** | google/gemma-3-12b-it |
| **Environment** | NVIDIA A100-SXM4-80GB x 8, ms-swift, vLLM, DeepSpeed ZeRO2 |

---

## Verified Workflow

### Step 1: Launch vLLM Server (Separate GPUs)

### Step 2: Execute GRPO Training (Remaining GPUs)

```bash
export CUDA_VISIBLE_DEVICES=1,2,3,4

NPROC_PER_NODE=4 \
swift rlhf \
    --rlhf_type grpo \
    --model google/gemma-3-12b-it \
    --use_vllm true \
    --vllm_mode server \
    --vllm_server_host 127.0.0.1 \
    --vllm_server_port 8000 \
    --vllm_skip_weight_sync true \  # Important: Mandatory when using standard vllm serve
    --train_type full \
    --torch_dtype bfloat16 \
    --deepspeed zero2.json \
    ...

-----

## Failed Attempts (Very Important\!)

| Attempt | Why it Failed | Lesson Learned |
|---------|---------------|----------------|
| Execution without `vllm_skip_weight_sync` | 404 `/update_flattened_params/` error. Standard `vllm serve` does not support this API. | `--vllm_skip_weight_sync true` is mandatory when using `vllm serve` instead of `swift rollout`. |
| Running vLLM without `--served-model-name` | 404 Model `default` not found. ms-swift requests the model name as `default`. | Must add `--served-model-name default` to the vLLM server arguments. |
| Parsing OpenAI API response with default ms-swift code | `service_tier` TypeError, `'dict' object has no attribute 'message'`. | ms-swift `vllm_client.py` code modification required (see patch below). |

-----

## ms-swift Code Patches (Required)

### Patch 1: `rollout_mixin.py` - Skip Weight Sync

Add to the beginning of the `_move_model_to_vllm` function in `swift/trainers/rlhf_trainer/rollout_mixin.py`:

As you see above, every skill template includes a "Failed Attempts" table. This table gets read more than any other section. Success paths are nice to know, but failure paths are what save time. "I tried X and it didn't work because Y" is worth more than paragraphs of explanation.

image

The template also asks for a troubleshooting file in references/troubleshooting.md. Real error messages, actual symptoms, exact fixes. When Claude surfaces a skill, it can pull from this file to give precise answers instead of guesses.

Also, any vauge advice doesn't help. "Use a small learning rate" means different things to different people. Skills that get used contain copy-paste configurations.

# Experiment Log: grpo-external-vllm-server

## 2025-11-08

### Objective

Perform GRPO training in `ms-swift` using an external vLLM server.

### Environment

  * **GPU:** NVIDIA A100-SXM4-80GB x 8
  * **Frameworks:** ms-swift, vLLM, DeepSpeed ZeRO2
  * **Model:** google/gemma-3-12b-it

-----

### Issues Encountered and Solutions

#### Issue 1: 404 `/update_flattened_params/` Error

  * **Symptom:** A 404 error occurred in the vLLM server logs.
  * **Cause:** The `_move_model_to_vllm` function attempted to call the weight sync API even when `vllm_skip_weight_sync=true` was set.
  * **Solution:** Added a logic check for `vllm_skip_weight_sync` in `rollout_mixin.py`.

...

### Results

  * Confirmed that GRPO training started normally after fixing all bugs.
  * Verified the combination of **External vLLM Server + ms-swift GRPO training**.

### Hyperparameters

# GRPO Training Config (Ready to copy-paste)
rlhf_type: grpo
use_vllm: true
vllm_mode: server
vllm_server_host: 127.0.0.1
vllm_server_port: 8000
vllm_skip_weight_sync: true  # Mandatory when using standard vllm serve

...

### Lessons Learned

1.  **Weight Synchronization:** Be aware that the weight synchronization API is unavailable when using `vllm serve` instead of `swift rollout`.
2.  **API Compatibility:** Be cautious of discrepancies in response formats when using OpenAI API-compatible servers.
3.  **Process Management:** It is highly recommended to always include cleanup logic in distributed training scripts.

The plugin.json tells Claude this skill exists. The SKILL.md contains the actual knowledge. The references/ folder holds supporting material. Scripts go in scripts/ so people can copy them directly.

Each skill follows the same layout:

plugins/training/experiment-name/
├── .claude-plugin/
│   └── plugin.json          # Metadata and trigger conditions
├── skills/experiment-name/
│   └── SKILL.md             # The main knowledge document
├── references/
│   ├── experiment-log.md    # Daily experiment notes
│   └── troubleshooting.md   # Error → solution mappings
└── scripts/
    └── (reusable code)

This structure isn't arbitrary. We iterated on it over a few weeks. Earlier versions put everything in one file, which got unwieldy. Splitting troubleshooting into its own file made it easier to maintain and easier for Claude Code to search skills.

When you open a PR that touches the plugins/ folder, GitHub Actions validate the structure automatically.

  • Does plugin.json have the required fields?
  • Does SKILL.md exist?
  • Is the description field specific enough?
# From our GitHub Actions workflow
- name: Check SKILL.md exists
  run: |
    for plugin_json in $(find plugins -name "plugin.json" -path "*/.claude-plugin/*"); do
      plugin_dir=$(dirname $(dirname "$plugin_json"))
      skill_md=$(find "$plugin_dir/skills" -name "SKILL.md" | head -1)
      if [ -z "$skill_md" ]; then
        echo "❌ Missing SKILL.md: $plugin_dir"
        exit 1
      fi
    done

If validation passes, the PR gets a comment listing all plugins that will be affected. After merge, another action regenerates marketplace.json so the registry stays current without manual updates.

We published the above template at templates/experiment-skill-template/ in our repository. When you run /retrospective, Claude uses this template as a starting point. You can also copy it manually if you're writing a skill by hand.

The template is opinionated. It asks for things you might not think to include, hardware requirements, package versions, next steps. That's intentional. A skill written six months ago is only useful if someone can actually reproduce the environment.

Demo: Training a Small Transformer to Learn Addition

Let me walk through an actual research session. The goal is to train a transformer to learn integer addition using Base-100 tokenization. The parameter budget is 0.5M to 4M. I want to know which hyperparameter combination works best before committing to a long training run.

I start by typing /advise.

Screenshot 2025-12-08 at 10.45.41 PM (2)

Claude searches the skill registry and finds relevant experiments. In this case, it pulls from the ColBERT FDE parameter search, not because the tasks are identical, but because the methodology transfers. The response includes a structured breakdown of my request, then maps lessons from past experiments onto my situation.

image What comes back isn't generic advice. The ColBERT skill documented that ksim=4 works because "16 buckets fit token distribution." That d_proj=32 causes information loss. Claude translates these into hypotheses for arithmetic (e.g. document d_model, n_layers, pos_encoding, and output_format) the same way.

But here's where it gets interesting. I had been discussing infrastructure with Gemini 3 Deep Think earlier in the session. Claude integrates that context too.

Gemini's insight: "As total parameter count decreases, the bottleneck shifts from VRAM to computation." For small models under 10M parameters, GPU sits underutilized while CPU/IO becomes the constraint. Claude picks this up and suggests aggressive batch sizes for tiny models. This would never occur to me coming from LLM training where batch size 4 is normal.

The /advise response ends with a pre-experiment checklist and concrete code. RoPE theta values to sweep: 10, 30, 100, 500, 10000. Batch size search space: [512, 1024, 2048, 4096]. All before I've written a single line.

I ask Claude to create the experiment infrastructure.

image It generates a full project: TECHSPEC.md with phases and success criteria, train.py at around 700 lines, evaluate.py, sweep.py, and a baseline config. The TECHSPEC defines three expected outcomes, best case (95%+ accuracy under 1M params), realistic case (95%+ at 2-3M), worst case (need over 4M). Having these written down before training starts matters. It's too easy to move goalposts after seeing results.

image I launch four baseline experiments: Upper bound (256d-6L, 3.18M params), Middle (192d-4L, 1.4M), Lower (64d-3L, 253K), and Tiny (32d-2L, 77K) as a negative control.

image The error message is the same across all runs, which happens in apply_rotary_pos_emb at line 377.

RuntimeError: The size of tensor a (32) must match the size of tensor b (16) at non-singleton dimension 3.

Claude reads the logs, identifies the issue, and explains: standard RoPE implementations output freqs with shape [seq_len, head_dim/2], but the attention layer expects [seq_len, head_dim]. For short sequences like arithmetic (under 32 tokens), this mismatch breaks everything.

The fix requires two changes: torch.cat((freqs, freqs), dim=-1) to match head dimension, and .unsqueeze(0).unsqueeze(0) for proper 4D broadcasting with query and key tensors. Claude Code patches train.py and reruns.

Now the experiments complete.

image The W&B skills shows all four runs. Upper bound hits 91.5% eval exact match by step 1900. The tiny model flatlines near zero. Claude queries W&B through MCP to pull the final numbers.

image The scaling law is steep. 40x more parameters (77K to 3.18M) yields 98x improvement in exact match. The minimum viable size for this task appears to be somewhere between 253K and 500K; the 253K model gets 37.6%, which is above random but insufficient. RoPE theta=100 works well for these short sequences.

Claude updates the TECHSPEC, marking Phase 1 complete and recording the baseline results.

image

Session over. Time to PR it.


Let's say that two days later, your teammate wants to run Phase 2. The question now, at the same parameter count, does a wide-shallow model or a narrow-deep model perform better?

I type /advise again to get hints from other "skills" before designing the experiments.

image (64)

Claude finds the skill I created two days ago. It surfaces the Phase 1 results: the 90.62% at 3.18M, the 79.31% at 1.4M and notes that Phase 2 is ready.

image

It already has a hypothesis: "For fixed param count, wider-shallower beats narrower-deeper. And the rationale was that "Addition is a 'local' operation, doesn't need deep reasoning."

image (65) The response includes a controlled parameter-matched sweep design. Six pairs of wide vs deep configurations at identical param counts, plus a depth ablation to find the minimum layers needed for reliable carry propagation. Eighteen runs total.

image I didn't have to memorize or even explain the Phase 1 results or to re-derive the RoPE theta settings. Claude knew because the skill existed. So the cycle continues - run experiments, hit problems, fix them, document everything, and the next session starts with all of that context already loaded.

The System Behind It

The skills registry handles knowledge. But running a thousand experiments a day requires more than good documentation.

The TECHSPEC is where it starts. Before Claude writes any code, I usually spend hours with top-tier models like Claude Opus 4.5, GPT-5.1 Pro and Gemini 3 DeepThink just reading. Prior papers, failed attempts, blog posts about stuff that exploded. That becomes maybe 20 pages of notes, which we distill into a markdown file. It's not a prompt. It's closer to a research contract. What we're trying to learn, which hypotheses matter, what parameter ranges to sweep, budget limits, what success looks like. Claude reads it and knows why each experiment exists. We call this spec-driven modeling. It's different from typing commands into a CLI and hoping.

image

The infrastructure side matters too. We run a system called Creep Colony that manages GPU containers on a HashiCorp Nomad cluster. The management service runs in Kubernetes, but the actual GPU workloads get scheduled onto Nomad client nodes. When an agent says "I need to sweep 200 configs," that becomes job specs. The system reads the TECHSPEC, estimates how much compute each config needs, and packs jobs onto the cluster. That's how you train over a thousand models in a day without opening a terminal.

Build your own Skill Registry today

You don't need our infrastructure. The skills registry, /advise, and /retrospective run on a GitHub repo with some Claude Code configuration.

I put the minimum viable setup in a GitHub Gist. Folder structure, CLAUDE.md, validation workflows. Copy it and modify for your team.

  • Start with the repository structure:
your-org/research-skills/
├── plugins/
│   ├── training/
│   │   ├── your-first-experiment/
│   │   │   ├── .claude-plugin/
│   │   │   │   └── plugin.json
│   │   │   ├── skills/your-first-experiment/
│   │   │   │   └── SKILL.md
│   │   │   ├── references/
│   │   │   │   └── experiment-log.md
│   │   │   └── scripts/
│   │   └── another-experiment/
│   └── evaluation/
├── templates/
│   └── experiment-skill-template/
├── scripts/
│   ├── validate_plugins.py
│   └── generate_marketplace.py
├── marketplace.json
└── CLAUDE.md

The CLAUDE.md file at the root is important. It tells Claude Code how to behave in this repository. Ours includes instructions for /advise and /retrospective commands, the skill template location, and rules about PR formatting. When someone clones the repo and runs Claude Code inside it, these instructions load automatically.

  • Here's a minimal CLAUDE.md:
# Research Skills Registry

## Commands

### /advise
Search the skills registry for relevant experiments before starting new work.
1. Read the user's goal
2. Search plugins/ for related skills by scanning description fields
3. Summarize relevant findings: what worked, what failed, recommended parameters

### /retrospective  
Save learnings from the current session as a new skill.
1. Summarize key findings from the conversation
2. Create a new plugin folder using templates/experiment-skill-template/
3. Fill in SKILL.md with: goal, what worked, what failed, final parameters
4. Create a branch and open a PR to main

## Skill Template
Use templates/experiment-skill-template/ as the base for new skills.

## Rules
- Every skill needs a specific description field with trigger conditions
- Always include a "Failed Attempts" table
- Include exact hyperparameters, not vague advice
  • The GitHub Actions do two things: validate PRs and update the marketplace index. Note that the marketplace.json is what Claude Code reads when someone runs /plugin marketplace add. It lists all available skills with their descriptions. The action regenerates it whenever a skill gets merged.
# .github/workflows/validate.yml
name: Validate Plugins

on:
  pull_request:
    paths: ['plugins/**']

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Check plugin structure
        run: |
          for dir in plugins/*/*; do
            if [ -d "$dir/.claude-plugin" ]; then
              # Check plugin.json exists and has required fields
              jq -e '.name and .description and .skills' \
                "$dir/.claude-plugin/plugin.json" > /dev/null || \
                { echo "❌ Invalid plugin.json in $dir"; exit 1; }
              
              # Check SKILL.md exists
              find "$dir/skills" -name "SKILL.md" | grep -q . || \
                { echo "❌ Missing SKILL.md in $dir"; exit 1; }
              
              echo "✅ $dir"
            fi
          done

# .github/workflows/marketplace.yml
name: Update Marketplace

on:
  push:
    branches: [main]
    paths: ['plugins/**']

jobs:
  update:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Generate marketplace.json
        run: python scripts/generate_marketplace.py
      
      - name: Commit if changed
        run: |
          git config user.name "github-actions"
          git config user.email "actions@github.com"
          git add marketplace.json
          git diff --staged --quiet || git commit -m "Update marketplace.json"
          git push

Team onboarding is three steps.

  • First, everyone installs the registry.
# Inside Claude Code before running below
/plugin marketplace add your-org/research-skills
/plugin install all-skills@research-skills
  • Second, they read the CLAUDE.md to understand the commands. Someone on the team might need to create the first few skills manually. The system bootstraps from examples. If your registry is empty, /advise returns nothing useful, and people stop using it.

  • Thirdly, Seed it with three or four real experiments from your team's recent work. Include the failures. Especially include the failures.

The cultural part is harder than the technical part

The skills registry creates a single source of truth for experimental knowledge. Failed approaches, working configurations, "I wish someone told me this" moments, Claude Code becomes the interface.

Product teams benefit too. When a researcher documents why a certain approach failed, that context becomes accessible to anyone scoping the next project. The gap between "what research knows" and "what product assumes" shrinks. Specifications get grounded in actual experimental results instead of optimistic guesses.

The hard part is getting people to contribute. Researchers are busy. Writing documentation feels like overhead. The trick is making the contribution path so frictionless that it costs more effort to skip it than to do it. /retrospective takes thirty seconds. Claude does the writing. You just approve the PR.

Once the registry has enough mass, the incentives flip. People start contributing because they want their work to be findable. They want credit for the hard problems they solved. They want the new hire to know that any bug fixes came from them.

Every team has years of tacit knowledge scattered across Slack threads, abandoned notebooks, and memories that fade. Claude Code can't recover what's already lost. But it can stop the bleeding. From now on, everything gets captured. Everything compounds.

The researchers who adopt this fastest aren't the most organized ones. They're the ones who got burned. The ones who spent a week on a problem, only to discover a teammate solved it last month. That pain converts into habit quickly.

If you're already using Claude Code for experiments, you're halfway there. The skills layer is just making explicit what Claude could do for you if it knew what your team knew.

Give it that knowledge then ultrathink. Watch what happens :)

Community

Sign up or log in to comment