tperes commited on
Commit
d463150
·
verified ·
1 Parent(s): 64bde10

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -43
README.md CHANGED
@@ -196,17 +196,27 @@ response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
196
 
197
  Apache 2.0
198
 
 
 
199
  ---
200
 
201
- # Original model card: : palmyra-mini-thinking-b
202
 
203
- ## Model Details
 
 
 
204
 
205
- **Model Name:** palmyra-mini-thinking-b
 
 
206
 
207
- **Version:** 1.0
208
 
209
- **Type:** Generative AI Language Model
 
 
 
 
210
 
211
  ## Introduction
212
 
@@ -220,44 +230,140 @@ The model's mathematical abilities are particularly noteworthy. It achieves an i
220
 
221
  Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers.
222
 
223
- ## Benchmark Scores
224
-
225
- | Benchmark | Score |
226
- |:-----------------------------------------------------------------|---------:|
227
- | gsm8k (strict-match) | 0.4268 |
228
- | minerva_math(exact_match) | 0.0708 |
229
- | mmlu_pro(exact_match) | 0.2926 |
230
- | hendrycks_math | 0.0016 |
231
- | ifeval (inst_level_loose_acc) | 0.3297 |
232
- | mathqa (acc) | 0.3045 |
233
- | humaneval (pass@1) | 0.0732 |
234
- | BBH (get-answer)(exact_match) | 0.288 |
235
- | mbpp | 0.168 |
236
- | leadboard_musr (acc_norm) | 0.3796 |
237
- | gpqa lighteval gpqa diamond_pass@1:8_samples | 0.3958 |
238
- | AIME24(pass@1)(avg-of-1) | 0.6 |
239
- | AIME25(pass@1)(avg-of-1) | 0.5 |
240
- | Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.2873 |
241
- | AMC23 | 0.925 |
242
- | MATH500 | 0.882 |
243
- | Minerva | 0.2941 |
244
- | Olympiadbench (extractive_match) | 0.5733 |
245
- | Codecontests (pass_rate) | 0.2018 |
246
- | Codeforces (pass_rate) | 0.6343 |
247
- | Taco (pass_rate) | 0.3456 |
248
- | APPS (all_levels) | 0.0584 |
249
- | HMMT23 (extractive_match) | 0.2333 |
250
- | Average | 0.359378 |
251
-
252
-
253
- ## Intended Use
254
-
255
- This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.
256
-
257
- ## Limitations
258
-
259
- The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
260
 
261
  ## Ethical Considerations
262
 
263
- As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
  Apache 2.0
198
 
199
+ #### Original model card below:
200
+
201
  ---
202
 
 
203
 
204
+ <div align="center">
205
+ <h1>Palmyra-mini-thinking-b</h1>
206
+
207
+ </div>
208
 
209
+ <p align="center">
210
+ <img src="https://huggingface.co/Writer/palmyra-mini-thinking-b/resolve/main/logo-mini-b%20benchmark-performance.png?download=true" width="800"/>
211
+ </p>
212
 
213
+ ### Model Description
214
 
215
+ - **Language(s) (NLP):** English
216
+ - **License:** Apache-2.0
217
+ - **Finetuned from model:** Qwen/Qwen2.5-1.5B
218
+ - **Context window:** 131,072 tokens
219
+ - **Parameters:** 1.7 billion
220
 
221
  ## Introduction
222
 
 
230
 
231
  Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers.
232
 
233
+ ## Benchmark Scores (sampling params: temperature:0.6, top_p:0.95)
234
+
235
+ Pass@1(avg-of-64)
236
+
237
+ | Benchmark | Pass@1 (avg-of-64) | Majority@64 |
238
+ | :-------- | :------------------- | :----------- |
239
+ | AIME24 | 59.43% | 71.67% |
240
+ | AIME25 | 49.69% | 60.00% |
241
+ | GPQA | 42.01% | 47.22% |
242
+ | HMMT25 | 27.86% | 30.00% |
243
+ | HLE | 5.22% | N/A |
244
+ | MMLU-PRO | 55.49% | 60.60% |
245
+ | MATH500 | 93.80% | 95.40% |
246
+ | LCB | 34.51% | N/A |
247
+
248
+ LCB here is version v6_2408_2505
249
+
250
+
251
+ Pass@1(avg-of-1)
252
+
253
+ | Benchmark | Score (%) |
254
+ |:-----------------------------------------------------------------|------------:|
255
+ | GSM8K (strict-match) | 42.68% |
256
+ | Minerva Math (exact match) | 7.08% |
257
+ | MMLU-PRO (exact match) | 29.26% |
258
+ | MATH (Hendrycks) | 0.16% |
259
+ | IFEval (inst_level_loose_acc) | 32.97% |
260
+ | MathQA (acc) | 30.45% |
261
+ | HumanEval (pass@1) | 7.32% |
262
+ | BBH (get-answer)(exact match) | 28.80% |
263
+ | MBPP | 16.80% |
264
+ | GPQA (diamond, pass@1: 8 samples) | 39.58% |
265
+ | AIME24 (pass@1)(avg-of-1) | 60.00% |
266
+ | AIME25 (pass@1)(avg-of-1) | 50.00% |
267
+ | Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 28.73% |
268
+ | AMC23 | 92.50% |
269
+ | MATH500 | 88.20% |
270
+ | Minerva | 29.41% |
271
+ | Olympiadbench (extractive_match) | 57.33% |
272
+ | Codecontests (pass_rate) | 20.18% |
273
+ | Codeforces (pass_rate) | 63.43% |
274
+ | Taco (pass_rate) | 34.56% |
275
+ | APPS (all_levels) | 5.84% |
276
+ | HMMT (Feb 2025) (extractive_match) | 23.33% |
277
+ | Average | 35.94% |
278
+
279
+ ### Use with transformers
280
+
281
+ You can run conversational inference using the Transformers Auto classes with the `generate()` function. Here's an example:
282
+
283
+ ```py
284
+ import torch
285
+ from transformers import AutoTokenizer, AutoModelForCausalLM
286
+
287
+ model_id = "Writer/palmyra-mini-thinking-b"
288
+
289
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
290
+
291
+ model = AutoModelForCausalLM.from_pretrained(
292
+ model_id,
293
+ torch_dtype=torch.float16,
294
+ device_map="auto",
295
+ attn_implementation="flash_attention_2",
296
+ )
297
+
298
+ messages = [
299
+ {
300
+ "role": "user",
301
+ "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
302
+ }
303
+ ],
304
+
305
+ input_ids = tokenizer.apply_chat_template(
306
+ messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
307
+ )
308
+
309
+ gen_conf = {
310
+ "max_new_tokens": 256,
311
+ "eos_token_id": tokenizer.eos_token_id,
312
+ "temperature": 0.3,
313
+ "top_p": 0.9,
314
+ }
315
+
316
+ with torch.inference_mode():
317
+ output_id = model.generate(input_ids, **gen_conf)
318
+
319
+ output_text = tokenizer.decode(output_id[0][input_ids.shape[1] :])
320
+
321
+ print(output_text)
322
+ ```
323
+
324
+ ## Running with vLLM
325
+ ```py
326
+ vllm serve Writer/palmyra-mini-thinking-b
327
+ ```
328
+ ```py
329
+ curl -X POST http://localhost:8000/v1/chat/completions \
330
+ -H "Content-Type: application/json" \
331
+ -d '{
332
+ "model": "Writer/palmyra-mini-thinking-b",
333
+ "messages": [
334
+ {
335
+ "role": "user",
336
+ "content": "You have a 3-liter jug and a 5-liter jug. How can you measure exactly 4 liters of water?"
337
+ }
338
+ ],
339
+ "max_tokens": 8000,
340
+ "temperature": 0.2
341
+ }'
342
+ ```
343
 
344
  ## Ethical Considerations
345
 
346
+ As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.
347
+
348
+
349
+ ### Footnotes
350
+
351
+ - Base model: This model builds on NVIDIA's OpenReasoning-Nemotron-1.5B (`https://huggingface.co/nvidia/OpenReasoning-Nemotron-1.5B`).
352
+ - Evaluation methodology:
353
+ - Pass@1 (avg-of-1): computed using `lm_eval` and `lighteval`.
354
+ - Pass@1 (avg-of-64) and Majority@64: computed using `nemoskills`.
355
+
356
+ ### Citation and Related Information
357
+
358
+
359
+ To cite this model:
360
+ ```
361
+ @misc{Palmyra-mini-thinking-b,
362
+ author = {Writer Engineering team},
363
+ title = {{Palmyra-mini: A powerful LLM designed for math and coding}},
364
+ howpublished = {\url{https://dev.writer.com}},
365
+ year = 2025,
366
+ month = Sep
367
+ }
368
+ ```
369
+ Contact Hello@writer.com