Update README.md

2025-08-16 13:35:54 +08:00 · 2025-01-15 10:30:16 +00:00 · 2025-01-15 10:30:16 +00:00 · 4a6fbe8a38
commit 4a6fbe8a38
parent 9d7da8daad
1 changed files with 205 additions and 39 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,7 @@
+---
+license: apache-2.0
+---
+
 # InternLM 


@ -5,7 +9,6 @@
 <div align="center">
 <img src="https://github.com/InternLM/InternLM/assets/22529082/b9788105-8892-4398-8b47-b513a292378e" width="200"/>

-
  <div>&nbsp;</div>
  <div align="center">
    <b><font size="5">InternLM</font></b>
@ -18,7 +21,6 @@
  </div>


-
 [![evaluation](https://github.com/InternLM/InternLM/assets/22529082/f80a2a58-5ddf-471a-8da4-32ab65c8fd3b)](https://github.com/internLM/OpenCompass/)

 [💻Github Repo](https://github.com/InternLM/InternLM) • [🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new) • [📜Technical Report](https://arxiv.org/abs/2403.17297)
@ -31,15 +33,14 @@



-
 ## Introduction 

 InternLM3 has open-sourced an 8-billion parameter instruction model, InternLM3-8B-Instruct, designed for general-purpose usage and advanced reasoning. This model has the following characteristics:

 - **Enhanced performance at reduced cost**: 
-  State-of-the-art performance on reasoning and knowledge-intensive tasks surpass models like Llama3.1-8B and Qwen2.5-7B. Remarkably, InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale. 
+State-of-the-art performance on reasoning and knowledge-intensive tasks surpass models like Llama3.1-8B and Qwen2.5-7B. Remarkably, InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale. 
 - **Deep thinking capability**:
-  InternLM3 supports both the deep thinking mode for solving complicated reasoning tasks via the long chain-of-thought and the normal response mode for fluent user interactions. 
+InternLM3 supports both the deep thinking mode for solving complicated reasoning tasks via the long chain-of-thought and the normal response mode for fluent user interactions. 

 ## InternLM3-8B-Instruct

@ -70,9 +71,7 @@ We conducted a comprehensive evaluation of InternLM using the open-source evalua
 - The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).

 **Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
-
 ### Requirements
-
 ```python
 transformers >= 4.48
 ```
@ -80,7 +79,7 @@ transformers >= 4.48

 ### Conversation Mode

-#### Modelscope inference
+#### Transformers inference

 To load the InternLM3 8B Instruct model using Transformers, use the following code:

@ -90,7 +89,7 @@ from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
 model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm3-8b-instruct')
 tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
 # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
-# model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
+model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
 # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
  # pip install -U bitsandbytes
@ -105,7 +104,7 @@ messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Please tell me five scenic spots in Shanghai"},
 ]
-tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

 generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8)

@ -114,12 +113,11 @@ generated_ids = [
 ]
 prompt = tokenizer.batch_decode(tokenized_chat)[0]
 print(prompt)
-response = tokenizer.batch_decode(generated_ids)[0]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 print(response)
 ```

 #### LMDeploy inference
-
 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.

 ```bash
@ -162,15 +160,56 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i

 ####  Ollama inference

-TODO
+First install ollama,
+```python
+# install ollama
+curl -fsSL https://ollama.com/install.sh | sh
+# fetch model
+ollama pull internlm/internlm3-8b-instruct
+# install 
+pip install ollama
+```

+inference code,
+
+```python
+import ollama
+
+system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语).
+- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
+- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文."""
+
+messages = [
+    {
+        "role": "system",
+        "content": system_prompt,
+    },
+    {
+        "role": "user",
+        "content": "Please tell me five scenic spots in Shanghai"
+    },
+]
+
+stream = ollama.chat(
+    model='internlm/internlm3-8b-instruct',
+    messages=messages,
+    stream=True,
+)
+
+for chunk in stream:
+  print(chunk['message']['content'], end='', flush=True)
+```
 #### vLLM inference

 We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.

 ```python
 git clone -b support-internlm3 https://github.com/RunningLeon/vllm.git
-pip install -e . 
+# and then follow  https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source to install 
+cd vllm
+python use_existing_torch.py
+pip install -r requirements-build.txt
+pip install -e . --no-build-isolatio
 ```

 inference code:
@ -206,7 +245,6 @@ print(outputs)


 ### Thinking Mode
-
 #### Thinking Demo

 <img src="https://github.com/InternLM/InternLM/blob/017ba7446d20ecc3b9ab8e7b66cc034500868ab4/assets/solve_puzzle.png?raw=true" width="400"/>
@ -218,7 +256,6 @@ print(outputs)


 #### Thinking system prompt
-
 ```python
 thinking_system_prompt = """You are an expert mathematician with extensive experience in mathematical competitions. You approach problems through systematic thinking and rigorous reasoning. When solving problems, follow these thought processes:
 ## Deep Understanding
@ -266,16 +303,14 @@ When you're ready, present your complete solution with:
 Focus on clear, logical progression of ideas and thorough explanation of your mathematical reasoning. Provide answers in the same language as the user asking the question, repeat the final answer using a '\\boxed{}' without any units, you have [[8192]] tokens to complete the answer.
 """
 ```
-
 #### Transformers inference
-
 ```python
 import torch
 from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
 model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm3-8b-instruct')
 tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
 # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
-model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
+model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
 # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
  # pip install -U bitsandbytes
@ -287,7 +322,7 @@ messages = [
    {"role": "system", "content": thinking_system_prompt},
    {"role": "user", "content": "Given the function\(f(x)=\mathrm{e}^{x}-ax - a^{3}\),\n(1) When \(a = 1\), find the equation of the tangent line to the curve \(y = f(x)\) at the point \((1,f(1))\).\n(2) If \(f(x)\) has a local minimum and the minimum value is less than \(0\), determine the range of values for \(a\)."},
 ]
-tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

 generated_ids = model.generate(tokenized_chat, max_new_tokens=8192)

@ -296,10 +331,9 @@ generated_ids = [
 ]
 prompt = tokenizer.batch_decode(tokenized_chat)[0]
 print(prompt)
-response = tokenizer.batch_decode(generated_ids)[0]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 print(response)
 ```
-
 #### LMDeploy inference

 LMDeploy is a toolkit for compressing, deploying, and serving LLM.
@ -327,15 +361,56 @@ print(response)

 ####  Ollama inference

-TODO
+First install ollama,
+
+```python
+# install ollama
+curl -fsSL https://ollama.com/install.sh | sh
+# fetch model
+ollama pull internlm/internlm3-8b-instruct
+# install
+pip install ollama
+```
+
+inference code,
+
+```python
+import ollama
+
+messages = [
+    {
+        "role": "system",
+        "content": thinking_system_prompt,
+    },
+    {
+        "role": "user",
+        "content": "Given the function\(f(x)=\mathrm{e}^{x}-ax - a^{3}\),\n(1) When \(a = 1\), find the equation of the tangent line to the curve \(y = f(x)\) at the point \((1,f(1))\).\n(2) If \(f(x)\) has a local minimum and the minimum value is less than \(0\), determine the range of values for \(a\)."
+    },
+]
+
+stream = ollama.chat(
+    model='internlm/internlm3-8b-instruct',
+    messages=messages,
+    stream=True,
+)
+
+for chunk in stream:
+  print(chunk['message']['content'], end='', flush=True)
+```
+
+
+#### 

 #### vLLM inference

 We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
-
 ```python
 git clone https://github.com/RunningLeon/vllm.git
-pip install -e .
+# and then follow  https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source to install 
+cd vllm
+python use_existing_torch.py
+pip install -r requirements-build.txt
+pip install -e . --no-build-isolatio
 ```

 inference code
@ -391,9 +466,9 @@ Code and model weights are licensed under Apache-2.0.
 InternLM3，即书生·浦语大模型第3代，开源了80亿参数，面向通用使用与高阶推理的指令模型（InternLM3-8B-Instruct）。模型具备以下特点：

 - **更低的代价取得更高的性能**:
-  在推理、知识类任务上取得同量级最优性能，超过Llama3.1-8B和Qwen2.5-7B。值得关注的是InternLM3只用了4万亿词元进行训练，对比同级别模型训练成本节省75%以上。
+在推理、知识类任务上取得同量级最优性能，超过Llama3.1-8B和Qwen2.5-7B。值得关注的是InternLM3只用了4万亿词元进行训练，对比同级别模型训练成本节省75%以上。
 - **深度思考能力**:
-  InternLM3支持通过长思维链求解复杂推理任务的深度思考模式，同时还兼顾了用户体验更流畅的通用回复模式。
+InternLM3支持通过长思维链求解复杂推理任务的深度思考模式，同时还兼顾了用户体验更流畅的通用回复模式。

 #### 性能评测

@ -444,7 +519,7 @@ from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
 model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm3-8b-instruct')
 tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
 # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
-# model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
+model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
 # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
  # pip install -U bitsandbytes
@ -459,7 +534,7 @@ messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Please tell me five scenic spots in Shanghai"},
 ]
-tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

 generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8)

@ -468,7 +543,7 @@ generated_ids = [
 ]
 prompt = tokenizer.batch_decode(tokenized_chat)[0]
 print(prompt)
-response = tokenizer.batch_decode(generated_ids)[0]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 print(response)
 ```

@ -516,15 +591,60 @@ curl http://localhost:23333/v1/chat/completions \

 #####  Ollama 推理

-TODO
+准备工作

+```python
+# install ollama
+curl -fsSL https://ollama.com/install.sh | sh
+# fetch 模型
+ollama pull internlm/internlm3-8b-instruct
+# install python库
+pip install ollama
+```
+
+推理代码
+
+```python
+import ollama
+
+system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语).
+- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
+- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文."""
+
+messages = [
+    {
+        "role": "system",
+        "content": system_prompt,
+    },
+    {
+        "role": "user",
+        "content": "Please tell me five scenic spots in Shanghai"
+    },
+]
+
+stream = ollama.chat(
+    model='internlm/internlm3-8b-instruct',
+    messages=messages,
+    stream=True,
+)
+
+for chunk in stream:
+  print(chunk['message']['content'], end='', flush=True)
+```
+
+
+#### 
 ##### vLLM 推理

 我们还在推动PR(https://github.com/vllm-project/vllm/pull/12037) 合入vllm，现在请使用以下PR链接手动安装

 ```python
 git clone https://github.com/RunningLeon/vllm.git
-pip install -e .
+# and then follow  https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source to install 
+cd vllm
+python use_existing_torch.py
+pip install -r requirements-build.txt
+pip install -e . --no-build-isolatio
 ```

 推理代码
@ -624,7 +744,7 @@ from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
 model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm3-8b-instruct')
 tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
 # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
-model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
+model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16).cuda()
 # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
  # InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
  # pip install -U bitsandbytes
@ -636,7 +756,7 @@ messages = [
    {"role": "system", "content": thinking_system_prompt},
    {"role": "user", "content": "已知函数\(f(x)=\mathrm{e}^{x}-ax - a^{3}\)。\n（1）当\(a = 1\)时，求曲线\(y = f(x)\)在点\((1,f(1))\)处的切线方程；\n（2）若\(f(x)\)有极小值，且极小值小于\(0\)，求\(a\)的取值范围。"},
 ]
-tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

 generated_ids = model.generate(tokenized_chat, max_new_tokens=8192)

@ -645,10 +765,9 @@ generated_ids = [
 ]
 prompt = tokenizer.batch_decode(tokenized_chat)[0]
 print(prompt)
-response = tokenizer.batch_decode(generated_ids)[0]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 print(response)
 ```
-
 ##### LMDeploy 推理

 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
@ -676,7 +795,45 @@ print(response)

 #####  Ollama 推理

-TODO
+准备工作
+
+```python
+# install ollama
+curl -fsSL https://ollama.com/install.sh | sh
+# fetch 模型
+ollama pull internlm/internlm3-8b-instruct
+# install python库
+pip install ollama
+```
+
+inference code,
+
+```python
+import ollama
+
+messages = [
+    {
+        "role": "system",
+        "content": thinking_system_prompt,
+    },
+    {
+        "role": "user",
+        "content": "Given the function\(f(x)=\mathrm{e}^{x}-ax - a^{3}\),\n(1) When \(a = 1\), find the equation of the tangent line to the curve \(y = f(x)\) at the point \((1,f(1))\).\n(2) If \(f(x)\) has a local minimum and the minimum value is less than \(0\), determine the range of values for \(a\)."
+    },
+]
+
+stream = ollama.chat(
+    model='internlm/internlm3-8b-instruct',
+    messages=messages,
+    stream=True,
+)
+
+for chunk in stream:
+  print(chunk['message']['content'], end='', flush=True)
+```
+
+
+#### 

 ##### vLLM 推理

@ -684,7 +841,11 @@ TODO

 ```python
 git clone https://github.com/RunningLeon/vllm.git
-pip install -e .
+# and then follow  https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source to install 
+cd vllm
+python use_existing_torch.py
+pip install -r requirements-build.txt
+pip install -e . --no-build-isolatio
 ```

 推理代码
@ -715,6 +876,10 @@ print(outputs)



+
+
+
+
 ## 开源许可证

 本仓库的代码和权重依照 Apache-2.0 协议开源。
@ -730,4 +895,5 @@ print(outputs)
      archivePrefix={arXiv},
      primaryClass={cs.CL}
 }
-```
+```
+