Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add chatglm3-6b and glm-4-9b-chat model support #6999

Closed
wants to merge 5 commits into from

Conversation

mnlife
Copy link

@mnlife mnlife commented Apr 30, 2024

This pull request adds support for chatglm3-6b-chat and glm-4-9b-chat models. Fixes [#7778]

somethings I'm not sure about:

  • The prompt must include both a prefix and a suffix for the inference results to be correct. such as below:
./build/bin/llama-cli -m ~/models/glm-4-9b-chat-Q4_K_M.gguf --verbose-prompt -p "[gMASK]<sop><|user|>hi<|assistant|>"
  • When I add my chat template to examples/server/public/prompt-formats.js and run llama-server, start the browser and input http://localhost:8080/ and change prompt style. The assistant always starts a new line before speaking.
    Screenshot from 2024-06-20 11-43-17

  • The inference results are incorrect with the CUDA version.

below is some link about chatglm model:

./convert-hf-to-gguf.py "--outfile" ~/models/xxx-f16.gguf "--outtype" "f16" ~/os/llm/xxx
./build/bin/quantize ~/models/xxx-f16.gguf ~/models/xxx-Q4_K_M.gguf 15 8

Copy link
Contributor

github-actions bot commented Apr 30, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 537 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8704.42ms p(95)=20543.59ms fails=, finish reason: stop=477 truncated=60
  • Prompt processing (pp): avg=97.85tk/s p(95)=416.02tk/s
  • Token generation (tg): avg=32.52tk/s p(95)=47.83tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=chatglm3 commit=f3bc337f432a5f8d7391bd7af7bacfa55778d210

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716966257 --> 1716966887
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 909.91, 909.91, 909.91, 909.91, 909.91, 940.56, 940.56, 940.56, 940.56, 940.56, 945.17, 945.17, 945.17, 945.17, 945.17, 939.65, 939.65, 939.65, 939.65, 939.65, 914.89, 914.89, 914.89, 914.89, 914.89, 910.92, 910.92, 910.92, 910.92, 910.92, 919.35, 919.35, 919.35, 919.35, 919.35, 927.86, 927.86, 927.86, 927.86, 927.86, 921.6, 921.6, 921.6, 921.6, 921.6, 933.02, 933.02, 933.02, 933.02, 933.02, 926.24, 926.24, 926.24, 926.24, 926.24, 938.48, 938.48, 938.48, 938.48, 938.48, 952.31, 952.31, 952.31, 952.31, 952.31, 956.25, 956.25, 956.25, 956.25, 956.25, 976.45, 976.45, 976.45, 976.45, 976.45, 978.04, 978.04, 978.04, 978.04, 978.04, 978.62, 978.62, 978.62, 978.62, 978.62, 974.09, 974.09, 974.09, 974.09, 974.09, 988.49, 988.49, 988.49, 988.49, 988.49, 985.13, 985.13, 985.13, 985.13, 985.13, 988.7, 988.7, 988.7, 988.7, 988.7, 979.7, 979.7, 979.7, 979.7, 979.7, 980.1, 980.1, 980.1, 980.1, 980.1, 990.16, 990.16, 990.16, 990.16, 990.16, 948.19, 948.19, 948.19, 948.19, 948.19, 963.37, 963.37, 963.37, 963.37, 963.37, 958.14, 958.14, 958.14, 958.14, 958.14, 957.13, 957.13, 957.13, 957.13, 957.13, 955.28, 955.28, 955.28, 955.28, 955.28, 954.39, 954.39, 954.39, 954.39, 954.39, 955.65, 955.65, 955.65, 955.65, 955.65, 951.82, 951.82, 951.82, 951.82, 951.82, 952.9, 952.9, 952.9, 952.9, 952.9, 961.35, 961.35, 961.35, 961.35, 961.35, 960.08, 960.08, 960.08, 960.08, 960.08, 959.68, 959.68, 959.68, 959.68, 959.68, 877.63, 877.63, 877.63, 877.63, 877.63, 874.8, 874.8, 874.8, 874.8, 874.8, 877.0, 877.0, 877.0, 877.0, 877.0, 878.83, 878.83, 878.83, 878.83, 878.83, 878.56, 878.56, 878.56, 878.56, 878.56, 878.47, 878.47, 878.47, 878.47, 878.47, 828.69, 828.69, 828.69, 828.69, 828.69, 829.68, 829.68, 829.68, 829.68, 829.68, 828.01, 828.01, 828.01, 828.01, 828.01, 826.18, 826.18, 826.18, 826.18, 826.18, 826.97, 826.97, 826.97, 826.97, 826.97, 832.66, 832.66, 832.66, 832.66, 832.66, 831.77, 831.77, 831.77, 831.77, 831.77, 837.36, 837.36, 837.36, 837.36, 837.36, 835.06, 835.06, 835.06, 835.06, 835.06, 837.99, 837.99, 837.99, 837.99, 837.99, 820.89, 820.89, 820.89, 820.89, 820.89, 820.26, 820.26, 820.26, 820.26, 820.26, 821.84, 821.84, 821.84, 821.84, 821.84, 822.54, 822.54, 822.54, 822.54, 822.54, 822.64, 822.64, 822.64, 822.64, 822.64, 823.04, 823.04, 823.04, 823.04, 823.04, 823.02, 823.02, 823.02, 823.02, 823.02, 824.99, 824.99, 824.99, 824.99, 824.99, 825.22]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716966257 --> 1716966887
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.95, 42.95, 42.95, 42.95, 42.95, 31.35, 31.35, 31.35, 31.35, 31.35, 28.51, 28.51, 28.51, 28.51, 28.51, 30.79, 30.79, 30.79, 30.79, 30.79, 32.02, 32.02, 32.02, 32.02, 32.02, 33.19, 33.19, 33.19, 33.19, 33.19, 33.97, 33.97, 33.97, 33.97, 33.97, 34.56, 34.56, 34.56, 34.56, 34.56, 34.93, 34.93, 34.93, 34.93, 34.93, 34.46, 34.46, 34.46, 34.46, 34.46, 34.28, 34.28, 34.28, 34.28, 34.28, 33.8, 33.8, 33.8, 33.8, 33.8, 32.91, 32.91, 32.91, 32.91, 32.91, 31.84, 31.84, 31.84, 31.84, 31.84, 31.73, 31.73, 31.73, 31.73, 31.73, 31.01, 31.01, 31.01, 31.01, 31.01, 30.36, 30.36, 30.36, 30.36, 30.36, 30.55, 30.55, 30.55, 30.55, 30.55, 30.83, 30.83, 30.83, 30.83, 30.83, 30.57, 30.57, 30.57, 30.57, 30.57, 30.51, 30.51, 30.51, 30.51, 30.51, 30.63, 30.63, 30.63, 30.63, 30.63, 30.77, 30.77, 30.77, 30.77, 30.77, 30.82, 30.82, 30.82, 30.82, 30.82, 30.95, 30.95, 30.95, 30.95, 30.95, 30.89, 30.89, 30.89, 30.89, 30.89, 30.78, 30.78, 30.78, 30.78, 30.78, 30.88, 30.88, 30.88, 30.88, 30.88, 31.04, 31.04, 31.04, 31.04, 31.04, 31.25, 31.25, 31.25, 31.25, 31.25, 31.38, 31.38, 31.38, 31.38, 31.38, 31.44, 31.44, 31.44, 31.44, 31.44, 31.56, 31.56, 31.56, 31.56, 31.56, 31.66, 31.66, 31.66, 31.66, 31.66, 31.51, 31.51, 31.51, 31.51, 31.51, 30.87, 30.87, 30.87, 30.87, 30.87, 30.8, 30.8, 30.8, 30.8, 30.8, 30.83, 30.83, 30.83, 30.83, 30.83, 31.08, 31.08, 31.08, 31.08, 31.08, 31.13, 31.13, 31.13, 31.13, 31.13, 31.22, 31.22, 31.22, 31.22, 31.22, 31.32, 31.32, 31.32, 31.32, 31.32, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 31.11, 30.53, 30.53, 30.53, 30.53, 30.53, 29.29, 29.29, 29.29, 29.29, 29.29, 29.23, 29.23, 29.23, 29.23, 29.23, 29.2, 29.2, 29.2, 29.2, 29.2, 29.17, 29.17, 29.17, 29.17, 29.17, 29.18, 29.18, 29.18, 29.18, 29.18, 29.19, 29.19, 29.19, 29.19, 29.19, 29.25, 29.25, 29.25, 29.25, 29.25, 29.3, 29.3, 29.3, 29.3, 29.3, 29.06, 29.06, 29.06, 29.06, 29.06, 29.05, 29.05, 29.05, 29.05, 29.05, 29.07, 29.07, 29.07, 29.07, 29.07, 29.16, 29.16, 29.16, 29.16, 29.16, 29.29, 29.29, 29.29, 29.29, 29.29, 29.42, 29.42, 29.42, 29.42, 29.42, 29.43, 29.43, 29.43, 29.43, 29.43, 29.54]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716966257 --> 1716966887
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.38, 0.38, 0.38, 0.38, 0.38, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.24, 0.24, 0.24, 0.24, 0.24, 0.33, 0.33, 0.33, 0.33, 0.33, 0.18, 0.18, 0.18, 0.18, 0.18, 0.34, 0.34, 0.34, 0.34, 0.34, 0.25, 0.25, 0.25, 0.25, 0.25, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.11, 0.11, 0.11, 0.11, 0.11, 0.29, 0.29, 0.29, 0.29, 0.29, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.27, 0.27, 0.27, 0.27, 0.27, 0.3, 0.3, 0.3, 0.3, 0.3, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.33, 0.33, 0.33, 0.33, 0.33, 0.57, 0.57, 0.57, 0.57, 0.57, 0.64, 0.64, 0.64, 0.64, 0.64, 0.63, 0.63, 0.63, 0.63, 0.63, 0.43, 0.43, 0.43, 0.43, 0.43, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.08, 0.08, 0.08, 0.08, 0.08, 0.23, 0.23, 0.23, 0.23, 0.23, 0.29, 0.29, 0.29, 0.29, 0.29, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716966257 --> 1716966887
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0]
                    
Loading

@mofosyne mofosyne added help wanted Extra attention is needed enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@mofosyne mofosyne self-assigned this May 10, 2024
@mofosyne mofosyne removed their assignment May 10, 2024
@mofosyne mofosyne marked this pull request as ready for review May 15, 2024 03:12
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@mnlife mnlife force-pushed the chatglm3 branch 2 times, most recently from 8aee20e to cb324f4 Compare May 15, 2024 05:28
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@mnlife mnlife force-pushed the chatglm3 branch 2 times, most recently from ed1d3ff to 9226518 Compare May 23, 2024 08:14
@github-actions github-actions bot added testing Everything test related python python script changes labels May 23, 2024
@mnlife mnlife force-pushed the chatglm3 branch 4 times, most recently from d523390 to f3bc337 Compare May 29, 2024 06:26
@arch-btw arch-btw mentioned this pull request Jun 5, 2024
4 tasks
@mnlife mnlife force-pushed the chatglm3 branch 2 times, most recently from bccb68f to a096383 Compare June 11, 2024 07:47
@arch-btw
Copy link
Contributor

arch-btw commented Jun 13, 2024

Is there any way to support glm-4 ? #7778

@mnlife
Copy link
Author

mnlife commented Jun 17, 2024

Is there any way to support glm-4 ? #7778

under development
https://github.com/mnlife/llama.cpp/tree/glm4

@mnlife mnlife changed the title add chatglm3-6b model support [help wanted] add chatglm3-6b and glm-4-9b-chat model support Jun 19, 2024
@mnlife mnlife force-pushed the chatglm3 branch 2 times, most recently from a03cbca to bf430d6 Compare June 19, 2024 10:51
Signed-off-by: XingXing Qiao <[email protected]>
@legraphista
Copy link
Contributor

legraphista commented Jun 20, 2024

Not sure if this is a model or an implementation issue, but computing the imatrix of https://huggingface.co/THUDM/glm-4-9b-chat (fp16, q8_0, ...) always results in nans (dataset)

$ ../llama.cpp/llama-imatrix -m glm-4-9b-chat.Q8_0.gguf.link.gguf -f imatrix.dataset -c 512 -b 512 --threads 32 -ngl 999 -o imatrix.dat

llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from glm-4-9b-chat-IMat-GGUF/glm-4-9b-chat.Q8_0.gguf.hardlink.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = chatglm
llama_model_loader: - kv   1:                               general.name str              = glm-4-9b-chat
llama_model_loader: - kv   2:                     chatglm.context_length u32              = 131072
llama_model_loader: - kv   3:                   chatglm.embedding_length u32              = 4096
llama_model_loader: - kv   4:                chatglm.feed_forward_length u32              = 13696
llama_model_loader: - kv   5:                        chatglm.block_count u32              = 40
llama_model_loader: - kv   6:               chatglm.attention.head_count u32              = 32
llama_model_loader: - kv   7:            chatglm.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:   chatglm.attention.layer_norm_rms_epsilon f32              = 0.000000
llama_model_loader: - kv   9:                          general.file_type u32              = 7
llama_model_loader: - kv  10:               chatglm.rope.dimension_count u32              = 64
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = chatglm-bpe
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,151073]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151329
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  20:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = ChatGLM4
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q8_0:  162 tensors
llm_load_vocab: special tokens cache size = 223
llm_load_vocab: token to piece cache size = 0.9732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = chatglm
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151552
llm_load_print_meta: n_merges         = 151073
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.6e-07
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 9.30 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = glm-4-9b-chat
llm_load_print_meta: BOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '<|user|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.31 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   629.00 MiB
llm_load_tensors:      CUDA0 buffer size =  8897.23 MiB
.................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    20.00 MiB
llama_new_context_with_model: KV self size  =   20.00 MiB, K (f16):   10.00 MiB, V (f16):   10.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1606
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 25 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 122.54 ms
compute_imatrix: computing over 125 chunks with batch_size 512
compute_imatrix: 0.65 seconds per pass - ETA 1.35 minutes
[1]7.9954,[2]6.0663,[3]5.9242,[4]7.2853,[5]7.2095,[6]6.0079,[7]6.4837,[8]6.8565,[9]7.0213,
save_imatrix: stored collected data after 10 chunks in glm-4-9b-chat-IMat-GGUF/imatrix.dat
[10]6.1378,[11]6.7411,[12]7.3944,[13]7.8688,[14]8.1976,[15]8.6583,[16]9.1342,[17]9.4154,[18]9.1001,[19]8.6114,
save_imatrix: stored collected data after 20 chunks in glm-4-9b-chat-IMat-GGUF/imatrix.dat
[20]8.5978,nan detected in blk.18.attn_output.weight

Edit: Looks like running it on CPU instead of CUDA gets it past chunk 21

@choyakawa
Copy link

will the vision model of glm-4 also be considered?

@mnlife
Copy link
Author

mnlife commented Jun 21, 2024

under development

@mnlife mnlife closed this Jun 21, 2024
@mnlife
Copy link
Author

mnlife commented Jun 21, 2024

will the vision model of glm-4 also be considered?

under development

@mnlife mnlife reopened this Jun 21, 2024
@CsBoBoNice
Copy link

你好,我使用您的分支编译,使用NVIDIA显卡进行推理,使用模型为glm-4-9b-chat.Q5_K_S.gguf,

能够回答类似:你好;你是谁;写一首诗;这些简短的问题。

但是当提问变长时会出现回复乱码,例如:将以下中文翻译为英文: 生活和天气一样,有晴,有阴,偶尔还会下点雨,自然规律,生活不简单尽量简单过。

以下是执行的日志:

.\build\bin\Release\llama-cli.exe -m D:\models\glm-4-9b-chat.Q5_K_S.gguf -p "[gMASK]<|user|>hi<|assistant|>" -t 16 --keep -1 -c 1024 -b 1024 -n -1 -s 123 -ngl 18 --color -i
Log start
main: build = 3187 (de3c909)
main: built with MSVC 19.39.33523.0 for x64
main: seed = 123
llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from D:\models\glm-4-9b-chat.Q5_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = chatglm
llama_model_loader: - kv 1: general.name str = glm-4-9b-chat
llama_model_loader: - kv 2: chatglm.context_length u32 = 131072
llama_model_loader: - kv 3: chatglm.embedding_length u32 = 4096
llama_model_loader: - kv 4: chatglm.feed_forward_length u32 = 13696
llama_model_loader: - kv 5: chatglm.block_count u32 = 40
llama_model_loader: - kv 6: chatglm.attention.head_count u32 = 32
llama_model_loader: - kv 7: chatglm.attention.head_count_kv u32 = 2
llama_model_loader: - kv 8: chatglm.attention.layer_norm_rms_epsilon f32 = 0.000000
llama_model_loader: - kv 9: general.file_type u32 = 16
llama_model_loader: - kv 10: chatglm.rope.dimension_count u32 = 64
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.pre str = chatglm-bpe
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,151552] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,151073] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151329
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151329
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151329
llama_model_loader: - kv 20: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 22: tokenizer.chat_template str = ChatGLM4
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_1: 40 tensors
llama_model_loader: - type q5_K: 121 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 223
llm_load_vocab: token to piece cache size = 0.9732 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = chatglm
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151552
llm_load_print_meta: n_merges = 151073
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 16
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.6e-07
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 13696
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q5_K - Small
llm_load_print_meta: model params = 9.40 B
llm_load_print_meta: model size = 6.23 GiB (5.69 BPW)
llm_load_print_meta: general.name = glm-4-9b-chat
llm_load_print_meta: BOS token = 151329 '<|endoftext|>'
llm_load_print_meta: EOS token = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token = 151329 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 151336 '<|user|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.31 MiB
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloaded 18/41 layers to GPU
llm_load_tensors: CPU buffer size = 6377.09 MiB
llm_load_tensors: CUDA0 buffer size = 2468.01 MiB
...................................................................................
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: n_batch = 1024
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 22.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 18.00 MiB
llama_new_context_with_model: KV self size = 40.00 MiB, K (f16): 20.00 MiB, V (f16): 20.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 789.62 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB
llama_new_context_with_model: graph nodes = 1606
llama_new_context_with_model: graph splits = 202

system_info: n_threads = 16 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 1024, n_batch = 1024, n_predict = -1, n_keep = 5

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to the AI.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

hi
你好!我是人工智能助手,很高兴能帮助你。请问有什么可以帮到你的吗?
你是谁
我是一个名为 ChatGLM 的人工智能助手,我是基于清华大学 KEG 实验室和智谱 AI 公司于 2024 年共同训练的语言模型 GLM-4 开发的 人工智能助手。
将以下中文翻译为英文: 生活和天气一样,有晴,有阴,偶尔还会下点雨,自然规律,生活不简单尽量简单过。
514%89,5!/"18A(;8!9)$0H-)::G.+74=()47:-46%=2:>2$*+",&1!@e:6==:0,E.45/EF+G4!%72-++"

@youth123
Copy link
Contributor

你好,我使用您的分支编译,使用NVIDIA显卡进行推理,使用模型为glm-4-9b-chat.Q5_K_S.gguf,
能够回答类似:你好;你是谁;写一首诗;这些简短的问题。
但是当提问变长时会出现回复乱码,例如:将以下中文翻译为英文: 生活和天气一样,有晴,有阴,偶尔还会下点雨,自然规律,生活不简单尽量简单过。

I have already solved the incorrect answers issue based on this PR. Here is the PR.
#8031

@mnlife
Copy link
Author

mnlife commented Jun 27, 2024

你好,我使用您的分支编译,使用NVIDIA显卡进行推理,使用模型为glm-4-9b-chat.Q5_K_S.gguf,

能够回答类似:你好;你是谁;写一首诗;这些简短的问题。

但是当提问变长时会出现回复乱码,例如:将以下中文翻译为英文: 生活和天气一样,有晴,有阴,偶尔还会下点雨,自然规律,生活不简单尽量简单过。

以下是执行的日志:

.\build\bin\Release\llama-cli.exe -m D:\models\glm-4-9b-chat.Q5_K_S.gguf -p "[gMASK]<|user|>hi<|assistant|>" -t 16 --keep -1 -c 1024 -b 1024 -n -1 -s 123 -ngl 18 --color -i Log start main: build = 3187 (de3c909) main: built with MSVC 19.39.33523.0 for x64 main: seed = 123 llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from D:\models\glm-4-9b-chat.Q5_K_S.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = chatglm llama_model_loader: - kv 1: general.name str = glm-4-9b-chat llama_model_loader: - kv 2: chatglm.context_length u32 = 131072 llama_model_loader: - kv 3: chatglm.embedding_length u32 = 4096 llama_model_loader: - kv 4: chatglm.feed_forward_length u32 = 13696 llama_model_loader: - kv 5: chatglm.block_count u32 = 40 llama_model_loader: - kv 6: chatglm.attention.head_count u32 = 32 llama_model_loader: - kv 7: chatglm.attention.head_count_kv u32 = 2 llama_model_loader: - kv 8: chatglm.attention.layer_norm_rms_epsilon f32 = 0.000000 llama_model_loader: - kv 9: general.file_type u32 = 16 llama_model_loader: - kv 10: chatglm.rope.dimension_count u32 = 64 llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 13: tokenizer.ggml.pre str = chatglm-bpe llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,151552] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,151073] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151329 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151329 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151329 llama_model_loader: - kv 20: tokenizer.ggml.eot_token_id u32 = 151336 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 151329 llama_model_loader: - kv 22: tokenizer.chat_template str = ChatGLM4 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q5_1: 40 tensors llama_model_loader: - type q5_K: 121 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 223 llm_load_vocab: token to piece cache size = 0.9732 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = chatglm llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151552 llm_load_print_meta: n_merges = 151073 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 2 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 16 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.6e-07 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q5_K - Small llm_load_print_meta: model params = 9.40 B llm_load_print_meta: model size = 6.23 GiB (5.69 BPW) llm_load_print_meta: general.name = glm-4-9b-chat llm_load_print_meta: BOS token = 151329 '<|endoftext|>' llm_load_print_meta: EOS token = 151329 '<|endoftext|>' llm_load_print_meta: UNK token = 151329 '<|endoftext|>' llm_load_print_meta: PAD token = 151329 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 151336 '<|user|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.31 MiB llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloaded 18/41 layers to GPU llm_load_tensors: CPU buffer size = 6377.09 MiB llm_load_tensors: CUDA0 buffer size = 2468.01 MiB ................................................................................... llama_new_context_with_model: n_ctx = 1024 llama_new_context_with_model: n_batch = 1024 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 22.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 18.00 MiB llama_new_context_with_model: KV self size = 40.00 MiB, K (f16): 20.00 MiB, V (f16): 20.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB llama_new_context_with_model: CUDA0 compute buffer size = 789.62 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB llama_new_context_with_model: graph nodes = 1606 llama_new_context_with_model: graph splits = 202

system_info: n_threads = 16 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: interactive mode on. sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 1024, n_batch = 1024, n_predict = -1, n_keep = 5

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to the AI.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

hi 你好!我是人工智能助手,很高兴能帮助你。请问有什么可以帮到你的吗? 你是谁 我是一个名为 ChatGLM 的人工智能助手,我是基于清华大学 KEG 实验室和智谱 AI 公司于 2024 年共同训练的语言模型 GLM-4 开发的 人工智能助手。 将以下中文翻译为英文: 生活和天气一样,有晴,有阴,偶尔还会下点雨,自然规律,生活不简单尽量简单过。 514%89,5!/"18A(;8!9)$0H-)::G.+74=()47:-46%=2:>2$*+",&1!@e:6==:0,E.45/EF+G4!%72-++"

有人来做这些工作了,这个pr #8031

@mnlife mnlife closed this Jun 27, 2024
@0wwafa
Copy link

0wwafa commented Jul 2, 2024

why is still pending?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request examples help wanted Extra attention is needed python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants