billymg[asciilifeform]: http://logs.bitdash.io/pest/2024-12-27#1034669 << yeah, i think this is true. but you could look into "fine tuning", that's something that's within the realm of a beefy home setup or rented server farm time
billymg[asciilifeform]: i think it's roughly taking an existing model and feeding it with enough domain-specific data that it can be used effectively for a given use case
discord_bridge[asciilifeform]: (awtho) billymg: I built llama-cpp. I attempted to run a 17gb deepseek model, but it ended up freezing my macbook. I tried another deepseek model using LM Studio (which should be accessible via Cline) but it is very very slow.
billymg[asciilifeform]: awt: is it an intel macbook pro or arm?
discord_bridge[asciilifeform]: (awtho) Arm
billymg[asciilifeform]: how much total ram?
billymg[asciilifeform]: when you try running the model with llama-server you can open 'Activity Monitor' and see how much memory you have available
discord_bridge[asciilifeform]: (awtho) 16 gb
discord_bridge[asciilifeform]: (awtho) Is there a way I can safely do that without freezing my machine?
billymg[asciilifeform]: ah, def not enough then for the 17gb model, it must have been swapping and that's what froze it
billymg[asciilifeform]: considering the OS, plus browser, IDE, whatever other random things are gonna take up at least 50% of your ram i'd say your best bet is trying it out on your desktop PC (assuming that has the specs for it)
billymg[asciilifeform]: you can run it on a desktop pc and serve on your local network too, so your macbook's VS Code plugin will just be making requests to llama-server on your desktop
discord_bridge[asciilifeform]: (awtho) Desktop has: 16 GB Radeon RX 6900 XT with 5120, 128 GB ECC ram.
billymg[asciilifeform]: that oughta be enough to get it going. i've only tried it with nvidia but you build it with HIP for AMD GPUs. llama-server then has a flag -ngl, --gpu-layers that lets you control how much to offload to VRAM
billymg[asciilifeform]: it will exit if you exceed your available vram, so the idea is to increase until it fails
billymg[asciilifeform]: the rest of the model will then load into your regular ram and the inferencing will happen on the CPU, so it will be slower but usable
discord_bridge[asciilifeform]: (awtho) ty