services and best practices for running ml inference

share

Summary of results

GPT-4o
Warning: quote links might not work while streaming
1.

For anything involving inference you’re much better off with one of the many inference model servers such as TensorFlow serving, Triton Inference Server, etc.

3.

What's the best way to run inference on the 70B model as an API? Most of the hosted APIs including HuggingFace seem to not work out of the box for models that large, and I'd rather not have to manage my own GPU server.

4.

if you are just using it for inference, i think an appropriate comparison would just be like a together.ai endpoint or something - which allows you to scale up pretty immediately and likely is more economical as well.

5.

Is that more for training or inference use?

I realize this isn't always entirely decoupled in certain online learning approaches. I don't work in ML, am certainly not an expert, and am genuinely curious where this space is at now in terms of hardware requirements for SOTA methodologies these days, especially inference phase HW requirements for just running stuff that's out there.

6.

How are you doing online ML inference, without fetching data?

7.

I have a ML infra that does between 1 million and 2 million inferences per day (granted it is not a chat bot or code generator, but it is legit ML inference in a similar mode) and it only cost $120 per month for the server (also runs multiple containers and is a web server), which mostly sits idle.

9.

There are basically two approaches to on-chain inference: consensus-based approaches (several parties run inference and give a claimed result), and zkML (one party runs inference and proves the result cryptographically).

zkML can be done using general-purpose ZK libraries (since they support arbitrary computations), or there are some specialized tools for proving ML inference, such as https://github.com/ddkang/zkml. It's currently pretty expensive to prove huge models like LLMs, but there's a lot of work being done to make it more practical.

10.

https://github.com/ml-explore/mlx-examples

Several people working on mlx-enabled backends to popular ML workloads but it seems inference workloads are the most accelerated vs generative/training.

11.

Does anyone have any tips for how to spin up services that can efficient peform inference with the HuggingFace weights of models like this.

I would love to switch to something like this over OpenAI's GPT3.5 Turbo, but this weekend I'm struggling to get reasonable inference speed on reasonably priced machines.

12.

What are some good resources to get into working with this and learning the basics around ML to get some fundamental understanding of how this works?

13.

How come you always have to install some version of pytorch or tensor flow to run these ml models? When I'm only doing inference shouldn't there be easier ways of doing that, with automatic hardware selection etc. Why aren't models distributed in a standard format like onnx, and inference on different platforms solved once per platform?

14.

For some reason they focus on the inference, which is the computationally cheap part. If you're working on ML (as opposed to deploying someone else's ML) then almost all of your workload is training, not inference.

15.

Are there any other FDWs that do ML inference?

Remember, this is not plain file serving -- this is actually invoking XGBoost library which does complex mathematical operations. The user does not get data from disk, they get inference results.

Unless you know of any other solution which can invoke XGBoost (or some other inference library), I don't see anything "embarrassingly overkill" there.

16.

> If you're working on ML (as opposed to deploying someone else's ML) then almost all of your workload is training, not inference.

Wouldn't that depend on the size of your customer base? Or at least, requests per second?

17.

Checkout the latest docs https://mlc.ai/mlc-llm/docs/ MLC started with demos and it evolved lately, with API integrations, documentations into an inference solution that everyone can reuse for universal deployments

18.

If by running models you mean just the inference phase, then even today you can run large family of ML models on commodity hardware (with some elbow grease, of course). The training phase is generally the one not easily replicated by non-corporations.

19.

You can already run inference of many modern ML models in-browser via, e.g., https://huggingface.co/docs/transformers.js/en/index .

20.

What is the advantage of this versus running something like https://github.com/simonw/llm , which also gives you options to e.g. use https://github.com/simonw/llm-mlc for accelerated inference?

21.

We're running it on vLLM and are working with others in the community to bring it to other optimized inference frameworks.

23.

I'm assuming the cost/effort of getting whatever customized Postgres instance supports efficient (i.e not using the CPU alone) ML inference is more than orchestrating something else.

24.

I'm an ML engineer but I know nothing about the inference part. Are there that many kind of devices that optimizing for inference on a device is a thing? I thought almost everyone serves from GPUs/TPUs and hence there are only 2 major device types. What am I missing here?

25.

This is a free, open-source and technology-agnostic platform to make it easier to benchmark, optimize, compare and discuss AI and ML Systems across rapidly evolving software, hardware, models and data sets in a fully automated and reproducible way via open optimization challenges as demonstrated in the latest MLPerf inference v3.0 community submission.

26.

Or you can apply GPU optimizations for such ML workloads. By optimizing the way these models run on GPUs, significantly improve efficiency and slash costs by a factor of 10 or even more. These techniques include kernel fusion, memory access optimization, and efficient use of GPU resources, which can lead to substantial improvements in both training and inference speed. This allows AI models to run on more affordable hardware and still deliver exceptional performance. For example, LLMs running on A100 can also run on 3090s with no change in accuracy and comparable inference latency.

27.

i always thought that elasticsearch would make a good host for a ML enabled datastore. building indices and searching them are similar computational paradigms to training and inference and the scaling framework would lend itself well to both computation heavy training and query/inference.

although i dunno if it has good support for lots of floats. and i guess all the ml code would have to be java.

28.

> Right, but they can (and do) also prepackage ML inference services that don't involve Postgres.

but then you would need some dataprocessing/warehousing infra integrated to produce dataset for inference and then do something with inference results. Having one db with everything packaged would reduce complexity significantly.

ML training workflow also could be integrated into this DB, so you could have few queries to generate data for training, model training, generate data for inference, produce inference results and do something with inference results.

29.

Thanks to the folks at MLCommons we have some benchmarks and data to evaluate and track inference performance published today. Includes results from GPUs, TPUs, and CPUs as well as some power measurements across several ML use cases including LLMs.

"This benchmark suite measures how fast systems can process inputs and produce results using a trained model. Below is a short summary of the current benchmarks and metrics. Please see the MLPerf Inference benchmark paper for a detailed description of the motivation and guiding principles behind the benchmark suite."

https://mlcommons.org/en/inference-datacenter-31/

For example the latest TPU (v5) from Google scores 7.13 queries per second with an LLM. Looking at GCP that server runs $1.2 / hour on demand.

On Azure an H100 scores 84.22 queries per second with an LLM. Couldn't find the price for that but an A100 costs $27.197 per hour so no doubt the H100 will be more expensive than that.

7.13 / $1.2 = 5.94 queries/second/$

84.22 / $27.197 (A100 Pricing) = 3.09 queries/second/$

[edited to include GCP TPU v5 and Nvidia H100 relative performance info for LLM Inference]


Terms & Privacy Policy | This site is not affiliated with or sponsored by Hacker News or Y Combinator
Built by @jnnnthnn