全部
← Back to Squawk list
Next Generation Model Serving
Model Server Motivation Producing real-time predictions requires two components: feature vector building and model evaluation (aka inference). Our original infrastructure (the Python streamer) handled both of these in one place. It followed our flight data feed to build feature vectors and evaluated those vectors on models it loaded into memory. The streamer was not performant enough to run for all the airports we wished to make predictions for, so we ran many copies of it, with each streamer… (flightaware.engineering) 更多...Sort type: [Top] [Newest]
Thanks! Both the article and your "translation" helped me better understand some of the AI/ML stuff happening right around me. I appreciate you taking time to help those not close to the technology better understand the state-of-the-art of this important field.
Impressive read on the behind-the-scenes technology driving (flying?) the FlightAware experience. Well done, Caroline and team!
Basically, as they note the use of all of this machine learning magic is called inference (contrasted with the first step of teaching the AI what you want it to do, which is broadly called training). Inference comes down to two stages:
1) Make the data (whatever it is - speech, text, video, images, other types of data) something the real machine learning/"AI" can understand (vectorization).
2) Run the model (often on GPUs from Nvidia).
With this new implementation they took the fairly common approach of doing this fast and at high scale. Before they were running all of this in one server process doing steps 1 and 2. This is relatively easy but also inefficient and problematic for the reasons they highlight.
Luckily, Nvidia has a free open source product called the Triton Inference Server the article references. It's used by many/most of the "big boys" in the space that need to make AI work efficiently for a large number of people, lots of data, etc. It is absolutely the best option for this application.
It has features primarily for Nvidia GPUs (of course) that provide incredible increases in performance, often 20x or more. Not only does this make a single node more efficient it better enables running "AI" on as much hardware as you need to keep up with demand, data, etc. This is kind of a natural progression in terms of AI deployments when they get used by more people with more data.
Triton Inference Server does a bunch of near magic to enable this and that's reflected in one of the final paragraphs of the article:
"This new architecture was released early March 2023. Upon release, it used a single machine for the streamer and three Triton replicas in Kubernetes, which could easily fit onto two nodes. This cut our hardware usage by 75% while still quadrupling our catchup rate."
In terms of really nerding out, I would want to ask the FlightAware tech team if they've tried using the Triton Model Navigator, Performance Analyzer, etc to further optimize inference performance. Generally, deploying models with the ONNX runtime is my preferred approach but I've found that making use of Nvidia Triton Inference Server and the ability to dynamically compile to TensorRT (with caching) provides another large leap in performance (assuming they're running on GPUs with Tensor Cores which is almost certainly the case). They provide the Triton ensemble configuration but don't provide the Triton model configurations for the models themselves so I have no way to know if this is the approach they are taking.