check point 1
check point 2
check point 3
check point 4
check point 5
check point 6
본문 바로가기

상품 검색

장바구니0

회원로그인

회원가입

오늘 본 상품 0

없음

Dynamic Memory Compression > 자유게시판

Dynamic Memory Compression

페이지 정보

작성자 Staci 작성일 25-08-14 15:56 조회 35 댓글 0

본문

maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgWyhDMA8=&rs=AOn4CLB_UNyXmFffy2J_DtgJ9Hfzo-4-PgRegardless of the success of massive language models (LLMs) as general-objective AI tools, their high demand for computational resources make their deployment difficult in lots of real-world situations. The sizes of the mannequin and conversation state are restricted by the accessible excessive-bandwidth memory, limiting the number of customers that can be served and the utmost conversation length. Transformers: The dialog state consists of a distinct illustration for each aspect of a sequence, which quickly explodes in measurement. SSMs: Compress the whole sequence into a single representation, which can overlook previous information attributable to its finite capability. Compression of the dialog state frees up memory and is crucial for operating bigger models within the identical memory constraints, processing more tokens at a time, or simply decreasing the latency. To this finish, researchers at NVIDIA have developed a brand Memory Wave new technology referred to as dynamic memory compression (DMC) that can drastically enhance the efficiency of LLMs deployment and broaden their horizons to longer sequences without working out of memory.



close-up-of-a-computer-motherboard-with-many-wires.jpgDMC opens a third means, where a Transformer mannequin could be skilled to adaptively compress the conversation state and achieve a desired compression fee. This allows a big reduction of the conversation state size without changing the familiar Transformer structure. DMC does not require training from scratch, as the prevailing fashions will be retrofitted by means of a negligible quantity of extra coaching, which is extra reliable than error-prone coaching-free methods. What impacts LLM inference efficiency? Pre-filling: A user query is ingested. Auto-regressive era: The response is generated one token at a time. During technology, to perform self-consideration, Transformers append a pair of representations (key-worth pair, or KVP) for each token to a cache. A different KVP is stored for every layer and every consideration head. Consequently, the KVP cache grows proportionally to the sequence length. Because the KVP cache must fit into the GPU Memory Wave Audio together with the LLM weights, it might probably occupy a big part of it or even exhaust it.



Additionally, the larger the KVP cache, Memory Wave Audio the longer it takes to execute a single inference step. It is because calculating consideration scores is a memory-certain operation. Each query has its own KVP cache to be loaded. The situation is different for linear projections in attention or FFN layers, where each weight matrix have to be loaded into SRAM from HBM one time for all queries, if the GPU is working on many queries at the same time in parallel. Previous research tried to scale back the scale of the KVP cache by quantizing its representations, sharing consideration heads, or evicting tokens from it. Nevertheless, these methods degrade the original efficiency as a result of they delete info from memory without altering the original LLM behavior. Dynamic memory compression (DMC) is a simple method to compress KV cache during inference without incurring efficiency drop. This equation, mendacity at the center of DMC, transforms a sub-sequence of keys into a specific prefix sum, which is paying homage to in style SSMs like xLSTM or RWKV.



During inference, the values of alpha are strictly binary. KVP cache, for the compressing habits. The frequency of averaging choices determines the compression fee of DMC. In a plain mannequin, the cache is prolonged by one KVP at a time. With DMC, a choice variable determines whether or not the cache must be prolonged or if the new pair needs to be merged with the final one within the KVP cache. Train pre-existing LLMs, akin to the ones from the Llama household, utilizing between 2-8% of the original coaching knowledge mixture. Slowly transition in the direction of DMC by exerting stress to average new pairs with the trailing ones. The goal compression charge is ramped up from 1x to the specified degree over the course of retrofitting. After reaching the goal compression price, fix it for the ultimate steps of retrofitting to consolidate it. The choice to append or merge is discrete. To practice LLMs with gradient descent, you carry out a steady relaxation of this determination via the Gumbel-Sigmoid distribution, which results in partially appended and partially merged memory parts throughout coaching.

\r
Welcome to Relaxed Mind ..."}]},"snippetHoverText":{"runs":[{"text":"From the video description"}]},"maxOneLine":false}],"inlinePlaybackEndpoint":{"clickTrackingParams":"CJQBENwwGBAiEwjhhu-a2omPAxUSdBIBHQZ_EooyBnNlYXJjaFILbWVtb3J5IHdhdmWaAQMQ9CQ=","commandMetadata":{"webCommandMetadata":{"url":"/watch?v=Xal3RTspi9Y\u0026pp=YAHIAQGiBhUBdpLKYOuRxraxRRpMmikH5xoRht0%3D","webPageType":"WEB_PAGE_TYPE_WATCH","rootVe":3832}},"watchEndpoint":{"videoId":"Xal3RTspi9Y","params":"qgMLbWVtb3J5IHdhdmW6AwsIp4PfkbfSx9i8AboDCwiqk6u_-Nja9pUBugMLCMiujMfhqbKVlwG6AwoIgJWVse7AhKQYugMKCOOJwbWQ6amHHLoDCgimmvXhz7ydx3e6AwsI7ZmYqtKZg86KAboDCgi35_mM1MyWkTa6AwsI_OCr0Iz9ro2kAboDCgjrtIfgrK-h4GO6AwoI5a7O3YjexZoYugMLCI7_xejG0Y6stwG6AwoIhumMi9rHi4JuugMrEilPTEFLNXV5X214b2RMTGcxMExLSU13LTIyOHNZZmtTVXM2dzg1cmpDa7oDCgjJjLb2kI_cv1G6AwoIz_33otjJpucBugMLCOPN1OGt77mHmgG6AwsI6u-4_uOgnoyGAboDCwiBs8T5x_i83foB","playerParams":"YAHIAQGiBhUBdpLKYOuRxraxRRpMmikH5xoRht0%3D","playerExtraUrlParams":[{"key":"inline","value":"1"}],"watchEndpointSupportedOnesieConfig":{"html5PlaybackOnesieConfig":{"commonConfig":{"url":"https://rr3---sn-x5guiuxaxjvh-q5jk.googlevideo.com/initplayback?source=youtube\u0026oeis=1\u0026c=WEB\u0026oad=3200\u0026ovd=3200\u0026oaad=11000\u0026oavd=11000\u0026ocs=700\u0026oewis=1\u0026oputc=1\u0026ofpcc=1\u0026msp=1\u0026odepv=1\u0026id=5da977453b298bd6\u0026ip=202.40.182.158\u0026initcwndbps=885000\u0026mt=1755153652\u0026oweuc="}}}}},"searchVideoResultEntityKey":"EgtYYWwzUlRzcGk5WSDnAigB","avatar":{"decoratedAvatarViewModel":{"avatar":{"avatarViewModel":{"image":{"sources":[{"url":"https://yt3.ggpht.com/s_UsM5GFo5Sa-RiprMdKKgl84ov6y3i_hG9dwOt1NpH4ZX90uRQyRXgwUfFkL1CMVH-MZk_rqq0=s68-c-k-c0x00ffffff-no-rj","width":68,"height":68}]},"avatarImageSize":"AVATAR_SIZE_M"}},"a11yLabel":"Go to channel","rendererContext":{"commandContext":{"onTap":{"innertubeCommand":{"clickTrackingParams":"CJQBENwwGBAiEwjhhu-a2omPAxUSdBIBHQZ_Eoo=","commandMetadata":{"webCommandMetadata":{"url":"/@relaxedmind1993","webPageType":"WEB_PAGE_TYPE_CHANNEL","rootVe":3611,"apiUrl":"/youtubei/v1/browse"}},"browseEndpoint":{"browseId":"UCKA8nI3JBYpWqx2HiiEssNw","canonicalBaseUrl":"/@relaxedmind1993"}}}}}}}}},{"lockupViewModel":{"contentImage":{"collectionThumbnailViewModel":{"primaryThumbnail":{"thumbnailViewModel":{"image":{"sources":[{"url":"https://i9.ytimg.com/s_p/OLAK5uy_mxodLLg10LKIMw-228sYfkSUs6w85rjCk/mqdefault.jpg?sqp=COyL9sQGir7X7AMICI2-r70GEAE=\u0026rs=AOn4CLB_IZqqw34w71grFHp62I2UbVC2gg\u0026v=1739317005","width":180,"height":180},{"url":"https://i9.ytimg.com/s_p/OLAK5uy_mxodLLg10LKIMw-228sYfkSUs6w85rjCk" frameborder="0" allowfullscreen title="memory wave (c) by youtube.com" style="float:{#vleft left|#vleft left|#vleft left|#vleft left|#vright right};padding:{#vright 10px 0px 10px 10px|#vleft 10px 10px 10px 0px};border:0px;">

댓글목록 0

등록된 댓글이 없습니다.

개인정보 이용약관
Copyright © (주)베리타스커넥트. All Rights Reserved.
상단으로