Home
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   DIR   Ask HN: How do you add guard rails in LLM response without breaking streaming?
       
       
        olalonde wrote 25 min ago:
        From what I can tell, ChatGPT appears to be doing "optimistic"
        streaming... It will start streaming the response to the user but may
        eventually hide the response if it trips some censorship filter. The
        user can theoretically capture the response from the network since the
        censorship is essentially done client-side but I guess they consider
        that good enough.
       
        brrrrrm wrote 27 min ago:
        fake it.
        
        add some latency to the first token and then "stream" at the rate you
        received tokens even though the entire thing (or some sizable chunk)
        has been generated.  that'll give you the buffer you need to seem fast
        while also staying safe.
       
        shaun-Galini wrote 2 hours 36 min ago:
        We have just the product for you! We’ve recently improved guardrail
        accuracy by 25% for a $5B client and would be happy to show you how we
        do it.
        
        You're right - prompt eng. alone doesn't work. It's brittle and fails
        on most evals.
        
        Ping me at shaunayrton@galini.ai
       
        com2kid wrote 2 hours 46 min ago:
        You start streaming the response immediately and kick off your
        guardrails checks. If the guard rail checks are triggered you cancel
        the streaming response.
        
        Perfect is the enemy of good enough.
       
          simonw wrote 49 min ago:
          About a year ago I was using an earlier version of Claude to help
          analyze the transcript from a podcast episode.
          
          I'd fed in a raw transcript and I was asking it to do some basic
          editing, remove ums and ahs, that kind of thing.
          
          It had streamed about 80% of the episode when it got to a bit where
          the podcast guest started talking about "bombing a data center"...
          and right in front of my eyes the entire transcript vanished. Claude
          effectively retracted the entire thing!
          
          I tried again in a fresh window and hit Ctrl+A plus Ctrl+C while it
          was running to save as much as I could.
          
          I don't think the latest version of Claude does that any more - if
          so, I've not seen it.
       
        digitaltrees wrote 3 hours 27 min ago:
        We are using keep-ai.com for a set of health care related AI project
        experiments.
       
        CharlieDigital wrote 4 hours 5 min ago:
        If it's the problem I think it is, the solution is to run two
        concurrent prompts.
        
        First prompt validates the input.  Second prompt starts the actual
        content generation.
        
        Combine both streams with SSE on the front end and don't render the
        content stream result until the validation stream returns "OK".  In the
        SSE, encode the chunks of each stream with a stream ID.  You can also
        handle it on the server side by cancelling execution once the first
        stream ends.
        
        Generally, the experience is good because the validation prompt is
        shorter and faster to last (and only) token.
        
        The SSE stream ends up like this:
        
            data: ing|tomatoes
            
            data: ing|basil
            
            data: ste|3. Chop the
        
        I have a writeup (and repo) of the general technique of
        multi-streaming: [1] (animated gif at the bottom).
        
  HTML  [1]: https://chrlschn.dev/blog/2024/05/need-for-speed-llms-beyond-o...
       
          lolinder wrote 1 hour 8 min ago:
          This doesn't solve the critical problem, which is that you usually
          can't tell if something is okay until you have context that you don't
          yet have. This is why even SOTA models will backtrack when you hit
          the filter—they only realize you're treading into banned territory
          after a bunch of text has already been generated, including text that
          already breaks the rules.
          
          This is hard to fix because if you don't wait until you have enough
          context, you've given your censor a hair trigger.
          
          > Combine both streams with SSE on the front end and don't render the
          content stream result until the validation stream returns "OK".
          
          Just a note that this particular implementation has the additional
          problem of not actually applying your validation stream at the API
          level, which means your service can and will be abused worse than it
          would be if you combined the streams server-side. You should never
          rely on client-side validation for security or legal compliance.
       
        joshhart wrote 4 hours 16 min ago:
        Hi, I run the model serving team at Databricks. Usually you run regex
        filters, LLAMA Guard, etc on chunks at a time so you are still
        streaming but it's in batches of tokens rather than single tokens at a
        time. Hope that helps!
        
        You could of course use us and get that out of the box if you have
        access to Databricks.
       
          lordswork wrote 3 hours 37 min ago:
          But ultimately, it's an unsolved problem in the field. Every single
          LLM has been jailbroken.
       
            accrual wrote 2 hours 28 min ago:
            Has o1 been jailbroken? My understanding is o1 is unique in that
            one model creates the initial output (chain of thought) then
            another model prepares the first response for viewing. Seems like
            that would be a fairly good way to prevent jailbreaks, but I
            haven't investigated myself.
       
              tcdent wrote 1 hour 56 min ago:
              Literally everything is trivial to jailbreak.
              
              The core concept is to pass information into the model using a
              cipher. One that is not too hard that it can't figure it out, but
              not too easy as to be detected.
              
              And yes, o1 was jailbroken shortly after release:
              
  HTML        [1]: https://x.com/elder_plinius/status/1834381507978280989
       
        potatoman22 wrote 4 hours 24 min ago:
        Depending on what sort of constraints you need on your output, a custom
        token sampler, logit bias, or verifying it against a grammar could do
        the trick.
       
        jonathanrmumm wrote 4 hours 27 min ago:
        have it format in yaml instead of json, incomplete yaml is still valid
        yaml
       
        seany62 wrote 4 hours 27 min ago:
        > Hi all, I am trying to build a simple LLM bot and want to add guard
        rails so that the LLM responses are constrained.
        
        Give examples of how the LLM should respond. Always give it a default
        response as well (e.g. "If the user response does not fall into any of
        these categories, say x").
        
        > I can manually add validation on the response but then it breaks
        streaming and hence is visibly slower in response.
        
        I've had this exact issue (streaming + JSON). Here's how I approached
        it:
        1. Instruct the LLM to return the key "test" in its response. 
        2. Make the streaming call.
        3. Build your JSON response as a string as you get chunks from the
        stream.
        4. Once you detect "key" in that string, start sending all subsequent
        chunks wherever you need.
        5. Once you get the end quotation, end the stream.
       
        viewhub wrote 4 hours 45 min ago:
        What's your stack? What type of response times are you looking for?
       
        throwaway888abc wrote 1 day ago:
        Not sure about the exact nature of your project, but for something
        similar I’ve worked on, I had success using a combination of custom
        stop words and streaming data with a bit of custom logic layered on
        top. By fine-tuning the stop words specific to the domain and applying
        filters in real-time as the data streams in, I was able to improve the
        response to users taste. Depending on your use case, adding logic to
        dynamically adjust stop words or contextually weight them might also
        help you.
       
       
   DIR <- back to front page