Home
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Efficient high-resolution image synthesis with linear diffusion transformer
       
       
        cpldcpu wrote 5 hours 6 min ago:
        >We introduce a new Autoencoder (AE) that aggressively increases the
        scaling factor to 32. Compared with AE-F8, our AE-F32 outputs 16×
        fewer latent tokens,
        
        Basically they compress/decompress the images more, which means they
        need less computation during generation. But on the flip side this
        should mean less variability.
        
        Isn't this more of a design trade-off than an optimization?
       
          Lerc wrote 3 hours 42 min ago:
          It might not be compressing more (haven't yet looked at the paper). 
          You can have fewer but larger tokens for the same amount of data.
          
          It would decrease the workload by having fewer things to compare
          against balanced against workload per comparison.   For normal N²
          that makes sense but the page says.
          
          We introduce a new linear DiT, replacing vanilla quadratic attention
          and reducing complexity from O(N²) to O(N) Mix-FFN
          
          So not sure what's up there.
       
        henning wrote 8 hours 50 min ago:
        Trained on stolen copyrighted work? Or fairly licensed? Not that AI
        bros give a shit about the law or treating people fairly.
       
          zaptrem wrote 3 hours 35 min ago:
          Copyright means you own the right to reproduce a given work. It
          doesn't mean you own the ideas behind that work. If that were true,
          then all of modern music would instantly be a copyright violation.
       
            jrm4 wrote 1 hour 25 min ago:
            Did you see the results of the Marvin Gaye / Pharrell Williams
            case? Sadly, it's getting pretty close to that.
       
              zaptrem wrote 1 hour 22 min ago:
              Exactly what the labels want, since if that type of thing keeps
              going their way they will soon own not just every song like they
              currently do, but every future song forever.
       
          ClassyJacket wrote 5 hours 36 min ago:
          This argument is only fair if you also think human artists should be
          banned, from birth, from ever looking at any other art. After all
          that would be training on stolen copyrighted work.
       
          david-gpu wrote 7 hours 14 min ago:
          Do you believe that human artists should pay license fees for all the
          art that they have ever seen, studied or drawn inspiration from?
          Whether graphic artists, writers or what have you.
       
            accrual wrote 2 hours 7 min ago:
            I'm still trying to figure out which side to be on. On one hand I
            agree with you - there would be little modern art if it wasn't for
            centuries of preceding inspiration.
            
            On the other hand, at least one suit was making headway as of
            2024-08-14, about 2 months ago [0]. It seems like there must be
            some merit to GPs claim if this is moving forward. But again, I'm
            still trying to figure out where to stand.
            
            [0]
            
  HTML      [1]: https://arstechnica.com/tech-policy/2024/08/artists-claim-...
       
            kadoban wrote 6 hours 55 min ago:
            Human artists get in copyright trouble if the spam out a copy of
            something they studied and sell it. The businesses using AI artists
            do not seem to.
       
              david-gpu wrote 5 hours 14 min ago:
              Artists who think that their copyright has been infringed upon
              are free to sue, just as they do when the alleged plagiarist is a
              human. I fail to see the difference.
       
              ClassyJacket wrote 5 hours 36 min ago:
              Image generation models don't do that either
       
        lpasselin wrote 9 hours 3 min ago:
        This comes from the same group as the EfficientViT model. A few months
        ago, their EfficientViT model was the only modern and small ViT style
        model I could find that had raw pytorch code available. No dependencies
        to the shitty framework and libraries that other ViT are using.
       
        cube2222 wrote 10 hours 7 min ago:
        This looks like quite a huge breakthrough, unless I'm missing
        something?
        
        ~25x faster performance than Flux-dev, while offering comparable
        quality in benchmarks. And visually the examples (surely cherry-picked,
        but still) look great!
        
        Especially since with GenAI the best way to get good results is to just
        generate a large amount of them and pick the best (imo). Performance
        like this will make that much easier/faster/cheaper.
        
        Code is unfortunately "(Coming soon)" for now. Can't wait to play with
        it!
       
          godelski wrote 4 hours 23 min ago:
          > surely cherry-picked
          
          As someone who works in generative vision, this is one of the most
          frustrating aspects (especially for those with less GPU resources).
          There's been a silent competition for picking the best images and not
          showing random results (even when there are random results they may
          be a selected batch). So it is hard to judge actual quality until you
          can play around.
          
          Also, I'm not sure what laptop that is but they say 0.37s to generate
          a 1024x1024 image on a 4090. They also mention that it requires 16GB
          VRAM. But that laptop looks like a MSI Titan, which has a 4090, and
          correct me if I'm wrong, but I think the 4090 is the only mobile card
          with 16GB?[0] (I know desktop graphics have 16 for most cards). The
          laptop demo takes 4s to generate a 1024x1024 image. But they are
          chopped down quite a bit[1]
          
          I wonder if that's with or without TensorRT
          
          [0] [1]
          
  HTML    [1]: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...
  HTML    [2]: https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4090-Laptop...
       
            zamadatix wrote 4 hours 5 min ago:
            The GeForce RTX 3080 Mobile and GeForce RTX 3080 Ti Mobile also
            have 16 GB versions as noted directly above the linked section on
            [0].
       
              godelski wrote 3 hours 2 min ago:
              Thanks! I forgot about that (usually mobile cards have less VRAM,
              not more lol). I don't necessarily doubt the paper's generation
              claim, but there are of course many factors that could help
              clarify what that number actually represents
       
          Lerc wrote 5 hours 35 min ago:
          >This looks like quite a huge breakthrough, unless I'm missing
          something?
          
          Looking at their methodology,  it seems like it's more of an
          accumulation of existing good ideas into the one model.
          
          If it performs as well as they say, perhaps you can say the
          breakthrough is discovering just how much can be gained by combining
          recent advances.
          
          It's sitting on just the edge of sounding too good to be true to me. 
          I will certainly be pleased if it holds up to scrutiny.
       
          Archit3ch wrote 8 hours 3 min ago:
          If you generate 25x more images, you can afford to cherry-pick.
       
            Lerc wrote 5 hours 41 min ago:
            That transfers computer time to user time.    It's great when you
            want variations,  less so when you want precision and consistency. 
             Picking the best image tires the brain quite quickly, you have to
            take into account the at a glance quality without it overriding the
            detail quality.
            
            I'd be curious to see how a vision model would go if it were
            finetuned to select the best image match to a given criteria.
            
            It's possible that you could do O1 style training to build a final
            stage auto-cherrypicker.
       
            cube2222 wrote 7 hours 15 min ago:
            It would be interesting to have benchmarks that take this into
            account (maybe they already do or I’m misunderstanding how those
            benchmarks work). I.e. when comparing quality between two different
            models of vastly different performance, you could be doing
            best-of-n in the faster model.
       
              Vt71fcAqt7 wrote 6 hours 51 min ago:
              That sounds like it could be an intiresting metric. Worth noting
              that there is a difference between an algorithmic "best of n"
              selection (via eg. an FID score) vs. manual cherry picking which
              takes more factors into account such as user preference and also
              takes time to evaluate, which is what GP was suggesting.
       
                cube2222 wrote 6 hours 34 min ago:
                Yeah I’d likely just pick the best scoring one (that is, the
                pick is made by the evaluation tool, not the model) - to
                simulate “whatever the receiver deemed best for what they
                wanted”.
       
          liuliu wrote 9 hours 56 min ago:
          If you read closer to the benchmark, it seems to be slightly worse
          than FLUX [dev] on prompt adherence and quality. However, the best is
          to evaluate the result oneself, and the track-record of PixArt Sigma
          (from the same author?) is pretty good!
       
        echelon wrote 11 hours 40 min ago:
        Image models are going to be widely available. They'll probably be a
        dime a dozen soon. It's great that an increasing number of models are
        going open, because these are the ecosystems that will grow.
        
        3D models (sculpts, texture, retopo, etc.) are following a similar
        trend and trajectory.
        
        Open video models are lagging behind by several years. While CogVideo
        and Pyramid are promising,  video models are petabyte scale and so much
        more costly to build and train.
        
        I'm hoping video becomes free and cheap, but it's looking like we might
        be waiting a while.
        
        Major kudos to all of the teams building and training open source
        models!
       
        smusamashah wrote 12 hours 54 min ago:
        > (e.g. Flux-12B), being 20 times smaller and 100+ times faster in
        measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB
        laptop GPU, taking less than 1 second to generate a 1024 × 1024
        resolution image.
       
       
   DIR <- back to front page