• Admiral Patrick@dubvee.org
    link
    fedilink
    English
    arrow-up
    85
    arrow-down
    5
    ·
    2 months ago

    As junk web pages written by AI proliferate, the models that rely on that data will suffer.

    Good.

  • Madrigal@lemmy.world
    link
    fedilink
    English
    arrow-up
    79
    arrow-down
    1
    ·
    2 months ago

    “On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.” - Charles Babbage

    • bionicjoey@lemmy.ca
      link
      fedilink
      English
      arrow-up
      15
      arrow-down
      1
      ·
      2 months ago

      The business people adopting AI: “who cares what it’s trained on? It’s intelligent right? It’ll just sort through the garbage and magically come up with the right answers to everything”

    • CookieOfFortune@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      2 months ago

      Of course modern UX design is very much based on getting the right answer with the wrong inputs (autocorrect, etc).

      • lennivelkant@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 months ago

        I believe Robustness was the term I learned years ago: the ability of a system to gracefully handle user error, make it easy to recover from or fix, clearly communicate what was wrong etc.

        Of course, nothing is ever perfect and humans are very creative at fucking up, and a lot of companies don’t seem to take UX too seriously. Particularly when the devs get tunnel vision and forget about user error being a thing…

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    26
    arrow-down
    1
    ·
    2 months ago

    Well, you’ve got a timestamped copy of much of the Web that existed up until latent-diffusion models at archive.org. That may not give you access to newer information, but it’s a pretty whopping big chunk of data to work with.

    • palordrolap@kbin.run
      link
      fedilink
      arrow-up
      19
      ·
      2 months ago

      Hopefully archive.org have measures in place to stop people from yanking all their data too quickly. As least not without a hefty donation or something. As a user it can chug a bit, and I’m hoping that’s the rate-limiting I’m talking about and not that they’re swamped.

      • Grimy@lemmy.world
        link
        fedilink
        English
        arrow-up
        9
        arrow-down
        2
        ·
        edit-2
        2 months ago

        That would go against the principal of the archive imo but regardless, if you take away all means of acquiring data freely, you are just giving companies like OpenAI and Google who already have copies of it an insane advantage.

        AI isn’t going away, we need to make sure we have free access to it as to not give our whole economy to a handful of companies.

  • Anarki_@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    14
    ·
    2 months ago

    ⢀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⣠⣤⣶⣶ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⢰⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⣀⣀⣾⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⡏⠉⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⣿ ⣿⣿⣿⣿⣿⣿⠀⠀⠀⠈⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠛⠉⠁⠀⣿ ⣿⣿⣿⣿⣿⣿⣧⡀⠀⠀⠀⠀⠙⠿⠿⠿⠻⠿⠿⠟⠿⠛⠉⠀⠀⠀⠀⠀⣸⣿ ⣿⣿⣿⣿⣿⣿⣿⣷⣄⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⣴⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⢰⣹⡆⠀⠀⠀⠀⠀⠀⣭⣷⠀⠀⠀⠸⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠈⠉⠀⠀⠤⠄⠀⠀⠀⠉⠁⠀⠀⠀⠀⢿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⢾⣿⣷⠀⠀⠀⠀⡠⠤⢄⠀⠀⠀⠠⣿⣿⣷⠀⢸⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡀⠉⠀⠀⠀⠀⠀⢄⠀⢀⠀⠀⠀⠀⠉⠉⠁⠀⠀⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿

  • kromem@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    arrow-down
    2
    ·
    2 months ago

    I’d be very wary of extrapolating too much from this paper.

    The past research along these lines found that a mix of synthetic and organic data was better than organic alone, and a caveat for all the research to date is that they are using shitty cheap models where there’s a significant performance degrading in the synthetic data as compared to SotA models, where other research has found notable improvements to smaller models from synthetic data from the SotA.

    Basically this is only really saying that AI models across multiple types from a year or two ago in capabilities recursively trained with no additional organic data will collapse.

    It’s not representative of real world or emerging conditions.

  • KevonLooney@lemm.ee
    link
    fedilink
    English
    arrow-up
    12
    ·
    edit-2
    2 months ago

    provenance requires some way to filter the internet into human-generated and AI-generated content, which hasn’t been cracked yet

    It doesn’t need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a “truth score” for each.

    We don’t teach children to read by just handing them random tweets. We give them books that are made specifically for children. Our filtering mechanism for good / bad content is very robust for humans. Why can’t AI just read every piece of “classic literature”, famous speeches, popular books, good TV and movie scripts, textbooks, etc?

    • FaceDeer@fedia.io
      link
      fedilink
      arrow-up
      2
      arrow-down
      1
      ·
      edit-2
      2 months ago

      And they’re overlooking that radionuclide contamination of steel actually isn’t much of a problem any more, since the surge in background radionuclides caused by nuclear testing peaked in 1963 and has since gone down almost back to the original background level again.

      I guess it’s still a good analogy, though. People bring up Low Background Steel because they think radionuclide contamination is an unsolved problem (despite it having been basically solved), and they bring up “model collapse” because they think it’s an unsolved problem (despite it having been basically solved). It’s like newspaper stories, everyone sees the big scary front page headline but nobody pays attention to the little block of text retracting it on page 8.

  • sundray@lemmus.org
    link
    fedilink
    English
    arrow-up
    3
    ·
    2 months ago

    AI writing, scraped by AI, producing more AI writing…

    So not “gray goo” exactly, but “gray slop”?