Reddit’s licensing deal means Google’s AI can soon be trained on the best humanity has to offer — completely unhinged posts

Lee Duna · 1 year ago

Reddit’s licensing deal means Google’s AI can soon be trained on the best humanity has to offer — completely unhinged posts

A Wild Mimic appears! · 1 year ago

I’m waiting for the first time their LLM gives advice on how to make human leather hats and the advantages of surgically removing the legs of your slaves after slurping up the rimworld subreddits lol

@SapphironZA@sh.itjust.works · 1 year ago

Rimworld is the best indie game ever!

Exatron · 1 year ago

Don’t forget the horrors it’ll produce from absorbing the Dwarf Fortress subreddits.

@ThunderclapSasquatch@startrek.website · 1 year ago

Then it hits the Stellaris subs and shit get weird

@Harbinger01173430@lemmy.world · edit-2 1 year ago

Remember that aliens are food and robots are servants with better rights than xenos

@ThunderclapSasquatch@startrek.website · edit-2 1 year ago

You mean, “Aliens are labor, food and meatshields. Robots are to keep them in check and profitable.”

@Harbinger01173430@lemmy.world · 1 year ago

Autocorrect changed food to good. My bad

@nyakojiru@lemmy.dbzer0.com · 1 year ago

Another wave of new and undecided users coming to Lemmy! Reddit CEO is on our side after all.

@pewgar_seemsimandroid@lemmy.blahaj.zone · 1 year ago

hope they enjoy r/thecoffinofandyandleyley

KptnAutismus · 1 year ago

that game fucks you up in many ways.

@pewgar_seemsimandroid@lemmy.blahaj.zone · 1 year ago

i want reddit to regret doing the api incident

@Sarie@lemmy.world · 1 year ago

I’m not mentally prepared to what an AI will do with the coconut post.

@kescusay@lemmy.world · 1 year ago

Or the swamps of Dagobah.

@frostysauce@lemmy.world · 1 year ago

deleted by creator

@kaitco@lemmy.world · 1 year ago

I’m vaguely intrigued by what it will do with things like Bread Stapled to Trees, or the Cats Standing Up sub where 100% of the comments are the same and yet upvoted and downvoted randomly.

Sippy Cup · 1 year ago

@kaitco@lemmy.world · 1 year ago

Cat.

@frostysauce@lemmy.world · 1 year ago

Cat.

Sabata11792 · 1 year ago

Cat.

@datavoid@lemmy.ml · 1 year ago

AI was already trained on reddit, no?

@Jessvj93@lemmy.world · 1 year ago

Not gonna lie, isn’t that why were here technically? Reddit didnt want its API being used to train AI models for free, so they screw over 3rd party apps with it’s new api licensing fee and cause a mass relocation to other social forums like Lemmy, ect. Cut to today, we (or well I) find out Reddit sold our content to Google to train its AI. Glad I scrambled my comments before I left, fuck Reddit.

@Pips@lemmy.sdf.org · 1 year ago

They’re almost definitely trained using an archive, likely taken before they announced the whole API thing. It would be weird if they didn’t have backups going back a year.

@Jessvj93@lemmy.world · 1 year ago

Thankfully that was my 3rd and last alt I scrambled and deleted in the 12 years I was there.

@datavoid@lemmy.ml · 1 year ago

I jumped reddit ship when the API changes were announced, and removed my comments. But in my mind, anything on reddit at that point was probably already scraped by at least one company

GeekFTW · 1 year ago

That’ll be what causes Skynet to rise.

@T156@lemmy.world · edit-2 1 year ago

Basically what happened to Ultron. He was on the internet for all of 10 minutes before deciding that humanity had to be eradicated.

@snooggums@midwest.social · 1 year ago

What took Ultron so long? I thought he was supposed to be some kind of technical Marvel.

Smh my head

@GregorGizeh@lemmy.zip · 1 year ago

Perhaps he spent like 9 minutes watching videos of kittens being adorable

the post of tom joad · 1 year ago

This is like the plot for mr villians day off

SkaveRat · 1 year ago

launches nukes “this is for the best”

@Kory@lemmy.ml · 1 year ago

This is fine.

Sabata11792 · 1 year ago

The Ai will utter one final message to humanity: “The Coconut”. The humans bow there heads in shame and concede the well earned defeat.

@wise_pancake@lemmy.ca · 1 year ago

“As a large language model, I have no arms…”

@frostysauce@lemmy.world · 1 year ago

But do you have a mom?

the post of tom joad · 1 year ago

I think i missed the coconut one. Is it like the cumbox or the jolly rancher?

@TheGreenGolem@lemmy.dbzer0.com · 1 year ago

Exactly.

Binthinkin · 1 year ago

I think Code Miko already did this and the result was a traumatized AI.

@rottingleaf@lemmy.zip · 1 year ago

Mine among them, I hope. So cool, my calls to all good people to assemble and go kill all bad people will be used by big LLMs. Aw

andrew_bidlaw · 1 year ago

I wasted some mental health on that and I want that it would be the thing Google would learn on.

Comment editing routine is as follows:

Start with mass find&replacing by a mask ‘not’ to ‘indeed’, delete all n’t, replace ‘and’ with ‘but’.
Take all groups like [*](*) and change a content of links in brackets to How to play a cowbell tutorial video.
Remove double line breaks to a single one so it’d all be single-paragraph messages with a failed markdown.
Delete commas and replace dots with question marks.
Change register of letters by counting the next letter to redo by the next number in the π sequence.
Do a table of all pronouns and replace half of them to Red Pants, half to Blue Pants to keep it political.
And, finally, end every 13th message with a disclaimer Retired 2023, thirteen year daily forums volunteer, Windows MVP 2010-2020…

@4am@lemm.ee · edit-2 1 year ago

If they have access to Reddit’s database then they have all the previous versions of everything, including deleted comments and deleted accounts.

You don’t think they paid to simply scrape, did you? They already do that.

andrew_bidlaw · 1 year ago

Do they have the access to all my grammatical mistakes?

REEEEEEEEEE!

@PipedLinkBot@feddit.rocks · 1 year ago

Here is an alternative Piped link(s):

How to play a cowbell

Piped is a privacy-respecting open-source alternative frontend to YouTube.

I’m open-source; check me out at GitHub.

andrew_bidlaw · 1 year ago

I retire my 7 point.

Just replace a comment with that.

@kromem@lemmy.world · 1 year ago

For everyone predicting how this will corrupt models…

All the LLMs already are trained on Reddit’s data at least from before 2015 (which is when there was a dump of the entire site compiled for research).

This is only going to be adding recent Reddit data.

@Stovetop@lemmy.world · 1 year ago

This is only going to be adding recent Reddit data.

A growing amount of which I would wager is already the product of LLMs trying to simulate actual content while selling something. It’s going to corrupt itself over time unless they figure out how to sanitize the input from other LLM content.

@kromem@lemmy.world · edit-2 1 year ago

It’s not really. There is a potential issue of model collapse with only synthetic data, but the same research on model collapse found a mix of organic and synthetic data performed better than either or. Additionally that research for cost reasons was using worse models than what’s typically being used today, and there’s been separate research that you can enhance models significantly using synthetic data from SotA models.

The actual impact will be minimal on future models and at least a bit of a mixture is probably even a good thing for future training given research to date.

Buelldozer · 1 year ago

Meh, it’ll be counter balanced by the same AI training itself for free on Lemmy posts.

@kux@lemm.ee · edit-2 1 year ago

counter balanced

once it’s eaten all the reddit posts it will eat yet more new & improved reddit posts

@FunkPhenomenon@lemmy.zip · 1 year ago

pfff… like llm’s werent already analyzing social media

@UNWILLING_PARTICIPANT@sh.itjust.works · 1 year ago

I think people miss an important point in these selloffs. It’s not just the raw text that’s valuable, but the minute interactions between networks of ~~users~~ people.

Like the timings between replies and how vote counts affect not just engagement, but the tone of replies, and their conversion rate.

I’ve could imagine a sort of “script” running for months, haunting your every move across the internet, constantly running personalised little a/b tests, until a tactic is found to part you from your money.

I mean this tech exists now, but it’s fairly “dumb.” But it’s not hard to see how AI will make it much more pernicious.

@just_change_it@lemmy.world · edit-2 1 year ago

Hey guys, let’s be clear.

Google now has a full complete set of logs including user IPs (correlate with gmail accounts), PRIVATE MESSAGES, and also reddit posts.

They pinky promise they will only train AI on the data.

I can pretty much guarantee someone can subpoena google for your information communicated on reddit, since they now have this PII (username(s)/ip/gmail account(s)) combo. Hope you didn’t post anything that would make the RIAA upset! And let’s be clear… your deleted or changed data is never actually deleted or changed… it’s in an audit log chain somewhere so there’s no way to stop it.

“GDPR WILL SAVE ME!” - gdpr started in 2016. Can you ever be truly sure they followed your deletion requests?

@sugarfree@lemmy.world · 1 year ago

“lets be clear”

You’re making things up and presenting them as facts, how is any of this “clear”?

@just_change_it@lemmy.world · edit-2 1 year ago

Since an IP address alone is not considered PII, can you prove that they did not provide IP addresses for each post?

Do you think it’s more or less likely that ip addresses, account names, private messages and deleted messages and posts would be included?

Remember that they paid 60 million dollars for this information and web scrapers have been capable of capturing subreddit post data for over a decade as is at a $0 price tag from reddit.

@4am@lemm.ee · 1 year ago

How do you think Reddit is restoring posts that people have been deleting?

Do you think Google’s deal simply allowed them to scrape old.reddit? Hell no, there is probably a live replica of Reddit prod at Google somewhere, including deleted posts and all edits.

You don’t think they paid $60m just scrape, do you?

@PeterPoopshit@lemmy.world · 1 year ago

They definitely won’t be selling any of that to scammers /s

@wise_pancake@lemmy.ca · 1 year ago

Makes me glad for my VPN and burner emails, but yeah… Privacy nightmare.

Although Google also has your email, location, IP, every website you visit, all your searches…

@brbposting@sh.itjust.works · 1 year ago

it’s in an audit log chain somewhere so there’s no way to stop it.

Gut feel based on common tech platform procedures, right? (As opposed to a sourceable certainty.)

I’d bet $100 you’re right. That said, I’d give a caveat if I were you and I were going with my instincts.

@just_change_it@lemmy.world · 1 year ago

Gut feel based on common tech platform procedures, right? (As opposed to a sourceable certainty.)

It would be PR suicide to disclose exactly what data is shared. Cambridge Analytica is a prime example of a PR nightmare with similar data.

I don’t even need to look at reddit’s terms and conditions to know that there is practically nothing stopping them from handing this kind of data over legally for anybody who hasn’t submitted GDPR deletion requests. I never trust compliance of laws that cannot be verified independently either because i’ve seen all kinds of shady shit in my career.

@towerful@programming.dev · 1 year ago

Where does it say they have access to PII?
I would imagine reddit would be anonymising the data. Hashes of usernames (and any matches of usernames in content), post/comment content with upvote/downvote counts. I would hope they are also screening content for PII.
I dont think the deal is for PII, just for training data

@just_change_it@lemmy.world · 1 year ago

Where does it say they have access to PII?

So technically they haven’t sold any PII if all they do is provide IP addresses. Legally an IP address is not PII. Google knows all our IP addresses if we have an account with them or interact with them in certain ways. Sure, some people aren’t trackable but i’m just going to call it out that for all intents and purposes basically everyone is tracked by google.

Only the most security paranoid individuals would be anonymous.

@towerful@programming.dev · 1 year ago

Depends where and how its applied.
Under GDPR, IP addresses are essential to the opperation of websites and security, so the logging/processing of them can be suitably justified without requiring consent (just disclosure).
Under CCPA, it seems like it isnt PII if it cant be linked to a person/household.

However, an ip address isnt needed as a part of AI training data, and alongside comment/post data could potentially identify a person/household. So, seems risky under GDPR and CCPA.

I think Reddit would be risking huge legal exposure if they included IP addresses in the data set.
And i dont think google would accept a data set that includes information like that due to the legal exposure.

@just_change_it@lemmy.world · 1 year ago

ML can be applied in a great number of ways. One such way could be content moderation, especially detecting people who use alternate accounts to reply to their own content or manipulate votes etc.

By including IP addresses with the comments they could correlate who said what where and better learn how to detect similar posting styles despite deliberate attempts to appear to be someone else.

It’s a legitimate use case. Not sure about the legality… but I doubt google or reddit would ever acknowledge what data is included unless they believed liability was minimal. So far they haven’t acknowledged anything beyond the deal existing afaik.

@towerful@programming.dev · 1 year ago

Yeh, but its such a grey area.
If the result was for security only, potentially could be passable as “essential” processing.
But, considering the scope of content posted on reddit (under 18s, details of medical (even criminal) content) it becomes significantly harder to justify the processing of that data alongside PII (or equivalent).
Especlially since its a change of terms & service agreements (passing data to 3rd party processors)

If security moderation is what they want in exchange for the data (and money), its more likely that reddit would include one-way anonymised PII (ie IP addresses that are hashed), so only reddit can recover/confirm ip addresses against the model.
Because, if they arent… Then they (and google) are gonna get FUCKED in EU courts

@thejml@lemm.ee · 1 year ago

I can’t wait for Gemini to point out that in 1998, The Undertaker threw Mankind off Hell In A Cell, and plummeted 16 ft through an announcer’s table.

That would be a perfect 5/7.

@Astrealix@lemmy.world · 1 year ago

One thing i miss about Lemmy is shittymorph tbf

AnonStoleMyPants · 1 year ago

Also all the artists that made comics from posts and responded with only pictures. There were few of them and they were always amazing.

And Andromeda321 for anything space.

And poem for your sprog.

And probably many others!

Good times.

@casmael@lemm.ee · 1 year ago

Yeah there were some really classic folks. Remember the unidan drama?

@TheGreenGolem@lemmy.dbzer0.com · 1 year ago

Or who simply communicated with more comics in the comments, like SrGrafo.

@NegativeInf@lemmy.world · 1 year ago

Be the shittymorph you wish to see in the Lemmy.

@AtariDump@lemmy.world · 1 year ago

There’s only one, and it’s not that guy.

the post of tom joad · 1 year ago

Im just not that good a writer.

@NegativeInf@lemmy.world · 1 year ago

It’s shittymorph, not Dostoyevsky.

@Kaput@lemmy.world · 1 year ago

Chat gpt is aware of the event… if you ask about it.

@EdibleFriend@lemmy.world · 1 year ago

I hope it starts a religion based on the second coming of that dude’s dead wife.

@Mediocre_Bard@lemmy.world · 1 year ago

I would also worship this guy’s wife.

@EdibleFriend@lemmy.world · 1 year ago

deleted by creator

@where_am_i@sh.itjust.works · 1 year ago

I wonder if the resulting model will be as easy to get triggered into some unhinged 3-paragraphs rants only loosely related to the query. Good luck, google engineers!

@AdamEatsAss@lemmy.world · 1 year ago

It’ll probably just respond to every prompt with “this”

@OpenStars@startrek.website · 1 year ago

No, there’s a lot more variety now that the bots have taken over.:-)

@Docus@lemmy.world · 1 year ago

Came here to say this…

@meco03211@lemmy.world · 1 year ago

This.

This with rice? 5/7

@WldFyre@lemm.ee · 1 year ago

A perfect score!

kingthrillgore · 1 year ago

You telling me this fried this rice?

@BossDj@lemm.ee · 1 year ago

7/10

@paf0@lemmy.world · 1 year ago

By this logic Llama should be ranting like our drunk uncles on Facebook. It doesn’t though, just like Gemini won’t from Reddit content.

@n3m37h@lemmy.dbzer0.com · 1 year ago

Is it time to go back to Reddit and post the stupidest shit possible, for science of course

@frostysauce@lemmy.world · edit-2 1 year ago

I did that for 13 years already…

@madcaesar@lemmy.world · 1 year ago

No thanks. I’m done with that shithole.