Reddit sells training data to unnamed AI company ahead of IPO

VITecNet@programming.dev · 9 months ago

Reddit sells training data to unnamed AI company ahead of IPO

humanbroadcast@lemmy.world · 9 months ago

Companies are in the fuck-around phase, and we’ll all have to live in the find-out era.

thesystemisdown@lemmy.world · 9 months ago

Meanwhile, the masses are still using all the ‘services’ because they all have momentum. I’m not confident any of them can do anything bad enough to chase off their users.

wise_pancake@lemmy.ca · 9 months ago

I thought Facebook would die with all the scandals, I’m the only person in my life who cared. I deleted Twitter before it became X, I’m the only one I know who did that.

I don’t think anyone gives a shit and it’s made me hate people a lot more than I used to.

gregorum@lemm.ee · 9 months ago

Most people I know left Facebook and Twitter years ago. Maybe it’s just a difference in people we associate with?

Imgonnatrythis@sh.itjust.works · 9 months ago

I already left reddit because they did bad things. Assume you mean chase off a critical mass though? The fact that “X” is still a thing may prove you correct.

gregorum@lemm.ee · 9 months ago

Huge amounts of people have already left X. Those that remain are mostly bots and neo Nazis.

tutus@links.hackliberty.org · 9 months ago

How ever much we want that to be true, it’s simply not. Sadly.

jmanes@lemmy.world · 9 months ago

Yeah, they’re not leaving. The only way they would leave is if the service were to be physically shut down. Pretty sure you could make everyone watch 1 minute long ads on app open and they would still stay.

millifoo@lemmy.world · 9 months ago

I spent a chunk of this afternoon nuking my old reddit posts. Thousands and thousands of posts… thank goodness for shreddit.

WallEx@feddit.de · 9 months ago

I did that before they went through with their api bullshit, I’m so happy, it was fully automated. Just typed in the replacement massege and that’s it

guacupado@lemmy.world · edit-2 9 months ago

I didn’t know about this. I just went into it.

edit: lol you have to pay for Shreddit. Nevermind, I don’t care about deleting my posts that much.

millifoo@lemmy.world · 9 months ago

No, not the website: the git project.

If you want a web app, try redact.dev (yes, there’s a paid version where you can download your old messages, but the free one wipes out your posts (with random text) for free.

NightAuthor@lemmy.world · 9 months ago

Most tools miss a ton bc of the limitations of the website and api. The best, pretty much only, way to get everything is to get an export of your data, then use that csv to delete all items one by one

Hubi@feddit.de · 9 months ago

Yup, shreddit has the ability to use the csv from the data request. Took me about 24 hours to edit and erase the 20.000+ comments I made over the last 10 years.

x4740N@lemmy.world · 9 months ago

Does shreddit have the ability to exclude certain subreddits because I want to exclude my comments on one subreddot but overwrite the rest of them

Aopen@discuss.tchncs.de · 9 months ago

Isnt this copyright infringement? Does reddits policy allow them to do this?

Lexi Sneptaur@pawb.social · 9 months ago

Very glad I overwrote all of my comments with random words before deleting my account. They won’t be profiting off of me anymore.

FaceDeer@kbin.social · 9 months ago

Instead you’re posting to the Fediverse, which is even more open for use by third parties.

Lexi Sneptaur@pawb.social · 9 months ago

Its obscure enough that I don’t think it’s being sought out by AI companies. The nature of federated instances should make it a bit more challenging to pull a complete data set too

FaceDeer@kbin.social · 9 months ago

Not so obscure that Meta isn’t paying attention and planning for interoperation, and Meta is one of the biggest players in the AI development field.

A complete data set isn’t required, just a comprehensive one.

foggy@lemmy.world · 9 months ago

Really trying to meddle an elections like it’s 2016

FaceDeer@kbin.social · 9 months ago

I don’t see anything in the article related to elections.

ArbiterXero@lemmy.world · 9 months ago

No, But there’s BIG money in AI astroturfing for elections.

pop@lemmy.ml · 9 months ago

Good. Get a good taste your own medicine.

It’s time how people in the US felt foreign entities meddling with their elections for a change, huh?

grabs popcorn

Hubi@feddit.de · 9 months ago

It’s been happening for the better part of a decade though. And started probably much earlier than that, just not as blatant.

SeedyOne@lemm.ee · 9 months ago

You’re naive to think it’s just starting now.

PizzaFacia@lemmy.world · 9 months ago

$60mm a year seems really cheap, no? I know its shit data from the bot posters but still would think it would be like $100-150mm

ilinamorato@lemmy.world · edit-2 9 months ago

It’s ludicrously cheap for the size and quality of the dataset. A set of 829 academic papers at University of Michigan is priced at $25,000—about 1/2400 of this sale. If you were to scale that dollar value to the size of the Reddit dataset, you’d expect it to contain about 2 million academic papers’ worth of data.

But Reddit has almost two decades of text written by 200 million chronically-online people. And sure, probably most Reddit users don’t write an academic paper amount of content every year; but the average is probably closer to that than not, especially when you consider that some of those subreddits like AskHistorians and AskScientists really are generating the equivalent of dozens of academic papers per day. Just based on the amount of text alone, Reddit should’ve sold us out for 50-100x what they got for just a single year of data, and 1000-2000x for the full twenty years (though, granted, they didn’t have that much data for that entire time, so let’s say half that).

Furthermore, those 829 papers in the U of M dataset are disconnected, unlinked text representing a tiny fraction of what U of M’s 50,000 students generate in even a single year. Reddit has data with links, images, conversational responses, prompt responses, Q&As, flash fiction, slash fiction, historical deep-dives, investigations, memes, inside jokes, a development of style and consensus over time, and a comprehensive understanding of what it means to interact online, generated by people around the world over the course of 18 years. It’s much better data for almost any LLM purpose that isn’t just writing academic papers from the perspective of students at a medium size 4-year undergrad institution in the Midwestern US. The quality of the dataset should’ve made the value even higher. It’s hard to say exactly how much higher, but let’s just be extremely conservative and say it should have doubled the total.

That means that, conservatively, the value of Reddit’s dataset—or, rather, our dataset, which Reddit freebooted from us—was about 1000x what they were paid, based on the proportional value of the U of M dataset.

They should’ve sold us out for billions.

Of course, we don’t know anything about what exclusivity deals or subset of data that they might have included with this deal. It might only be one year of data, and only 6 months of exclusivity. But assuming they sold the rights to the entire dataset, we got sold for pennies.