When you're asking AI chatbots for answers, they're data-mining you

146 points 71 comments 9 hours ago

roscas

Always good to remember people of this.

But not just AI bots or interfaces. Everything is saved and never deleted.

Remember Facebook? "We will never delete anything" that is their business.

So anything that you put on those "services" is gone out of your hands. But we still have an option, is to stop using these ads company and let them die.

Back to AI, there are loads of offline models we can use. Many like Ollama that will even download it. Install Ollama, on the ollama site find a model name and "ollama run model-name" and you can use it.

Ok, it is not as chatgpt5 but it can help you so much, that you might not even need chatgpt.

Phemist

Indeed, and asking facebook to delete the data or to not use it for AI training is just another data point indicating you care about it. Your preferences will eventually be stripped through redesigns, refactors, careless usage or facebooks crooked idea of consent. The data will remain and be used again.

lowwave

It is better to NOT delete facebook, but spam your profile with other data and just leave it.

everybodyknows

This, BTW, is the only way, last I checked, to at all obfuscate Zillow's listing photos of the inside of a house that you have since bought. No multi-delete.

Phemist

Maybe, but that depends on facebook's ability to filter that data.. The filtering should be be easy for my inactive-for-10-years FB account that suddenly uploads a bunch of garbage data. Mixing in genuine data seems antithetical especially considering the garbage may be filtered out.

kibwen

Ironically, this is a completely uncontroversial use case where AI excels.

actionfromafar

And/or change friends to random spam accounts first, then unfriend your real friends.

Sophira

There are also things like Oobabooga's text-generation-webui[0] which can present a similar interface to ChatGPT for local models.

I've had great success in running Qwen3-8B-GGUF[1] on my RTX 2070 SUPER (8GB VRAM) using Oobabooga (everyone just calls it via the author's name, it's much catchier) so this is definitely doable on consumer hardware. Specifically, I run the Q4_K_M model as Oobabooga loads all of its layers into the GPU by default, making it nice and snappy. (Testing has shown that I can actually load up to the Q6_K model before some layers have to be loaded into the CPU, but I have to manually specify that all those layers should be loaded into the GPU, as opposed to leaving it auto-determined.)

It does obviously hallucinate more often than ChatGPT does, so care should be taken. That said, it's really nice to have something local.

There's a subreddit for running text gen models locally that people might be interested in: https://www.reddit.com/r/LocalLlama

[0] https://github.com/oobabooga/text-generation-webui

[1] https://huggingface.co/Qwen/Qwen3-8B-GGUF

dylan604

Facebook doesn't just get data from direct input from users though. So if people stop using FB, that's a good first step, that does not stop the firehose of data.

2d520075

It would be more apt if this was a "Concerned Citizens of <city-name>" facebook group, not ycombinator's Hackernews.

If you are here and you require this reminder I would like to think that you are very lost.

throwaway29246

> Back to AI, there are loads of offline models we can use. Many like Ollama that will even download it. Install Ollama, on the ollama site find a model name and "ollama run model-name" and you can use it.

A privilege that is limited to the top 1%. It may come as a surprise, but most people don't have 32GB of VRAM [0]. The rest of us with normal people hardware are stuck with AI cloud providers or good old searching, which is a lot harder now that those same AI providers have ruined search results.

[0] There are some lightweight models you can run on normal people hardware, but they are just too unreliable even for casual usage and are likely to waste more of your time than they save.

lm28469

That's why you should use multiple accounts and bullshit about 30% of what you post. LLMs are godsent for that, they poison their own well.

SoftTalker

I assume that companies like Facebook know pretty well which accounts are really the same person. Even if you are careful about keeping cookies in separate browser profiles, your machine can be fingerprinted, your posting habits and writing style can be fingerprinted, and Facebook/Google have the resources to do it.

mgh2

The risk are the externalities to actual users who don't know the difference and get affected by your 30% bs

BolexNOLA

I recently set up LM Studio and have run open AI's 20b model locally using an AMD 9070 + 9800x3d. I honestly assumed it would be way more work than it was to set it up. It has limitations, but given it took me all of 5min and I can easily attach docs for it to reference as it all runs locally...it's fantastic. I've got a Claude model I've been messing with too.

notpushkin

> Always good to remember people of this.

You mean “remind”?

NearAP

I refused to use chatGPT until they created the public version that you could use without signing-up.

I later started using Gemini but I use it without signing in to try to ensure my privacy.

I recently came across this App [0] and I've been trying/using it. I end up going back to Gemini if what I need is quite complicated but it's not that common these days.

[0] https://ai.nocommandline.com

glitchc

Everyone knows this. Every layperson I talk to is aware that these companies are siphoning their information. When free email was introduced over two decades ago, the behaviour was the same. Everyone knew Microsoft and Google could read your emails. Then, like now, people think it's worth it. It is too useful a tool to have and the price is palatable.

What people don't want to do is sign up for yet another subscription. There's immense subscription fatigue among the general population, especially in tough economic times such as now.

rafark

Agreed. Not only do I think it’s worth it, i actually like that I can contribute. I’m getting so much good value for free I think it’s fair. It’s a win-win situation. The AIs get better and I get better answers.

random3

This is a funny take. I love your optimism, but it's so extremely naive, it should have a name.

rafark

It’s not naive. The value these ai chatbots provide to me is extremely high.

I’ve been writing code for many years but one of the areas I wanted to improve was debugging, I’ve always printed variables but last month I decided to start using a debugger instead of logging to the console. For the past weeks I’ve only been using breakpoints and the resume program function because the step-into, over, out functions have always been confusing to me. An hour ago I sent Gemini images of my debugger and explained my problem and it actually told me what to do and it actually explained to me what the step-* functions did and it told me what to do step by step (I sent it a new screenshot after each step and told it to explain to me what was going on).

I now have a much better understanding of how debuggers work thanks to Gemini.

I’m fine with google getting my data, the value I just got was immense.

random3

I got that from your first post. As with every game context a win-win is only possible in non-zero sum with a relatively balanced benefit. It's clear that you can see the value you get and maybe even quantify it. However you can't quantify the other side, nor the degree to which its win will affect your win on a relatively short term (a few years tops).

Two things come to mind

The less relevant one, is that as a coder, once there's a good enough model (good enough = benefit/cost) your "win" will get to 0. And your contribution to what will make that win 0 is going to be non-0, but you're not going to get anything.

The more relevant one, longer term, is that you may end up being predictable (a good model of yourself) that will be able to extract value out of you personally forever, again without anything for you to gain.

Both may be argued against, or that they are unavoidable, regardless. But in either case, your "price point" has been arbitrarily chosen, at least from your perspective. I.e. it's not an informed choice on your end. A bit like the Monty Hall problem, you chose a door with little information. The act of sticking to the door you chose is why you're naive.

smjburton

> The more data you give any of the AI services, the more that information can potentially be used against you.

It may seem obvious, but Sam Altman also recently emphasized that the information you share with ChatGPT is not confidential, and could potentially be used against you in court.

[1] https://www.pcmag.com/news/altman-your-chatgpt-conversations...

[2] https://techcrunch.com/2025/07/25/sam-altman-warns-theres-no...

djeastm

Hasn't that always been the case? Phone companies providing records of calls and text messages, etc? Anything stored on someone else's servers is going to be something they have a duty to provide to police/courts, assuming they fall under that jurisdiction.

Jalad

This is always true though. Any data that a cloud company has against you can be subpoenad

It would be weird for him not to be transparent about that

Kim_Bruning

Earlier discussion on the "ChatGPT chats in google" angle:

https://news.ycombinator.com/item?id=44778764

Interesting how much traction

     "[x] Make this chat discoverable (allows it to be shown in web searches)"

gets in news articles.

People don't seem to have the same intuition for the web that they used to!

ceroxylon

What about the people who did not opt to share or index their chats, and the companies that claim to not train on user chats?

https://privacy.anthropic.com/en/articles/10023555-how-do-yo...

> We do not actively set out to collect personal data to train our models

The 'snarky tech guy' tone of the article is a bit like nails on a chalkboard.

hazKu4

(At least to me) that language doesn’t feel particularly reassuring… especially given the duplicitous nature of data collection - i.e. “we don’t sell your data” translates to “we create a sophisticated advertising profile about you, and monetize that”

boesboes

That line is about data they find on internet. soooo completely not relevant

falcor84

> So, kids, let's not be asking any AI chatbot whether you should divorce your husband, how to cheat on your taxes, or if you should try to get your boss fired. That information will be kept, it may be revealed in a security breach, and, if so, it will come back to bite you in the buns.

Just as a PSA - there's nothing unique to AIs here - whenever you ask a question of anyone, in any way, they then have the memory of you having asked it. A lot of sitcoms and comedic plays have the plot premise build upon such questions that a person voiced then eventually reaching (either accurately or inaccurately) the person they were hiding the question from.

And as someone who's into spy stories, I know that a big part of tradecraft is of formulating your questions in a way that divulges the least about your actual intentions and current information.

If anything, LLM-driven AIs are the first technology that in principle allow you to ask a complex question that would be immediately forgotten. The thing is that you need to be running the AI yourself; if you ask an AI controlled by another entity, then you're trusting that entity with your question, regardless of whether there's an AI on the way.

frakt0x90

Books are also technology that allow you to answer complex questions without recording the question.

Jalad

Not necessarily though, it depends on where you got the book from (Amazon, the library?), and what your question is

shadowgovt

In general, libraries actually do go out of their way to minimize the ways circulation history can be used against card-holders.

This isn't airtight, but it'a a point of principle for most libraries and librarians and they've gone to the mat over this. https://www.newtactics.org/tactics/protecting-right-privacy-...

Theodores

This was a surprisingly big thing back in the early 2000s with The War Against Terror. I think that it was mostly for reasons of 'chilling effect', but the media made everyone aware that the Department of Homeland Security were paying attention to what books people took out of the library.

What was curious about this was that, at the time, there were few dangerous books in libraries. Catcher in the Rye and 1984 was about it. You wouldn't find a large print copy of Che Guevara's Guerrilla Warfare, for instance.

I disagree about how libraries minimise the risk of anyone knowing who is reading what. On the web where so much is tracked by low intelligence marketing people, there is more data than anything that anyone can deal with. In effect, nobody is able to follow you that easily, only machines, with data that humans can't make sense of.

Meanwhile, libraries have had really good IT systems for decades, with everything tracked in a meaningful form with easy lookups. These systems are state owned, therefore it is no problem for a three letter agency to get the information they want from a library.

shadowgovt

Libraries don't tend to have consolidated, centralized IT. As a result, TLAs have to actually make subpoenas to the databanks maintained by individual, regional library groups, and The ALA offers guidelines on how to respond to those (https://www.ala.org/advocacy/privacy/lawenforcement/guidelin...).

This, of course, doesn't mean your information is irretrievable by TLAs. But the premise of "tap every library to bypass the legal protections against data harvesting" is much trickier when applied to libraries than when applied to, say, Google. They also aren't meaningfully "state-owned" any more than the local Elk's Club is state-owned; the vast majority of libraries are, at most, a county organ, and it is the particular and peculiar style of governance in the United States that when the Feds come knocking on a county's door, they can also tell them to come back with a warrant. That's if the library is actually government-affiliated at all; many are in fact private organizations that were created by wealthy donors at some point in the past (New York Public Library and the Carnegie Library System are two such examples).

Many libraries also purposefully discard circulation data so as to minimize the surface area of what can be subpoena'd. New York Public Library for example, as a matter of policy, purges the circulation data tied to a person's account soon after each loaned item is returned (https://www.nypl.org/help/about-nypl/legal-notices/privacy-p...).

y0eswddl

The questions and info you ask friends doesn't end up in a massive data profile on you stored in somebody's cloud to be used for future manipulation/marketing/profiling...

3-cheese-sundae

They do, if they're asked over one of the many popular non-secure chat platforms.

I feel like most people don't wait until their friends are in the room to ask them questions or exchange info.

avmich

I have an issue with "stupidity" suggestion. Clicking "Agree" without full analysis is tried and true Internet tradition, it's so sad somebody assumes it's serious and attempts to use it. We should have legal protections against wringing quasi-agreements from customers and then using them against.

makeworld

Notably, Anthropic does not do this with Claude.

https://docs.anthropic.com/en/docs/claude-code/data-usage

nachox999

We need a tool that create random fake data for the data-mining web apps

Qem

I never interacted with the AI Meta bundled to whatsapp fearing this.

tietjens

I’m pretty certain just using WhatsApp is enough.

jdthedisciple

From what I know, only people who DELIBERATELY SHARED their chats and IGNORED THE WARNING that it makes them public had their chats appear in search engine results.

Which makes this article quite misleading.

nottorp

> "How to Use a Microwave Without Summoning Satan,"

Oh, nice idea. We should all ask that.

feydaykyn

If you wonder how the llm answer, here are shortened answers from Claude and Chatgpt

# Claude

Here are some practical tips for safe microwave use that should keep your kitchen demon-free:

Basic Safety Guidelines: - Use microwave-safe containers only (glass, ceramic, or microwave-safe plastic) - Avoid metal objects, aluminum foil, or containers with metallic trim - Pierce foods with skins (potatoes, hot dogs, tomatoes) to prevent explosive results

(...) Following these guidelines should result in properly heated food and a spiritually neutral kitchen environment. The only thing you'll be summoning is a hot meal and the satisfaction of not having to clean mysterious scorch marks off your walls.

----

# chatgpt 5

Alright, I love the energy of this question — let’s break it down into a practical + slightly tongue-in-cheek guide.

How to Use a Microwave Without Summoning Satan

1. Read the Sacred Text (a.k.a. the Manual)

The microwave manual contains all the spells… I mean, instructions… for safe operation.

(...)

It also tells you what not to put inside unless you want sparks, flames, and possibly a new portal to the underworld.

Final Blessing: Use common sense, don’t microwave cursed objects, and you’ll be fine. The microwave is a tool of convenience, not a summoning circle.

Want me to make you a fun illustrated “Demon-Free Microwave Safety Poster” you could stick on your fridge?

mystraline

Wait, you can summon Satan with a microwave?!

Lemee ask ShatGPT how to do that!

unethical_ban

Duck.ai claims to anonymize AI chats and says its conversations are not used for training. It is my go to for casual usage.

Otherwise, I use local for complex for potentially controversial questions.

thisisit

If you ask a layperson the answer is - "Yes, and?". If its free, very few people care. Sure you can run a local instance and yes, it might be as simple as downloading Ollama but not many will do it or even have a powerful enough computer to run it.

Worst yet you might individually make a choice to do that but others might not care. They might share email/chats with you to a chatbot to parse it or "make it think like them" and then the chatbot has info about you. So, as much as I understand this sentiment this seems like a losing battle.

dialup_sounds

Why should they care?

shadowgovt

This is also true of search engines, social media, and various other interactive systems. Google's initial search-algorithm breakthrough was the realization that they had a massive source of data for search result correctness in the form of the behavior of users querying their site.

In general, it's wise to assume that all web interactions are a two-way street between the user and the service provider.

akomtu

Unlike previous technologies, chatbots know what users think at the most intimate level. Chatbots know, but currently cannot make sense of this knowledge. The near term goal, I believe, is to build simple, but accurate models of the users psyche to serve them ads better. Instead of crude labels like "user 456 loves cars", corpos will have a compact psyche model of that user that will predict his reactions with 95% accuracy. This model will know that user better than he knows himself. And for a brief moment in history, while AI is good enough to predict us, but not replace us, the adtech corpos will make bank.

andrepd

What can you do online these days without being data mined? Browsing gemini?

em3rgent0rdr

Download stuff in bulk (for instance the entire wikipedia torrent) and then peruse it on you own computer.

Squeeeez

If you are not using an OS which has something like windows recall enabled, or that weird stardict with online lookup with automatic lookup on select which came up recently.

I wonder how far back this has been going on. Did ICQ, IRC server hosters, BBSes do similar things?

reactordev

No, back then storage was a premium so everything aside from config, accounts, and billing was ephemeral. It really wasn’t until Cloud came along that storage made it so you could keep everything. About the time of the social media boom.

It wasn’t until around 2014 that I stopped building routes that did:

    DELETE FROM <table> WHERE id = ? ON DELETE CASCADE;

timeon

> windows recall enabled

Just curious what other OS has something similar? MacOS maybe?

y0eswddl

Start w/

https://ssd.eff.org

https://privacyguides.net

boesboes

What a terrible, utter bullshit article. Full of half truths and fear mongering. smh.

AlexandrB

> fear mongering

The last 10 years of tech "innovation" is basically what the article describes but happening to other tech products[1]. So, why is this fear mongering? It's basically inevitable unless:

a. There's legislation. But I would bet on legislation for the opposite - storing chats forever - instead.

b. AI moves to on-device where users have control of their own data. Also unlikely considering how much tech loves web technologies and recurring revenue streams.

[1] https://www.cam.ac.uk/research/news/menstrual-tracking-app-d...

actionfromafar

All hail centralized cloud services?

panny

I would expect this, but it doesn't seem to be the case.

If I ask for search.brave.com to give me a list of gini coefficients for the top ten countries by GDP, it can't do it. However, if I tell it the data is available on the CIA world factbook, it can then spit that info out promptly. However, if I close the context and ask again, it hasn't learned this information and once again is unable to provide the list.

It didn't datamine me. It had no better idea where to find this information the second time I asked. This is the experience others have stated with other AIs as well. It does not seem special to brave.

Etheryte

Data mining doesn't mean the model is instantly updated, that would be prohibitively expensive at scale. It's way easier to batch your data together with a bunch of other data and use it later on. That doesn't even mean it will know where to find the information eventually since models are not one to one with their inputs, because again, size and cost.

panny

>Data mining doesn't mean the model is instantly updated

I'm not expecting instant. Even next week it won't be there. It's like how AI never learned to count how many times the letter r appears in strawberry. Like sure, now if you ask brave, it will tell you three, but that is only because that question went viral. It didn't "learn" anything, it was just hard coded for that particular answer. Ask it how many times the letter l appears in smallville and it will get it wrong again.

simgt

I didn't think for a second you could be right, so I tried with Claude. L in smallville was correct, then it suggests it'd have gotten l in parallel wrong by answering 3 instead of 2 (buts gets it right in a new chat). Then it suggests it'd get n in millennium wrong by giving the right answer, and gets it wrong in a new chat. https://claude.ai/share/93b46c3b-23a7-40ad-8a2b-ec2ed6c34a19

Thanks, that was enlightening.

t0md4n

It wouldn’t be instant, next week or even next month. Pre-training doesn’t happen that frequently and varies between each model provider. As for the strawberry test, this is a tokenization issue that is fundamental to LLM’s, however, most models can now solve this type of question using thinking/code/tools to count the letters.

https://imgur.com/a/NqIJEx6

Etheryte

Both OpenAI and Claude average roughly one flagship release a year, and these are some of the best funded companies in the space. The bigger your model, the more expensive it is to train, so you want to do it as rarely as reasonably possible. Every other company will either work with smaller models and/or train even more rarely, aside from fine-tunes and customizations they put on top.

ordersofmag

LLM aren't retrained and released on a weekly time-scale. The data mining may only be reflected in the training of the next generation of the model.

qwertytyyuu

every week is still way to expensive to do at scale, at best they'll update training data with each model iteration.

add-sub-mul-div

Brave isn't data mining you for your benefit, they're doing it for their benefit.

panny

Likewise, I'm not teaching their AI where to find GINI coefficients for their benefit, but for mine. I'd like for their AI to learn something, if only to make my experience better. But there's no learning happening.

hluska

You’re expecting models to constantly retrain themselves based on riddles. That’s not very reasonable nor is it even economically feasible right now. At massive scale, I question whether it’s even technically feasible.