If OpenAI says 'we don't train on your data', does that mean my uploaded files are private?

It means one specific thing - the file will not be used as gradient updates to the next model version - and it leaves a great deal else unaddressed. The file is still stored on OpenAI's infrastructure, still readable by the abuse-monitoring system and the engineers who debug it, still subject to legal preservation orders (including the active New York Times v OpenAI order that, since mid-2025, has required ChatGPT to keep deleted chats and API outputs indefinitely while the case is pending), and still tied to your account in ways that survive 'delete'. 'Not used for training' is one privacy property out of about eight you actually want to know about.

What is the New York Times v OpenAI preservation order and why does it matter for my files?

In May 2025, a federal magistrate judge in the Southern District of New York ordered OpenAI to preserve all ChatGPT output data and API output that would otherwise have been deleted, as part of the discovery process in the Times' copyright suit. The order applies to free, Plus, Pro, Team, and standard API users; it explicitly does not apply to ChatGPT Enterprise, Edu, or to API customers with a signed Zero Data Retention agreement. The practical effect for everyone else is that the 'delete chat' button still hides the chat from your sidebar but no longer deletes it from OpenAI's storage - it is retained under a legal hold, accessible to a small audit team, until the court lifts the order. OpenAI is appealing it. As of this writing the order is still in force.

Are Claude, Gemini, and Copilot any better here?

On training, all three of the big paid tiers (Claude Pro/Team/Enterprise, Gemini through Google Workspace, Microsoft 365 Copilot) commit not to train on customer data by default - this is genuinely consistent across the enterprise market and has been since roughly 2024. On retention, the free tiers are similar to ChatGPT free: chats are retained for service operation, abuse monitoring, and improvement of safety classifiers, with windows ranging from 30 days (Anthropic's stated default) to 18 months (Google's default consumer activity retention before 2024, now opt-in). On the legal-preservation question, only OpenAI is currently under the active NYT order; Anthropic and Google have their own pending copyright suits but no equivalent retention orders as of the publication date. The honest summary is that the enterprise contracts are credible, the free tiers are not, and the gap is widening.

What happens to a file I upload to Code Interpreter / Advanced Data Analysis?

The file is copied into a per-session Linux sandbox - a stripped-down container with Python, no network access, and a small ephemeral filesystem. The sandbox is destroyed when the session ends, which OpenAI documents as 'within a few hours of inactivity'. The file inside the sandbox is gone at that point. The copy of the file in the chat attachment, however, is not - that lives in the same chat-storage layer as the rest of the conversation, subject to the same retention, the same legal hold, and the same 'delete chat' semantics described above. The sandbox lifetime is a real technical boundary; the attachment lifetime is the one that actually matters for privacy.

Is the paid tier of ChatGPT/Claude/Gemini meaningfully more private than the free tier?

On training, yes - all three paid consumer tiers default to not training on your inputs (ChatGPT Plus/Pro since April 2023, Claude Pro since launch, Gemini Advanced since launch). On retention, mostly no - the paid consumer tiers use the same chat storage with the same windows as the free tiers. The big jump is from consumer paid to enterprise: ChatGPT Enterprise, Claude Enterprise, and Microsoft 365 Copilot all offer customer-managed retention, no training by default, SOC 2 reports, and (for ChatGPT Enterprise specifically) exemption from the NYT preservation order. If you are evaluating tools for a workplace, the consumer paid tier is not the right tier for sensitive work; the enterprise tier is.

What does the Samsung 2023 incident actually teach us?

In April 2023 Samsung's semiconductor division banned generative AI tools on internal systems after engineers pasted source code, internal meeting notes, and a hardware test sequence into ChatGPT to ask for debugging help. The bans were not about a leak from OpenAI - there was no breach. The problem was that the company had no control over what employees were sending out, and once a chunk of confidential code is in a third-party chat log, it is sitting under that third-party's retention, legal-hold, and breach-exposure policy rather than the company's. The Samsung response - block the consumer tools, deploy an enterprise tier with a no-training contract - is the same response most large companies have taken since. The lesson is not that AI tools are dangerous; it is that the difference between the consumer tier and the enterprise tier is the entire story.

What about the ChatGPT search-indexing incident in 2024?

In late 2024 a number of ChatGPT 'shared chat' links - the public URLs the share button generates - were found indexed by Google Search, surfacing conversations that the people who created the share links almost certainly did not intend to publish. OpenAI's share-link semantics had been 'anyone with the link can view', which is functionally 'anyone' once a search engine has crawled it (see our companion piece on cloud share links for the full pattern of why this happens). OpenAI added a noindex meta tag to shared chats and tightened the default in early 2025, but the indexed URLs that were already in Google's index took weeks to clear. The general rule across the industry: any 'shared with link' surface is one search-engine crawl away from being public, regardless of the URL's length.

Are AI features built into other apps - Notion AI, Slack AI, the Apple Intelligence stuff - any different?

They split into two camps. Notion AI, Slack AI, and most Office 365 Copilot features are wrappers around OpenAI, Anthropic, or Google models running on the AI vendor's infrastructure, with a contract that flows your data through the wrapper vendor to the model vendor and back. The retention and training posture follows the contract chain - Slack AI on the Enterprise+ tier inherits no-training, Notion AI inherits the OpenAI enterprise terms, and so on. Apple Intelligence is genuinely different: Apple's Private Cloud Compute architecture runs the requests on Apple-controlled hardware with attested code and no persistent storage, and the on-device models never leave the device at all. Apple's design here is the strongest published privacy posture in mainstream AI, and it is worth understanding even if you do not own Apple devices because it sets a useful bar for what other vendors could be doing and are not.

What is the practical rule for handling sensitive files - contracts, IDs, medical records, source code?

If the file would be uncomfortable in a public S3 bucket, do not put it into a consumer AI tool. The realistic options are: (1) an enterprise contract with a no-training, zero-retention or short-retention clause, audited by your legal team; (2) an on-device or self-hosted model - Ollama, LM Studio, llama.cpp for text, the local image and OCR models we ship in Privvert's tools - where the file never leaves your machine; (3) redact the sensitive fields locally before uploading, treating the AI tool as untrusted infrastructure. For the specific case of files where the sensitive part is identifiers or metadata rather than the body, our pieces on PDF redaction, photo metadata, and what 'delete' really does cover the local-redaction workflow in detail.

AI tools and your files: what ChatGPT, Claude, and Gemini actually keep when you upload

The pattern is now reflex. You have a contract you do not fully understand, so you drop it into ChatGPT and ask for a summary. You have a spreadsheet of customer data and you want a quick chart, so you hand it to Claude. You took a photo of a whiteboard at the end of a meeting and you want the action items extracted, so you upload it to Gemini. The model reads it, gives you a useful answer, the chat scrolls up out of view, and the file - in the part of your attention you do not think about - is done. The whole interaction took less than two minutes.

The file is not done. It is sitting in a chat-storage system that belongs to a company you have a one-page consent screen with. It is readable by a small number of employees on an abuse-review team. It is, for ChatGPT users on the free, Plus, Pro, Team, and standard API tiers, currently being preserved indefinitely under a federal court order in the New York Times' copyright case against OpenAI - an order that has been in force since mid-2025 and that explicitly overrides the 'delete chat' button. It is, on every tier of every tool, subject to a breach exposure that is no different in kind from the breach exposure of any other SaaS application. And the rules differ enough between the free tier, the paid consumer tier, and the enterprise tier of the same product that 'I uploaded it to ChatGPT' does not actually tell anyone what happened to the file.

This piece walks through what each of the big AI tools - ChatGPT, Claude, Gemini, and Microsoft Copilot - actually does with a file you upload, what 'we don't train on your data' means and does not mean, the specific legal and operational carve-outs that the in-app documentation tends to skip, the incidents that show what happens when policy and reality diverge, and the practical rules for anything you would not want on the front page of a newspaper.

What 'upload' actually means inside an AI tool

It is worth pausing on the mechanics, because the privacy story only makes sense once you can see the layers. When you attach a file to a chat, three to five things happen in sequence, depending on the tool.

First, the file goes to the vendor's storage. This is a bucket on AWS, GCP, or Azure that the vendor controls, with the file keyed to your account and your specific chat. The upload happens over HTTPS, and the file at rest is encrypted by the cloud provider's default envelope encryption - this is the baseline, and it is the same baseline that any other cloud storage product uses. The encryption protects against a stolen disk; it does not protect against the vendor reading the file, because the vendor holds the keys. This is normal and is not specific to AI tools. It is the same posture Dropbox, Google Drive, OneDrive, and Slack have.

Second, the file is parsed into something the model can consume. For text files (PDF, DOCX, TXT, MD), this means extracting the text - which, for PDFs, is the same selectable-text extraction we covered in our piece on why black rectangles in PDFs do not actually redact, and which carries the same caveat that hidden layers, comments, and metadata come through. For spreadsheets, the parser walks the sheets and serialises them. For images, the parser may run OCR, may run a vision model, or both. The parsed representation is what the model sees; the original file is kept alongside it, in the chat attachment record, for the next time someone re-opens the chat.

Third, the relevant chunks of the parsed file are stuffed into the model's context window. The model has no persistent memory of the file - it sees the bytes only during the inference call - but the inference call itself runs on the vendor's GPU cluster, and the request payload (which contains the file content plus your prompt) is logged for at minimum the duration of the request, and in practice for whatever the vendor's request-log retention is. For OpenAI's API, request and response payloads are documented as retained for 30 days for abuse monitoring and then deleted - except when they are not, which brings us to the next layer.

Fourth, the chat is saved to the conversation-history layer. This is the part you can see in the sidebar. The file attachment, the full text of your prompts, the full text of every model response, the model name, the timestamp, and a session ID are all kept here. For ChatGPT consumer tiers this layer used to be governed by a '30-day default plus opt-in temporary chat' policy; since the May 2025 NYT preservation order it is governed by an indefinite legal hold instead. For Anthropic's Claude, the layer is governed by Anthropic's stated 30-day default for free and Pro consumer tiers, with longer retention for Team and Enterprise customers who configure it. For Google's Gemini through a personal account, it flows through the Google Activity controls - which have shifted between 18-month default, 3-month default, and current opt-in retention over the past two years. For Microsoft 365 Copilot, it flows through your tenant's existing Microsoft 365 retention settings, which is one of the more honest defaults in the market.

Fifth - if the tool offers it - the file may be copied into a sandbox for tools like Code Interpreter, Advanced Data Analysis, or Claude's analysis tool. The sandbox is a short-lived container, the copy in the sandbox is destroyed when the container is reaped, and the sandbox boundary is a real technical boundary. The copy in the chat-attachment layer (step four) is unaffected by what happens in the sandbox.

The privacy properties of an upload are determined almost entirely by what happens in steps two, three, and four - and step four is the long-lived one. The 'delete chat' button, when it works, operates on step four. When the legal hold is in force, it does not.

What 'we don't train on your data' actually means

This phrase is now standard on every paid tier of every major AI tool. It is also one of the most narrowly scoped privacy promises in the industry, and the gap between what it sounds like and what it commits to is large enough to be worth dismantling carefully.

What it does commit to: your inputs (the prompt) and the model's outputs (the reply) will not be used as training examples for the next iteration of the model. Specifically, your conversation will not be added to a fine-tuning dataset, will not be used as reinforcement-learning-from-human-feedback signal, and will not be used as raw text for the next pretraining run. This is a real commitment, and the major vendors have honoured it - there is no documented case of an enterprise customer's data showing up in a subsequent model's outputs.

What it does not commit to:

Storage. The conversation is still stored. The file is still stored. Not training on it is a different question from not keeping it.
Internal access. Vendor employees on trust-and-safety, abuse review, and customer support teams can still access the conversation under documented internal-access policies. These policies are real and audited, but they exist.
Safety classifiers and abuse monitoring. The conversation can still be analysed by automated systems that score it for policy violations, even if it cannot be added to the training set. These classifiers are themselves trained on data, and the rules about which conversations contribute to those classifiers' training sets are usually separate from the headline 'we don't train on your data' commitment.
Legal preservation. As the NYT order demonstrates, a 'we don't train' commitment does not exempt the data from a court-mandated legal hold.
Breach exposure. The data is still on the vendor's infrastructure. A breach of that infrastructure exposes your file the same way a breach of any cloud-storage vendor would.
Downstream contractors. The vendor's subprocessors (hyperscaler cloud providers, content-moderation contractors, analytics vendors) have whatever access their subprocessor agreement gives them.

When you read 'we don't train on your data' on a marketing page, the correct mental translation is: 'this conversation will not become part of the next model'. That is a useful property. It is not the same property as 'this conversation is private'.

ChatGPT: the tool everyone uses, the tier most people are on, and the order nobody mentions

ChatGPT is the dominant entrant, and its privacy posture is the one most people unwittingly inherit. The picture has four tiers and one very large carve-out.

On the free tier, conversations are retained for an unspecified period, are used by default to improve OpenAI's models (you can opt out in settings, under Data Controls), and are accessible to OpenAI's trust-and-safety team. Uploaded files follow the same rules as the conversation they are attached to. The 'delete chat' button removes the chat from your sidebar.

On the Plus, Pro, and Team tiers, the training default flips - your conversations are not used for training unless you opt in. Retention and access otherwise follow the consumer pattern. The delete-chat button has the same semantics.

On the Enterprise and Edu tiers, conversations are not used for training, are stored under a customer-managed retention policy (the default is 'until the customer deletes it', and admins can set automatic deletion windows), and SOC 2 reports are available. Crucially, Enterprise and Edu are the only tiers exempt from the NYT preservation order.

On the API, the default is no training, request/response payloads are retained for 30 days for abuse monitoring, and customers can request a Zero Data Retention agreement that removes that 30-day retention entirely. ZDR is available to customers who can demonstrate a regulatory need and whose use case passes OpenAI's review.

The carve-out that overrides all of this for the consumer and standard-API tiers is the New York Times preservation order. In May 2025, Magistrate Judge Ona Wang of the Southern District of New York ordered OpenAI to preserve all ChatGPT output data and API output that would otherwise have been deleted, as part of discovery in the Times' copyright lawsuit. OpenAI has stated publicly that the order applies to free, Plus, Pro, Team, and standard API users, and that it does not apply to Enterprise, Edu, or ZDR API customers. The practical effect for everyone in the affected bucket: when you click 'delete chat', the chat is hidden from your sidebar and removed from your normal access, but a copy is retained in a separate audit-controlled storage layer, accessible to a small legal team, until the court lifts the order. OpenAI is appealing the order and has called it 'a sweeping and unnecessary overreach'. The appeal is unresolved as of this writing.

If you have been using 'delete chat' on ChatGPT free or Plus since mid-2025 under the assumption that it deletes the chat, that assumption has been wrong. The chat is still there. This is not a breach or a bug - it is a court order that OpenAI is complying with - but it is also not a fact you can find in the in-app help page without searching for it.

Claude: a cleaner story, with a different set of asterisks

Anthropic's posture is the clearest of the consumer-AI vendors, partly because the company is younger and partly because its explicit market positioning is 'the safe and responsible one'. The rules for Claude.ai across free, Pro, Team, and Enterprise are consistent on one important point: Anthropic does not train its models on customer inputs or outputs by default on any tier. This is uncommon - OpenAI's free tier defaults the other way - and is worth weighing if the training-default is the property you care about.

Retention is documented at 30 days for free and Pro consumer conversations, with longer retention configurable for Team and Enterprise. Files uploaded to Claude are subject to the same retention as the conversation. Deletion is documented as propagating to backups within 30 additional days. There is no active legal preservation order analogous to the NYT case against OpenAI, though Anthropic has its own pending copyright suits (Bartz v Anthropic, the Universal Music Group case) which may produce something similar in time.

Claude's analysis tool (the equivalent of Code Interpreter) runs in a sandbox with the same lifetime semantics as OpenAI's: the sandbox dies, the copy of the file in the sandbox goes with it, the copy in the chat-attachment layer persists. There is no meaningful difference between vendors on this point.

The asterisk on Claude is the abuse-monitoring carve-out: certain conversations flagged by safety classifiers for review may be retained beyond 30 days, in some cases up to two years, for Anthropic's safety research and to improve the classifiers themselves. This is documented in Anthropic's usage policy, is narrower in scope than the equivalent OpenAI carve-out, but is still a carve-out worth being aware of.

Gemini: tied to Google, which is sometimes good and sometimes the problem

Gemini's privacy story is unique among the major AI tools because Gemini is not a standalone product the way ChatGPT and Claude are - it is a feature that ships across many Google surfaces, and which surface you use changes the privacy posture more than the model does.

Gemini through a personal Google account (gemini.google.com, the Gemini app on Android and iOS) flows through Google Activity controls. Your Gemini conversations are, by default, retained for 18 months and may be reviewed by human annotators to improve Google's services - this is the same posture Google Assistant has had for years, and it carries the same controls (you can turn the retention down to 3 months, or off entirely, in My Activity). The 'reviewed by human annotators' part is the one that has produced Google's largest privacy headlines, going back to the 2019 Belgian press leak of Google Assistant audio recordings, and is the part most worth turning off if you are uploading anything sensitive.

Gemini through Google Workspace (a paid Workspace account with Gemini included) flips the defaults: no human review, no training on your data, retention follows your Workspace data retention rules. This is the credible enterprise posture.

Gemini API on Google AI Studio with a free key uses your conversations to improve Google's models, retains them for 55 days for abuse monitoring, and surfaces them in the Studio UI. Gemini API on Vertex AI (the paid Google Cloud surface) does not use customer data for training, retains request/response payloads for 30 days for abuse monitoring, and offers a 'request not to be logged' option for customers with a regulatory need.

The honest takeaway: 'I uploaded it to Gemini' is meaningless without knowing which Gemini. The free consumer one is a Google Activity stream with human review; the Workspace one is a normal enterprise SaaS; the Vertex API one is enterprise infrastructure.

Microsoft 365 Copilot: the model is the boring part

Microsoft's positioning is that the model (which, under the hood, is OpenAI's GPT-4-class model running on Azure-hosted OpenAI infrastructure that Microsoft contracts) is the boring part of Copilot. The interesting part is the integration with your tenant. When you ask Copilot in Word to summarise a document, the document text leaves your machine, goes to Azure OpenAI's inference endpoint inside your tenant boundary, the model processes it, the reply comes back, and - per Microsoft's documented contract - the request payload is not retained, not used for training, and not accessible to OpenAI. The retention that applies to the conversation is your tenant's existing Microsoft 365 retention policy.

This is the strongest published posture among the integrated productivity-suite AI tools, and it is the reason Microsoft has had the easiest time selling Copilot into regulated industries. The trade-off is that it is only available with an enterprise Microsoft 365 subscription, which excludes individual users and small businesses on consumer plans. The Copilot you get on the consumer Microsoft 365 Personal plan is a different product with a different (consumer-tier) privacy posture.

Apple Intelligence and Private Cloud Compute: the only mainstream tool with a meaningfully different architecture

Apple's design for Apple Intelligence is genuinely different from everyone else's and is worth understanding even if you are not on Apple's platform, because it sets the bar that the rest of the industry will be measured against in the coming years.

The on-device model handles most requests locally. The bytes never leave the device. When a request is too complex for the on-device model, it is escalated to Apple's Private Cloud Compute (PCC) - and the PCC architecture is the part that matters. Apple contracts that the PCC servers run only Apple-signed, publicly-published code; that the code is verifiable via remote attestation (your device can confirm the server is running the published binary before sending the request); that the servers have no persistent storage and discard the request after processing; and that Apple itself cannot access the contents of the request. Independent security researchers (Trail of Bits, among others) have published audits of the architecture and the published source.

The third-party fallback - the integration that hands off some requests to ChatGPT - is opt-in per request, is mediated through an OpenAI API key Apple holds rather than your account, and is explicitly documented as flowing through OpenAI's standard API privacy posture (which means the 30-day API retention and the NYT preservation order apply unless OpenAI's ZDR terms are in effect for Apple's contract specifically; Apple has not publicly clarified this point).

The takeaway is not 'use Apple devices'. It is that 'AI model running on someone else's GPU in a way that the someone else cannot read or persist' is technically possible in 2026, has been deployed at scale, and is increasingly the bar that informed buyers should hold other vendors to.

What goes wrong when policy and reality diverge: the receipts

Three incidents are worth keeping in your head because they show the failure modes the marketing pages tend to avoid.

Samsung, April 2023. In the space of about three weeks, three separate Samsung Semiconductor employees pasted confidential material into ChatGPT free: source code for a proprietary chip-database tool, internal meeting notes from a product review, and a hardware test sequence. None of this was a breach - there was no leak from OpenAI - but Samsung had no ability to retrieve or delete the data once it was in OpenAI's chat-storage layer, and the data was, at the time, eligible for use in training. Samsung banned generative AI on internal systems in May 2023 and rolled out a Microsoft 365 Copilot deployment in the following months. The pattern repeated at JPMorgan, Verizon, Apple, Amazon, Bank of America, Deutsche Bank, Citigroup, and Goldman Sachs over the same six months. The lesson, for every company on every tier: an employee with a browser is a leak path unless there is an enterprise-tier alternative they prefer.

The Italian DPA / Garante ban, March-April 2023. Italy's data protection regulator temporarily ordered OpenAI to stop processing the data of Italian users, citing GDPR concerns around the legal basis for training-data collection and the inadequacy of age verification. The ban was lifted after OpenAI added a data-controls page, clarified its privacy policy, and added age-verification on signup, but it set the regulatory precedent that the consumer AI tools were operating without a clear legal basis for what they were doing with EU users' data. The follow-on investigations have not yet produced final rulings but have shaped every subsequent privacy policy update across the industry.

ChatGPT shared chats indexed by Google, late 2024. A wave of ChatGPT 'shared chat' links - the public URLs the share button generates - turned up in Google Search results, exposing conversations that the people who created the share links had not intended to publish. OpenAI's share semantics were 'anyone with the link', which, as our companion piece on why cloud share links are not actually private explains in detail, becomes 'anyone' the moment a search engine crawls the URL. OpenAI added a noindex tag and tightened the default; the URLs that were already in Google's index took weeks to fall out. The lesson generalises to every 'share with link' surface on every AI tool: a shared conversation is one search crawl away from public.

The practical rule for sensitive files

Putting all of the above together, the working rule for handling anything you would not want exposed comes down to four tiers, in order of preference.

First, on-device or self-hosted. If the file is sensitive enough that you would not put it in a public S3 bucket, the cleanest answer is to use a model that runs on a machine you control. Ollama, LM Studio, and llama.cpp all run open-weight models (Llama 3.1, Mistral, Qwen, DeepSeek) locally on a laptop with enough RAM. The model is less capable than a frontier commercial model, but for summarisation, extraction, and classification of text - which is the bulk of what most people use AI for - the gap is smaller than the headlines suggest. The file never leaves your machine. This is the same logic that underpins Privvert's tools: the privacy property of 'it does not leave the machine' is qualitatively different from any property a cloud vendor can offer.

Second, enterprise tier with a signed contract. If you need a frontier-capable model for sensitive work, the right tier is the enterprise one with a no-training, short-retention contract that your legal team has read. ChatGPT Enterprise (with the NYT exemption), Claude Enterprise, Microsoft 365 Copilot Enterprise, or Vertex AI with logging disabled are the credible options. The consumer paid tier is not a substitute - the contracts are different, the retention is different, and the legal-preservation posture is different.

Third, redact before upload. If the file is mostly non-sensitive but contains a few identifiers - names, addresses, account numbers, photo metadata - the right pattern is to remove those locally before sending the file to any cloud tool. Our pieces on PDF redaction, removing photo metadata, and what 'delete' really does cover the local-only workflows for the most common cases.

Fourth, accept the exposure consciously. If the file is not sensitive - a meeting transcript you would happily post on a blog, a screenshot from a public website, a homework question - then a consumer AI tool is the right level of infrastructure for the job. The mistake is treating the consumer tier as private when it is not.

Where this leaves AI tools in 2026

The honest summary is that AI tools are following the same privacy-maturity curve that cloud storage went through between 2008 and 2015. The early product was 'we have a folder on the internet'; the early privacy posture was 'we hope you trust us'; the gap between the consumer free tier and the enterprise paid tier widened steadily as enterprise customers refused to deploy without contracts. AI tools are in roughly the equivalent of 2012 on that curve: the enterprise contracts are now credible, the consumer tiers are still loose, the gap between them is the defining feature of the market, and the regulatory pressure is starting to bite.

The other piece worth saying out loud, given how often the marketing implies otherwise, is that 'end-to-end encrypted' and 'private' do not currently apply to consumer AI tools in any useful sense. The model needs to read the file to answer the question; the vendor needs to run the model to provide the service; the request payload is, by physics, visible to the vendor's infrastructure at the moment of inference. The closest thing to end-to-end encryption in this space is the attested-execution model Apple deployed with Private Cloud Compute - and even that is the model's owner attesting to itself about itself, with no third-party participant analogous to the Signal protocol's design. The cryptographic primitive that would let you send an encrypted prompt that a model could process without the vendor seeing it (fully homomorphic encryption on a 100-billion-parameter model) is not within an order of magnitude of practical in 2026. The privacy story for AI tools, for the foreseeable future, is going to be a contract and an infrastructure story, not a cryptography story. Pick your tier accordingly.

If you handle files the way we do at Privvert - the starting assumption is that the file stays on your machine - the AI tools fit naturally as a fourth category alongside online file converters, cloud share links, and the rest of the infrastructure we have been writing about. They are useful, they have legitimate uses, and they have a specific shape of exposure that does not match what most people assume. Knowing the shape is the whole game. For the related question of what end-to-end encryption does and does not cover in the messaging layer the same files might travel through, our piece on what 'end-to-end encrypted' actually means walks through the same set of trade-offs in the adjacent domain.

FAQ

If OpenAI says 'we don't train on your data', does that mean my uploaded files are private?: It means one specific thing - the file will not be used as gradient updates to the next model version - and it leaves a great deal else unaddressed. The file is still stored on OpenAI's infrastructure, still readable by the abuse-monitoring system and the engineers who debug it, still subject to legal preservation orders (including the active New York Times v OpenAI order that, since mid-2025, has required ChatGPT to keep deleted chats and API outputs indefinitely while the case is pending), and still tied to your account in ways that survive 'delete'. 'Not used for training' is one privacy property out of about eight you actually want to know about.
What is the New York Times v OpenAI preservation order and why does it matter for my files?: In May 2025, a federal magistrate judge in the Southern District of New York ordered OpenAI to preserve all ChatGPT output data and API output that would otherwise have been deleted, as part of the discovery process in the Times' copyright suit. The order applies to free, Plus, Pro, Team, and standard API users; it explicitly does not apply to ChatGPT Enterprise, Edu, or to API customers with a signed Zero Data Retention agreement. The practical effect for everyone else is that the 'delete chat' button still hides the chat from your sidebar but no longer deletes it from OpenAI's storage - it is retained under a legal hold, accessible to a small audit team, until the court lifts the order. OpenAI is appealing it. As of this writing the order is still in force.
Are Claude, Gemini, and Copilot any better here?: On training, all three of the big paid tiers (Claude Pro/Team/Enterprise, Gemini through Google Workspace, Microsoft 365 Copilot) commit not to train on customer data by default - this is genuinely consistent across the enterprise market and has been since roughly 2024. On retention, the free tiers are similar to ChatGPT free: chats are retained for service operation, abuse monitoring, and improvement of safety classifiers, with windows ranging from 30 days (Anthropic's stated default) to 18 months (Google's default consumer activity retention before 2024, now opt-in). On the legal-preservation question, only OpenAI is currently under the active NYT order; Anthropic and Google have their own pending copyright suits but no equivalent retention orders as of the publication date. The honest summary is that the enterprise contracts are credible, the free tiers are not, and the gap is widening.
What happens to a file I upload to Code Interpreter / Advanced Data Analysis?: The file is copied into a per-session Linux sandbox - a stripped-down container with Python, no network access, and a small ephemeral filesystem. The sandbox is destroyed when the session ends, which OpenAI documents as 'within a few hours of inactivity'. The file inside the sandbox is gone at that point. The copy of the file in the chat attachment, however, is not - that lives in the same chat-storage layer as the rest of the conversation, subject to the same retention, the same legal hold, and the same 'delete chat' semantics described above. The sandbox lifetime is a real technical boundary; the attachment lifetime is the one that actually matters for privacy.
Is the paid tier of ChatGPT/Claude/Gemini meaningfully more private than the free tier?: On training, yes - all three paid consumer tiers default to not training on your inputs (ChatGPT Plus/Pro since April 2023, Claude Pro since launch, Gemini Advanced since launch). On retention, mostly no - the paid consumer tiers use the same chat storage with the same windows as the free tiers. The big jump is from consumer paid to enterprise: ChatGPT Enterprise, Claude Enterprise, and Microsoft 365 Copilot all offer customer-managed retention, no training by default, SOC 2 reports, and (for ChatGPT Enterprise specifically) exemption from the NYT preservation order. If you are evaluating tools for a workplace, the consumer paid tier is not the right tier for sensitive work; the enterprise tier is.
What does the Samsung 2023 incident actually teach us?: In April 2023 Samsung's semiconductor division banned generative AI tools on internal systems after engineers pasted source code, internal meeting notes, and a hardware test sequence into ChatGPT to ask for debugging help. The bans were not about a leak from OpenAI - there was no breach. The problem was that the company had no control over what employees were sending out, and once a chunk of confidential code is in a third-party chat log, it is sitting under that third-party's retention, legal-hold, and breach-exposure policy rather than the company's. The Samsung response - block the consumer tools, deploy an enterprise tier with a no-training contract - is the same response most large companies have taken since. The lesson is not that AI tools are dangerous; it is that the difference between the consumer tier and the enterprise tier is the entire story.
What about the ChatGPT search-indexing incident in 2024?: In late 2024 a number of ChatGPT 'shared chat' links - the public URLs the share button generates - were found indexed by Google Search, surfacing conversations that the people who created the share links almost certainly did not intend to publish. OpenAI's share-link semantics had been 'anyone with the link can view', which is functionally 'anyone' once a search engine has crawled it (see our companion piece on cloud share links for the full pattern of why this happens). OpenAI added a noindex meta tag to shared chats and tightened the default in early 2025, but the indexed URLs that were already in Google's index took weeks to clear. The general rule across the industry: any 'shared with link' surface is one search-engine crawl away from being public, regardless of the URL's length.
Are AI features built into other apps - Notion AI, Slack AI, the Apple Intelligence stuff - any different?: They split into two camps. Notion AI, Slack AI, and most Office 365 Copilot features are wrappers around OpenAI, Anthropic, or Google models running on the AI vendor's infrastructure, with a contract that flows your data through the wrapper vendor to the model vendor and back. The retention and training posture follows the contract chain - Slack AI on the Enterprise+ tier inherits no-training, Notion AI inherits the OpenAI enterprise terms, and so on. Apple Intelligence is genuinely different: Apple's Private Cloud Compute architecture runs the requests on Apple-controlled hardware with attested code and no persistent storage, and the on-device models never leave the device at all. Apple's design here is the strongest published privacy posture in mainstream AI, and it is worth understanding even if you do not own Apple devices because it sets a useful bar for what other vendors could be doing and are not.
What is the practical rule for handling sensitive files - contracts, IDs, medical records, source code?: If the file would be uncomfortable in a public S3 bucket, do not put it into a consumer AI tool. The realistic options are: (1) an enterprise contract with a no-training, zero-retention or short-retention clause, audited by your legal team; (2) an on-device or self-hosted model - Ollama, LM Studio, llama.cpp for text, the local image and OCR models we ship in Privvert's tools - where the file never leaves your machine; (3) redact the sensitive fields locally before uploading, treating the AI tool as untrusted infrastructure. For the specific case of files where the sensitive part is identifiers or metadata rather than the body, our pieces on PDF redaction, photo metadata, and what 'delete' really does cover the local-redaction workflow in detail.