Reflections

Self-hosting keeps your private data out of AI models

Tim Abbott 6 min read

Last week, Slack’s users realized that under the company’s terms of service, their private data could be used to train artificial-intelligence models.

“To develop AI/ML models, our systems analyze Customer Data (e.g. messages, content, and files) submitted to Slack.”

— Slack’s privacy principles, May 17, 2024

This came as a shock: chat messages convey sensitive company data, and LLMs (large language models, the category of AI models that includes ChatGPT) are known to leak the data they are trained on.

“Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim.”

— “Quantifying Memorization Across Neural Language Models” [ arXiv:2202.07646]

Assuming Slack’s belated updates are accurate, the ways in which Slack is currently using private data may be acceptable to some customers who were initially concerned.

Nevertheless, in this blog post, I’ll argue that the risks of entrusting your business’s private communications to a cloud service are now greater than ever. I’ll also discuss how you can minimize those risks by self-hosting the tools that process your most sensitive private data.

Full disclosure: I lead the development of Zulip, an organized team chat app that can be self-hosted or used as a cloud service.

Data-hungry AI/LLM efforts are an existential need for many tech companies

In the current climate of AI buzz, AI/LLM initiatives are a “must” for companies to boost their stock price or raise their next round of funding. AI/ML companies captured more than a third of all venture funding in 2023, and series B valuations were almost 60% higher for startups that promised AI in their pitch.

Data is the lifeblood of AI/LLM technology, and there’s a long history of major tech companies disregarding data privacy when it conflicts with their commercial interests. Drawing clear boundaries on data use and sticking to them will be challenging for any corporation, regardless of how leaders at the company may personally feel. Slack in particular is part of Salesforce, which considers AI to be key to its strategy.

With a SaaS-only product (with no option to self-host), you’ll find yourself in a tough spot if a vendor changes their data use policies to ones you find unacceptable. You can stop using the product altogether, with all the stress of selecting and implementing an alternative solution in a hurry. Or you can decide to live with it.

You give software providers a limited license to use your data

A terms of service agreement with a cloud services company that hosts your private data will include a limited license that allows the company to process your data. Slack may well update its terms of service to be more restrictive in response to the recent backlash.

However, a license that limits how data can be used for training AI models is effective only to the extent that you can legally enforce it.

To assert your rights, you have to know when they’ve been violated in the first place. If an LLM trained with your proprietary data shares your data with a competitor, how likely are you to even realize that it happened?

Tech giants claim that licensing restrictions don’t apply to training LLMs

If you do find out that your data has been misused to train an LLM, convincing a court to compensate you for the damage to your business will be an arduous and uncertain journey.

Microsoft and OpenAI have taken the position that copyright and licensing restrictions can largely be ignored when training LLMs, and are making that argument in court.

For example, open-source code is published under a variety of permissive licenses that nonetheless limit how that code can be used. In creating the GitHub Copilot AI model, Microsoft and OpenAI blatantly ignored these license restrictions, and are defending this policy as “fair use” in a class-action lawsuit.

Meanwhile, The New York Times is suing Microsoft and OpenAI for using its articles in training models, which can spit out “near-verbatim excerpts from Times articles that would otherwise require a paid subscription to view.”

Self-hosting is the better option for many organizations

The best way to protect your data from being misused by another company is not to share it in the first place. At the same time, teams need software to collaborate: tools for chat, project management, docs, etc. Self-hosting your collaboration software is the surest way not to have your proprietary data end up in someone else’s LLM.

(The other option is to use an end-to-end encrypted application, where the hosting provider cannot access the data at all, but this comes with functionality compromises that make it impractical for most business software applications).

Just 10 years ago, self-hosting key applications was often impractical. For many business needs, there was a large gap between the quality and capabilities of market-leading SaaS products and the best self-hosted alternatives. But in 2024, there is a wealth of self-hostable products that can successfully replace popular cloud services.

When you self-host, storing and processing your data on your own infrastructure is the safest option. When that’s not practical, self-hosting key applications in a public cloud should be safe enough for most organizations: both technical and PR considerations make it highly unlikely that a hosting provider would break into a customer’s database to access data.

Self-hosting your software tools requires the expertise and attention to set them up securely, and keep the software updated with the latest security fixes. If you can’t commit to this right now, your next best option is to use an open-source product that lets you switch between SaaS and self-hosting if your resources and priorities change.


A note on Zulip’s approach to data privacy

We offer both self-hosting and professional cloud hosting for Zulip, and make it easy to move your data between the two.

As a sustainable business not dependent on venture capital funding, we have the freedom to operate in accordance to our values. Making sure customer data is protected is our highest priority. We don’t train LLMs on Zulip Cloud customer data, and we have no plans to do so.

Should we decide that training our own LLMs is necessary for Zulip to succeed, we promise to do so in a responsible manner. This would include (1) clearly documenting what data would be used and for what purpose, (2) providing an easy in-app option to opt in or out, (3) proactively presenting this option to all Zulip Cloud organizations, and (4) supporting any customers who wish to export their data to self-hosted Zulip.

Finally, we are committed to keeping Zulip 100% open-source, so the source code that defines how data is processed is available for third parties to review and audit.