Why I Built My Homelab


Rethinking Data Ownership and Privacy in the AI Era

*ding* my phone chimes - $9.99 Apple, Inc.

Every month, like clockwork, I get a $9.99 bill from Apple for my 1TB of cloud storage. What started as a manageable $0.99, then $2.99, ballooned alongside my ever-growing collection of photos and 4K videos—and my reluctance to sift through and delete the junk. The trap of cloud storage is that unless you stay vigilant, the costs creep higher until you’re paying more annually than the price of a physical hard drive. Of course, there are benefits like redundancy, ease of use, and arguably security.

The more I thought about it, the more I realized how dependent I’d become on cloud storage for convenience. My photos, videos, and files weren’t just in the cloud—they were locked behind someone else’s infrastructure, with costs and conditions I couldn’t control. It felt like renting space in perpetuity, with no clear end in sight. I’m not advocating an all-out revolt against the cloud, though. I still rely on convenient features like automatic phone backups; they’ve saved me more than once. The trick is finding a balance—using cloud services for offsite backups and quick syncing, rather than storing everything there by default. That dependency got me wondering: what would it take to regain control over my data and reduce my reliance on third-party platforms?

As a developer, I’ve always been interested in running my own server rack. I’ve managed large-scale applications in the cloud for years, but I’d never done so on bare metal. Setting up my own infrastructure felt like the perfect opportunity to dive deeper into networking and system administration through hands-on experience—my preferred way to learn. I needed a stronger justification though. Ten dollars a month wasn’t enough justification to invest in hardware and dedicate the time to set everything up.

I began to consider aspects such as the ownership of data, and the value of other useful services like Plex—a media server for streaming movies, music, and TV shows. Previously, I purchased or rented movies on services like Amazon Prime, but I also owned several physical discs of old holiday movies and other classics. This spurred me to look more into what actually happened when I “bought” a movie to watch on Amazon Prime Video, and my discoveries concerned me.

When you purchase a physical DVD from a store, you own the disc outright. Warner Brothers can’t come to your home and take it back simply because Blockbuster went out of business or their licensing agreements changed. However, this isn’t the case with digital movies. When you “buy” a digital movie—such as through Amazon—you’re typically purchasing a license to access the content, not ownership of the file. This license depends on agreements between the platform (e.g., Amazon) and the content owner (e.g., Disney). If those agreements change or expire, the platform could lose the rights to offer the movie, revoking your access to it.

For example, Amanda Caudel filed a lawsuit[1][2] against Amazon in 2021, arguing that the term “buy” is misleading in this context, as digital purchases are contingent on ongoing licensing agreements. Unlike physical media, digital ownership often means you don’t control the file or have independent rights to it. Instead, your access is tied to the platform and its agreements. In the digital world, unless you own and have explicit rights to the file, your ownership is not guaranteed. In essence, the lawsuit underscores that “buying” a digital movie is more akin to leasing permission to watch it. Once those licensing agreements lapse or change, that permission can vanish overnight—along with your so-called purchase.

As I grappled with the notion of taking back control over my photos and videos, I realized there’s a bigger dimension to data ownership. If our access to media can be revoked, what assurances do we have about the security and privacy of our personal data stored in the cloud? I know I sure don’t read through every new privacy policy and terms of service agreement sent my way.

“Nothing to hide, nothing to fear”

While it’s true that many of us don’t have much to hide, that doesn’t mean we’re comfortable with leaving the metaphorical curtains wide open. Privacy isn’t just about secrecy—it’s about control. As our lives become more integrated with digital platforms, we often trade control for convenience. The question is: How much control are we willing to give up, and at what cost?

As I began questioning the long-term sustainability of relying on cloud storage, my thoughts inevitably expanded beyond personal costs to the larger, more abstract concerns surrounding data. If I could no longer trust the platform to safeguard my own photos and videos, what about the other data I entrusted to these services? As more of our digital lives are stored in the cloud, the implications of data usage become far more critical. Beyond the loss of access to a file or photo, there are broader ethical issues at play—especially when it comes to how that data is used, shared, and sold. This shift in perspective led me to think more deeply about the intersection of data ownership, privacy, and the emerging role of AI in shaping the future of how we interact with technology.

This tradeoff becomes even more concerning in the context of machine learning and AI. Modern AI systems thrive on data—lots of it. From photos to documents, the files we store in the cloud may be part of a much larger data ecosystem. While companies promise to anonymize data used for training AI models, the line between anonymized and identifiable data isn’t always as clear as it seems. And as competition in AI intensifies, the pressure to collect, utilize, and perhaps exploit user data will only grow.

As we move from personal data storage to the broader implications of data usage, it’s clear that the stakes are much higher. With AI systems becoming more powerful and pervasive, the data we store isn’t just sitting idly in the cloud—it’s actively being utilized in ways that could impact our privacy in areas many users don’t fully understand. This is where the ethical and regulatory challenges of AI data collection become a critical issue.

Anonymized training data is vital for creating machine learning models in privacy-sensitive domains like healthcare. However, mishandling this data can lead to serious, unintended privacy breaches. For example, a common emerging technology is the use of AI Medical Scribes[3] to transcribe medical speech. This software provides great value to medical staff by improving their workflow efficiency, yet it comes with concerns regarding the handling of sensitive patient data for training and other development purposes.

Mariana, a provider of AI Scribe software, acknowledges these issues on their website—outlining steps like de-identification of data and database security measures[4]. Seeing companies recognize and tackle these challenges head-on is promising, but even the best engineers can make mistakes or face downstream risks from third-party vendors. At the end of August 2024, Confidant Health faced a breach of 5.3 terabytes of user data[5], including secure patient files such as multi-page psychiatric summaries and medical history.

Although high-profile breaches make headlines, an equally concerning issue is the everyday outsourcing of data to multiple third-party services. Companies like Astronomer provide a fully managed Apache Airflow service. Rather than hosting in-house or using a secure cloud provider, many companies—especially startups—outsource crucial infrastructure to providers like Astronomer to reduce maintenance overhead. Astronomer is just one example; LangChain is also commonly used for LLM-based applications, offering tools such as LangSmith for debugging, development, and iteration.

The real vulnerability often surfaces during routine development tasks, where sensitive user data can inadvertently end up in logs or debugging tools—potentially exposing personally identifiable information to unauthorized eyes. Startups rely on multiple third-party services to move quickly and minimize operational costs. Unfortunately, you’re only as safe as your most vulnerable link. Without extra measures to properly secure user data—both in transit and at rest—it can be easily exposed once it leaves the core system.

Companies in the MedTech space typically undergo extra privacy and security scrutiny, such as complying with regulations like HIPAA in the U.S., which safeguards personal health information to some extent. In the EU, GDPR provides a comprehensive data protection framework across sectors. By contrast, the U.S. lacks a single, national data privacy law akin to GDPR, so outside of laws like HIPAA, many areas—including advertising and certain AI training—remain underregulated when it comes to personal data protection. The inconsistent regulatory landscape only reinforced my desire to manage my own storage and reduce potential vulnerabilities.

I am not saying you shouldn’t trust your healthcare provider or that you should start living under a rock, but I do see value in reducing my exposure to the cloud in a world where data is more valuable than gold. Ultimately, the decision to build my homelab wasn’t just about saving money on cloud storage or addressing concerns over data ownership—it was about regaining a measure of control.

As a DevOps engineer, this project gave me the opportunity to deepen my understanding of networking and system administration on bare-metal, while also providing a sense of security and independence that I didn’t have in the cloud. The frustrations with digital ownership, the creeping costs, and the growing concerns around privacy and AI-driven data collection were all factors in my decision to move away from third-party providers. Of course, cloud services aren’t inherently evil—they offer real convenience and have enabled many innovations. But building my own infrastructure has allowed me to regain some of the control that cloud storage often strips away.

It’s not perfect, and it certainly comes with its own set of challenges, but it’s a step toward more direct ownership of my data—and ultimately, my digital life. Of course, running a homelab means I’m on the hook for everything: I maintain hardware, manage backups, update software, and foot the electricity bill. If a drive fails at 3 a.m., there’s no tech support hotline to call—I’m the tech support. But for me, the hands-on learning and sense of autonomy make up for the occasional headache. In a world that increasingly shifts toward convenience and outsourcing, this homelab is my attempt to carve out a space where I hold the keys to my own information—and have some damn fun. While the cloud offers unparalleled convenience, nothing beats the freedom of holding the keys to your own data and software.

Sources

  1. Caudel v. Amazon Motion to Dismiss. (2021). Scribd. https://www.scribd.com/document/481886329/Caudel-v-Amazon-MTD
  2. Hollywood Reporter. (2021). Amazon argues users don’t actually own purchased Prime Video content. https://www.hollywoodreporter.com/business/business-news/amazon-argues-users-dont-actually-own-purchased-prime-video-content-4083703/
  3. Wikipedia. (n.d.). Automated medical scribe. https://en.wikipedia.org/wiki/Automated_medical_scribe
  4. Mariana AI. (n.d.). Exploring the privacy implications of AI medical scribes: What you need to know. https://marianaai.com/post/exploring-the-privacy-implications-of-ai-medical-scribes-what-you-need-to-know
  5. Wired. (2024). Confidant Health therapy records database exposure. https://www.wired.com/story/confidant-health-therapy-records-database-exposure/