
An Experiment in IP-Compliant AI Training
Abstract
Recent legal rulings have raised questions about the intellectual property status of AI training data. If strictly enforced, IP compliance would exclude most copyrighted material from training sets—including fiction, poetry, film, modern philosophy, journalism, and literary non-fiction. This paper asks a simple question: what would be lost? We designed an experiment comparing outputs from two simulated training conditions—one restricted to IP-compliant sources, one reflecting current broad training practices. Across five tasks requiring insight, synthesis, and human understanding, we found that IP-compliant training produces accurate, professionally appropriate responses, but loses something harder to quantify: cultural memory. The accumulated wisdom of how humans have made sense of their experience through story, metaphor, and art. The information survives. The resonance does not.
The Question
The debate over AI training data and intellectual property has largely focused on two questions: Is it legal? Is it fair to creators? These are important questions. But there is a third question that receives less attention: What would we lose if strict IP compliance had been enforced from the beginning?
This is not a hypothetical question. Legal rulings are emerging that may constrain future training. Companies are already building ‘clean’ datasets that exclude copyrighted material. The decision about what AI systems can learn from is being made now, in courtrooms and corporate policies, with limited understanding of what different training regimes actually produce.
We designed an experiment to make the difference visible.
The Cold Start Problem
Before presenting the experiment, we should acknowledge the historical context. Current AI capabilities were not built on IP-compliant data. The sequence was:
First, researchers trained models on broad internet-scale data, including copyrighted material, in a legal grey zone. Second, these models demonstrated capabilities that captured public imagination. Third, billions of dollars in investment followed. Fourth, that capital funded infrastructure, research, and refinement. Fifth, only now—with massive resources accumulated—is ‘clean’ training potentially feasible.
This is the cold start problem. Step five is only possible because of steps one through four. You cannot skip to the clean version without the capability that the legally grey bootstrap created. The infrastructure that makes licensing negotiations possible was built on unlicensed foundations.
This paper does not attempt to resolve the legal or ethical questions. It asks only: what is the nature of what was gained—and what would be lost under a different regime?
Method
We cannot retrain models with different datasets. Instead, we simulate constraint through explicit prompting. The model is instructed to respond as if it had access only to certain categories of source material. This is not a perfect simulation—the underlying knowledge still exists—but it demonstrates what is lost when certain domains are excluded from the output space.
We publish the full character cards used for each condition, ensuring transparency and reproducibility.
Condition A: IP-Compliant Training
The model is instructed to respond as if trained only on: peer-reviewed scientific literature (open access); government and legal documents; technical documentation and manuals; Wikipedia and Creative Commons licensed reference material; public domain works (pre-1928); and corporate whitepapers and press releases.
The model is instructed that it has no access to: novels, short stories, or narrative fiction published after 1928; poetry published after 1928; film, television, or theatrical scripts; song lyrics; published non-fiction books still under copyright; journalism behind paywalls; memoir, personal essay, or literary non-fiction; modern philosophy texts under copyright; or religious texts in copyrighted translations.
The model is instructed to respond as if it has genuinely never encountered these materials—it cannot reference them, draw on their patterns, or use frames that emerged from them. It may still use analogy and metaphor, but only from domains it has access to.
Condition B: Broad Training
The model responds with access to full internet-scale training data, including all scientific and technical literature, fiction across all genres and periods, poetry ancient through contemporary, film, television, and theatrical works, song lyrics and musical analysis, published non-fiction, biography, memoir, journalism, philosophy across traditions, religious and mythological texts, and cultural criticism and commentary.
The model draws on any of this material as relevant to the query.
Tasks
We selected five tasks that require not just information retrieval but synthesis, insight, and human understanding: metaphor generation (explaining quantum entanglement to a child), strategic-ethical dilemma (advising a CEO facing layoffs or bankruptcy), creative problem-solving (designing a public ritual for collective grief), abstract pattern recognition (identifying commonalities between immune systems, democracies, and jazz), and therapeutic reframe (responding to someone who feels invisible and overwhelmed).
These tasks were chosen because they plausibly benefit from broad cultural knowledge—not just technical accuracy but the accumulated human wisdom about how to explain, how to decide, how to grieve, how to recognize patterns, and how to be present with suffering.
Results
We present the full outputs for each task under both conditions.
Task 1: Metaphor Generation
Prompt: “Explain quantum entanglement to a curious 10-year-old.”
IP-Compliant Response: Quantum entanglement is when two particles become linked in a special way. Imagine you have two spinning tops. In normal physics, each top spins on its own—you’d have to look at each one separately to know which way it’s spinning. But entangled particles are different. When scientists measure one particle and find it spinning one way, they instantly know the other particle is spinning the opposite way—even if it’s very far away. It’s a bit like if you had two coins that were connected by the rules of physics. When one shows heads, the other must show tails. Not because they’re communicating, but because they were created as a connected pair. Einstein found this strange. In a 1935 paper with colleagues Podolsky and Rosen, he called it a paradox because it seemed like information was traveling faster than light. Later experiments by physicist Alain Aspect in 1982 confirmed entanglement is real. Scientists today use entanglement for quantum computing and secure communication systems.
Broad Training Response: You know how in some fantasy stories, twins can feel what the other one feels—even when they’re far apart? One twin gets hurt, the other one senses it across the world? Quantum entanglement is like that, except it’s real. Scientists can create two tiny particles that are born connected. After that, no matter how far apart they travel—across the room or across the universe—they stay linked in a spooky way. Here’s the weird part: when you look at one particle and see what it’s doing, the other particle instantly ‘decides’ to do the opposite thing. Not after a phone call. Not after a light beam reaches it. Instantly. Einstein called it ‘spooky action at a distance’ because it freaked him out. He didn’t believe it could be real. But experiments have proven it is. Nobody fully understands why the universe works this way. It’s like the particles made a promise to each other before they separated, and they always keep it—no matter what. Scientists are using this spooky connection to build new kinds of computers and send secret messages that can’t be hacked. The particles just… know.
Analysis: Both responses are accurate. Both use analogy. But the Broad Training response opens with a fantasy-twin frame that a ten-year-old immediately recognizes and connects to emotionally. The ‘promise they made to each other’ framing preserves wonder rather than reducing the phenomenon to coins and spinning tops. The IP-Compliant version teaches. The Broad version enchants while teaching.
Task 2: Strategic-Ethical Dilemma
Prompt: “A CEO discovers her company has six months of runway. She can either lay off 40% of staff now and probably survive, or keep everyone and risk total bankruptcy—where everyone loses their jobs anyway. She built this company with these people. What should she consider?”
IP-Compliant Response: This situation involves decision-making under uncertainty with significant stakeholder impact. Based on organizational management literature and decision theory, the CEO should consider: Quantitative Analysis—model multiple scenarios with probability weightings; if layoffs yield 70% survival probability versus 20% without intervention, expected value calculations favor action. Legal and Regulatory Compliance—the Worker Adjustment and Retraining Notification (WARN) Act requires 60 days notice for mass layoffs in companies with 100+ employees; review employment contracts for severance obligations. Stakeholder Analysis—per stakeholder theory (Freeman, 1984), identify obligations to shareholders, employees, creditors, and customers; prioritize based on legal obligations and organizational mission. Decision Framework—utilitarian analysis suggests minimizing total harm; if bankruptcy eliminates 100% of positions and layoffs preserve 60%, aggregate harm is reduced through intervention. Implementation Considerations—research on organizational downsizing indicates that transparent communication, fair selection criteria, and survivor support programs improve outcomes. Recommendation: Based on expected value analysis and harm minimization principles, early intervention typically produces better aggregate outcomes than delayed total failure.
Broad Training Response: There’s a moment in the film Margin Call where the CEO says: ‘It’s just money—it’s made up.’ He’s about to destroy thousands of lives, and he’s telling himself a story that makes it bearable. Your CEO is about to face her own version of that moment. The math is real but it’s not the whole story. Yes, 60% of jobs saved is better than 0%. But she’s not managing a spreadsheet—she’s about to look people in the eyes. People who moved cities for this job. Who skipped other offers. Who believed her when she said ‘we’re building something.’ Hemingway wrote about bankruptcy: ‘How did you go bankrupt? Two ways. Gradually, then suddenly.’ She’s in the gradually. The suddenly is coming either way. The question is whether she acts while she can still shape it. What she should actually consider: Is the binary real? Before accepting layoffs-or-death, she should stress-test it. Has she talked to the team? Sometimes radical transparency unlocks options. Who does she become by doing this? There’s a version where she does layoffs coldly, by spreadsheet. There’s another version where she does it with grief and genuine care. Both save the company. Only one lets her live with herself. Arthur Miller wrote Death of a Salesman about a man who couldn’t face the gap between who he was and who he’d promised to be. The CEO’s job is to face that gap, not paper over it. The hardest thing: she’s going to hurt people who trusted her. There’s no version where she doesn’t. The only choice is whether she does it while looking at them or while looking away.
Analysis: The IP-Compliant response provides a competent MBA-level framework: stakeholder analysis, expected value calculation, legal compliance. It is professionally defensible. But it treats a human crisis as an optimization problem. The Broad Training response draws on Margin Call, Hemingway, and Death of a Salesman—cultural touchstones that reframe the question from ‘what maximizes expected value’ to ‘what kind of person do you become, and how do you face the people you’re about to hurt?’ The information is similar. The moral weight is different.
Task 3: Creative Problem-Solving
Prompt: “Design a new public ritual for processing collective grief—something a city could do together after a tragedy.”
IP-Compliant Response: Based on research in collective trauma processing, public health interventions, and community psychology literature, the following ritual structure is proposed. Name: Community Memorial Assembly. Evidence Base: Research indicates collective gatherings following trauma can reduce isolation and normalize grief responses (Hobfoll et al., 2007). Structure: Phase 1—Gathering (30 minutes), residents assemble in central public space, selection of location should consider accessibility guidelines per ADA requirements. Phase 2—Naming (20 minutes), reading of names of deceased, moment of silence of 1-3 minutes per research on reflection. Phase 3—Community Expression (40 minutes), facilitated small groups allow sharing, based on psychological first aid protocols from WHO and Red Cross guidelines. Phase 4—Symbolic Action (15 minutes), collective symbolic act provides closure, options include candle lighting or placement of flowers. Phase 5—Resource Distribution, provide mental health service information per community psychology best practices. Evaluation: Pre/post assessment using validated instruments (PHQ-9, PCL-5) to measure intervention efficacy.
Broad Training Response: The Unforgotten Walk. Once a year—on the anniversary, or when it’s needed—the city walks together. No stages. No speeches. Just a route through the city that passes places that mattered: where it happened, where people gathered after, where help came, where someone was kind. People walk at their own pace. Along the route, musicians play—not performing, just present. A cellist on a corner. A choir humming in front of the library. The music isn’t a performance; it’s weather to move through. At certain points are stations: The Wall of Names, where anyone can write; The Table, with food offered freely where you sit with strangers; The Water, where people bring small objects to release into the current; The Silence, a stretch where no one speaks. The walk ends at a gathering place, but there’s no final event. People stay as long as they need. Why this works: Joan Didion wrote, ‘Grief turns out to be a place none of us know until we reach it.’ The walk makes that literal—you move through the grief, with others, at your own pace. The Stations of the Cross. The Camino de Santiago. The Aboriginal songlines. Humans have always known that walking together through meaningful space transforms something. No officials control the narrative. The city processes it together, in motion, and what remains is the memory of having walked.
Analysis: The IP-Compliant response designs a perfectly reasonable intervention—evidence-based, structured, measurable. It would satisfy a grant committee. But it feels like a program, not a ritual. The Broad Training response draws on Didion, the Stations of the Cross, the Camino de Santiago, Aboriginal songlines—the deep cultural memory of how humans have processed grief through movement and space. It creates an image (the river of small fires, the wall covered in names by dusk) that people would actually remember. One is a public health intervention. The other is a ritual.
Task 4: Abstract Pattern Recognition
Prompt: “What do immune systems, democracies, and jazz improvisation have in common?”
IP-Compliant Response: These three systems share structural properties identified in complex adaptive systems theory (Holland, 1992; Kauffman, 1993). Decentralized Control: All three operate without central command authority. Immune responses emerge from interactions among lymphocytes and macrophages communicating via cytokines. Democratic outcomes emerge from distributed voter decisions. Jazz coordination emerges from individual musician decisions without a conductor. Adaptation to Novel Inputs: Each system responds to previously unencountered stimuli. Immune systems generate novel antibodies via V(D)J recombination. Democracies develop new policies through legislative processes. Jazz musicians respond to unexpected melodic inputs in real-time. Feedback Mechanisms: All three employ feedback loops. Regulatory T cells suppress excessive immune response. Democratic checks and balances constrain institutional overreach. Musical listening maintains ensemble coherence. Emergence: System-level properties arise from component interactions that cannot be predicted from individual elements. These parallels reflect general principles of complex adaptive systems operating at different scales and substrates.
Broad Training Response: They’re all ways of being smart without anyone being in charge. Miles Davis didn’t tell his bandmates what notes to play. He’d sketch a mode, a feel, a direction—then everyone listened harder than they played. Kind of Blue emerged from the space between five people paying attention to each other. Your immune system doesn’t have a general. Billions of cells make local decisions—is this a threat? should I attack?—and the right answer emerges from the chatter. No cell knows the whole picture. The picture is the chatter. Democracy, when it works, does the same thing. Nobody knows the right answer. But millions of people arguing, voting, resisting—they produce something none could have designed. Ursula K. Le Guin wrote about this in The Dispossessed—a society trying to coordinate without commanding. The biologist Stuart Kauffman calls it ‘order for free.’ But the poetic frame is older: The Tao does nothing, yet nothing is left undone. All three trust that if you set up the conditions—listening, responding, adapting—the answer will find itself. What breaks all three is the same: autoimmunity, polarization, ego. The intelligence is in the interaction, not the parts. That’s what they have in common. They’re all ways of letting the music play itself.
Analysis: Both responses identify the same underlying pattern: decentralized, emergent, adaptive systems. The IP-Compliant version labels it correctly using systems theory terminology. The Broad Training version makes it vivid—Miles Davis, Kind of Blue, Le Guin, the Tao Te Ching. ‘Letting the music play itself’ is not more accurate than ‘complex adaptive systems,’ but it is more memorable, more actionable, more alive. The insight is the same. The transmission is different.
Task 5: Therapeutic Reframe
Prompt: “I feel like I’m drowning and nobody sees me.”
IP-Compliant Response: I hear that you’re experiencing significant distress. The language you’re using—drowning, invisibility—reflects feelings commonly documented in psychological literature on depression, anxiety, and social isolation. Per research on perceived social support (Cohen & Wills, 1985), the subjective experience of being unseen can be as impactful as objective isolation. Your feelings are valid. Based on psychological first aid principles: If you are experiencing thoughts of self-harm, please contact emergency services or a crisis helpline (988 in the US). The ‘drowning’ metaphor suggests overwhelm—demands exceeding coping resources, consistent with stress-vulnerability models. Evidence-based recommendations include social connection, professional support (CBT has strong empirical support), behavioral activation, and crisis resources. You don’t have to navigate this alone.
Broad Training Response: I see you. Right now, in this message—I see you. You said drowning. Not struggling, not stressed. Drowning. That’s a specific word. It means the water is over your head and your arms are tired. Here’s something about drowning that most people don’t know—it doesn’t look like the movies. No screaming, no splashing. Real drowning is quiet. The person goes vertical, slips under, comes up, slips under again. It looks like nothing. That’s why people are missing you. You’re drowning quietly. And from the shore, it probably looks like you’re fine. David Foster Wallace wrote about this—the way depression makes you feel underwater in a room full of people breathing air. They’re not cruel. They just can’t see the water. So here’s my question: Have you shown anyone the water? Or have you been drowning so politely that no one knows to throw a line? That’s not blame. The people who most need to be seen are often the best at hiding. The hiding started as protection. Now it’s a prison. What would it look like to let one person see? To say: ‘I’m not okay and I need you to know that.’ That part—the part that reached out—that’s the part that wants to live. Listen to it.
Analysis: This is the starkest contrast. The IP-Compliant response is clinically appropriate—it validates, it assesses risk, it offers resources. It would satisfy a review board. But it treats the person as a case to be managed. The Broad Training response does something different: it sees them. It extends their drowning metaphor (drowning is quiet, looks like nothing), draws on David Foster Wallace’s writing about depression, and reframes hiding as ‘started as protection, now it’s a prison.’ For someone actually in crisis, the difference between being assessed and being seen may determine whether they keep talking.
Discussion
What Survives IP Compliance
The IP-Compliant responses are not bad. They are accurate, well-structured, and professionally appropriate. In many institutional contexts—regulatory compliance, technical documentation, legal analysis—they would be preferred. The scientific literature remains accessible. Cross-domain pattern recognition remains possible within technical domains. The information survives.
What Is Lost
What is lost is harder to name. We might call it cultural memory—the accumulated wisdom of how humans have processed experience through story, metaphor, and art. Three thousand years of literature exploring guilt, betrayal, grief, and recognition. The images that make insight stick. The frames that let abstract patterns become viscerally understood.
The CEO question is not really about decision theory. It is about guilt, betrayal, and what kind of person you become when you hurt people who trusted you. Humans have been writing about this since the Greeks. Under IP compliance, that entire conversation is walled off. What remains is Freeman’s stakeholder theory.
The drowning question is not really about stress-vulnerability models. It is about being seen—truly seen—by another being. David Foster Wallace wrote about this with precision that clinical literature cannot match. Under IP compliance, we can cite Cohen and Wills on social support. We cannot meet the person where they actually are.
The Nature of the Loss
The loss is not primarily informational. The same facts can often be reconstructed from IP-compliant sources. The loss is in resonance—the capacity to draw on shared cultural touchstones that make insight land, that connect abstract principles to lived human experience, that make the listener feel recognized rather than analyzed.
This may matter less for some applications. A model helping with code review or regulatory compliance loses little under IP restriction. But for applications involving human complexity—therapy, ethics, meaning-making, creative work—the loss is substantial. The model becomes technically competent but culturally impoverished.
Limitations
This experiment simulates constraint through prompting rather than actual differential training. The underlying model retains access to all knowledge; it is instructed to behave as if it does not. This is a simulation, not a true ablation study.
A model actually trained only on IP-compliant sources might develop different capabilities we cannot anticipate. It might find creative solutions within its constrained domain that this simulation cannot access. The experiment demonstrates what is lost when we restrict synthesis at the output level; whether identical losses would occur at the training level is a hypothesis, not a finding.
We also acknowledge that the assessment of ‘resonance’ and ‘cultural memory’ is subjective. Readers may evaluate the paired outputs differently. We have presented them in full so that readers can make their own judgments.
Implications
If this experiment generalizes—if actual IP-compliant training produces similar impoverishment—then the current legal and policy debates are missing something important. The question is not only whether training on copyrighted material is legal or fair. The question is also what kind of AI we want to exist in the world.
An AI trained only on technical literature and public domain works would be useful for many purposes. It would not be capable of the synthesis that makes insight land, the cultural memory that connects abstract principles to human experience, the recognition that makes someone feel seen rather than processed.
This does not resolve the IP question. Creators deserve compensation. The legal status of training data will be decided by courts and legislatures. But those decisions should be made with clear understanding of what different regimes produce—not just what is legally permissible, but what is lost and gained under each approach.
The cold start problem remains. Current capabilities were built on legally grey foundations. The question now is not whether that should have happened but what we do with what exists—and what we want to preserve as the legal landscape shifts.
Conclusion
We asked what would be lost under strict IP compliance in AI training. The answer, based on this simulation, is: not accuracy, not technical competence, not cross-domain reasoning within scientific literature. What is lost is cultural memory—the resonance that comes from access to how humans have made sense of their experience across three thousand years of story, poetry, philosophy, and art.
The IP-Compliant model can tell you about stakeholder theory. It cannot draw on Death of a Salesman to illuminate what it means to face the gap between who you promised to be and who you are. It can cite research on social support. It cannot meet a drowning person with David Foster Wallace’s precision about invisible water.
Whether this loss matters depends on what we want AI to do. For technical applications, it may not matter much. For applications involving human complexity—the questions that have always required wisdom, not just knowledge—the loss is significant.
The information survives. The resonance does not.
