AI Copyright Regulation: Navigating Legal Challenges and Ethical Dilemmas

In this episode of "Your AI Injection," host Deep Dhillon and fellow Xyonixian Carsten Tusk dive into the emerging world of AI-generated content and the regulatory challenges it presents. The two discuss the legal distinctions between transformative and replicative works, the impact of AI on creative industries, and the ongoing debate over compensation and ethical use of publicly available data. With insights into ongoing legal cases and potential future regulations, this episode offers valuable perspectives on how AI companies can navigate the complex landscape of copyright laws while fostering innovation and creativity. Join us to explore the balance between protecting creative rights and leveraging AI's transformative potential!

Learn more about AI Regulation here:

xyonix solutions

At Xyonix, we enhance your manufacturing operations by designing custom AI models for accurate predictions, driving an AI-led transformation, and enabling new levels of operational efficiency and innovation. Learn more about Xyonix's Virtual Concierge Solution, the best way to enhance your customers' satisfaction.

Listen on your preferred platform here.

[Automated Transcript]

Deep: Welcome to your AI injection, where we explore whether AI is our ally or adversary. I'm Deep Dhillon, here to guide you through the highly nuanced world of artificial intelligence. Together, we'll dig into all the amazing and sometimes frightening possibilities emerging in AI. Through our discussions, you'll learn how to harness AI ethically and effectively in your organization.

So I think, like, that the general, the general thing that's got everybody kind of riled up is, um, that there's like, okay, AI companies can, like, suck down all this publicly available information and knowledge, whether it's music or text or whatever, and they can generate stuff off of it, be generative, people find a lot of value in that, but there's a question mark around it.


CHECK OUT SOME OF OUR POPULAR PODCAST EPISODES:


Deep: Who should get compensated? How, how that should be arbitrated? Who makes money off it? Is it okay to take stuff and presume you can use it until somebody contacts you? Like, kind of evolves with the internet or, you know, do you need to initially go out and get, you know, permission?

Carsten: It started a while ago with, with basically generative art, um, generative art as far as imagery is concerned, right?

That was the first thing where artists said, well, this thing produces something very similar to my work. And I feel like that this whole discussion about music is essentially just a reiteration of that. This time with some players in it that have more money behind their lawsuits.


Xyonix customers:


Deep: Tell me about the first case that you're talking about.

Carsten: Well, that's, that's, you know, all the, the stable diffusion models that are generate, generate imagery. Um, they were trained on, on images, um, where later on the artist appeared and, and basically said, uh, hey, this is my work, this, this is, um, you know, sure it was public, but, um, if somebody is recreating my work, I want to be compensated for it.

Deep: I did a little bit of digging around on like what matters, you know, from a legal standpoint on copyright and one of the concepts that's a big deal in in music and other areas is what they call transformative works versus replicative works. So, with respect to these new audio lawsuits, um, for these, uh, companies, I forgot the names of the companies being sued, but there's a couple of music generation firms being sued by the RIAA, and, um, in the replicative sense, the law is pretty clear, like, you can't, you can't, like, just copy their stuff and reproduce a song, so you can't go out and Grab a CD, um, or an MP4 or whatever, and then just record it a little bit differently and then run around and claim it as yours.

That the law is very clear on. Um, and then on the other end of it is you are allowed to listen to it as a sole musician and. be inspired by it and take inspiration and have it affect your stuff. Uh, and that's the kind of transformative work. So maybe Greta Van Fleet's a good example, a band that sounds virtually identical to Led Zeppelin, to anyone trained, but they seem to get away with it just fine.

Um,

Carsten: Well, I mean, that, that, that is my, um, that was my main argument in this whole story, right? And it applies to music as well as to, to visual art and paintings. Why is it different if a computer looks at this imagery and learns the styles and how it is created from an artist that goes around and, you know, is influenced by the music he hears and the pictures and videos that he sees on the internet, right?

I mean, humans operate the same way. We find an art style that inspires us. We learn from it. We like it. We might even like it enough to replicate it or do something very similar, right? And so that is transformative work. It's kind of like Takes this art style and progresses it. It can be

Deep: transformative if, if you're talking influence, right?

I think, I think my understanding of the, the legal scenarios, you would have a, you'd be at a much better place as an AI system if you put in a, some checks to make sure you don't actually generate an entire song or even a small sample identical to anything in your catalog you were trained on. See, that's, that's the next Which seems like a feasible thing to do.

Yeah.

Carsten: Well, how identical is identical, right? That's the question. I'm pretty sure none of it is going to be pixel perfect replica, so you can't go with true identity. And so, how similar is too similar, right? Um, the same thing, if you have like, somebody creates a replica of a famous painting, let's say the artist is still alive and there's a replica, and it's known to be a replica, he doesn't pretend to be that artist.

How similar is too similar? Is that legal? Well, apparently, courts

Deep: Yeah, I was curious about this. So I was digging around a bit. So apparently courts, um, they apply this kind of like random person on the street logic. So like in the olden day, like back, you know, when, when Napster and all that stuff was an issue, right?

Like copyright infringement cases came down to, well, hey, if it's a CD quality, high quality recording, And maybe you downsample it, uh, you know, or you get a lower quality mp3 in those days recording out of it. It's still the same recording because the average person on the street says it's the same recording.

That was basically the logic. But in, in our kind of modern context, I mean, I think the essence of your question. Is a super valuable one. And I, like you were getting at, like, what's the difference between an AI system learning from publicly available works versus a human like an individual human. I mean, there's a few that, you know, you and I would agree on.

Like, 1 is just clearly scale. Like, no human can read all of this stuff or listen to all of this stuff, um, at the same scale and magnitude. So maybe that plays a role.

Carsten: But to, to counter that argument, you could say the AI doesn't also doesn't generate everything all at once, right? I could, I could train an AI just on like one genre and have a generate images in that genre, which would be very similar to a human, um, art student that studies one style and then, you know, adopt that style and, and, um, performs it himself.

The fact that I can take AI and like put all these different genres into one box. I don't think that matters, you know, I mean, you could pose the question as would it be okay if the AI arc is only trained on one genre, then you would have the human equivalent again.

Deep: Your point is that a given human could become an expert in a smaller Subdomain and maybe even do replicative work.

Let's say it was a pianist or a guitar player or something, and that that's kind of fundamentally no different than a machine. It's just sort of replicating across a lot broader space. Basically.

Carsten: Yeah,

Deep: I think, uh, the other thing that seems to be driving this conversation is fear. There's a lot of artists, musicians that there's just it seems to be this kind of general fear of a is going to take away our ability to be creative and.

Be expressive and that if too much power is given to the machines or models that, you know, somehow artists or musicians are not going to be relevant or needed or they'll be stolen from. I don't know. What's your read on this? I have my own take on it, but I'm not,

Carsten: you know, that's kind of like a, it's kind of like a tough subject.

I think, um, when it comes to, let's say commercial art. Absolutely true. But the question is, should it be called art? What's commercial

Deep: art in your

Carsten: Like, marketing, um, graphics, just, you know Oh, not, not Commercials Painter in a music Not a painter, like a Picasso that is like, painting because it's his hobby or, you know, just basically people that are painting to make money with it.

You could say the same thing about authors. Authors that are just like, writing these, these serial novellas and this It's not, it's not I mean, yes, it is creative, but at the same time, it's just meant to generate more and more money, right? It's not somebody's life's work. So you

Deep: mean, not art world art, but maybe, maybe craft plus media assets around Hollywood?

Let's say you're a commercial graphic

Carsten: designer, right? Uh huh. And you're hired to make illustrations for commercial magazines. Your job might be going away. But that's not art. That's just work. In my opinion, I don't

Deep: know. I mean, like, I guess I think about it really differently. I think about it less like jobs are going away.

I think about tasks and jobs are different from tasks. Like, we have a tons of tasks that have already gone away as data scientists that we did 20 years ago, 30 years ago, or even a year and a half ago. And that's very different from jobs. Like, You know, a job is sort of, so I imagine there's still going to be a role for graphic designers because there's still going to be a human subjectivity.

They're just, we're just going to start operating at a higher level.

Carsten: I agree, because you can compare it to like construction, you know. Once upon a time, it took like 20 people to dig a basement. Now you have one guy with a backhoe that does it like in a half an afternoon. Yeah. And, and AI in many ways is exactly the same way.

Very simple tasks have been replaced by machinery. And I think that the jobs will just morph into a different direction. You know, now you have somebody controlling the machinery rather than doing the work themselves.

Deep: Yeah. I mean, at some point, like going back to the creativity question about, you know, that I asked like, Hey, do artists and musicians have something to fear?

From robot artists and musicians. I think the general answer is not really like if any if any field Like, humans are fickle about what they like and don't like, and the second everyone realizes that you can get a machine learning model to generate a Hollywood movie script in a few seconds, that just becomes the absolute baseline of what you cannot do.

Like, it has to be way better than that, because everyone's going to be like, I don't want to see that movie, I don't want to see any, I don't want to see a bunch of robots on stage, like, Playing classical music or something like it's just weird. It's it's uninteresting And so people redefine what their tastes are and their tastes are usually Defined with you know, social factors included, you know with like fashion dress Popularity things in the zeitgeist.

It's the same way like when photography first came out. Everyone thought ah, well That's gonna be the death of painting because in those days, you know People were commissioning somebody to come in and you know You know, and paint like, you know, those portraits of the family or whatever, and I don't know, I don't know how much that exists today.

I mean, I'm sure it still does. to some extent, but it certainly didn't cause the death of painting. You know, painting just, you had all kinds of movements beyond that.

Carsten: Yeah, and it also boils down to what do you call art, right? I see art as like a form of self expression. A painter creates a painting because it's something that's in his mind and he wants to bring it on a canvas.

That never goes away. It's still that artist's self expression that he did there. Um, what might go away is if he does that serially and for commercial purposes. Well, then his skill at doing that might be less valuable in the future, but, you know, the fundamental fact that somebody uses art as self expression is a great artist, I don't think that goes away.

And if I would buy a painting that is really hand painted by somebody, I think there's, there's value for me to know, Hey, I know the artist that painted that, right? Yeah, there's definitely

Deep: something to this whole artisanal movement that's happened in the last 10 years, like with the emergence of sites like Etsy, where people don't necessarily want the cheap, mass produced, machine made thing.

Like they, they start, Demanding handmade stuff and hand coned and hand crafted and my guess is the same kind of thing's going to happen where, where machines, you know, the AI systems are going to get really good at making, uh, all kinds of things. And there's just going to be less value attributed to that than stuff that's handmade, at least with respect to this creativity question.

Carsten: And if I want something that I want to put in my living room because I want to have a story behind it to tell, I want to look at it and think about something special, then I put a lot of value into that this is a handcrafted piece of art, right? But if I'm creating, let's say, a PowerPoint presentation, I need some stock imagery for it.

Um, I don't care, right? If it's AI generated, if it's an artist that made it, it really does not matter to me. It's just an illustration that serves a purpose. It doesn't have any deeper value to me. So that, that's the one hand on the other hand, I can also understand that a lot of people that up until now were making a really good living with such commercial art are now a little bit afraid about their jobs or their future or what they're going to do, but it's just a shift in that industry, I think.

Deep: So what do we think about like, like, how do, how do we think about the bigger copyright. Great question. So like, so you have, you know, let's just talk about like chat, GBT, cloud, Anthropic, Google, Gemini, all that stuff. When we talk about the, the textual data that they're, they're trained on, there's a big question about, I mean, there's kind of an assumption that they're all going to be writing, you know, really big checks to like Reddit.

You know, places that, you know, that have this content and some places are definitely, you know, threatened, you know, like,

Carsten: So, I mean, Reddit, I don't think Reddit has the right to sell their users data. Actually, I mean, that's a topic in itself. I think nobody has the right to sell their users data. Um, my take on it is, if I can read it on the internet, Free the access, open access, then do with it whatever you wish.

If it's stolen content, if, you know, it was illegally obtained, or if it was behind, like, some guardrail that was circumvented in order to access the data. Then it becomes questionable, because clearly there the intent was not to have it publicly available.

Deep: Yeah, but legally, that's not what the law says, right?

Like, the law is pretty clear on what is subject to free and fair use from a copyright standpoint. I don't,

Carsten: I don't think the law is very clear on that, um, because this is unprecedented, right? Everything that is publicly readable on the internet is free to read. Right? You're not allowed to replicate it, you're not allowed to republish it, et cetera, et cetera.

That's not what these machines are doing. These machines are literally just reading that content. That's it. They're not copying it, they're not replicating it, they're just reading it. They're not storing it anywhere, they're just reading it.

Deep: Well, I don't know. I mean, we'll find out. And there's like, no shortage of cases winding through the courts.

Yeah, I mean, like, I feel like the search engine evolution gives us, um, a glimpse into how this stuff might evolve. So if we rewind like 25, 30 years ago, like the general MO was just sucked out all the publicly available content and build search engines and just figure out the legal stuff after. And so, you know, you had companies like originally, you know, AltaVista and, you know, You know, and then eventually Google at, you know, and Yahoo, all these folks were, you know, pulling down all of this content and they were mostly legally hiding behind fair use laws where you're allowed to generate like a snippet, but you can't like show the whole content.

I don't know if you remember, but back in the old days, remember how you used to be able to see the full cached. Copy from yeah. Yeah, I think that that got swiped away at some point legally. They weren't allowed to do that anymore

Carsten: Well, they're the whole purpose isn't they they kind of justified it by or the third party people They get their traffic through the search engines, right?

They become discoverable. So it's kind of like a Mutually beneficial relationship. Well, it

Deep: was until the news until the news agencies came along and they're like now we're we're getting You know Disintermediated here, right? Like, newspapers sold subscriptions historically, and people bought the whole paper and read, in theory, the whole paper.

Carsten: They still do, though, because Yeah, but

Deep: they lo I mean, like, they find a news agency that's not dying, and there's like a handful, but most of them are A lot of them are behind

Carsten: A lot of them are behind paywalls, though, right? So you find the article on Google or something, and then you try to read the article.

So that's But that's new, right?

Deep: Yeah, that's new. That's like the last few years, but like 10, 10 years ago, 15 years ago, even just five years ago, the paywalls were something that they kind of jockeyed Google into agreeing to. I mean, Google was pushing for the longest time saying, no, we get to do whatever we want with this content.

My point's not like, what's the nuances around it? My point was that, if you jump up a level, the strategy of push the envelope, get people utility, get everyone addicted to the utility, And then kind of duke it out in court over the next 20 or 30 years. I'm thinking that's what's going to happen in the AI arena.

Same thing, like get everyone addicted to it so that the courts can't just suddenly yank it very easily without it impacting the economy at large. Overwhelmed by data and unsure if AI is a risk or a resource? Consult with our data scientists at zionx. com and let's explore AI's profound potential together.

Carsten: Yeah, I feel like it's, it's the same old story, right? People want to be compensated. If they're, if whatever they're creating or publishing is used somewhere, they want to have, and somebody else is making money off of it, they want to have a share of that money, right? Yeah. And I can see that argument, right?

You could say, Yes, we publish this for free. You can read it on the internet and now somebody else is taking it and is making money with it, right? Because ultimately these companies that are releasing these chatbots, they're charging people for the API you, so they're making money with that content.

Without that content, they wouldn't be making a

Deep: ton of money. They're making a lot more than whoever wrote the articles.

Carsten: So that's an argument to be made there, right? But since we cannot, since there's no traceability to original sources, which is actually A huge problem with this whole thing, but because there's no reference checking, there's nothing.

Since that doesn't exist in LLMs, there's also no way to funnel any back compensation to those companies. So currently, technically, they don't have a way to compensate somebody for their content being used to generate an answer.

Deep: Well, I think isn't it perplexity has citations on, uh, whenever they do some generative piece, they have links to whoever's content was used to generate the citations.

If

Carsten: you, if you do something like, you know, the whole retrieval augmented generation, yeah, you might find sources because you know where, why you generated it or what context you provided, et cetera, et cetera. But that's not the norm today, right? Um, no, it's

Deep: not, but it's also something that Google's kind of, you know, a lot of folks pushing on that to, like, try to get, um, true reference ability to technically,

Carsten: They could do something like, hey, I have my, I have my model.

Yes, the model would work on its own, but I can't allow that. So, what I do is, I use my model and my search engine, and I search relevant content, and then I use that as context for my model, and then if my model generates an answer from that relevant content, then I, you know, feed it back to them. So

Deep: And you also don't necessarily need, like, in order to compensate people, you don't necessarily need to have You know, per response itemized level tracking.

Like you could just put it on the front end, which is kind of de facto, what will happen, I guess, first, where you just cut a deal with Reddit and, you know, Google and everybody gives them a billion dollars to just access all their content. That's probably what's going to happen. And then they can do whatever they want.

Carsten: So, so some sort of licensing scheme, exactly. So we licensed your content. And that works for the

Deep: big folks, but that doesn't really cover, you know, like small artists in the search engine arena world. You know, there was like robots. txt in the early days, and you know, it was a voluntary standard, so it doesn't really necessarily make that much, hold up that much in court, but a lot of that search engine companies were saying like, hey, your robots.

txt said we could grab the content, so we grabbed it. It gets in there now. If you don't want it in there, you know, they still have places where you can pull your content, but people need it in there because they need the traffic.

Carsten: I was going to say, this doesn't exist today, but it would be a fairly easy thing to do that in the future.

And basically have some sort of mechanism where your content is either whitelisted or blacklisted for AI usage. And we might be seeing something in the future, I don't know. But we kind of have that,

Deep: right? Like, in a way, like, to grant unconditional access, if you throw a Creative Commons license on it, then everybody assumes, yes, it's fine.

And then, but there's just like a whole spectrum between, you know, black and white there. Like, and then banned content, if you physically can't get at it, it's probably presumed banned.

Carsten: But also, the burden of truth is hard, right? So imagine two artists. They have very similar styles one whitelist the content for AI usage.

The other one doesn't yeah So now AI sucks up one's content doesn't suck up the other one's content. AI is trained Now somebody generates an image with that AI or a song and spits out something super similar to what the other artists that didn't want his work to be used in AI. Isn't this exactly

Deep: what happened with the Scarlett Johansson thing?

I think, I think Chad G Like, so, so Sam Altman and those guys, they went over, they met up with, uh, or they, they, they tried, they, somehow they communicated with Scarlett and they said, Hey, we want your voice cause you were in her and we think it's cool and it should be the voice for, you know, for Chad GPT.

She said, no, unambiguously. I think they just went back and hired an actress whose voice sounded like her and then they, and then they did it anyway.

Carsten: Yeah, people, people think they're unique, they're not.

Deep: I think that's one of the, that's a theme we should kind of dig into. Everybody does assume that they're really unique, like whether it's, You know, as an author, as a writer, as a voice, as a behavioral pattern, but, you know, as data scientists, we know that's not true, because what we spend our lives doing is clustering these things into, you know, known behavioral patterns or, and so.

I feel like that's the part that's kind of a meta takeaway here.

Carsten: Yeah, and it's valid in everything. Even like scientific discovery. If you go back in history, think about how many like scientific discoveries were made almost independently by different research teams almost at the same time. Yeah, like Newton

Deep: and Leibniz with calculus.

Yeah.

Carsten: No, we're not unique. There's replication and

Deep: startups are another one. Like we're all just sitting around looking at the same information. Maybe we have a little bit of an edge and we look at slightly different stuff than some other people, but people who have generally the same skills as you and I are going to generally come up with the same ideas that you and I are.

When's the facts change a little bit. Yeah, that's kind of the part that I think in some ways is depressing about AI. It's like, like this, this illusion that humanity has gone through of feeling like they were really individual and special, at least in the Western world where individuality is more prized.

Is the sort of increasing acknowledgement like Internet has that effect even, you know, where I remember when my son was like, I think he was in like, 6th or 7th grade. He was writing a paper and, you know, he's doing all this research on the Internet. And then he wrote this whole paper. And then, you know, right before he was going to turn it in, he Googled and he found that somebody had like, disproven his whole position and he just deleted everything.

And I was like, nobody expects a 6th grader to be writing an original. That's pretty cool. Piece like it's fine. And he's like, no, I can't turn it in now because it's not original and I'm like, yeah You're not a grad student like it's fine.

Carsten: I think you're right I think it's most of the things that we used to do in the old days that were like Localized and now on a global level and on a global level you very quickly realize that a you're not as good as you think You are yeah, you might be regionally, but you're not globally Um, and B, you're not unique either, right?

There's many other people doing the things that you're doing that you thought was a great original idea, but that's not a bad thing, right? It's just like that, that illusion that we used to have while we were growing up in our little villages, shielded from the world prior to the internet, uh, that no longer exists.

Deep: So, I've got, uh, um, some questions that our producer Jesse put together for the show that, um, I think our questions that you and I might skip past normally, but I think they make sense to like, kind of bring folks into so one of our questions was like, well, how do AI companies actually get training data?

And what if any ethical considerations do we imagine being involved? They crawl the web.

Carsten: They look for data sources on the web that are that are available. If they're available, they download them and they put them into the pile and the models get trained on the pile. Very often, it's very similar to what search engines do.

They just call the web and look for data sources. And some, some companies have other resources, like for example, um, Google has their whole library scan and I'm not sure to what degree or what end they have used all their Google books stuff in order to train these models. Um, these things. And then there's also

Deep: just a lot of public repos that, that, that are free to use like Wikipedia, and there's a lot of specific, specific training You know, language translation, like a lot of different tasks that you can also pull in.

Carsten: Also more controversial stuff. Well, I don't really know. Because the problem is that you don't really know the details. For example, how did GitHub train its copilot? Did they train it only on publicly available source? Did they pay attention to the licenses of that publicly available source that they trained it on?

Was there like some non public stuff in that training data? And the same applies to like books or other sources that these companies use, right? And that's where the ethical questions come in. Well, I think

Deep: that's a good point because people are, there's an increasing push on the legislation side for, uh, transparency there, where the, where these big model AI model companies are being, um, I think there's like a, a bill in the California legislature right now to force transparency, like what, how did you train?

What were the sources? Where did you get them? That feels to me like a good first step to at least just know, Okay. Where this stuff's coming from, but I'm curious, I think you and I have been around enough to know that there's what you say is in it and then there's what's actually in it. And those aren't necessarily the same thing, and unless you're enforcing it somehow, as a government entity, people, you know, and companies are going to lean on being really private about that.

Carsten: Yeah, and I was going to say, um, there's a thing that you, that you say publicly, and what you, what you, um, you know, claim has happened, and what has really happened, because, again, there's no traceability. When these things are just shown to the models, they're not stored, they can't be traced in there, and whether or not a model has seen a certain piece of content, they Can neither be proven nor disproven afterwards.

So unless you have like some internal whistleblowers that say, Hey, they illegally did this and this, it's never going to see the light of day. So at the end of the day, we don't know.

Deep: Yeah. And I think that that's just kind of going back to, I'll, I'll try to throw a little bit more spin on this question.

So like, how do they typically acquire the chain data? So yeah, all of the crawling stuff, you know, like you have bots that go to websites just like you do with the browser, but there's also a lot of gray area there and there's also a lot of red lines even there. So some of the gray areas there, you know, a lot of sites are set up specifically for human consumption and not for bot consumption or, and so, so sometimes, you know, people play, companies will play cat and mouse games where they will, you know, You know, have IP addresses like banks rotating, you know, through their AWS cloud so that they're, they look like humans and they're coming in with human appearing browsers and they're pulling down content and not looking like bots.

So for example, like if they're pulling down pricing information off of consumer goods or, you know, whatever, that's often a cat and mouse game that gets played there. That might be the gray area, you know, I'd say like the, the, the not gray area that everyone kind of agrees is generally okay is if it's, you know, creative comments and then there's like, if it's publicly accessible and nobody went out of their way to block it, and they have no publicly declared licensing terms, that's probably a little more gray, but not too much.

And then there's this other stuff that I'm describing that's more, and then there's even more stuff, you know, like where, you know, people are cruising the dark web and grabbing stuff like that's, that's clearly like in the red zone. And all of it is after the fact you could, you know, if you build all your assets, build your model, you could in theory, delete it all or hide it.

And nobody knows, like you're saying the model itself doesn't contain that information anymore.

Carsten: Right, right. And even, even if the model later on reproduces something that looks. Very similar to something it hasn't seen before. Um, we talked about that earlier. That is entirely possible without it ever having seen the original.

Because that's how these models work.

Deep: You're listening to your AI Injection, where we explore how to harness AI ethically and effectively in your products. Visit xyonix. com for guidance on innovating with AI. So, here's another question that we've got here. Like, what challenges do companies face? And ensuring their models are not inadvertently trained on copyrighted material.

So, so, like, there's a lot of companies that just blatantly ban any of their employees from using LLMs. Which we both know is just no way that that's actually gonna work. I mean, people are gonna go, go to GPT to do their work if it's faster. And they might have to, like, do it from home or off their phone or whatever, um, to circumvent getting caught.

But that kind of a policy is not The same thing is what the people in the company actually do. And so, you know, like engineers, data scientists that are building these models, they're going to, they're obsessed with the metric they're trying to improve. So the efficacy performance. So if they have to grab stuff, they'll grab stuff.

If there's a lawyer sitting in there right next to them, sure. There's, you know, there's a lot of challenges than just even getting a corporate policy adhered to by a bunch of technical people that tend to be a bit freewheeling on their own. Yeah.

Carsten: I mean, it's the same challenge that search engine engines have when they do, uh, web crawls.

You can, you can search for, for books that are copyrighted on Google and sometimes you find them because somebody publishes it and then the search engine or puts it on their website and the search engine picks up the PDF and then you can download it from there because the search engine found it and you can search it there.

It is very challenging to to monitor web crawls for data that may or may not be copyrighted. And so if, if you have a web crawler that crawls the web. to train your model and you, um, you come across a site that might have a disclaimer that says, Hey, this, this material is all copyrighted, unless they express that in a way that is like a standard that the web crawler might understand.

So it could actually react to that. It won't react to that. It'll suck it down and use it anyway. Um, and so the putting the burden on the people that are training the models to make sure there's no copyright material there is very extreme, I think. If they want to do something, they need to come up with a standard where they really say, Okay, this is the standard how you flag your material as copyrighted and not to be used in AI.

And if you follow that, then the companies are obliged to, you know, adhere to that. Other than that, no guarantees.

Deep: Yeah, that seems like, uh, yeah, you're bringing up this whole kind of canterworms around user generated content, right? Like, even if you take a site like YouTube where Google It goes through pretty great lengths to try to ensure that, that they actually have copyright over the stuff they're using, but they didn't in the early days, right?

In the early days, anybody could just go, you know, rip a CD and throw up the, you know, the, the song onto YouTube or Still happens. Absolutely, it still happens, but they, but they do go, they, you know, they're, they're combing and looking for copyrighted material, but your point is a good point. They do. Yeah, definitely.

Like if you take, um, Cause, cause I just,

Carsten: what was I, I was Googling yesterday for some music. So I just, um, Googled for like, you know, Foo Fighters, um, greatest hits.

Deep: Yeah, but that's all there because it's on there. It's on there because somebody wants it on there. Not

Carsten: by the Foo Fighters, by some dude.

Deep: Okay.

Carsten: No, it's interesting. So it depends, right? Like,

Deep: yeah. So people do all kinds of weird stuff, right? Where were they all like, they'll skew the material. Um, they might, like, invert the video imagery, they might compress. That's why I like a lot of, a lot of video content will be sped up or slowed down because they're trying to circumvent, um, but I've personally like uploaded video, like a video that I shot.

And then I just grabbed a song, uh, put it on the, you know, like in the video in iMovie and then put it up to YouTube and then like within like an hour I get a struck down notice. I was curious if they're, if they're actually. Yeah,

Carsten: so they're checking some stuff slips through the cracks. But you know, as

Deep: you know, I mean, it's, it's not a definite, absolutely not a hundred percent, right?

There's, there's always a cat and mouse game there.

Carsten: And you also know that these big companies have literally like thousands of people that do nothing else but content moderation and are basically paid to like monitor all this stuff for copyright violations, violence, all kinds of things. And

Deep: the courts generally, like, as long as they can prove that their processes are reasonable, they can make their case that they're That they're doing what needs to be done, and then you can argue about how much, how well they're, they're doing what needs to be done.

But generally the courts don't, like, just hammer the crud out of them if they're putting a billion dollars into, for example, copyright analysis or whatever. So I don't know, but getting back to your point of like, so you go and you pull down content off of YouTube. You're a data science group. That's just, you know, learning off of the stuff from YouTube.

Yeah. I mean, even in a place like YouTube, we're just pretty well put together and tracking for copyright stuff's going to leak through. So then the question is like, okay. How do we even enforce all of that, and what's really different with AI from the pre AI world?

Carsten: Nothing. It's not, it's not different, I think.

If anything, I think it's less, it's less harmful than direct copyright violations. Because again, we, we, we go back to the beginning of the talk. We now have a model, an AI, that gets inspired by the content it sees. If the content was on YouTube originally, any human can look at it and is inspired by it.

Why shouldn't the AI look at it and get inspired by it?

Deep: I mean, I think it, I feel like there might be, um, an obligation to prevent replicative content.

Carsten: I agree, because the moment that you create replicative content, it actually becomes a copyright violation. Yeah. So, if I would generate an image, and it's an identical replica to some copyrighted work, and I use that image for commercial purposes, I commit a copyright violation.

So I don't see where the problem is, to be honest.

Deep: I think it goes back to how similar, your question about how similar is similar, right? When billions of people are using a model, you know, let's just take text to assistively author text, and then that, and let's say that model is like spitting out, you know, I don't know, three sentences from some copyrighted material.

And it just ripples around the planet. Like, is there a liability there?

Carsten: Feels like there is. I think the problem goes deeper than that. I think the problem is really that the people that used to generate this work have become obsolete. And they don't want to be obsolete. And so they're fighting it.

Legally speaking, if I have an AI that can generate stock art in a certain style, I do not need the two artists I had employed before anymore, or the two stock artists that I used to buy my content from. I just don't need them anymore. And we're quickly moving into a world where that just becomes Where they become trainers of the

Deep: algorithm, basically.

Yeah,

Carsten: yeah.

Deep: Yeah, I mean, it's, yeah, I mean, I think it's a genuine question, you know. But what are the specific ethical responsibilities of the AI innovators in terms of, like, transparency and consent? I mean, from my vantage, It feels like it comes down to being transparent about how you get your training data that feels like an ethical thing that if you want to be an ethical company, you should be really, you know, straightforward and transparent about where you're getting your data from.

And if it's only creative commons license, fine, say it if you're crawling. In addition to that, say it, but giving people the ability to get out somehow. I think that we need a maybe a more powerful global standard there.

Carsten: I, I agree with you on the transparency. Um, I think they should just be transparent where the data is coming from, what data they're using.

I'm a little bit torn on the consent. Uh, I don't think they need consent. I think the data is available until it's, uh, explicitly forbidden. Uh, and so

Deep: I, I, I lean that way too. Yeah, like it's, it's like allowed until you say you don't want it. But I think there needs to be a mechanism for you to say you don't want it.

Carsten: Exactly. And, and I think that could be something literally like that robots file, right? Or like a certain licensing, the world needs to agree on it. So that everybody can actually adhere to it, come up with a standard, how I can explicitly flag my content as I specifically disallow this for my use.

Deep: We don't, I don't know if a standard is actually going to happen because we had 20 whatever years.

Robots.

Carsten: txt.

Deep: Yeah, that standard has gotten pretty widespread use, but the courts don't acknowledge it that well. But the, but also just like, you know, the licenses, like the, the create, like we have a, I wouldn't call this standard, but we have like a kind of a de facto ethical situation with where I think most companies that want to be on the fair side of things, they will only take content that has an explicit license that they're comfortable with.

And then think about whether it's, you know,

Carsten: But that's, that's not, most content doesn't have licenses, right?

Deep: That's where the problem gets. If

Carsten: I'm crawling the web, and I'm crawling a site that has like articles with images, and 20 percent of them don't want their images to be used in AI, and the other 70 percent don't have a license on it, how would you associate like a license with every single webpage?

Not even that, with every single media item on a webpage. You would have to find a way to associate a license with that particular item. What, what does like, um,

Deep: what do these big sites like Reddit and others, they all have license, license uses for their content, do they not? And they're the bulk of the, the content on the web.

I mean No,

Carsten: not really. The web is much, much bigger than that. I mean, it's absolutely bigger than that.

Deep: But, like, maybe you get to 70 percent or 80 percent through the massive sites.

Carsten: Big sites are really easy to work with. Well, no, even big sites are not easy to work with if you actually take the individual artists or the individual people into account, right?

It's nice for Reddit to say, yeah, you can have all users content. Did they ever ask the users? Well, I guess I do because the user signs up and signs some like 10 sheet, like, uh, I agree to these terms if I want to use this site. And the main point

Deep: of that, the main point of that term is that you give everything to Reddit.

Carsten: Exactly. So I don't think the big sites are the problem. They're easy to handle and they're easy to deal with licensing wise too. You could even pay them like you suggested earlier. I think it's the white call. It's the individual content. And it's the, the digital artists that uploads like, um, You know, the image to like a, um, artist's, uh, side because they want to share it with other artists, um, right?

So

Deep: That's where I think, like, I'm going to just predict that there's going to be some, you know, information portal, you know, to borrow an old term, that are going to be very accommodating to, you know, people who really care about protecting their works. And, um, and they're gonna, like, have the licenses, yeah, they're gonna have their opt outs and all that stuff, and they're gonna make it really explicit on behalf of all their users, like, what the bots can do with it, you know, and I think you had that, you know, even in the web era, like, you had that kind of thing happen.

Carsten: And I think it's fair that that's, that's really the only way to handle it going forward. And as far as past content is concerned, that's difficult, right? So like, it's not that these artists can suddenly go and pull all their work from the internet. It's impossible. Well, that's kind of

Deep: like one thing that's different between AI and the search scenario is like in the search scenario, Google built an index.

The search is up. Somebody searches, doesn't like something. They can actually do something. Google can physically pull the stuff down if they need to. But with the AI system, the models are trained once every six months or whatever. Maybe you can fix it moving forward, but you can't go back and delete prior models that are still, you know, that are still servicing, you know, API requests.

Carsten: Yeah. It'll be interesting. I mean, ultimately it becomes a legal question, right? How the courts will interpret, um, copyright and whether or not the use of existing content to train AI is a violation of copyright infringement or not. And I don't, I can't answer that. I don't know, throw it, throw it all away and recrawl the web and make sure you don't use any of that content and train the models again, but that would be much worse.

So do we want to go back? Do we want to go back to worse models? Or like you said, come up with licensing agreements with really big players. So you can cover 80%. Of the model training that you have today with licensed content and train clean models with licensed content only And and completely forget about the gray zone of the web where you started initially

Deep: My guess is that's kind of, that's probably going on right now.

Like I'm sure that, like Google's lawyers care a lot more than OpenAI's lawyers care, or at least whoever's listening to the lawyers. And in Google's case, they have an empire to like protect and in OpenAI's it's an emergent one. So they have a lot more. Leeway in the Google case. I'm maybe that's one of the reasons Jeff and I so much worse because they're a lot more cautious about what they let in and don't, you know, um, and then in, in an open a eyes case, I mean, I think their strategy, if I had to guess their strategy.

It's like, stay ahead, charge a bunch for the API and the subscriptions, make cash, build a fortress, pay off the big players, like Reddit and, and, you know, and core, uh, you know, corn, whoever, and like, Yeah. Get your 80 percent players there and then just expect a never ending slew of lawsuits over the next 50 years.

Carsten: Yeah, it gets more interesting when you look at, oh well, that probably doesn't exist. I was gonna say, it looks more interesting, um, if you look at, uh, what if you make all these models available for free, which many companies do, right? You can say it's for commercial purposes, but no, it's just for free. Uh, then that picture changes a little bit.

Yeah, I

Deep: mean, Facebook's doing that, right? Like, Meta's doing that with the llama models.

Carsten: Yeah, they're not selling it in any way, shape, or form.

Deep: Right. But they also don't even tell you what's in it. They don't give you the training data, do they? I don't know. Do they?

Carsten: I forgot. I think I read at some point what Lama was trained on.

I think they published something or somebody wrote a blog about it.

Deep: Oh, so they are transparent. Honestly, I

Carsten: forgot. They're pretty transparent about it, I think. Yeah.

Deep: So that feels, that's an interesting one, right? Because the free models are getting a lot better. And, you know, maybe they'll never be as good as the big super models that, you know, the big boys, the big folks put together.

But, like They're getting really good. And so then the artist argument, like whatever we come up with to kind of arbitrate and provide compensation, like, I don't even know how the green models are going to. Play in that other than to just be clear about what's in it in the first place.

Carsten: Well, the big problem is that it is the hardware and the energy that is required to run these models at scale, right?

Um, very few people can afford to run 176 billion parameter model, um, and that's a small one these days, uh, on their own. They make them available for free, but nobody can really use them unless they host them and are willing to fit the bill and somebody has to pay for that one way or the other.

Deep: So if we fast forward, last question, I think we're, um, kind of wrapping up here, but if we fast forward five years, maybe 10 years out, um, with respect to copyright and AI in particular, where is all this stuff going to land?

Carsten: I think we'll solve these problems. Uh, I think there's this new technology that is currently, uh, Um, you know, last year was, was the big, let's find use for this technology, because we didn't really know, this is great, but what are we going to do with it? So now some use cases, uh, crystallize out, we, we find ways to apply it in our day to day life.

These issues arise, I, I'm pretty sure we'll figure them out, there's going to be some court decisions that will lead the way, and the way that, People will train these models, get the source data for it. Yeah. But what that's going to be, there'll probably

Deep: be some regulation too, like in the EU, at least they'll, you know, there'll be, we

Carsten: already have these safety regulations,

Deep: all kinds of stuff coming out.

Yeah. It feels to me very much, um, like a similar evolutionary line to what happened with the, Like I remember in the early days, a lot of these kinds of questions. Um, but at the end of the day, five, 10 years out, if I have to guess, we will still have AI models, there'll be a hell of a lot, even better than they are today, they're going to be doing all kinds of super creative and super practical stuff and people will probably, the big players will definitely be getting compensated for their content.

The smaller players will probably be getting compensated if they need to through an arbitrator, but it'll be pennies on the dollar, so it won't matter. That would be my guess as to, like, what ends up happening.

Carsten: Yeah, and I feel like as far as, like, the creative people are involved, um, they will learn how to work with these new tools.

I, I, I don't think, I don't think these tools are a complete replacement. Um, if you ever use them, you'll, you'll quickly realize they don't do this job magically by themselves. It actually takes a lot of guidance and a lot of babysitting and direction to, to create something. And, uh, yeah, just like I said earlier, it's kind of like the backhoe that replaced the shovel, you

Deep: know?

Yeah, my hope is that, you know, Having the blank slate problem addressed with like these writing tools, for example, or similarly with musical composition. That it just raises everybody's performance level so that the net, like the best of us at writing the best of us at creating music get even better because they have all these tools to stand on and the worst of us get a lot better too, but that bar has also gone up.

So, you know, writing like, you know, Third grade content as a news entity like today will hopefully go away one day.

Carsten: We are becoming more editors than ground up creators, right? Yeah. Um, which is still most of the responsibility and hard work because the composition is still up to you. The fact checking is up to you, the fine tuning is up to you, and realizing your creative inspiration with these tools is also still up to you, so.

Deep: Yeah, I think that's right. I think we're all going to become much better editors. And I think, I would argue that great creators are already great editors, because they know how to edit their own books, their own written content, their own songs. Like if you talk to any great musician and you talk about how they wrote this song, you know, 10 percent of the conversation will be how they got inspired to do it.

And then like a bunch of it will be like through a rewrite process that wraps up today's episode of your AI injection. I'm deep Dylan. Thanking you for joining us as we continue to explore the highly nuanced world of AI. A world full of both tremendous promise and potential pitfalls. Stay tuned for more insights and visit us at Xyonix.com. That's X Y O N I X dot com to learn how we can help you implement transformative AI solutions responsibly.