Anna's Archive: An Update from the Team

542 points 182 comments 3 hours ago
lolive

I choose the books I buy, from Anna's Archive. I choose the comics I buy from readComicsOnline. I choose the [european] graphic novels I buy from #WONTTELL.

And I am one of the best customers of these 3 physical shops, in my town.

So sure, I don't buy the latest trends based on ads. I investigate a lot to buy GREAT stuff. Sometimes the shopkeeper has headaches to find the obscure stuff I discovered online that NOBODY knows it exists.

Am I an exception?

I don't know but those services are great to maintain a freedom of choice.

dfxm12

I don't think I follow. There's no recommendation engine in AA, right? Do you download a bunch of books from AA, read them, then if you happened to like one enough, you will buy it from a local bookstore?

sersi

I'm exactly the same. I tend to get the first book of any series that interests me and read a third before I decide whether to buy it or not. I do buy about 3-4 books a month (mostly epub drm free preferred) plus about 10 european graphic novels (paper books only) a month so I'm a heavy consumer I think.

Wowfunhappy

> Am I an exception?

Yes, I think you're an exception, sorry.

We will never have real data on this. But simply on its face, I find it extremely hard to believe that most consumers have a strong enough moral compass to go out of their way to buy something they already have access to. Maybe they will for a tiny handful of special books that they want hard copies of, or authors they really like, but not for most media they consume.

This type of system also becomes a popularity contest for creators; you are supporting the people you like as opposed to whose work you want to read. If an author says something you disagree with, it's easy to just read their work without paying them. I'm not against consumer boycotts, but it should generally come with a sacrifice on both sides--for consumers, that means missing out on the product or service.

You are free to feel however you want about this. I can certainly see the immense societal value of making things accessible to more people. But I flat out don't believe the "piracy doesn't lead to lost sales" shtick, of course it does.

ZunarJ5

https://gizmodo.com/the-eu-suppressed-a-300-page-study-that-...

From above:

'The Dutch firm Ecory was commissioned to research the impact of piracy for several months, eventually submitting a 304-page report to the EU in May 2015. The report concluded that: “In general, the results do not show robust statistical evidence of displacement of sales by online copyright infringements. That does not necessarily mean that piracy has no effect but only that the statistical analysis does not prove with sufficient reliability that there is an effect.”

The report found that illegal downloads and streams can actually boost legal sales of games, according to the report. The only negative link the report found was with major blockbuster films: “The results show a displacement rate of 40 percent which means that for every ten recent top films watched illegally, four fewer films are consumed legally.”'

ndriscoll

Books seem somewhat unique to me in that the physical product is better or at least different from the digital one, so it kind of makes sense to buy it even if you already have a digital copy. This is unlike e.g. streaming services where the paid service is strictly worse than the pirated one (e.g. no offline, doesn't work at all with some monitors/setups, only low bitrates allowed).

kelnos

[delayed]

skeaker

Your other points aside...

> I'm not against consumer boycotts, but it should generally come with a sacrifice on both sides--for consumers, that means missing out on the product or service.

I'm curious as to why you feel this way, genuinely. The decision to boycott means that there is no sale, full stop, so no money is being handed over. Why does anything after that matter? The important part, the money, is already decided from the start.

glimshe

I would not buy a book after downloading it from Anna's archive. But that's the wrong question in my opinion. You should be asking why aren't most books available in a DRM free format?

The main reason to download "pirated" books is that they get rid of all annoying barriers that exist in "legitimate" copies. It's a better product.

gspencley

> But I flat out don't believe the "piracy doesn't lead to lost sales" shtick, of course it does.

I'm not as certain as you are. Correlation does not imply causation, but media sales have trended upwards in the age of piracy which leads to some interesting hypotheses.

A few years ago Shirley Manson (lead singer of the 90s band Garbage) accused YouTube of making its fortune off the backs of content creators - basically charging the entire enterprise as being one big exercise in copyright infringement. And yet the music industry, as well as Hollywood, seem to be doing better and better each year in terms of dollars made. Some of the distribution models have changed - broadcast and cable television are pretty dead in the water, but the entertainment industries in general seem to be doing better than ever. And yeah lots of individual artists are still getting raw deals from Spotify and labels etc. as they always have. But industry-wise, in terms of dollar amounts, it seems there's more money to be made than ever before from creating and selling entertainment.

The statement you made that I absolutely agree with is that it's hard to get real world data on this. An individual who is able to get free access to something may be unlikely to ever pay for that same thing.But the answer to the question: "Does piracy hurt the industry's bottom line, or help it on the whole?" is a very difficult question to answer. And we have to consider the even harder stuff to measure. Things like: is a teenager who pirates recorded media more or less likely to buy merch and concert tickets? More or less likely to buy a special edition package with tangible collector items?

At the end of the day, I have no clue.

I also offer all of this being very pro-capitalism and pro-intellectual-property. I don't condone piracy. But if we're just looking at raw data and trying to form our hypothesis, we have to start with the fact that the raw data points to upwards trends on the whole.

Phelinofist

Can you recommend some of the obscure stuff?

hinkley

I was reading a book series from my local library and for reasons I don’t understand they were missing the third or fourth book in the series. Probably damaged or lost. I even thought I could check the local (especially used) bookstores, buy a copy and then gift it to the library, but there’s a new edition that has a completely different vibe and size, with 2024 prices so I thought better of it. So I’d heard of Anna’s Archive and I got it there. Then it turned out one of the last books was unavailable too, can’t recall if it was missing or someone else had it out and wasn’t going to return it any time soon.

I was just trying to finish this writer’s corpus on a reread of their later material. It’s not that I’m cheap. I own a paper and audiobook copy of several of my favorite books. Including this author, so I’ve paid her twice. I just avoided the trap some of my friends long ago were falling into of hoarding books, by only keeping books I intend to read again. So any completionist tendencies have always been resolved via library or electronic editions.

I’m getting older now, and my first real confrontation with my own mortality came up with books. I have several years worth of books even if I were retired and reading three or four a week. New things come out all the time, and new voices. I haven’t read some of these books in ten years or more. Am I really going to read them again before… So a couple years ago I reread Dune for what will likely be the last time and sold my ratty old yellow copies to a used bookstore. If I do it again it will likely be audiobook.

ofou

Shadow libraries maintainers deserve a Nobel prize for their contributions to humanity. Satoshi would be proud.

jancsika

Satoshi's pride:

* ability to fund shadow libraries without fear of censorship

* lists with a single item still count as lists

skeaker

To be fair, the theory with the whole coin thing is solid, and I'd say it should count as something to be proud of even if in reality it gets tainted by speculative investments.

notpushkin

aaronsw would be proud, too.

sleepyguy

Perhaps he could spare a few coins, chump change to him to help out.

vlade11115

Also, they provide a torrents list that anyone can seed and be part of the long-term preservation.

https://annas-archive.org/torrents

aniviacat

I'm surprised i2p torrents are still not popular enough to be offered as an option by sites like this.

I'd assume there are many people who don't help out purely because of legal fears, something i2p could help with.

gylterud

What is the status on I2P these days? I used to run a lot of stuff on it. It was a lot of fun. It was like this cozy alternative development of internet, where things still felt like 1997.

justin66

"Anna’s Archive itself has organized some of the largest scrapes: we acquired tens of millions of files from IA Controlled Digital Lending"

Not really helping in the big picture, here, guys.

Palomides

yeah, that's a really unfortunate shoutout that's going to be brought up in court.

om8

Why? They acquired books, that’s what they do

kleiba

The OP is referring to the ongoing legal struggles the IA is facing wrt. to their version of an online library (with digital book lending).

boombapoom

fuck those guys, annas archive is one of the last good things about the internet.

akudha

I am curious how they’re funded. How they are able to stay online. Surely there must be people, governments etc with deep pockets that would want to take them down?

jampekka

You can donate to get access to faster download mirrors. I'd guess this is the main source of their revenue.

https://annas-archive.org/donate

notpushkin

I suppose it could also be their enterprise users, though there’s not a lot of info on this aspect of their activity.

Koshkin

> the last good things

Last but not least?

hereme888

Isn't it humorous how citizens are pro Anna's archive, but governments are against it? Bit of additional evidence for elitism and such.

thomassmith65

It is not humorous or strange because that formulation omits authors.

How many authors who write the books in pro Anna's archive are happy about it?

I personally am pro Anna's archive (and sci-hub, etc) because I believe it benefits society to have better read citizens. That said, I have some misgivings, because under our current system, there are issues with law and remuneration.

jimbokun

What about writers?

MYEUHD

IIRC it was shown that piracy increases sales for books.

For example, if you pirated an ebook and liked it, you'd likely buy a physical copy.

black_knight

This is true for me! For authors like, I might read a few epubs, then buy their entire series in hardcover (or paperback if no hardcover is available) to have in my bookshelves for rainy days.

thorn

Kudos to the team behind this project! It looks like they have improved UI in last year. The crucial problem right now is to remain accessible or to survive. I have no idea how much effort is being put into it. I wonder is it possible to remain afloat despite all efforts to take them down?

jauntywundrkind

There was a pretty major UI update in the past 2-5 days-ish.

Apologies for the minor grumble, but on mobile I used to be able to browse search results much more effectively; the new design only fits ~4-5 results on a screen.

kelvinjps10

I'd like that they enable torrents for single files, like internet archive does waiting too long for being able to download a file It's kind of annoying

raybb

Their volunteering system seems pretty well organized. Also might explain why I've seen so many comments over the years sharing about anna's archive.

https://annas-archive.org/volunteering

freefaler

BTW, this is very useful:

https://open-slum.org/

japaget

This site is down or inaccessible to me. What is in it and why is it useful?

tux3

That site has a list of shadow libraries, whether they are still operating, and where to find them.

computerdork

Know am going to be downvoted into oblivion, but as a composer, can see it from the side of creators. Yeah, making their products free is starving these industries. For instance, in music, there is already very little money in music (think about how many musicians you personally know who can make a living off of music, besides being a music teacher). And, the music industry is still not even the same size as it was in 90's - global revenue in 2024 was $29 billion, while in 1994, in was $35 billion (and that's not even taking into account inflation).

Yes, there are many other reason why the music industry fell, but when your main demographic can always go to bittorrent to get their music if prices are too high, then there is only so much you can do with the price of music.

Yeah, I remember the 90's, music was huge, and there were so many good bands (Smashing Pumpkins, Nirvana, REM, White Stripes... Or if you're more into popular music, Michael Jackson, Whitney Houston...). Now, music is de-valued and cheap and our music scene has been decimated. Personally, think we should try to find ways to support musicians, writers, thinkers, artists...

... but if you have a different opinion, no worries. But, if you can, give it thought.

dulpo

This is surprising. I thought last I heard they'd arrested the guy who was suspected of running the site, about a year or so ago. Guess I'm misremembering.

Also I'm surprised Cloudflare hasn't shut them down like they do for other dodgy sites.

lode

When accessing from Belgium the link is blocked by Cloudflare:

Error HTTP 451 Unavailable For Legal Reasons

In response to a legal order, Cloudflare has taken steps to limit access to this website through Cloudflare's pass-through security and CDN services within Belgium

clickety_clack

Man, I thought cloudflare stood in front of individual sites. When did they start becoming a filter on an individual’s web connections?

foobarchu

CF is in a position such that if they aren't cooperating with national laws, then they are actively hindering them. National governments don't like that, and will have ISPs block CF wholesale if that's what accomplishes their goals.

stavros

Eh, they can't block half the Internet.

celsoazevedo

To operate in Belgium, they have to follow local laws and comply with legal orders. They either make the site unavailable to local IPs or leave that market.

dulpo

Interesting. Seems to be only certain jurisdictions. I can access it no problem from the UK Vodafone network.

camtarn

I'm unable to resolve the domain on EE UK - looks like it's DNS blocked.

By comparison, on my work network (TalkTalk) I can resolve the domain but I get a connection reset from the site.

I think this might be the first time I've hit a DNS block. It feels rather eerie seeing people talking about a site that, from my point of view, doesn't even exist...

PaulRobinson

There's an inconsistent censoring of numerous websites across the UK. In short, the biggest ISPs (a list which changes over time), will block various sites (TPB, libgen, AA, and others), based on court orders taken out at different timesIn general, it's a good idea to use Private Relay if you're using Apple devices and have access to it, no matter what network you're on, and if you're doing anything you don't want your ISP to traffic capture you should be using VPNs and/or Tor.

There are a lot of legitimate reasons to want to use scraping sites that UK copyright law is not nuanced enough to protect, and so blanket bans just end up emerging at the demands of copyright owners (which more often than not, means Disney or Springer).

spaceport

It starts with one

teekert

Set proton VPN to Albania and enjoy the full internet is my experience.

spacedcowboy

Hmm. Even the title link above doesn't work for me on Virgin's cable, in the UK

dulpo

Do you see an error page / blocked page?

I used to get archive.org blocked and had to contact my provider to have the filters taken off.

spacedcowboy

Nope,it just takes forever, then eventually shows a blank screen...

barrell

Yep blocked by Ziggo in NL as well

telesilla

Whenever I'm in the Netherlands I need to set my DNS to 1.1.1.1 or similar, lots of blocks.

borski

Except that that’s CloudFlare, which is also blocking Anna’s Archive.

qualeed

Luckily it isn't the only public DNS.

8.8.8.8, 9.9.9.9, and many others exist.

noble-lombax

I actually didn't know there were more error codes beyond error code 429

Mogzol

There's "431 Request Header Fields Too Large" which you will see occasionally. But after that 451 is the only other 400-level error code above 429. It was chosen as a reference to the book Fahrenheit 451.

mariusor

451 is kind of a novelty code, its meaning being related to Bradbury's "Fahrenheit 451" SciFi novel.

goku12
5555624

The two behind Z-Library were arrested in late 2022.

dulpo

Thank you, I think I must have got the details of that confused with the OCLC lawsuit.

mightysashiman

remember guys, it's not pirating, it's gathering date from AI model training purposes. Perfectly legal.

whirlwin

Just curious - What is the future of service like these? More and more content will be AI generated, to some degree. And should thereby that content be aggregated?

akkad33

Not sure like between books and AI

curvaturearth

Pretty sure no one wants AI slop stored away forever even though that's the unavoidable future

thimabi

In the future, the curation function of libraries will become even more important. Libraries — even bookstores —, both physical and online, will probably use as competitive advantage their capacity to separate the wheat from the chaff. There's no value to a place where AI slop is prevalent.

jimsimmons

Also how can one totally anonymously pay them?

squigz

It doesn't look like they accept from anything that strikes me as being remotely anonymous, which is surprising.

https://annas-archive.org/donate

I'll also say that when too much money starts becoming a part of this, trouble will increase dramatically. I realize this sort of endeavor costs a lot of time and money, but it's a line we should probably be aware of.

jimsimmons

Does anyone have discreet pointers for downloading all the data? What format is it usually?

stonecharioteer

Please remain up. Libgen no longer works. I've used IRC for fiction and non-fiction but tech books needs Anna's Archive and Libgen. I buy the physical with company budget to pay the author but I need DRM free ebooks to read comfortably on my Tab S9 Ultra.

DyslexicAtheist

libgen is still there

duckkg5

Not accurate. You are probably looking at a site like https://libgen.ac/ which states clearly at the top: "Not a Part of Library Genesis. ex libgen.io, libgen.org"

The real one has been down for a long time.

gregorygoc

What’s the url?

slt2021

Anna's archives is possibly the greatest site ever.

Infinite love to the team <3

xtracto

Kind of... the fact that they have the actual data behind a "soft" paywall (waiting times and terribly slow transfers otherwise) makes me a bit skeptic of their "goodwill".

SimianSci

No such thing as free when bandwidth costs money. Any service online that is handing out things for free without restriction is getting their return through scrupulus means and shouldnt be trusted. Anna's Archive straddles the line enough to allow people to download books for free but not at too great an expense to the volunteers who pay out of pocket to support the project.

Vektorceraptor

So what about the authors and creators of the works? They did it for free?

AIPedant

Information and well-crafted sentences are available on the Language Tree, easily plucked by anyone at zero cost. It's greedy for those so-called novelists and subject matter experts to expect a living wage.

"Information wants to be free," which means that any cost of producing that information can be abstracted away due to ideological inconvenience.

slt2021

they already work almost for free, since all the money goes to the publisher and retailer.

out of $20 book, the authors earn about $1 - $1.5, for e-books its about $1.7 - $2

The value from book sales goes to retailer and publisher: two large corporations, and in case of amazon - a single big corporation

so please cry me a river about amazon's lost profits earned at the back of the book authors

Aerroon

Governments. You forgot governments. They take the bulk of the money, especially in Europe.

~25% VAT and then the publishers and retailers take their cut. The government takes another 40% in income and payroll taxes from that. The leftovers are what the author gets.

Buying from yourself is probably the biggest markup you can get.

slt2021

yes, if you add VAT and remove taxes from authors' incomes, it becomes even more laughable.

its really might be better to publish for free and create a buy me a coffee

akkad33

Then what's the economic interest for writing a book

0cf8612b2e1e

Their backdoor plan to get rich! Not going to fool me this time VCs!!

Everyone involved is taking on significant personal liability and hosting expenses. Not sure what more you expect.

klik99

Yes spot on, crazy that asking for an optional pittance for less bandwidth throttling on such a huge and risky project can be seen as exploitative.

exe34

you should ask for a refund!

mattl

Bandwidth isn’t free of charge

bibelo

and hosting

nulld3v

I believe you only hit the paywall when you try to use the search engine & download individual files. They still offer the underlying data for free archival/mirroring via torrents.

baal80spam

annas-archive.li/blog, 2025-08-17

About recent events.

We are still alive and kicking. In recent weeks we’ve seen increased attacks on our mission. We are taking steps to harden our infrastructure and operational security. The work of securing humanity’s legacy is worth fighting for.

Since we started in 2022, we have liberated tens of millions of books, scientific articles, magazines, newspapers, and more. These are now forever protected from destruction by natural disasters, wars, budget cuts, and other catastrophes, thanks to everyone who helps with torrenting.

Anna’s Archive itself has organized some of the largest scrapes: we acquired tens of millions of files from IA Controlled Digital Lending, HathiTrust, DuXiu, and many more.

We have also scraped and published the largest book metadata collections in history: WorldCat, Google Books, and others. With this we’ll be able to identify which books are still missing from our collections, and prioritize saving the rarest ones.

Much thanks to all of our volunteers for making these projects happen.

We’ve forged some incredible partnerships. We’ve partnered with two LibGen forks, STC/Nexus, Z-Library. We’ve secured tens of millions additional files through these partnerships. And they are helping the mission by mirroring our files.

Unfortunately we have seen the disappearance of one of the LibGen forks. We don’t have further information about what happened there, but are saddened by this development.

There is a new entrant: WeLib. They appear to have mirrored most of our collection, and use a fork of our codebase. We have copied some of their user interface improvements, and are grateful for that push. Sadly, we are not seeing them share any new collections, nor share their codebase improvements. Since they haven’t shown commitment to contributing back to the ecosystem, we advise extreme caution. We recommend not using them.

In the meantime, we have some exciting projects in the works. We have hundreds of terabytes in new collections sitting on our servers, waiting to be processed. If you’re at all interested in helping out, feel free to check out our Volunteering and Donate pages. We run all of this on a minimal budget, so any help is greatly appreciated.

Keep fighting.

iLoveOncall

> In recent weeks we’ve seen increased attacks on our mission.

A pretty rich thing to say when your mission is piracy.

I'm not against piracy at all, quite the contrary, but this is quite laughable.

mavamaarten

Right? I mean I love what they're doing. But at the same time please, stop claiming to be holy angels trying to build an archive for historical purposes. You're a terrific piracy site, period.

sandspar

It's an interesting peek into their milieu. For those in the club, the statement might seem self-evident.

oguz-ismail

> We recommend not using them

I've been using WeLib since April and had a good experience so far

SimianSci

If efforts like this are to be sustainable in any lasting way, participants need to be cooperative, not parasitic. I agree with the Anna's Archive team, it serves noone to have one of these players in the space hoarding their own collections and not sharing them to other archiving projects, it make the collection extremely vulnerable and at risk of becoming lost knowledge as time goes on.

jeron

I disagree with how this is framed. shadow libraries thrive on decentralization, any other servers mirroring a collection is better than no mirrors at all

SimianSci

Im not sure how you disagree with this. Decentralization relies on multiple copies in multiple places. The fact is that WeLib is not allowing other libraries like Anna's Archive to mirror or copy thier exclusive collection, hence the recommendation not to use them.

Otherwise, please explain how I am missing your point.

neilv

> If efforts like this are to be sustainable in any lasting way, participants need to be cooperative, not parasitic. I agree with the Anna's Archive team,

That's an odd combination.

Barrin92

>If efforts like this are to be sustainable in any lasting way, participants need to be cooperative, not parasitic

that is an odd demand for a site that thrives on piracy. Don't steal from the thieves? When you take from others it's liberation, when others take from you it's parasitic, that's certainly a convenient coincidence

carlosjobim

No honour among thieves.

andrei_says_

Let’s have the person who does not use any LLMs throw the first stone.

keroro

Why use them over annas archive?

oguz-ismail

cleaner interface

max_

The entire internet needs to be re-designed to stand up against attacks.

- DDOS attacks

- Spamming

- UK like surveillance laws

- LLM scraping

Why is it that there is almost not initiative for this?

grues-dinner

The Internet has been redesigned. It's just not been redesigned with your interests in mind and at least some of the "attacks" are features to the right people.

theturtletalks

The precursor to BitCoin was this interesting project called HashCash. It was built to combat email spam and forced the sender to spend compute solving a moderate hash and put it in the header. The person who receives the email can prove easily if the sender "paid" the cost.

progval

There are, but they each have their tradeoffs.

Proof of work and micropayments (eg. Xanadu or Internet Mail 2000) schemes solve spamming and LLM scraping, but are more expensive or more CPU-intensive.

P2P systems like FreeNet too, but they are harder to use and more storage intensive and make it easier to spy on individual users.

Tor solves UK-like surveillance laws but it's slower and makes it easier to spam.

freefaler

Decentralization and interoperability, including the TCP routing protocols give the ability for the network to grow freely, but makes those kind of attacks easier.

The easiest way to mitigate those problem will be to decrease the openness and centralize more. It might lead to even worse things that DDOS.

GuB-42

RFC-3514 [1] proposed an effective solution against attacks.

So see, there are initiatives, but people treat it as a joke, maybe because of when it was released.

[1] https://www.ietf.org/rfc/rfc3514.txt

uberman

Out of curiosity, do you see the archive in question as being part of the problem or that it needs protection from the issues you raise?

butchkass

Go right ahead

anon191928

because they will come after new design? how do you not see this?

monster_truck

I'll start the wiki

meindnoch

I'll design the logo!

IAmBroom

I'll make a GUI in Visual Basic!

exe34

I'll bring my axe!

spogbiper

i'll make snacks

dulpo

Redesigned like how?

ilovefood

I fully agree. It's difficult though because I genuinely believe that the solution space overlaps with cryptography, which is quickly discounted as viable option because it is now laden with negative connotations.

goku12

Cryptography has negative connotations? Like what? Do you mean cryptocurrency by any chance? (If so, it's feasible to practice cryptography without touching cryptocurrency).

gia_ferrari

Not op, but in my bubble:

- DRM. - Owner-unfriendly device locks (such as manufacturer-controlled secure boot or locked-down OSes). - Inability to audit network traffic from one's own devices, i.e. an IoT device. - Remote attestation, when in opposition to open computing.

I could also see folks seeing the use of cryptography as "having something to hide" - I don't personally agree.

vpribish

nah. cryptography is not seriously held back by cryptocurrency

squigz

Because the vast majority of people don't want this, and not for some nefariuos reason or because they're stupid, but because we don't want to enable blatant fraud and abuse, among other things.

(Not to mention the astronomical technical work it would be; you can't just replace "The Entire Internet")

exe34

the problem is that anybody who does that work will be targeted very quickly by the people in power.

even if it's decentralised, it'll be banned one way or another and you'll be hunted down.

random3

"Be the change you want to see in the world"

NoMoreNicksLeft

I dread these. I still remember the rarbg announcement from a few years back I saw here. Do I even dare click the link?

HedgeMage

Not that scary. Click it.

crest

They just announced that they're still in the fight.

ronsor

I think you'll be happy if you do

revskill

Openai need to train their models based on these books, not stackoverflow or reddit.

burkaman

They do: https://xcancel.com/vxunderground/status/1888019174133276846, https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...

The tweet only names Meta, but it would be very surprising if OpenAI didn't do the same thing.

CamperBob2

Anyone who doesn't train on all material available, legal or otherwise, will be outcompeted by teams that do, including those based in countries that don't respect Western copyright law. It's that simple.

Either this is practice is judged (or legislated) to be fair use, or copyright is done. It's also that simple.

atrettel

I'm not convinced that LLMs and other AI models need to train on all material available. A representative sample is better.

I'll ignore the legality aspects in my response. I think coming up with a representative sample of all relevant information would be better in the long term (teams will not be outcompeted on long time horizons). Why don't the companies do this? Because it is easier to just "carpet bomb the parameter space" and worry about the potential confounding [1] and sampling bias [2] later. Coming up with a representative sample requires domain expertise and that is expensive in terms of time and money. But it reduces the total amount of training data and should reduce the amount of time and resources it takes to build the models. That may matter now that models are quite large.

This is definitely a design decision with tradeoffs on both sides. I can entertain the notion that we don't have time to sample things, but I think we are all too often dismissing the long-term benefits of proper sampling.

(In terms of the legality aspects, judges are trying to "split the baby" [3] in my opinion by saying that training on stuff you got legally is OK but training on pirated material isn't. So nobody is going to recommend training on pirated material in the first place.)

[1] https://en.wikipedia.org/wiki/Confounding

[2] https://en.wikipedia.org/wiki/Sampling_bias

[3] https://www.404media.co/judge-rules-training-ai-on-authors-b...

sigseg1v

Outcompeted in the competition of what, exactly? How quickly they can produce inaccurate garbage?

9dev

Or none of both happens and the corporations will just continue to evade laws and taxes to their benefit.

spaceport

Quality. The tranformable value in all data is not equal.

alfalfasprout

So, what? Authors and rights holders are supposed to just take it?

Copyright law exists for a reason. Trying to improve an LLM doesn't give you the right to flout our legal system. Yes, other countries might have an advantage in LLM training as a result but so be it.

crazygringo

> Authors and rights holders are supposed to just take it?

If it's judged as fair use, then yes. And then it's not flouting anything.

Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.

For example, nonfiction authors already "just take it" when reviews describe the main points of their book without paying them a cent. The justification is that it's for the greater good, and rights are limited.

dns_snek

> Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.

That's a rather bastardized and twisted representation of copyright and fair use.

The "whole point" of copyright was to promote the authorship of original creative works by legally protecting the financial income of those authors. The "whole point" of fair use was to make exceptions in cases where it's clear that the usage doesn't result in a market substitute and deprive original authors of their income.

The end-goal of LLMs is to ingest all of that original content and reproduce it with expert-level accuracy, promising to be the know-all, end-all product. If wildly optimistic predictions of LLM proponents turn out to be correct then they will never buy a book again, they will have no reason to. And this is precisely what the copyright was designed to protect authors against.

atrettel

Judges have recently ruled [1] that training on legally obtained materials constitutes fair use, but we will have to see in the long term if that ruling holds up.

[1] https://www.404media.co/judge-rules-training-ai-on-authors-b...

Night_Thastus

>the whole point of fair use is to benefit society

I'll stop you right there - I really don't think that applies at all. Does 'society' really benefit when the whole thing is a funnel for enormous amounts of wealth to go to already-gigantic companies like Microsoft?

CamperBob2

Yes, if it helps me get my own job done more effectively, efficiently, and economically. That's how our society works. You and I benefit from this, too, not just Microsoft.

If you don't like it, there's a process for changing how it works, but don't expect an easy path to success. Various people will object, and will have to be won over to your way of thinking.

bfrankline

> Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.

How do you think masked language models work?

bee_rider

It seems like it could conceivably be fair in some sense, as long as the models were actually released as open-weights (for the benefit of society).

hyperman1

Copyright law indeed exists for a reason. And that reason was that church and crown felt threatened by the power of printing presses to distribute ideas they couldn't control. 'To promote the usefull arts' has always been a way to sell the idea to the masses.

CamperBob2

"...but so be it."

That phrase is carrying a lot of water, isn't it? Trillions of dollars worth by some estimates.

bugufu8f83

They do, don't they? I think OpenAI uses libgen.

Meta managed to get into a private ebook torrent tracker called Bibliotik a few years ago to use for training Llama and the resulting publicity essentially killed the tracker.

neilv

People like this, because people like free stuff, and like to rationalize getting free stuff. Occasionally, someone who likes free stuff styles themself a freedom fighter, though their values do not otherwise seem to extend beyond getting free stuff.

Some AI company techbros like this data trove even harder, and limit their pretending to publicly saying things like "we're changing the world" (and "AI could be bad if you don't give us money and lock out competitors") but really only care about wealth and power.

Certain sanctioned countries that culturally value literature and science might also appreciate this. (This last category, I'm much-much more sympathetic to, and wish them well in their intellectual pursuits and appreciation of the humanities, though we should really find a better way to share that doesn't undermine Western economies and many people's livelihoods.)

agentcoops

I share your concern for the livelihood of authors (and your skepticism regarding the naiveté that often surrounds pro-piracy rhetoric), but I don't think that's fair to the question here. Unlike in the case of music or film, most users are not just trying to get the latest NY Times best-selling novel. The percent of books made accessible through these services that are tied to an author's income through consumer sales is negligible. Most specialist literature, whether in the natural sciences or the humanities, is priced under the assumption that university libraries are the ones making the purchase, often more or less automatically. Yet even and perhaps especially in the US (I know nothing of the library culture in certain sanctioned countries), it's increasingly rare that university libraries have open stacks for non-students and there are incredibly few public libraries that actually provide access to scholarly works, past or present -- New York Public Library and the Library of Congress in DC are the ones I've used personally, but I'm sure there are a handful of others.

Moreover, however many countless AI companies now buying and pulping copies of every book in existence seems to be really changing the used book market. Prices are going up dramatically and before this year it was very rare to not find a single copy in the world of whatever old book one desired.

As someone who spends a disproportionate amount on books and shares your concern for not making life even more difficult for authors, these services going away would be a tremendous regression.

_DeadFred_

Don't forget the video piracy thread had a lot of justification to the effect of 'the people that work on these shows/movies don't get paid enough anyways, so it's ok for me to pirate'. Wait, so you think they should get paid more for their work, this what they do is worth being paid for, just not by you? Weirdest flex.

brianstorms

Fuck that site. Offers people links to free PDF downloads of my book that I worked on for 32 years and finally got published by Pantheon Books in 2017. I didn't work all that fucking time for criminals like these to just break copyright law and make the book available for free. Fuck Anna's Archive, and I hope they go down in legal flames ASAP.

gaudystead

I'm sorry you feel that way and it's understandable to be frustrated by them allowing piracy of something you've worked so long on.

That being said, do you know if their offering of your material has had a significant impact on your revenue or is it more the principal of the matter?

shortstuffsushi

This strikes me as a bit ironic, if you're serious, as you list your current work as covering the entirety of the Beatles discography. Are you paying them for the rights?

pavel_lishin

I don't think this is a useful path to go down; there's a legal precedent for cover songs, and perhaps he did pay the fee: https://www.nolo.com/legal-encyclopedia/question-when-mechan...

Rotundo

I bought your book. I would never have discovered your book if it wasn't for a shadow library (not Anna's). The topic is rather niche and there is no marketing to speak of.

How do you expect people to find your book?

Also, but too late now, if I had known your attitude, I'd not have bought your book.

pavel_lishin

I wonder if the people who downloaded it for free (has anyone actually done so?) would have ever paid for it.

dd_xplore

I think they shouldn’t publish books which are fairly new. Hurts the authors…

trinsic2

Cultures are created to protect power structures. Culture is the enforcer of authority. Culture distorts principles in order to defend the authority of evil. Culture must convince you that it is not wrong when law subjugates your worth and destroys your freedom. Culture convinces people of this by perverting the concept of morality. Morality is liberty. Immorality is evil. The exercise and defense of freedom are moral. The destruction of freedom is immoral. This is the pure truth of morality. Prudence is the proper application of principle. Imprudence is foolishness. Prudence is not morality. It is not immoral to kick a heavy stone with your bare foot, but it would probably be foolish. Prudence is a question of applying the principles and wisdom you have gathered in your life to achieve the goals you have for yourself. This is made possible by liberty. Without liberty, prudence is meaningless. Morality must come before prudence. The great lie of culture is that authority is not bound by morality, and that authority can enforce its own prudence upon you. The great lie of culture is that you are worth less than law. Cultures teach that intentions of prudence can be enforced by law. In this fashion they gain excuse to control the lives of people. In order for people to learn, grow, and find happiness, people must be free to test their understanding of principles. With freedom, they can do this by a process of faith, trial and error. In this fashion children grow from immaturity to maturity. In this fashion human beings gain wisdom. Cultures are agents of evil. The objective of evil is the damnation of your ability to grow strong in wisdom. The objective of evil is the destruction of your worth. In order to gain control over you, culture spreads the lie that authority is not bound by morality. It teaches that authority can destroy freedom at will, and claims prudence as the reason you should willingly submit. In the name of defending you, culture claims that the destruction of freedom is morality. Cultures pretend that evil is good and that good is evil. Prudence can be found all around you. It is found in the choices you make every day. Even when a mistake is made, you learn prudence. Prudence cannot be enforced. To enforce prudence is law. Law is lie. Without the freedom to choose, you cannot learn prudence. You cannot be happy. Morality can be found all around you. Wherever you find it, you will find joy. Wherever you find immorality, you will find misery. Culture enforces authority by destroying freedom with law. This is immorality. - The End of all Evil, Jeremy Locke

You have invested in an idea that has been created by power structures through culture, that you are getting harmed by someone else's freedom. The people that will/want to support your work will do so out of a desire to do so, not because law says its right.

Many people are deceived that law breakers are immoral and harmful to society, but I don't think that's the case. The people that care to much about copyright are too invested in demanding a return for their efforts. What ever happens to the priority of making the world a better place first and foremost and having faith that you will be compensated in some fashion for your efforts?

cakealert

Can Anna's Archive claim to be a non-profit when it's effectively an illegal enterprise with unknown controllers?

They are even offering decent bounties: https://software.annas-archive.li/AnnaArchivist/annas-archiv...

Whoever is running it must be doing really well for themselves laundering all that crypto.

Also interestingly they don't offer a tor onion service, while the admin is most certainly technically competent to administer one given that he no doubt uses tor to insulate himself from his enterprise and launder crypto. What is the reasoning for that?

teraflop

Your comment seems like a non sequitur to me. Whether something is a "non-profit" has nothing to do with whether it receives or spends money. (See, e.g. the American Red Cross's ~$4B/yr budget.) It's about what it does with the money it has.

Obviously, since Anna's Archive is breaking the law, it can't conform itself to the normal legal/regulatory system that governs non-profit organizations. It can certainly still claim to be acting in the spirit of a non-profit, and it's up to you to decide whether you trust that claim. Nobody's forcing you to give them money.

cakealert

The connotation of a non-profit is that it's being audited. It would be extremely silly to suggest otherwise.

teraflop

It may have that connotation to you, but in general (at least in the US) non-profit organizations are not required to have independent audits. Typically, that requirement only happens if they receive a certain amount of government funding. An organization may choose to undergo audits in order to make people feel better about donating to it.

I really, really don't think that anybody is being fooled or misled into thinking that Anna's Archive is a "legitimate" audited organization when they describe themselves as a non-profit.

addaon

> The connotation of a non-profit is that it's being audited.

This is very geography-specific. In the US, 501(c)(3)s (what most people think of when they say "non-profit" where I am) have no general requirement for audits. There's also plenty of non-profit-by-some-definition organizations that never file a Form 1023, giving up some benefits of the 501(c)(3) regulations but in exchange being even less regulated.

Projectiboga

The entities are regulated at the state level in the usa, with the responsibility to comply with both state and federal tax authorities.

badlibrarian

Audits have nothing to do with it; all entities are subject to audit.

The primary difference between a non-profit and a for-profit is that a non-profit does not distribute profit to shareholders, including the founders.

cakealert

Audit or threat of audit is the mechanism of enforcement and that is all that ever matters.

pdabbadabba

At least in the US, claiming that you are a nonprofit implies that contributions are tax deductible. Claiming that you are a nonprofit when contributions are not tax deductible might be considered fraudulent.

anigbrowl

Not true. There are different classes of nonprofit and they are not all tax deductible. Some nonprofits opt to forgo pursuing that status because it involves a lot of extra administration/filing requirements.

jrflowers

They are already very much in breach of US law, which they have always been clear about. That aside, they don’t claim that contributions to them are tax deductible.

I would love to see someone try to explain to the IRS why all those purchases of Amazon gift cards and Monero for the transparently illegal organization should be deductible though

gowld

Is Cosa Nostra a non-profit? The question doesn't make sense. It's a category error.

A non-profit is a corporate legal structure. An unregistered organization could be a cabal, a gang, a syndicate, a fellowship, a religion, a movement, a private club, or something else.

nine_k

The intent is still important. While from a legal point of view a terrorist cell cannot be registered as a non-profit, it typically spends whatever funds it can secure to further its political goals, not on increasing the wealth of its owners or participants. A typical criminal band though is a for-profit entity.

SimianSci

Given the amount of hosting and storage needed to sustain this project. Nobody is getting rich off of donations. Not to mention the lifestyle tradeoffs that innevitably come with international fugitive status do not lend themselves to a very comfortable life.

The usage of crypto is entirely one of necessity, as controling information and knowledge is something powerful people have clear stakes in. Many countries weild their financial systems to hold or acquire power. Information and Knowledge is one form of such power.

Everything points to the Anna's Archive team being passionate ideologues as opposed to some criminal enterprise focused on profit motives.

cakealert

> Not to mention the lifestyle tradeoffs that innevitably come with international fugitive status do not lend themselves to a very comfortable life.

Anonymous international fugitive?

> Nobody is getting rich off of donations.

How can anyone aside from the beneficiary know that?

The extent to which the controller can get rich off this enterprise depends entirely on the unknown quantity of donated funds (and deals with AI companies) and his skill at laundering crypto (which darknet marketplace controllers doing far more illegal stuff can do).

iLoveOncall

> Given the amount of hosting and storage needed to sustain this project. Nobody is getting rich off of donations.

They're getting donations as much as megaupload was getting donations for premium accounts...

People pay for higher bandwidth and no wait time, not to support the "cause". It's a farce to qualify this of donations.

And obviously people do get rich off of it, as you can see from the slew of file hosting services.

southernplaces7

illegal doesn't at all have to mean immoral or particularly wrong either. Laws are complex constructions, often created for decidedly hypocritical reasons of benefitting some at the expense of others.

Thus, Who gives a shit if they're taking money from those who voluntarily subscribe. They still offer an absolutely incredible free service to who knows how many people who otherwise wouldn't be able to afford so much access to so much free information.

Given the behavior of the pro-copyright business interests and legal bodies of the world, and the outright hypocrisy of openly creating one set of rules on content piracy for certain corporations while applying another, harsher rule system for those who aren't so nicely connected, smug moralizing about something like Annas Archive has little grounding.

And aside from picking random crap out of your ass for smearing arbitrarily, what shred of evidence do you have of anyone there laundering crypto, and how?

cakealert

> what shred of evidence do you have of anyone there laundering crypto, and how

The controller's freedom. If they didn't launder it they wouldn't be free.

> They still offer an absolutely incredible free service

Actually their free downloads aren't particularly good when compared to some of the other online services that 'leech' from them.

And their torrent strategy could be altruistic but it could also be self interested. By spreading storage costs around and attracting more contributions. And providing insurance to hardrive seizures.

What mainly interests me is how much money they are actually making, I suspect it's very profitable.

Made by @calebRussel