How OpenAI bots destroyed this seven-man company’s website ‘like a DDoS attack’

user

3 months ago

How OpenAI bots destroyed this seven-man company’s website ‘like a DDoS attack’

On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down. This looks like a distributed denial of service attack. They soon discovered that the culprit was a bot from OpenAI that was constantly trying to destroy all the huge sites. “We have over 65,000 products, every product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.” OpenAI sent “tens of thousands” of server requests trying to download all of them, hundreds of thousands of photos, along with detailed descriptions. “OpenAI used 600 IPs to scrape data, and we’re still analyzing the logs from last week, maybe more,” he said of the IP addresses the bot used to try to use the site. “Their crawler crushed our site,” he said, “It was a DDoS attack.” The Triplegangers website is their business. The seven-employee company has spent more than a decade amassing what it calls the web’s largest database of “human digital doubles,” meaning 3D image files scanned from actual human models. It sells 3D object files, as well as photos – everything from hands to hair, skin, and full bodies – to 3D artists, video game makers, anyone who needs to digitally recreate real human characteristics. Tomchuk’s team, based in Ukraine but also licensed in the US out of Tampa, Florida, has a terms of service page on its site that prohibits bots from taking pictures without permission. But only those who do nothing. Websites must use a properly configured robot.txt file with special tags that tell the OpenAI bot, GPTBot, not to leave the site. (OpenAI also has several other bots, ChatGPT-User and OAI-SearchBot, which have their own tags, according to the crawler’s information page.) Robot.txt, otherwise known as the Robots Exclusion Protocol, was created to tell search engines what sites they should not crawl. when indexing the web. OpenAI says on its information page that it respects these files when configured with its own do-not-crawl tag, although it also warns that it can take bots up to 24 hours to recognize an updated robot.txt file. As Tomchuk experienced, if a site doesn’t use robot.txt correctly, OpenAI and others think it can scrape to its heart’s content. It is not an opt-in system. To add insult to injury, not only were Triplegangers knocked offline by an OpenAI bot during US business hours, but Tomchuk expected to rack up AWS bills for all the bot’s CPU and download activity. Robot.txt is also not failsafe. AI companies voluntarily comply. Another AI startup, Perplexity, was quite famously called out last summer by an investigation by Wired when some evidence showed that Perplexity did not respect it. Each of these is a product, with a product page containing more photos. Used with permission.Image Credit:Triplegangers (opens in new window) Not sure what they took On Wednesday, after the day the OpenAI bot returned, Triplegangers had a properly configured robot.txt file, and so did Cloudflare. Create an account to block GPTBot and some other bots found, such as Barkrowler (SEO crawler) and Bytespider (TokTok crawler). Tomchuk also hopes that they block crawlers from other AI modeling companies. As of Thursday morning, the site had not crashed, he said. But Tomchuk still doesn’t have a good enough way to figure out what OpenAI can do or remove the material. They didn’t find a way to contact OpenAI and ask. OpenAI did not respond to TechCrunch’s request for comment. And OpenAI has so far failed to deliver the voting tools it promised, as TechCrunch reported. This is a very difficult problem for Triplegangers. “We are in a business where rights are kind of a serious problem, because we scan real people,” he said. With laws like Europe’s GDPR, “they can’t just take anyone’s photo on the web and use it.” The Triplegangers website is also a very tasty find for AI crawlers. Multi-billion dollar startups, like Scale AI, have been created where humans consciously tag images to train AI. The Triplegangers site contains detailed photos: ethnicity, age, tattoos and scars, all body types, and more. The irony is that the greedy OpenAI bot is what the Triplegangers say about how they have been exposed. If scraped more slowly, Tomchuk will not understand, he said. “It’s scary because there seems to be a loophole that these companies are using to infiltrate data by saying “you can opt out if you update your robot.txt with our tag,” Tomchuk said, but this puts the onus on the business owner. know how to block people .The Triplegangers’ server logs show how the bots access the site, from hundreds of IP addresses. They want other small businesses to know that the only way to find out is that the AI bots are taking copyrighted items websites are actively looking for them. Another website owner recently told Business Insider how the OpenAI bot crashed in 2024. New research from the digital advertising company DoubleVerify found that AI crawlers and scrapers will lead to an 86% increase in “illegal traffic” by 2024 – that is, traffic that does not come from real users. Still, “or many sites remain unaware that they’ve been taken down by these bots,” Tomchuk said. “Now we have to monitor log activity every day to find these bots.” If you think about it, the whole model operates like a shakedown mafia: AI bots will take what they want unless you have protection. “They need to ask permission, not just scrape data,” Tomchuk said. TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

Source link