Is this relatively simple to create or other? Image location scrambler thingy.

3 posts by 2 authors in: Forums > CMS Builder
Last Post: May 8, 2023 (RSS)

Archived
Reply
Reply

By Codee - May 8, 2023

Dave and I.T. Team,

Been doing some thinking on security procedures and recent influx of site scrapers for images by A.I. Sometime back your team created the Spambot Email Protector which was a GREAT idea with only one glitch (order of operations, if someone used dynamic entry from CMSB it also got scrambled so legit emails didn't work in that case) but now I am wondering if there is a simple way (like plugin or standard coding) to either prevent, or screw with, AI image scraping? I was thinking it would/could work by adjusting the IRL after the page is loaded (so displaying the correct image or thumbnail but if someone tries to right-click-download, right-click-open-in-new-window, or just scrape the IRL from the source code, they get the wrong IRL/URL.)

Does that make sense?

Archived
Reply
Reply

By Dave - May 8, 2023

Hi Codee,

There is no perfect solution to this issue, as browsers need to download images. A bot that perfectly replicates browser behaviour could still download them. However, there are ways to make it challenging for many scrapers.

One method involves requiring an HTTP_REFERER value, which is the URL of the page that linked to or referred to the current page or image. Simple bots often request a file directly without sending this value. You can filter based on this field (with PHP or .htaccess) to display a different image or an error if there is no referrer.

Another approach is to filter based on the HTTP_USER_AGENT value. All browsers send their name and version when making a web request, and well-behaved bots should include something that identifies their source. You can find more information about this here: https://datadome.co/threat-research/how-chatgpt-openai-might-use-your-content-now-in-the-future/

Additionally, you can ban access from blocks of IP addresses. If you know a certain service is scraping your site (or might) and you know their IP range, you can block it. However, be cautious not to exclude valid search engine indexing bots.

Other ideas to consider:

Disallowing bots with robots.txt
Loading images with JavaScript to prevent access by bots that can't use JavaScript
Watermarking your images to make them traceable and less useful
Limiting the rate of requests so scrapers and bots can't download everything all at once

Hope that helps, the solution might vary based on what you're trying to protect and why, but there are a few options.

Dave Edis - Senior Developer
interactivetools.com

Archived
Reply
Reply

By Codee - May 8, 2023

Dave,

Thank you. Yea, I couldn't figure how to make that happen because downloading the images is critical browser behavior and monkeying with that causes issues, and different issues in different browsers, and it's not hard to break past some of the robots.txt rules. For example, the tool img2datset can be prevented, by website coders, from access using X-Robots-Tag: noai”, “X-Robots-Tag: noindex” , “X-Robots-Tag: noimageai”, and “X-Robots-Tag: noimageindex”. By default, img2dataset will ignore images with such headers. HOWEVER, img2dataset tells users "to disable this behaviour and download all images, you may pass “--disallowed_header_directives '[ ]'” ...which is part of what just ticks me off.

That's why I wrote because I couldn't think of any other simple ways to combat this. Yes, I have some .htaccess blocks in place but attempting to block everything would cripple site speed.