Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal

Hot Potato@lemmy.world · 10 months ago

Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal

Admiral Patrick@dubvee.org · 10 months ago

Yep. I block all bots to my instance.

Most are parasitic (GPTBot, ImageSift bot, Yandex, etc) but I’ve even blocked Google’s crawler (and its ActivityPub cralwer bot) since it now feeds their LLM models. Most of my content can be found anyway because instances it federated to don’t block those, but the bandwidth and processing savings are what I’m in it for.

Mac@federation.red · 10 months ago

Teach me oh wise one

Admiral Patrick@dubvee.org · 10 months ago

Kinda long, so I’m putting it in spoilers. This applies to Nginx, but you can probably adapt it to other reverse proxies.

Create a file to hold the mappings and store it somewhere you can include it from your other configs. I named mine map-bot-user-agents.conf

Here, I’m doing a regex comparison against the user agent ($http_user_agent) and mapping it to either a 0 (default/false) or 1 (true) and storing that value in the variable $ua_disallowed. The run-on string at the bottom was inherited from another admin I work with, and I never bothered to split it out.

'map-bot-user-agents.conf'

# Map bot user agents
map $http_user_agent $ua_disallowed {
    default 		0;
    "~CCBot"		1;
    "~ClaudeBot"	1;
    "~VelenPublicWebCrawler"	1;
    "~WellKnownBot"	1;
    "~Synapse (bot; +https://github.com/matrix-org/synapse)" 1;
    "~python-requests"	1;
    "~bitdiscovery"	1;
    "~bingbot"		1;
    "~SemrushBot" 	1;
    "~Bytespider" 	1;
    "~AhrefsBot" 	1;
    "~AwarioBot"	1;
    "~GPTBot" 		1;
    "~DotBot"	 	1;
    "~ImagesiftBot"	1;
    "~Amazonbot"	1;
    "~GuzzleHttp" 	1;
    "~DataForSeoBot" 	1;
    "~StractBot"	1;
    "~Googlebot"	1;
    "~Barkrowler"	1;
    "~SeznamBot"	1;
    "~FriendlyCrawler"	1;
    "~facebookexternalhit" 1;
    "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough|
^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl
y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT
rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb|
^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs
PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu
eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne
r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void
EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^
WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1;

}

Once you have a mapping file setup, you’ll need to do something with it. This applies at the virtual host level and should go inside the server block of your configs (except the include for the mapping config.).

This assumes your configs are in conf.d/ and are included from nginx.conf.

The map-bot-user-agents.conf is included above the server block (since it’s an http level config item) and inside server, we look at the $ua_disallowedvalue where 0=false and 1=true (the values are set in the map).

You could also do the mapping in the base nginx.conf since it doesn’t do anything on its own.

If the $ua_disallowed value is 1 (true), we immediately return an HTTP 444. The 444 status code is an Nginx thing, but it basically closes the connection immediately and wastes no further time/energy processing the request. You could, optionally, redirect somewhere, return a different status code, or return some pre-rendered LLM-generated gibberish if your bot list is configured just for AI crawlers (because I’m a jerk like that lol).

Example site1.conf


include conf.d/includes/map-bot-user-agents.conf;

server {
  server_name  example.com;
  ...
  # Deny disallowed user agents
  if ($ua_disallowed) { 
    return 444;
  }
 
  location / {
    ...
  }

}

taaz@biglemmowski.win · 10 months ago

I’ve always been told to be scared about ifs in nginx configs

Admiral Patrick@dubvee.org · 10 months ago

Yeah, if’s are weird in Nginx. The rule of thumb I’ve always gone by is that you shouldn’t try to if on variables directly unless they’re basically pre-processed to a boolean via a map (which is what the user agent map does).

Mac@federation.red · edit-2 10 months ago

So I would need to add this to every subdomain conf file I have? Preciate you!

Admiral Patrick@dubvee.org · edit-2 10 months ago

I just include the map-bot-user-agents.conf in my base nginx.conf so it’s available to all of my virtual hosts.

When I want to enforce the bot blocking on one or more virtual host (some I want to leave open to bots, others I don’t), I just include a deny-disallowed.conf in the server block of those.

deny-disallowed.conf

  # Deny disallowed user agents
  if ($ua_disallowed) { 
    return 444;
  }

site.conf

server {
  server_name example.com;
   ...
  include conf.d/includes/deny-disallowed.conf;

  location / {
    ...
  }
}

Mac@federation.red · 10 months ago

Okay yeah I was thinking my base domain conf but that’s even better.

MCasq_qsaCJ_234@lemmy.zip · 10 months ago

I have two questions. How much do those bots consume your bandwidth? And by blocking search robots, do you stop being present in the search results or are you still present, but they do not show the content in question?

I ask these questions because I don’t know much about the topic when managing a website or an instance of the fediverse.

Admiral Patrick@dubvee.org · edit-2 10 months ago

How much do those bots consume your bandwidth?

Pretty negligible per bot per request, but I’m not here to feed them. They also travel in packs, so the bandwidth does multiply. It also costs me money when I exceed my monthly bandwidth quota. I’ve blocked them for so long, I no longer have data I can tally to get an aggregate total (I only keep 90 days). SemrushBot alone, before I blocked it, was averaging about 15 GB a month. That one is fairly aggressive, though. Imagesift Bot, which pulls down any images it can find, would also use quite a bit, I imagine, if it were allowed.

With Lemmy, especially earlier versions, the queries were a lot more expensive, and bots hitting endpoints that triggered a heavy query (such as a post with a lot of comments) would put unwanted load on my DB server. That’s when I started blocking bot crawlers much more aggressively.

Static sites are a lot less impactful, and I usually allow those. I’ve got a different rule set for them which blocks the known AI scrapers but allows search indexers (though that distinction is slowly disappearing).

And by blocking search robots, do you stop being present in the search results or are you still present, but they do not show the content in question?

I block bots by default, and that prevents them from being indexed since they can’t be crawled at all. Searching “dubvee” (my instance name / url) in Google returns no relevant results. I’m okay with that, lol, but some people would be appalled.

However, I can search for things I’ve posted from my instance if they’ve federated to another instance that is crawled; the link will just be to the copy on that instance.

For the few static sites I run (mostly local business sites since they’d be on Facebook otherwise), I don’t enforce the bot blocking, and Google, etc are able to index them normally.

MCasq_qsaCJ_234@lemmy.zip · 10 months ago

Thanks for the explanation and it was clear to me.