- cross-posted to:
- [email protected]
- cross-posted to:
- [email protected]
DuckDuckGo, Bing, Mojeek, and other search engines are not returning full Reddit results any more.
DuckDuckGo, Bing, Mojeek, and other search engines are not returning full Reddit results any more.
Kinda long, so I’m putting it in spoilers. This applies to Nginx, but you can probably adapt it to other reverse proxies.
map-bot-user-agents.conf
Here, I’m doing a regex comparison against the user agent (
$http_user_agent
) and mapping it to either a0
(default/false) or1
(true) and storing that value in the variable$ua_disallowed
. The run-on string at the bottom was inherited from another admin I work with, and I never bothered to split it out.'map-bot-user-agents.conf'
# Map bot user agents map $http_user_agent $ua_disallowed { default 0; "~CCBot" 1; "~ClaudeBot" 1; "~VelenPublicWebCrawler" 1; "~WellKnownBot" 1; "~Synapse (bot; +https://github.com/matrix-org/synapse)" 1; "~python-requests" 1; "~bitdiscovery" 1; "~bingbot" 1; "~SemrushBot" 1; "~Bytespider" 1; "~AhrefsBot" 1; "~AwarioBot" 1; "~GPTBot" 1; "~DotBot" 1; "~ImagesiftBot" 1; "~Amazonbot" 1; "~GuzzleHttp" 1; "~DataForSeoBot" 1; "~StractBot" 1; "~Googlebot" 1; "~Barkrowler" 1; "~SeznamBot" 1; "~FriendlyCrawler" 1; "~facebookexternalhit" 1; "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough| ^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb| ^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^ WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1; }
Once you have a mapping file setup, you’ll need to do something with it. This applies at the virtual host level and should go inside the
server
block of your configs (except the include for the mapping config.).This assumes your configs are in conf.d/ and are included from nginx.conf.
The
map-bot-user-agents.conf
is included above theserver
block (since it’s anhttp
level config item) and insideserver
, we look at the$ua_disallowed
value where 0=false and 1=true (the values are set in the map).You could also do the mapping in the base
nginx.conf
since it doesn’t do anything on its own.If the
$ua_disallowed
value is 1 (true), we immediately return an HTTP 444. The444
status code is an Nginx thing, but it basically closes the connection immediately and wastes no further time/energy processing the request. You could, optionally, redirect somewhere, return a different status code, or return some pre-rendered LLM-generated gibberish if your bot list is configured just for AI crawlers (because I’m a jerk like that lol).Example site1.conf
include conf.d/includes/map-bot-user-agents.conf; server { server_name example.com; ... # Deny disallowed user agents if ($ua_disallowed) { return 444; } location / { ... } }
So I would need to add this to every subdomain conf file I have? Preciate you!
I just include the
map-bot-user-agents.conf
in my basenginx.conf
so it’s available to all of my virtual hosts.When I want to enforce the bot blocking on one or more virtual host (some I want to leave open to bots, others I don’t), I just include a
deny-disallowed.conf
in theserver
block of those.deny-disallowed.conf
# Deny disallowed user agents if ($ua_disallowed) { return 444; }
site.conf
Okay yeah I was thinking my base domain conf but that’s even better.
I’ve always been told to be scared about
if
s in nginx configsYeah,
if
’s are weird in Nginx. The rule of thumb I’ve always gone by is that you shouldn’t try toif
on variables directly unless they’re basically pre-processed to a boolean via amap
(which is what the user agent map does).