DuckDuckGo, Bing, Mojeek, and other search engines are not returning full Reddit results any more.

    • Admiral Patrick@dubvee.org
      link
      fedilink
      English
      arrow-up
      4
      ·
      2 months ago

      Kinda long, so I’m putting it in spoilers. This applies to Nginx, but you can probably adapt it to other reverse proxies.

      1. Create a file to hold the mappings and store it somewhere you can include it from your other configs. I named mine map-bot-user-agents.conf

      Here, I’m doing a regex comparison against the user agent ($http_user_agent) and mapping it to either a 0 (default/false) or 1 (true) and storing that value in the variable $ua_disallowed. The run-on string at the bottom was inherited from another admin I work with, and I never bothered to split it out.

      'map-bot-user-agents.conf'
      # Map bot user agents
      map $http_user_agent $ua_disallowed {
          default 		0;
          "~CCBot"		1;
          "~ClaudeBot"	1;
          "~VelenPublicWebCrawler"	1;
          "~WellKnownBot"	1;
          "~Synapse (bot; +https://github.com/matrix-org/synapse)" 1;
          "~python-requests"	1;
          "~bitdiscovery"	1;
          "~bingbot"		1;
          "~SemrushBot" 	1;
          "~Bytespider" 	1;
          "~AhrefsBot" 	1;
          "~AwarioBot"	1;
          "~GPTBot" 		1;
          "~DotBot"	 	1;
          "~ImagesiftBot"	1;
          "~Amazonbot"	1;
          "~GuzzleHttp" 	1;
          "~DataForSeoBot" 	1;
          "~StractBot"	1;
          "~Googlebot"	1;
          "~Barkrowler"	1;
          "~SeznamBot"	1;
          "~FriendlyCrawler"	1;
          "~facebookexternalhit" 1;
          "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough|
      ^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl
      y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT
      rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb|
      ^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs
      PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu
      eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne
      r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void
      EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^
      WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1;
      
      }
      

      Once you have a mapping file setup, you’ll need to do something with it. This applies at the virtual host level and should go inside the server block of your configs (except the include for the mapping config.).

      This assumes your configs are in conf.d/ and are included from nginx.conf.

      The map-bot-user-agents.conf is included above the server block (since it’s an http level config item) and inside server, we look at the $ua_disallowedvalue where 0=false and 1=true (the values are set in the map).

      You could also do the mapping in the base nginx.conf since it doesn’t do anything on its own.

      If the $ua_disallowed value is 1 (true), we immediately return an HTTP 444. The 444 status code is an Nginx thing, but it basically closes the connection immediately and wastes no further time/energy processing the request. You could, optionally, redirect somewhere, return a different status code, or return some pre-rendered LLM-generated gibberish if your bot list is configured just for AI crawlers (because I’m a jerk like that lol).

      Example site1.conf
      
      include conf.d/includes/map-bot-user-agents.conf;
      
      server {
        server_name  example.com;
        ...
        # Deny disallowed user agents
        if ($ua_disallowed) { 
          return 444;
        }
       
        location / {
          ...
        }
      
      }
      
      
      
      • Mac@federation.red
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        2 months ago

        So I would need to add this to every subdomain conf file I have? Preciate you!

        • Admiral Patrick@dubvee.org
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          2 months ago

          I just include the map-bot-user-agents.conf in my base nginx.conf so it’s available to all of my virtual hosts.

          When I want to enforce the bot blocking on one or more virtual host (some I want to leave open to bots, others I don’t), I just include a deny-disallowed.conf in the server block of those.

          deny-disallowed.conf
            # Deny disallowed user agents
            if ($ua_disallowed) { 
              return 444;
            }
          
          site.conf
          server {
            server_name example.com;
             ...
            include conf.d/includes/deny-disallowed.conf;
          
            location / {
              ...
            }
          }
          
        • Admiral Patrick@dubvee.org
          link
          fedilink
          English
          arrow-up
          2
          ·
          2 months ago

          Yeah, if’s are weird in Nginx. The rule of thumb I’ve always gone by is that you shouldn’t try to if on variables directly unless they’re basically pre-processed to a boolean via a map (which is what the user agent map does).