I’m currently working on an elixir app to create form endpoints for different sites (because I have a lot of static sites without any real “backend”).
A pretty general question. How would you deal with identifying spam? I will use Cloudflare as a WAF but I’m looking for ideas/libraries on how to identify if something is spam or not. I don’t wanna block from submission (unless a DDOS attack or similar) but I wanna mark a submission as spam if likely to be spam. So kind of open-ended question (libraries, general inputs a la “you should read up on X and Y”, etc).
Any form that has submitted given field (with any value) - most likely comes from spam bots.
Regular users simply won’t see the field.
P.s. this tactic worked pretty good some years ago, might be obsolete now. I haven’t worked on frontend for a long time
Oh, yea for sure I will use that! Probably should have added what I already plan to use. I’m gonna use a honeypot field and add the ability to use a captcha service. And also use cloud flares “bot score” (or what it is called). But I thought perhaps there is some ML or similar to just analyse the text to see if it is spam or not.
A few weeks back I started building a very similar app/service but got busy with other (paying work) things and had to place it on the back burner.
Years ago I wrote a simple PHP library to handle form submissions which basically utilized the hidden field strategy mentioned by @Egis but also regex scanned all the other fields looking for things like
bcc:|cc:|cco:|to:|content-type|http:|https: etc. and would exit w/500 or reject as spam etc. It worked really well and is still in use on many old websites some of my clients still run.
Where it doesn’t work is for smart(er) bots that skip the hidden field and just fill the form fields with (seemingly) random junk text with (maybe) a random email addresses in an email field to test the form etc. For those forms on some low volume sites I stared using this IP Address API.
Many of the bots are running on servers in a datacenter etc. so you can use the info returned by that API to help determine whether or not you want to reject, mark as spam. or whatnot.
The info returned by the API is quite extensive. If you’re careful, the false positive rate can be kept fairly low but it needs to be tuned / adjusted for the website where the form is located.
The ML idea crossed my mind as well but at this point I don’t (yet) have the ML chops to think it through.
Hope that helps!
If you control both ends, you can use signed URLs to ensure the form data is coming from a known website. A signed URL adds a token to the query params that is calculated on the client side using some form of algorithm and a key and is then computed on the backend when the request is received. The two values are then compared, and if they match, the request is allowed.
I have a blog post Pass data from site A to site B that talked about how to post data from one site to another securely. Basically, I require a click through from the user before the action is performed.
Mindblown! Very interesting.
This sounds very interesting, can you tell us more about how this would work or provide some resources perhaps?
For the method described by @connected-cjohnston , I am not sure it is secure enough. If the first site has no back-end, then whatever it does on the front-end can be reverse engineered and forged. One should never trust the front-end. If the first site has a back-end, and you control both sites, you could just do a back channel API call with shared secret.
Although this is somewhat Google specific, I think it is a good explanation of signed URLs
Google Signed URL
As others have mentioned honeypots are a great way to identify possible spam, but if you want to block it - how far do you want to go?
Disable JS for form submission (or use LV to only show the contents of the form once the initial page is loaded - in other words where the user must have JS in order to proceed. Most bots do not have JS)
Check IPs - have their been other form submissions from the same IP?
Check form content - does it contain URLs, or is an exact copy of previous form submissions, or contains watched words?
Use 3rd party services to spot known spam email addresses/IPs etc.
Look at OS projects like Discourse to see the type of anti-spam measures they use.
Probably the biggest difference I have noticed between Discourse and other (non-SPA) forum platforms is that bots would need JS to register on a Discourse forum - that alone eradicates a massive number of spam accounts.
Have you checked out Cloudflare Turnstile? It’s a good, completely unobtrusive CAPTCHA alternative.