GitHub Spam is out of control
Update: Theo made a cool video about this article with some of his personal thoughts and experiences as well. Give it a watch when you get a chance: https://www.youtube.com/watch?v=CEZ3WEdNS-c
Spam is nothing new, spam on GitHub is also not particularly new. Any site that accepts user-generated content will need to figure out how to prevent people from submitting spam, whether that is for scams, malicious software, or X-rated material. I have been getting tagged in Crypto related for the past 6 months or so. In the past 24 hours I have been tagged in two of them.
Normally, these crypto scams on GitHub post and tag multiple people in it, and then almost immediately get deleted by the poster of the scam. It appears that this is a way to bypass spam filters, or at the very least make it harder to report them. According to this post on GitHub’s community org, the end user gets an email with the full post and spam, but there is no easy way to report it since it is already deleted.
The Issue
Today, though, was my “lucky” day. I got tagged in two scams, but one of them is still up! So let’s take a look into it.
As we can see in the screenshot above, there is a copy and paste message from a seemly auto-generated user and a bunch of real users tagged below as “Winners”. The full pull request can be found here: https://github.com/boazcstrike/github-readme-stats/pull/1 (Archive.org Link)
Let’s do a little experiment and search for the title of the comment on GitHub and see what we get:
https://github.com/search?q=AltLayer+Airdrop+Season+One+Announcement&type=pullrequests (Archive.org Link)
That is 274 comments on pull requests and 545 comments on issues. Over 800 spam comments (819 to be exact). To be fair, I saw a couple of false positives in this search, but VERY few since this is a very specific and long term we searched up. Assuming that 95% of them are correct matches, then that is ~780 posts.
The REAL kicker in all of those pull requests and issues I could find, I could only find one’s that was 24 hours or newer. The oldest I could find is only 18 hours ago from the time of writing this article!
Each post has up to 20 users tagged in it. I do not know if this is a GitHub imposed limit or if they might get flagged easier if they tag more than 20 accounts. ~780 posts * 20 = 15,600 accounts tagged.
As I was finishing this article, I found another set of these with the title of “Binance Airdrop Guide: $500k Worth of Airdrop is Ready, here’s how to Claim”.
Another ~800 mentions of it. The interesting thing with this one is that some of these are over 1 month old! There are even 3 spam posts on 1 pull request, tagging 10 users each! https://github.com/varathsurya/nurse_management_api/pull/1 (Archive.org Link)
So that is another ~15k accounts tagged… We are 30k accounts tagged so far, lets look at who is doing the tagging for the most part.
Here are a few accounts I found:
https://github.com/devsquadcore (Archive.org Link)
https://github.com/mohamedata-code (Archive.org Link)
https://github.com/altagencyuk (Archive.org Link)
They seem to have a lot of similarities.
1) No profile picture
2) A couple of years old, but usually no commits and no repos
3) If they do have a repo(s), it’s a 1 commit thing usually of some open-source software (1 account had 4 repos of Laravel, and one had 1 repo of wordpress).
WTF
Quick side note: How the actual fuck does GitHub NOT have a report button on a piece of user generated content. Do you know the process of reporting this? Copy Link -> Go to user’s profile page -> Click Block & Report -> Click Report Abuse button -> *New page* Click “I want to report harmful… cryptocurrency abuse” -> Click “I want to report suspicious cryptocurrency or mining content.” button -> FINALLY paste the link you copied 10 years ago into the form box and give your justification on why this user did a bad thing and hope that the link still works/content is still up by the time they get around to looking at it…
That is 7 different steps on 3 different pages with multiple models/dropdowns… Come on, that is WAY to much. I have never reported these before because it was too much work, I legit gave up and just ignored it because I knew it was a scam and wasn’t going to fall for it. IF YOU WANT YOUR USERS TO HELP YOU, MAKE IT EASY FOR THEM!
*Sorry, had to get that off my chest. It always seems that Trust and Safety UI/UX things like that are give little time and thought because they are not the cool sexy and flashy features that users see or care about most of the time…. until the spam starts!
The Fix
So what can be done about this? What can GitHub do? I have a couple of “simple” ideas. I say simple because I realize that not only is user-generated content moderation an uphill battle, but doing it at scale adds another level of complexity to it all.
If a user is posting multiple comments in a relatively short period of time (lets say a day), have some system that checks to see if it’s a 95% copy and paste to all of their other issues? Ok, this could snag some real users who, say, use templates in their PRs or issues. Fine, there must be some way to rate that account on a number of other factors and their past activity. If they have no repos, no commits in any repos (public or private), no profile picture, no bio, no SSH keys, etc etc, and all they are doing is making comments…. That is a lot of red flags to me personally.
Another “simple” idea, is to compare comments site wide with each other. They are using the same heading, same body, same image, same links, and just checking who they are tagging. That is a pretty big red flag for me as well. Also, tagging 20 people (even 10 people) at a time can be a red flag. Maybe not once or twice, but if they do it multiple times and always to different users, then that should trigger something to prevent them from posting.
Conclusion
With the rise of generative AI and ChatGPT being able to write endless variations of 1 spam template to bypass the similarity check I just proposed above, content moderation will continue to be an uphill battle. It most likely will get even harder! I am a bit surprised though about GitHub’s, seemingly, lack of ability to handle this sort of spam. I am 100% sure (no proof, though) that intelligent people are already working on this at GitHub, but it’s a clear that they need a concrete plan moving forward. They need to put some real effort into it. Hell, train some AI to auto-filter or auto-rank comments before they get posted. If there are too many red flags, then hold those comments for human moderation before letting it be posted. Spam is nothing new, and I am sure that spam on GitHub is nothing new, but it seems to be getting worse and the only thing getting better are the spammers.