For some days now, a project I’ve been working on for some time has been online: it is a solution for monitoring the Internet. Right now the main focus is on the search for phishing kits and webshell. As it is designed, the solution can perform various other tasks but given the consumption of resources that this type of work entails (and that the resources available, already like this, are almost at the limit) for now I am limiting its functionality.
It is a project that I carry out as a hobby so I work on it in the spare time, but lately it has been giving me some satisfaction!
The thing I like most about this project is that it presents a significant amount of difficulties that allow me to always learn new things on different fronts: design, programming, databases, systems, networks and of course, security.
If you are interested in the result of this project you can find the results on my Twitter account (which has now become unusable given the amount of messages that the platform produces …).
The solution consists of several subsystems and some of these I have decided to release on Github. The first project that I released is Argilla, it is still in beta and has some bugs as well as being poor in functionality, but it is light and does what I need right now. When I have time I will evolve it to make it more usable in other contexts.
Currently 4000000 of websites are analyzed every day, for each site the content is analyzed and a value indicating the level of risk is calculated. Sites with higher risk values are published on Twitter as “Threat” and “Possible threat”. Before being published on Twitter they are reported to Netcrat through the appropriate API. The publication on Twitter includes, in addition to the link, some hashtags (mainly #phishing and #opendir) and, since the last release, the registrar who registered the domain is tagged.
The application components are developed in .NET, Core and Framework, the operating systems are Windows 2019 (for the database) and Linux for the agents. The database is SQL Server 2019. The databases are 4 for a total of about 300GB and grow by about 6GB per day.
Right now I am evaluating the possibility of adopting Tensorflow for some works but I am having some difficulty in creating useful datasets. If anyone has skills and wants to work on this project, any help is welcome!