Scraping the TOR for rare contents

Pierluigi Paganini July 18, 2019

Cyber security expert Marco Ramilli explains the difficulties for scraping the ‘TOR networks’ and how to enumerate hidden-services with scrapers.

Scraping the “TOR hidden world” is a quite complex topic. First of all you need an exceptional computational power (RAM mostly) for letting multiple runners grab web-pages, extracting new links and re-run the scraping-code against the just extracted links. Plus a queue manager system to manage scrapers conflicts and a database to store scraped data need to be consistent. Second, you need great starting points. In other words you need the .onion addresses where your scrapers start from. You might decide to begin from common and well-known onion links such as The TOR-hidden-wiki or to start from great reddit threads such this one, but seldom those approaches bring you to what I refer as “interesting links”. For this post “interesting links” means specific links that are rare or not very widespread and mostly focused on cyber-attacks and/or cyber-espionage. Another approach needs be used in order to reach better results. One of the most profitable way to search for “interesting links” is to look for .onion addresses in temporal and up-to-date spots such as: temporal pasties, IRC chats, slack or telegram groups, and so on and so forth. In there you might find links that bring you to more rare contents and to less spread information.

Today I want to start from here by showing some simple stats about scraped .onion links in my domestic scraping cluster. From the following graph you might appreciate some statistics of active-and-inactive scraped hidden services. The represented week is actually a great stereotype of what I’ve got in the last whole quarter. What is interesting, at least in my personal point of view, is the percentage of offline (green) onion services versus the percentage of online (yellow) onion services.

Tor crawlers

This scenario changed dramatically in the past few months. While during Q1 (2019) most of the scraped websites were absolutely up-and-running on Q2 (2019) I see, most of the scraped hidden services, dismissed and/or closed even if they persists in the communication channels (IRC chat, Pasties, Telegram, etc.).

I think there are dual factors that so much affected last quarter in spotting active hidden service. (1) Old content revamping. For example bots pushing “interesting links” back online even after months of inactivity. This activity is not new at all, but during the past quarter has been abused too many time respect to previous quarters. (2) Hidden services are changing address much more fast respect to few months ago. In order to make hard to spot malicious actors, they might decide to keep up-and-running their hidden services only for few hours and then change address/location. Is that way to enumerate hidden-services passing away or is it a simple weird time-frame? We will see it during the next “Scraping” months, stay tuned !

About the author: Marco Ramilli, Founder of Yoroi

I am a computer security scientist with an intensive hacking background. I do have a MD in computer engineering and a PhD on computer security from University of Bologna. During my PhD program I worked for US Government (@ National Institute of Standards and Technology, Security Division) where I did intensive researches in Malware evasion techniques and penetration testing of electronic voting systems.

I do have experience on security testing since I have been performing penetration testing on several US electronic voting systems. I’ve also been encharged of testing uVote voting system from the Italian Minister of homeland security. I met Palantir Technologies where I was introduced to the Intelligence Ecosystem. I decided to amplify my cybersecurity experiences by diving into SCADA security issues with some of the biggest industrial aglomerates in Italy. I finally decided to found Yoroi: an innovative Managed Cyber Security Service Provider developing some of the most amazing cybersecurity defence center I’ve ever experienced! Now I technically lead Yoroi defending our customers strongly believing in: Defence Belongs To Humans.

This analysis and many other studies and tools are available on Marco Ramilli’s blog:

https://marcoramilli.com/2019/07/18/scraping-the-tor-for-rare-contents/

[adrotate banner=”9″] [adrotate banner=”12″]

Pierluigi Paganini

(SecurityAffairs – Tor network, DarkWeb)

[adrotate banner=”5″]

[adrotate banner=”13″]



you might also like

leave a comment