Open source intelligence techniques

dezembro 08, 2025

ferramentas para extrair metadados de PDFs e imagens.

Hoje eu vou mostrar as três melhores ferramentas para extrair metadados de PDFs e imagens. Primeiro, o pdfinfo — ótimo para verificar autores, datas de criação e o software utilizado. Depois, o ExifTool — o extrator de metadados mais poderoso, revelando GPS, informações do dispositivo e metadados avançados. E por fim, o Metadata2Go — um analisador online rápido para quando você precisa de resultados imediatos. Use essas ferramentas no seu workflow de OSINT para verificar documentos, rastrear a origem de fotos e descobrir detalhes ocultos.”🎥 Extração de Metadados de PDF e Imagens — Tutorial Completo Neste vídeo, eu compartilho diferentes técnicas e ferramentas que você pode usar para extrair metadados de imagens ou arquivos PDF. Vamos analisar três ferramentas essenciais: Pdfinfo, ExifTool e Metadata2Go. 🔹 1. PDFINFO — Extraindo Metadados de PDFs (Ferramenta Local) O que faz: O pdfinfo lê os metadados estruturais e de autoria armazenados dentro de arquivos PDF. ✅ Como usar (Linux...

Dark Web OSINT With Python and OnionScan: Part One

Written by Justin,

July 28th, 2016

You may have heard of this awesome tool called OnionScan that is used to scan hidden services in the dark web looking for potential data leaks. Recently the project released some cool visualizations and a high level description of what their scanning results looked like. What they didn’t provide is how to actually go about scanning as much of the dark web as possible, and then how to produce those very cool visualizations that they show.

At a high level we need to do the following:

Setup a server somewhere to host our scanner 24/7 because it takes some time to do the scanning work.
Get TOR running on the server.
Get OnionScan setup.
Write some Python to handle the scanning and some of the other data management to deal with the scan results.
Write some more Python to make some cool graphs. (Part Two of the series)

Let’s get started!

Setting up a Digital Ocean Droplet

If you already use Amazon, or have your own Linux server somewhere you can skip this step. For the rest of you, you can use my referral link here to get a $10 credit with Digital Ocean that will get you a couple months free (full disclosure I make money in my Digital Ocean account if you start paying for your server, feel free to bypass that referral link and pay for your own server). I am assuming you are running Ubuntu 16.04 for the rest of the instructions.

The first thing you need to do is to create a new Droplet by clicking on the big Create Droplet button.
Next select a Ubuntu 16.04 configuration, and select the $5.00/month option (unless you want something more powerful).
You can pick a datacenter wherever you like, and then scroll to the bottom and click Create.

It will begin creating your droplet, and soon you should receive an email with how to access your new Linux server. If you are on Mac OSX or Linux get your terminal open. If you are on Windows then grab Putty from here.

On Mac OSX it is: Finder -> Applications -> Utilities -> Terminal
On Linux: Click your start menu and search for Terminal

Now you are going to SSH into your new server. Windows Putty users just punch the IP address in that you received in your email and hit Enter. You will be authenticating as the root user and then type in the password you were provided in your email.

For Mac OSX and Linux people you will type the following into your terminal:

1

ssh root@IPADDRESS

You will be forced enter your password a second time, and then you have to change your password. Once that is done you should now be logged into your server.

Installing Prerequisites

Now we need to install the prerequisites for our upcoming code and for OnionScan. Follow each of these steps carefully and the instructions are the same for Mac OSX, Linux or Windows because the commands are all being run on the server.

Feel free to copy and paste each command instead of typing it out. Hit Enter on your keyboard after each step and watch for any problems or errors.

1

screen

1

apt-get update

1

apt-get install tor git bison libexif-dev

1

apt-get install python-pip

1

pip install stem

Now we need to install the Go requirements (OnionScan is written in Go). The following instructions are from Ryan Frankel’s post here.

1

bash < <(curl -s -S -L https://raw.githubusercontent.com/moovweb/gvm/master/binscripts/gvm-installer)

1

[[ -s "$HOME/.gvm/scripts/gvm" ]] && source "$HOME/.gvm/scripts/gvm"

1

source /root/.gvm/scripts/gvm

1

gvm install go1.5 --binary

1

gvm use go1.5

Ok beauty we have Go installed. Now let’s get OnionScan setup by entering the following:

1

go get github.com/s-rah/onionscan

1

go install github.com/s-rah/onionscan

Now if you just type:

1

onionscan

And hit Enter you should get the onionscan command line usage information. If this all worked then you have successfully installed OnionScan. If you for some reason close your terminal and you can’t run the onionscan binary anymore just simply do a:

1

gvm use go1.5

and it will fix it for you.

Now we need to make a small modification to the TOR configuration to allow our Python script to request a new identity (a new IP address) which we will use when we run into scanning trouble later on. We have to enable this by doing the following:

1

tor --hash-password PythonRocks

This will give you output that will include the bottom line that looks like this:

16:3E73307B3E434914604C25C498FBE5F9B3A3AE2FB97DAF70616591AAF8

Copy this line and then type:

1

nano -w /etc/tor/torrc

This will open a simple text editor. Now go to the bottom of the file by hitting the following keystrokes (or endlessly scrolling down):

CTRL+W CTRL+V

Paste in the following values at the bottom of the file:

1

2

3

ControlPort 9051

ControlListenAddress 127.0.0.1

HashedControlPassword 16:3E73307B3E434914604C25C498FBE5F9B3A3AE2FB97DAF70616591AAF8

Now hit CTRL+O to write the file and CTRL+X to exit the file editor. Now type:

1

service tor restart

This will restart TOR and it should have our new settings in place. Note that if you want to use a password other than PythonRocks you will have to follow the steps above substituting your own password in place, and you will also have to later change the associated Python code.

We are almost ready to start writing some code. The last step is to grab my list of .onion addresses (at last count around 7182 addresses) so that your script has a starting point to start scanning hidden services.

1

wget https://raw.githubusercontent.com/automatingosint/osint_public/master/onionrunner/onion_master_list.txt

Whew! We are all setup and ready to start punching out some code. At this point you can switch to your local machine or if you are comfortable writing code on a Linux server by all means go for it. I find it easier to use WingIDE on my local machine personally.

A Note About Screen

You notice that both sets of instructions I have you run the screen command. This is a handy way to keep your session alive even if you get disconnected from your server. When you want to jump back into that session, you simply SSH back into the server and execute:

1

screen -rx

This will be handy later on when you start doing your scanning work, as it can take days for it to complete fully.

Writing an OnionScan Wrapper

OnionScan is a great tool but we need to be able to systematically control it, and process the results. As well, TOR connections are notoriously unstable so we need a way to kill a stuck scan process and grab a fresh IP address from the TOR network. Let’s get coding! Crack open a new Python file, name it onionrunner.py and start punching out the following (you can download the full code here).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

from stem.control import Controller

from stem import Signal

from threading import Timer

from threading import Event

import codecs

import json

import os

import random

import subprocess

import sys

import time

onions         = []

session_onions = []

identity_lock  = Event()

identity_lock.set()

Lines 1-12: we import all of the required modules that we are going to be using in this script.
Lines 14-15: we initialize two empty lists to hold our full onion list and the list of onions we are working through during the current scanning session.
Lines 17-18: we utilize an Event object that will help us to coordinate two threads that will be executing. We have to set the Event object first so that by default our main thread will execute later. More on these threads later.

Now we have to build some helper functions that will deal with loading our master list of onions and to be able to continue adding newly discovered onions to this list:

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

#

# Grab the list of onions from our master list file.

#

def get_onion_list():

	# open the master list

	if os.path.exists("onion_master_list.txt"):

		with open("onion_master_list.txt","rb") as fd:

			stored_onions = fd.read().splitlines()	

	else:

		print "[!] No onion master list. Download it!"

		sys.exit(0)

	print "[*] Total onions for scanning: %d" % len(stored_onions)

	return stored_onions

#

# Stores an onion in the master list of onions.

#

def store_onion(onion):

	print "[++] Storing %s in master list." % onion

	with codecs.open("onion_master_list.txt","ab",encoding="utf8") as fd:

		fd.write("%s\n" % onion)

	return

Line 23: we define our get_onion_list function that is going to load our master list.
Lines 26-33: we check to see if the onion_master_list.txt file is present (26) and if it is we crack it open (28) and then read the contents back and split it so that each line gets append to a list called stored_onions (30). If the file isn’t present then we output an error message (32) and exit the script (33).
Lines 35-37: we simply output the total number of onions loaded (35) and return the list back from the function (37).
Line 41: we define our store_onion function that takes a single parameter onion which is the hidden service we wish to add to the master list.
Lines 45-46: we crack open the master list file (45) and then write out the hidden service address (46).

Now we will implement the function that deals with running the onionscan binary to do the actual scanning work. Keep adding code in your editor:

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

#

# Runs onion scan as a child process.

#		

def run_onionscan(onion):

	print "[*] Onionscanning %s" % onion

	# fire up onionscan

	process = subprocess.Popen(["onionscan","webport=0","--jsonReport","--simpleReport=false",onion],stdout=subprocess.PIPE,stderr=subprocess.PIPE)

	# start the timer and let it run 5 minutes

	process_timer = Timer(300,handle_timeout,args=[process,onion])

	process_timer.start()

	# wait for the onion scan results

	stdout = process.communicate()[0]

	# we have received valid results so we can kill the timer 

	if process_timer.is_alive():

		process_timer.cancel()

		return stdout

	print "[!!!] Process timed out!"	

	return None

Line 53: we define the run_onionscan function to take one parameter onion that is the address of our hidden service.
Line 58: here we are using the subprocess.Popen class to start onionscan passing in the command line arguments –jsonReport and –simpleReport=false which will give us JSON output on STDOUT and disable the normal output from OnionScan. The final two parameters are telling Popen that we want to communicate with stdout and stderr meaning we want to be able to retrieve the output of both.
Lines 61-62: here is where we have a bit of magic. We create a new Timer object that is provided from the threading module. A Timer will run for a specified time, and then execute a function when that time has been reached unless you cancel the Timer. In this case we are setting it to 300 seconds (5 minutes) and then telling it to call the handle_timeout function when 300 seconds have been hit. We also pass in the process object and the current onion we are processing. This will allow us to handle when onionscan executes for 5 minutes which could indicate that our Tor connection has gone down or that the hidden service can’t be reached any longer, so we want to be able to kill the onionscan, request a new IP from the Tor network, and continue working through our list of hidden services. We start the timer on line 62.
Line 65: here we are waiting for OnionScan to return the JSON results from the scan and we store it in the stdout variable.
Lines 68-70: if we reach this line then we know that OnionScan was finished before the 300 seconds are up, so we check if the Timer is still running (68) and then cancel the Timer (69) and return the JSON output (70).

So there you have a neat trick to deal with some timing issues when running command line binaries. Now let’s implement the actual timeout handling function to deal will killing the OnionScan and requesting a new IP from the Tor network. Keep on adding code:

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

#

# Handle a timeout from the onionscan process.

#

def handle_timeout(process,onion):

	global session_onions

	global identity_lock 

	# halt the main thread while we grab a new identity

	identity_lock.clear()

	# kill the onionscan process

	try:

		process.kill()

		print "[!!!] Killed the onionscan process."

	except:

		pass

	# Now we switch TOR identities to make sure we have a good connection

	with Controller.from_port(port=9051) as torcontrol:

		# authenticate to our local TOR controller

		torcontrol.authenticate("PythonRocks")

		# send the signal for a new identity

		torcontrol.signal(Signal.NEWNYM)

		# wait for the new identity to be initialized

		time.sleep(torcontrol.get_newnym_wait())

		print "[!!!] Switched TOR identities."

	# push the onion back on to the list	

	session_onions.append(onion)

	random.shuffle(session_onions)

	# allow the main thread to resume executing

	identity_lock.set()	

	return

Line 79: we define the handle_timeout function that takes the process parameter (our Popen object) and the onion parameter which is the current hidden service we are scanning.
Line 85: here we are clearing the identity_lock which will halt our main thread (you’ll see in a bit). This will allow us to do the process killing, and grab a new identity without the main thread trying to process a new hidden service. We want to be able to cleanly deal with the onionscan process that has timed out before continuing on to a new hidden service.
Lines 88-92: here we are using the kill() function that our process object has to kill off the onionscan process that took to long to execute.
Line 95: we now connect to our local Tor controller port and store the connection object in the torcontrol variable.
Line 98: we authenticate to the Tor controller using our PythonRocks password that you set at the beginning of this blog post. Remember if you decided to use a different password, make sure you put it in here.
Line 101: we send the signal to the local Tor controller that we would like a new identity (IP address).
Line 104: we pause execution until the new IP address has been acquired.
Line 109-110: here we are re-adding the current hidden service back into our session list. This is because we didn’t get a full scan done on the hidden service so we want to make sure we re-scan it at some point in the future. We then shuffle the list (110) so that we don’t end up just grabbing this same hidden service again. If this hidden service is not working properly or is down, you would end up in an infinite loop of timeouts, kills, re-add to list, rescan. This is why we shuffle!
Line 113: we set the identity_lock object again so that the main thread is now notified to continue executing, which will load a fresh hidden service for scanning.

Now we need to implement the function that will handle processing the JSON results that OnionScan hands back to us. March on good Python soldier:

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

#

# Processes the JSON result from onionscan.

#

def process_results(onion,json_response):

	global onions

	global session_onions

	# create our output folder if necessary

	if not os.path.exists("onionscan_results"):

		os.mkdir("onionscan_results")

	# write out the JSON results of the scan

	with open("%s/%s.json" % ("onionscan_results",onion), "wb") as fd:

		fd.write(json_response)

	# look for additional .onion domains to add to our scan list

	scan_result = ur"%s" % json_response.decode("utf8")

	scan_result = json.loads(scan_result)

	if scan_result['identifierReport']['linkedOnions'] is not None:

		add_new_onions(scan_result['identifierReport']['linkedOnions'])		

	if scan_result['identifierReport']['relatedOnionDomains'] is not None:

		add_new_onions(scan_result['identifierReport']['relatedOnionDomains'])

	if scan_result['identifierReport']['relatedOnionServices'] is not None:

		add_new_onions(scan_result['identifierReport']['relatedOnionServices'])

	return

Line 121: we define our process_results function to take in the onion parameter and the json_response respectively.
Lines 126-127: if the onionscan_results directory doesn’t exist (126) we create it (127) because that’s how we roll.
Lines 130-131: here we are writing out the JSON results to a file that is named by the hidden service that we just scanned. Pretty straightforward.
Lines 134-135: we do a bit of string conversion to get the JSON string into a format we can use (134) and then we decode the JSON (135) to turn it into a native Python dictionary.
Lines 137-144: there are three fields that we are interested in that could contain additional .onion domains that we may want to add to our list of scan targets. The linkedSites, relatedOnionDomains and relatedOnionServices keys all will return lists. If they are set appropriately we hand the list off to our add_new_onions function.

Let’s implement that function now.

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

#

# Handle new onions.

#

def add_new_onions(new_onion_list):

	global onions

	global session_onions

	for linked_onion in new_onion_list:

		if linked_onion not in onions and linked_onion.endswith(".onion"):

			print "[++] Discovered new .onion => %s" % linked_onion

			onions.append(linked_onion)

			session_onions.append(linked_onion)

			random.shuffle(session_onions)

			store_onion(linked_onion)

	return

Line 152: we define our add_new_onions function to take in the list of .onion domains we have just discovered.
Lines 157-159: we walk through the list of onions (157) and then check to make sure that we don’t already have this onion in our master list and that it is a .onion domain (159). There are cases where OnionScan will discover sites that are not in the dark web, and we’ll get to those in our visualization post.
Lines 163-166: we add the new onion to our master list (163), we add it to our current session list of onions to scan (164), we shuffle the session list again (165) and then we store the onion in our onion_master_list.txt file (166).

Now let’s start putting the finishing touches on this script.

170

171

172

173

174

175

176

177

178

# get a list of onions to process

onions = get_onion_list()

# randomize the list a bit

random.shuffle(onions)

session_onions = list(onions)

count = 0

Line 171: we call our get_onion_list function that will load up all of our stored hidden service addresses.
Lines 174-175: we shuffle the onions up (174) and then create a copy of the list and store it in our session_onions variable (175).
Line 177: we initialize a counter variable that we will use to determine when we are finished looping over all of our hidden services.

Now it’s time to put the main loop in place that will be responsible for kickstarting OnionScan for each hidden service that we have stored.

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

while count < len(onions):

	# if the event is cleared we will halt here

	# otherwise we continue executing

	identity_lock.wait()

	# grab a new onion to scan

	print "[*] Running %d of %d." % (count,len(onions))

	onion  = session_onions.pop()

	# test to see if we have already retrieved results for this onion

	if os.path.exists("onionscan_results/%s.json" % onion):

		print "[!] Already retrieved %s. Skipping." % onion

		count += 1

		continue

	# run the onion scan	

	result = run_onionscan(onion)

	# process the results

	if result is not None:

		if len(result):

			process_results(onion,result)		

	                count += 1

Line 179: we create our while loop that will stop executing once we have worked through all of our hidden services.
Line 183: we are waiting for our Event object to be set before continuing execution. You will remember that this will only halt here if our handle_timeout function is dealing with grabbing a new Tor identity. Once the identity_lock is cleared we will move past this line.
Line 187: we remove a hidden service from our list and store it in the onion variable.
Lines 190-195: we are testing to see if we have already scanned the hidden service by checking to see if the JSON file exists (190) and if so we increment our count variable (193) and then we go back to the top of the while loop using the continue keyword (195).
Line 198: since we have not yet scanned the current hidden service, we kick off the scan process and return the result in the aptly named result variable.
Lines 201-206: if we get a good result back we test the length of the JSON string (203) and if it is greater than zero we pass the JSON string and hidden service off to our process_results function for storage (204) and then increment our count variable before returning to the top of the while loop.

Whew! That is a lot of code, but hopefully you have learned a few new Python coding tricks along the way, and it might give you ideas on how you can wrap other scanning software in a similar way as we did with OnionScan. Now for the moment of truth…

Let it Rip!

Now you are ready to start scanning! Simple run:

1

python onionrunner.py

And you should start seeing output like the following:

# python onionrunner.py[*] Total onions for scanning: 7182[*] Running 0 of 7182.
[*] Onionscanning nfokjthabqzfndmj.onion[*] Running 1 of 7182.[*] Onionscanning gmts3xxfrbfxdm3a.onion

…

If you check the onionscan_results directory you should see a JSON files that are named by the hidden service that was scanned. Let this puppy run as long as you can tolerate, in the second post we are going to process these JSON files and begin to create some visualizations. For bonus points you can also push those JSON files into Elasticsearch (or modify onionrunner.py to do so on the fly) and analyze the results using Kibana!

If you don’t want to wait to get all of the data yourself, you can download the scan results for 8,167 onions from here.

Ache aqui

OSINTBRASIL | Open Source Intelligence Brasil

Whatsapp 47 988618255

Compartilhe

ferramentas para extrair metadados de PDFs e imagens.