OSINT Web Scraping Guide - React Artifact
OSINT
Web Scraping
AI-powered tools, techniques, and real-world examples for open source intelligence gathering.
OSINT Scraping Pipeline
A structured pipeline ensures your intelligence is reliable, organized, and actionable.
AI-Powered Tools
LLM Extraction (GPT/Claude API)
Feed raw HTML/text to an LLM and ask it to extract structured data — names, emails, addresses, relationships — in JSON format. No regex needed.
Perplexity / You.com
AI search engines that combine web search with synthesis. Use them for quick profiling — they surface and summarize information from dozens of sources automatically.
Firecrawl
AI-native scraper that turns any website into clean markdown. Handles JavaScript rendering, pagination, and returns LLM-ready content via API.
Diffbot
ML-based extraction that automatically identifies and classifies content types (articles, people, products, events) without custom selectors.
Crawl4AI
Open-source, LLM-friendly web crawler. Extracts structured data with AI, supports async crawling, custom extraction strategies, and graph building.
AI Agents (LangChain / AutoGen)
Build autonomous OSINT agents that plan searches, scrape iteratively, cross-reference sources, and summarize findings without manual input.
Classic OSINT Scraping Tools
| Tool | Best For | Notes | Cost |
|---|---|---|---|
| Maltego | Entity mapping & relationship graphs | GUI-based, transforms API ecosystem, ideal for social network analysis | Freemium |
| theHarvester | Email, subdomain, host enumeration | Aggregates Google, Bing, Shodan, LinkedIn sources in one command | Free |
| Shodan | Internet-connected device indexing | Finds exposed cameras, servers, ICS — query with filters like org:, port: |
Freemium |
| Scrapy | Large-scale custom web crawling | Python framework, highly extendable, middleware support for proxies/headers | Free |
| Playwright / Puppeteer | JS-rendered pages & auth-gated content | Headless browser automation, handles SPAs, CAPTCHAs, and login flows | Free |
| SpiderFoot | Automated OSINT reconnaissance | 100+ modules, GUI + CLI, correlates domains/IPs/emails/usernames automatically | Free |
| Recon-ng | Modular recon framework | Metasploit-style CLI, database-backed, great for structured campaigns | Free |
| OSINT Framework | Resource directory | Curated tree of tools by category — a starting reference, not a scraper itself | Free |
Key Techniques
Google Dorking
Advanced search operators to find exposed files, login pages, and sensitive information indexed by search engines.
site:example.com filetype:env
# Find open directories
intitle:"index of" inurl:/backup
# Find email addresses
site:linkedin.com "@company.com"
API Endpoint Discovery
Many sites expose unofficial APIs used by their mobile apps. Intercept with browser DevTools or mitmproxy to find structured data endpoints.
Wayback Machine Mining
The Internet Archive stores historical snapshots. Use the CDX API to enumerate all snapshots of a target domain and extract deleted content.
?url=*.example.com/*
&output=json
&fl=timestamp,original
&filter=statuscode:200
Certificate Transparency
CT logs list every TLS certificate issued. Mine them via crt.sh to discover subdomains, internal services, and infrastructure for any domain.
# Returns all subdomains with certs
Metadata Extraction
Documents, images, and PDFs leak author names, GPS coordinates, software versions, and internal paths via EXIF/metadata.
Username Pivoting
A single username found on one platform can be cross-referenced across dozens of services to build a complete digital footprint.
Code Examples
Example 1 — AI-Powered Entity Extraction
Scrape a webpage and use an LLM to extract structured OSINT entities automatically.
from bs4 import BeautifulSoup
# 1. Scrape the target page
html = requests.get("https://target-site.com/about").text
text = BeautifulSoup(html, "html.parser").get_text()
# 2. Ask an LLM to extract structured data
prompt = f"""Extract all OSINT entities from the text below.
Return JSON with keys: people, emails, phones, addresses, orgs.
Text: {text[:3000]}"""
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
entities = json.loads(response.choices[0].message.content)
print(entities)
Example 2 — Automated Username Hunt (Sherlock)
pip install sherlock-project
# Hunt username across 300+ platforms
sherlock johndoe
# Output to file, only found results
sherlock johndoe --output results.txt --print-found
Example 3 — Firecrawl + LLM Intelligence Pipeline
import anthropic
fc = FirecrawlApp(api_key="fc-...")
result = fc.scrape_url("https://company.com", formats=["markdown"])
markdown = result["markdown"]
# Feed clean markdown directly to Claude
client = anthropic.Anthropic()
msg = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user",
"content": f"Identify key personnel, partnerships, and technologies: {markdown}"}]
)
print(msg.content[0].text)
Example 4 — Shodan Dorking via API
api = shodan.Shodan("YOUR_API_KEY")
# Find Apache servers in a specific org
results = api.search('org:"Target Corp" apache')
for r in results["matches"]:
print(f"IP: {r['ip_str']} | Port: {r['port']} | OS: {r.get('os','unknown')}")
Pro Tips & Tricks
Rotate User Agents & Use Residential Proxies
Never send requests with the default python-requests/2.x agent. Use fake_useragent to rotate browser UAs and combine with rotating residential proxies (Bright Data, Oxylabs) to avoid IP bans.
Use AI to Parse Inconsistent Formats
Instead of writing fragile regex for phone numbers, addresses, or names in different formats, pass the raw text to an LLM: "Extract all phone numbers in E.164 format from this text: ...". Far more robust.
Cache Everything — Scrape Once, Analyze Many Times
Store raw HTML/JSON locally before processing. This lets you re-run different AI extraction prompts on the same data without re-scraping, saving time and avoiding detection.
Leverage Browser DevTools Network Tab First
Before writing a scraper, check if the site has a JSON API endpoint under the XHR/Fetch tab. Calling /api/v2/results.json is always cleaner than parsing HTML.
Use Async Crawlers for Scale
Use httpx with asyncio or Crawl4AI for concurrent requests. Async crawling can be 10-50x faster than synchronous requests for large target sets.
Wayback Machine for Deleted Content
Targets often delete sensitive information — but the Wayback Machine may have archived it. The CDX API lets you programmatically search all snapshots for a domain or URL pattern.
Entity Resolution with AI
When you have "John Smith", "J. Smith", and "John A. Smith" across different sources, use an LLM with context to resolve whether they're the same person. Traditional string matching fails here.
Respect robots.txt — But Know What It Means
robots.txt is advisory for crawlers, not a legal instrument. However, violating it in some jurisdictions may affect the legality of your activity. Always check ToS and applicable law before scraping.
Legal & Ethical Boundaries
Permitted
Publicly available information, security research on your own systems or with authorization, academic research, journalism, due diligence on public entities, bug bounty programs.
Gray Areas
Scraping LinkedIn/social media (ToS violations), automated scraping of copyrighted content, aggregating public info to create private profiles, data scraping in GDPR jurisdictions.
Never Do
Scraping private/authenticated data without permission, CFAA violations, stalking/harassment, bypassing security controls, doxxing individuals, or accessing systems without authorization.
Olá, INTERNAUTA #OSINT
Esse conhecimento antes era restrito a agências de segurança e grandes corporações.
Hoje, ele está nas suas mãos.
Fontes Abertas para Empresas
Economize dinheiro, melhore a segurança e aumente a produtividade. Aprenda a direcionar pesquisas para resultados rápidos e melhor Due Diligence.
GARANTIR AGORA →Redes Ocultas — Técnicas de Investigação Digital
Análise de redes, rastreamento de entidades e técnicas avançadas de investigação em ambientes digitais complexos.
GARANTIR AGORA →Investigação Digital
Técnicas práticas e metodologia profissional para conduzir investigações digitais com rigor e eficácia comprovada.
GARANTIR AGORA →Segredos da Perícia Digital
Desvende os segredos da perícia digital. Análise de metadados, coleta de evidências e técnicas forenses para profissionais.
GARANTIR AGORA →Agente OSINT IA
Use agentes de Inteligência Artificial para automatizar investigações OSINT. O futuro da inteligência estratégica já está aqui.
GARANTIR AGORA →Linguagem Corporal para Contextos Investigativos
Decodifique comportamentos, detecte inconsistências e aplique análise de linguagem corporal em interrogatórios e entrevistas investigativas.
GARANTIR AGORA →Curso Completo Avançado de OSINT
Transforme dados públicos em inteligência estratégica. Técnicas modernas, ferramentas poderosas e estudos de caso reais.
Investigação Defensiva
Provas digitais sólidas para seu caso. Coleta e análise rigorosa de evidências para construir uma defesa eficaz. Para advogados e investigadores.
SAIBA MAIS →Investigação Digital com IA
Saia na frente, tome decisões mais inteligentes e navegue no mundo digital com confiança e poder. Acesso imediato, linguagem direta, resultados reais.
GARANTIR AGORA →
Comentários
Postar um comentário