About Open Data

Datasets: 8 Files: 65,153 Total size: 78.1 TB

Rapid7 Open Data offers researchers and partners access to data from Project Sonar, which conducts internet-wide surveys to gain insights into global exposure to common vulnerabilities.

Open Data

Rapid7 Open Data provides access to Project Sonar's internet telemetry data in order to help security researchers and advocates advance security on a societal level. Data access is provided with appropriate balancing security controls to protect privacy as well as use case review to ensure that the data is being used in alignment with the project's goals.

Use Cases

Open Data is mostly aligned with the following use cases:

Legitimate public research projects by academics and others that are working on security-related topics. Researchers can request access to the Project Sonar data sets for a limited time and are subject to conditions for sharing findings to advance the public good. See the Data Use Restrictions section below for more details.
Governments, ISACs, and other nonprofits working on security advocacy to reduce opportunities for attackers. As part of our shared mission to advance security for all, data provided to these groups will be geo-filtered and covered by legal agreements stipulating balancing controls.
Commercial security projects with defensive or positive security outcomes. While it is not the primary goal of the Open Data initiative, we recognize that there are entities who would like to incorporate the data within their offerings. We may provide access for these use cases if they align with the project's goals.

Rapid7 customers can access Project Sonar data relating to their assets through Project Doppler, a free tool that provides more curated insight into an organization’s external exposures and attack surface. We are investigating ways to extend Project Doppler access to non-customer internal InfoSec teams while still balancing privacy concerns.

If you have a use case for Project Sonar data that does not fit into one of the categories above, please contact us at research[at]rapid7.com. We welcome any opportunity to better understand how our data may be useful and we want to continue to advance security and support the security community as best we can.

Data Use Restrictions

In order to ensure that data is only used in support of the project's goals and to help protect privacy we have implemented the following general restrictions:

The data must be used for cybersecurity purposes that improve security outcomes. It cannot be used in security products or services that attack or cause harm. It cannot be used for non-security purposes such as marketing.
The data cannot be redistributed in bulk. Data about assets (domains, IPs, etc) can only be shared with owners, controllers, and/or others with a legitimate relationship with those assets.
Those requesting access for public research projects must have a specific goal with defined timelines and deliverables. The research results must be freely available to the general public without restrictions.
We are not currently approving requests from individual researchers or bug bounty participants. This is due, in large part, to the logistics of vetting the requests of individuals and the related agreements required to ensure that the required privacy protections are in place.

Feel free to contact research[at]rapid7.com regarding further questions.

Requesting Access

If you would like access to our Open Data datasets please contact opendata[at]rapid7.com and provide the following:

Summary of your research or use case and the projected outputs.
Description of which datasets you are interested in.
High level description of your organization including which of the categories above that it best fits within.
For academic projects, the expected publish date of your research.

Note that all requests for data via Open Data will be vetted to ensure that they align with its goals of advancing security. All recipients must enter into a data sharing agreement which requires, among other things, commitments to limiting the negative privacy impact of the data's use. For some use cases we will limit what is shared by geo-filtering the data.

Project Sonar

Project Sonar is a security research project by Rapid7 that conducts internet-wide surveys across different services and protocols to gain insights into global exposure to common vulnerabilities. The data collected is available via Open Data in an effort to enable security research.

This page contains a condensed version of the project activities. Please visit the following posts for further details and the motivation behind Project Sonar:

Project Sonar - Scan All The Things | Rapid7 Blog
Legal considerations for widespread scanning | Rapid7 Blog

The Scanning and Collection Process

Project Sonar gathers data in two stages. In the first stage, this involves scanning all public IPv4 addresses in an attempt to determine which have the respective service port open. Once an IP is identified as meeting these criteria, collection activities take place which involve connecting to and communicating with the service.

Project Sonar performs its scans from several different subnets, which can be allowlisted or blocklisted at your preference:

5.63.151.96/27
71.6.233.0/24
88.202.190.128/27
146.185.25.160/27
109.123.117.224/27

Project Sonar performs its collection activities from AWS EC2 us-west-1, us-west-2 and us-east-1 instances with non-static IP addresses, and as such cannot be readily allowlisted or blocklisted themselves, however it is sufficient to blocklist or allowlist the scan ranges listed above.

At no point does Sonar bypass any technical barriers or otherwise access non-public-facing computers. We are doing everything possible to reduce impact on remote networks and we follow best practices as outlined by the ZMap developers.

Services and collected data

Sonar collects all SSL certificates visible on public IPv4 HTTPS web servers and certain non-HTTP services, such as SSL and STARTTLS-enabled email services like SMTP, IMAP and POP. This data can be used to detect changes such as malicious replacement of certificates or reveal the revocation of a compromised previous certificate. This data is complementary to the Electronic Frontier Foundation's SSL Observatory project. Other purposes include detection of insecurely reused or still actively used revoked certificates. In addition, with the Sonar data one can see all IP addresses / services that claim to represent a particular domain - which in turn can be used for asset identification and detection of malicious certificate usage. Also the certificate fields can be used for soft- and hardware identification in specific situations.

Sonar performs several HTTP studies that collect the HTML content of all public IPv4 web servers. The main HTTP study requests the index page (“/”) on TCP port 80, and other studies request other specific pages potentially on other TCP ports. This behavior is similar to what search engines do, except that Sonar does not crawl the servers beyond the initial requested page. One of the potential uses of this data set is the identification of compromised web servers and injected malicious HTML snippets such as "iframes" to non-advertisement web servers. We found several instances of Javascript and direct IFrames pointing to so-called "exploit kits" that try to infect client computers. We also use this data to identify vulnerable embedded devices through fingerprinting the content and headers of the HTTP response

Sonar gathers the reverse DNS records for all IPv4 addresses. This data enables organizational asset discovery and can help identify misconfigurations and possibly DNS hijacking attempts.

Sonar uses the domain names gathered from the above processes as well as certain TLD zone files to conduct DNS record requests for many common DNS record types. This data is also useful for asset discovery and the identification of phishing portals, as well as new malicious domains matching algorithmic patterns.

Sonar scans a growing number of TCP and UDP services. TCP studies include SSH, SMB, Telnet, RDP, Mongo, Redis, CouchDB, and more. UDP studies include NetBIOS, DNS, NTP, IPMI, NAT-PMP, BACNet, SIP, SNMP, MDNS, and quite a few others. We use the metadata from these publicly exposed services to identify large-scale misconfigurations and vulnerabilities in consumer, enterprise, and critical infrastructure systems.

Opt-out

In case you would like to be excluded from some or all of our probes please let us know at research[at]rapid7.com - make sure to mention your CIDR blocks / list of IP addresses and affiliation.

Please note that as part of the opt-out process we attempt to verify that the requestor has been delegated or otherwise controls the network addresses in the opt-out request. We typically perform this verification via WHOIS and other tools. If we cannot verify delegation or ownership we are unlikely to opt-out the requested addresses. As a note, we periodically review our Opt-out list and remove stale entries where the WHOIS record has changed or if we can no longer verify ownership, control, or affiliation. The opt-out can be requested again in the future.

Acknowledgements

Project Sonar employs a range of open-source tools, most notably the ZMap software developed by Zakir Durumeric, Eric Wustrow, and J. Alex Halderman at the University of Michigan. We publish a few of our own tools as well, including DAP and Recog, both of which are used in the processing stage of our scanning system. Learn more about the Rapid7 researchers maintaining and extracting insights from Project Sonar.

Rapid7 Labs