Integrating with Amazon S3 (Python)

AWS S3 is a popular tool for storing your data in the cloud, however it also has huge potential for unintentionally leaking sensitive data. By utilizing AWS SDKs in conjunction with Nightfall’s Scan API, you can discover, classify, and remediate sensitive data within your S3 buckets.

You will need the following for this tutorial:

We will use boto3 as our AWS client in this demo. If you are using another language, check this page for AWS's recommended SDKs.

To install boto3 and the Nightfall SDK, run the following command.

pip install boto3
pip install nightfall=0.6.0

In addition to boto3, we will be utilizing the following Python libraries to interact with the Nightfall SDK and to process the data.

import boto3
import requests
import json
import csv
import os
from nightfall import Nightfall

We've configured our AWS credentials, as well as our Nightfall API key, as environment variables so they don't need to be committed directly into our code.

aws_session_token = os.environ.get('AWS_SESSION_TOKEN')
aws_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
aws_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')

nightfall_api_key = os.environ.get('NIGHTFALL_API_KEY')

Next we define the Detection Rule with which we wish to scan our data. The Detection Rule can be pre-made in the Nightfall web app and referenced by UUID. Also, we extract our API Key, and abstract a nightfall class from the SDK, for it.

detectionRuleUUID = os.environ.get('DETECTION_RULE_UUID')

nightfall = Nightfall(os.environ['NIGHTFALL_API_KEY'])

Now we create an iterable of scannable objects in our target S3 buckets, and specify a maximum file size to pass to the Nightfall API (500 KB). In practice, you could add additional code to chunk larger files across multiple API requests.

We will also create an all_findings object to store Nightfall Scan results. The first row of our all_findings object will constitute our headers, since we will dump this object to a CSV file later.

This example will include the full finding below. As the finding might be a piece of sensitive data, we would recommend using the Redaction feature of the Nightfall API to mask your data. More information can be seen in the 'Using Redaction to Mask Findings' section below.

objects_to_scan = []
size_limit = 475000

all_findings = []
all_findings.append(
  [
    'bucket', 'object', 'detector', 'confidence', 
    'finding_start', 'finding_end', 'fragment'
  ]
)

We will now initialize our AWS S3 Session. Once the session is established, we get a handle for the S3 resource.

my_session = boto3.session.Session(
  aws_session_token = aws_session_token,
  aws_access_key_id = aws_access_key_id,
  aws_secret_access_key = aws_secret_access_key
)

s3 = my_session.resource('s3')

Now we go through each bucket and retrieve the scannable objects, adding their text contents to objects_to_scan as we go.

In this tutorial we assume that all files are text-readable. In practice, you may wish to filter out un-scannable file types such as images with the object.get()['ContentType'] property.

for b in s3.buckets.all():
  for o in b.objects.all():
    temp_object = o.get()
    size = temp_object['ContentLength']

    if size < size_limit:
      objects_to_scan.append(temp_object['Body'].read().decode())

For each object content we find in our S3 buckets, we send it as a payload to the Nightfall Scan API with our previously configured detectors.

On receiving the request response, we break down each returned finding and assign it a new row in the CSV we are constructing.

In this tutorial, we scope each object to be scanned its own API request. At the cost of granularity, you may combine multiple smaller files into a single call to the Nightfall API.

for o in objects_to_scan:
  nightfall_response = nightfall.scanText(
        [o],
        detection_rule_uuids=[detectionRuleUUID]
  )

  findings = json.loads(nightfall_response)
  
  for f_idx, finding in enumerate(o):
    row = [
      o.bucket_name, 
      o.key, 
      finding['detector']['name'],
      finding['confidence'],
      finding['location']['byteRange']['start'],
      finding['location']['byteRange']['end'],
      finding['location']['codepointRange']['start'],
      finding['location']['codepointRange']['end'],
      finding['fragment']
    ] 
    all_findings.append(row)

Now that we have finished scanning our S3 buckets and collated the results, we are ready to export them to a CSV file for further review.

if len(all_findings) > 1:
  with open('output_file.csv', 'w') as output_file:
    csv_writer = csv.writer(output_file, delimiter = ',')
    csv_writer.writerows(all_findings)
else:
  print('No sensitive data detected. Hooray!')

That's it! You now have insight into all of the sensitive data inside your data stored inside your organization's AWS S3 buckets.

As a next step, you could attempt to delete or redact your files in which sensitive data has been found by further utilizing boto3.

Using Redaction to Mask Findings

With the Nightfall API, you are also able to redact and mask your S3 findings. You can add a Redaction Config, as part of your Detection Rule. For more information on how to use redaction, and its specific options, please refer to the guide here.

Using the File Scanning Endpoint with S3

The example above is specific for the Nightfall Text Scanning API. To scan files, we can use a similar process as we did the text scanning endpoint. The process is broken down in the sections below, as the file scanning process is more intensive.

Prerequisites:

In order to utilize the File Scanning API you need the following:

  • An active API Key authorized for file scanning passed via the header Authorization: Bearer — see Authentication and Security
  • A Nightfall Detection Policy associated with a webhook URL
  • A web server configured to listen for file scanning results (more information below)

The steps to use the endpoint are as follows:

  1. Retrieve list of files in S3 buckets/objects

Similar to the process in the beginning of this tutorial for the text scanning endpoint, we will now initialize our AWS S3 Session. Once the session is established, we get a handle for the S3 resource.

my_session = boto3.session.Session(
  aws_session_token = aws_session_token,
  aws_access_key_id = aws_access_key_id,
  aws_secret_access_key = aws_secret_access_key
)

s3 = my_session.resource('s3')

Now we go through each bucket and retrieve the scannable objects.

for b in s3.buckets.all():
  for o in b.objects.all():
    # here we can call the file scanning endpoints

For each object content we find in our S3 buckets, we send it as an argument to the Nightfall File Scan API with our previously configured detectors.

  1. Iterate through list of files and begin the file upload process, as shown here.

  2. Once the files have been uploaded, begin using the scan endpoint mentioned here. Note: As can be seen in the documentation, a webhook server is required for the scan endpoint, to which it will send the scanning results. An example webhook server setup can be seen here.

  3. The scanning endpoint will work asynchronously for the files uploaded, so you can monitor the webhook server to see the API responses and file scan findings as they come in.

Resources:

File Scanning Process Documentation
File Scan API Reference: