Supported File Types
After 18th August 2024, this page would permanently be moved to a new location. You can access this page from a new URL which is present here. If you have saved or bookmarked the current URL, kindly update it with the new URL, since there will be no 301 redirect from the current URL to the new URL.
The file scan API has first-class support for text extraction and scanning on all MIME types enumerated below.
Certain file types receive special handling, such as tabular data and archives of Git repositories, that results in more precise information about the location of findings within the source file..
Handling of MIME Types Not Listed
Files with a MIME type not listed below are processed using an unoptimized text extractor. As a result, the quality of the text extraction for unrecognized types may vary.
Accepted Text and Derivatives
- application/json
- application/x-ndjson
- application/x-php
- text/calendar
- text/css
- text/csv (treated as tabular data and may be redacted )
- text/html
- text/javascript
- text/plain
- text/tab-separated-values (treated as tabular data)
- text/tsv (treated as tabular data)
- text/x-php
Accepted Office Formats
- application/pdf
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet (treated as tabular data)
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.ms-excel (treated as tabular data)
Accepted Archive and Compressed File Types
- application/bzip2
- application/ear
- application/gzip
- application/jar
- application/java-archive
- application/tar+gzip
- application/vnd.android.package-archive
- application/war
- application/x-bzip2
- application/x-gzip
- application/x-rar-compressed
- application/x-tar
- application/x-webarchive
- application/x-zip-compressed
- application/x-zip
- application/zip
Accepted Image File Types
- image/apng
- image/avif
- image/gif
- image/jpeg
- image/jpg
- image/png
- image/svg+xml
- image/tiff
- image/webp
Rejected MIME Types
The file scan API explicitly rejects requests with MIME types that are not conducive to extracting or scanning text. Sample rejected MIME types include:
- application/photoshop
- audio/midi
- audio/wav
- video/mp4
- video/quicktime
Spreadsheets and Tabular Data
File scans of Microsoft Office, Apache parquet, csv, and tab separated files will provide additional properties to locate findings within the document beyond the standard byteRange
, codepointRange
, and lineRange
properties.
Findings will contain a columnRange
and a rowRange
that will allow you to identify the specific row and column within the tabular data wherein the finding is present.
This functionality is applicable to the following mime types:
- text/csv
- text/tab-separated-values
- text/tsv
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- application/vnd.ms-excel
Apache parquet data files are also accepted.
Below is a sample match of a spreadsheet containing dummy PII where a SSN was detected in the 2nd column and 55th row.
{
"findings":[
{
"path":"Sheet1 (5)",
"detector":{
"id":"e30d9a87-f6c7-46b9-a8f4-16547901e069",
"name":"US social security number (SSN)",
"version":1
},
"finding":"624-84-9182",
"confidence":"LIKELY",
"location":{
"byteRange":{
"start":2505,
"end":2516
},
"codepointRange":{
"start":2452,
"end":2463
},
"lineRange":{
"start":55,
"end":55
},
"rowRange":{
"start":55,
"end":55
},
"columnRange":{
"start":2,
"end":2
},
"commitHash":""
},
"matchedDetectionRuleUUIDs":[
"950833c9-8608-4c66-8a3a-0734eac11157"
],
"matchedDetectionRules":[
]
},
...
Redacting CSV Files
Findings within csv files may be redacted.
To enable redaction in files, set the enableFileRedaction
flag of your policy
to "true"
The csv file will be redacted based on the configuration of the defaultRedactionConfig
of the policy
Below is an example curl request for a csv file that has already been uploaded .
curl --request POST \
--url https://api.nightfall.ai/v3/upload/02a0c5e1-c950-4e28-a988-f6fffefc4205/scan \
--header 'Accept: application/json' \
--header 'Authorization: Bearer NF-<Your API Key>' \
--header 'Content-Type: application/json' \
--data '
{
"policy": {
"detectionRuleUUIDs": [
"950833c9-8608-4c66-8a3a-0734eac11157"
],
"alertConfig": {
"email": {
"address": "<your email addres>"
}
},
"defaultRedactionConfig": {
"maskConfig": {
"charsToIgnore": [
"-",
"@"
],
"maskingChar": "*"
}
},
"enableFileRedaction": true
},
"requestMetadata": "csv redaction test"
}
'
When results are sent to the location specified in the alertConfig
(in this case an email address) a redactedFile
property will be set with a fileURL
in addition the findingsURL
{
"errors":null,
"findingsPresent":true,
"findingsURL":"https://files.nightfall.ai/asdfc5e1-c950-4e28-a988-f6fffefc4205.json?Expires=1655324479&Signature=zjo1nT-PECHC-fiTvAgdA8aDnceoY~6iGfzOBCcBjscKqOHnIar8hoH4gGufffiulBw5BpfJuvWwBW~lXO~ZNhN139LDwoTsfLJswJiQCB2Hj-Az0Em6go~1j8WBqCS8G0Gk17M-zcPedHGX3z~1pw8nm5sh6Pa-jJwfw9NIEiqmBb3Vdcj3J-~Wzag~ENV4499rnG299ee-ig5Ms1oVlzycb4YxzgTMrTL5Q07ozNenwFZcGDNQre1inLXmV-m8teLX-K3boklenp9KXiNDDV0wi74ADN-QfIR1q1oU7mEI1f3aVC3kju0QRErp2lsfs08EtZKLE3C4N17jDJdYcw__&Key-Pair-Id=K24YOPZ1EKX0YC",
"redactedFile":{
"fileURL":"https://files.nightfall.ai/asdfc5e1-c950-4e28-a988-f6fffefc4205-redacted.csv?Expires=1655324479&Signature=Hx8kRh88maLeStysy3fsLbFVG9VELEtfemtQe2lWUnFjAMd9HqlEksTmirqAWFWV4zPVUB73izlMj5cSer8v2N5ZCcnD3dz~nnwR4P5LewGJ2CQzGnDnXgh70HW5qp04gnUD-pYWp~bGPVspkJKCkl1zH-EoGonvcNVq3SNsVzOlsVIjep7Y7otQKEEyAZ7JmHiVfuBxrvn8pleuC5lEJ3f9miPyoRqH9DyPlNTJTIuijqe9q32Qcui2RsDR6IT-foFX52dy6rRa01ZV0gZMDWJokMlCr8Iu5An~qnhxC49bqTtI82oz9FcBaP-Yea8cq1TiAfGxX7CJ0~JeTLvr6g__&Key-Pair-Id=K24YOPZ1EKX0YC",
"validUntil":"2022-06-15T20:21:19.750990823Z"
},
"requestMetadata":"csv redaction test",
"uploadID":"02a0c5e1-c950-4e28-a988-f6fffefc4205",
"validUntil":"2022-06-15T20:21:19.723045787Z"
}
This redacted file will be a modified version of the original csv file.
Below is an example of a redacted csv file.
name,email,phone,alphanumeric
Ulric Burton,*****@*************,*-***-***-****,TEL82EBM1GQ
Wade Jones,******************@***********,(********-****,VVF64PJV2EF
Molly Mccullough,*****************@**********,(********-****,OHO41SFZ2BR
Raja Riggs,************@**********,(********-****,UVD51JTE5NZ
Colin Carter,**********************@*********,(********-****,LNI34LLC5WV
Git Repositories
Nightfall provides special handling for archives of Git repositories.
Nightfall will scan the repository history to discover findings in particular checkin, returning the hash for the checkin.
In order to scan the repository, you will need to create a clone, i.e.
git clone https://github.com/nightfallai/nightfall-go-sdk.git
This creates a clone of the Nightfall go SDK.
You will then need to create an archive that can be uploaded using Nightfall's file scanning sequence.
zip -r directory.zip directory
Note that in order to work, the hidden directory .github
must be included in the archive.
When you initiate the file upload sequence with this file, you will receive scan results that contain the commitHash
property filled in.
Using the Nightfall go SDK archive created above, a simple example would be to scan for URLs (i.e. strings starting with http://
or https://
), which will send results such as the following:
{
"findings":[
{
"path":"f607a067..53e59684/nightfall.go",
"detector":{
"id":"6123060e-2d9f-4f35-a7a1-743379ea5616",
"name":"URL"
},
"finding":"https://api.nightfall.ai/\"",
"confidence":"LIKELY",
"location":{
"byteRange":{
"start":142,
"end":168
},
"codepointRange":{
"start":142,
"end":168
},
"lineRange":{
"start":16,
"end":16
},
"rowRange":{
"start":0,
"end":0
},
"columnRange":{
"start":0,
"end":0
},
"commitHash":"53e59684d9778ceb0f0ed6a4b949c464c24d35ce"
},
"beforeContext":"tp\"\n\t\"os\"\n\t\"time\"\n)\n\nconst (\n\tAPIURL = \"",
"afterContext":"\n\n\tDefaultFileUploadConcurrency = 1\n\tDef",
"matchedDetectionRuleUUIDs":[
"cda0367f-aa75-4d6a-904f-0311209b3383"
],
"matchedDetectionRules":[
]
},
...
Support for Large Repositories
Currently, processing is limited to repositories with a total number of commits lower than 5000.
Large repositories result in a large volume of data sent at once. We are working on changes to allow these and other large surges of data to be processed in a more controlled manner, and will increase the limit or remove it altogether once those changes are complete.
Sensitive Data in GitHub Repositories
If the finding in a GitHub repository is considered to be sensitive, it should be considered compromised and appropriate mitigation steps (i.e. secrets should be rotated).
To retrieve the specific checkout, you will need to clone the repository, i.e.
git clone https://github.com/nightfallai/nightfall-go-sdk.git
You can then checkout the specific commit using the commit hash returned by Nightfall.
cd nightfall-go-sdk
git checkout 53e59684d9778ceb0f0ed6a4b949c464c24d35ce
Note that you are in a 'detached HEAD' state when workin with this sort of check out of a repository.
See also: Removing sensitive data from a repository
JSON Files
Preview Functionality
The functionality described below is not yet generally available to customers. Contact [email protected] if you are interested in working with this functionality.
Nightfall provides support for locating findings with JSON objects. Therefore instead of relying on byte ranges, codepoint ranges, or beforeContext and afterContext, client code may instead make use of a jsonKey
in the location
of findings. The value of jsonKey
contains the path at which the finding occurred (e.g. ".data.user[2].first_name, .[2].data[1].moreData")
The jsonKey
value allows client code to object to directly access an attribute using JSON syntax. This feature supports arrays (e.g. ".AttributeName[i]") as well as nested attributes (e.g. ".level_0_Attribute.level_1_Attribute.level_2_Attribute")
This feature is strictly limited to correctly formatted JSON files and is only available for uploaded files. It is useful for cases where content is logs and events in are in the JSON format.
Limitations on Keys
If a key/attribute is too large (1000+ characters), it will be truncated and the
jsonKey
value will have the first and last 5 characters of the key in this format (e.g. "abcde…fghej")Nightfall does not scan the names of keys (as opposed to values) for sensitive data.
Below is a sample JSON file containing an array of objects with various different attributes.
[
{
"phone": "(211) 488-6068",
"email": "[email protected]",
"alphanumeric": "XOZ42ZMC7FY",
"name": "Ocean Fox"
},
{
"phone": "1-583-385-1427",
"email": "[email protected]",
"alphanumeric": "KZI71WKV8QZ",
"name": "Keelie Berry"
}
]
Below is a resulting finding
created as a result of doing a file scan where the Detection Rule checks for phone numbers and email addresses. Note the jsonKey
attribute under the location
sub-object of the findings
array.
{
"findings": [
{
"confidence": "VERY_LIKELY",
"matchedDetectionRules": [],
"matchedDetectionRuleUUIDs": [
"950833c9-8608-4c66-8a3a-0734eac11157"
],
"location": {
"jsonKey": ".[0].phone",
"columnRange": {
"start": 0,
"end": 0
},
"rowRange": {
"start": 0,
"end": 0
},
"codepointRange": {
"start": 1,
"end": 14
},
"byteRange": {
"start": 1,
"end": 14
},
"commitHash": "",
"lineRange": {
"start": 4,
"end": 4
}
},
"finding": "211) 488-6068",
"detector": {
"version": 1,
"id": "d08edfc4-b5e2-420a-a5fe-3693fb6276c4",
"name": "Phone number Detector"
}
},
{
"confidence": "VERY_LIKELY",
"matchedDetectionRules": [],
"matchedDetectionRuleUUIDs": [
"950833c9-8608-4c66-8a3a-0734eac11157"
],
"location": {
"jsonKey": ".[1].phone",
"columnRange": {
"start": 0,
"end": 0
},
"rowRange": {
"start": 0,
"end": 0
},
"codepointRange": {
"start": 0,
"end": 14
},
"byteRange": {
"start": 0,
"end": 14
},
"commitHash": "",
"lineRange": {
"start": 8,
"end": 8
}
},
"finding": "1-583-385-1427",
"detector": {
"version": 1,
"id": "d08edfc4-b5e2-420a-a5fe-3693fb6276c4",
"name": "Phone number Detector"
}
},
{
"confidence": "LIKELY",
"matchedDetectionRules": [],
"matchedDetectionRuleUUIDs": [
"950833c9-8608-4c66-8a3a-0734eac11157"
],
"location": {
"jsonKey": ".[0].email",
"columnRange": {
"start": 0,
"end": 0
},
"rowRange": {
"start": 0,
"end": 0
},
"codepointRange": {
"start": 0,
"end": 35
},
"byteRange": {
"start": 0,
"end": 35
},
"commitHash": "",
"lineRange": {
"start": 2,
"end": 2
}
},
"finding": "[email protected]",
"detector": {
"id": "89f810aa-64a5-4269-b0a0-110d250d55ee",
"name": "email address Detector"
}
},
{
"confidence": "LIKELY",
"matchedDetectionRules": [],
"matchedDetectionRuleUUIDs": [
"950833c9-8608-4c66-8a3a-0734eac11157"
],
"location": {
"jsonKey": ".[1].email",
"columnRange": {
"start": 0,
"end": 0
},
"rowRange": {
"start": 0,
"end": 0
},
"codepointRange": {
"start": 0,
"end": 29
},
"byteRange": {
"start": 0,
"end": 29
},
"commitHash": "",
"lineRange": {
"start": 6,
"end": 6
}
},
"finding": "[email protected]",
"detector": {
"id": "89f810aa-64a5-4269-b0a0-110d250d55ee",
"name": "email address Detector"
}
}
]
}
This value of the jsonKey
property can be used to programmatically retrieve the contents of the specific property where the finding was made. For instance, you could use a tool such as jq which is like sed
for JSON data. In the screenshot below from jqplay.org you can see how the value ".[0].phone" can be used to retrieve the value "(211) 488-6068" from the original sample json object.
Updated about 2 months ago