Parsers

To add support for a new log type we should 1. Define the schema for the new log type 1. Provide a parser to parse a single log item 1. Register and validate a log type entry 1. Publish the log type entry so Backend and Frontend can use it

Code structure

Log types are identified by a name that has the form Service.EventName. The Service prefix groups log types based on the service that produced them (ie NGINX, Juniper, Syslog). All log types for a Service are grouped under a go module in the parsers module. For example all AWS.* logs live under the awslogs module at internal/log_analysis/log_processor/parsers/awslogs.

Each log type is described in it's own .go file inside it's module. For example AWS.CloudTrail is defined in cloud_trail.go file at internal/log_analysis/log_processor/parsers/awslogs/cloudtrail.go.

New log types must be registered with the logtypes package during module init

As an example we will use FooService.Event. FooService.Event is a log event produced by FooService and describes the result of a user request. An example FooService.Event log line is:

{
"time": "2020-08-06T15:42:23.415Z",
"message": "failed to serve a request",
"remote_ip": "127.0.0.1",
"referrer_url": "https://example.com",
"request_id": "123456789",
"user": {
"uid": 123456,
"groups": ["admin","ssh"]
},
"duration_s": 0.15
}

Step 0: Writing a test

We will start by writing a test to better understand what we want to achieve and to be sure our code works as it is supposed to be working.

Our goal here is to parse a JSON string of a Foo.Event and produce a Panther log event that will be stored in our storage backend to be processed by the rules engine and queried in the security data lake.

Panther provides the testutil.CheckRegisteredParser helper that tests a registered log type using an input log and the expected JSON for the panther log event(s) it should produce when processed. This method ensures we haven't missed any fields or indicator values throughout the process.

package foologs
import "github.com/panther-labs/panther/internal/log_analysis/log_processor/parsers/testutil"
func TestFooEvent(t *testing.T) {
input := `{
"time": "2020-08-06T15:42:23.415Z",
"message": "failed to serve a request",
"remote_ip": "127.0.0.1",
"referrer_url": "https://example.com",
"request_id": "123456789",
"user": {
"uid": 123456,
"groups": ["admin","ssh"]
},
"duration_s": 0.15
}`
expect := `{
"time": "2020-08-06T15:42:23.415Z",
"message": "failed to serve a request",
"remote_ip": "127.0.0.1",
"referrer_url": "https://example.com",
"request_id": "123456789",
"user": {
"uid": 123456,
"groups": ["admin","ssh"]
},
"duration_s": 0.15,
"p_event_time": "2020-08-06T15:42:23.415Z",
"p_any_ip_addreses": ["127.0.0.1"],
"p_any_domain_names": ["example.com"],
"p_any_trace_ids": ["123456789"],
"p_log_type": "Foo.Event"
}`
testutil.CheckRegisteredParser(t, "Foo.Event", input, expect)
}

Note that Panther adds some fields to the output event. We call these fields 'panther fields' and they are prefixed with p_ to avoid name collisions.

Also note that the expect JSON includes all the panther-fields so we can verify that the log processing is correct. For testing purposes p_parse_time and p_row_id are omitted since they would vary on each run of the test. The helper only verifies that these fields are non-empty and of valid format in the parsed result.

The test should fail at this point since we haven't defined anything yet.

Step 1: Defining the log type schema

First order of bussiness is to define the schema for the Foo.Event log type. Panther uses Go structs to define the schema of a log type.

We can represent the schema of a Foo.Event as a Go struct by:

package foologs
import "github.com/panther-labs/panther/internal/log_analysis/log_processor/pantherlog"
type FooEvent struct {
Time pantherlog.Time `json:"time" validate:"required" tcodec:"rfc3339" panther:"event_time" description:"The foo event time"`
Message pantherlog.String `json:"time" validate:"required" description:"The foo event time"`
RemoteIP pantherlog.String `json:"remote_ip" panther:"ip" description:"The remote ip used for the request"`
ReferrerURL pantherlog.String `json:"remote_ip" panther:"url" description:"The remote ip used for the request"`
RequestID pantherlog.String `json:"request_id" panther:"trace_id" description:"The id of the request that generated this event"`
User *User `json:"user" description:"The user that made the request"`
Duration pantherlog.Float64 `json:"duration_s" description:"The number of seconds it took to serve the request"`
}
type User struct {
UID pantherlog.Int64 `json:"uid" description:"The id of the user that made the request"`
Groups []string `json:"groups" description:"The groups the user is a member of"`
}

Field types

Fields in a log event struct should use the types defined in the pantherlog module. These types handle null values and missing JSON fields by omitting them in the output. Empty strings ("") and zero numeric values are never omitted in order to preserve as much as possible from the original log event.

Panther uses struct tags to specify field attributes. All fields must have some description to document the contents of this field. These documentation strings will be used to generate user documentation from this code. The json tag must use the exact name used in the logs. Panther automatically adds omitempty to all fields.

String values

pantherlog.String

Indicator strings

String values can be tagged with a panther:"SCANNER" to define indicator fields. Note that a scanner can produce multiple indicator fields from a single value, or a different indicator field based on the value. Panther defines the following scanners

Scanner

Description

ip

Adds a p_any_ip_addresses indicator if the value is a valid IP address

domain

Adds a p_any_domains indicator

url

Adds a p_any_domains or a p_any_ip_addresses indicator using the hostname part of the URL

net_addr

Adds a p_any_domains or a p_any_ip_addresses indicator by splitting a HOST:PORT address

sha256

Adds a p_any_sha256_hashes indicator

sha1

Adds a p_any_sha1_hashes indicator

md5

Adds a p_any_md5_hashes indicator

trace_id

Adds a p_any_trace_ids indicator

aws_arn

Adds a p_any_aws_arns indicator (needs import of awslogs package)

aws_instance_id

Scans an ARN string and adds any p_any_aws_arns, p_any_aws_account_id and p_any_instance_ids indicators found (needs import of awslogs package)

aws_account_id

Adds a p_any_aws_arns indicator (needs import of awslogs package)

aws_tag

Adds a p_any_aws_tags indicator (needs import of awslogs package)

Timestamps

pantherlog.Time

Timestamps use a separate Go type to allow easier querying of logs using date time ranges. By default timestamps use the RFC3339 format. To specify a different format for a timestamp use the tcodec tag.

Numbers

All numeric values can be parsed from either a JSON number or a JSON numerical string (ie both 42 and "42" are valid).

Floating point numbers

pantherlog.Float64, pantherlog.Float32

Integers

pantherlog.Int64, pantherlog.Int32, pantherlog.Int16, pantherlog.Int8

If you are certain about the range of values an integer can take, you can use one for the smaller sizes accordingly. This will limit the storage requirements for the columns. If you are unsure about the range limits use pantherlog.Int64.

Unsigned Integers

pantherlog.Uint64, pantherlog.Uint32, pantherlog.Uint16, pantherlog.Uint8

If you are certain about the range of values an integer can take, you can use one for the smaller sizes accordingly. A usual example is a port number that can fit in a pantherlog.Uint16. This will limit the storage requirements for the columns. If you are unsure about the range limits use pantherlog.Uint64.

Booleans

pantherlog.Bool

Boolean values handle null case by omitting the field when encoding to JSON.

Objects

If our log has nested objects we define a separate struct to define the object. Fields that have an object value should use a pointer so the nested object can be safely omitted in the output if it is null or missing in the log input.

Arrays

We use a simple Go slice which will be omitted in the output if it's empty.

Step 2: Defining a parser

Now that we have defined the schema of Foo.Event we need to provide a way for panther to parse log input into a panther log event. To achieve this we need to provide a type implementing parsers.Interface.

type Interface interface {
ParseLog(input string) (results []*pantherlog.Result, err error)
}

The parser takes a log item (in most cases a line of text) and tries to parse it into one or more log events. If it fails to parse the line it should return a nil slice and an error.

For our example the parser would be:

package foologs
import (
jsoniter "github.com/json-iterator/go"
"github.com/panther-labs/panther/internal/log_analysis/log_processor/pantherlog"
"github.com/panther-labs/panther/internal/log_analysis/log_processor/pantherlog/rowid"
"github.com/panther-labs/panther/internal/log_analysis/log_processor/parsers"
)
type FooParser struct {
pantherlog.ResultBuilder
}
func (p *FooParser) ParseLog(input string) ([]*pantherlog.Result, error) {
event := FooEvent{}
// We unmarshal the string using jsoniter
if err := jsoniter.UnmarshalFromString(input, &event); err != nil {
return nil, err
}
// We validate the struct using the `validate` struct tags.
if err := parsers.ValidateStruct(&event); err != nil {
return nil, err
}
result, err := p.BuildResult("Foo.Event", &event)
if err != nil {
return nil, err
}
return []*pantherlog.Result{result}, nil
}

Parser code follows this general pattern: 1. Parse the input text into a struct 2. Validate the parsed output to ensure the input is valid for this log type 3. Package the log event(s) into pantherlog.Result values so the log processor can store them.

Fields marked with a panther struct tag will only be processed when the Result is encoded to JSON. This is deliberate in order to be able to support both JSON and text-based log types. In the log processor pipeline this happens in the final stage when the result is written to a buffer that will be uploaded to S3.

Attention This means that EventTime will only be set on the Result when it is encoded to JSON.

If a log event requires special logic to produce the event timestamp it can implement pantherlog.EventTimer interface:

type EventTimer interface {
PantherEventTime() time.Time
}

The time.Time returned by EventTimer instances takes precedence over event timestamps defined with struct tags.

Step 3: Create and validate a log type entry

Now that we have our parser and schema ready, we need to map the Foo.Event log type to them. Panther keeps track of supported log types using a registry of log types.

To create and validate a log type entry you need to use the function

logtypes.Register(config logtypes.Config) (logtypes.Entry, error)

This validates the configuration and makes sure there are no duplicate entries system-wide.

The logtypes.Config struct fields include:

  • a struct value defining the log event schema

  • a parsers.Factory value to instantiate parsers for this log type

  • fields describing the log type (name, description and reference URL)

So back to our example we would write:

package foologs
import (
"github.com/panther-labs/panther/internal/log_analysis/log_processor/logtypes"
"github.com/panther-labs/panther/internal/log_analysis/log_processor/parsers"
)
func init() {
logtypes.MustRegister(logtypes.Config{
Name: "Foo.Event",
Description: "Foo log event",
ReferenceURL: "https://example.com/logs/foo-event", // used in generated documentation
Schema: FooEvent{},
NewParser: parsers.FactoryFunc(func (_ interface{}) (parsers.Interface, error) {
return &FooParser{}, nil
}),
})
}

For the common case, where each log line is a JSON object value corresponding to a single log event, you can use logtypes.MustRegisterJSON to avoid most of the boilerplate code in the above example:

package foologs
import (
"github.com/panther-labs/panther/internal/log_analysis/log_processor/logtypes"
"github.com/panther-labs/panther/internal/log_analysis/log_processor/parsers"
)
var EntryFooEvent = logtypes.MustRegisterJSON(logtypes.Desc{
Name: "FooService.Event",
Description: "Foo log event",
ReferenceURL: "https://example.com/logs/foo-event", // used in generated documentation
}, func () interface{} {
return &FooEvent{}
})

By now our test should be passing, verifying that Panther can properly handle Foo.Event logs.

Step4: Publishing the log type to the system

The final step is to publish our new log type to make it available to the system at runtime.

To ensure our foologs module registers it's log type at runtime we need to import it into internal/log_analysis/log_processor/registry/registry.go. That module is imported by all lambdas and tools when they need to access the available log types.

// internal/log_analysis/log_processor/registry/registry.go
package registry
import (
// Register log types in init() blocks
_ "github.com/panther-labs/panther/internal/log_analysis/log_processor/parsers/apachelogs"
_ "github.com/panther-labs/panther/internal/log_analysis/log_processor/parsers/awslogs"
// ...
_ "github.com/panther-labs/panther/internal/log_analysis/log_processor/parsers/foologs"
// ...
)

We also need to add the log type name to web/constants.ts so the frontend knows about the existence of Foo.Event.

// web/constants.ts
export const LOG_TYPES = [
'Apache.AccessCombined',
'Apache.AccessCommon',
// ...
'Foo.Event',
// ...
]

We try to keep things tidy by adding the new log types in alphabetical order.

Before making a pull-request

  • Ensure your code is formatted, run mage fmt

  • Ensure all tests pass mage test:ci

  • Be sure to checkin the documentation that will be automatically generated and update the SUMMARY.md if you added a new family of log.

  • Deploy Panther and add a Source that uses the log types you defined. You should be able to see a new table with your added parser in Glue Data Catalog

  • Do an end-to-end test. You can use s3queue to copy test files into the panther-bootstrap-auditlogs-<id> bucket to drive log processing or use the development tool ./out/bin/devtools/<os>/<arch>/logprocessor to read files from the local file system.

  • Write a test rule for the new type to ensure data is flowing.