AWS S3/Athena

A Hands-on Guide

📘

TL;DR AWS S3/Athena:

To integrate AWS S3/Athena and perform searches with Query:

  • Setup AWS IAM access for API access
  • Create IAM Keys
  • Create and/or upload, JSON, csv, xml, or parquet files to S3 bucket
  • Create Athena Table
  • Configure Query Platform data connection
  • Perform searches using the Athena data connection

In a great post recently written by Dhiraj Sharan, he delves into the ability to set up S3 & Athena for searching security data. In this article we will go step by step to get you started. We will then configure Query’s Federated Search platform to get those results combined with other data sources.

Here's a brief data architecture of what we will configure to get data from AWS S3 to the Query Federated Search platform:

Setup User Account

As with everything AWS, you need access and permissions to the API’s. You will create a specific user for this purpose. You may already have a user and roles configured.

Login to the AWS console, search for IAM, and open the link:

Click "Users" on the left navigation bar, then click "Add User".

Note: you do not need to click "Provide user access to the AWS Management Console".  Use this option only if you plan on having this new user access the AWS console, otherwise leave it unchecked.

Click "Next."

You can assign roles to the group or to the user. Below, I have attached a policy directly,  called “AmazonS3FullAccess”.  Later in this guide, we will need read & write access to the bucket created. If more granular permissions are needed, you may create your own policy to a specific bucket as needed.

Click "Next."

On the Review & Create page, click "Create User."

Create IAM Access Keys

To upload files via the CLI or python we will need to create IAM Keys. In the user window, click the user created above.

Click "Security Credentials."

Scroll down to the section "Access Keys" and click "Create Access Key."

Select “Application running outside AWS” and then click "Next."

Add a tag if you wish, and click "Create Access Key."

In the "Retrieve access keys" screen, save the access key and secret access key in a safe place. We will use this key later.  

Click "Done."

Create S3 Bucket

Next, we will create an S3 bucket to store files. These files can be in a few different formats: JSON lines, parquet, XML or CSV. In this tutorial, we will use JSON lines, as this is a common format.

In the AWS console, browse to the S3 page. Click "Create Bucket." You have the option to store the data in any AWS region. In this example, I am using US West (N. California) us-west-1. Select your preferred region.

There are many options to creating a bucket, however the defaults for this tutorial are sufficient. We do not want public access to this bucket, and AWS keys are suitable for our data.  Read through the documentation on any other parameters that are of interest.

Once created, you should see the bucket listed:

One bucket can contain multiple files and folders. We recommend a folder for each data source under the same bucket. Then, under each folder you have the option to store the data by date and hour. For example:

Bucket structure:

├── Mysecuritydata(Bucket)
│   ├── Data source folder
│   │   ├── Date folder
│   │   │   ├──Date Hour File 00
│   │   │   ├──Date Hour File 01
│   │   │   ├──Date Hour File 02

Click the bucket where you are storing the data. In my example it's “mysecuritydata”. Next, create a folder and name it what the data source type is.

Here you find all of the folders one with each data source:

Create Sample Data

Next, we can generate sample data in JSON lines format.  Here are two lines of JSON lines we will use for this example:

{"id": "4sLczeBcw8hN", "ts": "2023-04-24T13:55:06", "ip": "172.16.16.10", "hostname": "TS-2852", "fqdn": "ts-2852.directory.query.ai", "user": "barbara.salazar"}
{"id": "82u7wcb5s56u", "ts": "2023-04-24T13:55:06", "ip": "172.16.16.11", "hostname": "TS-1533", "fqdn": "ts-1533.directory.query.ai", "user": "mary.duncan"}

Copy and paste the lines into a text document and save with the extension “.json”. I’ll save it as userdevices.json.  Here is a screenshot from VsCode:

Upload Sample Data

You have the option to upload the file in multiple methods: via the web console, the AWS CLI, or via code. Below is a python sample.

Upload Sample Data via Web


Here we will upload the file via the console. Open the folder you created from above. In my example it's called SampleTestData.  Then click the Upload button.

If successful, you will see your uploaded file.

Upload Sample Data with AWS CLI

You can also upload the file to the bucket/folder using the CLI. For CLI access, you will need to follow this AWS guide to use the keys we created in the “Create IAM Access Keys” portion of this document.

To test your permissions and configuration you can issue the following command:

aws s3 ls <your bucket name>/

In my sample it looks like this:

Next, upload your file with this command:

aws s3 cp <local file> s3://<bucket name>/<folder>

For example:

If you get any permissions error you will need to update the policy permissions for the account you created. See the Setup User Account section in this document.

Upload Sample Data with Python

Below is a simple example of how to upload a file to AWS using python. This was created with python-3.9, however this should work on most python versions greater than python-3.4.

The script is dependent on using the same console credentials used in the AWS CLI. If you are unable to run AWS s3 ls  this script will not work. We are good security practitioners and do not want to put credentials in our code.

This is only sample code to get you started. You may want to add in additional code to validate the file contents.  To get started, I would recommend setting up a virtual python environment.  This python documentation has a great place to start.  Once your python environment is set up, start your python environment in a new folder with this command on Linux or Mac.

cd /<your new folder>
python3 -m venv .venv

Activate python environment.

source .venv/bin/activate

For windows the command would be:

cd \<your new folder\
c:\> python3\python -m venv .\venv

Activate python environment (windows)

c:\> \Scripts\activate.bat  

Or in Powershell:

PS C:\> Scripts\Activate.ps1

Testing

To test your python virtual environment in Mac or Linux, use which python. This will confirm python binary is local and not the system OS python binary. Here are the steps I took on my Mac:

mkdir /Users/guruuser/scripts/test-env/
cd /Users/guruuser/scripts/test-env/
python3 -m venv .venv
source .venv/bin/activate
which python
Users/guruuser/scripts/test-env/.venv/bin/python

Next, you will need to install a few modules. The commands are:

pip install boto3
pip install pendulum

Below you will see the code to push a file to AWS S3. You will need to update 4 lines of the script to match your environment, they are:

  • awsprefix
  • jsondir
  • bucket
  • security_folder

In the example above, here’s my configuration:

awsprefix = 'dt-'
jsondir = './toUpload/'
bucket = 'mysecuritydata'
security\_folder = 'SampleTestData'

Note: be aware that this was written on a MAC and works on linux. Windows users be careful of copy/pasting and the indentations.

To run the script from your virtual environment is python3 aws-s3-upload.py.

The script was built to upload files into a folder structure based upon year, month, date and then named the hour they were collected.

THE SCRIPT aws-s3-upload.py

import os
import boto3
import pendulum
from glob import glob
import botocore.exceptions

awsprefix = 'dt-'
jsondir = '<YOUR LOCAL DIRECTORY HERE>'
bucket = '<YOUR AWS BUCKET HERE>'
security_folder = '<YOUR AWS FOLDER>'
s3_client = boto3.client('s3')

def getawsfile(jfilepath):
    filedate = os.path.getctime(jfilepath)
    file_dt = pendulum.from_timestamp(filedate)
    file_yr = file_dt.format('Y')
    file_mo = file_dt.format('MM')
    file_day = file_dt.format('DD')
    file_hr = file_dt.format('HH')
    awspath = f'{security_folder}/{file_yr}/{file_mo}/{file_day}/'
    awsfilename = f'{awsprefix}{file_yr}{file_mo}{file_day}-{file_hr}.jsonl'

    return (f'{awspath}{awsfilename}')

def main():
    jsonfiles = glob(jsondir + '\*.json')  # get json all files in dir
    for jfile in jsonfiles: # name the file based on the timestamp of the file:
    awsfileandpath = getawsfile(jfile)
    try:
        s3_client.upload_file(jfile, bucket, awsfileandpath)
    except boto3.exceptions.S3UploadFailedError as s3e:
        print(s3e)
        print('Error: File upload failed!')
    except boto3.exceptions.ClientError as ce:
        print(ce)
        print('Error: File upload failed!')
        print(f'{jfile} file uploaded to {awsfileandpath}')

if __name__ == '__main__':
   main()

Review Your Uploaded Files

Now that you have uploaded your file(s), you should be able to see the objects in the AWS console. Here you see I’ve uploaded the files to the “mysecuritydata” bucket, under “SampleTestData/2023/04/26”:

Now that we have the files uploaded, we can query the json files using Athena.

Setup Athena Table

Athena is a way for you to query data that’s contained within an S3 bucket stored in CSV, Json lines, Parquet or XML. Athena uses common sql syntax and searches through all of the files as the data storage. To get started with Athena, we must configure Athena tables to understand the structure of the files stored in S3.  Note: charges are incurred for each query as Athena scans the data files.

There are two methods to configure Athena tables. You can simply define the table structure like creating a database table.  Or you can use AWS glue to scan the files and configure the structure for you.

In our example, we are going to manually create the table since we know the data structure. Open the AWS console and browse to the AWS Athena query editor page:

We will first need to create a database. The command is:

CREATE DATABASE <database_name>

In my example, I call my Athena database securitydb:

Click Run.

If the database was created you should see:

Next, let's create a table within the database we just created:

Here’s the sql code:

CREATE EXTERNAL TABLE IF NOT EXISTS `securitydb`.`inventory` (
        `id` string,
        `ts` timestamp,
        `ip` string,
        `hostname` string,
        `fqdn` string,
        `user` string
)
ROW FORMAT serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties ('paths' = 'id, ts, ip, hostname, fqdn, user')
LOCATION 's3://mysecuritydata/SampleTestData/';

Notice that I identified each key within the JSON to identify if the table column was a string or timestamp. There are other data types that Athena supports, so review your data and make the appropriate updates.

Click "Run" to create this database table.

Once created, you will see on the left navigation bar that I have a database named securitydb with one table called inventory.

Testing Athena

So far we have:

  1. Created a user with the proper permissions to upload to S3,
  2. Created IAM keys for that user,
  3. Created an S3 bucket,
  4. Uploaded a file to the bucket via the console, CLI and python, and
  5. Created an Athena database & table.

Let's test our configuration. From the Athena query editor, open a new tab (The blue + on the right) with this sql syntax select * from securitydb.inventory

Here you can see the two results found in the S3 files.

Setup Query Federated Search

Coming Soon

View Query Results

Coming Soon