Categories
AWS Cloud Computing

AWS Data Retrieval Efficiency with S3 Select

For your large objects in S3, switching to query driven partial fetches using S3 Select can help save on your Data Transfer Costs.

AWS S3 is a block storage space and for our projects, we have been using it both as a data lake as well as a storage space for our structured data. We very recently made use of the AWS S3 Select functionality. S3 Select gave us the ability to query our data in place and can help reduce data transfer costs on aws as you do not need to download an entire object. Apart from reduction of data transfer cost, the other major uplift has been on efficiency w.r.t time taken to get to the results. We are seeing closer to 300% uplift in efficiency.

With AWS S3 Select, you are able to get a few rows or a few data points instead of getting the entire object from storage and then doing the processing yourself. This helps you get efficiency because of lower bandwidth requirement and also lower processing as you can now directly get the result, instead of the object (from which you will then find the result). 

What is AWS S3 Select?

In order to understand S3 Select, think of a scenario where you have an object residing in your AWS S3 bucket and for the sake of argument, let’s say it a csv file containing names of employees, their employee id , employee grades and salary per month. Let’s say you want to query all those employees who are part of a certain Employee Grade. Without S3 Select you will need to first download the entire file from S3, load it in your application and then pick up the records which meet your criteria (detailed bit more under the Application flow with AWS S3 section). With S3 Select however, you can write a simple SQL query and have the specific records returned to you. A two-step (or depending on the complexity of the architecture, a multi-step) process is reduced to a single step process because of the “Query in place” that we are able to achieve with S3. The other S3 tools that can do Query in place (and are also recommended services to use if you wish to build an Analytics service) are AWS Redshift Spectrum and AWS Athena. S3 Select is designed more from selective retrieval of data from an object.

Application flow with AWS S3 (without Select)

While the actual application logic is fairly complex to detail out here, i am giving below a schematic representation of the Application we were using earlier. In our case, while we were using ec2 for application, once we started using S3 Select, we have now moved this entire setup to a Lambda function. This same python application for instance can be sitting anywhere (we also have processes running map reduce with very similar architecture abstraction)

Application Flow with objects in AWS S3
Application Flow with objects in AWS S3
  • Download object from AWS S3. 
  • Read the object by loading it into a dataframe. 
  • Do the necessary value fetch and process next step. 

Application Flow with AWS S3 Select 

With AWS S3 Select, our flow has got simplified as follows

  • We make a call to AWS S3 using the SELECT call. 
  • We pass the necessary SQL 
  • We get the result and we use that result. 
AWS Amazon S3 Select allows efficiency and cost savings because of Query in Place

So in short, no more downloading the file, extracting the file, processing it and then getting the result. 

For S3 Select to work, your objects should be stored in a CSV, JSON or Apache Parquet format. One other very interesting thing is that S3 Select also supports compression using GZIP or BZIP2 on CSV and JSON objects. Compressed Parquet is not supported.

With Amazon S3 Select, you only retrieve the subset of data that you need from your object, you can expect to get better performance when objects are compressed and larger in size.

As per some data published by AWS, for a test CSV object of 133,975,755-byte (~128 MB) consisting of 1 million lines. In testing such a file, the file size was reduced by ~60% down to 50,308,104 bytes (~50.3 MBytes) with GZIP compression.

File Size (Bytes)Bytes scannedBytes processedBytes returnedDifference
Uncompressed133,975,755133,975,755133,975,7556N/A
Compressed50,308,10450,308,104133,975,7556~60% smaller
Data published by AWS shows 60% efficiency when dealing with Compressed objects and AWS S3 Select

Using S3 Select with Python and AWS SDK

S3 Select is available through AWS SDK, AWS CLI and REST API. You can use the latest boto library and get started with it.

//import boto3 library. 
import boto3

// create a s3 client
s3 = boto3.client('s3')

// Here we are creating a S3 Select call. 
// We tell S3 about the Bucket, the file name (key)
// what type of expression we are about to send (SQL here). 
// We then also input the SQL itself. Here we will also need to 
// inform about the Input parameters. If the file has headers, we should inform about that. 
// Also same with whether the file is compressed or not. 

resp = s3.select_object_content(
    Bucket='s3select-demo',
    Key='sample_data.csv',
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object s where s.\"Name\" = 'Jane'",
    InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
    OutputSerialization = {'CSV': {}},
)

// resp has a Payload which we go through. 
for event in resp['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)
    elif 'Stats' in event:
        statsDetails = event['Stats']['Details']
        print("Stats details bytesScanned: ")
        print(statsDetails['BytesScanned'])
        print("Stats details bytesProcessed: ")
        print(statsDetails['BytesProcessed'])
        print("Stats details bytesReturned: ")
        print(statsDetails['BytesReturned'])

Accessing S3 Select using AWS Console

While its possible to use S3 Select using AWS SDK, Rest API and also the CLI, its also available as a feature on AWS Console. If you are using the AWS Console, you can extract upto 40 MB of records from an object which is 128 MB in size. Here are the steps you need to follow to use S3 Select from AWS Console

a. First head to the bucket where you have the object. In my case, i have created a demo bucket and a csv file isstored there. First select the file and then click on the menu item “Actions”.

b. Next select “Query with S3 Select” from the drop down.

Selecting Query with S3 Select allows you to query this object

c. You next specify your input and output settings. Our object is a csv file which is comma separated and has no compression. You will need to select appropriate settings for your object.

Input settings for the object to be queried
Output settings for the query results. These can also be downloaded later.

d. Next step after this is to write your SQL Query. Do note only SELECT is allowed. In this example here we have done a SELECT * with a limit of 5, which returns the first 5 results to the console.

Example S3 Select Query and Results

Once you are happy with the results, you can also download the results as a CSV or JSON depending on what you selected under Output settings.

Download results from S3 Select

What kind of SQL does S3 Select support?

S3 Select supports standard SQL but there is no support for GROUPBY, JOIN. One thing you need to understand here is that AWS has provided this feature because probably their goal with SELECT is to help with data retrieval and not Analytics. Therefore if your goal is to do analytics directly, i will suggest either build your own pipeline or you try out Athena. 

How are you Charged with AWS S3 Select?

You are charged for the amount of data scanned and the amount of data returned. One feature of S3 select that can help you, if you understand your data well is using selective scanning. With Selective Scanning, if you have a 500 megabyte file and you only scan for the first 50 megabytes, you will only be charged for 50 and not 500.

Leave a Reply

Your email address will not be published. Required fields are marked *