Categories
AWS Cloud Computing

AWS Data Retrieval Efficiency with S3 Select

AWS S3 is a block storage space and for our projects, we have been using it both as a data lake as well as a storage space for our structured data. We very recently made use of the AWS S3 Select functionality. S3 Select gave us the ability to query our data in place and can help reduce data transfer costs on aws as you do not need to download an entire object. Apart from reduction of data transfer cost, the other major uplift has been on efficiency w.r.t time taken to get to the results. We are seeing closer to 300% uplift in efficiency.

With AWS S3 Select, you are able to get a few rows or a few data points instead of getting the entire object from storage and then doing the processing yourself. This helps you get efficiency because of lower bandwidth requirement and also lower processing as you can now directly get the result, instead of the object (from which you will then find the result). 

What is AWS S3 Select?

In order to understand S3 Select, think of a scenario where you have an object residing in your AWS S3 bucket and for the sake of argument, let’s say it a csv file containing names of employees, their employee id , employee grades and salary per month. Let’s say you want to query all those employees who are part of a certain Employee Grade. Without S3 Select you will need to first download the entire file from S3, load it in your application and then pick up the records which meet your criteria (detailed bit more under the Application flow with AWS S3 section). With S3 Select however, you can write a simple SQL query and have the specific records returned to you. A two-step (or depending on the complexity of the architecture, a multi-step) process is reduced to a single step process because of the “Query in place” that we are able to achieve with S3. The other S3 tools that can do Query in place (and are also recommended services to use if you wish to build an Analytics service) are AWS Redshift Spectrum and AWS Athena. S3 Select is designed more from selective retrieval of data from an object.

Application flow with AWS S3 (without Select)

While the actual application logic is fairly complex to detail out here, i am giving below a schematic representation of the Application we were using earlier. In our case, while we were using ec2 for application, once we started using S3 Select, we have now moved this entire setup to a Lambda function. This same python application for instance can be sitting anywhere (we also have processes running map reduce with very similar architecture abstraction)

Application Flow with objects in AWS S3
Application Flow with objects in AWS S3
  • Download object from AWS S3. 
  • Read the object by loading it into a dataframe. 
  • Do the necessary value fetch and process next step. 

Application Flow with AWS S3 Select 

With AWS S3 Select, our flow has got simplified as follows

  • We make a call to AWS S3 using the SELECT call. 
  • We pass the necessary SQL 
  • We get the result and we use that result. 
AWS Amazon S3 Select allows efficiency and cost savings because of Query in Place

So in short, no more downloading the file, extracting the file, processing it and then getting the result. 

For S3 Select to work, your objects should be stored in a CSV, JSON or Apache Parquet format. One other very interesting thing is that S3 Select also supports compression using GZIP or BZIP2 on CSV and JSON objects. Compressed Parquet is not supported.

With Amazon S3 Select, you only retrieve the subset of data that you need from your object, you can expect to get better performance when objects are compressed and larger in size.

As per some data published by AWS, for a test CSV object of 133,975,755-byte (~128 MB) consisting of 1 million lines. In testing such a file, the file size was reduced by ~60% down to 50,308,104 bytes (~50.3 MBytes) with GZIP compression.

File Size (Bytes)Bytes scannedBytes processedBytes returnedDifference
Uncompressed133,975,755133,975,755133,975,7556N/A
Compressed50,308,10450,308,104133,975,7556~60% smaller
Data published by AWS shows 60% efficiency when dealing with Compressed objects and AWS S3 Select

Using S3 Select with Python and AWS SDK

S3 Select is available through AWS SDK, AWS CLI and REST API. You can use the latest boto library and get started with it.

//import boto3 library. 
import boto3

// create a s3 client
s3 = boto3.client('s3')

// Here we are creating a S3 Select call. 
// We tell S3 about the Bucket, the file name (key)
// what type of expression we are about to send (SQL here). 
// We then also input the SQL itself. Here we will also need to 
// inform about the Input parameters. If the file has headers, we should inform about that. 
// Also same with whether the file is compressed or not. 

resp = s3.select_object_content(
    Bucket='s3select-demo',
    Key='sample_data.csv',
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object s where s.\"Name\" = 'Jane'",
    InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
    OutputSerialization = {'CSV': {}},
)

// resp has a Payload which we go through. 
for event in resp['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)
    elif 'Stats' in event:
        statsDetails = event['Stats']['Details']
        print("Stats details bytesScanned: ")
        print(statsDetails['BytesScanned'])
        print("Stats details bytesProcessed: ")
        print(statsDetails['BytesProcessed'])
        print("Stats details bytesReturned: ")
        print(statsDetails['BytesReturned'])

Accessing S3 Select using AWS Console

While its possible to use S3 Select using AWS SDK, Rest API and also the CLI, its also available as a feature on AWS Console. If you are using the AWS Console, you can extract upto 40 MB of records from an object which is 128 MB in size. Here are the steps you need to follow to use S3 Select from AWS Console

a. First head to the bucket where you have the object. In my case, i have created a demo bucket and a csv file isstored there. First select the file and then click on the menu item “Actions”.

b. Next select “Query with S3 Select” from the drop down.

Selecting Query with S3 Select allows you to query this object

c. You next specify your input and output settings. Our object is a csv file which is comma separated and has no compression. You will need to select appropriate settings for your object.

Input settings for the object to be queried
Output settings for the query results. These can also be downloaded later.

d. Next step after this is to write your SQL Query. Do note only SELECT is allowed. In this example here we have done a SELECT * with a limit of 5, which returns the first 5 results to the console.

Example S3 Select Query and Results

Once you are happy with the results, you can also download the results as a CSV or JSON depending on what you selected under Output settings.

Download results from S3 Select

What kind of SQL does S3 Select support?

S3 Select supports standard SQL but there is no support for GROUPBY, JOIN. One thing you need to understand here is that AWS has provided this feature because probably their goal with SELECT is to help with data retrieval and not Analytics. Therefore if your goal is to do analytics directly, i will suggest either build your own pipeline or you try out Athena. 

How are you Charged with AWS S3 Select?

You are charged for the amount of data scanned and the amount of data returned. One feature of S3 select that can help you, if you understand your data well is using selective scanning. With Selective Scanning, if you have a 500 megabyte file and you only scan for the first 50 megabytes, you will only be charged for 50 and not 500.

Categories
AWS Cloud Computing

How To Lower Your AWS Data Transfer Costs

AWS (Amazon Web Services) charges its users for three primary components across its services, these are Compute, Bandwidth and Storage. In this article we will first understand what comprises of Data Transfer and then look at few tried and tested ways of reducing your Data Transfer costs on AWS.

AWS transfer costs can add up very quickly and therefore an understanding of how these costs work and what are some basic ways in which this can be optimized, is important.

How AWS looks at Data Transfer and associated Costs

Even before we get started with reducing our usage bill , we need to know how AWS looks at Data Transfer. There are two pieces to this puzzle. There is data transfer that happens within AWS, i.e. within the same data centre (or lets say the region of AWS), this could be for accessing your database on another server or accessing a certain service like Kinesis and then there is bandwidth consumption that happens between AWS and the Public Internet.

Fig: Incoming traffic to AWS region from Internet

When it comes to networking, we represent data using terms like ingress, midgress and egress.

  • ingress traffic refers to the data that makes its way into AWS from the public internet. This is usually the data that is not a result of a request made from within the VPC.
  • midgress traffic here will refer to traffic between AWS regions. For example a request made from your ec2 instance to AWS S3. Within AWS, there is a concept of Region and then there is a concept of Availability Zones within a region. So while, theoritically you can say that traffic within AWS is midgress, pricing rules take into account traffic movement across Regions and Availability Zones into account.
Fig : AWS Region comprises of multiple Availability Zones. These zones in turn host services.
  • egress traffic refers to the traffic making its way out of the AWS region to the public internet. This can be because of a script that is running on an ec2 instance which pings some public server or it could be traffic going out of VPC (in which case it will be egress for the VPC).

For certain AWS services, data movement cost is part of the service cost itself, instead of being a separate line item.

Data Transfer within AWS

This is the midgress use-case we discussed earlier, where data travels between AWS services across regions, or within the same region but different availability zones or same region and same availability zone.

  • At the time of writing this article, if you have AWS services like EC2, RDS, Redshift, Elasticache, Elastic Network Interfaces or VPC Peering connections across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction.
  • If you make use of AWS services like EC2, S3, Glacier, DynamoDB, SES, SQS, Kinesis, ECR, SNS or SimpleDB do note that data transferred between these services within the same region is Free.
  • Also note that within the same availability zone (this is like being in the same server rack), most of the AWS services have zero fee on data transfer.
  • Data Transfer from one region to another is classified as InterRegion Inbound (data transfer in) and InterRegion Outbound on your AWS monthly bill. Inter region data transfer is charged by AWS as per the data transfer pricing of the source region rates.

Data Transfer between AWS and Public Internet

If you use any service which transfers data out to the public internet from an AWS region, you are generally billed basis the pricing for that specific region. Data transfer rates are tiered and therefore depending on the usage bucket you will be charged.

Fig : Requests from AWS Region to Public Internet

Image below shows the pricing you will incur for sending data out of EC2 to the Internet. All ingress is free. Refer to the latest aws data transfer pricing here.

Fig : Amazon EC2 On Demand Data Transfer Cost to the Internet

AWS Free Tier gives you 15GB of free Data Transfer Out each month for one year. After the first year, you get 1GB Data Transfer Out to internet free per month per region.

Tips for Saving AWS Data Transfer Costs

  • Plan your architecture such that you factor in the routes the data will travel. Now that you know that Data transfer costs within AWS are highest for inter-region connectivity, next is same region inter Availability zones and last is same Availability Zone.
  • Back in 2015, we had a high traffic application hosted on one instance of ec2 that called another ec2 server (within the same availability zone) over public internet address. This added both to latency as well as data transfer cost for us. We were able to get rid of these by using private IP Address. Lesson here is, to use Private IP Address whenever your application architecture allows. This is cheaper than using both public as well as Elastic IP Address.
  • We have been able to come up with interesting solutions by complimenting our data transfer modeling with the AWS Calculator as well. One interesting solution we found was about using Cloudfront with Amazon ec2. If you transfer a lot of data to your end users (images, videos, large files etc), do consider using Cloudfront. Costs from ec2 to Cloudfront (which are called “origin fetches”) i.e. from origin to edge locations are free of charge. You will however pay for the object delivery charges basis the regions selected for Cloudfront.
  • Look at your WAN configurations and Multi-AZ setups, there are often leakages that can be protected here.
  • If you are using VPN, you might want to consider using AWS Direct Connect for connectivity to AWS and your own corporate data centre. This also has a significant cost benefits as you are only charged once in case of Direct Connect. In case of AWS Direct Connect you are charged for data that egresses from AWS region to your Direct Connect Location. You will be charged region to direct connect location charges in this case.

Categories
Uncategorized

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!