Migrating data from RDS to Redshift using AWS Data Pipeline

Fola Animashaun
4 min readJun 1, 2021

Data pipeline is an Amazon web service (AWS) service that allows you move data from one place to the other and run transformations. You can easily compare it to Google Cloud’s Dataflow.

At work, I am currently creating data pipelines to publish our data to a single warehouse which would be frequently updated as needed . We have data on Azure and AWS (Relational Database Service) RDS and I need to bring them to one place to create data rich tables and also use as a single data source on Metabase; our data visulization tool . Hopefully, this will also help others in a similar situation or new to AWS.
Below is a simple diagram to describe what i’m trying to achieve.

First steps …

1. Make sure Data Pipeline is available in your region

2. Create a Redshift cluster

3. Create a bucket in S3 that is in the same region as Data Pipeline

4. Create a table to avoid the error below. I did mine manually on Redshift

'XXX.table_name' doesn't exist and no createTableSql was provided at

You can visit this link to guide you on how to create a table.

5. Ensure DataPipelineDefaultRole has the following polices attached :

  • AWSDataPipelineRole
  • AmazonEC2RoleforDataPipelineRole
  • AmazonS3FullAccess . This last policy fixed the error below for me;
Please ensure role (‘DataPipelineDefaultRole’) has s3:Get*, s3:List*, s3:Put* and sts:AssumeRole permissions for DataPipeline

To create a data pipeline vist the link below and follow the following steps;
https://console.aws.amazon.com/datapipeline/

For my use case I am perfoming a full copy of the data on RDS to Redshift on a daily basis

Parameters

Name : Name the data pipeline ( this cannot be changed later so name approriately)

Source: Build using a template Full copy of RDS My SQL table …..

RDS MySQL password: Your database password.

RDS MySQL table name : The name of the table in your database

Redshift username : Username for your redshift cluster

Redshift table insert mode : OVERWRITE_EXISTING( This is because I want a full rewrite of the table )

RDS MySQL connection string: Usually in this format jdbc:mysql://dbinstance.id.region.rds.amazonaws.com:port/dbname

S3 staging folder :S3 folder link (folder in s3 should be in the same region with Redshift cluster)

Redshift table distribution key (optional)

MySQL to Redshift type conversion overrides (optional)

Redshift table name: The name of the table you created

Redshift password: Your Redshift cluster password

RDS and Redshift security group(s): The name of your security group , this can be found in RDS.

RDS MySQL username : Your database username

Redshift table sort keys( Optional)

Redshift JDBC connection string: Usually in this format: jdbc:redshift://endpoint:port/database

Schedule

There are two options for scheduling your pipline run

On pipeline activation: This would run once every 100 years so basically once.

On a schedule: You can set a schedule for the pipeline to run on a periodical basis with a start date and an end date.

Note that you can not switch from on schedule to on pipeline activation and vice versa later

You can view the pipeline stages by clicking on the link then you would see something like the image below.

To check if your data was successfully uploaded, right click the ellipsis and select preview data on the table name in Redshift.

Or Query the data in redshift

SELECT * from tablename limit 10

Notes:

  1. For reasons unknown after my first successful primary key(salesid)

2. Even though the RedshiftTableCreateActivity is successful I encounter an error at S3ToRedshiftCopyActivity

output table named 'xxx.table_name' doesn't exist and no createTableSql was provided

3. I got validation warning but i was still able to run the pipeline successfully

4. I think AWS can do better with the arrangement of the parameter fields, for example puttting the RDS paremeters together and putting the Redshift parameters together

--

--