www.galavan.comIntroduction

AWS Appflow is a new data integration service which as the tagline states, securely integrates apps and automates data flows at any scale, without code. It offers a variety of features such as:

  • Point and click user interface
  • Native SaaS integrations
  • High scale data transfer 

In this article we will explore using AppFlow to extract data and load it into the Snowflake Cloud Data Platform.    

At the core of the Snowflake Cloud Data Platform is a cloud data warehouse. It offers features such as the separation of storage from compute, automatic elastic scaling, and data sharing.    

Configuration

While there are various AppFlow source connectors to choose from such as Zendesk, ServiceNow, and Slack, for the purposes of this blog the data source was three CSV files, each with the same format, located on AWS S3. 

A prerequisite to configuring Snowflake as a destination in AppFlow is a Snowflake external stage. An external stage allows Snowflake to access data files stored in a location outside of Snowflake. Once that is set-up, the AppFlow configuration can begin.       

As it says on the AppFlow tin, the configuration was point-and-click. Along with source (AWS S3) and destination (Snowflake), the flow trigger (in this case, on-demand) can also be configured. In addition to the Snowflake external stage, login with relevant permissions, compute warehouse, and target table are needed. 

  • Note: Currently, once a data connection is configured in AppFlow, it can’t be changed so if an adjustment is needed, a new connection must be created.

The next step was to map source to target fields along with modifications such as complete or partial column masking. Mappings can also be uploaded via CSV. Field validations are possible as indicated in the following screenshot:

The action taken following a value’s validation can be to either terminate the flow, or the record can be ignored.  

  • Note : The functionality to change mappings after initial configuration does not appear to be 100% stable at present and can throw an error even though the mappings configured are correct

Running the Flow

A Flow can be triggered with a mouse click via the AWS dialog. In my case, the Flow failed on the first attempt. In addition to that, the corresponding run history wasn’t the most informative:  

However I’m sure this will improve as the service becomes more mature. After much trial and error to the source files such as changing the newline character, removing the the text qualifiers, and also creating a new Flow from scratch, the Flow worked. In other words, the data was loaded into the Snowflake datawarehouse target table successfully.

(This data is test data!)

Perhaps it was due to the source being S3, however the only error handing setting that was available is : if AppFlow can’t write a record to the destination then stop the current flow run. This led to Appflow only creating a log file in S3 if the Flow run was a success.  

Behind the scenes

Checking behind the scenes revealed that Appflow converted the three source CSVs contents to a single JSON file. It then loaded this data into the Snowflake target table using the Snowflake JDBC driver, matching the source and target columns by column name. 

Billing

You pay for every successful Flow run. A Flow run is a call to the data source to transfer data to a destination with some validations / filters. You also are charged based on data volume. Additional services usage such as AWS KMS or reads/writes to S3 are billed separately.  

Encryption

Data is encrypted at rest and in transit (the data at rest aspect only applies to S3). You can customize the encryption using e.g. AWS KMS (Key Management Service). AWS PrivateLink support is a nice touch and this also integrates with Snowflake. AWS PrivateLink is used if you wish to ensure that the traffic from AppFlow to Snowflake stays off the public internet.  

  • Tip: If considering AWS PrivateLink, check your Snowflake edition to ensure support – Business Critical upwards   

Scalability

AppFlow scales up automatically which means that you don’t need to plan or provision resources. It can process up to 100GB per data flow. 

Conclusion

The configuration process in principle is straight forward. As mentioned, no code was required. If you already have the source and target details to hand, then configuration should take a matter of minutes. Likewise the scale up / down is documented as automatic, although this would need to be put through it’s paces. 

On the other hand, I found AppFlow to be a tad temperamental. It took considerable trial and error to get a Flow to work. Also as mentioned, trying to re-configure certain aspects after the initial Flow configuration were either not possible, or the process needs to be a little more robust. A big help would be if error handling could be more informative, and also AppFlow API support does not appear to be available at present.

While currently the number of source and target connectors is relatively small, I’m sure that this will grow, along with the service’s robustness. If you need to get data to a target such as Snowflake regularly, securely, and with a certain amount of validation, AppFlow is worth being aware of. For now I would recommend giving the service a little time to evolve.  

Dan Galavan has been delivering data solutions to clients for 21 years and is Snowflake SnowPro Core certified

References: