Analyzing Clinical Trials Data using Snowflake and ThoughtSpot – Part 1

ThoughtSpot and Snowflake recently announced tight integration of their offerings. In the recently concluded Beyond 2019 (ThoughtSpot annual conference) in Dallas, ThoughtSpot showed how data stored in Snowflake can be analyzed in ThoughtSpot.

Since we have a deep partnership with both Snowflake and ThoughtSpot, we are obviously pretty excited with this. We have our own point of view on how to integrate the two products to make sure we are extracting the best from both of them. As a starter, this 2 blog series explains how to analyze publicly available clinical trial data using Snowflake and ThoughtSpot. The part1 of the blog focuses on ingesting this data (in XML format) into Snowflake and preparing it for analysis in ThoughtSpot. The part2 of the blog will focus on how to connect Snowflake to ThoughtSpot and how to analyze it.

Dataset used

For this demo, we have used Clinical Trials Data which is publically available on https://clinicaltrials.gov/ct2/search/advanced

Search term used – Heart Attack (no other filters used)

Found 1497 studies

Selected option of Downloading ZIP file for ALL studies

This creates a zip file and downloads it.

Data Ingestion

The Python code below explains various steps involved in loading the data into Snowflake and then preparing it for analyzing in ThoughtSpot. In the process, I also explain various constructs using which XML can we queried in Snowflake.

Read Snowflake Config Details and open Snowflake Connection

Just pull the credentials from the config file. This file is similar to typical properties file (key=value pair).

In [1]: from configparser import ConfigParser
config = ConfigParser()
config.read('config.properties')
# get config data
account = config.get('DatabaseSection', 'account')
user = config.get('DatabaseSection', 'user')
password = config.get('DatabaseSection', 'password')
data_path = config.get('DataSection', 'path')

Connect to Snowflake

In [58]: import snowflake.connector
from pandas import DataFrame

ctx = snowflake.connector.connect(
user=user,
password=password,
account=account
)

Some utility functions

In [30]: def execSQL(sql):
    ctx.cursor().execute(sql)
    
def execSQLAndGetDF(sql, max_rows = 10000):
    try:
        cur = ctx.cursor()
        cs = cur.execute(sql)
        rows = cs.fetchmany(max_rows)
    finally:
        cs.close()
    df = DataFrame(rows)
    df.columns = [col[0] for col in cur.description]
    return df

Setup database, schema, virtual warehouse to be used

In [60]: execSQL("USE DATABASE HealthCareSolution")
execSQL("USE SCHEMA ClinicalTrials")
execSQL("USE WAREHOUSE LoadWH")
execSQL("CREATE STAGE IF NOT EXISTS ClinicalTrialXMLStage")
SQL = "CREATE OR REPLACE FILE FORMAT ClinicalTrialXMLFileFormat 
       TYPE = 'XML' 
       COMPRESSION = 'AUTO' PRESERVE_SPACE = FALSE 
       STRIP_OUTER_ELEMENT = FALSE 
       DISABLE_SNOWFLAKE_DATA = FALSE 
       DISABLE_AUTO_CONVERT = FALSE 
       IGNORE_UTF8_ERRORS = TRUE;"
execSQL(SQL)

Push data to stage and copy into Snowflake table

In [12]: #Upload data to Stage
execSQL("PUT file://" + data_path  + 
        "* @ClinicalTrialXMLStage/heart_attack")
#create table
execSQL("CREATE TABLE IF NOT EXISTS ClinicalTrialsXML (X Variant);")
#Copy data from stage to the table
execSQL("COPY INTO ClinicalTrialsXML FROM 
        @ClinicalTrialXMLStage/heart_attack/  
        FILE_FORMAT = ClinicalTrialXMLFileFormat 
        on_error = continue;")

Basic verification using record count

In [20]: sql = "SELECT count(*) FROM ClinicalTrialsXML;"

try:
    cs = ctx.cursor().execute(sql)
    rows = cs.fetchone()
    print("Record Count = ", rows[0])
finally:
    cs.close()

Record Count =  1498

Query XML Nodes
The section below explains how to use XML constructs to find nodes like condition, minimum age, maximum age and show it

Use XMLGet to reach to a XML node.

In [40]: #Get min/max age. Note that 'condition' is an array and this query only brings in one of the conditions
sql = "SELECT 
GET(XMLGET(XMLGET(X, 'id_info', 0), 'nct_id', 0), '

	NCT_ID	CONDITION_VALUE	MIN_AGE	MAX_AGE
0	“NCT00000542”	“Diabetes Mellitus”	“55 Years”	“100 Years”
1	“NCT00001629”	“Myocardial Infarction”	“N/A”	“N/A”
2	“NCT00050765”	“Myocardial Infarction”	“18 Years”	“80 Years”