Exploring dataset

Exploring dataset

Overview

This stage is to visualize the features in the metadata. The product of this stage is a website displaying many interactive figures. The website and figures are based on Dash1 and Plotly2 respectively. Users can hover and zoom in on the figures to get more information such as the locations or the herbs in the formula. The interactive functions are disabled here due to technical limitations. Note that when there are too many items on the x-axis, some items will be hidden. The hidden items will be visible when zooming in.

Auxilary code

Run the following code before drawing figures. Here is the code for imports, data loading and preprocessing.

import json
import pandas as pd
import plotly.express as px
import plotly.subplots
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import pairwise
import plotly.graph_objects as go
from dash import Dash, dcc, html, Input, Output, callback, dash_table
from jinja2 import Template
from utils import getLocation, removeTextColumn

pd.DataFrame.iteritems = pd.DataFrame.items  # since iteritems is removed from pandas 2.0

quantity_model = pd.read_excel("../clean_quantity_model.xlsx", header=0)
herbs = list(quantity_model.columns[1:])
chinese_quantity_model = pd.read_excel("../chinese_quantity_model.xlsx", header=0)
herb_types = pd.read_excel("../herbs_name_animalplantseparation.xlsx", header=0, index_col=0, usecols=[1, 3])
formulae_locations = getLocation()
quantity_model = pd.merge(quantity_model, formulae_locations, how='left', on='Name')
herbs_frequency_df = pd.read_excel('../herbs_name_frequency.xlsx', header=0, index_col=0)
merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0)

# relocate the location columns
locations = quantity_model.pop('Location')
quantity_model.insert(1, locations.name, locations)

Here are the utility functions.

import pandas as pd
import plotly.graph_objects as go

herb_types = pd.read_excel("../herbs_name_animalplantseparation.xlsx", header=0, index_col=0, usecols=[1, 3])


def removeTextColumn(df):
    textCols = ['Name', 'Location']
    for textCol in textCols:
        if textCol in df.columns:
            df = df.drop(textCol, axis=1)
    return df

def getLocation():
    merge_result = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)
    volumeName = {4971387: '仲景全書卷一', 4971388: "仲景全書卷二", 4971389: "仲景全書卷三"}
    df = dict()
    df['Name'] = merge_result.index.copy()
    df['Location'] = [f"{volumeName[volumeId]} 第{startPage}頁"
                      for (volumeId, startPage) in merge_result[['VolumeID', 'StartPage']].values.tolist()]
    df = pd.DataFrame.from_dict(df)
    return df

Frequencies of Herbs

Figure 1 shows that the herb frequency distribution is skewed. About 1/8 of the herbs are commonly used. About half of the herbs are only used once.

Figure 1: Frequencies of herbs in Shang-Han-Lun

Zooming in on the 10 most commonly used herbs, Figure 2 shows that the frequency of 甘草 is much higher than the others. Over half of the formulae use 甘草. Moreover, the frequencies of 大棗, 桂枝 and 生薑 are nearly the same, indicating that they may be used together. A similar pattern shows between 乾薑, 附子 and 人參.

Furthermore, 8 of the 10 most commonly used herbs have the property of hotness. This is because Shang-Han-Lun (傷寒論) is about treating cold. Also, 3 of the top 4 most commonly used herbs could replenish the patient’s energy. This reveals an important concept in Traditional Chinese Medicine– People’s internal energy is of the greatest importance.

Code for the herb frequency distribution
herbs_frequency_barChart = px.bar(
    herbs_frequency_df,
    x='藥材名稱',
    y='出現次數',
    title='Frequencies of herbs in Shang-Han-Lun',
    orientation='v',
    labels={'藥材名稱': 'herb name', '出現次數': 'Frequency (times)'}
)

Effect of Location

Figure 3 clearly shows the distribution of formulae in Shang-Han-Lun (傷寒論). The green and purple sectors together occupy more than half of the chart. These sectors contain the formulae treating illnesses related to 太陽, which is the first line of defence in the human body in Traditional Chinese Medicine. We now know that Zhang Zhong Jing (張仲景) had made lots of efforts in treating patients at a very early stage in the course of illness. This is impressive.

Code for the distribution chart
def getFormulaeDistribution():
    volumeName = {4971387: '仲景全書卷一', 4971388: "仲景全書卷二", 4971389: "仲景全書卷三"}

    def volumeId_to_name(id):
        return volumeName[id]

    df = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0)
    df['VolumeID'] = df['VolumeID'].apply(volumeId_to_name)
    fig = px.sunburst(
        df,
        path=['VolumeID', 'Subchapter', 'Name'],
        hover_data={'StartPage': True},
        title='Formulae Distribution in Shang-Han-Lun (傷寒論)'
    )
    fig.update_layout(height=800, width=800)
    return fig

To further study the effect of formulae’s locations, we have plotted the number of herbs used by formulae across volume (Figure 4). The 3 colors correspond to 3 volumes in Zhongjing quan shu (仲景全書). Since there are 112 formulae to show, the formulae name may be hidden at the x-axis. You can zoom in for more details.

We can see that the number of herbs used by formulae is not too large. The most herbs used is Wu-Mei-Wan-Fang (烏梅丸方), which uses 14 herbs but most of the other formulae use only 3-7 herbs. The low herb count could be related to the writing time of Shang-Han-Lun. Zhang Zhong Jing wrote the book in the late Eastern Han Dynasty when wars were everywhere. Herbs available were limited. That is why he did not use many herb species in each formula. 

Besides, Figure 2 shows that the herbCount has no direct relationship with the locations. The herb count oscillates across Shang-Han-Lun.

Code for the herbCount distribution
formulae_herbCount_chart = px.line(
    merge_result,
    x='Name',
    y='herbCount',
    title='Number of Herbs used by Formulae',
    markers=True
)
formulae_herbCount_chart.update_xaxes(title_text="")
maxCount = merge_result.loc[:, 'herbCount'].max(axis=None)
herbCount_rects = [('桂枝湯方', '四逆湯方', '#00CC96'), ('葛根湯方', '炙甘草湯方', '#636EF9'),
         ('大承氣湯方', '竹葉石膏湯方', '#EE553B')]
for start, end, color in herbCount_rects:
    formulae_herbCount_chart.add_shape(
        type='rect',
        x0=start,
        y0=0,
        x1=end,
        y1=maxCount,
        fillcolor=color,
        opacity=0.25,
    )

Apart from the herb count, we have also studied the relationship between location and dosages (Figure 5). It shows that many herbs are used together since there are similar trends. This will be further discussed in the next section. An interesting finding is that when we isolate 半夏 from the trend by clicking its legend (Figure 6), it shows that 半夏 is commonly used in volume 2 only.

Code for the dosage trends
def get_parallel_plot_on_common_herbs(numeric_table, herbNum, overviewTitle, subplotTitle):
    numeric_table = numeric_table.loc[:, (numeric_table != 0).any(axis=0)]
    cols = list(numeric_table.columns)
    for col in ['Name', 'Location']:
        if col in cols:
            cols.remove(col)
    cols = cols[:herbNum]
    df_toScan = numeric_table.loc[:, cols]
    maxDosage = df_toScan.max(axis=None)

    plots = plotly.subplots.make_subplots(
        rows=len(cols),
        cols=1,
        shared_xaxes=True,
        y_title="Weight (g)",
    )
    overview = go.Figure(layout={'title': overviewTitle})

    rects = [('桂枝湯方', '四逆湯方', '#00CC96'), ('葛根湯方', '炙甘草湯方', '#636EF9'), ('大承氣湯方', '竹葉石膏湯方', '#EE553B')]
    for index, col in enumerate(cols):
        fig = go.Scatter(
            x=numeric_table.loc[:, 'Name'],
            y=numeric_table.loc[:, col],
            mode='lines+markers+text',
            name=f'{col}'
        )
        plots.add_trace(
            fig,
            row=1 + index,
            col=1,
        )
        upper_limit = max(numeric_table.loc[:, col])
        for start, end, color in rects:
            plots.add_shape(
                type='rect',
                x0=start,
                y0=0,
                x1=end,
                y1=upper_limit,
                fillcolor=color,
                opacity=0.25,
                row=1 + index,
                col=1,
            )

        overview.add_trace(fig)

    plots.update_layout(
        height=1200,
        width=600,
        title_text=subplotTitle
    )
    for start, end, color in rects:
        overview.add_shape(
            type='rect',
            x0=start,
            y0=0,
            x1=end,
            y1=maxDosage,
            fillcolor=color,
            opacity=0.25,
        )

    return overview, plots


default_overview, default_stack_trend = get_parallel_plot_on_common_herbs(
    quantity_model,
    10,
    overviewTitle="Overview of dosage trends for the 10 most common herbs",
    subplotTitle="Stacked subplots with dosage trends"
)

Correlations between Herbs

The aforementioned “similar trend” corresponds to the concept of “herb combination” in traditional Chinese Medicine. A well-known herb combination would be {桂枝, 芍藥}. Figure 7 shows that they have a similar dosage trend. In Figure 9, their dosages’ trendline has a positive slope, indicating that they may be positively correlated. Their high correlation is visualized by dark colour in the heatmap (Figure 10). To explain this in traditional Chinese Medicine, it is said that 桂枝 and 芍藥 together can harmonize the energies of Ying and Wei in our bodies, which is crucial for staying healthy.

On the other hand, some drugs are not meant to be used together, because they may show side effects or limited treating effects. For example, {附子, 半夏} are rarely used together because this would produce toxins, according to the concept “18 opposites(十八反)” in traditional Chinese Medicine. The opposite trend in Figure 8 and the L-shaped line in Figure 9 show that these 2 herbs are mutually exclusive in Shang-Han-Lun. Figure 10 shows that they have high differences.

Code for pair plot and heatmap
def getPairPlot(dim=10):
    if dim <= 0:
        return None

    df = pd.read_excel("../clean_quantity_model.xlsx", header=0, index_col=0)
    herb_names = df.columns[:dim]
    labels = [herb_name for herb_name in herb_names]
    fig = px.scatter_matrix(
        df,
        dimensions=herb_names,
        hover_name=df.index,
        title=f'Compare and Contrast the {dim} most common herbs by dosage',
        height=800,
        labels=dict(zip(herb_names, labels))
    )
    fig.update_traces(diagonal_visible=False, showupperhalf=False)

    return fig


def get_heatmap(df,
                labels,
                precomputed=False,
                title="Differences across formulae",
                description="The Darker, The More Similar",
                x='Formula 1',
                y='Formula 2'
                ):
    if not precomputed:
        df = scale_df(df, None, MinMaxScaler())
        dist = pairwise.euclidean_distances(df)
        df = pd.DataFrame(dist, index=labels, columns=labels)

    title = title
    description = f"<br><sup>{description}</sup>"
    fig = px.imshow(
        df,
        labels=dict(x=x, y=y, color="Relative Distances"),
        x=labels,
        y=labels,
        title=title + description
    )
    if len(df.index) > 24:
        fig.update_layout(height=800)
    else:
        fig.update_layout(autosize=True)
    return fig

pair_plot = getPairPlot(dim=10)
HerbCorrHeatmap = get_heatmap(
    1 - round(quantity_model.iloc[:, 1:].corr(), 2),
    quantity_model.columns[1:],
    precomputed=True,
    title="Differences across herbs",
    x='Herb 1',
    y='Herb 2'
)

Hierarchical Relationship between Formulae

As mentioned before, Shang-Han-Lun has explicitly indicated the parent-child relationship for some formulae with the ending keyword 「於」. We have used a treemap (Figure 11) to visualize this hierarchical Relationship. Gui-Zhi-Tang-Fang (桂枝湯方) is the largest category and it is the only grandparent in the Shang-Han-Lun.

Code for treemap
def getHerbLevel():
    df = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0)
    one_hot_model = pd.read_excel("../one_hot_model.xlsx", header=0, index_col=0)

    # Aim: find the grandparent and parent of each formula
    df['parent'] = df['dependOn']
    df['grandParent'] = '無'
    for index, data in df.iterrows():
        if data['dependOn'] != '無':
            parent = data['parent']
            grandParent = df[df['Name'] == parent]
            df.at[index, 'grandParent'] = grandParent.loc[grandParent.index[0], 'parent']

    # Aim: find the herb difference between a formula and its origin
    df['added'] = '不適用'
    df['removed'] = '不適用'
    for index, data in df.iterrows():
        if data['parent'] == '無':
            continue

        diff_herbs = one_hot_model.loc[data['Name']] - one_hot_model.loc[data['parent']]
        added_herbs = diff_herbs[diff_herbs == 1].index.values
        removed_herbs = diff_herbs[diff_herbs == -1].index.values

        df.at[index, 'added'] = ",".join(added_herbs) if len(added_herbs) > 0 else "無"
        df.at[index, 'removed'] = ",".join(removed_herbs) if len(removed_herbs) > 0 else "無"

    volumeName = {4971387: '仲景全書卷一', 4971388: "仲景全書卷二", 4971389: "仲景全書卷三"}
    names = ["Formulae hierarchy"] + df['Name'].tolist()
    parents = [""] + ["Formulae hierarchy" if source == '無' else source for source in df['parent']]
    formula_locations = ["不適用"] + \
                        [
                            f"{volumeName[volumeId]} 第{startPage}頁"
                            for (volumeId, startPage) in df[['VolumeID', 'StartPage']].values.tolist()
                        ]
    added_herbs = ["不適用"] + df['added'].tolist()
    removed_herbs = ["不適用"] + df['removed'].tolist()
    custom_data = [
        f"Location: {location}<br>Parent: {parent}<br>added: {added}<br>removed: {removed}"
        for (location, parent, added, removed) in zip(formula_locations, parents, added_herbs, removed_herbs)
    ]

    fig = go.Figure(go.Treemap(
        labels=names,
        parents=parents,
        hovertemplate=
        "<b>%{label}</b><br>" +
        '%{customdata}' +
        "<extra></extra>"
        ,
        customdata=custom_data,
    ))
    fig.update_layout(margin=dict(t=50, l=20, r=20, b=20))

    return fig

treemap = getHerbLevel()

Apart from the documented parent-child relationship, we would like to discover the undocumented relationship. So, we perform k-means clustering and visualize it with UMAP dimensionality reduction (Figure 12). The clustering is based on the quantity in grams model since it gives the highest silhouette score compared to other models.

Users can use the slider to choose the expected number of clusters. The binary cluster gives the highest silhouette score compared to other choices of k. Interestingly, Wu-Mei-Wan-Fang (烏梅丸方) is the only member in cluster 1.

Possible Reasons for Wu-Mei-Wan-Fang (烏梅丸方) to be the most unique formula:

In terms of data analytics, only Wu-Mei-Wan-Fang (烏梅丸方) uses 14 herbs, which is the formula using the most number of herbs in Shang-Hang-Lun. To emphasise this reason! we could see 75% of the formulae in Shang-Han-Lun use less than 7 herbs, which is a lot less than 14. Also, in this book, only Wu-Mei-Wan-Fang (烏梅丸方) uses Wu-Mei (烏梅) and Shu-Jiao (蜀椒), which also makes the formula less similar to the other formulae.

In terms of Chinese Medicine theory, Wu-Mei-Wan-Fang (烏梅丸方) treats diseases that simultaneously show symptoms that are caused by the hotness evil and the cold evil. This would be a complicated situation to be handled. Moreover, only this formula helps to handle parasites such as the roundworms in the human body. Therefore, we could understand that Wu-Mei-Wan-Fang (烏梅丸方) is far from other formulae in Shang-Han-Lun.

The last figure would be the heatmap representing the differences across formulae. In Figure 13, the yellow line corresponds to the outlier, which is Wu-Mei-Wan-Fang (烏梅丸方).

Code for clustering and heatmap

Here is the code for clustering. Since it takes a long time to perform 10 times clustering. We choose to run it once and save it in json file. This could shorten the first response time of the website.

def kmeansVisualization():
    model_to_cluster = pd.read_excel("../clean_quantity_model.xlsx", header=0, index_col=0)

    fit = umap.UMAP()
    u = fit.fit_transform(model_to_cluster)
    u = pd.DataFrame(u, index=model_to_cluster.index, columns=['component 1', 'component 2'])

    df = pd.read_excel("../shanghanlun_mergeOCR_afterExpansion.xlsx", header=0, index_col=0)
    selected_merge_result = df.loc[model_to_cluster.index.values, :]

    volumeName = {4971387: '仲景全書卷一', 4971388: "仲景全書卷二", 4971389: "仲景全書卷三"}
    formula_location = [f"{volumeName[volumeId]} 第{startPage}頁" for (volumeId, startPage) in
                        selected_merge_result[['VolumeID', 'StartPage']].values.tolist()]

    figs = []
    cluster_tables = []
    sse = {}
    silhouette_scores = []

    # Perform k-means clustering
    for k in range(2, 11):
        kmeans = KMeans(init='random', n_clusters=k, random_state=42, n_init='auto')
        cluster_assignments = kmeans.fit_predict(model_to_cluster)
        labels = [f"Cluster_{label + 1}" for label in cluster_assignments]

        # record metrics
        sse[k] = kmeans.inertia_  # Inertia: Sum of distances of samples to their closest cluster center
        silhouette = silhouette_score(model_to_cluster, cluster_assignments, metric='euclidean')
        silhouette_scores.append(silhouette)

        # Plot the clusters
        fig = px.scatter(u, x='component 1', y='component 2', color=labels, hover_name=u.index,
                         hover_data={'component 1': False, 'component 2': False, 'location': formula_location},
                         title=f'K-Means Clustering (K={k}) - UMAP')
        figs.append(fig)

        clusters = {
            'Clusters': [f"Cluster {cluster_label + 1}" for cluster_label in range(k)],
            'Formulae': []
        }
        for cluster_label in range(k):
            cluster_indices = np.where(cluster_assignments == cluster_label)[0]
            cluster_data_indices = sorted(model_to_cluster.index[cluster_indices], key=lambda name: len(name))
            clusters['Formulae'].append(" ".join(cluster_data_indices))
        cluster_tables.append(clusters)

    def to_records(t):
        new_df = pd.DataFrame.from_dict(t)
        return new_df.to_dict('records')

    cluster_tables = list(map(to_records, cluster_tables))

    sse_chart = px.line(
        x=sse.keys(),
        y=sse.values(),
        title="Sum of squared distance for different k",
        markers=True,
        labels={
            "x": "k",
            "y": "Sum of squared distance"
        }
    )
    silhouette_chart = px.line(
        x=range(2,11),
        y=silhouette_scores,
        title="Silhouette Scores for different k",
        markers=True,
        labels={
            "x": "k",
            "y": "Silhouette Score"
        }
    )
    return figs, cluster_tables, sse_chart, silhouette_chart

def figure_to_json(figure, fileName):
    with open(fileName, "w") as f:
        plotly.io.write_json(figure, f)


if __name__ == '__main__':
    kmeans_scatter_plots, kmeans_cluster_tables, sse_lineChart, silhouette_lineChart = kmeansVisualization()

    for index, cluster in zip(range(2, 11), kmeans_scatter_plots):
        figure_to_json(cluster, f"./figures/cluster_{index}.json")
    for index, table in zip(range(2, 11), kmeans_cluster_tables):
        with open(f"./figures/cluster_table_{index}.json", "w") as out:
            json.dump(table, out)

    figure_to_json(sse_lineChart, "./figures/sse_lineChart.json")
    figure_to_json(silhouette_lineChart, "./figures/silhouette_lineChart.json")

################## Get json on website ##################
def getFormulaeClusterCharts():
    cluster_charts = []
    for k in range(2, 11):
        cluster_charts.append(plotly.io.read_json(f"./figures/cluster_{k}.json"))
    return cluster_charts


def getFormulaeClusterTables():
    cluster_tables = []
    for k in range(2, 11):
        with open(f"./figures/cluster_table_{k}.json", "r") as table_json:
            cluster_tables.append(json.load(table_json))
    return cluster_tables


kmeans_scatter_plot, kmeans_cluster_table = getFormulaeClusterCharts(), getFormulaeClusterTables()
sse_lineChart, silhouette_lineChart = \
    plotly.io.read_json("./figures/sse_lineChart.json"), \
        plotly.io.read_json("./figures/silhouette_lineChart.json")

Here is the code for the heatmap

formula_heatmap = get_heatmap(
    removeTextColumn(quantity_model),
    labels=quantity_model.loc[:, 'Name'],
)

Layout for the Website

eda = Dash(__name__)

eda.layout = html.Div([
    html.H1("Exploring the dataset"),
    html.Div(
        "Note that when there are too many formulae to show, the formula name may be hidden at the axis. You can zoom "
        "in for more details."
    ),

    dcc.Graph(
        figure=formulae_distribution,
        id='formulae_distribution'
    ),

    dcc.Graph(
        figure=herbs_frequency_barChart,
        id='herb-frequency-chart'
    ),

    dcc.Graph(
        figure=formulae_herbCount_chart,
        id='formulae_herbCount_chart'
    ),

    dcc.Graph(
        figure=pair_plot,
        id='pair_plot'
    ),
    dcc.Graph(
        figure=HerbCorrHeatmap,
        id='corrHeatmap'
    ),

    dcc.Graph(
        figure=default_overview,
        id='overview_trend'
    ),
    dcc.Graph(
        figure=default_stack_trend,
        id='dosage_trends'
    ),

    dcc.Graph(
        figure=treemap,
        id='treemap'
    ),

    html.H3("Clustering"),
    dcc.Graph(id='scatter', style={"display": "None"}),
    dcc.Slider(
        id="level-slider",
        min=2,
        max=11,
        marks={k: f"k: {k}" for k in range(2, 11)},
        value=2,
        step=None
    ),
    dash_table.DataTable(
        id='cluster_table',
        data=[],
        columns=[
            {'name': 'Clusters', 'id': 'Clusters'},
            {'name': 'Formulae', 'id': 'Formulae'},
        ],
        style_table={"display": "None"},
        style_cell={
            'textAlign': 'left',
            'height': 'auto',
            'whiteSpace': 'normal'
        }
    ),
    dcc.Graph(id='sse-chart', figure=sse_lineChart),
    dcc.Graph(id='silhouette_lineChart', figure=silhouette_lineChart),

    dcc.Graph(id='heatmap', figure=formula_heatmap)
])


# Callback to update the treemap based on the selected k
@callback(
    Output("cluster_table", "data"),
    Output("cluster_table", "style_table"),
    Input("level-slider", "value")
)
def update_cluster_table(k):
    return kmeans_cluster_table[k - 2], {"display": "inline"}


@callback(
    Output("scatter", "figure"),
    Output("scatter", "style"),
    Input("level-slider", "value")
)
def update_scatter(k):
    return kmeans_scatter_plot[k - 2], {"display": "inline"}


eda.title = 'Explore the dataset'
if __name__ == '__main__':
    eda.run_server(debug=True, port=8055)
  1. Dash’s official website, https://dash.plotly.com/ ↩︎
  2. Plotly Open Source Graphing Library for Python, https://plotly.com/python/ ↩︎