{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparison and assessment of GRN inference methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load GRNs inferences\n", "+ Load the rankings/scores provided by different methods, let $M$ be the number of \"methods\".\n", "+ Concatenate the rankings use the `pandas.concat` function\n", "\n", "__Check the `pandas.concat` example:__" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
score_method1score_method2score_method3
a130100
b220300
c310200
\n", "
" ], "text/plain": [ " score_method1 score_method2 score_method3\n", "a 1 30 100\n", "b 2 20 300\n", "c 3 10 200" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a dummy dataframe a\n", "a = pd.DataFrame()\n", "a[\"score_method1\"] = [1,2,3]\n", "a.index = [\"a\",\"b\",\"c\"]\n", "# create a dummy dataframe b\n", "b = pd.DataFrame()\n", "b[\"score_method2\"] = [10,20,30]\n", "b.index = [\"c\",\"b\",\"a\"]\n", "# create a dummy dataframe c\n", "c = pd.DataFrame()\n", "c[\"score_method3\"] = [100,200,300]\n", "c.index = [\"a\",\"c\",\"b\"]\n", "# join the three dummy datasets\n", "join_df = pd.concat([a,b,c], axis=1,sort=True) # axis=1 for rows and axis=0 for columns\n", "join_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ It could be usefull to set the index of each ranking in the following format: \"TFid_TGid\"\n", "\n", "__Check the index modification example:__" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TFTGscorerank
tfa_tgatfatga0.90
tfb_tgbtfbtgb0.21
tfc_tgctfctgc0.12
\n", "
" ], "text/plain": [ " TF TG score rank\n", "tfa_tga tfa tga 0.9 0\n", "tfb_tgb tfb tgb 0.2 1\n", "tfc_tgc tfc tgc 0.1 2" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a dummy ranking dataframe\n", "ranking = pd.DataFrame()\n", "ranking[\"TF\"] = [\"tfa\",\"tfb\",\"tfc\"]\n", "ranking[\"TG\"] = [\"tga\",\"tgb\",\"tgc\"]\n", "ranking[\"score\"] = [0.9,0.2,0.1]\n", "ranking[\"rank\"] = [0,1,2]\n", "# Change the index\n", "ranking.index = ranking[\"TF\"] + \"_\" + ranking[\"TG\"]\n", "ranking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load \"High Confidence\" GRN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let an oriented graph $GRN_{HC} = \\langle V_{HC}, E_{HC} \\rangle$ denote the \"high confidence\" GRN presented in [Poitier et al. 2014](https://www.sciencedirect.com/science/article/pii/S2211124714010043), where $V_{HC}$ denotes the set of nodes (genes) and $E_{HC}$ denotes the set of edges (regulatory relationships).\n", "\n", "+ Load the \"high confidence\" GRN\n", "+ Set the index (as in the previous step).\n", "+ Select only the edges that involve highly expressed genes (those that you have used to infer your GRNs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compare methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ Represent the correlation between the different methods using a heatmap " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ Generate a [dendogram](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html) or a [clustermap](https://seaborn.pydata.org/generated/seaborn.clustermap.html) using the ranking or the score vectors " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ " + Compare the top $k$ links of each method using the [Jaccard similarity score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html), represent the results in a matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ For each method, keep only the edges that are reported in the \"high confidence\" GRN.\n", "+ Apply a [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to the rankings of the methods for the chosen links, keep 3 dimensions, and the location of each method in this new space (you go from $M$ points in a $|E_{HC}|$-th dimensional space to $M$ points in a 3D space ). \n", "\n", "__Check the following example:__" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dummy ranking:\n", " method0 method1 method2 method3 method4 method5 method6 method7 \\\n", "edge0 327 924 545 53 740 27 755 108 \n", "edge1 247 908 773 376 189 69 191 536 \n", "edge2 185 115 305 833 433 500 747 997 \n", "edge3 504 206 287 403 203 862 221 538 \n", "edge4 323 449 549 438 388 573 85 539 \n", "\n", " method8 method9 \n", "edge0 213 936 \n", "edge1 138 166 \n", "edge2 686 496 \n", "edge3 511 306 \n", "edge4 573 441 \n", "\n", "Explained variance ratio for each PC\n", "[0.16138035 0.1454675 ]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# create a dummy join ranking of 10 methods and 100 edges\n", "rankings = pd.DataFrame(np.random.randint(0,1000,size=(100,10)))\n", "rankings.columns = [\"method\"+str(i) for i in rankings.columns]\n", "rankings.index = [\"edge\"+str(i) for i in rankings.index]\n", "print(\"Dummy ranking:\")\n", "print(rankings.head())\n", "# transpose the ranking matrix\n", "rankingsT = rankings.T\n", "# import PCA\n", "from sklearn.decomposition import PCA\n", "# create PCA object with k principal axis\n", "k = 2\n", "pca = PCA(n_components=k)\n", "# Apply the PCA\n", "rankingsT_pca = pca.fit_transform(rankingsT)\n", "# Explained variance ratio for each principal axis\n", "print(\"\\nExplained variance ratio for each PC\")\n", "print(pca.explained_variance_ratio_)\n", "# plot the coordinates of each method along the Principal Axis\n", "plt.plot(rankingsT_pca[:,0],rankingsT_pca[:,1],\"o\")\n", "plt.xlabel(\"$PC_0$\")\n", "plt.ylabel(\"$PC_1$\")\n", "for i,method in enumerate(rankingsT.index):\n", " plt.text(x=rankingsT_pca[i,0], y=rankingsT_pca[i,1], s=method)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Answer the following questions:\n", "+ How different are the distinct methods between them?\n", "+ Which methods are more similar between them?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assess the quality of the inferred GRNs\n", "In this section we are going to use the following quality measures to evaluate the inferred GRNs with respect to the \"High Confidence\" GRN:\n", "\n", "+ [Precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html)\n", "+ [Recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)\n", "+ [Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)\n", "\n", "These measures are defined using the __confusion matrix__\n", "$$ConfMat=\\begin{pmatrix}& T_{real} & F_{real}\\\\T_{pred}& TP & FP \\\\F_{pred} & FN & TN \\end{pmatrix}$$\n", "\n", "__Where__:\n", "+ $TP$: True positive (correctly predicted as True)\n", "+ $FP$: False positive (wrongly predicted as True)\n", "+ $FN$: False negative (wrongly predicted as False)\n", "+ $TN$: True negative (correctly predicted as False)\n", "\n", "__Metrics__:\n", "+ $Recall = \\frac{TP}{TP + FN}$\n", "+ $Precision = \\frac{TP}{TP + FP}$\n", "+ $Accuracy = \\frac{TP+TN}{TP+TN+FP+FN}$\n", "\n", "__Bonus__:\n", "You can also use the follwing evaluation criteria:\n", "+ [ROC curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)\n", "+ [AUC curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)\n", "+ [Average Precision Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ Select the to $k = |E_{HC}|$ best edges for each method\n", "+ Evaluate the different results with respect to the \"High Confidence\" GRN using the previous measures\n", "+ Which are the best methods, w.r.t. this dataset? Explain" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Build a meta learner\n", "In [Marbach et al. 2012](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3512113/), the authors have suggest that \"no single inference method performs optimally across all datasets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse datasets\"\n", "\n", "+ Compute a new GRN by averaging the ranks of the different methods. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ Compute the similarity of this new GRN w.r.t. the previous ones" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ Compute the quality of this new GRN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ Let $X^{rank}$ be a matrix with $E$ rows (one for each edge) and $M$ columns (one for each method). The value $X^{rank}_{i,j}$ is the rank that method $j$ has attributed to edge $i$. \n", "+ Let $y$ be a binary label vector with $E$ rows (one for each edge), s.t. $y_e=1$ if $e \\in E_{HC}$ and $y_e=0$ otherwise\n", "+ Train a Random Forest classifier to predict $y$, from the values $X_{rank}$\n", "+ Extract the feature importance from this classifier to infer which are the better methods\n", "\n", "__Check the following example:__" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dummy ranking:\n", " method0 method1 method2 method3 method4 method5 method6 method7 \\\n", "edge0 7 320 25 672 333 413 487 657 \n", "edge1 86 444 726 745 502 224 11 610 \n", "edge2 569 849 788 281 921 207 329 700 \n", "edge3 190 531 193 827 744 45 160 813 \n", "edge4 712 344 347 893 817 409 785 823 \n", "\n", " method8 method9 \n", "edge0 953 692 \n", "edge1 847 820 \n", "edge2 57 614 \n", "edge3 648 480 \n", "edge4 899 420 \n", "\n", "Dummy labels:\n", " labels\n", "edge0 1\n", "edge1 1\n", "edge2 0\n", "edge3 0\n", "edge4 1\n" ] }, { "data": { "text/plain": [ "array([0.08026446, 0.10601871, 0.08706498, 0.09726857, 0.1003492 ,\n", " 0.09425878, 0.15110857, 0.0748779 , 0.08805547, 0.12073336])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a dummy join ranking of 10 methods and 100 edges\n", "rankings = pd.DataFrame(np.random.randint(0,1000,size=(100,10)))\n", "rankings.columns = [\"method\"+str(i) for i in rankings.columns]\n", "rankings.index = [\"edge\"+str(i) for i in rankings.index]\n", "print(\"Dummy ranking:\")\n", "print(rankings.head())\n", "# create dummy $y$ vector (binary labels)\n", "y = pd.DataFrame()\n", "y[\"labels\"] = np.random.binomial(p=0.5,n=1,size=100)\n", "y.index = rankings.index\n", "print(\"\\nDummy labels:\")\n", "print(y.head())\n", "# Train the RF classifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "rf = RandomForestClassifier(n_estimators=100)\n", "rf.fit(rankings,y[\"labels\"])\n", "rf.feature_importances_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question__: Since this method belongs to a supervised learning cathegory, how could you test its performance?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }