Can You Use ChatGPT to Analyze Data from TCGA?

can chatgpt analyze tcga gene expression data

I’ve heard a lot about Generative AI (GenAI) doing wonders for data analysis; however, I’ve been skeptical. That is because some datasets are inherently simpler than others to handle, their structure is very straightforward, and the complexity of the questions is low.

For example, asking:

“How many books did Stephen King write?”

Is not the same as asking:

“Subset patients in three groups based on the expression of XYZ, then compare the gene expression of the high vs. low groups, tell me if you notice an enrichment in effector memory cells, run pathway/gene set enrichment analysis, and prepare nice plots that I can show my boss/students/colleague/whatever”.

On the other hand, I believe that GenAI tools can be wonderful to help interpret the results. Even accounting for the ever present mantra “AI can make mistakes”, GenAI tools can access corners of knowledge that we may not know existed. And with that, they can show us insights that we may not have otherwise considered.

Here I am guilty of downplaying the ability of some GenAI tools to provide the correct code, or run it for me on the fly. Yet, as a good scientist, I decided to put this to the test.

I downloaded gene expression data from TCGA, and then passed it to ChatGPT 4o to see what it can do. Here’s my step-by-step guide on what I did, how I did it, and the results.

Note, the analyses in this post were done back in November 2024, before the OpenAI updates for GPT 4.5 and 4o image generation.

Disclaimer: needless to say that this is for educational and entertainment purposes only; I am doing this to fulfill my interest in AI and genomics. None of the questions/data/prompts/code/results should ever be used for medical, diagnostic, prognostic, therapeutic, or patient-related purposes.

What do you need to recreate the results using ChatGPT?

You can rerun all this by yourself, and you will need very little, and this includes:

Access to ChatGPT 4o (I will use my subscription)
Access to cBioportal (free)
The ability to upload local files to ChatGPT (some workstations are blocked by the employer’s policy, so make sure to use your personal computer if that happens)
Access to Perplexity.ai (free)

You do NOT need to know how to code.

You do NOT have to know statistics…well, this sounds like blasphemy…let me rephrase it: you should know/I hope you know basic statistical terminologies, even if you don’t know how to code.

Please note: in this test I will only use the web interface of ChatGPT.

Why? Because this test is designed with wet lab researchers in mind.

Wet lab researchers do not know how to code, and they will not know how to use API calls either.

I want for anyone to be able to recreate this test independently of any coding background, and without setting up and paying for API calls.

Furthermore, it would make no sense to get R or python code from ChatGPT, and then have a wet-lab scientist run it without knowing what’s going on.

Where to find and download TCGA gene expression data

One of the simplest ways possible to download TCGA gene expression data is to use cBioportal. You can select a study > go to the pie chart on the far right > select the Download symbol next to the study title, and voilà: you can save a zipped file with all data we need.

For this test, I downloaded the AML data (Acute Myeloid Leukemia) data because it has only 200 patients, I have not used in the past, and I don’t work with AML data, thus making this a true unbiased experiment.

See the video below for a step-by-step guide on how to download the TCGA gene expression data from cBioportal.

Step by step on how to download data from cBioportal

What file contains the TCGA gene expression data?

The content of the zipped folder you downloaded will look something like this:

And we will pick the data_mrna_seq_v2_rsem.txt file.

Please note that the rsem v2 in cBioportal corresponds to the rsem.genes.normalized_results from TCGA. This dataset is normalized already (RSEM), so it won’t need any further normalization beyond a log2 transformation when calculating fold changes.

Are there other ways to download TCGA gene expression data?

Yes, there are about a million different ways you can download TCGA gene expression data. Here’s a few, if you prefer any of them over the way I showed you, feel free to go ahead and use that method:

Firehose website (https://gdac.broadinstitute.org/)
UCSC Xena browser (https://xenabrowser.net/)
Various R packages
Python packages (e.g. pytcga)
API calls (although that would still run through the above points)

ChatGPT data analysis: can ChatGPT handle TCGA gene expression data?

The short answer is no, at least not as text file “as-is” and in the web interface.

It looks like the text file as-is cannot go through, maybe due to its size…so some alternative approach is needed.

The file contains 20,532 rows by 175 columns for a size of 30.1Mb. By itself is not huge, but to upload it, we can zip it. Once zipped, the file uploads with no problem.

Note: it may be that for some of you, the file uploads as a txt file with no problem. That’s totally fine. You can use the text file instead of the zipped file.

Structure of the TCGA gene expression data file

structure of tcga gene expression data

The gene expression file is a simple table, with gene names, Entrez gene IDs, and expression values, as RSEM, in each column. Nothing fancy here, it is pretty standard. Yet, we want to ensure that ChatGPT understands what each column means. It should be able to automatically recognize the different elements of the table, and we will be able to see that while it runs the analysis.

Differential gene expression on cBioportal as QC

cBioportal allows you to run differential gene expression analysis directly on the portal, even if not everyone knows that it’s possible.

This is a simple and quick way to keep an eye on the top-line results generated by ChatGPT, and make sure that they match-ish what we expect to see.

Here I will subset patients manually, comparing the top 32% and bottom 32% of patients based on CDK6 gene expression. This will correspond to patients that have high vs. low expression of CDK6. If you want to run it on your own, from the main query page in cBioportal, the input values I used are the following:

mRNA expression z-scores relative to all samples (log RNA Seq V2 RSEM):CDK6: EXP > 0.55CDK6: EXP < -0.39

Then go to the “Comparison/Survival” tab, select the two groups based on gene expression, select mRNA, and from Data Set use the RNA V2 RSEM data.

The table with the results includes genes differentially expressed between patients with high vs. low CDK6 expression. The p.value is calculated using a Student’s t.test and the q-value comes from FDR correction.

The table with the results looks as follows:

*Example of the differential gene expression obtained from cBioportal*

How to Prompt ChatGPT for Analyzing TCGA gene expression data

It’s prompting time….with the file containing gene expression data successfully zipped and uploaded, we can now test some prompting.

First, there are a few key elements that make a good prompt:

Be clear and specific
Longer prompts are better understood if written in a markdown-like style
Priming it with something like “you are a bioinformatician/data analyst/wizard of Oz and you need to help me analyze gene expression data from TCGA” can help adding specificity to the response

For fun and simplicity, I will first use unstructured text. Then, depending on the output, I will edit the prompt to be more and more specific, guiding ChatGPT to what has to be done.

Prompt iteration number 1

Let’s first try to use the example I wrote above in the introduction, and let’s see how ChatGPT handles it. For now, we will use the CDK6 as a gene for subsetting patients, and everything else will stay as-is.

Moreover, for this first iteration, I will purposely not pass much information to ChatGPT to see if it can recognize where the genes and patient IDs columns are in the data file.

Here’s the prompt:

Using the file I uploaded, subset patients in three groups based on the expression of CDK6, then compare the gene expression of the high vs. low groups, tell me if you notice an enrichment in effector memory cells, run pathway/gene set enrichment analysis, and prepare nice plots that I can show my boss/students/colleague/whatever.

You can follow this in the video:

I have to admit it: ChatGPT went further than I expected before getting stuck for a missing python library.

Here are a couple of pros and cons that I see from the get-go.

Pros:

It detected genes in rows and patients in columns
It correctly detected the second column as Entrez Gene ID and did not treat it as a gene expression value
It divided the dataset into the three groups as requested
It calculated a fold change value to somewhat find genes differentially expressed between the groups.

Cons:

It did not run any statistical test to see if a gene was significantly differentially expressed
When calculating a fold change, it adds 1e-6 to the denominator. While I agree with the point of using a small constant to avoid divisions by zero, adding 1e-6 will favor small numbers and over-inflate the fold changes. I rather be conservative and add +1 to both numerator and denominator
Less of a con but more of a pet peeve: it ran everything in python. Contrary to Claude Sonnet, ChatGPT can run python code but cannot run R code, so if I had to pass the code to my genomic person in the lab, I’d need to get something in R most likely…sorry python lovers!

Here’s the python code that ChatGPT generated in this first iteration.

import pandas as pd
import zipfile

# Unzip and read the uploaded file
zip_path = '/mnt/data/data_mrna_seq_v2_rsem.txt.zip'
extracted_file_path = '/mnt/data/data_mrna_seq_v2_rsem.txt'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall('/mnt/data/')

# Load the dataset to check its structure
data = pd.read_csv(extracted_file_path, sep="\t")

# Display the first few rows and columns of the dataset to understand its structure
data.head(), data.columns

# Step 1: Identify CDK6 expression and subset patients into groups
cdk6_expression = data[data['Hugo_Symbol'] == 'CDK6'].iloc[:, 2:].T  # Extract expression for all patients
cdk6_expression.columns = ['CDK6']
cdk6_expression['Group'] = pd.qcut(cdk6_expression['CDK6'], 3, labels=["Low", "Medium", "High"])

# Step 2: Subset data into High and Low groups
high_group = cdk6_expression[cdk6_expression['Group'] == 'High'].index
low_group = cdk6_expression[cdk6_expression['Group'] == 'Low'].index

# Extract gene expression for High and Low groups
high_expression = data[['Hugo_Symbol'] + list(high_group)].set_index('Hugo_Symbol')
low_expression = data[['Hugo_Symbol'] + list(low_group)].set_index('Hugo_Symbol')

# Step 3: Perform differential expression analysis (basic fold-change as a starting point)
mean_high = high_expression.mean(axis=1)
mean_low = low_expression.mean(axis=1)
fold_change = mean_high / (mean_low + 1e-6)  # Avoid division by zero
diff_expr = pd.DataFrame({'Gene': mean_high.index, 'High': mean_high.values,
                          'Low': mean_low.values, 'FoldChange': fold_change.values})

# Filter for significantly changed genes (arbitrary threshold for now)
significant_genes = diff_expr[diff_expr['FoldChange'] > 2]  # Fold change > 2 for up-regulated in High

# Display initial results for significant genes
import ace_tools as tools; tools.display_dataframe_to_user(name="Differential Gene Expression Results (High vs. Low)", dataframe=significant_genes)

# Step 4 and 5: Pathway enrichment and visualization
# These require additional libraries or external databases for pathway analysis.
# I will implement pathway enrichment next.

And here’s the code that ChatGPT generated to create plots:

import gseapy as gp
import matplotlib.pyplot as plt

# Step 4: Prepare data for gene set enrichment analysis (GSEA)
# Use significant genes to check for pathway enrichment
gene_list = significant_genes['Gene'].tolist()

# Perform GSEA using KEGG pathways as an example
pre_res = gp.prerank(
    rnk=diff_expr.set_index('Gene')['FoldChange'],
    gene_sets='KEGG_2016',
    outdir=None,  # No output directory, in-memory analysis
    min_size=15,
    max_size=500,
    permutation_num=100,  # Reduce permutations for speed
    no_plot=True,
)

# Step 5: Visualize GSEA results
# Extract top pathways for visualization
top_pathways = pre_res.res2d.sort_values(by="NES", ascending=False).head(5)

# Plot bar chart for top pathways
plt.figure(figsize=(8, 6))
plt.barh(top_pathways.index, top_pathways['NES'], color='blue', alpha=0.7)
plt.xlabel('Normalized Enrichment Score (NES)')
plt.ylabel('Pathways')
plt.title('Top Pathways Enriched in High vs. Low CDK6 Expression')
plt.tight_layout()

# Save plot to visualize
plt_path = "/mnt/data/top_pathways_enrichment.png"
plt.savefig(plt_path)

tools.display_dataframe_to_user(name="Top Pathway Enrichment Results", dataframe=top_pathways)
plt_path

But unluckily it returns this error because it cannot find in the environment the correct python module to run GSEA analysis:

————————————————————————— ModuleNotFoundError Traceback (most recent call last) Cell In[3], line 1 —-> 1 import gseapy as gp 2 import matplotlib.pyplot as plt 4 # Step 4: Prepare data for gene set enrichment analysis (GSEA) 5 # Use significant genes to check for pathway enrichment ModuleNotFoundError: No module named ‘gseapy’

As a way to overcome this, we can ask ChatGPT not to use that python library. In this way it will prevent it from trying to load it, and therefore, prevent this error.

Prompt iteration number 2

We know that the first iteration needed a few improvements, so I tried to improve the query by giving clearer directions, which include:

Run a statistical test to evaluate whether two genes are significantly differentially expressed.
Use a conservative +1 when calculating the Log2FC (as personal choice).
I changed the Bonferroni adjustment to FDR to match what cBioportal does.
I specified not to use the gseapy module.
We can ask to find a potential enrichment in main pathways and come up with a method to do it if no python modules are available. I specified to account for the magnitude AND direction of the changes.
I added TGF-beta and removed cell cycle, mostly because MYC can promote cancer cell proliferation, so asking for cell cycle would be redundant.
I specified that I want both tables and graphs, and to sort the tables by p.value/adjusted p.value.

The new prompt now looks as follows:

Using the file I uploaded do the following:
> subset patients in three groups based on the expression of CDK6,
> compare the gene expression of the high vs. low groups.
> Present the fold changes as Log2 fold changes and add 1 to numerator and denominator to avoid dividing by zero
> You MUST run a statistical test to identify genes that are significantly differentially expressed, and adjust the raw p.value using FDR correction,
> run pathway/gene set enrichment analysis, and tell me if you notice an enrichment in effector memory cells. Do not use the module gseapy as it is not available. Design a test similar to GSEA (e.g. ranked method) to calculate enrichment in effector memory cells, cell death, TGFbeta, IFNG pathway, and MYC-related pathways. Make sure it accounts for the magnitude and direction of changes in gene expression by weighting the degree of up/down regulation of each gene in the pathway.
> prepare and print summary tables with the top 20 results per each analysys, sorting the tables by adjusted p.value and p.value, lowest to highest.
> print in the chat pretty summary plots for gene expression changes and pathway enrichment that I can show my boss/students/colleague/whatever.

Lastly, before running the query, I started a new session. Why? Well, in previous iterations of this test, while I was writing this article, ChatGPT kept encountering problems.

What kind of problems? It did not follow directions, instead of gene names it reported Gene1/Gene2…, it wasn’t able to print tables or plots in the chat window, it didn’t recall the file anymore, and it was mixing up old directions with new directions.

Starting fresh from scratch has the advantage that it cannot mix up old and new directions. However, it cannot learn from what was an older set of instructions and cannot keep memory of older data it generated. And I am ok with that.

Moving on. This version of the prompt worked much better. It still had a couple of issues that I will describe, but it was able to address most of my requests correctly. Moreover, the results from differential gene expression analysis were quite similar to what I obtained in cBioportal.

Let’s check it out.

ChatGPT Differential Gene Expression analysis

ChatGPT generated this code:

import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

# Extract CDK6 expression data
cdk6_expression = data.loc[data['Hugo_Symbol'] == 'CDK6'].iloc[:, 2:]

# Classify patients into high, medium, and low expression groups based on CDK6 expression
expression_quantiles = cdk6_expression.quantile([0.33, 0.66], axis=1)
low_group = cdk6_expression.columns[cdk6_expression.iloc[0] <= expression_quantiles.iloc[0]]
high_group = cdk6_expression.columns[cdk6_expression.iloc[0] >= expression_quantiles.iloc[1]]
medium_group = cdk6_expression.columns[(cdk6_expression.iloc[0] > expression_quantiles.iloc[0]) & (cdk6_expression.iloc[0] < expression_quantiles.iloc[1])]

# Subset data for high and low groups
high_expression_data = data[high_group]
low_expression_data = data[low_group]

# Calculate log2 fold changes and perform statistical tests
log2_fold_changes = np.log2(high_expression_data.mean(axis=1) + 1) - np.log2(low_expression_data.mean(axis=1) + 1)
p_values = [ttest_ind(high_expression_data.iloc[i], low_expression_data.iloc[i], nan_policy='omit').pvalue
            for i in range(len(data))]

# Adjust p-values using FDR correction
_, adj_p_values, _, _ = multipletests(p_values, method='fdr_bh')

# Create a result dataframe with log2 fold changes and adjusted p-values
results = pd.DataFrame({
    'Gene': data['Hugo_Symbol'],
    'Log2_Fold_Change': log2_fold_changes,
    'P_Value': p_values,
    'Adjusted_P_Value': adj_p_values
}).sort_values(by=['Adjusted_P_Value', 'P_Value'])

# Filter top 20 significant genes
top_20_results = results.head(20)

# Display the top 20 results to the user
import ace_tools as tools; tools.display_dataframe_to_user(name="Top 20 Differentially Expressed Genes", dataframe=top_20_results)

top_20_results

And you can see that:

It correctly calculated the Log2FC as requested
It ran a t.test and reported a p.value
It tried to calculate the FDR, but it didn’t work
It sorted the results as requested

And the table looks as follows:

Gene	Log2_Fold_Change	P_Value	Adj_P_Value
CDK6	2.0731014323995627	2.02069583294685e-33
ZBTB7B	-1.8898232	1.5517164814828007e-23
NADK	-1.3187898	5.222453645989154e-23
DNAJC5	-0.8762809	8.258543830859833e-23
SIN3A	0.44209799880913714	1.7774779444226764e-22
MYO1F	-1.3129109	2.935519236887216e-22
MED21	1.0145126192190492	3.2793779960943836e-22
FGR	-3.2286553	1.32928330826454e-21
KIAA1147	1.348467711631173	1.5226135767225875e-21
BCKDHB	0.9431785388374241	3.545541190609549e-21
PRA	-2.2749654	4.655113647151368e-21
SMARCC1	0.7079032155847802	7.188732380047411e-21
CTSZ	-1.836924	7.63716261468474e-21
NAGA	-1.5047209	1.2318010227832022e-20
PRKCD	-1.5031394	1.269027820698226e-20
TFEB	-2.0741527	3.1272921888375734e-20
GPR132	-1.4971557	6.557733953893896e-20
CAP1	-0.8361107	9.395187559446837e-20
ZNF385A	-2.2389463	9.540521179657127e-20
UBFD1	0.7064396876575589	1.2569829329963535e-19

Interestingly, if we take these genes and compare them with what cBioportal gave us, the results are very similar:

Gene	ChatGPT Log2FC	cBioportal Log2FC
CDK6	2.0731014323995627	2.16
ZBTB7B	-1.889823178	-1.97
NADK	-1.318789752	-1.31
DNAJC5	-0.876280922	-0.87
SIN3A	0.44209799880913714	0.44
MYO1F	-1.31291086	-1.38
MED21	1.0145126192190492	1.02
FGR	-3.228655282	-3.77
KIAA1147/DENND11	1.348467711631173	1.39
BCKDHB	0.9431785388374241	1.01
PRA/S100A6	-2.274965406	-2.33
SMARCC1	0.7079032155847802	0.76
CTSZ	-1.836923991	-1.71
NAGA	-1.504720938	-1.41
PRKCD	-1.503139428	-1.5
TFEB	-2.074152701	-1.58
GPR132	-1.497155731	-1.37
CAP1	-0.836110697	-0.77
ZNF385A	-2.238946304	-2.09
UBFD1	0.7064396876575589	0.73

I decided to plot them side by side to better visualize how close they actually are.

The Pearson correlation coefficient of the two arrays (ChatGPT vs. cBioportal) is 0.99, and as illustrated by the plot below, the results are pretty darn close.

The blue and orange bars indicate the Log2FC data from ChatGPT4o and cBioportal, respectively.

And yes, I prepped the plot below using Excel, go ahead and hate me for that 🙂 .

ChatGPT Pathway analysis

The pathway analysis mostly followed the directions I gave it, although it was not perfect, it did seem to capture the right gist of it.

Here’s the code generated by ChatGPT to run the pathway analysis with TCGA data.

# Simulating a pathway analysis approach for enrichment testing
# Define pathways of interest with representative gene sets (example genes for demonstration)
pathways = {
    'Effector_Memory_Cells': ['CDK6', 'ZBTB7B', 'DNAJC5', 'FGR'],
    'Cell_Death': ['NADK', 'SIN3A', 'PRKCD'],
    'TGFbeta': ['SMARCC1', 'TFEB', 'ZNF385A'],
    'IFNG_Pathway': ['MYO1F', 'GPR132', 'CTSZ'],
    'MYC_Related': ['MED21', 'PRA', 'BCKDHB']
}

# Weighted scoring based on the direction and magnitude of gene expression changes
def pathway_enrichment(pathway_genes, results):
    ranked_genes = results.set_index('Gene').sort_values('Log2_Fold_Change', ascending=False)
    ranks = ranked_genes.loc[pathway_genes, 'Log2_Fold_Change'].dropna()
    enrichment_score = ranks.sum() / len(pathway_genes)
    return enrichment_score

# Calculate enrichment scores for each pathway
pathway_scores = {pathway: pathway_enrichment(genes, results) for pathway, genes in pathways.items()}

# Create a dataframe for pathway scores
pathway_results = pd.DataFrame({
    'Pathway': pathway_scores.keys(),
    'Enrichment_Score': pathway_scores.values()
}).sort_values(by='Enrichment_Score', ascending=False)

# Display the top pathway enrichment results
tools.display_dataframe_to_user(name="Pathway Enrichment Results", dataframe=pathway_results)

# Generating a summary plot for pathway enrichment
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.barh(pathway_results['Pathway'], pathway_results['Enrichment_Score'], color='skyblue')
plt.xlabel('Enrichment Score')
plt.title('Pathway Enrichment Analysis')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

The custom function for calculating a pathway enrichment score appears to be alright.

# Weighted scoring based on the direction and magnitude of gene expression changes
def pathway_enrichment(pathway_genes, results):
    ranked_genes = results.set_index('Gene').sort_values('Log2_Fold_Change', ascending=False)
    ranks = ranked_genes.loc[pathway_genes, 'Log2_Fold_Change'].dropna()
    enrichment_score = ranks.sum() / len(pathway_genes)
    return enrichment_score

Let’s break it down:

It sorts the results from differential gene expression analysis by Log2FC, high to low, setting the Gene column as index
It removes NAs (there shouldn’t be any)
It calculates the enrichment score by summing the Log2FC values of the genes in a pathway and dividing by the total number of genes in the same pathway. Therefore, a positive score may suggest pathway upregulation, while a negative score may indicate pathway downregulation.

And here are the results in table format with an accompanying bar chart:

Pathway	Enrichment_Score
MYC_Related	-0.105758083
Cell_Death	-0.79327706
Effector_Memory_Cells	-0.980414488
TGFbeta	-1.20173193
IFNG_Pathway	-1.548996861

Now, there are a couple of weaknesses that I can easily think about:

The genes used to calculate the enrichment are bare bone. For example, in the TGFbeta pathway it didn’t even include TGFBR1, aka the TGF beta receptor.
There are no calculations for a p.value
And mostly….Do these results make sense biologically?

Let’s address the last point first: Do these results make sense if we think about the biology of AML and CDK6? If we are in the right ballpark in terms of the biology we capture, then it makes sense to address all other points.

I don’t work with AML, and that’s one of the main reasons I picked this dataset. I wanted to analyze the data without knowing anything about the biology of the disease. Yet, when it comes to interpreting results, knowing the biology behind AML can mean the difference between the right and the wrong results.

ChatGPT Results and Biological Insights

Since this post is about using GenAI to analyze TCGA data, I thought it could be useful and faster (and hopefully accurate) to use GenAI to interpret the results. I want to see if the results that I have make sense in the context of AML and CDK6 expression.

To accomplish this, I used Perplexity.ai because:

It is not ChatGPT
It provides live references and links to the claims it makes about anything
I can use it for free (no need to have a Pro subscription for what we do)

I created a detailed-ish prompt that clearly tells Perplexity.ai what I need, and why I need it. The prompt is below, and you can watch it work in the video here:

I need your help to understand some data in the context of the biology of the AML disease (aka, acute myeloid leukemia).

I have the following results:
MYC_Related -0.105758083
Cell_Death -0.79327706
Effector_Memory_Cells -0.980414488
TGFbeta -1.20173193
IFNG_Pathway -1.548996861

these are pathway names with enrichment scores calculated using differential gene expression analysis. a negative value indicates that genes in a pathway are mostly downregulated, and vice versa, positive scores indicate that genes in a pathway are mostly upregulated.

the scores were calculated using the results of differential gene expression analysis between patients with high CDK6 expression vs low CDK6 expression.
Therefore, positive values indicate higher expression in patients with high CDK6
while negative values indicate lower expression in patients with high CDK6.

I need you to tell me if these results make sense based on the biology of AML and the role of CDK6 in AML

Perplexity.ai mostly agrees with the results that I obtained, however, it is not fully sold on MYC downregulation. CDK6 and MYC appear to concomitantly promote cell proliferation in AML, so a downregulation of MYC pathway in CDK6 high patients may not fully fit what we expect. However, the actual score is low (-0.1), and it may not be significant.

The full response from Perplexity.ai, complete with references, is here:

To analyze these results in the context of AML biology and CDK6’s role, let’s examine each pathway:

## MYC_Related (-0.105758083)

The slight downregulation of MYC-related genes in high CDK6 patients is somewhat unexpected. CDK6 and MYC are often considered cooperating oncogenes in AML[1]. However, the small magnitude of this score suggests the difference may not be biologically significant.

## Cell_Death (-0.79327706)

The downregulation of cell death pathways in high CDK6 patients aligns with CDK6’s role in AML. CDK6 promotes cell cycle progression and inhibits apoptosis, contributing to the survival and proliferation of leukemic cells[5]. This result supports CDK6’s function in enhancing AML cell survival.

## Effector_Memory_Cells (-0.980414488)

The downregulation of effector memory cell-related genes in high CDK6 patients may reflect the less differentiated state of AML cells. CDK6 is known to block myeloid differentiation in AML[2], which could explain the reduced expression of genes associated with more mature immune cells.

## TGFbeta (-1.20173193)

The downregulation of TGF-beta pathway genes in high CDK6 patients is consistent with AML biology. TGF-beta typically acts as a tumor suppressor, and its downregulation can promote leukemogenesis[1]. This result suggests that high CDK6 expression may be associated with reduced TGF-beta signaling, potentially contributing to AML progression.

## IFNG_Pathway (-1.548996861)

The significant downregulation of the interferon-gamma (IFN-γ) pathway in high CDK6 patients is noteworthy. IFN-γ has anti-tumor effects and can promote differentiation of myeloid cells[1]. The reduced activity of this pathway in high CDK6 patients aligns with CDK6’s role in maintaining the undifferentiated state of AML cells and promoting their survival.

## Conclusion

Overall, these results largely align with the known biology of AML and CDK6’s role in the disease. The downregulation of cell death, effector memory cell, TGF-beta, and IFN-γ pathways in high CDK6 patients is consistent with CDK6’s functions in promoting cell survival, blocking differentiation, and maintaining the leukemic state[5][2].

The slight downregulation of MYC-related genes is the only result that doesn’t immediately fit expectations, but its small magnitude suggests it may not be biologically significant. Further investigation into the specific MYC-related genes affected could provide more insight.

These findings support the potential of CDK6 as a prognostic biomarker and therapeutic target in AML, as suggested by recent research[5]. They also highlight the complex interplay between CDK6 and various signaling pathways in the context of AML biology.

Citations:
[1] https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1009462/full
[2] https://ashpublications.org/blood/article/124/1/13/33127/Requirement-for-CDK6-in-MLL-rearranged-acute
[3] https://tcr.amegroups.org/article/view/46401/html
[4] https://journals.lww.com/md-journal/fulltext/2020/08280/identification_of_the_hub_genes_and_pathways.106.aspx
[5] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7882723/
[6] https://www.nature.com/articles/s41598-021-91864-2
[7] https://www.oncotarget.com/article/9965/text/
[8] https://www.nature.com/articles/s41540-024-00366-0
[9] https://pmc.ncbi.nlm.nih.gov/articles/PMC7707947/

I would believe that the discordance with MYC and CDK6 biology may come from the weak gene signature used to calculate pathway enrichment.

I therefore asked again Perplexity.ai to give me an improved gene signatures that I can use to analyze the data.

Perplexity.ai reply was again quite good:

Based on the scientific literature and evidence data, I can provide improved gene signatures for the pathways you mentioned. These signatures are derived from well-established research in AML and related hematological contexts:

## Effector Memory Cells

– CD44: A cell-surface glycoprotein involved in cell-cell interactions and lymphocyte activation[2].
– CCR7: Chemokine receptor crucial for lymphocyte trafficking and homing[2].
– CD62L (SELL): L-selectin, important for lymphocyte homing to secondary lymphoid organs[2].
– KLRG1: Marker of terminally differentiated effector T cells[2].

## Cell Death

– BCL2: Key anti-apoptotic protein often overexpressed in AML[3][8].
– BAX: Pro-apoptotic protein, often dysregulated in AML[3][8].
– CASP3: Executioner caspase in apoptosis pathway[3].
– TP53: Tumor suppressor gene regulating cell cycle and apoptosis[7].

## TGF-beta

– SMAD2: Key mediator of TGF-beta signaling[1].
– SMAD4: Co-SMAD protein essential for TGF-beta signal transduction[1].
– TGFBR1: TGF-beta receptor type 1[1].
– TGFB1: TGF-beta 1 ligand[1].

## IFN-gamma Pathway

– STAT1: Key transcription factor in IFN-gamma signaling[2][5].
– IRF1: Interferon regulatory factor 1, important in IFN-gamma responses[2][5].
– JAK2: Janus kinase 2, crucial for IFN-gamma signal transduction[5].
– IFNGR1: IFN-gamma receptor 1[5].

## MYC-Related

– MYC: Proto-oncogene often deregulated in AML[7].
– MAX: MYC-associated factor X, forms heterodimers with MYC[4].
– CDKN2A: Cyclin-dependent kinase inhibitor, often altered in MYC-driven cancers[4].
– CCND2: Cyclin D2, a target gene of MYC[4].

These gene signatures are more closely aligned with the current understanding of AML biology and the specific pathways mentioned. They include key regulators and effectors in each pathway, which should provide a more accurate representation of pathway activity in your analyses.

Citations:
[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC7707947/
[2] https://www.nature.com/articles/s41375-024-02387-4
[3] https://haematologica.org/article/view/4309
[4] https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1009462/full
[5] https://www.nature.com/articles/s41375-024-02255-1
[6] https://www.nature.com/articles/s41598-024-68755-3
[7] https://www.nejm.org/doi/full/10.1056/NEJMoa1301689
[8] https://pmc.ncbi.nlm.nih.gov/articles/PMC9600036/
[9] https://journals.lww.com/md-journal/fulltext/2020/08280/identification_of_the_hub_genes_and_pathways.106.aspx

So the next step is to provide these genes to ChatGPT, and address the other minor issues that came up with the second iteration of the prompt.

Prompt iteration number 3

The second iteration of the prompt was an improvement over the first attempt. Now we can try to improve the prompt further and see if ChatGPT 4o can really shine with the extra details.

Notice that I changed SELL to LNHR, which is the name used for SELL/CD62L in the gene expression dataset that we uploaded to ChatGPT.

Using the file I uploaded do the following:
> subset patients in three groups based on the expression of CDK6,
> compare the gene expression of the high vs. low groups.
> Present the fold changes as Log2 fold changes and add 1 to numerator and denominator to avoid dividing by zero
> You MUST run a statistical test to identify genes that are significantly differentially expressed, and adjust the raw p.value using FDR correction,
> run pathway/gene set enrichment analysis using the following gene signatures: Effector Memory Cells (CD44,CCR7,LNHR,KLRG1), Cell Death (BCL2,BAX,CASP3,TP53), TGF-beta (SMAD2,SMAD4,TGFBR1,TGFB1), IFN-gamma (STAT1,IRF1,JAK2,IFNGR1), MYC (MYC,MAX,CDKN2A,CCND2)
> Do not use the module gseapy as it is not available. Design a test similar to GSEA (e.g. ranked method) to calculate enrichment in the pathways I gave you. Make sure it accounts for the magnitude and direction of changes in gene expression by weighting the degree of up/down regulation of each gene in the pathway.
> calculate a p.value for the pathway enrichment analysis and add it to the table
> prepare and print summary tables with the top 20 results per each analysis, sorting the tables by adjusted p.value and p.value, lowest to highest.
> print in the chat pretty summary plots for differential gene expression analysis and pathway enrichment analysis that I can show my boss/students/colleague/whatever.

The results from the third iteration are interesting, and you can [[ see the video here ]].

While ChatGPT did almost everything I asked for, there were a few issues.

The most troublesome issue was that ChatGPT was unable to adjust the p.values from the differential gene expression analysis using FDR correction. That ranked incorrectly the genes in the table and reported the wrong results for the top 20 differentially expressed genes.

Since in the code ChatGPT tries to use the statsmodels python module, which gave me issues in the past, I started believing that there is something funky going on between ChatGPT python environment and this module.
And to overcome this issue, I can ask ChatGPT to manually calculate the FDR.

Prompt iteration number 4(ish)

ChatGPT 4o seemed to have a few issues loading the statsmodels python module. To overcome this issue I initially specified in the prompt

Don’t use the statsmodels python module for FDR correction, rather, calculating manually.

But it didn’t work and kept throwing errors where it was unable to calculate the p.values.

I then modified the prompt removing the FDR correction altogether:

Using the file I uploaded do the following:
> subset patients in three groups based on the expression of CDK6,
> compare the gene expression of the high vs. low groups.
> Present the fold changes as Log2 fold changes and add 1 to numerator and denominator to avoid dividing by zero
> You MUST run a statistical test to identify genes that are significantly differentially expressed
> run pathway/gene set enrichment analysis using the following gene signatures: Effector Memory Cells (CD44,CCR7,LNHR,KLRG1), Cell Death (BCL2,BAX,CASP3,TP53), TGF-beta (SMAD2,SMAD4,TGFBR1,TGFB1), IFN-gamma (STAT1,IRF1,JAK2,IFNGR1), MYC (MYC,MAX,CDKN2A,CCND2)
> Do not use the module gseapy as it is not available. Design a test similar to GSEA (e.g. ranked method) to calculate enrichment in the pathways I gave you. Make sure it accounts for the magnitude and direction of changes in gene expression by weighting the degree of up/down regulation of each gene in the pathway.
> calculate a p.value for the pathway enrichment analysis and add it to the table
> prepare and print summary tables with the top 20 results per each analysis, sorting the tables by p.value, lowest to highest.
> print in the chat pretty summary plots for differential gene expression analysis and pathway enrichment analysis that I can show my boss/students/colleague/whatever.

And (drum rolls)…. I would consider this as the best run so far.

ChatGPT Differential Gene Expression Analysis – problem-solving

Let me be clear, this run was not error-less. In fact, there were issues mostly ChatGPT complaining about non-numeric values:

/tmp/ipykernel_12/3077703208.py:15: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.
  log2_fold_changes = log2_fold_changes.apply(lambda x: pd.np.log2(x))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 25
    23     high_vals = high_group_data.iloc[i, 1:].values
    24     low_vals = low_group_data.iloc[i, 1:].values
---> 25     stat, p = ttest_ind(high_vals, low_vals, equal_var=False)
    26     p_values.append(p)
    28 # Assemble results into a DataFrame

File ~/.local/lib/python3.11/site-packages/scipy/stats/_stats_py.py:6587, in ttest_ind(a, b, axis, equal_var, nan_policy, permutations, random_state, alternative, trim)
  6584 n2 = b.shape[axis]
  6586 if trim == 0:
-> 6587     v1 = _var(a, axis, ddof=1)
  6588     v2 = _var(b, axis, ddof=1)
  6589     m1 = np.mean(a, axis)

File ~/.local/lib/python3.11/site-packages/scipy/stats/_stats_py.py:1250, in _var(x, axis, ddof, mean)
  1248 def _var(x, axis=0, ddof=0, mean=None):
  1249     # Calculate variance of sample, warning if precision is lost
-> 1250     var = _moment(x, 2, axis, mean=mean)
  1251     if ddof != 0:
  1252         n = x.shape[axis] if axis is not None else x.size

ChatGPT also complained about not loading a module:

---------------------------------------------------------------------------
NameError Traceback (most recent call last) Cell In[8], line 18 16 # Display the top 20 results 17 top_genes = differential_expression_results.head(20)
---> 18 tools.display_dataframe_to_user(name="Top 20 Differentially Expressed Genes", dataframe=top_genes)
NameError: name 'tools' is not defined

However, it ChatGPT worked its way around! It told me about the error with the module and re-loaded it:

The analysis was updated, but an error occurred while displaying the results. Let me re-import the necessary tools and ensure the output is properly presented

And then finally reported correctly the top differentially expressed genes, sorted by p.value:

Gene	Log2_Fold_Change	p_value
CDK6	2.067219611522135	5.660371147310949e-26
SIN3A	0.43308727234316274	4.0156060681676737e-22
MED21	0.98170286	1.1038562107511053e-20
SMARCC1	0.71197579	2.895389255088193e-20
BCKDHB	0.9240641551504545	3.915908638597995e-20
UBFD1	0.70556967	1.345452755491943e-19
UNC119B	0.8080343734938509	1.4183170356415648e-19
ZC3HAV1L	1.1764913238270776	1.5532420726952573e-19
DNAJC5	-0.8546164	2.2312267711099827e-19
KIAA1147	1.3127269723859563	2.331020518448842e-19
ANAPC1	0.7649582040450793	3.1819880139169394e-19
MYO1F	-1.2909077	3.813280391247822e-19
VKORC1L1	0.6719584731022111	7.213374824359152e-19
ZBTB7B	-1.8704801	8.3060487586793445e-19
NADK	-1.2969082	1.226831575152487e-18
GPALPP1	0.6569137779749854	1.3712893376718062e-18
TBC1D19	1.366078834614737	1.824171945163633e-18
TP53BP1	0.8669344531378103	3.8363902900256965e-18
ALDH5A1	1.277755380333189	3.936952034264973e-18
ZRANB3	1.0832961702096027	6.392723121773683e-18

ChatGPT Pathway Enrichment Analysis

ChatGPT also did a good job with running pathway enrichment analysis. It scored the MYC, cell death, TGF-beta, effector memory cells and IFN gamma pathways in a manner that better reflects the expected biology.

Pathway	Enrichment_Score	p_value
MYC	1.9483001390457564	0.052947052947052944
Cell Death	1.5143642667010107	0.0959041
TGF-beta	1.0050862796598343	0.16283716283716285
Effector Memory Cells	-0.314078523	0.6173826173826173
IFN-gamma	-2.379303013	0.9240759240759241

The table seems to suggest that there is an increased MYC, cell death and TGF-beta pathways, while there seems to be a reduction in the IFN-gamma pathway and in the activity of effector memory cells.

ChatGPT generated this code to run the analysis:

# Define gene sets for pathway enrichment analysis
gene_sets = {
    "Effector Memory Cells": ["CD44", "CCR7", "LNHR", "KLRG1"],
    "Cell Death": ["BCL2", "BAX", "CASP3", "TP53"],
    "TGF-beta": ["SMAD2", "SMAD4", "TGFBR1", "TGFB1"],
    "IFN-gamma": ["STAT1", "IRF1", "JAK2", "IFNGR1"],
    "MYC": ["MYC", "MAX", "CDKN2A", "CCND2"]
}

# Perform pathway enrichment analysis using a rank-based method
pathway_results = []
ranked_genes = differential_expression_results.sort_values(by='Log2_Fold_Change', ascending=False)

for pathway, genes in gene_sets.items():
    # Extract the rank of the genes in the pathway
    pathway_ranks = ranked_genes[ranked_genes['Gene'].isin(genes)]
    enrichment_score = pathway_ranks['Log2_Fold_Change'].sum()  # Sum of weighted ranks
   
    # Calculate a p-value (simplified permutation test logic)
    random_scores = []
    for _ in range(1000):  # Random permutations
        random_sample = ranked_genes['Log2_Fold_Change'].sample(len(pathway_ranks), replace=False).sum()
        random_scores.append(random_sample)
   
    random_scores = np.array(random_scores)
    p_value = (np.sum(random_scores >= enrichment_score) + 1) / (len(random_scores) + 1)
   
    # Store results
    pathway_results.append({
        'Pathway': pathway,
        'Enrichment_Score': enrichment_score,
        'p_value': p_value
    })

# Convert results to a DataFrame and sort by p-value
pathway_results_df = pd.DataFrame(pathway_results).sort_values(by='p_value', ascending=True)

# Display the top 20 pathways
top_pathways = pathway_results_df.head(20)
tools.display_dataframe_to_user(name="Top 20 Pathway Enrichment Results", dataframe=top_pathways)

It uses a simple approach to calculate enrichment, which is the sum of the Log2FC values for the genes in a pathway, and then it calculates an empirical p.value by bootstrapping 1000 scores to calculate the confidence intervals.

pathway_ranks = ranked_genes[ranked_genes['Gene'].isin(genes)]
enrichment_score = pathway_ranks['Log2_Fold_Change'].sum()  # Sum of weighted ranks

The final p.value is simply the ratio between the scores greater or equal of the enriched p.value, divided by the number of enrichment scores.

p_value = (np.sum(random_scores >= enrichment_score) + 1) / (len(random_scores) + 1

Simple, yet correct.

Did ChatGPT plot the DEG and pathway enrichment results?

The answer is yes.
The plots are not the fanciest, but they are ok visuals, and they give you an idea of the results.

First, it prepped a volcano plot with the results from differential gene expression analysis.

And then it prepared a ranked barplot showing the enrichment scores for the pathways I asked about.

They are both simple plots, no fancy colors, no cool shapes, no magic annotations worth a Nature paper and so on, but they all convey a direct and simple message about the results.

The volcano plot is spartan, but it suggests that a lot of genes are differentially expressed, with an even number of genes up and down-regulated.

The barplot is straightforward to interpret, it is ranked and well labelled. The axes are labelled correctly with Pathway and Enrichment Score so that there is no confusion.

Summary and my take on using ChatGPT for analyzing TCGA gene expression data

I remember trying to do this as soon as OpenAI gave us the ability to upload files in the chat. It was a disaster, to say the least. Now ChatGPT4o is not yet perfect by any means, but through a few educated iterations of the prompt, you can get pretty close to what an initial real analysis would look like.

One of the issues I noticed was caused by inconsistent coding format. ChatGPT4o was able to overcome the issue a few times by automatically iterating over the issue, but more often than not, I had to restart the analysis.

Moreover, it seems to have issues with loading python modules, leading to a less robust analytic experience. Yet, by understanding the errors, and tweaking the prompt (and a dose of luck from the GenAI Gods), it all worked out.

I would have loved to have had an FDR correction in the dataset, but ChatGTP was first unable to load the necessary libraries, and then it was unable to calculate it manually. I would consider this a tradeoff to accept to get a glimpse inside the data.

In this test, I also started a new chat each time I tested a new prompt. The reason I did that is because I would like to give a wet-lab scientist the ability to ask directly a question to ChatGPT without getting lost in the many iterations needed provide feedback to the LLM and fine tune the request.

Last but not least… all of this was done without writing a single line of code. I love coding, I find it almost hypnotic, and I often stay up late at night analyzing data without realizing it’s 1am, if not 2am.

However, plenty of great scientists do not know how to code because they work on the bench side of research.

Being able to use AI tools to get a glimpse into genomic data can simplify their life so that they don’t have to wait for their bioinformatician before getting an idea of what’s going on with their favorite gene.

Did you ever use ChatGPT to analyze TCGA data?

How did you do it?

Feel free to shoot me a message at seb@letsjustdoai.com with your experience, or any feedback.