Monday, August 3, 2009

Genious MIT Broad Institute GenePattern [.gct file format]


Hahaha... I just got rescued by a group of benevolent MIT nerds! After hours and hours of intensive internet searching, I finally found a free tool that will help me analyse the muscular dystrophy microarray data. The tool that I chanced upon is called GenePattern3.2. It is a powerful genomic analysis platform that provides access to more than 100 tools for gene expression analysis, proteomics, SNP analysis and common data processing tasks. The MIT team developed this tool under the Broad Institute. I salute these generous guys for coming up with this top-notch insilico research software.

[Changing data from matlab (.mat) format to GenePattern3.2 (.gct) format]menari
I spent a good chunk of the afternoon planning out the steps that I will follow inorder to carry out gene expression profiling of the unknown type of muscular dystrophy. Gosh... 'stuff is kinda crazy confusing! However, I got some bonus time at the end of the day and ended up having some fun programming in the MATLAB® language. Apparently the GenePattern3.2 gene clustering module only accepts a specific file format (.gct) [see excel image below]. I had to use my inadequate Matlab skills to write a script that would convert all the .mat massive matrix data files into the GenePattern format. The task was a success [even though I surerly know that my computer science professor would give it a B- grade!]sengihnampakgigiI had four .mat data files and each took ~5mins to convert into a .gct file with only raw ASCII format text that occupied ~32MB of space . Massive data!!
% Fumba Chibaka (CNMC Internship- MD DIAGNOSIS)
% August 3rd, 2009
% Converts MATLAB (.mat) matrix data to raw GENE PATTERN (.gct) textformat
% Data collected from Human Genome U133 Plus 2.0 Array
% expressnValues = matrix (~60,000 rows/ 200 cols)-gene expression values
% probeid = cell array (~60,000 elements) - microarray probe gene id
% class_name = cell array (~10 elements) - disease groups
function output= genePattern(expressnValues, probeid, class_name)
M= size(expressnValues);
rows= M(1,1);
cols= M(1,2);


%ouput file(.gct)
fid = fopen('GENEPATTERN.gct', 'w');
% gct version (always #1.2)
fprintf(fid, '%s\n','#1.2');

%print # of rows and cols
fprintf(fid, '%d\t%d\n',rows,cols);
fprintf(fid, '%s\t','Gene');
fprintf(fid, '%s\t','Description');

%print disease group names
for c=1:cols
__x= class_name(c);
__id= char(x(1,1));
__fprintf(fid, '%s\t',id);
end;

%print geneid's and expression values
for r=1:rows

__% Start new line
__fprintf(fid, '\n');
__%print gene probe id (extracted from cell array- probeid)
__x= probeid(r);
__id= char(x(1,1));
__fprintf(fid, '%s\t',id);
__%print dummy on probe gene description column
__fprintf(fid, '%\t');

__%print the expression values
__for c=1:cols-1
____fprintf(fid, '%6.3f \t' , expressnValues(r:r, c:c));
__end;
end;

output= fid;
fclose(fid);
fclose('all');
Simple is complex! Muhahahaha [evil laugh].... Oh me gosh, its August already!
Click on images to view them in larger version

Related Posts by Categories



0 comments:

Post a Comment