Main Content

affyread

Read microarray data from Affymetrix GeneChip file

Syntax

AffyStruct = affyread(File)
AffyStruct = affyread(File, LibraryPath)

Description

AffyStruct = affyread(File) reads an Affymetrix® file and creates a MATLAB® structure. The affyread function can read Affymetrix EXP, DAT, CEL, CLF, BGP, CDF, and GIN files associated with Affymetrix GeneChip® arrays for expression, genotyping (SNP), or resequencing assays. It can read Affymetrix CHP files associated with Affymetrix GeneChip arrays for expression assays only.

AffyStruct = affyread(File, LibraryPath) specifies the path and folder of a CDF or GIN library file.

Reading many CEL files and/or a large CEL file can require extended amounts of memory from the operating system. If you receive any errors related to memory or have trouble reading CEL files, try the following:

Input Arguments

File

Character vector or string specifying a file name or a path and file name of one of the following Affymetrix file types associated with Affymetrix GeneChip arrays for expression, genotyping (SNP), or resequencing assays. However, if the file name is for a CHP file, it must be associated with an Affymetrix GeneChip array for an expression assay.

  • EXP — Data file containing information about experimental conditions and protocols.

  • DAT — Data file containing raw image data (pixel intensity values).

  • CEL — Data file containing information about the intensity values of the individual probes.

  • CHP — Data file containing summary information of the probe sets, including intensity values.

  • CLF — Cell layout file that maps probe IDs to a location (x- and y-coordinates) in the CEL file.

  • BGP — Background probe file that lists the probes to use for background correction.

  • CDF — Library file containing information about which probes belong to which probe set.

  • GIN — Library file containing information about the probe sets, such as the gene name associated with the probe set.

If you specify only a file name, put that file on the MATLAB search path or in the current folder. If you specify only a file name of a CDF or GIN library file, you can specify the path and folder in the LibraryPath input argument.

LibraryPath

Character vector or string specifying the path and folder of a:

  • CDF library file associated with File when File is a CHP file

  • CDF library file when File is a CDF file

  • GIN library file when File is a GIN file

Note

If you do not specify LibraryPath when reading a CHP file, affyread looks in the current folder for the CDF file. If it does not find the CDF file, it still reads the CHP file. However, it omits the probe set names and types from the return value, AffyStruct.

Output Arguments

AffyStruct

MATLAB structure containing information from an Affymetrix data or library file, for expression, genotyping (SNP), or resequencing assay types.

The following tables describe the fields in AffyStruct for the different Affymetrix file types.

EXP, DAT, CEL, CHP, CLF, BGP, CDF, and GIN Files

FieldDescription
Name

File name.

DataPath

Path and folder of the file.

LibPath

Path and folder of the CDF and GIN library files associated with the file you are reading.

FullPathName

Path and folder of the file.

ChipType

Name of the Affymetrix GeneChip array (for example, DrosGenome1 or HG-Focus).

Date or CreateDate

File creation date.

EXP File

FieldDescription
ChipLot
Operator
SampleType
SampleDesc
Project
Comments
Reagents
ReagentLot
Protocol
Station
Module
HybridizeDate
ScanPixelSize
ScanFilter
ScanDate
ScannerID
NumberOfScans
ScannerType
NumProtocolSteps
ProtocolSteps

Information about experimental conditions and protocols captured by the Affymetrix software.

DAT File

FieldDescription
NumPixelsPerRowNumber of pixels per row in the image created from the GeneChip array (number of columns).
NumRowsNumber of rows in the image created from the GeneChip array.
MinDataMinimum intensity value in the image created from the GeneChip array.
MaxDataMaximum intensity value in the image created from the GeneChip array.
PixelSizeSize of one pixel in the image created from the GeneChip array.
CellMarginSize of gaps between cells in the image created from the GeneChip array.
ScanSpeedSpeed of the scanner used to create the image.
ScanDateDate the scan was performed.
ScannerIDName of the scanning device used.
UpperLeftX
UpperLeftY
UpperRightX
UpperRightY
LowerLeftX
LowerLeftY
LowerRightX
LowerRightY
Pixel coordinates of the scanned image.
ServerNameNot used.
ImageA NumRows-by-NumPixelsPerRow image of the scanned GeneChip array.

CEL File

FieldDescription
FileVersionVersion of the CEL file format.
AlgorithmAlgorithm used in the image-processing step that converts from DAT format to CEL format.
AlgParamsCharacter vector containing parameters used by the algorithm in the image-processing step.
NumAlgParamsNumber of parameters in AlgParams.
CellMarginSize of gaps between cells in the image created from the GeneChip array, used for computing the intensity values of the cells.
RowsNumber of rows of probes.
ColsNumber of columns of probes.
NumMaskedNumber of masked probes, which are not used in subsequent processing.
NumOutliersNumber of cells identified as outliers (extremely high or extremely low intensity) by the image-processing step.
NumProbesNumber of probes (Rows * Cols) on the GeneChip array.
UpperLeftX
UpperLeftY
UpperRightX
UpperRightY
LowerLeftX
LowerLeftY
LowerRightX
LowerRightY
Pixel coordinates of the scanned image.
ProbeColumnNames

Cell array containing the eight column names in the Probes field:

  • PosXx-coordinate of the cell

  • PosYy-coordinate of the cell

  • Intensity — Intensity value of the cell

  • StdDev — Standard deviation of intensity value

  • Pixels — Number of pixels in the cell

  • Outlier — True/false flag indicating if the cell was marked as an outlier

  • Masked — True/false flag indicating if the cell was masked

  • ProbeType — Integer indicating the probe type (for example, 1 = expression)

ProbesNumProbes-by-8 array of information about the individual probes, including intensity values. The ProbeColumnNames field contains the column names of this array.

CHP File

FieldDescription
AssayTypeType of assay associated with the GeneChip array (for example, Expression, Genotyping, or Resequencing).
CellFileFile name of the CEL file from which the CHP file was created.
AlgorithmAlgorithm used to convert from CEL format to CHP format.
AlgVersionVersion of the algorithm used to create the CHP file.
NumAlgParamsNumber of parameters in AlgParams.
AlgParamsCharacter vector containing parameters used in steps required to create the CHP file (for example, background correction).
NumChipSummaryNumber of entries in ChipSummary.
ChipSummarySummary information for the GeneChip array, including background average, standard deviation, max, and min.
BackgroundZonesStructure containing information about the zones used in the background adjustment step.
RowsNumber of rows of probes.
ColsNumber of columns of probes.
NumProbeSetsNumber of probe sets on the GeneChip array.
NumQCProbeSets

Number of QC probe sets on the GeneChip array.

ProbeSets

(Expression GeneChip array)

NumProbeSets-by-1 structure array containing information for each expression probe set, including the following fields:

  • Name — Name of the probe set.

  • ProbeSetType — Type of the probe set.

  • CompDataExists — True/false flag indicating if the probe set has additional computed information.

  • NumPairs — Number of probe pairs in the probe set.

  • NumPairsUsed — Number of probe pairs in the probe set used for calculating the probe set signal (not masked).

  • Signal — Summary intensity value for the probe set.

  • Detection — Indicator of statistically significant difference between the intensity value of the PM probes and the intensity value of the MM probes in a single probe set (Present, Absent, or Marginal).

  • DetectionPValue — P-value for the Detection indicator.

  • CommonPairs — When CompDataExists is true, contains the number of common pairs between the experiment and the baseline after the removal of outliers and masked probes.

  • SignalLogRatio — When CompDataExists is true, contains the change in signal between the experiment and baseline.

  • SignalLogRatioLow — When CompDataExists is true, contains the lowest ratios of probes between the experiment and the baseline.

  • SignalLogRatioHigh — When CompDataExists is true, contains the highest ratios of probes between the experiment and the baseline.

  • Change — When CompDataExists is true, describes how the probe changes versus a baseline experiment. Choices are Increase, Marginal Increase, No Change, Decrease, or Marginal Decrease.

  • ChangePValue — When CompDataExists is true, contains the p-value associated with Change.

ProbeSets

(Genotyping GeneChip array)

NumProbeSets-by-1 structure array containing information for each genotyping probe set, including the following fields:

  • Name — Name of the probe set.

  • AlleleCall — Allele that is present for the probe set. Possibilities are AA (homozygous for the major allele), AB (heterozygous for the major and minor allele), BB (homozygous for the minor allele), or NoCall (unable to determine allele).

  • Confidence — Measure of the accuracy of the allele call.

  • RAS1 — Relative Allele Signal 1 for the SNP site, which is calculated using sense probes.

  • RAS2— Relative Allele Signal 2 for the SNP site, which is calculated using antisense probes.

  • PValueAA — p-value for an AA call.

  • PValueAB — p-value for an AB call.

  • PValueBB — p-value for a BB call.

  • PValueNoCall — p-value for a NoCall call.

ProbeSets

(Resequencing GeneChip array)

NumProbeSets-by-1 structure array containing information for each resequencing probe set, including the following fields:

  • CalledBases — 1-by-NumProbeSets character vector containing the bases called by the resequencing algorithm. Possible values are a, c, g, t, and n.

  • Scores — 1-by-NumProbeSets array containing the score associated with each base call.

CLF File

FieldDescription
LibSetName

Name of a collection of related library files for a given chip. There is only one LibSetName for a CLF file. For example, PGF and CLF files intended for use together must have the same LibSetName.

LibSetVersion

Version of a collection of related library files for a given chip. There is only one LibSetVersion for a CLF file. For example, PGF and CLF files intended for use together must have the same LibSetVersion.

GUID

Unique identifier for the CLF file.

CLFFormatVersion

Version of the CLF file format.

Rows

Number of rows in the CEL file.

Note

The CLF file is 1 base, which means the first row and column are designated 1,1, not 0,0.

Cols

Number of columns in the CEL file.

Note

The CLF file is 1 base, which means the first row and column are designated 1,1, not 0,0.

StartID

Starting number for the numbering of elements in the CLF file.

Tip

This information is useful when numbering does not start with 1.

EndID

Ending number for the numbering of elements in the CLF file.

Tip

This information is useful when numbering does not start with 1 and/or there are gaps in the numbering.

Order

Order in which the probe IDs are numbered in the CEL file, either 'row_major' or 'col_major'.

DataColNames

Names of the columns in the CEL file that contain data.

Data

If the numbering of elements in the CLF file is sequential, this field contains a function handle that calculates the x- and y- coordinates of each element in the file from the probe ID.

If the numbering of elements in the CLF file is not sequential, this field contains a matrix indicating the number value of each element in the file.

BGP File

FieldDescription
LibSetName

Name of a collection of related library files for a given chip. There is only one LibSetName for a BGP file.

LibSetVersion

Version of a collection of related library files for a given chip. There is only one LibSetVersion for a BGP file.

GUID

Unique identifier for a BGP file.

ExecGUID

Information about the algorithm used to generate the BGP file.

ExecVersion
Cmd
Data

Structure containing the following fields:

  • probe_id — ID of the probe to use for background correction.

  • probeset_id — ID of the probe set in the PGF file to which the probe belongs.

  • type — Classification information for the probe.

  • gc_count — Combined number of G and C bases in the probe.

  • probe_length— Length of the probe in base pairs.

  • interrogation_position — Interrogation position of the probe. It is typically 13 for 25-mer PM/MM probes.

  • probe_sequence — Sequence of the probe on the array, going in the direction from array surface to solution. For most standard Affymetrix arrays, this direction is from 3' to 5'. For example, for a sense target (st) probe (see the probe_type field), complement the sequence in this field before looking for matches to transcript sequences. For an antisense target (at), reverse this sequence.

  • atom_id — ID of the atom to which the probe belongs.

  • x — Column coordinate of the probe in the CEL file.

  • y — Row coordinate of the probe in the CEL file.

  • probeset_type — Classification information for the probe set, such as control, affx, or spike. This type information can include multiple classifications and can also be nested.

  • probe_type — Classification information for the probe, such as pm (perfect match), mm (mismatch), st (sense target), or at (antisense target). This type information can include multiple classifications and can also be nested.

CDF File

FieldDescription
Rows

Number of rows of probes.

Cols

Number of columns of probes.

NumProbeSets

Number of probe sets on the GeneChip array.

NumQCProbeSets

Number of QC probe sets on the GeneChip array.

ProbeSetColumnNames

Cell array containing the six column names in the ProbePairs field in the ProbeSets array:

  • GroupNumber — Number identifying the group to which the probe pair belongs. For expression arrays, this value is always 1. For genotyping arrays, this value is typically 1 (allele A, sense), 2 (allele B, sense), 3 (allele A, antisense), or 4 (allele B, antisense).

  • Direction — Number identifying the direction of the probe pair. 1 = sense and 2 = antisense.

  • PMPosXx-coordinate of the perfect match probe.

  • PMPosYy-coordinate of the perfect match probe.

  • MMPosXx-coordinate of the mismatch probe.

  • MMPosYy-coordinate of the mismatch probe.

ProbeSets

NumProbeSets-by-1 structure array containing information for each probe set, including the following fields:

  • Name — Name of the probe set.

  • ProbeSetType — Type of the probe set.

  • CompDataExists — True/false flag indicating if the probe set has additional computed information.

  • NumPairs — Number of probe pairs in the probe set.

  • NumQCProbes — Number of QC probes in the probe set.

  • QCType — Type of QC probes.

  • GroupNames — Name of the group to which the probe set belongs. For expression arrays, this field contains the name of the probe set. For genotyping arrays, this field contains the name of the alleles, for example {'A' 'C' 'A' 'C'}'.

  • ProbePairsNumPairs-by-6 array of information about the probe pairs. The column names of this array are contained in the ProbeSetColumnNames field.

GIN File

FieldDescription
Version

GIN file format version.

ProbeSetName

Probe set ID/name.

ID

Identifier for the probe set (gene ID).

Description

Description of the probe set.

SourceNames

Source or sources of the probe sets.

SourceURL

Source URL or URLs for the probe sets.

SourceID

Vector of numbers specifying which SourceNames or SourceURL each probe set is associated with.

Examples

collapse all

This example shows how to read and visualize microarray data from Affymetrix® GeneChip® file.

You need some sample data files from here. This example uses the sample data from the E. coli Antisense Genome Array. Extract the data files from the DTT archive using the Data Transfer Tool.

You also need to download the corresponding library files for the sample. For this example, Ecoli_ASv2.CDF and Ecoli_ASv2.GIN are used as for the E. coli Antisense Genome Array. You may already have these files if you have any Affymetrix GeneChip software installed on your machine. If not, get the library files by downloading the E. coli Antisense Genome Array zip file from here.

Read the contents of a CEL file into a MATLAB structure.

celStruct = affyread('Ecoli-antisense-121502.CEL');

Display a spatial plot of the probe intensities.

maimage(celStruct, 'Intensity')

Zoom in on a specific region of the plot.

axis([200 340 0 70])

Read the contents of a DAT file into a MATLAB structure. Display the raw image data, and then use the axis image command to set the correct aspect ratio.

datStruct = affyread('Ecoli-antisense-121502.dat');
imagesc(datStruct.Image)
axis image

Zoom in on a specific region of the plot.

axis([1900 2800 160 650])

Read the contents of a CHP file into a MATLAB structure, specifying the location of the associated CDF library file. Then extract information for probe set 3315278.

chpStruct = affyread('Ecoli-antisense-121502.chp','C:\LibFiles\');
geneName = probesetlookup(chpStruct,'3315278')
geneName = 

  struct with fields:

      Identifier: '3315278'
    ProbeSetName: 'argG_b3172_at'
        CDFIndex: 5213
        GINIndex: 3074
     Description: '/start=3316278 /end=3317621 /direction=+ /description=argininosuccinate synthetase'
          Source: 'NCBI EColi Genome'
       SourceURL: 'http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/altvik?gi=115&db=g&from=3315278'

Version History

Introduced before R2006a