Transcription Factor Binding Prediction from ATAC-seq and scATAC-seq with Deep Neural Networks
The train function will train a maxATAC model using the supplied ATAC-seq and ChIP-seq inputs. The inputs are organized by a tab-delimited text file described below:
| Column Name | Description |
|---|---|
Cell_Line |
Sample Cell Type |
TF |
Gene Symbol for TF |
ATAC_Signal_File |
Path to ATAC-seq bigwig signal track |
Binding_File |
Path to ChIP-seq bigwig signal track |
ATAC_Peaks |
Path to ATAC-seq peak bed file |
CHIP_Peaks |
Path to ChIP-seq peak bed file |
Train_Test_Label |
Train or Test label |
**Note: maxATAC was built with version 2.5.0 of tensorflow – newer versions of tensorflow may not be compatible. Therefore, if you experience errors in running train, check the version of tensorflow installed in your environment.
The meta file described above is used to locate all the input files for all the training data. **Note: all files specified in the meta file should be in unzipped format.
General Steps:
1) Initialize the training regions of interest pools
2) Initialize a Keras.Sequence object. Each sequence object is specific for training or prediction. The object will then create random batches of regions of interest from the input ROI pool with the correct ATAC-seq and ChIP-seq signals.
3) Fit the model for the given # of epochs
4) Select best model based on dice coefficient and save the results of training
For each TF model, all the ATAC-seq and ChIP-seq peaks are pooled into training, validation, and test groups.
Each model is trained on 100 batches of 1,000 examples per epoch. Each batch is composed of randomly chosen peaks that are then randomly assigned to cell types.
Training on ATAC-seq and ChIP-seq peaks is considered “peak-centric” training.
Training on multiple cell types per batch that are randomly assigned peaks is called “pan cell” training.
For every TF model, one cell type and 2 chromosomes are held out for independent testing.
maxatac train --genome hg38 --arch DCNN_V2 --sequence hg38.2bit --meta_file CTCF_meta.tsv --output ./CTCF_DCNN --prefix CTCF_DCNN --shuffle_cell_type --rev_comp
--sequenceThis argument specifies the path to the 2bit DNA sequence for the genome of interest
--meta_fileThis argument specifies the path to the meta file that describes the training data available. This meta file is described above.
--genomeSpecify which genome build this task is specified for (i.e. hg38).
--prefixThis argument is reserved for the prefix used to build the output filename. This can be any string. The extension .bw will be added to the filename prefix. Default: maxatac_model
--train_roiThis argument is used to input the BED file that you want to use to define the training regions of interest. If you set this option, you will randomly select regions from this file for training instead of using the meta data to build the training data pool.
--validate_roiThis argument is used to input the BED file that you want to use to define the validation regions of interest. If you set this option, you will randomly select regions from this file for validation instead of using the meta data to build the validation data pool.
--target_scale_factorThe scaling factor for scaling model targets signal. Used only for quantitative models. Default: 1
--output_activationThe output activation to use for the model. No other options are considered in the publication. Test at your own risk. Default: sigmoid
--chromsThe list of chromosomes to limit the study to. These include the training and validation chromosomes. Default: ["chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX"].
--tchromsThe list of chromosomes to use for training only. Default: ["chr3", "chr4", "chr5", "chr6", "chr7", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr20", "chr21", "chr22"].
--vchromsThe list of chromosomes to use for validation only. Default: ["chr2", "chr19"].
--archThe architecture to use for the neural network. Default: DCNN_V2.
--rand_ratioThe proportion of random regions to use per training and validation batch. This corresponds to the number of regions that are randomly selected form the genome as opposed to being created based on the ATAC-seq or ChIP-seq peaks. Default: 0.
--seedThe seed to use for the model in case of reproducibility. Default: random.randint(1, 99999).
--weightsThe weights to use for initializing a model prior to training. Default: do not initialize with weights.
--epochsThe number of epochs to train the model for. Default: 20.
--batchesThe number of batches to use per stochastic gradient descent step. Default: 100.
--batch_sizeThe number of examples to use per training batch. Default: 1000.
--val_batch_sizeThe number of examples to use per validation batch. Default 1000.
--outputThe output directory name to save results to. Default: ./training_results.
--plotWhether to plot the model structure or training history. Default: True.
--denseUsed if you want to use a dense layer at the end of the neural network. Default: False.
--threadsUse to change the number of threads used per job. Default: get available.
--loglevelThis argument is used to set the logging level. Currently, the only working logging level is ERROR.
--rev_compIf rev_comp is True, then use the reverse complement sequence in addition to the reference sequence. Default: False.
--shuffle_cell_typeIf shuffle_cell_type is True, then shuffle training ROI cell type label. This is related to “pan-cell” training as described in the maxATAC manuscript. Default: True.
--multiprocessingIf set, this option supports parallelization of training. Default: False.
--max_queue_size--blacklistThe path to a BED file that has regions to exclude.
--chrom_sizesThe path to the chromosome sizes file.