Identifying association model for single-nucleotide polymorphisms of ORAI1 gene for breast cancer

Background ORAI1 channels play an important role for breast cancer progression and metastasis. Previous studies indicated the strong correlation between breast cancer and individual single nucleotide polymorphisms (SNPs) of ORAI1 gene. However, the possible SNP-SNP interaction of ORAI1 gene was not investigated. Results To develop the complex analyses of SNP-SNP interaction, we propose a genetic algorithm (GA) to detect the model of breast cancer association between five SNPs (rs12320939, rs12313273, rs7135617, rs6486795 and rs712853) of ORAI1 gene. For individual SNPs, the differences between case and control groups in five SNPs of ORAI1 gene were not significant. In contrast, GA-generated SNP models show that 2-SNP (rs12320939-GT/rs6486795-CT), 3-SNP (rs12320939-GT/rs12313273-TT/rs6486795-TC), 5-SNP (rs12320939-GG/rs12313273-TC/rs7135617-TT/rs6486795-TT/rs712853-TT) have higher risks for breast cancer in terms of odds ratio analysis (1.357, 1.689, and 13.148, respectively). Conclusion Taken together, the cumulative effects of SNPs of ORAI1 gene in breast cancer association study were well demonstrated in terms of GA-generated SNP models.


Background
Single nucleotide polymorphisms (SNPs) are one of the most common variants in human genome [1]. Currently, SNPs have been applied to the association studies for complex diseases [2][3][4]. Genome-wide association studies (GWAS) can identify the several SNPs predisposing to many diseases [5][6][7][8]. Although GWAS covers human genome-wide SNPs, many SNPs of non-significance are commonly ignored. Recently, the possible jointed effects of gene-gene interactions are gradually uncovered in predicting many disease risks [9][10][11][12]. However, when simultaneously evaluate the complex interactions amongst huge SNPs, these interactions are complex and it may need the help of new strategy [13] or computation [14].
Similarly, the non-GWAS association studies have the similar condition to ignore the possible gene-gene interactions. For example, several individual SNPs of the ORAI calcium release-activated calcium modulator 1 (ORAI1) gene have reported to be involved in breast cancer susceptibility [15]. However, the possible SNP-SNP interactions of ORAI1 gene associated with breast cancer were not addressed. Different computational analyses have been introduced to examine SNP-SNP interaction in many association studies [14,[16][17][18][19][20][21][22][23]. Genetic algorithm (GA) is potential for feature selection for genome-wide scale datasets [24] and may apply to compute the difference between case and control groups to identify good models from the huge SNP combinations as well as tagSNP selection [25].
To address the possible SNP-SNP interaction in breast cancer susceptibility, five tagSNPs (rs12313273, rs6486795, rs7135617, rs12320939, and rs712853) of ORAI1 gene were selected in this study. Therefore, we introduced the GA to optimizing the analyses of SNP-SNP interactions of ORAI1 gene associated with breast cancer. GA is used to identify the best SNP models (SNP combinations with genotypes) with maximum frequency difference between breast cancer and control groups. Therefore, the best GA-generated SNP models of ORAI1 gene may be useful for predicting the breast cancer risk.

Data set collection
The case and control subjects are 345 female breast cancer patients and 290 female normal controls where the recruitment was approved by Cancer Center of Kaohsiung Medical University Hospital. The genotype dataset of breast cancer patients of five tagSNPs (rs12313273, rs6486795, rs7135617, rs12320939, and rs712853) of ORAI1 gene with minimum allele frequency (MAF) >10% obtained from our previous study [15]. For normal controls, samples of were collected in current study and SNP genotyping was performed as described [15].

Genetic algorithm
The GA [26] is a well-known evolutionary algorithm, and it has been applied for solving the complex problems in several fields. GA simulates the natural evolution to generate solutions of complex problems, including selection, crossover, mutation, and inheritance. The process of GA has six steps: (1) initializing population, (2) evaluate chromosome values, (3) select two parents using selection operation, (4) crossover operation, (5) mutation operation, and (6) replacement operation.
A population in first step is initialized according encoding schemes of problem. Second step aims to evaluate value of chromosomes in population using fitness function. Third step use the evaluated value of chromosomes to select the two good parents for generating two offspring (step 4). Then firth step is probabilistic to mutate two offspring. Final step is used to improve the value of population. Thus repeat of steps 2 to 6 in several generations can effectively search the good values of chromosomes in population, and a best chromosome in population is regarded to best solution. Algorithm 1 shows the GA pseudo-code, and the below section is detailed to explain the processes of six steps. select two parents using selection operation 06: generate two offspring using crossover operation 07: mutate two offspring using mutation operation 08: improve the value of population using replacement operation 09: next g 10: end

Encoding schemes
A population consists of the several possible solution of problem. The possible solution in GA is named a chromosome that is a set C = {c 1 , …, c d }. In this study, a chromosome indicates a possible model of associations between SNPs. All combinations of SNPs and genotypes can be represented a set A = S × G = {(s, g)| s∈S and g∈G} where S is a set of SNPs and G is a set of genotypes. For example, we assume an S contains two SNPs and a G contains three genotypes, i.e., S = {s 1 , s 2 } and G = {g 1 , g 2 , g 3 }.
All possible subsets can be represented A = S × G = {(s 1 , g 1 ), (s 1 , g 2 ), (s 1 , g 3 ), (s 2 , g 1 ), (s 2 , g 2 ), (s 2 , g 3 )}. Each subset in A represents the selected SNP and their genotype. A chromo- where d is the association model size. A possible chromosome in above example can be assigned as C = {(s 1 , g 1 ), (s 2 , g 2 )}; it means a model that includes the genotype "AA" of first SNP and the genotype "Aa" of second SNP.

Fitness function
A value of chromosome C can be evaluated by computing the fitness function; it facilitates GA for eliminating the worst chromosomes of population in each generation. In this study, a total number difference between case data and control data at a model is used to design a fitness function. Equation 1 is used to check a model whether a SNP is repeatedly selected or not. If a SNP is repeatedly selected in a C, the value of C is evaluated to zero. If it is not, Equation 2 is used to calculate the total number difference between cases and controls at a model. In Equation 2, the max_P and max_N are a total number of case data and a total number of control data, respectively. The P and N are respectively represented the set of case data and a set of control data; P i is the i th patient sample in case data and N i is the i th normal sample in control data. Equation 3 is used to evaluate whether all factors in a model are included in a set of sample. If a sample includes the model, the Equation 3 returns one value into Equation 2; whereas, it returns zero value.
Selection operation Selection operation aims to select the good chromosomes for generating the great offspring; the selected chromosomes name parents. Selection operation in this study uses a rank-based tournament scheme for selecting the two parents. The operation uses fitness function to evaluate all chromosomes of a population P = {C 1 , …, C i |i is population size}, and all values in P are recorded into a set R = {r 1 , …, r i | i is population size}. These values represent chromosome ranks. Then R is sorted from the big value to small value, i.e., r 1 ≥ r 2 ≥ r i . Thus the r 1 and r 2 with corresponding Cs in P are two selected parents.

Crossover operation
Crossover operation is used to generate the offspring from the parents, and the operation use a uniform crossover scheme. Uniform crossover firstly generate a binary mask set B = {b 1

Replacement operation
Replacement operation aims to gradually improve value of population. The generated two offspring C' 1 and C' 2 are evaluated by fitness function, and are used to compare the value to all chromosomes. When an offspring is higher value than a chromosome of population, it replaces the chromosome; otherwise, the offspring is deleted.

Parameter settings
In the GA parameters, both of the exchange probabilities in the tournament selection and uniform crossover are 1.0. The exchange probability of a one-point mutation is 0.1. The population size is 50, and the number of generations is 100.

Statistical analysis
All statistical value is computed using SPSS version 19.0 (SPSS Inc., Chicago, IL). Odds ratio (OR) with 95% confidence interval (CI) is used for measuring a single SNP and the model of association between SNPs; a P value of < 0.05 is considered statistically significant difference between the cases and controls.

Data collection
The complete genotype data set is available at http:// bioinfo.kmu.edu.tw/BRCA-ORAI1-5SNPs.xlsx. Based on these data, the GA-generated SNP models to address the possible SNP-SNP interaction in ORAI1 gene were evaluated in terms of breast cancer association later.
Comparison of patients and normal in terms of effect of single SNP Table 1 shows the occurrence of breast cancer for five SNPs in ORAI1 gene. The genotype with major allele (G in rs12320939; T in rs12313273; G in rs7135617; T in rs6486795; and T in rs712853) is regarded as the reference for analyzing breast cancer risks in terms of single SNPs. Minor allele is selected according the dbSNP database of NCBI (National Center for Biotechnology Information). No significant differences between the breast cancer patients and controls in all genotypes for each single SNP were found. Identification of the best model of SNPs association with maximum frequency difference between breast cancer and control groups During GA processing, the best ten models of two SNP combinations with genotypes (2-SNP models) were demonstrated in Table 2. In these 2-SNP models, the SNPs (1, 4) with genotype 2-2, i.e., [rs12320939-GT]-[rs6486795-TC], possessed the maximum frequency difference (7.20%) between the breast cancer and control groups, namely the best 2-SNP model. Similarly, the best GA-generated SNP models involving three to five SNP were shown in left side of Table 3.

Discussion
GA is a robust non-parametric method that detects nonlinear interactions amongst multiple discrete genetic factors. The advantage of GA is that the method can directly search the good models from the huge number of possible combinations without the training data set. In this study, the fitness function is designed based on the unbalanced data set to compute the difference between case data set and control data set. The function can effectively measure high-risk to search the good model in real data set. In current study, the OR values of 2-to 3-SNP models are larger than 1 but small, suggesting that the cumulative effect of these four SNPs (rs12320939, rs7135617, rs6486795, and rs712853) are weak. When five SNPs included, the OR value is 13.148, indicating that the cumulative effect of 5-SNP model becomes strong. This unstable cumulative effect of SNP combinations in SNP models may be partly explained by the experiment design that these five SNPs were only derived from a single gene ORAI1. Because breast cancer is a kind of multigene disease [27][28][29][30], therefore, SNPs derived from more genes included in association studies may reveal the cumulative effect effectively [9,11,12,[31][32][33]. Accordingly, the differential performance of the cumulative effects of SNPs from single gene and multigene is worth of further investigation in future.  The computational complexity of GA is calculated by a fitness function of computation. Suppose n iterations is implemented in a test, the computational complexity of GA is O(n) which represents the big-O in complexity analysis. GA in search of good association model has the below advantages: (1) GA effectively identify the highrisk models in high-order interaction, (2) the best model with statistical significant can be fast identified, and (3) it only has two parameters to need setting and is easily to fulfil for searching the good model. Further, GA is able to analyze high order SNP interactions amongst the huge number of SNPs from GWAS and pharmacogenomics studies in our experiences.

Conclusions
Although the polymorphisms of ORAI1 gene have been reported to associate with inflammatory diseases [34][35][36], effects of SNP-SNP interaction to diseases are still unclear. In this study, the GA successfully identified appropriate models of SNP-SNP interactions in breast cancer association study in terms of five SNPs in ORAI1 gene. The resulting SNP models can predict the breast cancer susceptibility more effective than the individual SNPs. This methodology can also apply to any kinds of SNP association studies, such as GWAS, pharmacogenomics and others. Therefore, the possible cumulative effect of SNP combination will be uncovered by this methodology.