In the genetic study, the advance of high-through technology allowed scientists to collect data on a larger scale and with more complexity. Thus, it is common that the collected data is high-dimension and heterogenous, i.e. the number of features is greater than that of observations. For example, in the study of liver cancer, the number of genes is more than 25000 while the sample size in only around hundreds. However, a common fact is that not all variables are useful in solving a problem, only a small proportion should be used. Hence selecting a useful subset of variables based on clinical information receives a lot of attention. Since supervised learning and unsupervised learning are two major problems of statistical learning. Supervised learning is given a labeled training set with variables and responses, then fit a model to predict the response for new test data. When the response is continuous, it’s often known as regression. If the response is categorical, then it’s a classification problem. While sometimes responses are not available, then this turns to be an unsupervised learning problem. For unsupervised learning, first, we need to recover the responses from the input variables. This is often referred to as a clustering problem. When the data is of high dimension both supervised and unsupervised problems will confront statistical and computational challenges. Therefore, there are three problems we mainly focused on. Firstly, many gene expression data are not along with a response. Thus, we need to cluster patients based on input variables to transfer the unsupervised learning into a supervised learning problem. Secondly, although many studies of high-dimension data were proposed, they are not suitable for heterogenous gene expression data. Moreover, an efficient method to select a subset of variables to discern different groups grow more and more attention. Thirdly, it is important to find a suitable model to predict new data labels based on the selected variables and recovered labels. In real data analysis, we mainly studied on identifying biomarkers and personalized treatments based on their gene expression data. For the supervised learning problem in which the label information is available, we proposed a framework of identifying biomarkers, containing three steps: differential gene expression analysis based on the labels, pathway analysis and random forest with 10 folds cross-validation. This framework provides the subset of useful genes and identifies the biomarkers based on the votes of random forest. For unsupervised learning problem, we proposed the framework of clustering cancer patients for treatments, to sequential biclustering patients and assign the different treatments. Sequential biclustering is a novel biclustering method that only allows overlapping genes for different groups. This framework returns labels for patients which leads to the next step of identifying biomarkers and assign suitable treatments for different clustered patients. Moreover, based on the studies of real data, we consider the clustering problem on high dimension and heterogeneous data, we proposed a more efficient procedure based on marginal screening for a mixture regression model. This algorithm takes advantage of heterogeneity of the data to filter out variables which leads to lower storage costs and higher computation speed. The performance of our method is more stable and with higher accuracy compared with the existing method. In the end, we discuss some future works, including real data applications and extension of generalizing linear models.