Data Science | personal

PROJECTS

Updated in 2016.5

This project is to use Sequential Monte Carlo methods to simulate the self-avoiding walks (SAW) problem. The SAW problem is a mathematical abstract of real problems such as counting number of protein chains in a specific grid/area. Three types of Monte Carlo sampling designs are used and compared.

This project is to use Markov Chain Monte Carlo methods to conduct exact sampling (Gibb's Sampler) of Ising/Potts model with coupled Markov Chains (black & white). White chain starts at all sites being 1 and black chain starts at all sites being 0. The left figure shows the final status of 1 (white) and 0 (black) after two chains converge.

This project is to use Monte Carlo simuation to conduct cluster sampling (Swendsen-Wang algorithm) of the Ising/Potts model, and compare it with the Gibb's sampler. We found that the cluster sampling is much more effective and converges faster than the Gibb's sample. Two different ways of cluster sampling have been used for the SW algorithm.

This project is to use machine learning techniques to predict the geographical origin of tranditional music. We apply and compare eight classification models, including LASSO logistic model, Naive Bayes, Linear Discriminant Analysis (LDA), Random Forest, Support Vector Machine (SVM) with linear and Gaussian kernels, Kernel Regularized Least Squares (KRLS), and AdaBoost, to estimate the probability of music originating from different regions. Cross-validation has been done to search for the optimal parameter for each model. The best model accuracy is 63% by using Random Forest.

This project seeks to use machine learning algorithms to predict Diabetic Retinopathy (DR) based on retinal image features in order to improve the efficiency of DR diagnosis during eye examinations. We compare and analyze 10 different statistical prediction models by using image-quality, lesion-specific, and andanatomical features (totally 19). The best prediction model is Support Vector Machine (SVM) withan accuracy of 75%, followed by the weighted model ensemble and LASSO models.

This project is to study the social attributes of human faces in affecting governor election by using Support Vector Machines (SVM). First, the landmark and HoG features (i.e., perceived human traits) of faces are constructed from hundreds of politician facial photos to train a SVM model for 14 social attributes (i.e., Old, Masculine, Baby-faced, Competent, Attractive, Energetic, Well-groomed, Intelligent, Honest, Generous, Trustworthy, Confident, Rich, Dominant), followed by a k-fold cross validation. The SVM model with highest accuracy and precision is selected. The best SVM model of social attributes is further used to train a "election-winner" SVM classifier.

The Adaboost and Realboost techniques are used to train >75000 face/non-face images based on 12000 face features to construct a strong classifier by 100 weak classifiers for detecting human faces in a class photo. The strong classifier from Adaboost results is used to detect the background classroom images and to identify the false alarm samples (i.e., hard mining). The Non-maximum suppression (NMS) is further applied to remove the detection results with strong overlaps. The detected false alarm samples (~2500) are added into the negative training samples and the Adaboost model is re-trained to get a refined strong classifier for testing images. Image boxes (red) with score greater than the threshold are identified as faces (postive).

The principal component analysis (PCA) is used to get the top 20 eigen faces from ~180 facial images with or without landmark warping (i.e., geometric correction). The eigen faces are furthur used to reconstruct orginal faces. In order to differentiate male and female faces, the Fisher Linear Discriminant (FLD) is used based on appearance and geometry traits of faces.

This project is to re-construct the global mean surface temperature based on 17 key factors (e.g., ice core data, tree ring data, sediment analysis, etc.) and compare with the observed temperature trend. The principal component analysis (PCA) and multiple linear regression anlaysis are used to find out the top 7 effective factors and re-construct the temperature change during 1400-2000 . The residual analysis, auto-correlation test, and various transformation of original parameters/factors have also been conducted in this process.

My Small Computer Games coded in Python

(CodeSkulptor Platform)

Asteroid wars

Blackjack

Simplified Tennis

Remebering cards

More interesting projects

are coming!

Courses

UCLA Dept. of Statistics Courses:

1. Applied Probability; 2. Large Sample Theory, Including Resampling; 3. Research Design, Sampling, and Analysis; 4. Statistical Modeling and Learning; 5. Advanced Modeling and Inference; 6. Statistics Programming;

7. Matrix Algebra and Optimization; 8. Pattern Recognition and Machine Learning;

9. Monte Carlo Methods for Optimization

Online Data Science Courses:

edX:

1. Introduction to Big Data with Apache Spark; 2. Scalable Machine Learning;

3. Knowledge Management and Big Data in Business; 4. Introduction to Computer Science and Programming Using Python

Coursera:

1. An Introduction to Interactive Programming in Python; 2. Computing for Data Analysis;

3. Mathematical Biostatistics Boot Camp;

Other:

1. Computer Science 101 (Stanford online); 2. Data Lakes for Big Data (EMC);

3. Fundamentals of NoSQL data management (CouchBase)