I am a passionate data scientist obsessed in solving all kinds of data problem by using machine learning. My never-stop-growing github is here. If you have any interesting math and data problems, welcome to contact me through email “chutianwen123@gmail.com”
Project Experience
Machine Learning Clustering on DNA Variation
DNA is the key secret of difference between human, which explains the hair color, skin color, genetic features and etc. I contributed to writing the cpp application that can tell the population to which a individual belongs given his/her DNA variations, such application confirms that machine learning’s gradually important practice in bio-science field.
The application uses a graph database called “sparksee” to store DNA data from 1000Genome project. By querying database through database [“sparksee”]
(http://www.sparsity-technologies.com) API, the application calculates the pairwise similarity among individuals in parallel. After obtaining this similarity matrix, spectral clustering algorithm groups all individuals into a specific number of populations. This application plays an important role in a complex proof of concept in collaboration with the National Cancer Institute/Frederick National Laboratory for Cancer Research/Leidos Biomedical Research. This work won the 3rd place out of 42 at the 4th Annual BioMedical Informatics Symposium at Georgetown University on October 16, 2015 and
1st place out of 78 at the Bio-IT World Conference Expo in Boston, MA on April 7, 2016.
Deep Learning Projects through Tensorflow
Tensorflow with GPU brings great benefits to the practice of deep learning in various fields. I have written multiple interesting python programs showing application deep learning. My codes are here, feel free to have a look.
- Applying convolution neural networks (mini Alex net) to classify “CIFAR-10” images among ten labels.
- Applying recurrent neural networks to train model itself to write interesting article by learning from provided books. You can give whatever to the program to train, then it will automatically become a corresponding style writer.
- Applying both recurrent and feed forward neural networks with embedding techniques to do sentiment analysis about movie reviews. Given a random movie review, program can tell whether it is positive or negative with 90% accuracy.
- Applying recurrent neural networks to analyze and try to predict stock price. Given historical time series of financial data, program will learn the pattern and predict the next few days’ stock price.
- …
SVM on Classification of Virus Type
As DNA/RNA sequence is the code for protein, virus can be trained to be predicted by given their RNA sequence. I designed a mismatch Trie model which projects RNA sequences into a mathematics feature vector. Such vector incorporates biology knowledge like codons and amino acids. The program is written by cpp program and processes sequences in parallel. After obtaining these feature vectors, I applied a multi-class SVM model through Matlab. With cross-validation as a fair measurement, a surprisingly high accuracy around 99% has been received.
Big Data Software Research and Benchmark (Spark, Sparksee, MemSQL, CrayRDF, Neo4j, MemSQL)
As a data scientist, understanding and being familiar how to use different software and tools for solving big data problem is a must. Besides, most of the Bio-science projects are highly related to the “big data” concept because DNA information can be very complex. For example, a person consists of up to 3 billion nuclear tides. Therefore, speed of handling data, ingestion to database and query become prudential factor when considering which software to utilize. Meanwhile, graph as a concept tends to become more popular to speed up query and illustrate data.
Given data of vcf file with size up to 1TB from 1000Genome project, I have designed a graph model of edges and vertices to represent different types of information like chromosome, variation, individual, populations and etc.
Based on such graph model, I have written multiple scala, python programs by applying different software tools. The combinations of tools can be scala + spark + graphX
, scala + spark + MemSQL
, python + sparksee
, python + CrayRDFTriple
, scala + spark + ignite
and etc. They are following the same workflow logic for benchmark purpose. Speaking of performance, tweaking difficulty and even learning curve, different solution has its unique pro and cons . For example of solutions using spark, by optimizing processing algorithm and software/hardware environment parameters, 10X speed increase is achieved. After a series of tests on a SGI super computer UV300, I have concluded each solution’s features and bottleneck on this speical hardware.
Tax-fraud Full-Stack Application
Usually data scientist is also a full-stack software engineer. Before 2016, IRS is still catching up seeking solution of handling tax return fraud by program. I have designed a graph model to represent tax information including companies, individuals, w2s, 1099s and etc. Then I developed a python program in back-end to process company’s w2s(in xml format) and ingest into graph database sparksee. Once a person tax return submitted(in xml format), the backend program will parse all the fields, compare corresponding ones from database and decide its fraud by business rules. For front-end UI and interaction between database, I have wrote an AngularJS program to handle user query and result display.
Autonomous Drone Motion Control with SLAM
The capability of recognizing surrounding environment is a key factor for autonomous vehicle like Drone(Quadrotor). For drones, ensuring its feasibility and reliability in simulation is critical because of destroyable consequence.
I developed a cpp and matlab program to implement feedback controllers and a trajectory optimization method on the simulation of a quadrotor following an ellipse trajectory in ROS. Besides, I have also calibrated a stereo camera pair for correcting images. After a series of 3D reconstruction of images, an initial visual odometry was achieved. By using Gtsam, I mapped the observations(similar features between images shoot in different time frame, IMU sensor measurement) and states(3D location) into a factor graph. After calculating maximum a posteriori of 3D positions, an optimized trajectory including loop closures was obtained. For real field test, I wrote a matlab program controlling quadrotor reaching target and avoid dynamic obstacle autonomously.
Skill-Set
Programming Language
- python
- scala
- cpp
- java
- matlab
- R
Framework and Platform
- spark(GraphX, SparkSQL)
- tensorflow
- cloudera
Database
- MemSQL
- Sqlite
- MySQL
- HBase
- Sparksee
- Neo4j
Operating System
- Windows
- Linux
Project Management
- Agile Jira
- BitBucket
- Github