Rachel was a Fellow in our Spring 2017 cohort and an instructor for our Summer 2017 cohort.
Tell us about your background. How did it set you up to be a great Data Scientist?
My background is in neuroscience; specifically I studied how images are processed in the visual system of biological brains.
Could you tell us about your Data Incubator Capstone project?
My goal was to determine if one can build an app that guides poorly performing taxi drivers so they can increase their hourly wage. One important decision that a taxi driver makes is the choice of where to go to look for customers. A naive strategy is to simply not move, and wait in the same general area until a customer arrives. Our algorithm determines where a taxi driver should search for customers by looking at the behavior of good drivers. It also factors in driving time with traffic and the cost of gas.
How did you come up with the idea for the project?
I began my project with some exploration of the images. I used t-SNE (t-distributed stochastic neighbor embedding) in scikit-learn, which is a tool to visualize high-dimensional data. Visualizing each image as a point in a 3-D plot showed that none of the three classes of cervixes clustered together. I also used a hierarchical cluster analysis in seaborn to confirm that the images did not easily group together by their three classes.
What technologies did you use and what skills did you learn at TDI that you applied to the project?
After some research I found that convolutional neural networks (CNNs) are the type of neural network best suited for complex image classification tasks. While there are over a dozen different python frameworks for CNNs, I ended up installing Keras with the TensorFlow backend in my DigitalOcean box. I choose Keras since it has a large and fast growing community of researchers and organizations with lots of online resources and documentation. I was also interested in trying a pre-trained model for extracting features, as well as just building the architecture of the network myself. Both of these options are relatively straightforward in Keras. I tried two pre-trained weights- VGG16 and VGG19 released by VGG at Oxford. My highest validation accuracy came from training my own custom CNN with two Conv2D layers to extract features, two max pooling layers to downsample, two fully connected (dense) layers, and two dropout layers to prevent overfitting.
What was your most surprising or interesting finding?
The thing that surprised me the most while working on my project was the lack of improvement that came from trying different pre-processing techniques. At the core of what separates the different cervix types is the amount of the “transition zone” visible on the cervix. This “transition zone” was usually at the center of the image surrounded by thousands of pixels containing no obviously useful classification information. I experimented with otsu thresholding in skimage, a histogram-based method that separates foreground from background. I also tried segmentation with a Gaussian mixture model analysis in sklearn. I was hoping that removing the background of the image and extracting just the cervix would improve the model’s accuracy. I think that perhaps there was just too much variation in foreground and background conditions of the individual images, because I did not gain model performance using either pre-processing technique.
Describe the business application for this project (how could a company use your work or your data)
The overall accuracy of my classifier when predicting a cervix image as one of three types is 57%. However in healthcare applications, both types 2 and 3 require additional cancer screening. When I combine types 2 and 3 into one group, I am able to predict an image as either type 1 or type 2/3 with 87% accuracy. Thus, this application could increase efficiency in identifying the need for additional cancer screenings during cervical examinations.