So you want to know what is in your genome, maybe like me, you spent 20 minutes spitting into a tube and sent it off to a company like 23andMe and have your genome arrayed for Single Nucleotide Polymorphisms (SNP but never snip. I tend to not pronounce acronyms unless they are phonetic otherwise it prevents people from looking them up on the internet). What 23andMe does is match these SNPs to studies that have been performed that associate the DNA sequence to a specific phenotype in a human being. They tell you that you are caucasian and have a probability of having wet earwax and blue hair.
I had an idea, what if we could create a noninvasive way to composite and predict a genome, like predicting a protein sequence by its structure. What if we could take phenotypic features from a picture of someone and composite a genome? What if just from a picture of someone you could tell if they had a gene associated with a specific disease or disease risk? Did you know that some genes(alleles) occur in greater than 90% of people with certain phenotypic traits?
This project is speculative. Our current understanding of our genomes only allows major traits to be distinguished. No one has studied things such as the association of chin to lip length on genes that are associated with bad teeth. This will happen though. Eventually our understanding of our genomes will allow one to learn alot about a person from only a picture. Data Science in this area is very primitive at the moment. With most genomes studied there are not pictures associated with them or much else.
From what I can tell no one has publicly(looking at you governments) tried to determine genetics from a photo. To me this project was an interesting idea to see how invasive one could be with just a picture. To maybe look a little bit at what the future holds. What if instead of giving a company your love interests DNA to sequence, like in GATTICA, you just upload a picture to a webserver?
This was the idea and this is the piece.
How it works:
It starts with a picture.
I wrote all the code in C++, which I consider an awful programming language. The program is multithreaded and uses the OpenCV library. It uses a webcam to try and find a human face in the streaming video it is constantly taking. Then comes the machine learning, I used a number of different machine learning algorithms that are built into the OpenCV API and ran them on a database of somewhere between 500-1000 faces that were sorted by sex and ancestry(race). Combining the different algorithms I performed "boosting" to create a meta-algorithm of sorts, which really helped (I have not statistically quantified by how much though). Because I wanted my training set to be reliable I needed a way to build this database of training images as I could not find a dataset like this that existed and the NSA wouldn't answer my emails. I ended up scraping OkCupid and using people's self-identified race and sex to build my training dataset. I also hand-curated these images.
Side Note:
If you ever post a picture on a dating website please look forward and directly at the camera with no sunglasses or hat on because that might make the day of a data scientist. Seriously, I had a program that did face and eye detection and cropped the photos and only about 20%-30% were usable!
The interesting thing is that machines see more than you or I. We as humans are limited usually, to what we have been trained to look for. Machine learning allows computers to explore associations that humans cannot perform. With decent lighting the sex and race detection are actually pretty good!
Sadly, since my datasets are pretty big and using different algorithms for comparison on a semi-old laptop it takes about 20 seconds for the detection to run. I needed a faster method otherwise people who interacted with it would become bored. I found servers online that provide APIs to do the same thing but I needed to sacrifice the freedom to have any of the races I wanted and to create an expanding dataset. Sucks, but it runs in about 2 seconds instead of 20. And it is also good that I have both a version that can run without the internet(original version) and the one with the online API.
Next, the program starts breaking down different features of the face, eye color, hair color, skin color. Color is an interesting thing. On computers the way we define them most frequently is through Red Green Blue(RGB) values. But really what is Red? RGB: 255 0 0? Of course! RGB: 255 100 100? Maybe??? RGB: 255 255 255? Definitely not. even genetic studies reference "blue" eyes, "brown" hair when their blue could include blue green and their brown could also be blackish. So when I was going to associate an SNP with blue eyes, I first needed to figure out what RGB values I thought blue was. That was interesting. I kind of just typed in values and created cut-offs based on what I saw, so pretty arbitrary...
Then I built my database of alleles, that were associated with traits such as race, sex, skin color, &c.
The program also talks to you and shows you videos and your genes.
So from the start:
You walk in and see a large display on the wall with a live feed. It detects your face, it uses machine learning to identify traits about you and tells you about your genome, showing you DNA sequences of the alleles you are predicted to have based on these traits. While it is doing this the program is talking to you generatively by using Google Translate English to English as a text-to-speech engine(little trick I picked up that works GREAT!).
Code for the version using the Online API
https://drive.google.com/folderview?id=0B_R75gIJvkFUczBxNXk3ZGtWRDg&usp=sharing
A bunch of face images cropped, aligned and sorted by self-identified sex (male. female) and self-identified race (caucasian, black, asian)
https://drive.google.com/folderview?id=0B_R75gIJvkFUdVAxNE45NXhSXzA&usp=sharing
This video plays when your presence is
detected in the Scanning Room by the software