Relationships can be a messy business. Fortunately for Yakir Reshef, identifying meaningful relationships is his forte. Born in Israel in 1987, Yakir found himself moving with his family to Kenya at age 3 and, shortly thereafter, immigrating to America. In high school he began dating Hilary Finucane, and since then they have taken the meaning of relationships to a whole new level. They both attended Harvard University, both in the Department of Mathematics. They are now both at the Weizmann Institute of Science and both in the Faculty of Mathematics and Computer Science: Yakir – a visiting Fulbright scholar hosted by Prof. Moni Naor;* and Hilary – a Ph.D. student in the group of Prof. Itai Benjamini.*
So the fact that Yakir and Hilary are joint authors of a recently published paper
on the subject of relationships seems quite appropriate: The paper reports a new data analysis tool that is able to search complex data sets for interesting relationships and trends that are invisible in other types of statistical analysis. And in another twist on relationships, the other first author of the paper, in addition to Yakir, is David Reshef – a computer scientist at the Broad Institute of MIT and Harvard in Cambridge, Massachusetts, and Yakir’s brother.
The two brothers decided they needed an algorithm that could uncover new and important, yet unexpected, relationships that would otherwise go unnoticed.
The tool they developed – under the guidance of advisers Michael Mitzenmacher of the Harvard University School of Engineering and Applied Sciences and Pardis Sabeti of the Broad Institute – is called the maximal information coefficient, or MIC for short. It is based on the idea that if two variables are related to each other, there should be a way to draw a grid on a scatterplot of the two variables in a way that captures the relationship between them. The algorithm that calculates the MIC searches through many such grids and uses the one best able to quantify how strong the relationship is. Researchers can calculate the MIC on each pair of variables in their data set, rank the pairs by their scores (the higher the score, the more related the pair) and then examine the top-scoring pairs – that is, the pairs that affect each other the most.
To test how well the algorithm works, Yakir, David and Hilary applied the MIC to data sets in a variety of fields – global health, gene expression, human gut microbiota and even major-league baseball – and compared the MIC results to those of current methods.
How did they fare? With regard to the microbiota data, the MIC was able to narrow down 22 million variable pairs to just a few hundred interesting relationships, many of which had not been observed before. For instance, it identified examples of “non-coexistent” species in which if one bacterium is abundant, the other is not, and vice versa. Some of the non-coexistent relationships identified were familiar – known to be caused by differences in host diet – while others were novel. This finding raises the possibility of the existence of additional factors that, like diet, affect the make-up of the human microbiome.
In another example, the team examined a data set from the World Health Organization covering 200 countries and containing 357 variables per country. One of the identified relationships was between female obesity and household income in the Pacific Islands, in which obesity increases with income, in contrast with other countries. It turned out that obesity, rather than being an anomaly, is considered a sign of status in the Pacific Islands. Most methods would treat this separate trend as “noise,” but the MIC is able to identify relationships, such as this one, that include more than one trend.
And major-league baseball? According to the MIC, hits, total bases and how many runs a player generates for a team are the most influential factors determining a player’s salary. A more traditional statistic had placed walks, intentional walks and runs batted in as the three strongest factors. So, which of the statistics is correct? The researchers are wisely leaving it to baseball enthusiasts to decide which set of variables is – or at least should be – more strongly tied to salary.
“What sets the MIC apart from other data analysis tools is twofold,” says Hilary. “Unlike other methods, the MIC assigns high scores to a wide variety of relationship types hidden in large data sets, while it can also provide similar scores for relationships with comparable amounts of noise.” Yakir: “In other words, the MIC has a “sweet spot” – it finds cool things going on that you might not have expected and that are difficult to find with other types of analyses.”
As for Hilary and Yakir, while working on the MIC together they discovered the top-scoring relationship of all: marriage. “It’s really great for us that we share the same passion for mathematics.” Not stopping at that, they also share hobbies, including classical piano, jogging and cooking.
*Profs. Itai Benjamini and Moni Naor had no involvement in this research.
Prof. Itai Benjamini is the incumbent of the Renee and Jay Weiss Professorial Chair.
Prof. Moni Naor's resaearch is supported by Citi Foundation and Walmart.