Computers may be getting smaller and faster, but the amount of information streaming through wires everywhere is still far outpacing our machines’ processing capacity. Computer scientists and mathematicians have in the past few decades been tackling the analysis of extremely large datasets – for example, continuous streams from sensors or satellites.
One of the issues researchers deal with is identifying and characterizing algorithms that are efficient for dealing with such massive data quantities. Large datasets are often linear – that is, they arrive in a continuous stream, one chunk of data at a time – and this makes it difficult, or sometimes impossible, to go back and obtain earlier information. Prof. Robert Krauthgamer of the Weizmann Institute of Science’s Computer Science and Applied Mathematics Department explains that the typical questions asked about data streams – and hopefully answered using computer algorithms – can be sorted into those that are practically impossible to answer – they would take prohibitive amounts of computing memory – and those that are easy, or tractable. Sometimes two questions may even seem to be very similar, yet one may be easy and the other very hard to answer.
Krauthgamer gives a real-life example to clarify the questions that can be asked: To identify a so-called “distributed denial of service” (DDOS) attack, in which a server is overloaded with excessive communication, technicians may want to know how many different computers contacted the server over a 24-hour period, and which computer had done so the most times. Figuring this out from the digital identity – the IP address – of each of these computers will be highly inefficient if the server has to scan its entire history. But a clever algorithm can estimate the number of computers in question with an error margin of something like 1%. And while it is difficult to find, with certainty, the IP address that shows up the most times in the dataset, an algorithm can use carefully chosen random samples to create a highly accurate list of “heavy hitters.”
There is an inherent mathematical principle that can describe this behavior
One sign of a good algorithm is that it uses a much smaller amount of memory than the dataset itself, so that the computation can be fast and efficient. In recent studies, Krauthgamer and his colleagues came up with proofs for identifying which problems are easy or hard. Going back to the example of the DDOS attack by way of explanation, he says that any algorithm for finding the heaviest hitter in the data stream – or just its frequency – would be inefficient. But there is another question that is easier – and more important in this case – to answer: Can we estimate how many distinct addresses have a frequency greater than 0 (that is, they appear at least once)? Such calculations involve a commonly used type of function known as an L P norm, in which the exponent p is some number. When p=0, the function is easy to estimate; when p=infinity it is extremely difficult. What lies in between? It turns out that anything over p=2 is hard, while anything up to and including 2 is easy; and there is a sharp transition between them – something resembling a phase transition in physics. Why should this be? The explanation has generally been that anything over 2 is similar to infinity. But in the course of characterizing a more general class of all symmetric functions, Krauthgamer and his colleagues provided two proofs that clarify this sharp dividing line, showing there is an inherent mathematical principle that can describe this behavior.
“What I find most compelling about these proofs,” says Krauthgamer, “is that we used math that has nothing to do with algorithms to explain whether an efficient algorithm can exist or not.”
In another study, Krauthgamer looked at a larger class of problems, asking if it is possible to tell if a new problem has an efficient algorithm. The method that is perhaps the most fundamental one in computer science, called reduction, is used to mathematically convert the new problem to a canonical known one, such as the Lp in the above example, and then to rely on the algorithm known to solve the canonical problem. But if this method fails – that is, if the new problem or question cannot be converted to the canonical one – can one find an efficient algorithm by a different method? Krauthgamer and his colleagues have now proved that if the new question cannot be converted, then there cannot be a good algorithm. Their proof was based on a classical area of math that is named after the Polish mathematician Stefan Banach, who developed it in the 1920s. “Basic logic now tells us that if there is a good algorithm, no matter how sophisticated or clever, then the new problem can also be converted to the canonical one,” says Krauthgamer. “Our contribution has been to demonstrate that something is algorithmically possible if, and only if, a simple mathematical conversion exists.”
Krauthgamer: “We explain a particular computational phenomenon by something that appears to be unconnected, for example, by a branch of math that contains no elements whatsoever of computers or algorithms. These characterizations get to the very root of such problems as why p>2 is hard, giving us an overarching, holistic explanation of why they occur.”
Although Krauthgamer’s work is theoretical, it informs the work of computer scientists who deal with complex data streams that may include anything from communications networks to extensive records from retail chains, astrophysics or biological experiments.