Theory In The Time of Big Data
What is the role of theory today?
Anna Gilbert and Atri Rudra are top theorists who are well known for their work in unraveling secrets of computation. They are experts on anything to do with coding theory—see this for a book draft by Atri with Venkatesan Guruswami and Madhu Sudan called Essential Coding Theory. They also do great theory research involving not only linear algebra but also much non-linear algebra of continuous functions and approximative numerical methods.
Today we want to focus on a recent piece of research they have done that is different from their usual work: It contains no proofs, no conjectures, nor even any mathematical symbols.
Their new working paper is titled, “Teaching Theory in the time of Data Science/Big Data.” As you might guess it is about the role of theory in the education of computer scientists today. The paper contains much information that they have collected on what is being taught at some of the top departments in computer science, and how the current immense interest in Big Data is affecting classic theory courses.
A short overview of what they find is:
- Computer Science (CS) enrollments are growing rapidly.
- Interest and importance of Big Data is also growing rapidly. The same is true of AI, Machine Learning, and Systems fields in general.
- The main consequence is that departments are being forced to revamp their curricula.
- There is a nationwide trend to reducing and condensing the required courses, thus leaving more room for courses in rapidly growing subfields and area concentrations such as certificates in Data Analytics, Information Assurance, or Mobile Applications.
The above is leading to pressure to delete and/or modify theory courses. From Atri’s CS viewpoint and Anna’s as Mathematics faculty active in the theory community, both wish to see CS majors obtain degrees that leave them well versed in CS in general and theory in particular. Undergraduates in programs with a CS component should likewise be well served in formal and mathematical areas. Is this possible given the finite constraints on the curriculums? It is not clear, but their paper shows what is happening right now with theory courses (plus linear algebra and probability/statistics), what is being planned for the near future, and some options that may be useful to consider.
§
For the purpose of this post, we made some edits to their text which follows, with their permission. Some changes were stylistic and some more content-oriented. Their PDF version linked as above may evolve over time—especially upon success of their appeal for reader input at the end. So to obtain a complete and current picture please visit their paper too.
Now Anna and Atri speak:
History and Some Caveats
The genesis of this article is a conversation between the two authors that started six weeks ago. One of us (Anna) was giving a talk at an NSF workshop on Theoretical Foundations of Data Science (TFoDS) and the other (Atri) was thinking about changes to the Computer Science (henceforth CS) curriculum that his department at the University at Buffalo is considering. Anna’s talk at NSF, which included data on theory courses at top ranked schools, generated a great deal of interest in knowing even more about the state of theory courses. This was followed by more data collection on our part.
This post is meant as a starting point of discussion on how we teach theory courses, especially in the light of the increased importance of data science. It is not a position paper—it does not argue that the current trends are inherently good or bad, nor does it prescribe any silver bullet. We do suggest some possible courses of action around which discussion can begin.
Overview: Drivers of Change
CS enrollments as well as the numbers of CS majors have increased exponentially in the last few years. In 2014, Ed Lazowska, Eric Roberts, and Jim Kurose exhibited the trend in the former, not only majors. Their graphs in Figure 1 show the trend in introductory CS course enrollments at six institutions in the years 2006–2014.
Figure 1. Enrollment trends in introductory CS sequences at six institutions (Stanford, MIT, University of Pennsylvania, Harvard, University of Michigan, and University of Washington) from 2006–2014. |
Lazowska’s presentation has more detailed statistics and a discussion of the potential implications of these increases. These trends remain valid in 2016, for example as shown by the following chart for the University at Buffalo. In addition to total number of CSE majors, it shows the enrollment in CSE 115 (the introduction to CSE course), CSE 191 (Discrete Math), CSE 250 (Data Structures), CSE 331 (Algorithms) and CSE 396 (Theory of Computation), all of which are required of all CS majors:
Figure 2. Enrollment trends, University at Buffalo CSE 8/08–5/16, with total majors. |
As enrollments out-pace hiring, class sizes have exploded. Lazowska points out that over 10% of Princeton’s majors are CS majors, while it is highly unlikely that 10% of Princeton’s faculty will ever be CS faculty. At the same time, many institutions are re-evaluating and changing their theoretical computer science (henceforth TCS) course requirements and content.
The twin pressures of staffing and content are shifting priorities in both the material covered and how it is covered—e.g., reducing emphasis on proofs and essay-type problems which are harder to grade. We are not judging these shifts or tying them directly to enrollments, but are for now observing that they are happening and impact a large (and increasing) number of students.
The changes in course content, in emphasis on particular TCS components, and in overall CS requirements (including mathematics and statistics) are occurring exactly when there is a big move towards “computational thinking” in many fields and a national emphasis on STEM education more broadly. Not only are the fundamental backgrounds of incoming CS majors thereby changing, but the CS audience is expanding to students in other fields that are benefiting from solid computational foundations. With the increasing role of data and concomitant needs for machine learning and statistics, it is important to obtain a deep understanding of the mathematical foundations of data science. Traditional TCS has been founded on discrete mathematics, but “continuous” math—especially as related to statistics, probability, and linear algebra—is increasingly important in ways also reflected by cutting-edge TCS research.
Survey of Curricula: Results and Analysis
We considered the top 20 CS schools according to the US News ranking of graduate programs, numbering 24 including ties. It may be inappropriate to use the graduate program rankings to consider the undergraduate program requirements, and it should be noted that the rankings cover all of the graduate program not just TCS, but this is a reasonable starting point. We sent colleagues a short survey and collected data (available spreadsheet) on these 24 schools. Since several include Engineering in one department as at Buffalo or a separate department as at Michigan we will use `CSE’ as the collective term.
What constitutes a theory course?
We counted the total number of theory courses that all CS majors have to take within the CSE department and then calculated the fraction over the total number of required courses. We categorized the theory courses under these bins:
- Discrete Math
- Data Structures
- Algorithms
- Theory of Computation
- Probability/Statistics
The bounds are not sharp—a Data Structures course always covers algorithms associated to the data structures and may overlap with an Algorithms course especially when graphs are covered—and Algorithms often includes some complexity theory, especially NP-completeness. In our spreadsheet these columns are followed by the number of theory electives—besides these required courses—that all CS majors have to take. We would like to clarify four things:
- Data Structures is, in most schools, primarily a programming course with little or no proofs, and is almost always taught by non-core theory faculty. We still counted it as a theory course—after all data structures did come from theory so we can “own it”—but we note that many of our colleagues do not.
- Many schools tie their Probability/Statistics requirement to courses outside the CSE department. We discuss this further below.
- We have not specifically addressed parallel or distributed algorithms which are necessary in much of data science or data-enabled science.
- All the discussion below is based purely on TCS or math courses that all CS majors have to take. In particular, our numbers do not consider cases where many—even most—students take a specific TCS course but it is formally optional.
Current theory requirements
We begin with statistics on the total number of semesters of theory courses that are currently required of all CS majors, standardly equating 3 quarters or trimesters to 2 semesters. The basic statistics are in Table 1.
The median number of semester-long courses was three. All but one school requires a discrete math course, all but two require a Data Structures course, and all but nine require an Algorithms course. Eight schools require a Theory of Computation course separate from Algorithms. All these schools have a significant programming component in their Data Structures course. Only one, Cornell, currently adds programming assignments in the required algorithms course. We would like to remind the reader that we are only considering TCS courses required of all CS majors—for instance, CS 124/125 at Harvard has programming assignments but is not required of all CS majors.
Current Mathematics/Statistics requirements
We limited attention to cases where courses in Probability/Statistics and/or Linear Algebra are required of all CS majors but taught outside of CSE. We focus on these two courses since they are most relevant to data science.
Probability/Statistics. Of those surveyed, nineteen schools required a Probability/Statistics course, while five did not. Five had developed a specific required course within the CSE department (Stanford, Berkeley, UIUC, Univ. of Washington, and MIT), three had choices among courses both inside and outside the CSE department, and eleven required a course outside CSE. Of the five institutions that did not require a Probability/Statistics course, two (Univ. of Wisconsin and Harvard) listed such a course among electives in Mathematics. Princeton, Yale, and Brown do not list such a course.
Linear Algebra. Sixteen surveyed schools require a Linear Algebra course, out of 24 total. Of the 16, only Brown and Columbia provide a linear algebra course within CSE that satisfies the requirement, though both allow for non-CSE linear algebra courses.
Changes: past and future
After reflecting on the data in relation to our initial observations about increasing CS enrollments and emphasis on computational thinking across disciplines, we dug deeper and asked people further questions about changes they have seen or are discussing at their institutions. Of eight departments responding (as of 6/10/16):
- four kept the number of theory requirements the same but changed emphasis on content, amid an overall decrease in requirements;
- three decreased the number of theory requirements and overall;
- one increased its theory requirements.
Four universities changed their Mathematics requirements in the last 10 years. These changes are primarily to require fewer semesters of Calculus II or III (e.g., some no longer require Ordinary Differential Equations) and, instead, require Linear Algebra and/or Probability/Statistics (whether inside the CSE department or not). Two institutions plan to make changes in the future, likely to require Linear Algebra.
Starting Points for Discussion
We suggest that now is the time to re-think some of the theory curriculum, to work with our colleagues in Mathematics and Statistics, and to develop mathematical foundations classes that are appropriate both for CS majors and STEM majors more broadly. Especially for CS majors, this exposure should come no later than junior year. Here are some starting points for this discussion.
- Perhaps require students to choose one out of five different foundational courses offered by several different departments—e.g. CSE, Statistics, and Mathematics.
- Perhaps team-teach courses that cover a combination of material across several departments.
- Developing small exploratory or project-based courses that complement the regular theoretical foundations classes, with hands-on programming in analogy to lab-based courses in the sciences, might improve on adding programming assignments to TCS courses.
- Assignments in algorithms courses can be tailored to show how the algorithms are applied to real-life data. Short of large-scale projects, smaller assignments amenable to good available auto-graders can achieve this benefit even in increasingly large classes. We hear that students—especially those less mathematically inclined—learn the material better if they have environments conducive to coding up the algorithms they see in class. We believe proof-based assignments remain relevant in algorithms courses but may be better complemented with some programming assignments.
- If your department is looking into reducing the total number of required CS courses, instead of dropping the theory of computing course, examine carefully and discuss which pieces of the Theory of Computation curriculum to retain, which elements and modes of analysis enable a variety of applications in the rest of computer science. See an earlier post on this blog, titled “Complexity as an Enable,” for a broader discussion.
Our goal is to educate the different students at our respective institutions as best we can, by working with our colleagues at our home institutes and by having a dialogue with our theory colleagues across the country.
Sociology and Sample Size
After sending emails initially to friends in our social networks to gather data and/or supplement the above preliminary analysis, we noted that we had asked only three women total. We then mused on how we could have increased that number by thinking a bit harder about which women were in our social network and whether the institutions we collected figures for had women theorists. We found that, upon reflection, we could have asked eight more women in our social networks, for a total of 11 women theorists, each at a different school, among the top 24 institutions. There are certainly more than 11 institutions with women theorists but either the women faculty are in areas we are not familiar with or they are women in our areas whom we do not yet know personally (e.g., new, junior faculty). In other words, a ten-minute reflection yielded an almost four-fold increase in representatives from an under-represented group.
We recognize that our sample covers only 24 top institutions. This was done mostly to reduce work on our part since the first data was collected by reading the relevant curricula webpages. Needless to say, a better picture of TCS and math requirements for CS degrees in schools in the US can be gained with more data. We are hoping that readers of this blog at many more institutions can make valuable contributions to our data collection and discussion. Those of you interested can contribute your institution’s information to this survey by filling in a Google form. We will periodically update the master spreadsheet with information that we get from this Google form.
Open Problems
We join Anna and Atri in their appeal which ends their paper: the destiny of theory courses can be considered as one large “open problem.” They conclude by thanking those who have already contributed data and others at Michigan and Buffalo and Georgia Tech (besides us) and MIT for inputs to their article.
We have a few remarks of our own: The main ulterior purpose of theory courses is to sharpen analytical modes of thinking and linear deductive argument, among skills often lumped into the general term “mathematical maturity.” The Internet and advances in technology have brought greater and quicker rewards for non-linear, associative, and more-visual modes. These might seem to compete with or even replace “theory,” but the point behind Anna and Atri’s post is that while diffused among more courses in various areas, the need for analytical and linear-deductive experience grows overall.
What emerges is a greater call for mathematical maturity before capstone courses in these areas, as opposed to the view that a required theory course can be taken in the senior year. Shifting TCS material into an early discrete mathematics course may accomplish this. As we have discussed in Buffalo, this could accompany an across-the-board upgrade in rigor of our entry curriculum, but that may discourage some types of students. That in turn might slow increased enrollments—amid several feedback loops whose consequences are an open problem.
[clarified in Buffalo figure that “Total” means majors.]
I have a small question: Did von Neumann have cancer? Some other blog says it was an experimental disease at that time… Ioannis (Yannis) Avramopoulos
thx much Gilbert/Rudra for this groundbreaking analysis. their use of the term “CSE” crossing computer science and engineering is telling and will not be to everyones taste in the field. the big data “revolution” is having a big impact and it looks like a big wave that will span early decades of the 21st century. recently its also increasingly becoming tightly coupled with machine learning. its of increasingly intense interest to (top) corporations eg google and there is talk about the next wave of innovation/ competition/ “disruptive technology” occurring in this area.
its a very applied field that has different models/ paradigms/ emphasis than prior fields such as statistics but is heavily centered on it also. coincidentally fortnow was recently musing on the “theory/ application” split. http://blog.computationalcomplexity.org/2016/06/karp-v-wigderson-20-years-later.html this split has been around for decades in CS but is coming to the forefront of attention lately and causing some friction/ shearing. but CS right now is very vibrant, outlook is rosy and it seems theres nothing to fear, only growth. much more investigation of applied CS in my blog in various essays incl recently here https://vzn1.wordpress.com/2015/12/29/select-refs-and-big-tribute-to-empirical-cs-math/
yawn
If one can ignore the issue of political correctness, the truth is that the demand for Data Science and Machine Learning, etc. is market driven, and hence driven by the Corporate America which is more interested in creating false needs for the masses, and an infrastructure for more effective marketing and sales tools.
The middle class lands up being simply conforming to such forces.
Is this a sign of advancement of the civilization?
Agreed.