These are the slides for a talk I gave recently.
Abstract. IPython notebooks, NumPy and Pandas data frames are the go-to tools for doing data science with Python. Spark and PySpark is rapidly becoming the de facto standard for doing analysis on large volumes of data. But what about CPU-intensive tasks? What about rough numerical, but distributed computations? In the first part of this talk I give an overview of the most interesting alternatives. The second part is a brief roundup of the file formats for storing data for numerical analysis; most of these file formats are language-independent.