使用熊猫的“大数据”工作流程

我在日常工作中使用SAS，它的核心支持非常棒。然而，SAS作为一款软件，由于许多其他原因而很糟糕。

有一天，我希望用python和pandas来取代我对SAS的使用，但我目前缺乏用于大型数据集的核心工作流。我不是在谈论需要分布式网络的“大数据”，而是那些大到无法装入内存，但又小到可以装入硬盘的文件。

我的第一个想法是使用HDFStore将大型数据集保存在磁盘上，只将我需要的数据块放入数据框架中进行分析。其他人提到MongoDB是一种更容易使用的替代方案。我的问题是:

完成以下任务的最佳实践工作流程是什么:

将平面文件加载到永久的磁盘数据库结构中查询该数据库以检索数据以输入pandas数据结构在操作熊猫的碎片后更新数据库

现实世界的例子将非常受欢迎，尤其是那些在“大数据”上使用熊猫的人。

编辑—我希望这样工作的一个例子:

迭代地导入一个大型平面文件，并将其存储在一个永久的磁盘数据库结构中。这些文件通常太大，无法装入内存。为了使用Pandas，我希望读取这些数据的子集(通常一次只有几列)，这些子集可以放入内存中。我将通过对所选列执行各种操作来创建新列。然后，我必须将这些新列追加到数据库结构中。

我正在努力寻找执行这些步骤的最佳实践方法。阅读关于熊猫和pytables的链接，似乎添加一个新列可能是一个问题。

编辑——具体回答杰夫的问题:

I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.

我很少向数据集中添加行。我几乎总是会创建新的列(统计/机器学习术语中的变量或特征)。

当前回答

如果您的数据集在1到20GB之间，那么您应该使用具有48GB RAM的工作站。然后Pandas可以将整个数据集保存在RAM中。我知道这不是你想要的答案，但是在一个有4GB内存的笔记本电脑上进行科学计算是不合理的。

2013-11-02 07:14:07

其他回答

值得一提的是，雷，这是一个分布式计算框架，它以分布式的方式对pandas有自己的实现。

只需替换pandas导入，代码应该像这样工作:

# import pandas as pd
import ray.dataframe as pd

# use pd as usual

详情请点击此处:

https://rise.cs.berkeley.edu/blog/pandas-on-ray/

更新: 处理熊猫分布的部分，已经提取到modin项目。

现在正确的用法是:

# import pandas as pd
import modin.pandas as pd

2018-03-18 09:30:39

我想指出维克斯包裹。

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

看一下文档:https://vaex.readthedocs.io/en/latest/ 该API与熊猫的API非常接近。

2019-06-03 09:40:50

2013-11-02 07:14:07

还有一种变化

在pandas中完成的许多操作也可以作为数据库查询(sql, mongo)来完成。

使用RDBMS或mongodb允许您在DB查询中执行一些聚合(针对大数据进行了优化，并有效地使用缓存和索引)。

稍后，您可以使用pandas执行后期处理。

这种方法的优点是，您获得了处理大数据的DB优化，同时仍然用高级声明性语法定义逻辑——而不必处理决定在内存中做什么和在内核外做什么的细节。

尽管查询语言和pandas是不同的，但是将部分逻辑从一种语言转换到另一种逻辑通常并不复杂。

2015-04-28 05:22:21

现在，在这个问题过去两年之后，又出现了一个“脱离核心”的熊猫版:dask。太棒了!虽然它不支持所有的熊猫功能，但你可以用它走得很远。更新:在过去的两年里，它一直在维护，有大量的用户社区使用Dask。

现在，这个问题已经过去四年了，在韦克斯又出现了另一只高性能的“超核心”熊猫。它“使用内存映射、零内存复制策略和惰性计算来获得最佳性能(没有内存浪费)。”它可以处理数十亿行的数据集，并且不将它们存储到内存中(甚至可以在次优硬件上进行分析)。