我在日常工作中使用SAS,它的核心支持非常棒。然而,SAS作为一款软件,由于许多其他原因而很糟糕。

有一天,我希望用python和pandas来取代我对SAS的使用,但我目前缺乏用于大型数据集的核心工作流。我不是在谈论需要分布式网络的“大数据”,而是那些大到无法装入内存,但又小到可以装入硬盘的文件。

我的第一个想法是使用HDFStore将大型数据集保存在磁盘上,只将我需要的数据块放入数据框架中进行分析。其他人提到MongoDB是一种更容易使用的替代方案。我的问题是:

完成以下任务的最佳实践工作流程是什么:

将平面文件加载到永久的磁盘数据库结构中 查询该数据库以检索数据以输入pandas数据结构 在操作熊猫的碎片后更新数据库

现实世界的例子将非常受欢迎,尤其是那些在“大数据”上使用熊猫的人。

编辑—我希望这样工作的一个例子:

迭代地导入一个大型平面文件,并将其存储在一个永久的磁盘数据库结构中。这些文件通常太大,无法装入内存。 为了使用Pandas,我希望读取这些数据的子集(通常一次只有几列),这些子集可以放入内存中。 我将通过对所选列执行各种操作来创建新列。 然后,我必须将这些新列追加到数据库结构中。

我正在努力寻找执行这些步骤的最佳实践方法。阅读关于熊猫和pytables的链接,似乎添加一个新列可能是一个问题。

编辑——具体回答杰夫的问题:

I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.

我很少向数据集中添加行。我几乎总是会创建新的列(统计/机器学习术语中的变量或特征)。


我经常以这种方式使用数十gb的数据 例:我在磁盘上有一些表,我通过查询读取它们,创建数据,然后再追加回去。

值得一读文档,在这篇文章的后面,有一些关于如何存储数据的建议。

这些细节会影响你如何存储你的数据,比如: 尽可能多地描述细节;我可以帮你建立一个结构

Size of data, # of rows, columns, types of columns; are you appending rows, or just columns? What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these. (Giving a toy example could enable us to offer more specific recommendations.) After that processing, then what do you do? Is step 2 ad hoc, or repeatable? Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file? Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)? Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?

解决方案

确保你至少安装了熊猫0.10.1。

逐块读取迭代文件和多个表查询。

由于pytables被优化为按行操作(这是您查询的对象),因此我们将为每组字段创建一个表。通过这种方式,可以很容易地选择一小组字段(这将与一个大表一起工作,但这样做更有效……我想我可以在将来修复这个限制…这是更直观的): (以下是伪代码。)

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
    B = dict(fields = ['field_10',......        ], dc = ['field_10']),
    .....
    REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

读入文件并创建存储(本质上是做append_to_multiple所做的事情):

for f in files:
   # read in the file, additional options may be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v['fields'], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v['dc'])

现在,文件中已经有了所有的表(实际上,如果您愿意,可以将它们存储在单独的文件中,您可能必须将文件名添加到group_map中,但这可能是不必要的)。

下面是获取列并创建新列的方法:

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

当你准备好post_processing时:

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

关于data_columns,实际上不需要定义ANY data_columns;它们允许您基于列再选择行。例如:

store.select(group, where = ['field_1000=foo', 'field_1001>0'])

在最终报告生成阶段,它们可能是您最感兴趣的(本质上,数据列与其他列隔离,如果您定义很多,这可能会在一定程度上影响效率)。

你可能还想:

创建一个函数,该函数接受字段列表,在groups_map中查找组,然后选择这些组并连接结果,从而获得结果帧(这实际上是select_as_multiple所做的工作)。这种结构对你来说是相当透明的。 在某些数据列上建立索引(使行子集设置更快)。 启用压缩。

如果你有问题,请告诉我!


这就是pymongo的情况。我还在python中使用sql server, sqlite, HDF, ORM (SQLAlchemy)进行了原型设计。首先,pymongo是一个基于文档的DB,所以每个人都是一个文档(属性字典)。很多人组成一个集合,你可以有很多集合(人,股票市场,收入)。

pd。注意:我在read_csv中使用chunksize来保持5到10k的记录(如果socket更大,pymongo会丢弃socket)

aCollection.insert((a[1].to_dict() for a in df.iterrows()))

查询:gt = >…

pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))

.find()返回一个迭代器,所以我通常使用ichunked来切成更小的迭代器。

因为我通常会将10个数据源粘贴在一起,那么join呢:

aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))

然后(在我的情况下,有时我必须agg对aJoinDF首先在它的“可合并”。)

df = pandas.merge(df, aJoinDF, on=aKey, how='left')

然后你可以通过下面的更新方法将新的信息写入你的主集合。(逻辑集合vs物理数据源)。

collection.update({primarykey:foo},{key:change})

对于较小的查找,只需反规范化。例如,文档中有代码,只需添加字段代码文本,并在创建文档时进行字典查找。

现在你已经有了一个很好的基于人的数据集,你可以在每个情况下释放你的逻辑,并创建更多的属性。最后,你可以把你的3到内存max关键指标读入pandas,并做枢轴/agg/数据探索。这适用于我的300万条记录的数字/大文本/类别/代码/浮动…

您还可以使用MongoDB内置的两种方法(MapReduce和聚合框架)。这里有更多关于聚合框架的信息,因为它似乎比MapReduce更简单,而且看起来很适合快速聚合工作。注意,我不需要定义字段或关系,并且可以向文档添加项。在快速变化的numpy, pandas, python工具集的当前状态下,MongoDB帮助我开始工作:)


我发现这一点有点晚,但我也处理过类似的问题(抵押贷款提前支付模型)。我的解决方案是跳过pandas HDFStore层,直接使用pytable。在最终文件中,我将每一列保存为单独的HDF5数组。

我的基本工作流程是首先从数据库中获得一个CSV文件。我把它压缩了,所以它不那么大。然后我将其转换为一个面向行的HDF5文件,方法是在python中遍历它,将每一行转换为真正的数据类型,并将其写入HDF5文件。这需要几十分钟,但它不使用任何内存,因为它只是逐行操作。然后我将面向行的HDF5文件“转置”为面向列的HDF5文件。

表的转置是这样的:

def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"):
    # Get a reference to the input data.
    tb = h_in.getNode(table_path)
    # Create the output group to hold the columns.
    grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1))
    for col_name in tb.colnames:
        logger.debug("Processing %s", col_name)
        # Get the data.
        col_data = tb.col(col_name)
        # Create the output array.
        arr = h_out.createCArray(grp,
                                 col_name,
                                 tables.Atom.from_dtype(col_data.dtype),
                                 col_data.shape)
        # Store the data.
        arr[:] = col_data
    h_out.flush()

然后读取它看起来是这样的:

def read_hdf5(hdf5_path, group_path="/data", columns=None):
    """Read a transposed data set from a HDF5 file."""
    if isinstance(hdf5_path, tables.file.File):
        hf = hdf5_path
    else:
        hf = tables.openFile(hdf5_path)

    grp = hf.getNode(group_path)
    if columns is None:
        data = [(child.name, child[:]) for child in grp]
    else:
        data = [(child.name, child[:]) for child in grp if child.name in columns]

    # Convert any float32 columns to float64 for processing.
    for i in range(len(data)):
        name, vec = data[i]
        if vec.dtype == np.float32:
            data[i] = (name, vec.astype(np.float64))

    if not isinstance(hdf5_path, tables.file.File):
        hf.close()
    return pd.DataFrame.from_items(data)

现在,我通常在内存很大的机器上运行这个程序,所以我可能对内存使用不够小心。例如,默认情况下,load操作读取整个数据集。

这通常适用于我,但它有点笨重,我不能使用华丽的pytables魔法。

编辑:与pytables默认的记录数组相比,这种方法的真正优势在于,我可以使用h5r将数据加载到R中,而h5r不能处理表。或者,至少,我无法让它加载异构表。


如果您的数据集在1到20GB之间,那么您应该使用具有48GB RAM的工作站。然后Pandas可以将整个数据集保存在RAM中。我知道这不是你想要的答案,但是在一个有4GB内存的笔记本电脑上进行科学计算是不合理的。


我认为上面的答案遗漏了一个我认为非常有用的简单方法。

当我有一个文件太大而无法在内存中加载时,我将该文件分解为多个较小的文件(按行或cols)

例如:如果有30天的交易数据,大小约为30GB,我将其分解为每天大小约为1GB的文件。我随后分别处理每个文件,并在最后汇总结果

最大的优点之一是它允许并行处理文件(多线程或多线程)。

另一个优点是文件操作(如示例中的添加/删除日期)可以通过常规shell命令完成,这在更高级/复杂的文件格式中是不可能的

这种方法并不能覆盖所有的场景,但是在很多场景中都非常有用


如果您选择创建数据管道的简单路径,将其分解为多个较小的文件,可以考虑Ruffus。


我知道这是一个老线程,但我认为火焰库值得一试。它就是为这种情况而建的。

从文档中可以看出:

Blaze将NumPy和Pandas的可用性扩展到分布式和核外计算。Blaze提供了一个类似于NumPy ND-Array或Pandas DataFrame的接口,但将这些熟悉的接口映射到各种其他计算引擎上,如Postgres或Spark。

编辑:顺便说一下,它是由ContinuumIO和Travis Oliphant (NumPy的作者)支持的。


还有一种变化

在pandas中完成的许多操作也可以作为数据库查询(sql, mongo)来完成。

使用RDBMS或mongodb允许您在DB查询中执行一些聚合(针对大数据进行了优化,并有效地使用缓存和索引)。

稍后,您可以使用pandas执行后期处理。

这种方法的优点是,您获得了处理大数据的DB优化,同时仍然用高级声明性语法定义逻辑——而不必处理决定在内存中做什么和在内核外做什么的细节。

尽管查询语言和pandas是不同的,但是将部分逻辑从一种语言转换到另一种逻辑通常并不复杂。


现在,在这个问题过去两年之后,又出现了一个“脱离核心”的熊猫版:dask。太棒了!虽然它不支持所有的熊猫功能,但你可以用它走得很远。更新:在过去的两年里,它一直在维护,有大量的用户社区使用Dask。

现在,这个问题已经过去四年了,在韦克斯又出现了另一只高性能的“超核心”熊猫。它“使用内存映射、零内存复制策略和惰性计算来获得最佳性能(没有内存浪费)。”它可以处理数十亿行的数据集,并且不将它们存储到内存中(甚至可以在次优硬件上进行分析)。


我最近遇到了一个类似的问题。我发现,简单地读取数据块,并在写入数据块时将其追加到相同的csv中就可以了。我的问题是根据另一个表中的信息添加一个日期列,使用某些列的值如下所示。这可能会帮助那些对dask和hdf5感到困惑但更熟悉熊猫的人,比如我自己。

def addDateColumn():
"""Adds time to the daily rainfall data. Reads the csv as chunks of 100k 
   rows at a time and outputs them, appending as needed, to a single csv. 
   Uses the column of the raster names to get the date.
"""
    df = pd.read_csv(pathlist[1]+"CHIRPS_tanz.csv", iterator=True, 
                     chunksize=100000) #read csv file as 100k chunks

    '''Do some stuff'''

    count = 1 #for indexing item in time list 
    for chunk in df: #for each 100k rows
        newtime = [] #empty list to append repeating times for different rows
        toiterate = chunk[chunk.columns[2]] #ID of raster nums to base time
        while count <= toiterate.max():
            for i in toiterate: 
                if i ==count:
                    newtime.append(newyears[count])
            count+=1
        print "Finished", str(chunknum), "chunks"
        chunk["time"] = newtime #create new column in dataframe based on time
        outname = "CHIRPS_tanz_time2.csv"
        #append each output to same csv, using no header
        chunk.to_csv(pathlist[2]+outname, mode='a', header=None, index=None)

我发现一个对大型数据用例很有帮助的技巧是通过将浮点精度降低到32位来减少数据量。它并不适用于所有情况,但在许多应用程序中,64位精度是多余的,节省2倍内存是值得的。让一个明显的观点变得更加明显:

>>> df = pd.DataFrame(np.random.randn(int(1e8), 5))
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
...
dtypes: float64(5)
memory usage: 3.7 GB

>>> df.astype(np.float32).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
...
dtypes: float32(5)
memory usage: 1.9 GB

正如其他人所指出的那样,几年之后,一个“脱离核心”的熊猫替代物出现了:dask。虽然dask并不是熊猫的替代品,但它的所有功能都有几个突出的原因:

Dask是一个用于分析计算的灵活并行计算库,它针对交互式计算工作负载的动态任务调度进行了优化 “大数据”集合,如并行数组、数据框架和列表,将NumPy、Pandas或Python迭代器等常用接口扩展到大于内存或分布式环境,并从笔记本电脑扩展到集群。

Dask emphasizes the following virtues: Familiar: Provides parallelized NumPy array and Pandas DataFrame objects Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects. Native: Enables distributed computing in Pure Python with access to the PyData stack. Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms Scales up: Runs resiliently on clusters with 1000s of cores Scales down: Trivial to set up and run on a laptop in a single process Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans

并添加一个简单的代码示例:

import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean().compute()

替换一些pandas代码,像这样:

import pandas as pd
df = pd.read_csv('2015-01-01.csv')
df.groupby(df.user_id).value.mean()

并且,特别值得注意的是,通过并发提供。期货接口是提交自定义任务的通用基础设施:

from dask.distributed import Client
client = Client('scheduler:port')

futures = []
for fn in filenames:
    future = client.submit(load, fn)
    futures.append(future)

summary = client.submit(summarize, futures)
summary.result()

值得一提的是,雷, 这是一个分布式计算框架,它以分布式的方式对pandas有自己的实现。

只需替换pandas导入,代码应该像这样工作:

# import pandas as pd
import ray.dataframe as pd

# use pd as usual

详情请点击此处:

https://rise.cs.berkeley.edu/blog/pandas-on-ray/


更新: 处理熊猫分布的部分,已经提取到modin项目。

现在正确的用法是:

# import pandas as pd
import modin.pandas as pd

我想指出维克斯包裹。

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

看一下文档:https://vaex.readthedocs.io/en/latest/ 该API与熊猫的API非常接近。


目前我正在“像”你一样工作,只是在一个较低的规模上,这就是为什么我没有为我的建议提供PoC。

然而,我似乎发现成功使用pickle作为缓存系统,并将各种函数的执行外包到文件中-从我的commando / main文件执行这些文件;例如,我使用prepare_use.py转换对象类型,将数据集分成测试、验证和预测数据集。

如何使用pickle进行缓存? 我使用字符串来访问动态创建的pickle文件,这取决于传递了哪些参数和数据集(我尝试捕获并确定程序是否已经运行,使用.shape作为数据集,使用dict作为传递的参数)。 关于这些措施,我得到一个字符串,试图找到和读取.pickle文件,如果找到,可以跳过处理时间,以便跳转到我现在正在执行的工作。

Using databases I encountered similar problems, which is why i found joy in using this solution, however - there are many constraints for sure - for example storing huge pickle sets due to redundancy. Updating a table from before to after a transformation can be done with proper indexing - validating information opens up a whole other book (I tried consolidating crawled rent data and stopped using a database after 2 hours basically - as I would have liked to jump back after every transformation process)

我希望我的2美分能在某种程度上帮助你。

问候。


拼花文件格式非常适合您所描述的用例。使用pd可以有效地读入列的特定子集。Read_parquet (path_to_file, columns=["foo", "bar"])

https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html