



将平面文件加载到永久的磁盘数据库结构中 查询该数据库以检索数据以输入pandas数据结构 在操作熊猫的碎片后更新数据库



迭代地导入一个大型平面文件,并将其存储在一个永久的磁盘数据库结构中。这些文件通常太大,无法装入内存。 为了使用Pandas,我希望读取这些数据的子集(通常一次只有几列),这些子集可以放入内存中。 我将通过对所选列执行各种操作来创建新列。 然后,我必须将这些新列追加到数据库结构中。



I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.










如果您的数据集在1到20GB之间,那么您应该使用具有48GB RAM的工作站。然后Pandas可以将整个数据集保存在RAM中。我知道这不是你想要的答案,但是在一个有4GB内存的笔记本电脑上进行科学计算是不合理的。



Blaze将NumPy和Pandas的可用性扩展到分布式和核外计算。Blaze提供了一个类似于NumPy ND-Array或Pandas DataFrame的接口,但将这些熟悉的接口映射到各种其他计算引擎上,如Postgres或Spark。

编辑:顺便说一下,它是由ContinuumIO和Travis Oliphant (NumPy的作者)支持的。


>>> df = pd.DataFrame(np.random.randn(int(1e8), 5))
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
dtypes: float64(5)
memory usage: 3.7 GB

>>> df.astype(np.float32).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
dtypes: float32(5)
memory usage: 1.9 GB


Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

看一下文档:https://vaex.readthedocs.io/en/latest/ 该API与熊猫的API非常接近。


Dask是一个用于分析计算的灵活并行计算库,它针对交互式计算工作负载的动态任务调度进行了优化 “大数据”集合,如并行数组、数据框架和列表,将NumPy、Pandas或Python迭代器等常用接口扩展到大于内存或分布式环境,并从笔记本电脑扩展到集群。

Dask emphasizes the following virtues: Familiar: Provides parallelized NumPy array and Pandas DataFrame objects Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects. Native: Enables distributed computing in Pure Python with access to the PyData stack. Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms Scales up: Runs resiliently on clusters with 1000s of cores Scales down: Trivial to set up and run on a laptop in a single process Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans


import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')


import pandas as pd
df = pd.read_csv('2015-01-01.csv')


from dask.distributed import Client
client = Client('scheduler:port')

futures = []
for fn in filenames:
    future = client.submit(load, fn)

summary = client.submit(summarize, futures)