Having spent a decent amount of time watching both the r and pandas tags on SO, the impression that I get is that pandas questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.

我们如何为熊猫问题创造好的可重复的例子?简单的数据帧可以放在一起,例如:

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

但许多示例数据集需要更复杂的结构,例如:

日期时间索引或数据 多个类别变量(是否有一个与R的expand.grid()函数等效的函数,它可以生成某些给定变量的所有可能组合?) 多索引或面板数据

对于难以用几行代码模拟的数据集,是否存在与R的dput()等效的允许您生成可复制粘贴代码以重新生成数据结构的程序?


注意:这里的大多数想法对于Stack Overflow来说都是非常通用的,确实是一般的问题。参见最小的,可复制的例子;另见简短,独立,正确的例子。

免责声明:写出一个好问题很难。

优点:

Do include a small example DataFrame, either as runnable code: In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B']) or make it "copy and pasteable" using pd.read_clipboard(sep='\s\s+'). You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented: In [2]: df Out[2]: A B 0 1 2 1 1 3 2 4 6 Test pd.read_clipboard(sep='\s\s+') yourself. I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows,[citation needed] and I bet I can do it in 5 rows. Can you reproduce the error with df = df.head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing. But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate: df = pd.DataFrame(np.random.randn(100000000, 10)) Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site. Write out the outcome you desire (similarly to above) In [3]: iwantthis Out[3]: A B 0 1 5 1 4 6 Explain where the numbers come from: The 5 is the sum of the B column for the rows where A is 1. Do show the code you've tried: In [4]: df.groupby('A').sum() Out[4]: B A 1 5 4 6 But say what's incorrect: The A column is in the index rather than a column. Do show you've done some research (search the documentation, search Stack Overflow), and give a summary: The docstring for sum simply states "Compute sum of group values" The groupby documentation doesn't give any examples for this. Aside: the answer here is to use df.groupby('A', as_index=False).sum(). If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure. df['date'] = pd.to_datetime(df['date']) # this column ought to be date. Sometimes this is the issue itself: they were strings.

缺点:

Don't include a MultiIndex, which we can't copy and paste (see above). This is kind of a grievance with Pandas' default display, but nonetheless annoying: In [11]: df Out[11]: C A B 1 2 3 2 6 The correct way is to include an ordinary DataFrame with a set_index call: In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']) In [13]: df = df.set_index(['A', 'B']) In [14]: df Out[14]: C A B 1 2 3 2 6 Do provide insight to what it is when giving the outcome you want: B A 1 1 5 0 Be specific about how you got the numbers (what are they)... double check they're correct. If your code throws an error, do include the entire stack trace. This can be edited out later if it's too noisy. Show the line number and the corresponding line of your code which it's raising against.

丑:

Don't link to a CSV file we don't have access to (and ideally don't link to an external source at all). df = pd.read_csv('my_secret_file.csv') # ideally with lots of parsing options Most data is proprietary, we get that. Make up similar data and see if you can reproduce the problem (something small). Don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph. Essays are bad; it's easier with small examples. Don't include 10+ (100+??) lines of data munging before getting to your actual question. Please, we see enough of this in our day jobs. We want to help, but not like this.... Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.


如何创建样本数据集

这主要是通过提供如何创建样本数据框架的示例来扩展AndyHayden的回答。Pandas和(特别是)NumPy为您提供了各种工具,这样您通常只需几行代码就可以创建任何真实数据集的合理副本。

在导入NumPy和Pandas之后,如果你想让人们能够准确地复制你的数据和结果,请确保提供一个随机种子。

import numpy as np
import pandas as pd

np.random.seed(123)

厨房水槽的例子

下面的例子展示了你可以做的各种事情。所有有用的样本数据帧都可以从这个子集中创建:

df = pd.DataFrame({

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to R's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
                          freq='D'), 6, replace=False)
    })

这产生:

          a   b       c  d  e          f          g
0 -1.085631 NaN   panda  0  0 2011-01-01 2011-08-12
1  0.997345   7   shark  0  1 2011-01-02 2011-11-10
2  0.282978   5   panda  1  0 2011-01-03 2011-10-30
3 -1.506295   7  python  1  1 2011-01-04 2011-09-07
4 -0.578600 NaN   shark  2  0 2011-01-05 2011-02-27
5  1.651437   7  python  2  1 2011-01-06 2011-02-03

一些注意事项:

np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy. For a more direct replacement for R's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of six dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.

虚假股市数据

除了使用上述代码的子集之外,您还可以进一步结合这些技术来完成几乎任何事情。例如,这里有一个结合了np的简短例子。Tile和date_range为覆盖相同日期的4只股票创建样本行情器数据:

stocks = pd.DataFrame({
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

现在我们有了一个100行的样本数据集(每个ticker有25个日期),但我们只使用了4行,这样其他人就可以很容易地进行复制,而无需复制和粘贴100行代码。然后你可以显示数据的子集,如果它有助于解释你的问题:

>>> stocks.head(5)

        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)

         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

回答SO问题最具挑战性的方面之一是重新创建问题(包括数据)所花费的时间。没有清晰方法重现数据的问题不太可能得到答案。假设你正在花时间写一个问题,并且你有一个需要帮助的问题,你可以通过提供数据来帮助自己,然后其他人可以使用这些数据来帮助解决你的问题。

@Andy提供的关于如何写出好的熊猫问题的说明是一个很好的开始。有关更多信息,请参阅如何询问以及如何创建最小的、完整的和可验证的示例。

请清楚地说出你的问题。在花时间写完你的问题和任何示例代码后,试着阅读它,并为你的读者提供一个“执行摘要”,总结了问题并清楚地陈述了问题。

最初的问题:

我有这些数据… 我想这么做… 我想让我的结果看起来像这样… 然而,当我尝试这样做时,我遇到了以下问题…… 我试图通过做(这个)和(那个)来找到解决方案。 我该怎么解决呢?

根据所提供的数据量、示例代码和错误堆栈,读者在理解问题是什么之前需要走很长的路。试着重复一下你的问题,这样问题本身就在最重要的位置,然后提供必要的细节。

修改问题:

问题:我该怎么做? 我试图通过做(这个)和(那个)来找到解决方案。 当我尝试这样做时,我遇到了以下问题…… 我希望我的最终结果是这样的…… 下面是一些可以重现我的问题的最小代码… 下面是如何重新创建我的示例数据: Df = pd。DataFrame ({A:[…, ' b ':[…]),…})

如有需要,提供样本数据!!

有时只需要DataFrame的头部或尾部。您还可以使用@JohnE提出的方法来创建更大的数据集,这些数据集可以被其他人复制。用他的例子来生成一个100行的股票价格数据框架:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

如果这是你的实际数据,你可能只需要包括数据框架的头部和/或尾部,如下所示(确保匿名化任何敏感数据):

>>> stocks.head(5).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319},
 'ticker': {0: 'aapl', 1: 'aapl', 2: 'aapl', 3: 'aapl', 4: 'aapl'}}

>>> pd.concat([stocks.head(), stocks.tail()], ignore_index=True).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00'),
  5: Timestamp('2011-01-24 00:00:00'),
  6: Timestamp('2011-01-25 00:00:00'),
  7: Timestamp('2011-01-25 00:00:00'),
  8: Timestamp('2011-01-25 00:00:00'),
  9: Timestamp('2011-01-25 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319,
  5: 10.017209045035006,
  6: 10.57090128181566,
  7: 11.442792747870204,
  8: 11.592953372130493,
  9: 12.864146419530938},
 'ticker': {0: 'aapl',
  1: 'aapl',
  2: 'aapl',
  3: 'aapl',
  4: 'aapl',
  5: 'msft',
  6: 'msft',
  7: 'msft',
  8: 'msft',
  9: 'msft'}}

您可能还想提供DataFrame的描述(仅使用相关列)。这使得其他人更容易检查每个列的数据类型并识别其他常见错误(例如date as string vs. datetime64 vs. object):

stocks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
date      100 non-null datetime64[ns]
price     100 non-null float64
ticker    100 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)

注意:如果你的DataFrame有MultiIndex:

如果你的DataFrame有一个多索引,你必须在调用to_dict之前先重置。然后你需要使用set_index重新创建索引:

# MultiIndex example.  First create a MultiIndex DataFrame.
df = stocks.set_index(['date', 'ticker'])
>>> df
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059
...

# After resetting the index and passing the DataFrame to `to_dict`, make sure to use 
# `set_index` to restore the original MultiIndex.  This DataFrame can then be restored.

d = df.reset_index().to_dict()
df_new = pd.DataFrame(d).set_index(['date', 'ticker'])
>>> df_new.head()
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059

答题者日记

关于提问,我最好的建议是利用回答问题的人的心理。作为这些人中的一员,我可以深入了解为什么我回答某些问题,为什么我不回答其他问题。

动机

我有动力回答问题有几个原因

Stackoverflow.com对我来说是一个非常有价值的资源。我想回馈社会。 在我努力回馈社会的过程中,我发现这个网站是一个比以前更强大的资源。回答问题对我来说是一次学习的经历,我喜欢学习。阅读另一位兽医的回答和评论。这种互动让我很开心。 我喜欢积分! 看到# 3。 我喜欢有趣的问题。

我所有最纯粹的意图都是伟大的,但如果我回答了1个或30个问题,我就会得到那种满足。驱使我选择回答哪些问题的因素很大程度上是分数最大化。

我也会花时间在有趣的问题上,但这是很少的,而且对那些需要解决无趣问题的提问者没有帮助。你让我回答问题的最好办法就是把这个问题放在盘子里,让我用尽可能少的努力来回答。如果我在看两个问题,其中一个有代码,我可以复制粘贴,以创建我需要的所有变量……我要那个!如果我有时间,我可能会回到另一个。

主要建议

让人们更容易回答问题。

Provide code that creates variables that are needed. Minimize that code. If my eyes glaze over as I look at the post, I'm on to the next question or getting back to whatever else I'm doing. Think about what you're asking and be specific. We want to see what you've done because natural languages (English) are inexact and confusing. Code samples of what you've tried help resolve inconsistencies in a natural language description. PLEASE show what you expect!!! I have to sit down and try things. I almost never know the answer to a question without trying some things out. If I don't see an example of what you're looking for, I might pass on the question because I don't feel like guessing.

你的名声不仅仅是你的名声。

我喜欢点(我在上面提到过)。但这些都不是我的名声。我真正的名声是网站上其他人对我的看法的融合。我努力做到公平和诚实,我希望别人能看到这一点。这对提问者意味着,我们记住了提问者的行为。如果你没有选择答案并给好的答案投票,我记得。如果你的行为方式是我不喜欢的,或者是我喜欢的,我都会记得。这也影响到我将回答哪些问题。


不管怎样,我可以继续讲下去,但我就不给读过这篇文章的人讲了。


这是我的dput版本——用于Pandas DataFrames的标准R工具,用于生成可重复的报告。 对于更复杂的帧,它可能会失败,但在简单的情况下,它似乎可以完成工作:

import pandas as pd
def dput(x):
    if isinstance(x,pd.Series):
        return "pd.Series(%s,dtype='%s',index=pd.%s)" % (list(x),x.dtype,x.index)
    if isinstance(x,pd.DataFrame):
        return "pd.DataFrame({" + ", ".join([
            "'%s': %s" % (c,dput(x[c])) for c in x.columns]) + (
                "}, index=pd.%s)" % (x.index))
    raise NotImplementedError("dput",type(x),x)

now,

df = pd.DataFrame({'a':[1,2,3,4,2,1,3,1]})
assert df.equals(eval(dput(df)))
du = pd.get_dummies(df.a,"foo")
assert du.equals(eval(dput(du)))
di = df
di.index = list('abcdefgh')
assert di.equals(eval(dput(di)))

注意,这将产生比DataFrame更详细的输出。to_dict,例如,

pd。DataFrame ({ “foo_1”:pd。系列([1,0,0,0,0,1,0,1],dtype = uint8,指数= pd。RangeIndex(start=0, stop=8, step=1) “foo_2”:pd。系列([0,1,0,0,1,0,0,0),dtype = uint8,指数= pd。RangeIndex(start=0, stop=8, step=1) “foo_3”:pd。系列([0,0,1,0,0,0,1,0],dtype = uint8,指数= pd。RangeIndex(start=0, stop=8, step=1) “foo_4”:pd。系列([0,0,0,1,0,0,0,0),dtype = uint8,指数= pd。RangeIndex(start=0, stop=8, step=1))} 指数= pd。(start=0, stop=8, step=1))

vs

{“foo_1”:{0,1,1:0,2:0,3:0,4:0,5:1,6:0,7:1}, “foo_2”:{0,0,1,1,2:0,3:0,4:1,5:0,6:0,7:0}, “foo_3”:{0,0,1:0,2:1,3:0,4:0,5:0,6:1,7:0}, “foo_4”:{0,0,1:0,2:0,3:1,4:0,5:0,6:0,7:0}}

对于上面的du,但它保留了列类型。 例如,在上面的测试用例中,

du.equals(pd.DataFrame(du.to_dict()))
==> False

因为杜。dtypes是uint8和pd.DataFrame(du.to_dict())。Dtypes是int64。