将IPython笔记本保持在版本控制下的好策略是什么?

笔记本格式非常适合版本控制:如果想对笔记本和输出进行版本控制,那么这种方法非常有效。当人们只想对输入进行版本控制,而不包括单元格输出时,就会出现烦恼。“构建产品”),可以是大的二进制blob,特别是电影和情节。特别是,我试图找到一个好的工作流程:

allows me to choose between including or excluding output, prevents me from accidentally committing output if I do not want it, allows me to keep output in my local version, allows me to see when I have changes in the inputs using my version control system (i.e. if I only version control the inputs but my local file has outputs, then I would like to be able to see if the inputs have changed (requiring a commit). Using the version control status command will always register a difference since the local file has outputs.) allows me to update my working notebook (which contains the output) from an updated clean notebook. (update)

如前所述,如果我选择包含输出(例如,在使用nbviewer时,这是可取的),那么一切都没问题。问题是当我不想对输出进行版本控制时。有一些工具和脚本可以剥离笔记本的输出,但我经常遇到以下问题:

I accidentally commit a version with the the output, thereby polluting my repository. I clear output to use version control, but would really rather keep the output in my local copy (sometimes it takes a while to reproduce for example). Some of the scripts that strip output change the format slightly compared to the Cell/All Output/Clear menu option, thereby creating unwanted noise in the diffs. This is resolved by some of the answers. When pulling changes to a clean version of the file, I need to find some way of incorporating those changes in my working notebook without having to rerun everything. (update)

我已经考虑了下面将要讨论的几个选项,但是还没有找到一个好的全面的解决方案。完整的解决方案可能需要对IPython进行一些更改,或者可能依赖于一些简单的外部脚本。我目前使用mercurial,但希望有一个解决方案也能与git一起工作:一个理想的解决方案是版本控制不可知的。

这个问题已经讨论过很多次了,但是从用户的角度来看,还没有明确的解决方案。这个问题的答案应该能提供明确的策略。如果它需要IPython的最新(甚至是开发版)版本或易于安装的扩展,那是没问题的。

更新:我一直在玩我修改过的笔记本版本,它可以选择保存一个.clean版本,每次保存都使用Gregory Crosswhite的建议。这满足了我的大部分约束条件,但留下了以下问题:

This is not yet a standard solution (requires a modification of the ipython source. Is there a way of achieving this behaviour with a simple extension? Needs some sort of on-save hook. A problem I have with the current workflow is pulling changes. These will come in to the .clean file, and then need to be integrated somehow into my working version. (Of course, I can always re-execute the notebook, but this can be a pain, especially if some of the results depend on long calculations, parallel computations, etc.) I do not have a good idea about how to resolve this yet. Perhaps a workflow involving an extension like ipycache might work, but that seems a little too complicated.

笔记

移除(剥离)输出

When the notebook is running, one can use the Cell/All Output/Clear menu option for removing the output. There are some scripts for removing output, such as the script nbstripout.py which remove the output, but does not produce the same output as using the notebook interface. This was eventually included in the ipython/nbconvert repo, but this has been closed stating that the changes are now included in ipython/ipython,but the corresponding functionality seems not to have been included yet. (update) That being said, Gregory Crosswhite's solution shows that this is pretty easy to do, even without invoking ipython/nbconvert, so this approach is probably workable if it can be properly hooked in. (Attaching it to each version control system, however, does not seem like a good idea — this should somehow hook in to the notebook mechanism.)

新闻组

关于版本控制的笔记本格式的思考。

问题

977:笔记本功能请求(打开)。 1280:清除-all保存选项(打开)。(从下面的讨论。) 3295:自动导出的笔记本:只导出显式标记的单元格(关闭)。扩展解决11添加写和执行魔法(合并)。

把请求

1621: clear In[] prompt numbers on "Clear All Output" (Merged). (See also 2519 (Merged).) 1563: clear_output improvements (Merged). 3065: diff-ability of notebooks (Closed). 3291: Add the option to skip output cells when saving. (Closed). This seems extremely relevant, however was closed with the suggestion to use a "clean/smudge" filter. A relevant question what can you use if you want to strip off output before running git diff? seems not to have been answered. 3312: WIP: Notebook save hooks (Closed). 3747: ipynb -> ipynb transformer (Closed). This is rebased in 4175. 4175: nbconvert: Jinjaless exporter base (Merged). 142: Use STDIN in nbstripout if no input is given (Open).


当前回答

接下来是Pietro Battiston编写的优秀脚本,如果您遇到这样的Unicode解析错误:

Traceback (most recent call last):
  File "/Users/kwisatz/bin/ipynb_output_filter.py", line 33, in <module>
write(json_in, sys.stdout, NO_CONVERT)
  File "/Users/kwisatz/anaconda/lib/python2.7/site-packages/IPython/nbformat/__init__.py", line 161, in write
fp.write(s)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 11549: ordinal not in range(128)

你可以在脚本开头添加:

reload(sys)
sys.setdefaultencoding('utf8')

其他回答

在深入研究之后,我终于在Jupyter文档中找到了这个相对简单的预保存钩子。它剥离单元格输出数据。您必须将其粘贴到jupyter_notebook_config.py文件中(参见下面的说明)。

def scrub_output_pre_save(model, **kwargs):
    """scrub output before saving notebooks"""
    # only run on notebooks
    if model['type'] != 'notebook':
        return
    # only run on nbformat v4
    if model['content']['nbformat'] != 4:
        return

    for cell in model['content']['cells']:
        if cell['cell_type'] != 'code':
            continue
        cell['outputs'] = []
        cell['execution_count'] = None
        # Added by binaryfunt:
        if 'collapsed' in cell['metadata']:
            cell['metadata'].pop('collapsed', 0)

c.FileContentsManager.pre_save_hook = scrub_output_pre_save

Rich Signell的回答是:

如果你不确定在哪个目录中找到你的jupyter_notebook_config.py文件,你可以输入jupyter——config-dir [into命令提示符/终端],如果你在那里找不到这个文件,你可以输入jupyter notebook——generate-config创建它。

下面文章中讨论的想法如何,笔记本的输出应该保存在哪里,理由是生成它可能需要很长时间,而且它很方便,因为GitHub现在可以渲染笔记本。添加了用于导出.py文件的自动保存钩子,用于diffs和.html,以便与不使用笔记本或git的团队成员共享。

https://towardsdatascience.com/version-control-for-jupyter-notebook-3e6cef13392d

我还将添加到其他人建议的https://nbdev.fast.ai/,这是一个最先进的“文学编程环境,正如Donald Knuth在1983年所设想的那样!”

它也有一些git钩子,可以帮助https://nbdev.fast.ai/#Avoiding-and-handling-git-conflicts和其他命令,如:

nbdev_read_nbs nbdev_clean_nbs nbdev_diff_nbs nbdev_test_nbs

所以你也可以创建你的文档,就像在写一个库,例如:

https://dev.fast.ai/ https://ohmeow.github.io/blurr/ https://rbracco.github.io/fastai2_audio/

除了第一个链接,您还可以在这里看到一个视频nbdev教程。

与2019年更好的方法相比,上面这些2016年非常流行的答案是不一致的。

有几个选项,最好的答案是Jupytext。

Jupytext

在Jupytext上捕获朝向数据科学的文章

它与版本控制的工作方式是将.py和.ipynb文件放在版本控制中。如果您想要输入差异,请查看.py;如果您想要最新呈现的输出,请查看.ipynb。

值得一提的是:VS studio, nbconvert, nbdime, hydrogen

我认为再多做一些工作,VS studio和/或hydrogen(或类似的)将成为解决这个工作流程的主要参与者。

我已经构建了一个python包来解决这个问题

https://github.com/brookisme/gitnb

它提供了一个CLI,使用git启发的语法来跟踪/更新/区分git repo中的笔记本。

这里有一个例子

# add a notebook to be tracked
gitnb add SomeNotebook.ipynb

# check the changes before commiting
gitnb diff SomeNotebook.ipynb

# commit your changes (to your git repo)
gitnb commit -am "I fixed a bug"

注意最后一步,我使用“gitnb commit”的地方是提交到你的git repo。它本质上是一个包装

# get the latest changes from your python notebooks
gitnb update

# commit your changes ** this time with the native git commit **
git commit -am "I fixed a bug"

还有更多的方法,并且可以配置为在每个阶段需要或多或少的用户输入,但这是总体思想。