如何做一个递归子文件夹搜索和返回文件在一个列表?

我正在编写一个脚本，以递归地遍历主文件夹中的子文件夹，并构建一个特定文件类型的列表。我对剧本有点意见。目前设置如下:

for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,subFolder,item))

问题是subFolder变量拉入的是子文件夹列表，而不是ITEM文件所在的文件夹。我在考虑之前运行子文件夹的for循环，并加入路径的第一部分，但我想我会仔细检查，看看是否有人有任何建议之前。

当前回答

您最初的解决方案几乎是正确的，但是变量“root”在递归遍历时被动态更新。Os.walk()是一个递归生成器。每个元组(root, subFolder, files)都是针对特定的根目录设置的。

i.e.

root = 'C:\\'
subFolder = ['Users', 'ProgramFiles', 'ProgramFiles (x86)', 'Windows', ...]
files = ['foo1.txt', 'foo2.txt', 'foo3.txt', ...]

root = 'C:\\Users\\'
subFolder = ['UserAccount1', 'UserAccount2', ...]
files = ['bar1.txt', 'bar2.txt', 'bar3.txt', ...]

...

我对你的代码做了轻微的调整，以打印一个完整的列表。

import os
for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,item))
            print(fileNamePath)

希望这能有所帮助!

编辑:(根据反馈)

OP误解/错误标记了子文件夹变量，因为它实际上是“根”中的所有子文件夹。因为OP，你要执行os。path。Join (str, list, str)，这可能不像你预期的那样。

为了帮助增加清晰度，你可以试试这个标签方案:

import os
for current_dir_path, current_subdirs, current_files in os.walk(RECURSIVE_ROOT):
    for aFile in current_files:
        if aFile.endswith(".txt") :
            txt_file_path = str(os.path.join(current_dir_path, aFile))
            print(txt_file_path)

2020-07-07 19:36:24

其他回答

i.e.

root = 'C:\\'
subFolder = ['Users', 'ProgramFiles', 'ProgramFiles (x86)', 'Windows', ...]
files = ['foo1.txt', 'foo2.txt', 'foo3.txt', ...]

root = 'C:\\Users\\'
subFolder = ['UserAccount1', 'UserAccount2', ...]
files = ['bar1.txt', 'bar2.txt', 'bar3.txt', ...]

...

我对你的代码做了轻微的调整，以打印一个完整的列表。

import os
for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,item))
            print(fileNamePath)

希望这能有所帮助!

编辑:(根据反馈)

OP误解/错误标记了子文件夹变量，因为它实际上是“根”中的所有子文件夹。因为OP，你要执行os。path。Join (str, list, str)，这可能不像你预期的那样。

为了帮助增加清晰度，你可以试试这个标签方案:

import os
for current_dir_path, current_subdirs, current_files in os.walk(RECURSIVE_ROOT):
    for aFile in current_files:
        if aFile.endswith(".txt") :
            txt_file_path = str(os.path.join(current_dir_path, aFile))
            print(txt_file_path)

2020-07-07 19:36:24

您可以通过这种方式返回绝对路径文件的列表。

def list_files_recursive(path):
    """
    Function that receives as a parameter a directory path
    :return list_: File List and Its Absolute Paths
    """

    import os

    files = []

    # r = root, d = directories, f = files
    for r, d, f in os.walk(path):
        for file in f:
            files.append(os.path.join(r, file))

    lst = [file for file in files]
    return lst


if __name__ == '__main__':

    result = list_files_recursive('/tmp')
    print(result)

2019-11-13 06:00:10

这不是最python的答案，但我把它放在这里是为了好玩，因为这是递归的一课

def find_files( files, dirs=[], extensions=[]):
    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1] in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return

在我的机器上有两个文件夹，root和root2

mender@multivax ]ls -R root root2
root:
temp1 temp2

root/temp1:
temp1.1 temp1.2

root/temp1/temp1.1:
f1.mid

root/temp1/temp1.2:
f.mi  f.mid

root/temp2:
tmp.mid

root2:
dummie.txt temp3

root2/temp3:
song.mid

假设我想在这两个目录中找到所有。txt和。mid文件，然后我就可以

files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)

#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']

2017-08-12 03:59:51

递归是Python 3.5中的新功能，所以它在Python 2.7中不起作用。下面是一个使用r个字符串的例子，你只需要提供路径，就像Win, Lin，…

import glob

mypath=r"C:\Users\dj\Desktop\nba"

files = glob.glob(mypath + r'\**\*.py', recursive=True)
# print(files) # as list
for f in files:
    print(f) # nice looking single line per file

注意:它将列出所有文件，无论它应该有多深。

2019-05-30 16:09:38

这似乎是我能想出的最快的解决方案，比os更快。走，比任何glob解决方案快得多。

它还会给你一个所有嵌套子文件夹的列表，基本上没有成本。您可以搜索几个不同的扩展。您还可以通过将f.path更改为f.name(不要为子文件夹更改它!)来选择返回完整路径或仅返回文件的名称。

参数:dir: str, ext: list。函数返回两个列表:子文件夹、文件。

请参阅下面的详细速度分析。

def run_fast_scandir(dir, ext):    # dir: str, ext: list
    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files


subfolders, files = run_fast_scandir(folder, [".jpg"])

如果你需要文件大小，你也可以创建一个大小列表并添加f.t at()。st_size用于显示MiB:

sizes.append(f"{f.stat().st_size/1024/1024:.0f} MiB")

速度分析

用于获取所有子文件夹和主文件夹中具有特定文件扩展名的所有文件的各种方法。

tl; diana:

Fast_scandir明显胜出，它的速度是所有其他解决方案的两倍，除了os.walk。操作系统。走路是第二慢一点。使用glob将大大减慢这个过程。没有一个结果使用自然排序。这意味着结果将像这样排序:1,10,2。要获得自然排序(1,2,10)，请查看:

https://stackoverflow.com/a/48030307/2441026

结果:

fast_scandir    took  499 ms. Found files: 16596. Found subfolders: 439
os.walk         took  589 ms. Found files: 16596
find_files      took  919 ms. Found files: 16596
glob.iglob      took  998 ms. Found files: 16596
glob.glob       took 1002 ms. Found files: 16596
pathlib.rglob   took 1041 ms. Found files: 16596
os.walk-glob    took 1043 ms. Found files: 16596

更新日期:2022-07-20 (Py 3.10.1寻找*.pdf)

glob.iglob      took 132 ms. Found files: 9999
glob.glob       took 134 ms. Found files: 9999
fast_scandir    took 331 ms. Found files: 9999. Found subfolders: 9330
os.walk         took 695 ms. Found files: 9999
pathlib.rglob   took 828 ms. Found files: 9999
find_files      took 949 ms. Found files: 9999
os.walk-glob    took 1242 ms. Found files: 9999

测试使用W7x64, Python 3.8.1，运行20次。439(部分嵌套)子文件夹中的16596个文件。 Find_files来自https://stackoverflow.com/a/45646357/2441026，允许您搜索多个扩展名。 Fast_scandir是我自己写的，还将返回一个子文件夹列表。你可以给它一个扩展列表来搜索(我测试了一个包含一个条目的列表到一个简单的if…== ".jpg"且无显著差异)。

# -*- coding: utf-8 -*-
# Python 3


import time
import os
from glob import glob, iglob
from pathlib import Path


directory = r"<folder>"
RUNS = 20


def run_os_walk():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
                  os.path.splitext(f)[1].lower() == '.jpg']
    print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_os_walk_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
    print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_iglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
    print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_pathlib_rglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(Path(directory).rglob("*.jpg"))
    print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def find_files(files, dirs=[], extensions=[]):
    # https://stackoverflow.com/a/45646357/2441026

    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1].lower() in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return


def run_fast_scandir(dir, ext):    # dir: str, ext: list
    # https://stackoverflow.com/a/59803793/2441026

    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files



if __name__ == '__main__':
    run_os_walk()
    run_os_walk_glob()
    run_glob()
    run_iglob()
    run_pathlib_rglob()


    a = time.time_ns()
    for i in range(RUNS):
        files = []
        find_files(files, dirs=[directory], extensions=[".jpg"])
    print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")


    a = time.time_ns()
    for i in range(RUNS):
        subf, files = run_fast_scandir(directory, [".jpg"])
    print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")

2020-01-18 18:52:11

如何做一个递归子文件夹搜索和返回文件在一个列表?

推荐文章

最新文章

标签