我有一个装满了上千份文件的桶。我如何搜索水桶?


当前回答

我也面临同样的问题。在S3中进行搜索应该比目前的情况容易得多。这就是为什么我在S3中实现了这个用于搜索的开源工具。

search是完全开源的S3搜索工具。它的实现始终牢记性能是关键因素,并根据基准测试在几秒钟内搜索包含~1000个文件的桶。

安装很简单。你只需要下载docker-compose文件并运行它

docker-compose up

搜索将开始,你可以在任何桶搜索任何东西。

其他回答

(至少)有两个不同的用例可以描述为“搜索桶”:

Search for something inside every object stored at the bucket; this assumes a common format for all the objects in that bucket (say, text files), etc etc. For something like this, you're forced to do what Cody Caughlan just answered. The AWS S3 docs has example code showing how to do this with the AWS SDK for Java: Listing Keys Using the AWS SDK for Java (there you'll also find PHP and C# examples). List item Search for something in the object keys contained in that bucket; S3 does have partial support for this, in the form of allowing prefix exact matches + collapsing matches after a delimiter. This is explained in more detail at the AWS S3 Developer Guide. This allows, for example, to implement "folders" through using as object keys something like folder/subfolder/file.txt If you follow this convention, most of the S3 GUIs (such as the AWS Console) will show you a folder view of your bucket.

我也面临同样的问题。在S3中进行搜索应该比目前的情况容易得多。这就是为什么我在S3中实现了这个用于搜索的开源工具。

search是完全开源的S3搜索工具。它的实现始终牢记性能是关键因素,并根据基准测试在几秒钟内搜索包含~1000个文件的桶。

安装很简单。你只需要下载docker-compose文件并运行它

docker-compose up

搜索将开始,你可以在任何桶搜索任何东西。

虽然不是AWS的原生服务,但有Mixpeek,它在S3文件上运行文本提取,如Tika、Tesseract和ImageAI,然后将它们放在Lucene索引中,使它们可搜索。

在这里查看文档

积分如下:

Download the module: https://github.com/mixpeek/mixpeek-python Import the module and your API keys: from mixpeek import Mixpeek, S3 from config import mixpeek_api_key, aws Instantiate the S3 class (which uses boto3 and requests): s3 = S3( aws_access_key_id=aws['aws_access_key_id'], aws_secret_access_key=aws['aws_secret_access_key'], region_name='us-east-2', mixpeek_api_key=mixpeek_api_key ) Upload one or more existing S3 files: # upload all S3 files in bucket "demo" s3.upload_all(bucket_name="demo") # upload one single file called "prescription.pdf" in bucket "demo" s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo") Now simply search using the Mixpeek module: # mixpeek api direct mix = Mixpeek( api_key=mixpeek_api_key ) # search result = mix.search(query="Heartgard") print(result) Where result can be: [ { "_id": "REDACTED", "api_key": "REDACTED", "highlights": [ { "path": "document_str", "score": 0.8759502172470093, "texts": [ { "type": "text", "value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\n℞ " }, { "type": "hit", "value": "Heartgard" }, { "type": "text", "value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. " } ] } ], "metadata": { "date_inserted": "2021-10-07 03:19:23.632000", "filename": "prescription.pdf" }, "score": 0.13313256204128265 } ]

然后解析结果

快进到2020年,使用aws-okta作为我们的2fa,下面的命令,尽管迭代这个特定bucket(+270,000)中的所有对象和文件夹非常缓慢,但运行良好。

aws-okta exec dev -- aws s3 ls my-cool-bucket --recursive | grep needle-in-haystax.txt

考虑到你在AWS…我认为你会想要使用他们的CloudSearch工具。把你想要搜索的数据放到他们的服务中…让它指向S3密钥。

http://aws.amazon.com/cloudsearch/