我有一个装满了上千份文件的桶。我如何搜索水桶?
当前回答
虽然不是AWS的原生服务,但有Mixpeek,它在S3文件上运行文本提取,如Tika、Tesseract和ImageAI,然后将它们放在Lucene索引中,使它们可搜索。
在这里查看文档
积分如下:
Download the module: https://github.com/mixpeek/mixpeek-python Import the module and your API keys: from mixpeek import Mixpeek, S3 from config import mixpeek_api_key, aws Instantiate the S3 class (which uses boto3 and requests): s3 = S3( aws_access_key_id=aws['aws_access_key_id'], aws_secret_access_key=aws['aws_secret_access_key'], region_name='us-east-2', mixpeek_api_key=mixpeek_api_key ) Upload one or more existing S3 files: # upload all S3 files in bucket "demo" s3.upload_all(bucket_name="demo") # upload one single file called "prescription.pdf" in bucket "demo" s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo") Now simply search using the Mixpeek module: # mixpeek api direct mix = Mixpeek( api_key=mixpeek_api_key ) # search result = mix.search(query="Heartgard") print(result) Where result can be: [ { "_id": "REDACTED", "api_key": "REDACTED", "highlights": [ { "path": "document_str", "score": 0.8759502172470093, "texts": [ { "type": "text", "value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\n℞ " }, { "type": "hit", "value": "Heartgard" }, { "type": "text", "value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. " } ] } ], "metadata": { "date_inserted": "2021-10-07 03:19:23.632000", "filename": "prescription.pdf" }, "score": 0.13313256204128265 } ]
然后解析结果
其他回答
使用Amazon Athena查询S3桶。另外,加载数据到Amazon Elastic搜索。希望这能有所帮助。
我尝试了以下方法
aws s3 ls s3://Bucket1/folder1/2019/ --recursive |grep filename.csv
这将输出文件存在的实际路径
2019-04-05 01:18:35 111111 folder1/2019/03/20/filename.csv
我也面临同样的问题。在S3中进行搜索应该比目前的情况容易得多。这就是为什么我在S3中实现了这个用于搜索的开源工具。
search是完全开源的S3搜索工具。它的实现始终牢记性能是关键因素,并根据基准测试在几秒钟内搜索包含~1000个文件的桶。
安装很简单。你只需要下载docker-compose文件并运行它
docker-compose up
搜索将开始,你可以在任何桶搜索任何东西。
(至少)有两个不同的用例可以描述为“搜索桶”:
Search for something inside every object stored at the bucket; this assumes a common format for all the objects in that bucket (say, text files), etc etc. For something like this, you're forced to do what Cody Caughlan just answered. The AWS S3 docs has example code showing how to do this with the AWS SDK for Java: Listing Keys Using the AWS SDK for Java (there you'll also find PHP and C# examples). List item Search for something in the object keys contained in that bucket; S3 does have partial support for this, in the form of allowing prefix exact matches + collapsing matches after a delimiter. This is explained in more detail at the AWS S3 Developer Guide. This allows, for example, to implement "folders" through using as object keys something like folder/subfolder/file.txt If you follow this convention, most of the S3 GUIs (such as the AWS Console) will show you a folder view of your bucket.
另一种选择是在您的web服务器上镜像S3桶并在本地遍历。诀窍在于本地文件是空的,只用作骨架。或者,本地文件可以保存您通常需要从S3获取的有用元数据(例如,文件大小、mimetype、作者、时间戳、uuid)。当您提供下载文件的URL时,在本地搜索,但要提供到S3地址的链接。
本地文件遍历很容易,而且这种用于S3管理的方法与语言无关。本地文件遍历还可以避免维护和查询文件数据库,或者延迟执行一系列远程API调用来验证和获取桶内容。
您可以允许用户通过FTP或HTTP直接将文件上传到您的服务器,然后在非高峰时段通过递归遍历任意大小文件的目录将一批新的和更新的文件传输到Amazon。在完成向Amazon的文件传输后,将web服务器文件替换为同名的空文件。如果一个本地文件有任何文件大小,那么直接提供它,因为它正在等待批量传输。
推荐文章
- 如何搜索亚马逊s3桶?
- 拒绝访问;您需要(至少一个)SUPER特权来执行此操作
- 我如何使用通配符“cp”一组文件与AWS CLI
- 我如何获得亚马逊的AWS_ACCESS_KEY_ID ?
- 如何使所有对象在AWS S3桶公共默认?
- 为什么我应该使用亚马逊Kinesis而不是SNS-SQS?
- 如何重命名AWS S3 Bucket
- AWS ECS中的任务和服务之间有什么区别?
- 亚马逊SimpleDB vs亚马逊DynamoDB
- 亚马逊ECS和亚马逊EC2有什么区别?
- 我如何知道我在S3桶中存储了多少对象?
- S3 Bucket操作不应用于任何资源
- 将AWS凭证传递给Docker容器的最佳方法是什么?
- 当权限为S3时,AccessDenied for ListObjects for S3 bucket:*
- 电子邮件地址未验证(AWS SES)