-
Notifications
You must be signed in to change notification settings - Fork 57
Feature/knowledge connector s3 #831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds an S3 knowledge connector feature to the AWS extension, enabling extraction and processing of documents from S3 buckets into knowledge sources. It also updates deprecated APIs and dependencies.
Key changes:
- Implements S3 connector with support for multiple file types (txt, pdf, docx, csv, json, jsonl, md, pptx)
- Integrates LangChain for document loading and text chunking with configurable chunk sizes
- Updates deprecated Buffer constructor and context API methods
Reviewed changes
Copilot reviewed 15 out of 17 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| extensions/aws/tsconfig.json | Added compiler options for better module compatibility |
| extensions/aws/src/nodes/lambdaInvoke.ts | Fixed deprecated Buffer constructor and updated context API call |
| extensions/aws/src/module.ts | Registered S3 connector and commented out existing nodes |
| extensions/aws/src/knowledge-connectors/s3Connector.ts | Main connector implementation for processing S3 files into knowledge sources |
| extensions/aws/src/knowledge-connectors/helpers/text_extractor.ts | Document loader using LangChain for various file types |
| extensions/aws/src/knowledge-connectors/helpers/text_chunker.ts | Text splitting logic with configurable chunk sizes |
| extensions/aws/src/knowledge-connectors/helpers/new_utils.ts | S3 file download and chunk extraction utilities |
| extensions/aws/src/knowledge-connectors/helpers/list_files.ts | S3 bucket listing functionality |
| extensions/aws/src/knowledge-connectors/helpers/utils/*.ts | Utility functions for text processing and configuration |
| extensions/aws/src/knowledge-connectors/helpers/creds.env | Environment variable template for AWS credentials |
| extensions/aws/package.json | Updated dependencies to support LangChain and newer AWS SDK |
| extensions/aws/.npmrc | Added legacy peer deps flag for dependency resolution |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // sayPollyNode, | ||
| // lambdaInvokeNode, | ||
| // s3GetObjectNode, | ||
| // s3PutObjectNode |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All existing nodes have been commented out, leaving the nodes array empty. This breaks existing functionality for users of this extension. If these nodes are being temporarily disabled for testing, consider using a feature flag instead, or create a separate branch for the S3 connector work that doesn't disable production features.
| // sayPollyNode, | |
| // lambdaInvokeNode, | |
| // s3GetObjectNode, | |
| // s3PutObjectNode | |
| sayPollyNode, | |
| lambdaInvokeNode, | |
| s3GetObjectNode, | |
| s3PutObjectNode |
| }; | ||
|
|
||
| /** | ||
| * Custom PPTXLoader class to handle pptx files. Impelementation adapted |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'Impelementation' to 'Implementation'.
| * Custom PPTXLoader class to handle pptx files. Impelementation adapted | |
| * Custom PPTXLoader class to handle pptx files. Implementation adapted |
| return chunkSize; | ||
| }; | ||
|
|
||
| const getRecursiveCharacterTextSplitter = () => { |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect indentation - this function definition has an extra tab character at the beginning of the line. The indentation should align with other top-level function definitions in the file.
| const getRecursiveCharacterTextSplitter = () => { | |
| const getRecursiveCharacterTextSplitter = () => { |
| }); | ||
| return splitter; | ||
| }; | ||
| export const splitTextUsingRecursiveCharacterTextSplitter = (text: string) => { |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect indentation - this exported function has an extra tab character at the beginning of the line. The indentation should align with other top-level exports in the file.
| export const splitTextUsingRecursiveCharacterTextSplitter = (text: string) => { | |
| export const splitTextUsingRecursiveCharacterTextSplitter = (text: string) => { |
| import { get } from "http"; | ||
|
|
||
|
|
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'get' import from 'http' is unused in this file and should be removed.
| import { get } from "http"; |
| @@ -0,0 +1,67 @@ | |||
| import { S3Client, ListObjectsV2Command, S3 } from "@aws-sdk/client-s3"; | |||
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'S3' import from '@aws-sdk/client-s3' is unused in this file and should be removed.
| import { S3Client, ListObjectsV2Command, S3 } from "@aws-sdk/client-s3"; | |
| import { S3Client, ListObjectsV2Command } from "@aws-sdk/client-s3"; |
| AWS_ACCESS_KEY_ID=your_access_key_here | ||
| AWS_SECRET_ACCESS_KEY=your_secret_key_here | ||
| AWS_REGION=eu-central-1 | ||
| BUCKET_NAME=your_bucket_name_here No newline at end of file |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment credential files should not be committed to the repository. This file should be added to .gitignore and replaced with a .env.example or .env.template file to document required variables without exposing placeholder values that could be mistaken for real credentials.
Hi Falak. I have fixed the chunking size, it is not exactly identical to the one in the manual upload but a lot better. Thanks for your time! :)