Skip to content

Add File System Support via pyodide.FS #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

fullzer4
Copy link

@fullzer4 fullzer4 commented May 26, 2025

Add File System Support via pyodide.FS

This PR implements comprehensive file system operations in the PyodideSandbox, addressing issue #34 (Allow attaching files) and issue #26 (Preserve newlines when printing).

Key Features

File System Operations

  • File Attachment: Pre-load files into sandbox with attach_file() and attach_files()
  • Standard Python I/O: Full support for open(), Path(), and standard file operations
  • Multiple File Types: Handle text, binary files (CSV, images, etc.)
  • Directory Support: Organize files in nested directory structures

Bug Fixes

Implementation Details

  • Added FileSystemOperation interface for defining file operations
  • Implemented file system operations in main.ts using pyodide.FS
  • Migrated from BaseTool to StructuredTool for dynamic descriptions

Usage Example

# pip install langgraph-codeact "langchain[anthropic]"
import asyncio
from langchain_sandbox import PyodideSandboxTool
from langgraph.prebuilt import create_react_agent

# Define sandbox tool with filesystem support
sandbox_tool = PyodideSandboxTool(
    enable_filesystem=True,
    allow_net=True,
)

# Attach sample data
sales_data = """...csv_data"""

sandbox_tool.attach_file("sales.csv", sales_data)

# Create agent with sandbox tool
agent = create_react_agent("anthropic:claude-3-7-sonnet-latest", [sandbox_tool])

query = """Please analyze the sales data and tell me:
1. What is the total revenue by category?
2. Which region has the highest sales?
3. What are the top 3 best-selling products by revenue?

Use pandas to read the CSV file and perform the analysis."""

async def run_agent(query: str):
    async for chunk in agent.astream({"messages": query}):
        print(chunk)

if __name__ == "__main__":
    asyncio.run(run_agent(query))

References

@fullzer4 fullzer4 changed the title Add File System Support via pyodide.FS [WIP] Add File System Support via pyodide.FS May 26, 2025
@fullzer4

This comment was marked as outdated.

@fullzer4 fullzer4 changed the title [WIP] Add File System Support via pyodide.FS Add File System Support via pyodide.FS May 28, 2025
@fullzer4
Copy link
Author

fullzer4 commented May 28, 2025

Update: Feature Complete & Ready for Review

Quick Summary

StructuredTool Migration

The switch to StructuredTool enables dynamic descriptions that automatically list available files, improving developer experience by providing clear visibility into attached resources.

Technical Details

This implementation focuses only on MEMFS support as documented in the Pyodide File System documentation.


Ready for review! 🚀 Open to feedback and suggestions for improvements.

@eyurtsev @vbarda Could you please review when you have a chance? Thank you!

@eyurtsev eyurtsev self-assigned this May 28, 2025
Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fullzer4 this PR is really amazing. Thank you for putting all of this together! I left a few questions about the extra entrypoints in the code and would like to figure out how to avoid them if possible.

Thanks also for fixing the various issues in the existing code! You caught more than one bug!

@eyurtsev
Copy link
Collaborator

I think we have to pass the file bytes via stdin rather than as arg as most OSs impose a max limit on the size of the arg value that can be passed

@fullzer4
Copy link
Author

fullzer4 commented Jun 2, 2025

@eyurtsev I've completed the filesystem implementation with stdin. It's working well and solves the CLI argument size limitations we discussed.

I chose a binary protocol over JSON stdin for file transfer, implemented with ReadableStream for efficient binary data handling, because it preserves binary data integrity and avoids "Maximum call stack size exceeded" errors that occur when sending large files as JSON through stdin. Would appreciate your thoughts on this approach.

I'm already working on the output functionality with a get_file method to extract files from the sandbox after execution. Will submit this in a separate PR once it's ready and tested.

Regarding the permissions flags, I had to make changes to fix critical errors like:

Error: Requires write access to "/home/fullzer4/path/node_modules/.deno/[email protected]/node_modules/pyodide/micropip-0.8.0-py3-none-any.whl"

and Code

NotCapable: Requires read access to "/path/node_modules/.deno/[email protected]/node_modules/pyodide/pyodide.asm.wasm"

The approach follows the principle of least privilege while ensuring everything works properly.

For the description part, I implemented a template-based system with _description_template and _build_description() that dynamically shows available files to the LLM:

A secure Python code sandbox with filesystem support... ATTACHED FILES AVAILABLE: • data.csvconfig.json These files are already loaded and ready to use...

The implementation now guarantees that when users provide custom descriptions, they are respected exactly as specified, while still supporting dynamic file information through an optional placeholder system.

I kept the filesystem methods (attach_file, create_directory, etc.) as they provide a clean, intuitive API. These methods make working with sandbox files much simpler than reimplementing these operations in user code.

I've also implemented the constructor-based file attachment you suggested, so users can now provide files directly when creating the sandbox tool:

sandbox_tool = PyodideSandboxTool(
    allow_net=True,
    files={
        "sales.csv": sales_data
    }
)

I decided to maintain both approaches - constructor-based initialization and method-based file operations - because they serve different use cases. The constructor approach is perfect for setting up files at creation time in a clean, declarative style, while the method-based approach allows for more dynamic file manipulation during the tool's lifecycle. This flexibility gives users the best of both worlds without forcing them into a single pattern.

Let me know if you'd like me to adjust anything before merging.

@fullzer4
Copy link
Author

fullzer4 commented Jun 4, 2025

Hi @eyurtsev, I hope you're doing well! I just wanted to gently check in on the filesystem support PR when you have a moment. I'm eager to hear your thoughts. Thank you so much!

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions and I'd like to remove the mutability and and various help methods on the Python side.

I can take care of the changes on the python side if you'd like me to!

If you want to get the changes in more quickly, we can also break this PR into smaller ones (e.g., just tackling the typescript side first.)

):
self.description = self._build_description()

def attach_file(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove all the extra helper methods to avoid mutability and multiple ways of achieving the same result? Passing files through the init is enough for both the tool and the underlying Sandbox.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included the helper methods (attach_file, create_directory, etc.) primarily for dynamic use cases where files need to be added after tool initialization - particularly useful in LangGraph nodes where the tool might need to attach files based on previous execution results.

Without these methods, users would need to recreate the entire tool instance or implement custom file management logic. The methods provide a clean API for: "tool running → needs new file → attach → continue" etc...

However, I understand the preference for immutability. If you'd like to remove them for simplicity, I can refactor to constructor-only approach.

_sync_sandbox: SyncPyodideSandbox | None = PrivateAttr(default=None)
_custom_description: bool = PrivateAttr(default=False)

def model_post_init(self, /, __context) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use pydantic's model_post_init instead of placing the code after the init (super().init()) call?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used model_post_init specifically to handle Pydantic's field initialization lifecycle. When trying to update the description field after super().__init__(), Pydantic had already frozen the field state, causing silent failures when attach_file() tried to update the description dynamically.

model_post_init ensures the description is properly built after all Pydantic initialization is complete, allowing proper field mutation for the dynamic file information feature.

@fullzer4
Copy link
Author

fullzer4 commented Jun 5, 2025

Left a few questions and I'd like to remove the mutability and and various help methods on the Python side.

I can take care of the changes on the python side if you'd like me to!

If you want to get the changes in more quickly, we can also break this PR into smaller ones (e.g., just tackling the typescript side first.)

Thanks for offering to help! I'm happy to handle the remaining changes on the Python side myself - there's no rush on my end to get these changes in immediately. I've already addressed some of your review points with recent commits (specifically the unnecessary filesystem operations and permission model reversion).

If you prefer breaking this into smaller PRs as you suggested, I'm open to that approach too, but I don't want you to worry about making the changes yourself unless you specifically want to. I'm comfortable implementing the remaining feedback items based on your guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow attaching files Preserve newlines when printing inside the sandbox
2 participants