Spaces:
Paused
Paused
| title: DockerTester | |
| emoji: 📚 | |
| colorFrom: red | |
| colorTo: blue | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| RedPajama Dataset API | |
| A FastAPI-based Application for Exploring the RedPajama-Data-1T Dataset | |
| Overview | |
| This application provides an intuitive API to interact with the RedPajama-Data-1T dataset. Built using FastAPI, it allows users to retrieve data chunks, perform searches, and view dataset summaries with ease. Ideal for researchers and developers working on large-scale language model datasets. | |
| Features | |
| 1. Retrieve Dataset Chunks | |
| Fetch smaller, manageable subsets of the dataset to explore or preprocess. | |
| 2. Search Data | |
| Search for specific keywords in the dataset and retrieve relevant results. | |
| 3. Dataset Summary | |
| Get an overview of the dataset’s structure, including available splits. | |
| Endpoints | |
| Endpoint Method Parameters Description | |
| / GET None Displays a welcome message. | |
| /get_data/ GET chunk_size (int, default: 10) Fetches a subset of the dataset. | |
| /search_data/ GET keyword (str, required), max_results (int, default: 10) Searches for entries containing the given keyword. | |
| /data_summary/ GET None Displays a summary of the dataset. | |
| Getting Started | |
| Prerequisites | |
| • Python 3.8+ | |
| • Pip for dependency management | |
| Setup | |
| 1. Clone the repository: | |
| git clone https://huggingface.co/spaces/Canstralian/DockerTester | |
| cd DockerTester | |
| 2. Install dependencies: | |
| pip install -r requirements.txt | |
| 3. Run the application: | |
| uvicorn app:app --host 0.0.0.0 --port 8000 | |
| 4. Access the API in your browser or using tools like Postman at: | |
| http://127.0.0.1:8000 | |
| Example Usage | |
| 1. Retrieve a Small Chunk of Data | |
| Fetch 5 examples from the dataset: | |
| curl "http://127.0.0.1:8000/get_data/?chunk_size=5" | |
| 2. Search the Dataset | |
| Search for the keyword example and return up to 3 results: | |
| curl "http://127.0.0.1:8000/search_data/?keyword=example&max_results=3" | |
| 3. View Dataset Summary | |
| Get an overview of available splits: | |
| curl "http://127.0.0.1:8000/data_summary/" | |
| Technologies Used | |
| • FastAPI: For building the API. | |
| • Hugging Face Datasets: To access and process the RedPajama-Data-1T dataset. | |
| • Uvicorn: For running the ASGI server. | |
| • Python: Backend language. | |
| Future Enhancements | |
| • Add support for advanced filtering (e.g., by metadata or specific fields). | |
| • Implement user authentication for restricted dataset access. | |
| • Add visualization endpoints for dataset insights. | |
| License | |
| This project uses the Apache 2.0 License. Refer to the LICENSE file for more details. | |
| Feel free to reach out for questions, feature requests, or contributions! | |