User Embedding Model

This repository contains a PyTorch model for generating user embeddings based on DMP (Data Management Platform) data. The model creates dense vector representations of users that can be used for recommendation systems, user clustering, and similarity searches.

Quick Start with Docker

To run the model using Docker:

docker build -t user-embedding-model .

docker run -v /path/to/your/data:/app/data \
  -e DATA_PATH=/app/data/users.json \
  -e NUM_EPOCHS=10 \
  -e BATCH_SIZE=32 \
  -v /path/to/output:/app/embeddings_output \
  user-embedding-model

Pushing to Hugging Face

To automatically push the model to Hugging Face, add your credentials:

docker run -v /path/to/your/data:/app/data \
  -e DATA_PATH=/app/data/users.json \
  -e HF_REPO_ID="your-username/your-model-name" \
  -e HF_TOKEN="your-huggingface-token" \
  -v /path/to/output:/app/embeddings_output \
  user-embedding-model

Input Data Format

The model expects user data in JSON format, with each user having DMP fields like:

{
  "dmp": {
    "city": "milano",
    "domains": ["example.com"],
    "brands": ["brand1", "brand2"],
    "clusters": ["cluster1", "cluster2"],
    "industries": ["industry1"],
    "tags": ["tag1", "tag2"],
    "channels": ["channel1"],
    "~click__host": "host1",
    "~click__domain": "domain1",
    "": {
      "id": "user123"
    }
  }
}

Environment Variables

DATA_PATH: Path to your input JSON file (default: "users.json")
NUM_EPOCHS: Number of training epochs (default: 10)
BATCH_SIZE: Batch size for training (default: 32)
LEARNING_RATE: Learning rate for optimizer (default: 0.001)
SAVE_INTERVAL: Save checkpoint every N epochs (default: 2)
HF_REPO_ID: Hugging Face repository ID for uploading
HF_TOKEN: Hugging Face API token

Output

The model generates:

embeddings.json: User embeddings in JSON format
embeddings.npz: User embeddings in NumPy format
vocabularies.json: Vocabulary mappings
model.pth: Trained PyTorch model
model_config.json: Model configuration
Hugging Face-compatible model files in the "huggingface" subdirectory

Hardware Requirements

Recommended: NVIDIA GPU with CUDA support
The code uses parallel processing for triplet generation to utilize all available CPU cores
For L40S GPU, recommended batch size: 32-64
Memory requirement: At least 8GB RAM

Model Architecture

The model consists of:

Embedding layers for each user field
Sequential fully connected layers for dimensionality reduction
Output dimension: 256 (configurable)
Training method: Triplet margin loss using similar/dissimilar users

Performance Optimization

The code includes several optimizations:

Parallel triplet generation using all available CPU cores
GPU acceleration for model training
Efficient memory handling for large datasets
TensorBoard integration for monitoring training

Troubleshooting

If you encounter issues:

Check that your input data follows the expected format
Ensure you have sufficient memory for your dataset size
For GPU errors, try reducing batch size
Check the logs for detailed error messages

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support