🤖 Automated Machine Learning Workflow

State-driven ML pipeline with intelligent agent orchestration

In this project, we build a machine learning workflow using AG2. The workflow involves data analysis, preprocessing, and model training to build a machine learning model.

Machine learning workflows typically involve several key steps:

  1. Data Analysis and Exploration: Understanding dataset size, columns, and distributions.
  2. Data Preprocessing: Cleaning data, handling missing values, and encoding categorical variables.
  3. Model Training: Training a model, comparing different models, and tuning hyperparameters.

🏗️ System Architecture

State Machine Workflow

The system follows a state machine design with intelligent transitions between ML workflow stages:

flowchart TD A["🚀 Init State"] --> B["🔍 Explore State"] B --> B1["👤 Data Explorer"] B1 --> B2["⚙️ Code Executor"] B2 -->|Success| C["🛠️ Preprocess State"] B2 -->|Error| B1 C --> C1["🔧 Data Preprocessor"] C1 --> C2["⚙️ Code Executor"] C2 -->|LLM Decision:
Ready for Training| D["🎯 Train State"] C2 -->|Need More Analysis| B C2 -->|Error| C1 D --> D1["🧠 Model Trainer"] D1 --> D2["⚙️ Code Executor"] D2 -->|< 2 Trials| D1 D2 -->|≥ 2 Trials| E["📊 Summarize State"] D2 -->|Error| D1 E --> E1["📝 Summarizer"] E1 --> F["🏁 End State"] %% State grouping subgraph STATES ["State Machine Workflow"] direction LR G["Custom Speaker
Selection Method"] H["StateFlow Pattern
Transition Logic"] end %% Styling classDef initEnd fill:#ffeaa7,stroke:#fdcb6e,stroke-width:3px classDef state fill:#e17055,stroke:#d63031,stroke-width:3px classDef agent fill:#74b9ff,stroke:#0984e3,stroke-width:2px classDef executor fill:#00b894,stroke:#00a085,stroke-width:2px classDef pattern fill:#fd79a8,stroke:#e84393,stroke-width:2px class A,F initEnd class B,C,D,E state class B1,C1,D1,E1 agent class B2,C2,D2 executor class G,H pattern

📋 State Machine Details

🔍Explore State

Analyze the dataset structure, distributions, and characteristics to understand the data landscape.

Agents:
• Data Explorer → Code Executor
Transition: Success → Preprocess | Error → Stay in Explore

🛠️Preprocess State

Clean and prepare data including handling missing values, encoding categoricals, and feature scaling.

Agents:
• Data Preprocessor → Code Executor
Transition: LLM decides if ready for training → Train | Need more analysis → Explore

🎯Train State

Train and compare multiple ML models with different algorithms and hyperparameters.

Agents:
• Model Trainer → Code Executor
Transition: < 2 trials → Continue training | ≥ 2 trials → Summarize | Error → Retry

📊Summarize State

Generate comprehensive workflow summary and integrate all successful code snippets.

Agents:
• Summarizer (LLM-only)
Transition: Always → End

🔄 Workflow Process

Step 1 - Dataset Analysis: Explore data shape, types, distributions, and missing values
Step 2 - Data Preprocessing: Handle missing data, encode categoricals, and scale features
Step 3 - Model Training: Train multiple models (2 iterations) with performance comparison
Step 4 - Visualization: Generate performance plots, confusion matrices, and evaluation metrics
Step 5 - Code Integration: Combine all successful code into reproducible script

🤖 AG2 Features

🎭 Custom Speaker Transitions

State-driven agent selection using custom speaker_selection_method for workflow control

🌊 StateFlow Design

Build state-driven workflows with intelligent transitions based on execution results

⚡ Code Execution

Jupyter-based code execution environment for interactive ML development

🧠 LLM Decision Making

AI-powered decisions on workflow readiness and state transitions

🏷️ Tags

data-analysis groupchat stateflow code-execution kaggle automated-ml workflow-automation model-training data-preprocessing state-machine hyperparameter-tuning

📋 Prerequisites

⚙️ Installation

  1. Clone and navigate to the folder:
git clone https://github.com/ag2ai/build-with-ag2.git cd build-with-ag2/automate-ml-for-kaggle
  1. Install dependencies:
uv sync
  1. Set up environment variables:
cp .env.example .env # Edit .env with your OpenAI API key

🚀 Usage

uv run python main.py

Automated Workflow: The system will automatically analyze the dataset (house_prices_train.csv), preprocess the data, train and compare multiple models, generate performance visualizations, and output a comprehensive summary.

The workflow will:

  1. Analyze the dataset (house_prices_train.csv)
  2. Preprocess the data automatically
  3. Train and compare multiple models
  4. Generate performance visualizations
  5. Output a comprehensive summary

📄 License

This project is licensed under the Apache License 2.0. See the LICENSE for details.