IdeaSearch LogoIdeaSearch

IdeaSearch Fitter Demo

IdeaSearch Fitter Demo usage examples - StreamLit-based intelligent symbolic regression web application

🎬 Demo Video

GitHubIdeaSearch/ideasearch-fit-demo

0

🚀 IdeaSearch Fitter Demo Usage Tutorial

📖 Overview

IdeaSearch Fitter Demo is an intelligent symbolic regression web application based on the IdeaSearch framework, using large language models to automatically discover mathematical expressions. Users can simply draw curves or upload data, and AI will find the best-fit formulas for you.

✨ Key Features

  • 🎨 Interactive Drawing Canvas - Intuitively draw target curves with support for multiple drawing modes
  • 📁 File Upload Support - Support NPZ data file upload and multi-dimensional feature fitting
  • 🤖 Multi-Model Support - Integrated with mainstream LLMs like GPT, Gemini, Qwen, DeepSeek
  • 🧠 Fuzzy Mode - Use natural language theory descriptions to assist fitting
  • 📊 Real-time Visualization - Dynamically display fitting progress and result comparisons
  • 🏝️ Island Evolution Algorithm - Parallel exploration of multiple solution spaces to improve fitting quality
  • 📈 Pareto Front Analysis - Balance expression complexity and fitting accuracy
  • ⚙️ Physical Unit Validation - Ensure generated expressions have correct dimensions

🛠️ Environment Setup

1. Clone the Repository

# Clone repository
git clone https://github.com/IdeaSearch/ideasearch-fit-demo
cd ideasearch-fit-demo

2. Install uv Package Manager

uv is a fast and reliable Python package manager, recommended for use:

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or install via pip
pip install uv

3. Configure Environment and Dependencies

# Sync dependency environment
uv sync

4. Configure API Keys

Copy the example configuration file and edit:

# Copy example configuration
cp api_keys.json.example api_keys.json

# Edit configuration file
nano api_keys.json  # or use other editors

API key configuration format:

{
  "Gemini_2.5_Flash": [{
    "api_key": "your-gemini-api-key-here",
    "base_url": "https://generativelanguage.googleapis.com/v1beta",
    "model": "gemini-2.0-flash-exp"
  }],
  "GPT_4o_Mini": [{
    "api_key": "your-openai-api-key",
    "base_url": "https://api.openai.com/v1",
    "model": "gpt-4o-mini"
  }],
  "Qwen_Plus": [{
    "api_key": "your-qwen-api-key",
    "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
    "model": "qwen-plus"
  }]
}

Supported Model Names (Please configure strictly according to the following names):

  • Gemini Series: Gemini_2.5_Flash, Gemini_2.5_Pro, Gemini_Pro
  • OpenAI Series: GPT_4o, GPT_4o_Mini, GPT_4_Turbo
  • Domestic Models: Qwen_Plus, Qwen_Max, Qwen3, Doubao
  • Open Source Models: Deepseek_V3, Grok_4

🚀 Launch Application

After configuration, start the application:

# Use launch script (recommended)
./run.sh

# Or start manually
uv run streamlit run app.py --server.port 8501

The application will automatically open in browser: http://localhost:8501

📖 Usage Guide

The application provides two main tabs for different usage scenarios:

🎨 Tab 1: Draw Curve Fitting

This is the most intuitive way to use, suitable for quick exploration and teaching demonstrations.

Operation Steps

  1. Draw Curves

    • Draw target curves on the left canvas
    • Supports three drawing modes:
      • Free Drawing: Hand-draw curves of any shape
      • Straight Line: Draw line segments
      • Points: Mark data points one by one
    • Adjustable line width (1-10 pixels)
    • Can enable 📷 Pass Image option to pass canvas images to vision-capable models (like Gemini)
  2. Configure Parameters (Right sidebar)

    • Model Selection: Recommend using Gemini_2.5_Flash (best balance of speed and quality)
    • Function Configuration: Select available mathematical functions
      Basic functions: sin, cos, tan, exp, log, sqrt, abs
      Advanced functions: sinh, cosh, tanh, asin, acos, atan
    • Fitting Parameter Tuning:
      • Number of Islands: 3-8 (Recommended: 5) - Number of parallel search populations
      • Number of Cycles: 3-10 (Recommended: 5) - Number of evolution generations
      • Unit Interactions: 5-10 (Recommended: 8) - LLM calls per generation
      • Target Score: 80.0 (Recommended) - Early stop when reached
    • Fuzzy Mode: Check to enable natural language theory description assistance
  3. Data Preview

    • The right side displays extracted data point information
    • Shows X, Y ranges and data point scatter plots
    • Confirm data quality before starting fitting
  4. Execute Fitting

    • Click ▶️ Start Fitting button
    • Observe real-time progress and log output
    • Can see during fitting process:
      • Current best score and expression
      • Real-time fitting curve comparison plots
      • API call counts and runtime

Fitting Results Interpretation

After fitting completion, the application displays:

  • 📊 Fitting Comparison Plot: Original curve vs AI-discovered fitting curve
  • 📈 Score History: Shows fitting quality improvement over iterations
  • 📊 Pareto Front: Analyzes trade-off between expression complexity and accuracy
  • 📞 API Call Log: Detailed model call records and performance statistics

📁 Tab 2: File Upload Fitting

This is the preferred method for professional users, supporting complex multi-dimensional data and physical unit validation.

Data Preparation

Prepare NPZ files containing the following keys:

  • 'x': Input features, shape (n_samples, n_features)
  • 'y': Output targets, shape (n_samples,)
  • 'error': Optional error data, shape (n_samples,)

Python example code:

import numpy as np

# Generate example data: F = m * a (Newton's second law)
m = np.random.uniform(1, 10, 100)  # mass kg
a = np.random.uniform(0.5, 20, 100)  # acceleration m/s^2
F = m * a  # force N
error = np.random.normal(0, 0.1, 100)  # measurement error

# Save as NPZ file
x = np.column_stack([m, a])  # input feature matrix
y = F  # output target
np.savez('physics_data.npz', x=x, y=y, error=error)

Operation Steps

  1. Upload Data File

    • Click Choose NPZ File to upload data
    • System will automatically validate data format
    • Display basic data information: number of samples, features, whether errors are included
  2. Variable Configuration (Key step)

    Set in ⚙️ Variable Configuration area:

    Basic Description:

    • Input Description: Describe the physical meaning of input data
      Example: "Use object's mass and acceleration to derive force acting on the object"

    Output Variables:

    • Output Variable Name: F
    • Output Variable Description: force
    • Output Variable Unit: kg*m/s^2

    Input Variable Configuration: Configure for each input feature:

    • Variable Name: m, a (corresponding to mass, acceleration)
    • Unit: kg, m/s^2
    • Description: mass, acceleration
  3. Advanced Options

    • Enable Unit Validation: When checked, performs dimensional analysis to ensure generated expressions are physically correct
    • Uncheck to skip unit checking, suitable for pure mathematical fitting
  4. Parameter Tuning

    • Sidebar parameters same as drawing mode
    • For complex data, recommend:
      • Number of Islands: 6-8
      • Number of Cycles: 8-10
      • Enable Fuzzy mode
  5. Execution and Results

    • Click ▶️ Start Fitting
    • For multi-dimensional data, results shown as predicted vs actual scatter plots
    • Ideally, points should be distributed near the y=x line

⚙️ Configuration Parameter Details

IdeaSearch Core Parameters

Prop

Type

Canvas Configuration Parameters

Prop

Type

Data Processing Parameters

Prop

Type

Fitter Configuration Parameters

Prop

Type

🎯 Parameter Tuning Guide

Key Parameter Explanations

ParameterRecommended ValueDescriptionTuning Suggestions
Number of Islands3-8Number of parallel evolution populationsIncrease improves diversity but consumes more API
Number of Cycles3-10Number of evolution generationsMore cycles usually yield better results
Unit Interactions5-10LLM calls per cycleBalance exploration depth and cost
Target Score80.0Automatic stop thresholdAdjust based on accuracy requirements (0-100)
Sample Temperature10-30Generation randomness controlHigh temperature increases creativity, low temperature more stable

🔧 Troubleshooting

Common Problem Solutions

Q: Application fails to start?

# Check Python version (requires 3.10+)
python --version

# Reinstall dependencies
uv sync

Q: API calls failing?

  1. Check if api_keys.json format is correct
  2. Confirm API keys are valid and have balance
  3. Verify network connection
  4. Check if model names exactly match configuration file key names

Q: Fitting results unsatisfactory?

  1. Increase search intensity: Raise number of islands and cycles
  2. Enable Fuzzy mode: Use natural language theory descriptions
  3. Try different models: GPT-4o usually performs better than Mini versions
  4. Optimize data quality: Ensure canvas curves are clear and data is evenly distributed
  5. Adjust function library: Choose appropriate basic functions based on expected function types

Q: Memory or performance issues?

  1. Lower number of islands and cycles
  2. Use more lightweight models
  3. Reduce number of data points
  4. Turn off some unnecessary visualizations

Log Viewing

The application automatically saves detailed logs in the logs/ directory:

logs/
├── fit_20231208_143022/    # Fitting process logs
├── db_20231208_143022/     # IdeaSearcher database files
└── ...

Each fitting creates an independent timestamped directory containing:

  • Complete fitting process records
  • API call details
  • Error messages and debug output
  • Best expressions and Pareto front data

📚 Technical Architecture

Core Components

  • Streamlit: Web application framework
  • IdeaSearch-framework: Core optimization engine
  • IdeaSearch-fit: Symbolic regression adapter
  • streamlit-drawable-canvas: Drawing canvas component

Data Flow

User Input (Canvas/File) → Data Preprocessing → IdeaSearchFitter → IdeaSearcher → LLM Calls → Expression Generation → Evaluation and Selection → Result Display

File Structure

app.py
api_keys.json
api_keys.json.example
pyproject.toml
run.sh
README.md
ARCHITECTURE.md

🚀 Advanced Features

Fuzzy Mode

Fuzzy mode is a unique feature of IdeaSearch that first lets LLM generate natural language theory descriptions, then converts them to mathematical expressions:

  1. Theory Generation: LLM analyzes data characteristics and generates physical or mathematical theory hypotheses
  2. Expression Conversion: Convert natural language theories to specific mathematical formulas
  3. Iterative Optimization: Continue refining theories and expressions based on fitting results

Applicable scenarios:

  • Physical law discovery
  • Complex nonlinear relationship modeling
  • Symbolic regression requiring interpretability

Physical Unit Validation

When unit validation is enabled, the system will:

  1. Dimensional Analysis: Check dimensional consistency of expressions
  2. Unit Derivation: Verify if output units match expectations
  3. Correction Suggestions: Provide correction suggestions for expressions that don't conform to units

This ensures generated expressions are physically meaningful.

Island Evolution Algorithm

  • Parallel Search: Multiple "islands" simultaneously evolve different expression populations
  • Population Exchange: Islands periodically exchange excellent individuals
  • Diversity Maintenance: Avoid premature convergence to local optima

Pareto Front Optimization

Balances two objectives:

  • Fitting Accuracy: Degree of expression matching with data
  • Expression Complexity: Simplicity of formulas

Helps users find optimal balance between accuracy and interpretability.

📈 Performance Optimization Suggestions

  1. Model Selection: Gemini 2.5 Flash provides best cost-effectiveness
  2. Batch Processing: Use larger unit interaction numbers to reduce network overhead
  3. Caching: System automatically caches intermediate results
  4. Parallelization: Island algorithm naturally supports parallel computing
  5. Early Stopping: Set reasonable target scores to avoid overfitting

🤝 Contribution and Feedback

Encountering problems or have improvement suggestions?


🎯 Start exploring AI-driven symbolic regression!

Let large language models drive research and promote scientific discovery

Last updated on