Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions SAMPLE_DATA_GENERATOR_IMPROVEMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Sample Data Generator Improvements Summary

## 🎯 Objective Achieved
Successfully improved the sample data generator to create **exactly 8 events** with **higher user connectivity** for more realistic funnel analysis.

## 📊 Key Improvements

### 1. **Exactly 8 Events** ✅
**Before:** 6 main funnel events + 8 additional scattered events (14 total)
**After:** Exactly 8 focused funnel events

```python
# NEW Event Sequence (8 events)
event_sequence = [
"Sign Up",
"Email Verification",
"First Login",
"Profile Setup",
"Product Browse",
"Add to Cart",
"Checkout Start",
"Purchase Complete",
]
```

### 2. **Higher User Connectivity** 📈
**Before:** Aggressive dropout rates (25%, 20%, 25%, 20%, 22%)
**After:** Gradual, realistic dropout rates (12%, 15%, 18%, 20%, 22%, 25%, 28%)

#### Connectivity Statistics:
- **Average events per user:** 4.91 (significantly improved)
- **Users completing all 8 events:** 20.6% (1,652 users)
- **Users completing 5+ events:** 56.0% (4,490 users)
- **Users completing only 1 event:** 7.2% (580 users)

### 3. **Weighted User Retention** 🎯
Implemented intelligent user selection based on user properties:
- **Premium users:** 1.8x more likely to continue
- **Basic users:** 1.3x more likely to continue
- **Younger users (18-35):** 1.2x more likely to continue
- **Free users:** Standard retention rate

### 4. **Realistic Timing Progression** ⏰
**Before:** Exponential time distribution with cohort factors
**After:** Step-specific realistic timing:
- **Sign Up:** On registration date
- **Email Verification:** Within 2 hours (exponential)
- **First Login:** Within 12 hours (exponential)
- **Profile Setup:** Within 48 hours (exponential)
- **Shopping Events:** Spread over weeks

### 5. **Enhanced Event Properties** 🔧
Added step-specific rich properties for better analysis:

#### Purchase Complete:
- `order_value`: $30-$300 range (lognormal distribution)
- `payment_method`: credit_card, paypal, apple_pay, google_pay
- `product_category`: electronics, clothing, books, home

#### Add to Cart:
- `cart_value`: $25-$200 range
- `items_count`: 1-5 items with realistic distribution

#### Product Browse:
- `pages_viewed`: 1-8 pages viewed
- `time_spent_minutes`: Exponential distribution (avg 8 min)

### 6. **Cross-Step Engagement Events** 🔄
Added repeat interactions for 40% of users to increase connectivity:
- **Repeat events:** Product Browse, Add to Cart
- **Timing:** 1 week to 2 months after initial journey
- **Enhanced properties:** Longer sessions, higher values for repeat users

### 7. **Improved Data Quality** 🛠️
- **JSON Serialization:** Fixed numpy type issues with explicit type casting
- **Session Tracking:** Added unique session IDs for each user interaction
- **Repeat Action Flags:** Marked repeat actions for analysis
- **Performance:** Reduced from 10,000 to 8,000 users for optimal performance

## 📈 Results Comparison

### Event Distribution:
```
Sign Up: 8,000 events (100.0%)
Email Verification: 7,040 events (88.0%)
Product Browse: 6,711 events (83.9%) ← High engagement
First Login: 5,984 events (74.8%)
Add to Cart: 5,812 events (72.7%) ← High engagement
Profile Setup: 4,906 events (61.3%)
Checkout Start: 2,295 events (28.7%)
Purchase Complete: 1,652 events (20.6%)
```

### User Journey Connectivity:
```
1 event: 580 users (7.2%) ← Very few single-event users
2 events: 909 users (11.4%)
3 events: 1,025 users (12.8%)
4 events: 996 users (12.4%)
5 events: 1,073 users (13.4%)
6 events: 1,122 users (14.0%)
7 events: 643 users (8.0%)
8 events: 1,652 users (20.6%) ← Strong complete journey rate
```

## 🎯 Business Impact

### Better Analysis Capabilities:
1. **Process Mining:** More connected user journeys for path analysis
2. **Cohort Analysis:** Realistic user behavior patterns
3. **Time Series:** Proper timing distributions for temporal analysis
4. **Segmentation:** Rich properties for detailed segmentation

### More Realistic Funnels:
1. **E-commerce Focus:** Clear shopping journey from browse to purchase
2. **Engagement Patterns:** Repeat interactions mirror real user behavior
3. **Retention Logic:** User properties influence journey completion
4. **Revenue Tracking:** Order values and payment methods for business analysis

## 🔧 Technical Improvements

### Performance:
- **Data Size:** Optimized from 10K to 8K users
- **Event Focus:** Reduced from 14 to 8 events for clarity
- **JSON Handling:** Fixed serialization issues
- **Memory Usage:** More efficient data generation

### Code Quality:
- **Type Safety:** Explicit type casting for JSON serialization
- **Documentation:** Clear comments explaining each improvement
- **Maintainability:** Structured approach to event generation
- **Testing:** Verified with actual funnel analysis

## ✅ Success Metrics

1. **✅ Exactly 8 Events:** Achieved - no more, no less
2. **✅ Higher Connectivity:** 4.91 avg events/user vs previous lower connectivity
3. **✅ Realistic Patterns:** E-commerce journey with proper timing
4. **✅ Rich Properties:** Step-specific properties for advanced analysis
5. **✅ Performance:** Faster generation and analysis
6. **✅ Compatibility:** Works seamlessly with existing funnel calculator

The improved sample data generator now provides a much more realistic and connected dataset that better represents actual user behavior in an e-commerce funnel, enabling more meaningful analysis and testing of the funnel analytics platform.
30 changes: 28 additions & 2 deletions app.py
Original file line number Diff line number Diff line change
Expand Up @@ -925,7 +925,7 @@ def main():
)

# Segmentation в отдельной секции для лучшей видимости
st.markdown("### 🎯 Segmentation (Optional)")
st.markdown("### 🎯 Segmentation (Optional, in development)")

# Segmentation controls
selected_property = "None"
Expand Down Expand Up @@ -2047,6 +2047,20 @@ def format_time(minutes):
help="Display transition counts on visualizations",
)

with col4:
use_funnel_events_only = st.checkbox(
"Use selected events only",
value=True,
help="Analyze only the events selected in your funnel (recommended for focused analysis)",
)

# Show warning if filtering is enabled but no funnel events selected
if use_funnel_events_only and not st.session_state.funnel_steps:
st.warning(
"⚠️ 'Use selected events only' is enabled but no funnel events are selected. "
"Please build your funnel first or disable this option to analyze all events."
)

# Process Mining Analysis
if st.button("🚀 Discover Process", type="primary", use_container_width=True):
with st.spinner("Analyzing user journeys..."):
Expand All @@ -2055,18 +2069,30 @@ def format_time(minutes):
config = FunnelConfig()
path_analyzer = PathAnalyzer(config)

# Determine which events to analyze
filter_events = None
if use_funnel_events_only and st.session_state.funnel_steps:
filter_events = st.session_state.funnel_steps

# Discover process structure
process_data = path_analyzer.discover_process_mining_structure(
st.session_state.events_data,
min_frequency=min_frequency,
include_cycles=include_cycles,
filter_events=filter_events,
)

# Store in session state
st.session_state.process_mining_data = process_data

# Create success message with filtering info
if filter_events:
filter_info = f" (filtered to {len(filter_events)} selected funnel events)"
else:
filter_info = " (analyzing all events in dataset)"

st.success(
f"✅ Discovered {len(process_data.activities)} activities and {len(process_data.transitions)} transitions"
f"✅ Discovered {len(process_data.activities)} activities and {len(process_data.transitions)} transitions{filter_info}"
)

except Exception as e:
Expand Down
Loading
Loading