返回 Skill 列表
extension
分类: 内容与媒体无需 API Key

data-ai-guide

全面的数据科学、机器学习和AI指南,涵盖Python、深度学习、NLP、大语言模型、提示工程以及MLOps。在构建AI模型、数据管道或机器学习系统时使用。

person作者: jakexiaohubgithub

Data Science & AI Guide

Master data science, machine learning, generative AI, and modern AI engineering practices.

Quick Start

Python Data Science Stack

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load and prepare data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate
accuracy = model.score(X_test, y_test)

Deep Learning with PyTorch

import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(784, 128)
        self.linear2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.linear1(x))
        return self.linear2(x)

# Training loop
model = SimpleNN()
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

LLM Prompt Engineering

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
  ],
  temperature=0.7
)

Data Science Path

Fundamentals

  • Mathematics: Statistics, linear algebra, calculus
  • Python: Libraries (Pandas, NumPy, Scikit-learn)
  • Data Analysis: Exploratory analysis, visualization
  • SQL: Querying and data manipulation

Machine Learning

  • Supervised Learning: Regression, classification
  • Unsupervised Learning: Clustering, dimensionality reduction
  • Model Evaluation: Cross-validation, metrics
  • Hyperparameter Tuning: Grid search, Bayesian optimization

Deep Learning

  • Neural Networks: Architecture, training
  • CNNs: Computer vision tasks
  • RNNs: Sequence modeling
  • Transformers: Modern architecture for NLP/Vision

Natural Language Processing

  • Text Processing: Tokenization, embeddings
  • Word Embeddings: Word2Vec, GloVe, FastText
  • BERT: Contextual embeddings
  • Transformers: GPT, BERT for various NLP tasks

Generative AI & LLMs

Large Language Models

  • GPT Family: GPT-3.5, GPT-4 for text generation
  • Claude: Constitutional AI models
  • Open Source: Llama, Mistral, Zephyr
  • Fine-tuning: Adapting models for specific tasks

Prompt Engineering

  • Role-based Prompting: Setting context and expertise
  • Few-shot Learning: Examples in prompt
  • Chain-of-Thought: Step-by-step reasoning
  • Retrieval Augmented Generation (RAG): Knowledge augmentation
# RAG Example
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings)

qa = RetrievalQA.from_chain_type(
  llm=llm,
  chain_type="stuff",
  retriever=vectorstore.as_retriever()
)

AI Agents

  • Tool Use: Agents calling external tools
  • Planning: Multi-step task execution
  • Memory: Conversation history, context
  • Evaluation: Assessing agent performance

Data Engineering

ETL Pipelines

  • Apache Airflow: Workflow orchestration
  • dbt: Data transformation
  • Kafka: Stream processing
  • Spark: Distributed processing

Big Data

  • Hadoop: Distributed storage and processing
  • Spark: In-memory processing framework
  • Scala: Spark's native language
  • Distributed Systems: Understanding CAP theorem

Data Warehousing

  • Snowflake: Cloud data warehouse
  • BigQuery: Google's data warehouse
  • Redshift: AWS data warehouse
  • Star Schema: Dimensional modeling

MLOps

Model Management

  • Model Versioning: Tracking model versions
  • Model Registry: MLflow, Weights & Biases
  • Experiment Tracking: Monitoring training runs
  • Model Cards: Documenting model capabilities

Deployment

  • Model Serving: FastAPI, TFServing
  • Containerization: Docker for models
  • Kubernetes: Production ML deployment
  • API Monitoring: Performance and data drift

Monitoring

  • Data Drift: Detecting distribution changes
  • Model Drift: Performance degradation
  • Feature Store: Consistent feature serving
  • Observability: Logging and metrics

Technology Stack

Core Libraries

  • Pandas: Data manipulation
  • NumPy: Numerical computing
  • Scikit-learn: Machine learning
  • Matplotlib/Seaborn: Visualization
  • Plotly: Interactive plots

Deep Learning

  • TensorFlow: Keras API, distributed training
  • PyTorch: Dynamic graphs, research-friendly
  • JAX: Functional programming for ML

LLM Frameworks

  • LangChain: Building LLM applications
  • LlamaIndex: RAG and indexing
  • OpenAI API: GPT models access
  • Hugging Face: Model hub and transformers

Learning Path

  1. Fundamentals (3 months)

    • Python programming
    • Statistics and mathematics
    • Data manipulation with Pandas
  2. Machine Learning (3 months)

    • Supervised learning
    • Model evaluation
    • Feature engineering
  3. Deep Learning (2 months)

    • Neural networks
    • CNNs and RNNs
    • Transformers
  4. Specialization (ongoing)

    • NLP / Computer Vision / Tabular Data
    • LLMs and generative AI
    • MLOps and production

Projects

  1. Iris Classification - Classic ML project
  2. Housing Price Prediction - Regression
  3. Sentiment Analysis - NLP with transformers
  4. Image Classification - CNN with deep learning
  5. LLM Chatbot - Using prompt engineering
  6. RAG System - Knowledge-augmented AI
  7. Time Series Forecasting - Stock predictions

Resources

Learning Platforms

  • Coursera: Andrew Ng's ML course
  • Fast.ai: Practical deep learning
  • DataCamp: Interactive data science
  • Kaggle: Competitions and datasets

Documentation

Roadmap.sh Reference: https://roadmap.sh/ai-engineer


Status: ✅ Production Ready | SASMP: v1.3.0 | Bonded Agent: 04-data-ai-specialist