Skip to content

58_Git_For_Version_Control_In_Ml_Projects

Category: AI & Data Science Tools
Type: AI/ML Tool or Library
Generated on: 2025-08-26 11:10:16
For: Data Science, Machine Learning & Technical Interviews


Git for Version Control in ML Projects: A Data Scientist’s Cheatsheet

Section titled “Git for Version Control in ML Projects: A Data Scientist’s Cheatsheet”

This cheatsheet provides a comprehensive guide to using Git for version control in machine learning projects. It covers installation, core concepts, practical examples, and advanced techniques tailored for data scientists and ML engineers.

1. Tool Overview: Git

  • What it is: Git is a distributed version control system that tracks changes to files and directories over time. It allows you to revert to previous versions, collaborate with others, and manage multiple branches of development.
  • Main Use Cases in AI/ML:
    • Code Management: Tracking changes to Python scripts, notebooks (Jupyter, Colab), and configuration files.
    • Experiment Tracking: Recording different versions of models, datasets, and training parameters.
    • Collaboration: Enabling multiple data scientists and engineers to work on the same project without conflicts.
    • Reproducibility: Ensuring that experiments can be replicated by others or at a later time.
    • Deployment: Managing different versions of models and deployment scripts.
    • Data Versioning (with Git LFS or DVC): While Git isn’t ideal for large datasets directly, it can be combined with other tools to manage data versions effectively.

2. Installation & Setup

  • Installation:
  • Configuration:
    Terminal window
    git config --global user.name "Your Name"
    git config --global user.email "your.email@example.com"
    git config --global core.editor "nano" # or vim, emacs, etc.
    git config --global --list #Verify your configuration

3. Core Features & API

  • Initialization:
    • git init: Creates a new Git repository in the current directory.
  • Staging & Committing:
    • git status: Shows the status of the working directory and staging area.
    • git add <file>: Adds a file to the staging area. git add . adds all modified and new files.
    • git commit -m "Commit message": Commits the staged changes with a descriptive message.
    • git commit -am "Commit message": Adds all tracked changes and commits them (shortcut). Use with caution to avoid accidentally committing unintended changes.
  • Branching & Merging:
    • git branch: Lists all local branches.
    • git branch <branch_name>: Creates a new branch.
    • git checkout <branch_name>: Switches to an existing branch.
    • git checkout -b <branch_name>: Creates a new branch and switches to it.
    • git merge <branch_name>: Merges changes from another branch into the current branch.
    • git branch -d <branch_name>: Deletes a branch (if it’s been merged). git branch -D <branch_name> forces deletion.
  • Remote Repositories:
    • git remote add origin <repository_url>: Adds a remote repository (e.g., GitHub, GitLab, Bitbucket). origin is a common alias for the main remote.
    • git remote -v: Lists configured remote repositories.
    • git push origin <branch_name>: Pushes local changes to a remote branch.
    • git pull origin <branch_name>: Fetches and merges changes from a remote branch into the local branch.
    • git clone <repository_url>: Clones a remote repository to your local machine.
  • Undoing Changes:
    • git reset HEAD <file>: Unstages a file (removes it from the staging area).
    • git checkout -- <file>: Discards changes in the working directory (reverts to the last committed version).
    • git revert <commit_hash>: Creates a new commit that undoes the changes from a specific commit. This is safer than reset as it preserves history.
    • git reset --hard <commit_hash>: Resets the repository to a specific commit, discarding all changes after that commit. Use with extreme caution as this can lead to data loss.
  • Viewing History:
    • git log: Shows the commit history.
    • git log --oneline: Shows a concise one-line commit history.
    • git log --graph: Shows a graphical representation of the branch history.
    • git diff: Shows the differences between the working directory and the staging area.
    • git diff --staged: Shows the differences between the staging area and the last commit.
    • git show <commit_hash>: Shows the details of a specific commit.
  • Ignoring Files:
    • Create a .gitignore file in the root of your repository. List files and patterns to exclude from version control. Common entries:
      *.pyc # Python bytecode files
      __pycache__/
      *.csv # Large data files (consider Git LFS or DVC)
      *.h5 # Model files
      /data/ # Entire data directory
      .env # Environment variables
      secrets.txt # Sensitive information

4. Practical Examples

  • Scenario: Training a Model and Tracking Experiments

    train.py
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    import joblib # For saving models
    # Load data
    data = pd.read_csv("data/iris.csv") # Assume iris.csv exists in ./data/
    X = data.drop("species", axis=1)
    y = data["species"]
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    # Train model
    model = LogisticRegression(solver='liblinear', multi_class='ovr') # Added solver to prevent warnings
    model.fit(X_train, y_train)
    # Evaluate model (optional)
    accuracy = model.score(X_test, y_test)
    print(f"Accuracy: {accuracy}")
    # Save model
    joblib.dump(model, "models/iris_model.pkl") #Save to models directory
    # Create a file with hyperparameters
    with open("config.txt", "w") as f:
    f.write("model_type: LogisticRegression\n")
    f.write("test_size: 0.2\n")
    f.write("random_state: 42\n")
    # Expected output (will vary based on the actual data):
    # Accuracy: 1.0
    Terminal window
    # Initialize Git repository
    git init
    # Create directories
    mkdir data
    mkdir models
    # Add data (replace with actual data retrieval if needed)
    echo "sepal_length,sepal_width,petal_length,petal_width,species" > data/iris.csv
    echo "5.1,3.5,1.4,0.2,setosa" >> data/iris.csv
    echo "4.9,3.0,1.4,0.2,setosa" >> data/iris.csv
    echo "7.0,3.2,4.7,1.4,versicolor" >> data/iris.csv
    echo "6.4,3.2,4.5,1.5,versicolor" >> data/iris.csv
    echo "6.3,3.3,6.0,2.5,virginica" >> data/iris.csv
    echo "5.8,2.7,5.1,1.9,virginica" >> data/iris.csv
    # Create .gitignore
    echo "*.pkl" > .gitignore
    echo "data/" >> .gitignore #Ignore the data directory (for now)
    echo "__pycache__/" >> .gitignore
    # Stage and commit
    git add train.py data/iris.csv config.txt .gitignore models/
    git commit -m "Initial commit: Training script, data, config, and .gitignore"
    # Experiment 1: Change the test size to 0.3 in train.py
    # ... (edit train.py)
    git add train.py
    git commit -m "Experiment 1: Changed test size to 0.3"
    # Experiment 2: Change the solver to 'newton-cg'
    # ... (edit train.py)
    git add train.py
    git commit -m "Experiment 2: Changed solver to 'newton-cg'"
    #Push to remote repository
    git remote add origin <your_repo_url> #Replace with your actual repository URL
    git push -u origin main #Or git push -u origin master if your main branch is named master
    # To track the data directory, you'd need Git LFS or DVC (see Advanced Usage)
  • Scenario: Collaborating on a Notebook

    1. Clone the repository: git clone <repository_url>
    2. Create a new branch for your changes: git checkout -b feature/data-cleaning
    3. Make changes to the notebook (e.g., data_cleaning.ipynb).
    4. Commit your changes: git add data_cleaning.ipynb; git commit -m "Cleaned missing values"
    5. Push your branch to the remote repository: git push origin feature/data-cleaning
    6. Create a pull request on GitHub/GitLab/Bitbucket to merge your changes into the main branch.

5. Advanced Usage

  • Git LFS (Large File Storage):

    • Designed for versioning large binary files (e.g., datasets, models).
    • Replaces large files with text pointers in Git, storing the actual files on a separate server.
    • Installation: brew install git-lfs (macOS) or sudo apt-get install git-lfs (Linux)
    • Usage:
      Terminal window
      git lfs install
      git lfs track "*.h5" # Track HDF5 model files
      git add .gitattributes # Commit the .gitattributes file
      git add models/my_model.h5
      git commit -m "Added model using Git LFS"
      git push origin main
  • DVC (Data Version Control):

    • A more comprehensive solution for data and model versioning.
    • Tracks data dependencies, pipelines, and model metrics.
    • Integrates with cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage) and other version control systems.
    • Installation: pip install dvc
    • Usage:
      Terminal window
      dvc init
      dvc add data/iris.csv #Track the data file
      git add data/.gitignore data/iris.csv.dvc .gitignore
      git commit -m "Add iris dataset using DVC"
      dvc run -n train_model -d data/iris.csv -o models/iris_model.pkl python train.py #Track training pipeline
      git add models/iris_model.pkl.dvc
      git commit -m "Tracked training pipeline"
      dvc push #Push the data and models to remote storage
  • Branching Strategies:

    • Gitflow: A well-defined branching model with main, develop, feature, release, and hotfix branches.
    • GitHub Flow: Simpler branching model with main and feature branches. Suitable for smaller teams and faster development cycles.
  • Rebasing:

    • git rebase <branch_name>: Moves your current branch on top of another branch, creating a cleaner history. Use with caution, especially on shared branches.
  • Cherry-picking:

    • git cherry-pick <commit_hash>: Applies a specific commit from one branch to another.
  • Stashing:

    • git stash: Temporarily saves changes in your working directory without committing them.
    • git stash pop: Applies the most recent stash.
    • git stash list: Lists all stashes.
  • Submodules and Subtrees:

    • For including other Git repositories as part of your project.

6. Tips & Tricks

  • Descriptive Commit Messages: Use clear and concise commit messages that explain the purpose of the changes. Follow the “Imperative Mood” convention (e.g., “Fix bug” instead of “Fixed bug”).
  • Frequent Commits: Commit small, logical changes frequently.
  • Use Branches: Create branches for new features, bug fixes, and experiments.
  • Code Reviews: Use pull requests and code reviews to ensure code quality and collaboration.
  • Automated Testing: Integrate Git with CI/CD pipelines to automatically run tests and deploy code.
  • Visual Git Clients: Consider using a GUI Git client like GitKraken, SourceTree, or GitHub Desktop for a more visual representation of the repository history.
  • Aliases: Define Git aliases for frequently used commands:
    Terminal window
    git config --global alias.st status
    git config --global alias.co checkout
    git config --global alias.br branch
    git config --global alias.ci commit
    git config --global alias.lg "log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative"
    Now you can use git st, git co, git br, git ci, and git lg as shortcuts.
  • Resolve Merge Conflicts Carefully: When merging branches, conflicts may arise. Carefully review and resolve these conflicts to ensure that the code is correct. Use a merge tool like meld or the built-in merge tools in your IDE.
  • Don’t Commit Large Data Directly: Use Git LFS or DVC for data versioning. Never commit sensitive data (API keys, passwords) to the repository.
  • Use a .gitattributes file: This file allows you to define attributes for specific file types or paths within your repository. It’s often used in conjunction with Git LFS or to handle line endings consistently across different operating systems.

7. Integration

  • Pandas: No direct integration, but Pandas DataFrames can be saved to CSV files, which can be tracked with Git (or better, DVC for large CSVs).
  • NumPy: NumPy arrays can be saved to files (e.g., .npy format), which can be tracked with Git LFS or DVC.
  • Scikit-learn: Scikit-learn models can be saved using joblib or pickle and tracked with Git LFS or DVC.
  • TensorFlow/PyTorch: TensorFlow and PyTorch models are typically saved as HDF5 files (.h5) or serialized Python objects, which should be tracked with Git LFS or DVC.
  • MLflow: MLflow automatically tracks experiments, parameters, metrics, and models, and can integrate with Git to track the code associated with each run.
  • CI/CD (Continuous Integration/Continuous Delivery): Git is a core component of CI/CD pipelines. Tools like Jenkins, CircleCI, GitHub Actions, and GitLab CI can automatically build, test, and deploy code whenever changes are pushed to a Git repository.

8. Further Resources