58_Git_For_Version_Control_In_Ml_Projects

Category: AI & Data Science Tools
Type: AI/ML Tool or Library
Generated on: 2025-08-26 11:10:16
For: Data Science, Machine Learning & Technical Interviews

Git for Version Control in ML Projects: A Data Scientist’s Cheatsheet

This cheatsheet provides a comprehensive guide to using Git for version control in machine learning projects. It covers installation, core concepts, practical examples, and advanced techniques tailored for data scientists and ML engineers.

1. Tool Overview: Git

What it is: Git is a distributed version control system that tracks changes to files and directories over time. It allows you to revert to previous versions, collaborate with others, and manage multiple branches of development.
Main Use Cases in AI/ML:
- Code Management: Tracking changes to Python scripts, notebooks (Jupyter, Colab), and configuration files.
- Experiment Tracking: Recording different versions of models, datasets, and training parameters.
- Collaboration: Enabling multiple data scientists and engineers to work on the same project without conflicts.
- Reproducibility: Ensuring that experiments can be replicated by others or at a later time.
- Deployment: Managing different versions of models and deployment scripts.
- Data Versioning (with Git LFS or DVC): While Git isn’t ideal for large datasets directly, it can be combined with other tools to manage data versions effectively.

2. Installation & Setup

Installation:
- Linux: sudo apt-get update && sudo apt-get install git (Debian/Ubuntu) or sudo yum install git (CentOS/RHEL)
- macOS: brew install git (using Homebrew) or download from https://git-scm.com/downloads
- Windows: Download from https://git-scm.com/downloads and install using the GUI installer.

Configuration:

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
git config --global core.editor "nano"  # or vim, emacs, etc.
git config --global --list #Verify your configuration

3. Core Features & API

Initialization:
- git init: Creates a new Git repository in the current directory.
Staging & Committing:
- git status: Shows the status of the working directory and staging area.
- git add <file>: Adds a file to the staging area. git add . adds all modified and new files.
- git commit -m "Commit message": Commits the staged changes with a descriptive message.
- git commit -am "Commit message": Adds all tracked changes and commits them (shortcut). Use with caution to avoid accidentally committing unintended changes.
Branching & Merging:
- git branch: Lists all local branches.
- git branch <branch_name>: Creates a new branch.
- git checkout <branch_name>: Switches to an existing branch.
- git checkout -b <branch_name>: Creates a new branch and switches to it.
- git merge <branch_name>: Merges changes from another branch into the current branch.
- git branch -d <branch_name>: Deletes a branch (if it’s been merged). git branch -D <branch_name> forces deletion.
Remote Repositories:
- git remote add origin <repository_url>: Adds a remote repository (e.g., GitHub, GitLab, Bitbucket). origin is a common alias for the main remote.
- git remote -v: Lists configured remote repositories.
- git push origin <branch_name>: Pushes local changes to a remote branch.
- git pull origin <branch_name>: Fetches and merges changes from a remote branch into the local branch.
- git clone <repository_url>: Clones a remote repository to your local machine.
Undoing Changes:
- git reset HEAD <file>: Unstages a file (removes it from the staging area).
- git checkout -- <file>: Discards changes in the working directory (reverts to the last committed version).
- git revert <commit_hash>: Creates a new commit that undoes the changes from a specific commit. This is safer than reset as it preserves history.
- git reset --hard <commit_hash>: Resets the repository to a specific commit, discarding all changes after that commit. Use with extreme caution as this can lead to data loss.
Viewing History:
- git log: Shows the commit history.
- git log --oneline: Shows a concise one-line commit history.
- git log --graph: Shows a graphical representation of the branch history.
- git diff: Shows the differences between the working directory and the staging area.
- git diff --staged: Shows the differences between the staging area and the last commit.
- git show <commit_hash>: Shows the details of a specific commit.

Ignoring Files:

Create a .gitignore file in the root of your repository. List files and patterns to exclude from version control. Common entries:

*.pyc       # Python bytecode files
__pycache__/
*.csv       # Large data files (consider Git LFS or DVC)
*.h5        # Model files
/data/      # Entire data directory
.env        # Environment variables
secrets.txt # Sensitive information

4. Practical Examples

Scenario: Training a Model and Tracking Experiments

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import joblib  # For saving models

# Load data
data = pd.read_csv("data/iris.csv") # Assume iris.csv exists in ./data/
X = data.drop("species", axis=1)
y = data["species"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression(solver='liblinear', multi_class='ovr') # Added solver to prevent warnings
model.fit(X_train, y_train)

# Evaluate model (optional)
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

# Save model
joblib.dump(model, "models/iris_model.pkl") #Save to models directory

# Create a file with hyperparameters
with open("config.txt", "w") as f:
    f.write("model_type: LogisticRegression\n")
    f.write("test_size: 0.2\n")
    f.write("random_state: 42\n")


# Expected output (will vary based on the actual data):
# Accuracy: 1.0

# Initialize Git repository
git init

# Create directories
mkdir data
mkdir models

# Add data (replace with actual data retrieval if needed)
echo "sepal_length,sepal_width,petal_length,petal_width,species" > data/iris.csv
echo "5.1,3.5,1.4,0.2,setosa" >> data/iris.csv
echo "4.9,3.0,1.4,0.2,setosa" >> data/iris.csv
echo "7.0,3.2,4.7,1.4,versicolor" >> data/iris.csv
echo "6.4,3.2,4.5,1.5,versicolor" >> data/iris.csv
echo "6.3,3.3,6.0,2.5,virginica" >> data/iris.csv
echo "5.8,2.7,5.1,1.9,virginica" >> data/iris.csv


# Create .gitignore
echo "*.pkl" > .gitignore
echo "data/" >> .gitignore #Ignore the data directory (for now)
echo "__pycache__/" >> .gitignore

# Stage and commit
git add train.py data/iris.csv config.txt .gitignore models/
git commit -m "Initial commit: Training script, data, config, and .gitignore"

# Experiment 1:  Change the test size to 0.3 in train.py
# ... (edit train.py)

git add train.py
git commit -m "Experiment 1: Changed test size to 0.3"

# Experiment 2: Change the solver to 'newton-cg'
# ... (edit train.py)

git add train.py
git commit -m "Experiment 2: Changed solver to 'newton-cg'"

#Push to remote repository
git remote add origin <your_repo_url> #Replace with your actual repository URL
git push -u origin main #Or git push -u origin master if your main branch is named master

# To track the data directory, you'd need Git LFS or DVC (see Advanced Usage)

Scenario: Collaborating on a Notebook
1. Clone the repository: git clone <repository_url>
2. Create a new branch for your changes: git checkout -b feature/data-cleaning
3. Make changes to the notebook (e.g., data_cleaning.ipynb).
4. Commit your changes: git add data_cleaning.ipynb; git commit -m "Cleaned missing values"
5. Push your branch to the remote repository: git push origin feature/data-cleaning
6. Create a pull request on GitHub/GitLab/Bitbucket to merge your changes into the main branch.

5. Advanced Usage

Git LFS (Large File Storage):
- Designed for versioning large binary files (e.g., datasets, models).
- Replaces large files with text pointers in Git, storing the actual files on a separate server.
- Installation: brew install git-lfs (macOS) or sudo apt-get install git-lfs (Linux)
- Usage:
  Terminal window
```
git lfs install
git lfs track "*.h5"  # Track HDF5 model files
git add .gitattributes # Commit the .gitattributes file
git add models/my_model.h5
git commit -m "Added model using Git LFS"
git push origin main
```

DVC (Data Version Control):

A more comprehensive solution for data and model versioning.
Tracks data dependencies, pipelines, and model metrics.
Integrates with cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage) and other version control systems.
Installation: pip install dvc

Usage:

dvc init
dvc add data/iris.csv #Track the data file
git add data/.gitignore data/iris.csv.dvc .gitignore
git commit -m "Add iris dataset using DVC"

dvc run -n train_model -d data/iris.csv -o models/iris_model.pkl python train.py #Track training pipeline
git add models/iris_model.pkl.dvc
git commit -m "Tracked training pipeline"

dvc push #Push the data and models to remote storage

Branching Strategies:
- Gitflow: A well-defined branching model with main, develop, feature, release, and hotfix branches.
- GitHub Flow: Simpler branching model with main and feature branches. Suitable for smaller teams and faster development cycles.
Rebasing:
- git rebase <branch_name>: Moves your current branch on top of another branch, creating a cleaner history. Use with caution, especially on shared branches.
Cherry-picking:
- git cherry-pick <commit_hash>: Applies a specific commit from one branch to another.
Stashing:
- git stash: Temporarily saves changes in your working directory without committing them.
- git stash pop: Applies the most recent stash.
- git stash list: Lists all stashes.
Submodules and Subtrees:
- For including other Git repositories as part of your project.

6. Tips & Tricks

Descriptive Commit Messages: Use clear and concise commit messages that explain the purpose of the changes. Follow the “Imperative Mood” convention (e.g., “Fix bug” instead of “Fixed bug”).
Frequent Commits: Commit small, logical changes frequently.
Use Branches: Create branches for new features, bug fixes, and experiments.
Code Reviews: Use pull requests and code reviews to ensure code quality and collaboration.
Automated Testing: Integrate Git with CI/CD pipelines to automatically run tests and deploy code.
Visual Git Clients: Consider using a GUI Git client like GitKraken, SourceTree, or GitHub Desktop for a more visual representation of the repository history.

Aliases: Define Git aliases for frequently used commands:

git config --global alias.st status
git config --global alias.co checkout
git config --global alias.br branch
git config --global alias.ci commit
git config --global alias.lg "log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative"

Now you can use git st, git co, git br, git ci, and git lg as shortcuts.

Resolve Merge Conflicts Carefully: When merging branches, conflicts may arise. Carefully review and resolve these conflicts to ensure that the code is correct. Use a merge tool like meld or the built-in merge tools in your IDE.
Don’t Commit Large Data Directly: Use Git LFS or DVC for data versioning. Never commit sensitive data (API keys, passwords) to the repository.
Use a .gitattributes file: This file allows you to define attributes for specific file types or paths within your repository. It’s often used in conjunction with Git LFS or to handle line endings consistently across different operating systems.

7. Integration

Pandas: No direct integration, but Pandas DataFrames can be saved to CSV files, which can be tracked with Git (or better, DVC for large CSVs).
NumPy: NumPy arrays can be saved to files (e.g., .npy format), which can be tracked with Git LFS or DVC.
Scikit-learn: Scikit-learn models can be saved using joblib or pickle and tracked with Git LFS or DVC.
TensorFlow/PyTorch: TensorFlow and PyTorch models are typically saved as HDF5 files (.h5) or serialized Python objects, which should be tracked with Git LFS or DVC.
MLflow: MLflow automatically tracks experiments, parameters, metrics, and models, and can integrate with Git to track the code associated with each run.
CI/CD (Continuous Integration/Continuous Delivery): Git is a core component of CI/CD pipelines. Tools like Jenkins, CircleCI, GitHub Actions, and GitLab CI can automatically build, test, and deploy code whenever changes are pushed to a Git repository.

8. Further Resources

Official Git Documentation: https://git-scm.com/doc
Git Tutorial: https://www.atlassian.com/git/tutorials
Git LFS Documentation: https://git-lfs.github.com/
DVC Documentation: https://dvc.org/
Pro Git Book (Free Online): https://git-scm.com/book
GitHub Learning Lab: https://lab.github.com/ This cheatsheet should provide a solid foundation for using Git effectively in your machine learning projects. Remember to practice and experiment with these commands to become comfortable with Git’s features. Good luck!