How to get started with GIT and work with GIT Remote Repo

From https://www.ntu.edu.sg/home/ehchua/programming/howto/Git_HowTo.html

1.  Introduction

GIT is a Version Control System (VCS) (aka Revision Control System (RCS), Source Code Manager (SCM)). A VCS serves as a Repository (or repo) of program codes, including all the historical revisions. It records changes to files at so-called commits in a log so that you can recall any file at any commit point.

Why VCS?

  1. The Repository serves as the backup (in case of code changes or disk crash).
  2. It is a living archive of all historical revisions. It lets you revert back to a specific version, if the need arises.
  3. It facilitates collaboration between team members, and serves as a project management tool.
  4. more…

Git was initially designed and developed by Linus Torvalds, in 2005, to support the development of the Linux kernel.

GIT is a Distributed Version Control System (DVCS). Other popular VCSes include:

  1. The standalone and legacy Unix’s RCS (Revision Control System).
  2. Centralized Client-Server Version Control System (CVCS): CVS (Concurrent Version System), SVN (Subversion) and Perforce.
  3. Distributed VCS (DVCS): GIT, Merurial, Bazaar, Darcs.

The mother site for Git is http://git-scm.com.

2.  Setting Up Git

You need to setup Git on your local machine, as follows:

  1. Download & Install:
    • For Windows and Mac, download the installer from http://git-scm.com/downloads and run the downloaded installer.
    • For Ubuntu, issue command “sudo apt-get install git“.

    For Windows, use the “Git Bash” command shell bundled with Git Installer to issue commands. For Mac/Ubuntu, use the “Terminal”.

  2. Customize Git:
    Issue “git config” command (for Windows, run “Git Bash” from the Git installed directory. For Ubuntu/Mac, launch a “Terminal”):

    // Set up your username and email (to be used in labeling your commits)
    $ git config --global user.name "your-name"
    $ git config --global user.email "your-email@youremail.com"

    The settings are kept in “<GIT_HOME>/etc/gitconfig” (of the GIT installed directory) and “<USER_HOME>/.gitconfig” (of the user’s home directory.
    You can issue “git config --list” to list the settings:

    $ git config --list
    user.email=xxxxxx@xxxxxx.com
    user.name=xxxxxx

3.  Git Basics

Git Commands

Git provides a set of simple, distinct, standalone commands developed according to the “Unix toolkit” philosophy – build small, interoperable tools.

To issue a command, start a “Terminal” (for Ubuntu/Mac) or “Git Bash” (for Windows):

$ git <command> <arguments>

The commonly-used commands are:

  1. initcloneconfig: for starting a Git-managed project.
  2. addmvrm: for staging file changes.
  3. commitrebaseresettag:
  4. statuslogdiffgrepshow: show status
  5. checkoutbranchmergepushfetchpull
Help and Manual

The best way to get help these days is certainly googling.

To get help on Git commands:

$ git help <command>
// or
$ git <command> --help

The GIT manual is bundled with the software (under the “doc” directory), and also available online @ http://git-scm.com/docs.

3.1  Getting Started with Local Repo

There are 2 ways to start a Git-managed project:

  1. Starting your own project;
  2. Cloning an existing project from a GIT host.

We shall begin with “Starting your own project” and cover “Cloning” later @ “Clone a Project from a Remote Repo“.

Setup the Working Directory for a New Project

Let’s start a programming project under the working directory called “hello-git“, with one source file “Hello.java” (or “Hello.cpp“, or “Hello.c“) as follows:

// Hello.java
public class Hello {
   public static void main(String[] args) {
      System.out.println("Hello, world from GIT!");
   }
}

Compile the “Hello.java” into “Hello.class” (or “Hello.cpp” or “Hello.c” into “Hello.exe“).

It is also highly recommended to provide a “README.md” file (a text file in a so-called “Markdown” syntax such as “GitHub Flavored Markdown“) to describe your project:

// README.md
This is the README file for the Hello-world project.

Now, we have 3 files in the working tree: “Hello.java“, “Hello.class” and “README.md“. We do not wish to track the “.class” as they can be reproduced from “.java“.

Initialize a new Git Repo (git init)

To manage a project under Git, run “git init” at the project root directory (i.e., “hello-git“) (via “Git Bash” for Windows, or “Terminal” for Ubuntu/Mac):

// Change directory to the project directory
$ cd /path-to/hello-git
 
// Initialize Git repo for this project
$ git init
Initialized empty Git repository in /path-to/hello-git/.git/

$ ls -al
drwxr-xr-x    1 xxxxx    xxxxx     4096 Sep 14 14:58 .git
-rw-r--r--    1 xxxxx    xxxxx      426 Sep 14 14:40 Hello.class
-rw-r--r--    1 xxxxx    xxxxx      142 Sep 14 14:32 Hello.java
-rw-r--r--    1 xxxxx    xxxxx       66 Sep 14 14:33 README.md

A hidden sub-directory called “.git” will be created under your project root directory (as shown in the above “ls -a” listing), which contains ALL Git related data.

Take note that EACH Git repo is associated with a project directory (and its sub-directories). The Git repo is completely contain within the project directory. Hence, it is safe to copy, move or rename the project directory. If your project uses more than one directories, you may create one Git repo for EACH directory, or use symlinks to link up the directories, or … (?!).

Git Storage Model

image

The local repo after “git init” is empty. You need to explicitly deposit files into the repo.

Before we proceed, it is important to stress that Git manages changes to files between so-called commits. In other words, it is a version control system that allows you to keep track of the file changes at the commits.

Staging File Changes for Tracking (git add <file>…)

Issue a “git status” command to show the status of the files:

$ git status
On branch master
Initial commit
 
Untracked files:
  (use "git add <file>..." to include in what will be committed)
      Hello.class
      Hello.java
      README.md
nothing added to commit but untracked files present (use "git add" to track)

By default, we start on a branch called “master“. We will discuss “branch” later.

In Git, the files in the working tree are either untracked or tracked. Currently, all 3 files are untracked. To stage a new file for tracking, use “git add <file>...” command.

// Add README.md file
$ git add README.md
 
$ git status
On branch master
Initial commit
 
Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   README.md
 
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        Hello.class
        Hello.java
 
// You can use wildcard * in the filename
// Add all Java source files into Git repo
$ git add *.java
 
// You can also include multiple files in the "git add"
// E.g.,
// git add Hello.java README.md
 
$ git status
On branch master
Initial commit
 
Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   Hello.java
        new file:   README.md

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        Hello.class

The command “git add <file>...” takes one or more filenames or pathnames with possibly wildcards pattern. You can also use “git add .” to add all the files in the current directory (and all sub-directories). But this will include “Hello.class“, which we do not wish to be tracked.

When a new file is added, it is staged (or indexed, or cached) in the staging area (as shown in the GIT storage model), but NOT yet committed.

Git uses two stages to commit file changes:

  1. git add <file>” to stage file changes into the staging area, and
  2. "git commit” to commit ALL the file changes in the staging area to the local repo.

The staging area allows you to group related file changes and commit them together.

Committing File Changes (git commit)

The “git commit” command commits ALL the file changes in the staging area. Use a -m option to provide a message for the commit.

$ git commit -m "First commit"   // -m to specify the commit message
[master (root-commit) 858f3e7] first commit
 2 files changed, 8 insertions(+)
 create mode 100644 Hello.java
 create mode 100644 README.md
 
// Check the status
$ git status
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
      Hello.class
nothing added to commit but untracked files present (use "git add" to track)
Viewing the Commit Data (git log)

Git records several pieces of metadata for every commit, which includes a log message, timestamp, the author’s username and email (set during customization).

You can use “git log” to list the commit data; or “git log --stat” to view the file statistics:

$ git log
commit 858f3e71b95271ea320d45b69f44dc55cf1ff794
Author: username <email>
Date:   Thu Nov 29 13:31:32 2012 +0800
    First commit
 
$ git log --stat
commit 858f3e71b95271ea320d45b69f44dc55cf1ff794
Author: username <email>
Date:   Thu Nov 29 13:31:32 2012 +0800
    First commit
 Hello.java | 6 ++++++
 README.md  | 2 ++
 2 files changed, 8 insertions(+)

Each commit is identified by a 40-hex-digit SHA-1 hash code. But we typcially use the first 7 hex digits to reference a commit, as highlighted.

To view the commit details, use “git log -p“, which lists all the patches (or changes).

$ git log -p
commit 858f3e71b95271ea320d45b69f44dc55cf1ff794
Author: username <email>
Date:   Thu Nov 29 13:31:32 2012 +0800
    First commit
diff --git a/Hello.java b/Hello.java
new file mode 100644
index 0000000..dc8d4cf
--- /dev/null
+++ b/Hello.java
@@ -0,0 +1,6 @@
+// Hello.java
+public class Hello {
+   public static void main(String[] args) {
+      System.out.println("Hello, world from GIT!");
+   }
+}
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..9565113
--- /dev/null
+++ b/README.md
@@ -0,0 +1,2 @@
+// README.md
+This is the README file for the Hello-world project.

Below are more options of using “git log“:

$ git log --oneline
   // Display EACH commit in one line.
 
$ git log --author="<author-name-pattern>"
   // Display commits by author
   
$ git log <file-pattern>
   // Display commits for particular file(s)

// EXAMPLES
$ git log --author="Tan Ah Teck" -p Hello.java
   // Display commits for file "Hello.java" by a particular author
File Status (git status)

A file could be untracked or tracked.

As mentioned, Git tracks file changes at commits. In Git, changes for a tracked file could be:

  1. unstaged (in Working Tree) – called unstaged changes,
  2. staged (in Staging Area or Index or Cache) – called staged changes, or
  3. committed (in local repo object database).

The files in “working tree” or “staging area” could have status of unmodifiedaddedmodifieddeletedrenamedcopied, as reported by “git status“.

The “git status” output is divided into 3 sections: “Changes not staged for commit” for the unstaged changes in “working tree”, “Changes to be committed” for the staged changes in the “staging area”, and “Untracked files”. In each section, It lists all the files that have been changed, i,e., files having status other than unmodified.

When a new file is created in the working tree, it is marked as new in working tree and shown as an untracked file. When the file change is staged, it is marked as new (added) in the staging area, and unmodified in working tree. When the file change is committed, it is marked as unmodified in both the working tree and staging area.

New File

When a committed file is modified, it is marked as modified in the working tree and unmodified in the staging area. When the file change is staged, it is marked as modified in the staging area and unmodifiedin the working tree. When the file change is committed, it is marked as unmodified in both the working tree and staging area.

New File

For example, made some changes to the file “Hello.java“, and check the status again:

// Hello.java
public class Hello {
   public static void main(String[] args) {
      System.out.println("Hello, world from GIT!");
      System.out.println("Changes after First commit!");
   }
}
$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
      modified:   Hello.java
 
Untracked files:
  (use "git add <file>..." to include in what will be committed)
      Hello.class
no changes added to commit (use "git add" and/or "git commit -a")

The “Hello.java” is marked as modified in the working tree (under “Changes not staged for commit”), but unmodified in the staging area (not shown in “Changes to be committed”).

You can inspect all the unstaged changes using “git diff” command (or “git diff <file>” for the specified file). It shows the file changes in the working tree since the last commit:

$ git diff
diff --git a/Hello.java b/Hello.java
index dc8d4cf..f4a4393 100644
--- a/Hello.java
+++ b/Hello.java
@@ -2,5 +2,6 @@
 public class Hello {
    public static void main(String[] args) {
       System.out.println("Hello, world from GIT!");
+      System.out.println("Changes after First commit!");
    }
 }

The older version (as of last commit) is marked as --- and new one as +++. Each chunk of changes is delimited by “@@ -<old-line-number>,<number-of-lines> +<new-line-number>,<number-of-lines> @@“. Added lines are marked as + and deleted as -. In the above output, older version (as of last commit) from line 2 for 5 lines and the modified version from line 2 for 6 lines are compared. One line (marked as +) is added.

Stage the changes of “Hello.java” by issuing the “git add <file>...“:

$ git add Hello.java
 
$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
      modified:   Hello.java
 
Untracked files:
  (use "git add <file>..." to include in what will be committed)
      Hello.class

Now, it is marked as modified in the staging area (“Changes to be committed”), but unmodified in the working tree (not shown in “Changes not staged for commit”).

Now, the changes have been staged. Issuing an “git diff” to show the unstaged changes results in empty output.

You can inspect the staged change (in the staging area) via “git diff --staged” command:

// List all "unstaged" changes for all files (in the working tree)
$ git diff
   // empty output - no unstaged change
 
// List all "staged" changes for all files (in the staging area)
$ git diff --staged
diff --git a/Hello.java b/Hello.java
index dc8d4cf..f4a4393 100644
--- a/Hello.java
+++ b/Hello.java
@@ -2,5 +2,6 @@
 public class Hello {
    public static void main(String[] args) {
       System.out.println("Hello, world from GIT!");
+      System.out.println("Changes after First commit!");
    }
 }
   // The "unstaged" changes are now "staged".

Commit ALL staged file changes via “git commit“:

$ git commit -m "Second commit"
[master 96efc96] Second commit
 1 file changed, 1 insertion(+)
 
$ git status
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
      Hello.class
nothing added to commit but untracked files present (use "git add" to track)

Once the file changes are committed, it is marked as unmodified in the staging area (not shown in “Changes to be committed”).

Both “git diff” and “git diff --staged” return empty output, signalling there is no “unstaged” and “staged” changes.

The stage changes are cleared when the changes are committed; while the unstaged changes are cleared when the changes are staged.

Issue “git log” to list all the commits:

$ git log
commit 96efc96f0856846bc495aca2e4ea9f06b38317d1
Author: username <email>
Date:   Thu Nov 29 14:09:46 2012 +0800
    Second commit

commit 858f3e71b95271ea320d45b69f44dc55cf1ff794
Author: username <email>
Date:   Thu Nov 29 13:31:32 2012 +0800
    First commit

Check the patches for the latest commit via “git log -p -1“, with option -n to limit to the last n commit:

$ git log -p -1
commit 96efc96f0856846bc495aca2e4ea9f06b38317d1
Author: username <email>
Date:   Thu Nov 29 14:09:46 2012 +0800
    Second commit
diff --git a/Hello.java b/Hello.java
index dc8d4cf..ede8979 100644
--- a/Hello.java
+++ b/Hello.java
@@ -2,5 +2,6 @@
 public class Hello {
    public static void main(String[] args) {
       System.out.println("Hello, world from GIT!");
+      System.out.println("Changes after First commit!");
    }
 }

I shall stress again Git tracks the “file changes” at each commit over the previous commit.

The .gitignore File

All the files in the Git directory are either tracked or untracked. To ignore files (such as .class.o.exe which could be reproduced from source) from being tracked and remove them from the untracked file list, create a “.gitignore” file in your project directory, which list the files to be ignored, as follows:

# .gitignore
 
# Java class files
*.class

# Executable files
*.exe

# Object and archive files
# Can use regular expression, e.g., [oa] matches either o or a
*.[oa]

# temp sub-directory (ended with a directory separator)
temp/

There should NOT be any trailing comments for filename. You can use regexe for matching the filename/pathname patterns, e.g. [oa] denotes either o or a. You can override the rules by using the inverted pattern (!), e.g., Adding !hello.exe includes the hello.exe although *.exe are excluded.

Now, issue a “git status” command to check the untracked files.

$ git status
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
      .gitignore
nothing added to commit but untracked files present (use "git add" to track)

Now, “Hello.class” is not shown in “Untracked files”.

Typically, we also track and commit the .gitignore file.

$ git add .gitignore
 
$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
        new file:   .gitignore
 
$ git commit -m "Added .gitignore"
[master 711ef4f] Added .gitignore
 1 file changed, 14 insertions(+)
 create mode 100644 .gitignore
  
$ git status
On branch master
nothing to commit, working directory clean

3.2  Setting up Remote Repo

  1. Sign up for a GIT host, such as Github https://github.com/signup/free (Unlimited for public projects; fee for private projects); or BitBucket @ https://bitbucket.org/ (Unlimited users for public projects; 5 free users for private projects; Unlimited for Academic Plan); among others.
  2. Login to the GIT host. Create a new remote repo called “test“.
  3. On your local repo (let’s continue to work on our “hello-git” project), set up the remote repo’s name and URL via “git remote add <remote-name> <remote-url>” command.
    By convention, we shall name our remote repo as “origin“. You can find the URL of a remote repo from the Git host. The URL may take the form of HTTPS or SSH. Use HTTPS for simplicity.

    // Change directory to your local repo's working directory
    $ cd /path-to/hello-git
     
    // Add a remote repo called "origin" via "git remote add <remote-name> <remote-url>"
    // For examples,
    $ git remote add origin https://github.com/your-username/test.git              // for GitHub
    $ git remote add origin https://username@bitbucket.org/your-username/test.git  // for Bitbucket

    You can list all the remote names and their corresponding URLs via “git remote -v“, for example,

    // List all remote names and their corresonding URLs
    $ git remote -v
    origin  https://github.com/your-username/test.git (fetch)
    origin  https://github.com/your-username/test.git (push)

    Now, you can manage the remote connection, using a simple name instead of the complex URL.

  4. Push the commits from the local repo to the remote repo via “git push -u <remote-name> <local-branch-name>“.
    By convention, the main branch of our local repo is called “master” (as seen from the earlier “git status” output). We shall discuss “branch” later.

    // Push all commits of the branch "master" to remote repo "origin"
    $ git push origin master
    Username for 'https://github.com': ******
    Password for 'https://your-username@github.com': *******
    Counting objects: 10, done.
    Delta compression using up to 8 threads.
    Compressing objects: 100% (10/10), done.
    Writing objects: 100% (10/10), 1.13 KiB | 0 bytes/s, done.
    Total 10 (delta 1), reused 0 (delta 0)
    To https://github.com/your-username/test.git
     * [new branch]      master -> master
    Branch master set up to track remote branch master from origin.
  5. Login to the GIT host and select the remote repo “test”, you shall find all the committed files.
  6. On your local system, make some change (e.g., on “Hello.java“); stage and commit the changes on the local repo; and push it to the remote. This is known as the “Edit/Stage/Commit/Push” cycle.
    // Hello.java
    public class Hello {
       public static void main(String[] args) {
          System.out.println("Hello, world from GIT!");
          System.out.println("Changes after First commit!");
          System.out.println("Changes after Pushing to remote!");
       }
    }
    $ git status
    On branch master
    Your branch is up-to-date with 'origin/master'.
     
    Changes not staged for commit:
      (use "git add <file>..." to update what will be committed)
      (use "git checkout -- <file>..." to discard changes in working dire
          modified:   Hello.java
    no changes added to commit (use "git add" and/or "git commit -a")
     
    // Stage file changes
    $ git add *.java
    
    $ git status
    On branch master
    Your branch is up-to-date with 'origin/master'.
    
    Changes to be committed:
      (use "git reset HEAD <file>..." to unstage)
            modified:   Hello.java
     
    // Commit all staged file changes
    $ git commit -m "Third commit"
    [master 744307e] Third commit
     1 file changed, 1 insertion(+)
     
    // Push the commits on local master branch to remote
    $ git push origin master
    Username for 'https://github.com': ******
    Password for 'https://username@github.com': ******
    Counting objects: 5, done.
    Delta compression using up to 8 threads.
    Compressing objects: 100% (3/3), done.
    Writing objects: 100% (3/3), 377 bytes | 0 bytes/s, done.
    Total 3 (delta 1), reused 0 (delta 0)
    To https://github.com/your-username/test.git
       711ef4f..744307e  master -> master

    Again, login to the remote to check the committed files.

3.3  Cloning a Project from a Remote Repo (git clone <remote-url>)

As mentioned earlier, you can start a local GIT repo either running “git init” on your own project, or “git clone <remote-url>” to copy from an existing project.

Anyone having read access to your remote repo can clone your project. You can also clone any project in any public remote repo.

The “git clone <remote-url>” initializes a local repo and copies all files into the working tree. You can find the URL of a remote repo from the Git host.

// SYNTAX
// ======
$ git clone <remote-url>
   // <url>: can be https (recommended), ssh or file.
   // Clone the project UNDER the current directory
   // The name of the "working directory" is the same as the remote project name
$ git clone <remote-url> <working-directory-name>
   // Clone UNDER current directory, use the given "working directory" name

// EXAMPLES
// ========
// Change directory (cd) to the "parent" directory of the project directory
$ cd path-to-parent-of-the-working-directory
 
// Clone our remote repo "test" into a new working directory called "hello-git-cloned"
$ git clone https://github.com/your-username/test.git hello-git-cloned
Cloning into 'hello-git-cloned'...
remote: Counting objects: 13, done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 13 (delta 2), reused 13 (delta 2)
Unpacking objects: 100% (13/13), done.
Checking connectivity... done.
 
// Verify
$ cd hello-git-cloned
 
$ ls -a
.  ..  .git  .gitignore  Hello.java  README.md
 
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

The “git clone” automatically creates a remote name called “origin” mapped to the cloned remote-URL. You can check via “git remote -v“:

// List all the remote names
$ git remote -v
origin  https://github.com/your-username/test.git (fetch)
origin  https://github.com/your-username/test.git (push)

3.4  Summary of Basic “Edit/Stage/Commit/Push” Cycle

// Edit (Create, Modified, Rename, Delete) files,
//  which produces "unstaged" file changes.
 
// Stage file changes, which produces "Staged" file changes
$ git add <file>                          // for new and modified files
$ git rm <file>                           // for deleted files
$ git mv <old-file-name> <new-file-name>  // for renamed file

// Commit (ALL staged file changes)
$ git commit -m "message"

// Push
$ git push <remote-name> <local-branch-name>

OR,

// Stage ALL files with changes
$ git add -A   // OR, 'git add --all'

$ git commit -m "message"
$ git push

OR,

// Add All and Commit in one command
$ git commit -a -m "message"

$ git push

3.5  More on Staged and Unstaged Changes

If you modify a file, stage the changes and modify the file again, there will be staged changes and unstaged changes for that file.

For example, let’s continue the “hello-git” project. Add one more line to “README.md” and stage the changes:

// README.md
This is the README file for the Hello-world project.
Make some changes and staged.
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
        modified:   README.md
 
$ git add README.md
 
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
        modified:   README.md

Before the changes are committed, suppose we modify the file again:

// README.md
This is the README file for the Hello-world project.
Make some changes and staged.
Make more changes before the previous changes are committed.
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.

Changes to be committed:
        modified:   README.md

Changes not staged for commit:
        modified:   README.md

// Now, "README.md" has both unstaged and staged changes.

// Show the staged changes
$ git diff --staged
diff --git a/README.md b/README.md
index 9565113..b2e9afb 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,3 @@
 // README.md
 This is the README file for the Hello-world project.
+Make some changes and staged.
 
// Show the unstaged changes
$ git diff
diff --git a/README.md b/README.md
index b2e9afb..ca6622a 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,4 @@
 // README.md
 This is the README file for the Hello-world project.
 Make some changes and staged.
+Make more changes before the previous changes are committed.
 
// Stage the changes
$ git add README.md
 
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
 
Changes to be committed:
        modified:   README.md

// Show staged changes
$ git diff --staged
diff --git a/README.md b/README.md
index 9565113..ca6622a 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,4 @@
 // README.md
 This is the README file for the Hello-world project.
+Make some changes and staged.
+Make more changes before the previous changes are committed.
 
// Commit the staged changes
$ git commit -m "Unstaged vs. Staged Changes"
[master a44199b] Unstaged vs. Staged Changes
 1 file changed, 2 insertions(+), 0 deletion(-)

Stage and Unstaged Changes

Take note that the stage changes are cleared when the changes are committed; while the unstaged changes are cleared when the changes are staged.

For convenience, you can also use the “git-gui” tool to view the unstaged and staged changes.

3.6  Git GUI Tools

Git-GUI (Windows)

For convenience, Git provides a GUI tool, called git-gui, which can be used to perform all tasks and view the commit log graphically.

Install “Git-Gui”.

To run the git-gui, you can right-click on the project folder and choose “Git Gui”; or launch the Git-bash shell and run “git gui” command.

To view the log, choose “Repository” ⇒ “Visualize master’s history”, which launches the “gitk”. You can view the details of each commit.

You can also view each of the file via “Repository” ⇒ “Browse master’s Files” ⇒ Select a file.

Git-gui is bundled with Git. To launch git-gui, right click on the working directory and choose “git gui”, or run “git gui” command on the Git-Bash shell.

[TODO]

EGit Plugin for Eclipse

[TODO]

4.  Tagging

Tag (or label) can be used to tag a specific commit as being important, for example, to mark a particular release. The release is often marked in this format: version-number.release-no.modificaton-no(e.g., v1.1.5) or or version-number.release-no.upgrade-no_modificaton-no (e.g., v1.7.0_26).

I recommend that you commit your code and push it to the remote repo as often as needed (e.g., daily), to BACKUP your code. When you code reaches a stable point (in turn of functionality), create a tag to mark the commit, which can then be used for CHECKOUT, if you need to show your code to others.

Listing Tags (git tag)

To list the existing tags, use “git tag” command.

Types of Tags – Lightweight Tags and Annotated Tags

There are two kinds of tags: lightweight tag and annotated tag. Lightweight tag is simply a pointer to a commit. Annotated tag contains annotations (meta-data) and can be digitally signed and verified.

Creating an Annotated Tag (git tag -a <tag-name> -m <message>)

To create an annotated tag at the latest commit, use “git tag -a <tag-name> -m <message>“, where -a option specifies annotation tag having meta-data. For example,

$ git tag -a v1.0.0 -m "First production system"
 
// List all tags
$ git tag
v1.0.0
 
// Show tag details
$ git show v1.0.0
   // Show the commit point and working tree

To create a tag for an earlier commit, you need to find out the commit’s name (first seven character hash code) (via “git log“), and issue “git tag -a <tag-name> -m <message> <commit-name>“. For example,

$ git log
......
commit 7e7cb40a9340691e2b16a041f7185cee5f7ba92e
......
    Commit 3
 
$ git tag -a "v0.9.0" -m "Last pre-production release" 7e7cb40
 
// List all tags
$ git tag
v0.9.0
v1.0.0
 
// Show details of a tag
$ git show v0.9.0
......

[TODO] Diagram

Creating Lightweight Tags (git tag <tag-name>)

To create a lightweight tag (without meta-data), use “git tag <tag-name>” without the -a option. The lightweight tag stores only the commit hash code.

Signed Tags

You can signed your tags with your private key, with -s option instead of -a.

To verify a signed tag, use -v option and provide the signer’s public key.

[TODO] Example

Pushing to Remote Repo

By default, Git does not push tags (and branches) to remote repo. You need to push them explicitly, via “git push origin <tag-name>” for a particular tag or “git push origin --tags” for all the tags.

5.  Branching/Merging

5.1  Git’s Data Structures

Git has two primary data structures:

  1. an immutable, append-only object database (or local repo) that stores all the commits and file contents;
  2. a mutable staging area (or index, or cache) that caches the staged information.

The staging area serves as the connection between object database and working tree (as shown in the storage model diagram). It serves to avoid volatility, and allows you to stage ALL the file changes before issuing a commit, instead of committing individual file change. Changes to files that have been explicitly added to the index (staging area) via “git add <file>” are called staged changes. Changes that have not been added are called unstaged changes. Staged and unstaged changes can co-exist. Performing a commit copies the statged changes into object database (local repo) and clears the index. The unstaged changes remain in working tree.

image

The object database contains these objects:

  • Each version of a file is represented by a blob (binary large object – a file that can contain any data: binaries or characters). A blob holds the file data only, without any metadata – not even the filename.
  • snapshot of the working tree is represented by a tree object, which links the blobs and sub-trees for sub-directories.
  • commit object points to a tree object, i.e., the snapshot of the working tree at the point the commit was created. It holds metadata such as timestamp, log message, author’s and committer’s username and email. It also references its parent commit(s), except the root commit which has no parent. A normal commit has one parent; a merge commit could have multiple parents. A commit, where new branch is created, has more than one children. By referencing through the chain of parent commit(s), you can discover the history of the project.

Each object is identified (or named) by a 160-bit (or 40 hex-digit) SHA-1 hash value of its contents (i.e., a content-addressable name). Any tiny change to the contents produces a different hash value, resulted in a different object. Typically, we use the first 7 hex-digit prefix to refer to an object, as long as there is no ambiguity.

There are two ways to refer to a particular commit: via a branch or a tag.

  • A branch is a mobile reference of commit. It moves forward whenever commit is made on that branch.
  • A tag (like a label) marks a particular commit. Tag is often used for marking the releases.

5.2  Branching

Branching allows you and your team members to work on different aspects of the software concurrently (on so-called feature branches), and merge into the master branch as and when they completes. Branching is the most important feature in a concurrent version control system.

A branch in Git is a lightweight movable pointer to one of the commits. For the initial commit, Git assigns the default branch name called master and sets the master branch pointer at the initial commit. As you make further commits on the master branch, the master branch pointer move forward accordingly. Git also uses a special pointer called HEAD to keep track of the branch that you are currently working on. The HEAD always refers to the latest commit on the current branch. Whenever you switch branch, the HEAD also switches to the latest commit on the branch switched.

Example

For example, let’s create a Git-managed project called git_branch_test with only the a single-line README.md file:

This is the README. My email is xxx@somewhere.com
$ git init
$ git add README.md
$ git commit -m "Commit 1"

// Append a line in README.md: This line is added after Commit 1
$ git status
$ git add README.md
$ git commit -m "Commit 2"

// Append a line in README.md: This line is added after Commit 2
$ git status
$ git add README.md
$ git commit -m "Commit 3"
 
// Show all the commits (oneline each)
$ git log --oneline
44fdf4c Commit 3
51f6827 Commit 2
fbed70e Commit 1

imgae

Creating a new Branch (git branch <branch-name>)

You can create a new branch via “git branch <branch-name>” command. When you create a new branch (says devel, or development), Git creates a new branch pointer for the branch devel, pointing initially at the latest commit on the current branch master.

$ git branch devel

imgae

Take note that when you create a new branch, the HEAD pointer is still pointing at the current branch.

Branch Names Convention
  • master branch: the production branch with tags for the various releases.
  • development (or next or devel) branch: developmental branch, to be merged into master if and when completes.
  • topics branch: a short-live branch for a specific topics, such as introducing a feature (for the devel branch) or fixing a bug (for the master branch).
Switching to a Branch (git checkout <branch-name>)

Git uses a special pointer called HEAD to keep track of the branch that you are working on. The “git branch <branch-name>” command simply create a branch, but does not switch to the new branch. To switch to a branch, use “git checkout <branch-name>” command. The HEAD pointer will be pointing at the switched branch (e.g., devel).

$ git checkout devel
Switched to branch 'devel'

imgae

Alternatively, you can use “git checkout -b <branch-name>” to create a new branch and switch into the new branch.

If you switch to a branch and make changes and commit. The HEAD pointer moves forward in that branch.

// Append a line in README.md: This line is added on devel branch after Commit 3
$ git status   // NOTE "On branch devel"
$ git add README.md
$ git commit -m "Commit 4"
[devel c9b88d9] Commit 4

imgae

You can switch back to the master branch via “git checkout master“. The HEAD pointer moves back to the last commit of the master branch, and the working directory is rewinded back to the latest commit on the master branch.

$ git checkout master
Switched to branch 'master'
// Check the content of the README.md, which is reminded back to Commit 3

imgae

If you continue to work on the master branch and commit, the HEAD pointer moves forward on the master branch. The two branches now diverge.

// Append a line in README.md: This line is added on master branch after Commit 4
$ git status   // NOTE "On branch master"
$ git add README.md
$ git commit -m "Commit 5"
[master 6464eb8] Commit 5

imgae

If you check out the devel branch, the file contents will be rewinded back to Commit-4.

$ git checkout devel
// Check file contents

5.3  Merging Two Branches (git merge <branch-name>)

To merge two branches, says master and devel, check out the first branch, e,g, master, (via “git checkout <branch-name>“) and merge with another branch, e.g., devel, via command “git merge <branch-name>“.

Fast-Forward Linear Merge

If the branch to be merged is a direct descendant, Git performs fast forward by moving the HEAD pointer forward. For example, suppose that you are currently working on the devel branch at commit-4, and the master branch’s latest commit is at commit-3:

$ git checkout master

// Let discard the Commit-5 totally and rewind to commit-3 on master branch
// This is solely for illustration!!! Do this with great care!!!
$ git reset --hard HEAD~1
HEAD is now at 7e7cb40 Commit 3
  // HEAD~1 moves the HEAD pointer back by one commit (-1)
  // --hard also resets the working tree
 
// Check the file contents
 
$ git merge devel
Updating 7e7cb40..4848c7b
Fast-forward
 README.md | 1 +
 1 file changed, 1 insertion(+)
 
// Check the file contents

imgae

Take note that no new commit is created.

3-Way Merge

If the two branches are diverged, git automatically searches for the common ancestor commit and performs a 3-way merge. If there is no conflict, a new commit will be created.

If git detects a conflict, it will pause the merge and issue a merge conflict and ask you to resolve the conflict manually. The file is marked as unmerged. You can issue “git status” to check the unmerged files, study the details of the conflict, and decide which way to resolve the conflict. Once the conflict is resolve, stage the file (via “git add <file>“). Finally, run a “git commit” to finalize the 3-way merge (the same Edit/Stage/Commit cycle).

$ git checkout master
// undo the Commit-4, back to Commit-3
$ git reset --hard HEAD~1
HEAD is now at 7e7cb40 Commit 3

// Change the email to abc@abc.com
$ git add README.md
$ git commit -m "Commit 5"

$ git checkout devel
// undo the Commit-4, back to Commit-3
$ git reset --hard HEAD~1
// Change the email to xyz@xyz.com to trigger conflict
$ git add README.md
$ git commit -m "Commit 4"

// Let's do a 3-way merge with conflict
$ git checkout master
$ git merge devel
Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.
 
$ git status
# On branch master
# You have unmerged paths.
#   (fix conflicts and run "git commit")
#
# Unmerged paths:
#   (use "git add <file>..." to mark resolution)
#       both modified:      README.md
no changes added to commit (use "git add" and/or "git commit -a")

The conflict file is marked as follows (in “git status“):

<<<<<<< HEAD
This is the README. My email is abc@abc.com
=======
This is the README. My email is xyz@xyz.com
>>>>>>> devel
This line is added after Commit 1
This line is added after Commit 2

You need to manually decide which way to take, or you could discard both by setting the email to zzz@nowhere.com.

$ git add README.md
$ git commit -m "Commit 6"

imgae

Take note that In a 3-way merge, a new commit will be created in the process (unlike fast-forward merge).

Deleting a Merged Branch (git branch -d <branch-name>)

The merged branch (e.g., devel) is no longer needed. You can delete it via “git branch -d <branch-name>“.

$ git branch -d devel
Deleted branch devel (was a20f002).
 
// Create the development branch again at the latest commit
$ git branch devel

5.4  Rebasing Branch (git rebase)

imgae

The primary purpose for rebasing is to maintain a linear project history. For example, if you checkout a devel branch and work on commit-5 and commit-6, instead of doing a 3-way merge into the masterbranch and subsequently remove the devel branch, you can rebase the commit-5 and commit-6, on commit-4, and perform a linear forward merge to maintain all the project history. New commits (7 and 8) will be created for the rebased commit (5 and 6).

The syntax is:

// SYNTAX
$ git rebase <base-name>
   // <base-name> could be any kind of commit reference
   // (such as an commit-name, a branch name, a tag, 
   // or a relative reference to HEAD).

Examples:

// Start a new feature branch from the current master
$ git checkout -b feature master
// Edit/Stage/Commit changes to feature branch
 
// Need to work on a fix on the master
$ git checkout -b hotfix master
// Edit/Stage/Commit changes to hotfix branch
// Merge hotfix into master
$ git checkout master
$ git merge hotfix
// Delete hotfix branch
$ git branch -d hotfix
 
// Rebase feature branch on master branch
//  to maintain a linear history
$ git checkout feature
$ git rebase master
// Now, linear merge
$ git checkout master
$ git merge feature

5.5  Amend the Last Commit (git commit –amend)

If you make a commit but want to change the commit message or adding more changes, you may amend the recent commit (instead of creating new commit) via command “git commit --amend“):

$ git commit --amend -m "message"

For example,

// Do a commit
$ git commit -m "added login menu"

// Realize that you have not staged some files.
// Amend the commit
$ git add morefile
$ git commit --amend
    // You could modify the commit message here

5.6  More on “git checkout” and Detached HEAD

git checkout” can be used to checkout a branch, a commit, or files. The syntaxes are:

$ git checkout <branch-name>
$ git checkout <commit-name>
$ git checkout <commit-name> <filename>

When you checkout a commit, Git switches into so-called “Detached HEAD” state, i.e., the HEAD detached from the tip of a branch. Suppose that you continue to work on the detached HEAD on commit-5, and wish to merge the commit-5 back to master. You checkout the master branch, but there is no branch name for your to reference the commit-5!!!

imgaeimgae

In Summary, you can use “git checkout <commit-name>” to inspect a commit. BUT you should always work on a branch, NOT on a detached HEAD.

5.7  More on “git reset” and “git reset –hard

[TODO] examples and diagram

$ git reset <file>
   // Unstage the changes of <file> from staging area,
   //   not affecting the working tree.
 
$ git reset
   // Reset the staging area
   // Remove all changes (of all files) from staging area,
   //   not affecting the working tree.

$ git reset --hard
   // Reset the staging area and working tree to match the
   //   recent commit  (i.e., discard all changes since the
   //   last commit).
 
$ git reset <commit-name>
   // Move the HEAD of current branch to the given commit,
   //   not affecting the working tree.
 
$ git reset --hard <commit-name>
   // Reset both staging area and working tree to the given
   //   commit, i.e., discard all changes after that commit.

[TODO] Diagram

[TODO] –soft option

5.8  git revert <commit-name>

The “git revert” undoes a commit. But, instead of removing the commit from the project history, it undos the changes introduced by the commit and appends a new commit with the resulting content. This prevents Git from losing history. “git revert” is a safer way comparing with “git reset“.

// SYNTAX
$ git revert <commit-name>
 
// EXAMPLE
[TODO] example and diagram

5.9  Summary of Work Flows

Setting up GIT and “Edit/Stage/Commit/Push” Cycle

Step 1: Install GIT.

  • For Windows and Mac, download the installer from http://git-scm.com/downloads and run the downloaded installer.
  • For Ubuntu, issue command “sudo apt-get install git“.

For Windows, use “git-bash” command shell provided by Windows installer to issue command. For Mac/Ubuntu, use “Terminal”.

Step 2: Configuring GIT:

// Setup your username and email to be used in labeling commits
$ git config --global user.email "your-email@yourmail.com"
$ git config --global user.name "your-name"

Step 3: Set up GIT repo for a project. For example, we have a project called “olas1.1” located at “/usr/local/olas/olas1.1“.

$ cd /usr/local/olas/olas1.1
 
// Initialize the GIT repo
$ git init
 
$ ls -al
   // Check for ".git" directory

Create a “README.md” (or “README.textile” if you are using Eclipse’s WikiText in “textile” markup) under your project directory to describe the project.

Step 4: Start “Edit/Stage/Commit/Push” cycles.

Create/Modify files. Stage files into the staging area via “git add <file>“.

// Check the status
$ git status
......
 
// Add files into repo
$ git add README.md
$ git add www
......
 
// Check the status
$ git status
......

Step 5: Create a “.gitignore” (in the project base directory) to exclude folders/files from being tracked by GIT. Check your “git status” output to decide which folders/files to be ignored.

For example,

# ignore files and directories beginning with dot
.*
 
# ignore directories beginning with dot (a directory ends with a slash)
.*/
 
# ignore these files and directories
www/test/
www/.*
www/.*/

The trailing slash indicate directory (and its sub-directories and files).

If you want the “.gitignore” to be tracked (which is in the ignore list):

$ git add -f .gitignore
      // -f to override the .gitignore

Step 6: Commit.

$ git status
......
 
// Commit with a message
$ git commit -m "Initial Commit"
......
 
$ git status
......

Step 7: Push to the Remote Repo (for backup, version control, and collaboration).

You need to first create a repo (says olas) in a remote GIT host, such as GitHub or BitBucket. Take note of the remote repo URL, e.g., https://username@hostname.org/username/olas.git.

$ cd /path-to/local-repo
 
// Add a remote repo name called "origin" mapped to the remote URL
$ git remote add origin https://hostname/username/olas.git
 
// Push the "master" branch to the remote "origin"
// "master" is the default branch name of your local repo after init. 
$ git push origin master

Check the remote repo for the files committed.

Step 8: Work on the source files, make changes, commit and push to remote repo.

// Check the files modified 
$ git status
......
 
// Stage for commit the modified files
$ git add ....
......
 
// Commit (with a message)
$ git commit -m "commit-message"
 
// Push to remote repo
$ git push origin master

Step 9: Create a “tag” (for version number).

// Tag a version number to the current commit
$ git tag -a v1.1 -m "Version 1.1"
      // -a to create an annotated tag, -m to provide a message
 
// Display all tags
$ git tag
......
 
// Push the tags to remote repo
// ("git push -u origin master" does not push the tags)
$ git push origin --tags
Branch and Merge Workflow

It is a good practice to freeze the “master” branch for production; and work on a development branch (says “devel“) instead. You may often spawn a branch to fix a bug in the production.

// Create a branch called "devel" and checkout.
// The "devel" is initially synchronized with the "master" branch.
$ git checkout -b devel
      // same as:
      // $ git branch devel
      // $ git checkout

// Edit/Stage/Commit
$ git add <file>
$ git commit -m "commit-message"

// To merge the "devel" into the production "master" branch
$ git checkout master
$ git merge devel

// Push both branches to remote repo
$ git push origin master devel

// Checkout the "devel" branch and continue...
$ git checkout devel
   // Edit/Stage/Commit/Push

// Need to fix a bug in production (in "master" branch)
$ git checkout master
// Spawn a "fix" branch to fix the bug, and merge with the "master" branch

// To remove the "devel" branch (if the branch is out-of-sync)
$ git branch -d devel
// To re-create the "devel" branch
$ git checkout -b devel

5.10  Viewing the Commit Graph (gitk)

You can use the “git-gui” “gitk” tool to view the commit graph.

To run the git-gui, you can right-click on the project folder and choose “Git Gui”; or launch the Git-bash shell and run “git gui” command.

To view the commit graph, choose “Repository” ⇒ “Visualize master’s history”, which launches the “gitk”. You can view the details of each commit.

6.  Collaboration

Reference: https://www.atlassian.com/git/tutorials/making-a-pull-request/how-it-works.

6.1  Synchronizing Remote and Local: Fetch/Merge, Pull and Push

Setup up a remote repo (revision)

As described earlier, you can use “git remote” command to set up a “remote name”, mapped to the URL of a remote repo.

// Add a new "remote name" maps to the URL of a remote repo
$ git remote add <remote-name> <remote-url>
// For example,
$ git remote add origin https://hostname/username/project-name.git
    // Define a new remote name "origin" mapping to the given URL
 
// List all the remote names
$ git remote -v

// Delete a remote name
$ git remote rm <remote-name>

// Rename a remote name
$ git remote rename <old-remote-name> <new-remote-name>
Cloning a Remote Repo (revision)
$ git clone <remote-url>
    // Init a GIT local repo and copy all objects from the remote repo
$ git clone <remote-url> <working-directory-name>
   // Use the working-directory-name instead of default to project name

Whenever you clone a remote repo using command “git clone <remote-url>“, a remote name called “origin” is automatically added and mapped to <remote-url>.

[TODO] Diagram

Fetch/Merge Changes from remote (git fetch/merge)

The “git fetch” command imports commits from a remote repo to your local repo, without updating your local working tree. This gives you a chance to review changes before updating (merging into) your working tree. The fetched objects are stored in remote branches, that are differentiated from the local branches.

$ cd /path-to/working-directory
 
$ git fetch <remote-name>
   // Fetch ALL branches from the remote repo to your local repo 
 
$ git fetch <remote-name> <branch-name>
   // Fetch the specific branch from the remote repo to your local repo

// List the local branches
$ git branch
* master
  devel
   // * indicates current branch   
 
// List the remote branches
$ git branch -r
  origin/master
  origin/devel

// You can checkout a remote branch to inspect the files/commits.
// But this put you into "Detached HEAD" state, which prevent you
// from updating the remote branch.

// You can merge the fetched changes into local repo
$ git checkout master
   // Switch to "master" branch of local repo
$ git merge origin/master
   // Merge the fetched changes from stored remote branch to local

[TODO] Diagram

git pull

As a short hand, “git pull” combines “git fetch” and “git merge” into one command, for convenience.

$ git pull <remote-name>
   // Fetch the remote's copy of the current branch and merge it 
   //  into the local repo immediately, i.e., update the working tree
   
// Same as
$ git fetch <remote-name> <current-branch-name>
$ git merge <remote-name> <current-branch-name>

$ git pull --rebase <remote-name>
   // linearize local changes after the remote branch.

The “git pull” is an easy way to synchronize your local repo with origin’s (or upstream) changes (for a specific branch).

[TODO] Diagram

Pushing to Remote Repo (revision)

The “git push <remote-name> <branch-name>” is the counterpart of “git fetch“, which exports commits from local repo to remote repo.

$ git push <remote-name> <branch-name>
   // Push the specific branch of the local repo
 
$ git push <remote-name> --all
   // Push all branches of the local repo
 
$ git push <remote-name> --tag
   // Push all tags
   // "git push" does not push tags
   
$ git push -u <remote-name> <branch-name>
   // Save the remote-name and branch-name as the
   // reference (or current) remote-name and branch-name.
   // Subsequent "git push" without argument will use these references.

[TODO] Diagram

6.2  “Fork” and “Pull Request”

“Fork” and “Pull Request” are features provided by GIT hosts (such as GitHub and BitBucket):

  • Pushing “Fork” button to copy a project from an account (e.g., project maintainer) to your own personal account. [TODO] diagram
  • Pushing “Pull Request” button to notify other developers (e.g., project maintainer or the entire project team) to review your changes. If accepted, the project maintainer can pull and apply the changes. A pull request shall provide the source’s repo name, source’s branch name, destination’s repo name and destination’s branch name.

6.3  Feature-Branch Workflow for Shared Repo

Feature-Branch workflow is more prevalent with small teams on private projects. Everyone in the team is granted push access to a single shared remote repository and feature (or topic) branches are used to isolate changes made by the team members.

The project maintainer starts the “master” branch on the shared remote repo. All developers clone the "master” branch into their local repos. Each developer starts a feature branch (e.g., “user1-featureX“) to work on a feature. Once completed (or even work-in-progress), he files a “pull request” to initiate a review for his feature. All developers can provide comments and suggestions. Once accepted, the project maintainer can then merge the feature branch into the “master” branch.

Feature Branch Workflow

The steps are:

  1. Mark, the project maintainer, starts the project by pushing to the shared remote repo’s “master” branch.
  2. Carol, a contributor, clones the project into her local repo, via:
    // Carol:
    $ cd parent-directory-of-the-working-directory
    $ git clone https://hostname/path-to/project-name.git
        // Create a remote-name "origin" (default), branch "master"
        //  on her local repo
  3. Carol starts a feature branch (says “carol-feature“) under the “master” branch to work on a new feature, via:
    // Carol:
    $ git checkout -b carol-feature master
       // Create a new branch "carol-feature" under "master" branch
       //   and switch to the new branch
    
    // Edit/Stage/Commit/Push cycles on carol-feature branch
    $ git status
    $ git add <file>
    $ git commit -m <message>
    $ git push origin carol-feature
    
    // Repeat until done
  4. Carol completes the new feature. She files a “pull request” (by pushing the “pull request” button on the Git host) to notify the rest of the team members.
  5. Mark, the project maintainer, or anyone in the team, can comment on Carol’s feature. Carol can re-work on the feature, if necessary, and pushes all subsequent commits under her feature branch.
  6. Once the feature is accepted, Mark, or anyone in the team (including Carol), performs a merge to apply the feature branch into the “master” branch:
    // Mark, or Anyone:
    $ git checkout master
       // Switch to the "master" branch of the local repo
    $ git pull origin master
       // Fetch and merge the latest changes on local's "master" branch,
       //   if any (i.e., synchronize)
    $ git pull origin carol-feature
       // Fetch and merge carol-feature branch on local's "master" branch
    $ git push origin master
       // Update the shared remote repo
  7. Everyone can update their local repo, via:
    // Everyone:
    $ git checkout master
       // Switch to the "master" branch of the local repo
    $ git pull origin master
       // Fetch and merge the latest changes on local "master" branch

[TODO] Diagram

6.4  Forking Workflow

In Forking workflow, Instead of using a common shared remote repo, each developer forks the project to his own personal account on the remote host. He then works on his feature (preferably in a feature branch). Once completed, he files a “pull request” to notify the maintainer to review his changes, and if accepted, merge the changes.

Forking workflow is applicable to developers working in small teams and to a third-party developer contributing to an open source project.

Forking Workflow

The steps are:

  1. Mark, the project maintainer, pushes the project from his local repo (“master” branch) to a remote Git host. He permits “read” access by contributors.
  2. Carol, a contributor, gotos Mark’s repo, forks the project (by pushing the fork button). “Forking” copies the project to Carol’s own personal account on the same Git host.
  3. Carol then clones the project from her forked repo into her local repo, via:
    // Carol:
    $ cd parent-directory-of-the-working-directory
    $ git clone https://hostname/carol/project-name.git
       // Create a remote name "origin" automatically 
       // Copy the "master" branch
  4. When a fork is cloned, Git creates a remote-name called origin that points to the fork, not the original repo it was forked from. To keep track of the original repo, Carol creates a remote name called “upstream” and pulls (fetches and merges) all new changes:
    // Carol:
    $ cd carol-local-repo-of-the-fork
    $ git remote add upstream https://hostname/mark/project-name.git
       // Create a remote-name "upstream" pointing to the original remote repo
     
    $ git remote -v
       // List all the remote names and URLs
       // origin: mapped to Carol's forked remote repo
       // upstream: mapped to Mark's original remote repo
     
    $ git pull upstream master
       // Fetch and merge all changes from the original remote repo to local repo
       //   for the "master" branch
  5. Now, Carol can make changes on her local repo (on a new branch), stage and commit the changes, and pushes them to her forked remote repo (so called Edit/Stage/Commit/Push cycles):
    // Carol:
    $ git checkout -b carol-feature master
       // Create a new branch called "carol-feature" under "master"
       //  and switch to the new branch
     
    // Edit/Stage/Commit/Push cycles on "carol-feature" branch
    $ git status
    $ git add <file>
    $ git commit -m <message>
    $ git push origin carol-feature
     
    // Repeat until done
  6. Carol files a pull request to Mark (the project maintainer) by pushing the pull-request button. She needs to specify her forked remote repo-name, her branch name (carol-feature), Mark’s remote repo-name, and Mark’s branch name (master).
  7. Mark opens the pull request (in pull request tab), reviews the change, and decides whether to accept the changes. Mark can ask Carol to re-work on the feature. Carol repeats the Edit/Stage/Commit/Push cycles.
    If Mark decides to accept the changes, he pushes the “Merge” button to merge Carol’s contribution to his master branch on the remote repo.
    [If there is no “Merge” button] Mark needs to do the following:

    // Mark:
    $ git checkout master
       // Checkout the "master" branch of the local repo
     
    $ git remote add carol git://hostname/carol/project-name.git
       // Add a new remote pointing to the Carol's forked remote repo
     
    $ git pull carol carol-feature
       // Fetch and merge the changes into local repo's master branch
     
    $ git push origin master
       // Push the update to the Mark's original remote repo
  8. All contributors (including Mark and Carol) shall regularly synchronize their local repo by fetch/merge with Mark’s master branch.
    // Carol (and everyone):
    $ git checkout master
       // Switch to the "master" branch of the local repo
    $ git pull upstream master
       // Fetch and merge the latest changes on "master" branch from 
       //  original remote repo to his local repo

6.5  Other Workflows

There are other workflows such as “Centralized Workflow” and “GitFlow Workflow”. Read “https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow“.

7.  Miscellaneous and How-To

Stage and Commit (git commit -a -m <message>)

You can skip the staging (i.e., the “git add <file>...“) and commit all changes in the working tree via “git commit -a -m <message>” with -a (or --all) option.

Stage all changes (git add -A)

You can use “git add -A” to stages all changes in the working tree to the staging area.

Unstage a Staged file (git rm –cached <file> / git reset head <file>)

Recall that you can use “git add <file>” to stage new files or modified files into the staging area.

To unstage a staged new file, use “git rm --cached <file>“.

To unstage a staged modified file, use “git reset head <file>“.

Unmodified a modified file (git checkout — <file>)

After a commit, you may have modified some files. You can discard the changes by checking out the last commit via “git checkout -- <file>“.

How to Amend the Last Commit (git commit –amend)

If you make a commit but want to change the commit message:

$ git commit --amend -m "message"

If you make a commit but realize that you have not staged some file changes, you can also do it with --amend:

$ git add morefile
$ git commit --amend

You can also make some changes to working tree, stage, and amend the last commit

// Edit morefile (make changes)
$ git add morefile
$ git commit --amend
How to Undo the Previous Commit(s) (git reset)

To undo previous commit(s):

// Reset the HEAD to the previous commit
// --soft to keep the working tree and index
$ git reset --soft HEAD~1   // Windows
$ git reset --soft HEAD^    // Unix

// Make changes
......

// Stage
$ git add ......
// Commit
$ git commit -c ORIG_HEAD

The “git reset --hard HEAD~1” moves the HEAD to the previous commit, restore the working tree and discard the index (i.e., discard all change after the previous commit). Instead of HEAD~n, you can also specify the commit hash code.

The “git rest HEAD~1” with default --mixed moves the HEAD to the previous commit, keep the working tree and discard the index

The “git reset --soft HEAD~1” moves the HEAD to the previous commit, keep the working tree and the index (i.e., keep all changes after the previous commit).

[TODO] Examples, diagrams and “git status” outputs.

For a public repo, you probably need to make another commit and push the commit to the public repo, or …

Relative Commit Names

A commit is uniquely and absolutely named using a 160-bit (40-hex-digit) SHA-1 hash code of its contents. You can always refer to a commit via its hash value or abbreviated hash value (such as the first 7 hex-digit) if there is no ambiguity.

You can also refer to a commit relatively, e.g., master~1 (Windows), master^ (may not work in Windows), master^1 refers to the previous (parent) commit on the master branch; master~2master^^ refers to the previous of the previous (grandparent) commit, and etc. If a commit has multiple parents (e.g., due to merging of branches), ^1 refers to the first parent, ^2 refers to the second parent, and so on.

REFERENCES & RESOURCES

  1. GIT mother site @ http://git-scm.com and GIT Documentation @ http://git-scm.com/doc.
  2. Git User’s Manual @ http://www.kernel.org/pub/software/scm/git/docs/user-manual.html.
  3. Git Hosts: GitHub @ https://github.com, Bitbucket @ https://bitbucket.org.
  4. Git Tutorials @ https://www.atlassian.com/git/tutorials.
  5. Bitbucket Documentation Home @ https://confluence.atlassian.com/display/BITBUCKET/Bitbucket+Documentation+Home.
  6. Bitbucket 101 @ https://confluence.atlassian.com/display/BITBUCKET/Bitbucket+101.
  7. Jon Loeliger and Matthew McCullough, “Version Control with Git”, 2nd Eds, O’reilly, 2012.
  8. Scott Chacon, “Pro Git”, Apress, 2009.

Managing Your Dependencies with JDepend

As a developer and architect, I’m always on the lookout for tools that will quickly provide feedback on the quality of software architectures and designs. The problem is that most measures of architectural and design quality tend to be vague qualities — scalability, reliability, maintainability, flexibility, modularity, etc. — that are difficult to measure in a repeatable, quantitative sense.

In this article, I’ll introduce you to JDepend, a freely available tool that can provide insight into several qualities of your software architecture. JDepend analyzes the relationships between Java packages using the class files. Since packages represent cohesive building blocks of your architecture, maintaining a well-defined package structure provides insight into architectural qualities of maintainability, flexibility, and modularity. Packages also provide a useful mechanism for estimating the impact of requirements changes, so understanding their dependencies is also useful in this respect as well. Because JDepend’s metrics are based on class files, they can be used to track the true state of your architecture at any given point in the software lifecycle.

Finally, I’ll use JDepend to analyze Sun’s J2EE Java Pet Store to give you an idea of how to utilize JDepend to manage your software development efforts.

A Little OO Background

Before we dive into the specifics of JDepend, I though it would be a good idea to review a few OO design concepts to set the stage for the rest of our discussion. A logical question is: “Why not measure dependencies between classes instead of dependencies between packages“? The problem with using classes is that OO design is predicated upon the notion of autonomous (or semi-autonomous) groupings of state and behavior (classes) collaborating to perform work. So at some level, you want to have classes working together or depending on each other. Unfortunately, the number of classes tends to get large quickly. If you pick an arbitrary maximum size for a class, say 300 lines of code (LOC), building a system of 300,000 LOC will give you a minimum of 1000 classes. Very quickly, you see that managing dependencies at the class level will not work. Packages provide a reasonable alternative. Classes that collaborate closely to perform a specific task are referred to as “cohesive” and they get grouped together into a package. The same technique can be applied to packages that collaborate closely; they form “subsystem” packages. You can now focus on managing the dependencies between the packages, which turns out to be a more realistic goal.

So how do we go about managing dependencies between packages? As a starting point, let us look at the differences between the two primary types of classes: concrete and abstract. In Java, a concrete class is any class that can be directly created using the new operator. The type of these classes is fixed when the code is compiled and at some point in the system, every object is represented by a concrete class. A dependency occurs when one class uses another concrete class within its implementation. This basically makes the statement “my implementation depends on this concrete type.” As experience in developing OO systems has evolved, designs that minimize the dependencies on concrete types have proven to be the most flexible. This flexibility is achieved through the use of abstract classes (this includes abstract classes, as well as interfaces in Java). By utilizing abstract classes in your implementation, you allow your class to accept any class that implements the contract defined by the abstract class. You see this pattern at work in the J2SE and J2EE APIs, as well as other design patterns. So managing package dependencies boils down to minimizing your use of concrete classes defined in other packages.

Getting and Using JDepend

JDepend is an open source software program available for download at www.clarkware.com. The download includes the source code, JUnit test cases, documentation, pre-build .jar files, an ANT build script, and a sample application for testing purposes. The online documentation covers all of the features available for setting up, configuring, and running JDepend so I’ll just highlight a few of the key features here.

After unzipping the download file, you can verify your download by running JDepend on the sample application provided in the download. Assuming you are in the directory where you unzipped the files, execute the following command:

java -cp ./lib/jdepend.jar jdepend.swingui.JDepend ./sample

This should produce a window similar to the one shown in Figure 1.

Figure 1
Figure 1. JDepend’s Swing GUI

This is JDepend’s Swing GUI. For a target package, the GUI provides drill-down capability for the packages it depends upon and the packages that depend upon it. The GUI also provides a series of metrics for each package. We will talk more about the metrics reported by the GUI in the following sections. JDepend can also provide output in either text or XML format for integration with other tools or reports within your environment.

Once you understand the JDepend metrics, you may want to automate their collection within your normal build and release cycle. Fortunately, there are optional Ant tasks for doing just that. If you are using Ant 1.5, you can use an XSL stylesheet to transform your JDepend XML output into an HTML report. These reports can be used as a regular part of your quality or metrics program. Turbine and Maven are two projects utilizing this feature.

Understanding JDepend’s Metrics Definitions

JDepend computes a series of metrics based on the dependencies between Java packages. These metrics were first identified in Robert Martin’s work with C++ (see Designing Object Oriented C++ Applications Using The Booch Method ) and later extended to work with Java packages by Mike Clark. If the definitions in Table 1 are slightly intimidating, don’t worry. We will tie them back into some familiar design and architecture concepts in the next section.

Table 1. JDepends metric definitions

Metric Definition
CC Concrete Classes The number of concrete classes in this package.
AC Abstract Classes The number of abstract classes or interfaces in this package.
Ca Afferent Couplings The number of packages that depend on classes in this package. Answers the question “How will changes to me impact the rest of the project?
Ce Efferent Couplings The number of other packages that classes in this package depend upon. Answers the question “How sensitive am I to changes in other packages in the project?
A Abstractness Ratio (0.0-1.0) of Abstract Classes (and interfaces) in this package. AC/(CC+AC)
I Instability Ratio (0.0-1.0) of Efferent Coupling to Total Coupling (Ce/(Ce+Ca)).
D Distance from Main Sequence The perpendicular distance of a package from the idealized line A+I=1. Answers the question “How balanced am I in terms of Abstractness and Instability?” The range of this metric is 0 to 1, with D=0 indicating a package that is coincident with the main sequence (balanced) and D=1 indicating a package that is as far from the main sequence as possible (unbalanced).

In many ways, packages represent an ideal unit for managing architectural qualities of the system. Packages represent groups of classes, so the package definition must accommodate the broader purpose of the classes that it represents. A well-defined package architecture allows the system to be partitioned into major subcomponents, which supports the isolation of concerns and the ability to understand and reason about the architecture in manageable chunks.

The second strength of packages is that their dynamic components, package dependencies, are tied to the implementation of the classes contained within the package. Because of this, the dependencies can be determined automatically and are representative of the true state of the architecture at that given point in time.

While package dependencies allow us to reason about the structure and relationships within our architecture, this ability is diminished by cyclic package dependencies. Figure 2 illustrates the ways in which a package can become cyclically dependent.

Figure 2
Figure 2. Cyclic dependencies

In the simplest form, two packages, A and B, are cyclically dependent if package A depends on package B and package B depends on package A. I refer to this as a direct cyclic dependency because the packages are directly in the cycle. Note that these direct cycles can also span multiple packages: for example, A depends on B, B depends on C, and C depends on A. The second form of cyclic dependency in figure 2 is what I call an indirect cyclic dependency. Package Z doesn’t directly participate in a direct cycle, but because it depends on one (or more) packages that do participate in a direct cyclic relationship, it is inherently less stable.

Cyclic dependencies have the following negative consequences for the system:

  • Diminish the ability to reason about components of the architecture in isolation.
  • Changes impact seemingly unrelated components of the architecture. This makes it difficult to accurately assess and manage the impact of changes to the system.
  • Separation of layers. Most architectural approaches recognize the advantages of layered architectures. Cyclic dependencies across layers couple the layers, defeating the purpose of layering.
  • Packages cooperating in a cycle must be released as an atomic unit.

Because of the negative consequences of cyclic package dependencies, it is often better to catch them before they make it into your software baseline. The following code sample demonstrates how to write a JUnit test using JDepend to verify that no cyclic dependencies exist for a particular package.

package com.xyz.ejb;

import java.io.*;
import java.util.*;
import junit.framework.*;
import jdepend.framework.*;

public class CycleTest extends TestCase {
..
  /**  Tests that a single package does not contain
    *  any package dependency cycles.
    */
 public void testOnePackageCycle() {

  JDepend _jdepend = new JDepend();

  _jdepend.addDirectory("/projects/ejb/classes");
  _jdepend.addDirectory("/projects/web/classes");

  _jdepend.analyze();

  JavaPackage p = _jdepend.getPackage("com.xyz.ejb");

  assertNotNull(p);
  assertEquals("Cycle exists: " + p.getName(),
                false, p.containsCycle());
 }
..
}

The CycleTest simply creates a new instance of the jdepend.framework.JDepend class, initializes the instance with the directories containing the classes to be analyzed, and then calls the instance’s analyze() method. After that, you just ask the JDepend instance for the JavaPackage in which you are interested. Each JavaPackage contains all of the metrics we just described. In this case, we are only requiring that the package not include any cyclic dependencies, but you can adjust the test as needed for your project.

Now that we have discussed the negative consequences of cyclic dependencies on package design, we can address to other metrics JDepend provides for assessing package design. You can also think about the other metrics in terms of the graphical representation shown in Figure 3.

Figure 3
Figure 3. Graphical view of JDepends metrics

From an architectural perspective, there are two primary categories for packages: Interface packages and Implementation packages.

Interface Packages

In most projects of any size, you typically spend some time identifying and defining the key contracts or interfaces between the major components of your system. This arrangement allows various development groups to proceed in parallel using the interface as an informal contract that will allow them to successfully integrate at a future date.

 

 

You will typically capture this contract information as either Java interfaces and/or Java abstract classes that can be grouped, as a cohesive unit, in a Java package. From an OO perspective, you have a package with little or no implementation that is being used throughout the rest of the system. This conforms to the general guidance:

Program to an interface, not to an implementation.

These Interface packages are represented by the ellipse in the upper left hand corner of Figure 3. Because they are only composed of Java interfaces and/or abstract classes, they have a high Abstractness (A). Because they are used by other packages but have few (if any) dependencies on other packages, they tend to be very stable. That is to say, they have low Instability (I).

Implementation Packages

At the other end of the spectrum from Interface packages are Implementation packages. These packages are made up of predominately concrete classes that represent the implementation(s) of the various components of the system. The implementation classes are represented by the ellipse in the lower right hand corner of Figure 3. The key is that the classes in these Implementation packages may depend on all of the other packages in the system, but no other packages should depend on them. Because of this, the implementation is free to change without having these changes ripple through the rest of the system.

Low-abstraction packages should depend upon high-abstraction packages.

This fits well with the concept of introducing new implementations to improve maintainability or performance of one portion of the system without having the change ripple through other unrelated portions of the system.

Main Sequence

It would be nice if everything fit neatly into the categories of pure interface or pure implementation, but the real world involves compromises and trade-offs. The Main Sequence, shown by the dashed line in Figure 3, represents the notion that although the forces of Abstractness and Instability for a package may vary, they should vary proportionally to one another. The ellipses around the Main Sequence are intended to show that the JDepend metrics are generalized, versus absolute measures of package architecture quality. JDepend reports D, which is actually the perpendicular distance from the Main Sequence, to simplify the math. This is shown by the d1and d2 in the figure. There are no absolutes for the value of D, but as its distance from the Main Sequence increases, there is a higher likelihood that the package(s) could benefit from a review or refactoring.

Analyzing the Pet Store

As an example of how you might apply JDepend in a project setting, let’s take a look at the results of analyzing the popular Java Pet Store application using JDepend.

I downloaded the 1.3 version of Java Pet Store from the Sun site and extracted the class files from the .jar files into a common directory, C:/petstore. I then ran JDepend using the Swing GUI:

java -cp %CLASSPATH% jdepend.swingui.JDepend C:/petstore

Figure 4 shows the resulting Afferent Dependencies portion of the JDepend GUI.

Figure 4
Figure 4. Afferent dependencies for petstore.controller.ejb

Reviewing the results in the GUI shows the dreaded “cyclic” tag on the end of a number of the packages. The petstore.controller.ejb package has been expanded to show the packages that depend upon it, controller.ejb.actions and controller.web. You can also see that controller.web.actions is the only package that depends upon controller.web. Since we know cyclic dependencies are generally not desirable, let’s see if we can identify their causes first.

Cyclic Dependencies

JDepend identifies packages that are cyclically dependent, but it also labels packages that depend on cyclically dependent packages as cyclic. These are the direct and indirect cyclic dependencies we discussed earlier. When refactoring to remove cyclic dependencies, your goal is to remove direct cyclic dependencies. Indirect cyclic dependencies are really just a side effect of the direct cyclic dependencies representing the ripple of the instability through the system. You can use the Swing GUI to identify the type of cyclic dependency your package is participating in.

Referring back to Figure 4, if the petstore.controller.ejb package is part of a direct cyclic dependency, then we would see the package names begin to repeat as we drilled down into the dependency structure. Because we do not see a pattern of repeated package dependencies in the afferent window, we know the package is cyclically dependent because it depends on a package that contains a cycle (indirect cyclic dependency). Figure 5 shows the efferent dependencies for the controller.ejb package.

Figure 5
Figure 5. Efferent dependencies for petstore.controller.ejb

Figure 5 shows that petstore.controller.ejb depends on six packages:

  • cart.ejb
  • customer.ejb
  • servicelocator
  • servicelocator.ejb
  • waf.controller.ejb
  • waf.exceptions

I’ve expanded the dependencies so you can see them in the display, but you should also note that the Ce metric also tells you there are six packages that this package depends upon. I drilled down further into the waf.controller.ejb package to allow you to see that the cycle is caused by dependencies between the waf.controller.ejb and waf.controller.ejb.action packages. The interface and abstract class composing the waf.controller.ejb.action package are shown below.

// ...waf.controller.ejb.action.EJBAction
// all packages begin with com.sun.j2ee.blueprints

package waf.controller.ejb.action;
import waf.event.Event;
import waf.event.EventResponse;
import waf.controller.ejb.StateMachine;
import waf.event.EventException;

public interface EJBAction  {

  public void init(StateMachine urc);
...
}

// ...waf.controller.ejb.action.EJBActionSupport
// all packages begin with com.sun.j2ee.blueprints

package waf.controller.ejb.action;
import waf.controller.ejb.StateMachine;

public abstract class EJBActionSupport
 implements java.io.Serializable, EJBAction {

  protected StateMachine machine = null;

  public void init(StateMachine machine) {
      this.machine = machine;
  }
...
}

Note that both of these files meet the general criteria for an Interface package. However, both the interface and the abstract class depend on the concrete class com.sun.j2ee.blueprints.waf.controller.ejb.StateMachine. As I mentioned earlier in the article, Interface packages should, if possible, generally avoid depending on concrete Implementation packages. In this case, simply defining an interface, waf.controller.ejb.action.StateMachineIF, for the EJBActionSupport class and EJBAction interface to use in their definitions would correct the cycle. The StateMachine class could then implement the StateMachineIF and the dependencies would only flow into the ejb.action package.

Larger Distance from Main Sequence

Another package that stood out in a quick review of the Pet Store packages is the servicelocator.ejb package, which had the following metrics:

CC: 1   AC: 0   Ca: 12   Ce: 1   A: 0   I: 0.08   D: 0.92

The distance from the main sequence, D, was what brought the package to my attention. Let’s look at the story the metrics tell about the package. First, there is only one concrete class in the package (CC = 1) and no abstract classes (AC = 0), so it falls into the general category of an Implementation class. Next, the package is depended upon by 12 other packages (Ca = 12) but only depends on one other package (Ce = 1). Because there are no abstract classes, the Abstractness is 0 (A = AC/(AC+CC)= 0). Because 12 packages depend on this package and it depends on only 1 package, the Instability is 0.08 (I = Ce/(Ce+Ca)).

At this point, you can begin to see several contradictions in the package. The package is heavily depended, so it should have a high degree of abstraction, but it is composed of only one concrete class. The package consists of only one class, so the JDepend metrics are really class metrics in this case. The class is also used by 12 other packages, so changes to the ServiceLocator class will ripple through these packages. The incredible part is that we already know quite a bit about the package and its relationships, even though we have not yet looked at the source. If you are using the JDepend GUI, you can also find out the specific packages this package depends upon and the packages that depend on this package.

From the package name, you may have already guessed that the concrete class contained in the servicelocator.ejb package is an implementation of the ServiceLocator pattern from the Core J2EE Patterns book. The asyncsender.ejb.AsyncSenderEJBimplementation shows how the ServiceLocator is used by one of these packages, asyncsender.ejb.

//asyncsender.ejb.AsyncSenderEJB
// all packages begin with com.sun.j2ee.blueprints

package asyncsender.ejb;

import asyncsender.util.JNDINames;
import servicelocator.ejb.ServiceLocator;
import servicelocator.ServiceLocatorException;

public class AsyncSenderEJB implements SessionBean
{
  private SessionContext sc;
  private Queue q;
  private QueueConnectionFactory qFactory;
  ....
  public void ejbCreate( ) throws CreateException
  {
    try {
      ServiceLocator serviceLocator =
        new ServiceLocator();

      qFactory =
        serviceLocator.getQueueConnectionFactory(
         JNDINames.QUEUE_CONNECTION_FACTORY);

      q = serviceLocator.getQueue(
        JNDINames.ASYNC_SENDER_QUEUE);

    }catch (ServiceLocatorException sle) {
      throw new EJBException(
        "AsyncSenderEJB.ejbCreate failed", sle);
    }
  }
  ....
}

You can see that the AsyncSenderEJB class creates and holds a local reference to an instance of the concrete class, ServiceLocator. A more flexible approach would be to define a new interface, servicelocator.ejb.ServiceLocatorIF, that the ServiceLocatorclass could implement. Because JDepend considers ServiceLocatorIF to be an abstract class, this would improve the Abstractness of the servicelocator.ejb package from 0 to A = AC/(AC+CC) = 1/(1+1) = 0.5. Points D1(original) and D2(modified) in Figure 6 illustrate the impact of this change on this distance from the main sequence for servicelocator.ejb package.

Figure 6
Figure 6. servicelocator.ejb package analysis

Note that while this improves the servicelocator.ejb package’s metrics, it does not address the problems of the classes that use the ServiceLocator class. Currently, adding a new type of ServiceLocator, say a CachingServiceLocator, would require modifying all 12 packages (24 files) that use the ServiceLocator. A more flexible approach would be to declare the local variable to be of type ServiceLocatorIF. The creation of the correct type of ServiceLocator could then be abstracted behind a factory method. This is left as an exercise for the reader.

Some Parting Remarks

There is always the tendency to attempt to identify a metric(s) that can be cast in stone as the ultimate proof of good software or architectural quality. I’ve had great success utilizing the metrics described in this article, on both Java and C++ projects, to identify architecture and implementation hotspots. I’ve never found a desirable cyclic dependency; however, I have found cases where the cost to fix the cycle was too high. The distance metric, D, also provides a reasonable approach to organizing classes into cohesive packages, but there is not a magical value for D.

Architectural “ilities” are often difficult to quantify in a concrete and repeatable manner. As your team size grows, it becomes impossible to ensure good design qualities are maintained by simply reviewing all the implementations. As you iterate through the design-implement-refactor cycle, it is relatively easy to introduce undesirable dependencies or erode the cohesion of one or more packages. JDepend provided a series of simple, repeatable metrics that allow you to monitor the evolution of your package architecture as part of your normal development process. The ultimate decision as to what to do is still up to the team, but the key is that you now have a simple, repeatable approach for monitoring the impacts of design and implementation decisions on the architecture.

From http://www.onjava.com/pub/a/onjava/2004/01/21/jdepend.html?page=1

Có những câu hỏi không cần câu trả lời

Một trong những dạng câu hỏi ưa thích của các công ty như Google hay McKinsey chính là câu hỏi về Quy mô thị trường, ví dụ như: “Có bao nhiêu tiệm Starbucks trên khắp nước Mỹ?”

Nếu như bạn lấy điện thoại ra và google câu trả lời thì TÉNG TÈNG. Chia buồn! Bạn đã bị loại.

Cho dù bạn có phỏng vấn vào bộ phận kĩ thuật, tài chính, marketing hay tư vấn đi chăng nữa thì họ cũng sẽ hỏi cùng kiểu câu này để định vị đâu là ứng cử viên nặng ký nhất.

Sự thật chuyện có bao nhiêu tiệm Starbucks trên nước Mỹ không quan trọng, họ chẳng hề quan tâm. Nếu như bạn chỉ chăm chăm vào câu chữ mà không nhận ra ẩn ý trong đó thì chứng tỏ đầu óc bạn quá đơn giản và đương nhiên, không hề xứng đáng.

Thật ra, mục đích của kiểu câu hỏi này chính là quan sát cách trả lời: quá trình suy nghĩ, giải quyết vấn đề và bảo vệ quan điểm của các ứng viên. Chẳng hạn, họ sẽ hỏi một câu có nội dung khác nhưng cùng dạng:

“Có bao nhiêu hộ gia đình trên khắp nước Mỹ?”

Nếu như vụ Starbucks còn có thể tìm trên mạng thì chuyện con số hộ gia đình gần như không tưởng, vì không có một thống kê cụ thể nào giúp tra cứu nhanh được. Vậy, ta phải bắt đầu từ đâu đây?

Trước tiên, hãy thử bắt đầu ở tổng dân số nước Mỹ. Ta biết rằng có khoảng 300 triệu người hiện đang sống ở Mỹ. Một gia đình điển hình có khoảng 3 người: cha, mẹ, con. Lấy trung bình cứ 3 người thì có 1 hộ gia đình ta sẽ có được con số 100 triệu.

Sau khi tra kĩ thống kê dân số, số hộ gia đình ở Mỹ là khoảng 125 triệu. Bạn đã trả lời sai. Nhưng, chuyện đó không quan trọng, điều quan trọng là bạn đã suy ra được con số 100 triệu gần mốc 125 và có luận điểm rõ ràng chứ không chém gió.

Đây là một câu hỏi khác: “Có bao nhiêu chiếc xe hơi trên nước Mỹ?”

Giờ, hãy thử bình tĩnh, một lần nữa, đây không phải kiểu câu hỏi cần bạn trả lời đúng. Ta lại bắt đầu ở con số 300 triệu dân một lần nữa. Một gia đình điển hình có 3 người thường có 2 chiếc xe. Vậy lấy trung bình 300 triệu dân ta có 200 triệu chiếc xe.

Kết quả thống kê thực tế: có 260 triệu xe hơi ở Mỹ. Ta lại trả lời sai, nhưng chỉ mất 30 giây là tính được một con số từ hư vô. Và, quan trọng là có lập luận.

Vậy hãy thử quay lại câu hỏi đầu bài xem: “Có bao nhiêu tiệm Starbucks trên khắp nước Mỹ?”

Hiện tôi đang sinh sống ở một thị trấn có khoảng 100.000 người, và tôi ước tính có khoảng 3-5 tiệm nơi đây. Vậy, với con số 300 triệu dân, ta tính được số thị trấn sẽ quanh mốc 3000. Số tiệm Starbucks sẽ dao động trong phạm vi 9.000 (3000×3) tới 15.000 (3000×5). Lấy trung bình ở giữa ta có con số quanh mốc 12.000 tiệm.

Một lần nữa, tôi đã thử trả lời câu hỏi từ hư vô, dựa trên lập luận rõ ràng. Điều tôi muốn nói ở đây chính là, đối với dạng câu hỏi Quy mô thị trường thế này, cách tốt nhất để lập luận là chia nhỏ vấn đề ra và ước tính trung bình.

Thật ra, những câu ở trên là những câu hỏi đơn giản nhất của dạng này, những câu phức tạp sẽ đòi hỏi bạn phải đưa ra nhiều lập luận logic hơn thế nữa. Người tuyển dụng trong nhiều trường hợp thật ra cũng chẳng hề biết câu trả lời chính xác. Họ đơn giản chỉ quan sát và lắng nghe dòng suy nghĩ của bạn mà thôi.

Lần tới, nếu gặp câu hỏi phỏng vấn dạng này, hãy thử bắt đầu từ cái mốc nào đó, dân số hoặc bất cứ thứ gì logic cũng được. Ít ra thì lúc đó bạn cũng có một điểm tựa rồi, còn nhấc được cả trái đất lên hay không thì phải còn tùy thuộc vào bạn vậy.

From http://www.ecoblader.com/2017/08/co-nhung-cau-hoi-khong-can-cau-tra-loi/

DDD, Hexagonal, Onion, Clean, CQRS, … How I put it all together

100-explicit-architecture-svg

This post is part of The Software Architecture Chronicles, a series of posts about Software Architecture. In them, I write about what I’ve learned on Software Architecture, how I think of it, and how I use that knowledge. The contents of this post might make more sense if you read the previous posts in this series.

After graduating from University I followed a career as a high school teacher until a few years ago I decided to drop it and become a full-time software developer.

From then on, I have always felt like I need to recover the “lost” time and learn as much as possible, as fast as possible. So I have become a bit of an addict in experimenting, reading and writing, with a special focus on software design and architecture. That’s why I write these posts, to help me learn.

In my last posts, I’ve been writing about many of the concepts and principles that I’ve learned and a bit about how I reason about them. But I see these as just pieces of big a puzzle.

Today’s post is about how I fit all of these pieces together and, as it seems I should give it a name, I call it Explicit Architecture. Furthermore, these concepts have all “passed their battle trials” and are used in production code on highly demanding platforms. One is a SaaS e-com platform with thousands of web-shops worldwide, another one is a marketplace, live in 2 countries with a message bus that handles over 20 million messages per month.

Fundamental blocks of the system

I start by recalling EBI and Ports & Adapters architectures. Both of them make an explicit separation of what code is internal to the application, what is external, and what is used for connecting internal and external code.

Furthermore, Ports & Adapters architecture explicitly identifies three fundamental blocks of code in a system:

  • What makes it possible to run a user interface, whatever type of user interface it might be;
  • The system business logic, or application core, which is used by the user interface to actually make things happen;
  • Infrastructure code, that connects our application core to tools like a database, a search engine or 3rd party APIs.

000 - Explicit Architecture.svg

The application core is what we should really care about. It is the code that allows our code to do what it is supposed to do, it IS our application. It might use several user interfaces (progressive web app, mobile, CLI, API, …) but the code actually doing the work is the same and is located in the application core, it shouldn’t really matter what UI triggers it.

As you can imagine, the typical application flow goes from the code in the user interface, through the application core to the infrastructure code, back to the application core and finally deliver a response to the user interface.

010 - Explicit Architecture.svg

Tools

Far away from the most important code in our system, the application core, we have the tools that our application uses, for example, a database engine, a search engine, a Web server or a CLI console (although the last two are also delivery mechanisms).

020 - Explicit Architecture.svg

While it might feel weird to put a CLI console in the same “bucket” as a database engine, and although they have different types of purposes, they are in fact tools used by the application. The key difference is that, while the CLI console and the web server are used to tell our application to do something, the database engine is told by our application to do something. This is a very relevant distinction, as it has strong implications on how we build the code that connects those tools with the application core.

Connecting the tools and delivery mechanisms to the Application Core

The code units that connect the tools to the application core are called adapters (Ports & Adapters Architecture). The adapters are the ones that effectively implement the code that will allow the business logic to communicate with a specific tool and vice-versa.

The adapters that tell our application to do something are called Primary or Driving Adapters while the ones that are told by our application to do something are called Secondary or Driven Adapters.

Ports

These Adapters, however, are not randomly created. They are created to fit a very specific entry point to the Application Core, a Port. A port is nothing more than a specification of how the tool can use the application core, or how it is used by the Application Core. In most languages and in its most simple form, this specification, the Port, will be an Interface, but it might actually be composed of several Interfaces and DTOs.

It’s important to note that the Ports (Interfaces) belong inside the business logic, while the adapters belong outside. For this pattern to work as it should, it is of utmost importance that the Ports are created to fit the Application Core needs and not simply mimic the tools APIs.

Primary or Driving Adapters

The Primary or Driver Adapters wrap around a Port and use it to tell the Application Core what to do. They translate whatever comes from a delivery mechanism into a method call in the Application Core.

030 - Explicit Architecture.svg

In other words, our Driving Adapters are Controllers or Console Commands who are injected in their constructor with some object whose class implements the interface (Port) that the controller or console command requires.

In a more concrete example, a Port can be a Service interface or a Repository interface that a controller requires. The concrete implementation of the Service, Repository or Query is then injected and used in the Controller.

Alternatively, a Port can be a Command Bus or Query Bus interface. In this case, a concrete implementation of the Command or Query Bus is injected into the Controller, who then constructs a Command or Query and passes it to the relevant Bus.

Secondary or Driven Adapters

Unlike the Driver Adapters, who wrap around a port, the Driven Adapters implement a Port, an interface, and are then injected into the Application Core, wherever the port is required (type-hinted).

040 - Explicit Architecture.svg

For example, let’s suppose that we have a naive application which needs to persist data. So we create a persistence interface that meets its needs, with a method to save an array of data and a method to delete a line in a table by its ID. From then on, wherever our application needs to save or delete data we will require in its constructor an object that implements the persistence interface that we defined.

Now we create an adapter specific to MySQL which will implement that interface. It will have the methods to save an array and delete a line in a table, and we will inject it wherever the persistence interface is required.

If at some point we decide to change the database vendor, let’s say to PostgreSQL or MongoDB, we just need to create a new adapter that implements the persistence interface and is specific to PostgreSQL, and inject the new adapter instead of the old one.

Inversion of control

A characteristic to note about this pattern is that the adapters depend on a specific tool and a specific port (by implementing an interface). But our business logic only depends on the port (interface), which is designed to fit the business logic needs, so it doesn’t depend on a specific adapter or tool.

050 - Explicit Architecture.svg

This means the direction of dependencies is towards the centre, it’s the inversion of control principle at the architectural level.

Although, again, it is of utmost importance that the Ports are created to fit the Application Core needs and not simply mimic the tools APIs.

Application Core organisation

The Onion Architecture picks up the DDD layers and incorporates them into the Ports & Adapters Architecture. Those layers are intended to bring some organisation to the business logic, the interior of the Ports & Adapters “hexagon”, and just like in Ports & Adapters, the dependencies direction is towards the centre.

Application Layer

The use cases are the processes that can be triggered in our Application Core by one or several User Interfaces in our application. For example, in a CMS we could have the actual application UI used by the common users, another independent UI for the CMS administrators, another CLI UI, and a web API. These UIs (applications) could trigger use cases that can be specific to one of them or reused by several of them.

The use cases are defined in the Application Layer, the first layer provided by DDD and used by the Onion Architecture.

060 - Explicit Architecture.svg

This layer contains Application Services (and their interfaces) as first class citizens, but it also contains the Ports & Adapters interfaces (ports) which include ORM interfaces, search engines interfaces, messaging interfaces and so on. In the case where we are using a Command Bus and/or a Query Bus, this layer is where the respective Handlers for the Commands and Queries belong.

The Application Services and/or Command Handlers contain the logic to unfold a use case, a business process. Typically, their role is to:

  1. use a repository to find one or several entities;
  2. tell those entities to do some domain logic;
  3. and use the repository to persist the entities again, effectively saving the data changes.

The Command Handlers can be used in two different ways:

  1. They can contain the actual logic to perform the use case;
  2. They can be used as mere wiring pieces in our architecture, receiving a Command and simply triggering logic that exists in an Application Service.

Which approach to use depends on the context, for example:

  • Do we already have the Application Services in place and are now adding a Command Bus?
  • Does the Command Bus allow specifying any class/method as a handler, or do they need to extend or implement existing classes or interfaces?

This layer also contains the triggering of Application Events, which represent some outcome of a use case. These events trigger logic that is a side effect of a use case, like sending emails, notifying a 3rd party API, sending a push notification, or even starting another use case that belongs to a different component of the application.

Domain Layer

Further inwards, we have the Domain Layer. The objects in this layer contain the data and the logic to manipulate that data, that is specific to the Domain itself and it’s independent of the business processes that trigger that logic, they are independent and completely unaware of the Application Layer.

070 - Explicit Architecture.svg

Domain Services

As I mentioned above, the role of an Application Service is to:

  1. use a repository to find one or several entities;
  2. tell those entities to do some domain logic;
  3. and use the repository to persist the entities again, effectively saving the data changes.

However, sometimes we encounter some domain logic that involves different entities, of the same type or not, and we feel that that domain logic does not belong in the entities themselves, we feel that that logic is not their direct responsibility.

So our first reaction might be to place that logic outside the entities, in an Application Service. However, this means that that domain logic will not be reusable in other use cases: domain logic should stay out of the application layer!

The solution is to create a Domain Service, which has the role of receiving a set of entities and performing some business logic on them. A Domain Service belongs to the Domain Layer, and therefore it knows nothing about the classes in the Application Layer, like the Application Services or the Repositories. In the other hand, it can use other Domain Services and, of course, the Domain Model objects.

Domain Model

In the very centre, depending on nothing outside it, is the Domain Model, which contains the business objects that represent something in the domain. Examples of these objects are, first of all, Entities but also Value Objects, Enums and any objects used in the Domain Model.

The Domain Model is also where Domain Events “live”. These events are triggered when a specific set of data changes and they carry those changes with them. In other words, when an entity changes, a Domain Event is triggered and it carries the changed properties new values. These events are perfect, for example, to be used in Event Sourcing.

Components

So far we have been segregating the code based on layers, but that is the fine-grained code segregation. The coarse-grained segregation of code is at least as important and it’s about segregating the code according to sub-domains and bounded contexts, following Robert C. Martin ideas expressed in screaming architecture. This is often referred to as “Package by feature” or “Package by component” as opposed to”Package by layer“, and it’s quite well explained by Simon Brown in his blog post “Package by component and architecturally-aligned testing“:

I am an advocate for the “Package by component” approach and, picking up on Simon Brown diagram about Package by component, I would shamelessly change it to the following:

These sections of code are cross-cutting to the layers previously described, they are the components of our application. Examples of components can be Authentication, Authorization, Billing, User, Review or Account, but they are always related to the domain. Bounded contexts like Authorization and/or Authentication should be seen as external tools for which we create an adapter and hide behind some kind of port.

080 - Explicit Architecture.svg

Decoupling the components

Just like the fine-grained code units (classes, interfaces, traits, mixins, …), also the coarsely grained code-units (components) benefit from low coupling and high cohesion.

To decouple classes we make use of Dependency Injection, by injecting dependencies into a class as opposed to instantiating them inside the class, and Dependency Inversion, by making the class depend on abstractions (interfaces and/or abstract classes) instead of concrete classes. This means that the depending class has no knowledge about the concrete class that it is going to use, it has no reference to the fully qualified class name of the classes that it depends on.

In the same way, having completely decoupled components means that a component has no direct knowledge of any another component. In other words, it has no reference to any fine-grained code unit from another component, not even interfaces! This means that Dependency Injection and Dependency Inversion are not enough to decouple components, we will need some sort of architectural constructs. We might need events, a shared kernel, eventual consistency, and even a discovery service!

Triggering logic in other components

When one of our components (component B) needs to do something whenever something else happens in another component (component A), we can not simply make a direct call from component A to a class/method in component B because A would then be coupled to B.

However we can make A use an event dispatcher to dispatch an application event that will be delivered to any component listening to it, including B, and the event listener in B will trigger the desired action. This means that component A will depend on an event dispatcher, but it will be decoupled from B.

Nevertheless, if the event itself “lives” in A this means that B knows about the existence of A, it is coupled to A. To remove this dependency, we can create a library with a set of application core functionality that will be shared among all components, the Shared Kernel. This means that the components will both depend on the Shared Kernel but they will be decoupled from each other. The Shared Kernel will contain functionality like application and domain events, but it can also contain Specification objects, and whatever makes sense to share, keeping in mind that it should be as minimal as possible because any changes to the Shared Kernel will affect all components of the application. Furthermore, if we have a polyglot system, let’s say a micro-services ecosystem where they are written in different languages, the Shared Kernel needs to be language agnostic so that it can be understood by all components, whatever the language they have been written in. For example, instead of the Shared Kernel containing an Event class, it will contain the event description (ie. name, properties, maybe even methods although these would be more useful in a Specification object) in an agnostic language like JSON, so that all components/micro-services can interpret it and maybe even auto-generate their own concrete implementations.

This approach works both in monolithic applications and distributed applications like micro-services ecosystems. However, when the events can only be delivered asynchronously, for contexts where triggering logic in other components needs to be done immediately this approach will not suffice! Component A will need to make a direct HTTP call to component B. In this case, to have the components decoupled, we will need a discovery service to which A will ask where it should send the request to trigger the desired action, or alternatively make the request to the discovery service who can proxy it to the relevant service and eventually return a response back to the requester. This approach will couple the components to the discovery service but will keep them decoupled from each other.

Getting data from other components

The way I see it, a component is not allowed to change data that it does not “own”, but it is fine for it to query and use any data.

Data storage shared between components

When a component needs to use data that belongs to another component, let’s say a billing component needs to use the client name which belongs to the accounts component, the billing component will contain a query object that will query the data storage for that data. This simply means that the billing component can know about any dataset, but it must use the data that it does not “own” as read-only, by the means of queries.

Data storage segregated per component

In this case, the same pattern applies, but we have more complexity at the data storage level. Having components with their own data storage means each data storage contains:

  • A set of data that it owns and is the only one allowed to change, making it the single source of truth;
  • A set of data that is a copy of other components data, which it can not change on its own, but is needed for the component functionality, and it needs to be updated whenever it changes in the owner component.

Each component will create a local copy of the data it needs from other components, to be used when needed. When the data changes in the component that owns it, that owner component will trigger a domain event carrying the data changes. The components holding a copy of that data will be listening to that domain event and will update their local copy accordingly.

Flow of control

As I said above, the flow of control goes, of course, from the user into the Application Core, over to the infrastructure tools, back to the Application Core and finally back to the user. But how exactly do classes fit together? Which ones depend on which ones? How do we compose them?

Following Uncle Bob, in his article about Clean Architecture, I will try to explain the flow of control with UMLish diagrams…

Without a Command/Query Bus

In the case we do not use a command bus, the Controllers will depend either on an Application Service or on a Query object.

[EDIT – 2017-11-18] I completely missed the DTO I use to return data from the query, so I added it now. Tkx to MorphineAdministered who pointed it out for me.

In the diagram above we use an interface for the Application Service, although we might argue that it is not really needed since the Application Service is part of our application code and we will not want to swap it for another implementation, although we might refactor it entirely.

The Query object will contain an optimized query that will simply return some raw data to be shown to the user. That data will be returned in a DTO which will be injected into a ViewModel. ThisViewModel may have some view logic in it, and it will be used to populate a View.

The Application Service, on the other hand, will contain the use case logic, the logic we will trigger when we want to do something in the system, as opposed to simply view some data. The Application Services depend on Repositories which will return the Entity(ies) that contain the logic which needs to be triggered. It might also depend on a Domain Service to coordinate a domain process in several entities, but that is hardly ever the case.

After unfolding the use case, the Application Service might want to notify the whole system that that use case has happened, in which case it will also depend on an event dispatcher to trigger the event.

It is interesting to note that we place interfaces both on the persistence engine and on the repositories. Although it might seem redundant, they serve different purposes:

  • The persistence interface is an abstraction layer over the ORM so we can swap the ORM being used with no changes to the Application Core.
  • The repository interface is an abstraction on the persistence engine itself. Let’s say we want to switch from MySQL to MongoDB. The persistence interface can be the same, and, if we want to continue using the same ORM, even the persistence adapter will stay the same. However, the query language is completely different, so we can create new repositories which use the same persistence mechanism, implement the same repository interfaces but builds the queries using the MongoDB query language instead of SQL.

With a Command/Query Bus

In the case that our application uses a Command/Query Bus, the diagram stays pretty much the same, with the exception that the controller now depends on the Bus and on a command or a Query. It will instantiate the Command or the Query, and pass it along to the Bus who will find the appropriate handler to receive and handle the command.

In the diagram below, the Command Handler then uses an Application Service. However, that is not always needed, in fact in most of the cases the handler will contain all the logic of the use case. We only need to extract logic from the handler into a separated Application Service if we need to reuse that same logic in another handler.

[EDIT – 2017-11-18] I completely missed the DTO I use to return data from the query, so I added it now. Tkx to MorphineAdministered who pointed it out for me.

You might have noticed that there is no dependency between the Bus and the Command, the Query nor the Handlers. This is because they should, in fact, be unaware of each other in order to provide for good decoupling. The way the Bus will know what Handler should handle what Command, or Query, should be set up with mere configuration.

As you can see, in both cases all the arrows, the dependencies, that cross the border of the application core, they point inwards. As explained before, this a fundamental rule of Ports & Adapters Architecture, Onion Architecture and Clean Architecture.

Conclusion

The goal, as always, is to have a codebase that is loosely coupled and high cohesive, so that changes are easy, fast and safe to make.

Plans are worthless, but planning is everything.

Eisenhower

This infographic is a concept map. Knowing and understanding all of these concepts will help us plan for a healthy architecture, a healthy application.

Nevertheless:

The map is not the territory.

Alfred Korzybski

Meaning that these are just guidelines! The application is the territory, the reality, the concrete use case where we need to apply our knowledge, and that is what will define what the actual architecture will look like!

We need to understand all these patterns, but we also always need to think and understand exactly what our application needs, how far should we go for the sake decoupling and cohesiveness. This decision can depend on plenty of factors, starting with the project functional requirements, but can also include factors like the time-frame to build the application, the lifespan of the application, the experience of the development team, and so on.

This is it, this is how I make sense of it all. This is how I rationalize it in my head.

However, how do we make all this explicit in the code base? That’s the subject of my next post about how I reflect the architecture and domain, in the code.

Last but not least, thanks to my colleague Francesco Mastrogiacomo, for helping me make my infographic look nice. 🙂

From https://herbertograca.com/2017/11/16/explicit-architecture-01-ddd-hexagonal-onion-clean-cqrs-how-i-put-it-all-together/

Clean Architecture: Standing on the shoulders of giants

This post is part of The Software Architecture Chronicles, a series of posts about Software Architecture. In them, I write about what I’ve learned on Software Architecture, how I think of it, and how I use that knowledge. The contents of this post might make more sense if you read the previous posts in this series.

Robert C. Martin (AKA Uncle Bob) published his ideas about Clean Architecture back in 2012, in a post on his blog, and lectured about it at a few conferences.

The Clean Architecture leverages well-known and not so well-known concepts, rules, and patterns, explaining how to fit them together, to propose a standardised way of building applications.

Standing on the shoulders of EBI, Hexagonal and Onion Architectures

The core objectives behind Clean Architecture are the same as for Ports & Adapters (Hexagonal) and Onion Architectures:

  • Independence of tools;
  • Independence of delivery mechanisms;
  • Testability in isolation.

In the post about Clean Architecture was published, this was the diagram used to explain the global idea:

cleanarchitecture-5c6d7ec787d447a81b708b73abba1680
Robert C. Martin 2012, The Clean Architecture

As Uncle Bob himself says in his post, the diagram above is an attempt at integrating the most recent architecture ideas into a single actionable idea.

Let’s compare the Clean Architecture diagram with the diagrams used to explain Hexagonal Architecture and Onion Architecture, and see where they coincide:

 

  • Externalisation of tools and delivery mechanisms

    Hexagonal Architecture focuses on externalising the tools and the delivery mechanisms from the application, using interfaces (ports) and adapters. This is also one of the core fundaments of Onion Architecture, as we can see by its diagram, the UI, the infrastructure and the tests are all in the outermost layer of the diagram. The Clean Architecture has exactly the same characteristic, having the UI, the web, the DB, etc, in the outermost layer. In the end, all application core code is framework/library independent.

  • Dependencies direction

    In the Hexagonal Architecture, we don’t have anything explicitly telling us the direction of the dependencies. Nevertheless, we can easily infer it: The Application has a port (an interface) which must be implemented or used by an adapter. So the Adapter depends on the interface, it depends on the application which is in the centre. What is outside depends on what is inside, the direction of the dependencies is towards the centre. In the Onion Architecture diagram, we also don’t have anything explicitly telling us the dependencies direction, however, in his second post, Jeffrey Palermo states very clearly that all dependencies are toward the centre. The Clean Architecture diagram, in turn, it’s quite explicit in pointing out that the dependencies direction is towards the centre. They all introduce the Dependency Inversion Principle at the architectural level. Nothing in an inner circle can know anything at all about something in an outer circle. Furthermore,when we pass data across a boundary, it is always in the form that is most convenient for the inner circle.

  • Layers

    The Hexagonal Architecture diagram only shows us two layers: Inside of the application and outside of the application. The Onion Architecture, on the other hand, brings to the mix the application layers identified by DDD: Application Services holding the use case logic; Domain Services encapsulating domain logic that does not belong in Entities nor Value Objects; and the Entities, Value Objects, etc.. When compared to the Onion Architecture, the Clean Architecture maintains the Application Services layer (Use Cases) and the Entities layer but it seems to forget about the Domain Services layer. However, reading Uncle Bob post we realise that he considers an Entity not only as and Entity in the DDD sense but as any Domain object: “An entity can be an object with methods, or it can be a set of data structures and functions.“. In reality, he merged those 2 innermost layers to simplify the diagram.

  • Testability in isolation

    In all three Architecture styles the rules they abide by provide them with insulation of the application and domain logic. This means that in all cases we can simply mock the external tools and delivery mechanisms and test the application code in insulation, without using any DB nor HTTP requests.

As we can see, Clean Architecture incorporates the rules of Hexagonal Architecture and Onion Architecture. So far, the Clean Architecture does not add anything new to the equation. However, in the bottom right corner of the Clean Architecture diagram, we can see a small extra diagram…

Standing on the shoulders of MVC and EBI

The small extra diagram in the bottom right corner of the Clean Architecture diagram explains how the flow of control works. That small diagram does not give us much information, but the blog post explanations and the conference lectures given by Robert C. Martin expand on the subject.

cleanarchitecturedesign

In the diagram above, on the left side, we have the View and the Controller of MVC. Everything inside/between the black double lines represents the Model in MVC. That Model also represents the EBI Architecture (we can clearly see the Boundaries, the Interactor and the Entities), the “Application” in Hexagonal Architecture, the “Application Core” in the Onion Architecture, and the “Entities” and “Use Cases” layers in the Clean Architecture diagram above.

Following the control flow, we have an HTTP Request that reaches the Controller. The controller will then:

  1. Dismantle the Request;
  2. Create a Request Model with the relevant data;
  3. Execute a method in the Interactor (which was injected into the Controller using the Interactor’s interface, the Boundary), passing it the Request Model;
  4. The Interactor:
    1. Uses the Entity Gateway Implementation (which was injected into the Interactor using the Entity Gateway Interface) to find the relevant Entities;
    2. Orchestrates interactions between Entities;
    3. Creates a Response Model with the data result of the Operation;
    4. Populates the Presenter giving it the Response Model;
    5. Returns the Presenter to the Controller;
  5. Uses the Presenter to generate a ViewModel;
  6. Binds the ViewModel to the View;
  7. Returns the View to the client.

The only thing here where I feel some friction and do differently in my projects is the usage of the “Presenter“. I rather have the Interactor return the data in some kind of DTO, as opposed to injecting an object that gets populated with data.

What I usually do is the actual MVP implementation, where the Controller has the responsibility of receiving and responding to the client.

Conclusion

I would not say that the Clean Architecture is revolutionary because it does not actually bring a new groundbreaking concept or pattern to the table.

However, I would say that it is a work of the utmost importance:

  • It recovers somewhat forgotten concepts, rules, and patterns;
  • It clarifies useful and important concepts, rules and patterns;
  • It tells us how all these concepts, rules and patterns fit together to provide us with a standardised way to build complex applications with maintainability in mind.

When I think about Uncle Bob work with the Clean Architecture, It makes me think of Isaac Newton. Gravity had always been there, everybody knew that if we release an apple one meter above the ground, it will move towards the ground. The “only” thing Newton did was to publish a paper making that fact explicit*. It was a “simple” thing to do, but it allowed people to reason about it and use that concrete idea as a foundation to other ideas.

In other words, I see Robert C. Martin is the Isaac Newton of software development! 🙂

Resources

2012 – Robert C. Martin – Clean Architecture (NDC 2012)

2012 – Robert C. Martin – The Clean Architecture

2012 – Benjamin Eberlei – OOP Business Applications: Entity, Boundary, Interactor

2017 – Lieven Doclo – A couple of thoughts on Clean Architecture

2017 – Grzegorz Ziemoński  – Clean Architecture Is Screaming

* I know Sir Isaac Newton did more than that, but I just want to emphasize how important I consider the views of Robert C. Martin.

From https://herbertograca.com/2017/09/28/clean-architecture-standing-on-the-shoulders-of-giants/

Clean Architecture Is Screaming

Uncle Bob’s Clean Architecture keeps your application flexible, testable, and highlights its use cases. But there is a cost: No idiomatic framework usage!

Welcome to the fifth installment of little architecture series! So far we have covered layershexagonsonions, and features. Today, we’ll look at a close friend of all four – Uncle Bob’s Clean Architecture, initially introduced here.

What Is Clean Architecture?

Clean Architecture builds upon the previously introduced four concepts and aligns the project with best practices like the Dependency Inversion Principle or Use Cases. It also aims for a maximum independence of any frameworks or tools that might stay in the way of application’s testability or their replacement.

Clean Architecture divides our system into four layers, usually represented by circles:

  • Entities, which contain enterprise-wide business rules. You can think of them as about Domain Entities a la DDD.
  • Use cases, which contain application-specific business rules. These would be counterparts to Application Services with the caveat that each class should focus on one particular Use Case.
  • Interface adapters, which contain adapters to peripheral technologies. Here, you can expect MVC, Gateway implementations and the like.
  • Frameworks and drivers, which contain tools like databases or framework. By default, you don’t code too much in this layer, but it’s important to clearly state the place and priority that those tools have in your architecture.

Between the circles, there is a strong dependency rule – no code in the inner circle can directly reference a piece of code from the outer circle. All outward communication should happen via interfaces. It’s exactly the same dependency rule as we introduced in the Onion Architecture post.

Apart from the layers, Clean Architecture gives us some tips about the classes we need to implement. As you can see in the picture below, the flow of control from the Controller to the Use Case goes through an Input Port interface and flows to the Presenter through an Output Port interface. This ensures that the Use Case and the user interface are properly decoupled. We’ll see an example of this later in the implementation section.

The Essence of Clean Architecture

I see two things that make Clean Architecture distinct and potentially more effective than other architectural styles: strong adherence to the Dependency Inversion Principle and Use Caseorientation.

Strong Adherence to DIP

Similarly to its Onion cousin, Clean Architecture introduces the Dependency Inversion Principle at the architectural level. This way, it explicitly states the priorities between different kinds of objects in your system. In a way, Clean Architecture does a better job at this, as it leaves no doubt about the tools like frameworks or databases – they have a dedicated layer outside all others.

Use Case Orientation

Similarly to what we’ve seen in Package by FeatureClean Architecture promotes vertical slicing of the code and leaving the layers mostly at the class level. The major difference between the two is that instead of focusing on a blurry concept of a feature, it reorients the packaging towards Use Cases. This is important as, ultimately, in any application that has some sort of GUI, one could identify real Use Cases. It’s also important to note that entities sit in a different layer as in complex systems, one Use Case can orchestrate several entities to cooperate and categorizing it by the type of the entity would be artificial.

Implementing Clean Architecture

We can’t be 100% sure about how Uncle Bob would implement a Clean Architecture today, as his book about it comes out in July (I preordered it already and will do a review or rehash of this post then), but we can look at the GitHub repository of his Clean Code Case study:

Packaging

As you can see, the Entities and Use Cases layers have their own separate packages, while the other layers can be identified only conceptually. The socketserver,  http, and  view packages can be considered a part of the Frameworks and Drivers, while gateways package is a little bit ambiguous – their implementation is surely the Interface Adapters layer, but their interfaces conceptually belong to Use Cases. My guess is that the interfaces are extracted to a separate package so they can be shared between different Use Cases. But it’s just a guess!

By looking at the usecases.codecastSummeries, we can get more insight into how a complete Use Casepackage looks like. As you can see, it accommodates all classes related to the execution of a particular Use Case: the view, controller, presenter, boundaries, view and response models, and the Use Case class itself.  This might be a lot more classes than you usually see in your projects when you execute an Application Service, but that’s what it takes to go perfectly Clean.

Internals

If you dug deeper into the implementation of the project’s classes, you’d see no annotation there other than @Override. That’s because of the frameworks being at the very outer layer of the architecture – the code is not allowed to reference them directly. You might ask, how could I leverage Spring in such a project? Well, if you really wanted to, you’d have to do some XML configurations or do it using @Configuration and @Bean classes. No @Service, no @Autowired, sorry!

My Extra Advice

Pulling off a Clean Architecture might be a demanding task, especially if you worked with a Package by Layer, fat controllers kind of project before. And even if you get the idea and necessary skills to implement it, your colleagues might not. They might want to do this anyway, but simply forget to add interfaces or work around framework annotations when necessary. One way to prevent some of these issues could be to create a Maven module for each of the layers so that breaking a rule won’t even compile. At the same time, if you don’t have these already, introducing Pair Programming or Code Reviews will help you to prevent people from messing up with the dependency declarations (circular dependencies would not work, but adding Spring both to the Use Cases and Adapters module would!).

Benefits of a Clean Architecture

  • Screaming – Use Cases are clearly visible in the project’s structure
  • Flexible – you should be able to switch frameworks, databases or application servers like pairs of gloves
  • Testable – all the interfaces around let you setup the test scope and outside interactions in any way you want
  • Could play well with best practices like DDD – to be honest, I haven’t seen it so far, but I also don’t see anything stopping you from making an effective mix of DDD’s Strategic and Tactical Patterns with Clean Architecture

Drawbacks of Clean Architecture

  • No Idiomatic Framework Usage – the dependency rule is relentless in this area
  • Learning Curve – it’s harder to grasp than the other styles, especially considering the point above
  • Indirect – there will be a lot more interfaces than one might expect (I don’t see it as necessarily bad, but I’ve seen people pointing this out)
  • Heavy – in the sense that you might end up with a lot more classes than you currently have in your projects (again, the extra classes are not necessarily bad)

When to Use Clean Architecture

Before I say something, let me note that I haven’t tried implementing it in a professional context yet, so all of it is a gut feeling. We will probably get some more knowledgeable advice from Uncle Bob himself in his upcoming book.

If we consider Clean Architecture‘s biggest drawbacks and its essence, I would derive the following criteria to consider:

  • Is the team skilled and/or convinced enough? One might consider this lame, but if people just don’t get it or they don’t want to do this, imposing the rigor of Clean Architecture on them might be counter-productive.
  • Will the system outlive major framework releases? Since we’re talking about heavy technology flexibility here, it’s important to consider if we’ll ever capitalize on this benefit. My experience so far suggests that most systems will, even if the developers won’t be in the company by then.
  • Will the system outlive the developers and stakeholders employment? Since Clean Architecture is so sound and makes Use Cases so clearly visible, systems that follow its principles will be much simpler to comprehend in the code, even if those who wrote it and asked for it are already gone.

Summary

Clean Architecture looks like a very carefully thought and effective architecture. It makes the big leap of recognizing the mismatch between Use Cases and Entities and puts the former in the driving seat of our system. It also gives a clear place for Frameworks and Drivers in our system – a separate layer outside all other layers. This, combined with the dependency rule might give us a plethora of benefits, but also might be way harder to pull off. In the end, it boils down to the question whether the system will live long enough so that the investment returns.

From https://dzone.com/articles/clean-architecture-is-screaming