diff --git a/.gitignore b/.gitignore
index a59ec565..1bd621cb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,6 +6,7 @@ cis565_getting_started_generated_kernel*
 *.vcxproj
 *.xcodeproj
 build
+profiling
 
 # Created by https://www.gitignore.io/api/linux,osx,sublimetext,windows,jetbrains,vim,emacs,cmake,c++,cuda,visualstudio,webstorm,eclipse,xcode
 
diff --git a/README.md b/README.md
index 0e38ddb1..f0d67ae3 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,162 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+- Zhanbo Lin
+    - [LinkedIn](https://www.linkedin.com/in/zhanbo-lin)
+- Tested on: Windows 10, i5-10400F @ 2.90GHz 48GB, RTX-3080 10GB (Personal) 
+- GPU Compute Capability: 8.6
+
+## Project Description ##
+
+In this project I implemented a few versions of prefix sum (scan) and stream compaction algorithm using both CPU and CUDA. I then benchmarked and compared their performance and scalability.
+
+### Algorithm implementations ###
+#### Scan ####
+- __Sequential__: Straightforward sequential implementation (CPU)
+- __Naive__: Basic parallel implementation (GPU) - "Stream Reduction Operations for GPGPU Applications." by Horn, Daniel in GPU Gems 2 
+- __Efficient__: Work-efficient parallel implementation (GPU)
+- __Efficient-Plus__: Work-efficient parallel implementation using shared memory to speed up (GPU) 
+- __Thrust__: Implementation using NVIDIA’s Thrust library (GPU)
+
+#### Stream Compaction ####
+- __Sequential Compaction without Scan__: Straightforward sequential implementation (CPU)
+
+- __Sequential Compaction with Scan__: Performs a sequential scan first, then uses the result for compaction (CPU)
+
+- __Parallel Compaction__: Performs an __Efficient__ scan first, followed by compaction (GPU)
+
+### Benchmarking Setup ###
+
+Since scan does the heavy lifting in stream compaction, we benchmarked only the scan algorithm’s execution time with input sizes ranging from $2^{23}$ to $2^{28}$, and focused our performance analysis on it.
+
+Each algorithm was executed 8 times, and the __average execution time__ was recorded.
+
+
+## Performance Analysis ##
+
+### Scan Algorithm Benchmarking Results (Execution Time by Data Size) ###
+![](img/time_vs_size.png)
+
+--- 
+
+### Bottleneck Analysis of Different Implementations ###
+
+#### Naive ####
+It's the most computational expensive one among the prallel implementations, but it still suffers from imbalanced compute and memory throughput. Although the calculations are evenly distributed across kernel executions, the workload remains skewed toward memory throughput, causing long scoreboard stalling and impact performances.
+
+![](img/naive/summary.png)
+<sub>Imbalance between compute and memory throughput</sub>  
+
+![](img/naive/long_scoreboard.png)
+<sub>Long scoreboard stalls due to global memory fetches</sub>  
+
+
+
+#### Efficient ####
+The work-efficient version makes uses of intermediate results and reduces the number of repetitive calculations, freeing more threads earlier and therefore scaling better with data size. 
+
+However, it still suffers from the same problem of imbalanced compute and memory throughput. Most kernel invocations are filled with computationally light threads, causing poor utilization of available compute resources.
+![](img/efficient/summary.png)
+<sub>Imbalance between compute and memory throughput</sub>  
+
+
+#### Efficient-Plus ####
+This version builds on the work-efficient implementation. It pre-fetches data into shared memory  and do all levels of up-sweep and down-sweep whthin a single kernel call. This minimizes the data transfers between L1 cache and global memory, alleviating the long scoreboard stalls observed in the previous implementation.
+
+![](img/efficient_plus/summary.png)
+<sub>Balanced compute and memory throughput</sub>  
+
+It also supports inputs of arbitrary size by dividing the array into blocks and padding only the last block. Compared to the power-of-two padding required in previous versions, this greatly reduces unnecessary computations, especially for large data sizes. 
+
+![](img/arbitrary_size.png)
+<sub>GPU Gems 3: Figure 39-6</sub>  
+
+The potential improvement is to avoid the bank conflicts when read from shared memory. This can be done by adding a padding when reading and writing to the shared memory. 
+
+In addition, to ensure block-level synchronization, my implementation will write the block increment array to the global memory before the next kernel call, and this data will be read in the next kernel call. There might be a better way to do it.
+
+
+#### Thrust ####
+
+The Thrust library’s implementation is more than twice as fast as the Efficient-Plus version. I think this is largely due to its higher memory throughput. 
+
+Interestingly, the L1 cache hit rate is nearly zero, so I wonder what causes this.
+
+![](img/thrust/vs_efficient_plus_memory_throughput.png)  
+<sub>Memory throughput comparison: Thrust (left) vs. Efficient-Plus (right)</sub>  
+
+--- 
+###  Sample Outputs ###
+Measured with data size = $2^{26}$
+```
+****************
+** SCAN TESTS **
+****************
+    [  16   1  47 ...  25   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 51.5072ms    (std::chrono Measured)
+    [   0  16  17 ... 1643369755 1643369780 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 43.3375ms    (std::chrono Measured)
+    [   0  16  17 ... 1643369691 1643369733 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 24.6601ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 23.1905ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 9.14457ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 9.06283ms    (CUDA Measured)
+    [   0  16  17 ... 1643369691 1643369733 ]
+    passed
+==== work-efficient-plus scan, power-of-two ====
+   elapsed time: 2.25453ms    (CUDA Measured)
+    passed
+==== work-efficient-plus scan, non-power-of-two ====
+   elapsed time: 2.2231ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 1.01891ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 1.02716ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 165.528ms    (std::chrono Measured)
+    [   1   1   1 ...   3   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 167.281ms    (std::chrono Measured)
+    [   1   1   1 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 276.596ms    (std::chrono Measured)
+    [   1   1   1 ...   3   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 110.745ms    (CUDA Measured)
+    [   1   1   1 ...   3   1 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 11.9542ms    (CUDA Measured)
+    [   1   1   1 ...   3   3 ]
+    passed
+```
+
+
+
+
+
+
+
 
-### (TODO: Your README)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
diff --git a/image.png b/image.png
new file mode 100644
index 00000000..c7ba3121
Binary files /dev/null and b/image.png differ
diff --git a/img/arbitrary_size.png b/img/arbitrary_size.png
new file mode 100644
index 00000000..21d2a0d2
Binary files /dev/null and b/img/arbitrary_size.png differ
diff --git a/img/efficient/summary.png b/img/efficient/summary.png
new file mode 100644
index 00000000..fca7dc5d
Binary files /dev/null and b/img/efficient/summary.png differ
diff --git a/img/efficient_plus/long_scoreboard.png b/img/efficient_plus/long_scoreboard.png
new file mode 100644
index 00000000..f76a5818
Binary files /dev/null and b/img/efficient_plus/long_scoreboard.png differ
diff --git a/img/efficient_plus/memory_workload_analysis_add.png b/img/efficient_plus/memory_workload_analysis_add.png
new file mode 100644
index 00000000..1c919521
Binary files /dev/null and b/img/efficient_plus/memory_workload_analysis_add.png differ
diff --git a/img/efficient_plus/memory_workload_analysis_scan.png b/img/efficient_plus/memory_workload_analysis_scan.png
new file mode 100644
index 00000000..092ac24b
Binary files /dev/null and b/img/efficient_plus/memory_workload_analysis_scan.png differ
diff --git a/img/efficient_plus/summary.png b/img/efficient_plus/summary.png
new file mode 100644
index 00000000..b1f76001
Binary files /dev/null and b/img/efficient_plus/summary.png differ
diff --git a/img/naive/long_scoreboard.png b/img/naive/long_scoreboard.png
new file mode 100644
index 00000000..0e7458ca
Binary files /dev/null and b/img/naive/long_scoreboard.png differ
diff --git a/img/naive/summary.png b/img/naive/summary.png
new file mode 100644
index 00000000..d91ec2e5
Binary files /dev/null and b/img/naive/summary.png differ
diff --git a/img/result.png b/img/result.png
new file mode 100644
index 00000000..c2d30ece
Binary files /dev/null and b/img/result.png differ
diff --git a/img/thrust/memory_workload_analysis_DeviceScanKernel.png b/img/thrust/memory_workload_analysis_DeviceScanKernel.png
new file mode 100644
index 00000000..d361d517
Binary files /dev/null and b/img/thrust/memory_workload_analysis_DeviceScanKernel.png differ
diff --git a/img/thrust/thrust.png b/img/thrust/thrust.png
new file mode 100644
index 00000000..ed748731
Binary files /dev/null and b/img/thrust/thrust.png differ
diff --git a/img/thrust/vs_efficient_plus_memory_throughput.png b/img/thrust/vs_efficient_plus_memory_throughput.png
new file mode 100644
index 00000000..a1343fa9
Binary files /dev/null and b/img/thrust/vs_efficient_plus_memory_throughput.png differ
diff --git a/img/time_vs_size.png b/img/time_vs_size.png
new file mode 100644
index 00000000..401ee4b5
Binary files /dev/null and b/img/time_vs_size.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 3d5c8820..5d4358f0 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,15 +13,29 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 26; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
-int *b = new int[SIZE];
+int *cpuResult = new int[SIZE];
 int *c = new int[SIZE];
 
+#define PROFILING 1
+
+#define TEST_ITERATIONS 8
+
+#define SCAN_TEST           1
+#define SCAN_CPU            1
+#define SCAN_NAIVE          1
+#define SCAN_EFFICIENT      1
+#define SCAN_EFFICIENT_PLUS 1
+#define SCAN_THRUST         1
+
+#define COMPACT_TEST        1
+
+
 int main(int argc, char* argv[]) {
+#if SCAN_TEST
     // Scan tests
-
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
@@ -31,92 +45,122 @@ int main(int argc, char* argv[]) {
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
-    // initialize b using StreamCompaction::CPU::scan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
-    // At first all cases passed because b && c are all zeroes.
-    zeroArray(SIZE, b);
+    // initialize cpuResult using StreamCompaction::CPU::scan you implement
+    // We use cpuResult for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
+    // At first all cases passed because cpuResult && c are all zeroes.
+    float t;
+
+#if SCAN_CPU
     printDesc("cpu scan, power-of-two");
-    StreamCompaction::CPU::scan(SIZE, b, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(SIZE, b, true);
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getCpuElapsedTimeForPreviousOperation, StreamCompaction::CPU::timer(), 
+        StreamCompaction::CPU::scan, SIZE, cpuResult, a);
+    printElapsedTime(t, "(std::chrono Measured)");
+    printArray(SIZE, cpuResult, true);
 
     zeroArray(SIZE, c);
     printDesc("cpu scan, non-power-of-two");
-    StreamCompaction::CPU::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getCpuElapsedTimeForPreviousOperation, StreamCompaction::CPU::timer(), 
+        StreamCompaction::CPU::scan, NPOT, c, a);
+    printElapsedTime(t, "(std::chrono Measured)");
     printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
+    printCmpResult(NPOT, cpuResult, c);
+#endif
 
-    zeroArray(SIZE, c);
+#if SCAN_NAIVE
     printDesc("naive scan, power-of-two");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-    onesArray(SIZE, c);
-    printDesc("1s array for finding bugs");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printArray(SIZE, c, true); */
+    zeroArray(SIZE, c);
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getGpuElapsedTimeForPreviousOperation, StreamCompaction::Naive::timer(), 
+        StreamCompaction::Naive::scan, SIZE, c, a);
+    printElapsedTime(t, "(CUDA Measured)");
+    // printArray(SIZE, c, true);
+    printCmpResult(SIZE, cpuResult, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
-    StreamCompaction::Naive::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(NPOT, b, c);
-
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getGpuElapsedTimeForPreviousOperation, StreamCompaction::Naive::timer(), 
+        StreamCompaction::Naive::scan, NPOT, c, a);
+    printElapsedTime(t, "(CUDA Measured)");
+    // printArray(NPOT, c, true);
+    printCmpResult(NPOT, cpuResult, c);
+#endif
+
+#if SCAN_EFFICIENT
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getGpuElapsedTimeForPreviousOperation, StreamCompaction::Efficient::timer(), 
+        StreamCompaction::Efficient::scan, SIZE, c, a);
+    printElapsedTime(t, "(CUDA Measured)");
+    // printArray(SIZE, c, true);
+    printCmpResult(SIZE, cpuResult, c);
+    
+    zeroArray(SIZE, c);
+    printDesc("work-efficient scan, non-power-of-two");
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getGpuElapsedTimeForPreviousOperation, StreamCompaction::Efficient::timer(), 
+        StreamCompaction::Efficient::scan, NPOT, c, a);
+    printElapsedTime(t, "(CUDA Measured)");
+    printArray(NPOT, c, true);
+    printCmpResult(NPOT, cpuResult, c);
+#endif
+
+#if SCAN_EFFICIENT_PLUS
+    zeroArray(SIZE, c);
+    printDesc("work-efficient-plus scan, power-of-two");
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getGpuElapsedTimeForPreviousOperation, StreamCompaction::EfficientPlus::timer(),
+        StreamCompaction::EfficientPlus::scan, SIZE, c, a);
+    printElapsedTime(t, "(CUDA Measured)");
     //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
+    printCmpResult(SIZE, cpuResult, c);
 
     zeroArray(SIZE, c);
-    printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printDesc("work-efficient-plus scan, non-power-of-two");
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getGpuElapsedTimeForPreviousOperation, StreamCompaction::EfficientPlus::timer(),
+        StreamCompaction::EfficientPlus::scan, NPOT, c, a);
+    printElapsedTime(t, "(CUDA Measured)");
     //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
+    printCmpResult(NPOT, cpuResult, c);
+#endif
 
+#if SCAN_THRUST
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
-    StreamCompaction::Thrust::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getGpuElapsedTimeForPreviousOperation, StreamCompaction::Thrust::timer(), 
+        StreamCompaction::Thrust::scan, SIZE, c, a);
+    printElapsedTime(t, "(CUDA Measured)");
     //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
+    printCmpResult(SIZE, cpuResult, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
-    StreamCompaction::Thrust::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    t = testForIterations(TEST_ITERATIONS, &PerformanceTimer::getGpuElapsedTimeForPreviousOperation, StreamCompaction::Thrust::timer(), 
+        StreamCompaction::Thrust::scan, NPOT, c, a);
+    printElapsedTime(t, "(CUDA Measured)");
     //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
+    printCmpResult(NPOT, cpuResult, c);
+#endif
+
+
+#endif
 
+#if COMPACT_TEST
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
     printf("*****************************\n");
 
     // Compaction tests
-
     genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
-
+    a[SIZE - 1] = 0; 
     int count, expectedCount, expectedNPOT;
 
-    // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
-    zeroArray(SIZE, b);
+    // initialize cpuResult using StreamCompaction::CPU::compactWithoutScan you implement
+    // We use cpuResult for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
+    zeroArray(SIZE, cpuResult);
     printDesc("cpu compact without scan, power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
+    count = StreamCompaction::CPU::compactWithoutScan(SIZE, cpuResult, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedCount = count;
-    printArray(count, b, true);
-    printCmpLenResult(count, expectedCount, b, b);
+    printArray(count, cpuResult, true);
+    printCmpLenResult(count, expectedCount, cpuResult, cpuResult);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact without scan, non-power-of-two");
@@ -124,31 +168,37 @@ int main(int argc, char* argv[]) {
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedNPOT = count;
     printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
+    printCmpLenResult(count, expectedNPOT, cpuResult, c);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact with scan");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
+    printCmpLenResult(count, expectedCount, cpuResult, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
+    printArray(count, c, true);
+    printCmpLenResult(count, expectedCount, cpuResult, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
+    printArray(count, c, true);
+    printCmpLenResult(count, expectedNPOT, cpuResult, c);
+#endif
+    fflush(stdout);
 
+#if !PROFILING
     system("pause"); // stop Win32 console from closing on exit
+#endif
+
+
     delete[] a;
-    delete[] b;
+    delete[] cpuResult;
     delete[] c;
 }
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
index 025e94aa..e696bb79 100644
--- a/src/testing_helpers.hpp
+++ b/src/testing_helpers.hpp
@@ -60,7 +60,7 @@ void genArray(int n, int *a, int maxval) {
 void printArray(int n, int *a, bool abridged = false) {
     printf("    [ ");
     for (int i = 0; i < n; i++) {
-        if (abridged && i + 2 == 15 && n > 16) {
+        if (abridged && i + 2 == 5 && n > 16) {
             i = n - 2;
             printf("... ");
         }
@@ -74,3 +74,17 @@ void printElapsedTime(T time, std::string note = "")
 {
     std::cout << "   elapsed time: " << time << "ms    " << note << std::endl;
 }
+
+using StreamCompaction::Common::PerformanceTimer;
+template<typename TimerFuncType, typename FuncType, typename... Args>
+float testForIterations(int iteration, TimerFuncType checkTimer, PerformanceTimer& timer, FuncType f, Args... args)
+{
+    float t = 0;
+    for (int i = 0; i < iteration; i++)
+    {
+        f(args...);
+        t += (timer.*checkTimer)();
+    }
+    t /= static_cast<float>(iteration);
+    return t;
+}
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d630..fa06e72a 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -22,8 +22,12 @@ namespace StreamCompaction {
          * Maps an array to an array of 0s and 1s for stream compaction. Elements
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata)
+        {
+            int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (idx >= n) { return; }
+
+            bools[idx] = (idata[idx] > 0) ? 1 : 0;
         }
 
         /**
@@ -31,9 +35,34 @@ namespace StreamCompaction {
          * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
          */
         __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
-            // TODO
+                const int *idata, const int *bools, const int *indices)
+        {
+            int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (idx >= n) { return; }
+
+            if (bools[idx] > 0) 
+            {
+                int oIdx = indices[idx];
+                odata[oIdx] = idata[idx];
+            }
+        }
+
+        void testKernMapToBoolean(int n, int*oData, const int* idata)
+        {
+            int blocks = (n + 32 - 1) / 32;
+            int *dev_oData,
+                *dev_iData;
+            cudaMalloc(&dev_oData, n * sizeof(int));
+            cudaMalloc(&dev_iData, n * sizeof(int));
+            cudaMemcpy(dev_iData, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            kernMapToBoolean <<< dim3(blocks), dim3(32) >>> (n, dev_oData, dev_iData);
+            cudaMemcpy(oData, dev_oData, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            cudaFree(dev_oData);
+            cudaFree(dev_iData);
         }
 
+        
     }
 }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index d2c1fed9..439b52af 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -30,6 +30,12 @@ inline int ilog2ceil(int x) {
     return x == 1 ? 0 : ilog2(x - 1) + 1;
 }
 
+template <class T>
+T divUp(T size, T div)
+{
+    return (size + div - 1) / div;
+}
+
 namespace StreamCompaction {
     namespace Common {
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
@@ -37,6 +43,7 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices);
 
+        void testKernMapToBoolean(int n, int* oData, const int* idata);
         /**
         * This class is used for timing the performance
         * Uncopyable and unmovable
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa115..1a778ce9 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,8 +1,11 @@
-#include <cstdio>
 #include "cpu.h"
+#include <cstdio>
 
 #include "common.h"
 
+#define SIMULATE_GPU_SCAN 0
+
+
 namespace StreamCompaction {
     namespace CPU {
         using StreamCompaction::Common::PerformanceTimer;
@@ -19,7 +22,16 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+#if SIMULATE_GPU_SCAN
+            // placeholder
+#else
+            odata[0] = 0;
+            int sum = 0;
+            for (int i  = 1; i < n; i++) {
+                sum += idata[i-1];
+                odata[i] = sum;
+            }
+#endif
             timer().endCpuTimer();
         }
 
@@ -30,9 +42,24 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int p1 = 0, p2 = 0;
+            int count = 0;
+            for (;p2 < n && p1 < n;)
+            {
+                if (idata[p2] != 0)
+                {
+                    odata[p1] = idata[p2];
+                    p1++;
+                    p2++;
+                    count++;
+                }
+                else
+                {
+                    p2++;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
 
         /**
@@ -41,10 +68,43 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            int* valValid = new int[n];
+            int* scanResult = new int[n];
+
             timer().startCpuTimer();
-            // TODO
+
+            // mark valid indices and scan
+            for (int i = 0; i < n; i++)
+            {
+                valValid[i] = idata[i] > 0 ? 1 : 0;
+            }
+
+            scanResult[0] = 0;
+            int sum = 0;
+            for (int i = 1; i < n; i++)
+            {
+                sum += valValid[i - 1];
+                scanResult[i] = sum;
+            }
+
+            // compact
+            int count = 0;
+            for (int i = 0; i < n; i++)
+            {
+                int val = idata[i];
+                if (val > 0)
+                {
+                    int oIdx = scanResult[i];
+                    odata[oIdx] = val;
+                    count++;
+                }
+            }
+
             timer().endCpuTimer();
-            return -1;
+
+            delete[] valValid;
+            delete[] scanResult;
+            return count;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346ee..3f8c906a 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -1,7 +1,10 @@
-#include <cuda.h>
-#include <cuda_runtime.h>
 #include "common.h"
 #include "efficient.h"
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <vector>
+#include <string>
+
 
 namespace StreamCompaction {
     namespace Efficient {
@@ -12,13 +15,76 @@ namespace StreamCompaction {
             return timer;
         }
 
+        __global__ void kernUpSweep(int n, int stride, int offset, int* data)
+        {
+            int idx = threadIdx.x + blockIdx.x * blockDim.x;
+            if (idx >= n) { return; }
+
+            int rIdx = (idx + 1) * stride - 1;
+            int lIdx = rIdx - offset;
+
+            data[rIdx] += data[lIdx];
+        }
+
+        __global__ void kernDownSweep(int n, int stride, int offset, int* data)
+        {
+            int idx = threadIdx.x + blockIdx.x * blockDim.x;
+            if (idx >= n) { return; }
+
+            int rIdx = (idx + 1) * stride - 1;
+            int lIdx = rIdx - offset;
+
+            int rVal = data[rIdx];
+            int lVal = data[lIdx];
+
+            data[rIdx] = rVal + lVal;
+            data[lIdx] = rVal;
+        }
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
+        void scan(int n, int *odata, const int * idata) {
+
+            const int N = 1 << ilog2ceil(n);
+            constexpr int threadsPerBlock = 256;
+
+            // create buffers
+            int* dev_buf;
+            cudaMalloc(&dev_buf, N * sizeof(int));
+            cudaMemset(dev_buf, 0, N * sizeof(int));
+            cudaMemcpy(dev_buf, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            // compute
+            const int maxDepth = ilog2(n - 1);
             timer().startGpuTimer();
-            // TODO
+            // upsweep
+            for (int d = 0; d <= maxDepth; d++)
+            {
+                int offset = 1 << d;
+                int stride = 1 << (d + 1);
+                int threadCount = N / stride;
+                int numBlocks = (threadCount + threadsPerBlock - 1) / threadsPerBlock;
+                kernUpSweep<<< numBlocks, threadsPerBlock >>>(threadCount, stride, offset, dev_buf);
+            }
+
+            // downsweep
+            cudaMemset(&dev_buf[N - 1], 0, sizeof(int)); 
+            for (int d = maxDepth; d >= 0; d--)
+            {
+                int offset = 1 << d;
+                int stride = 1 << (d + 1);
+                int threadCount = N / stride;
+                int numBlocks = (threadCount + threadsPerBlock - 1) / threadsPerBlock;
+                kernDownSweep<<< numBlocks, threadsPerBlock >>>(threadCount, stride, offset, dev_buf);
+            }
             timer().endGpuTimer();
+
+            // copy to host and free buffers
+            cudaMemcpy(odata, dev_buf, n * sizeof(int), cudaMemcpyDeviceToHost); 
+            cudaFree(dev_buf);
+
+            cudaDeviceSynchronize();
         }
 
         /**
@@ -30,11 +96,236 @@ namespace StreamCompaction {
          * @param idata  The array of elements to compact.
          * @returns      The number of elements remaining after compaction.
          */
+
         int compact(int n, int *odata, const int *idata) {
+            const int N = 1 << ilog2ceil(n);
+            const int threadsPerBlock = 128;
+
+            int* dev_idata;
+            cudaMalloc(&dev_idata, n * sizeof(int));
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            int* dev_odata;
+            cudaMalloc(&dev_odata, n * sizeof(int));
+
+            int* dev_bool;
+            cudaMalloc(&dev_bool, n * sizeof(int));
+
+            int* dev_scanBuf;
+            cudaMalloc(&dev_scanBuf, N * sizeof(int));
+            cudaMemset(dev_scanBuf, 0, N * sizeof(int));
+
+            timer().startGpuTimer();
+            // calculate mask
+            int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock;
+            Common::kernMapToBoolean << <numBlocks, threadsPerBlock >> > (n, dev_bool, dev_idata);
+            cudaMemcpy(dev_scanBuf, dev_bool, n * sizeof(int), cudaMemcpyDeviceToDevice);
+
+            // scan
+            const int maxDepth = ilog2(n - 1);
+            for (int d = 0; d <= maxDepth; d++)
+            {
+                int offset = 1 << d;
+                int stride = 1 << (d + 1);
+                int threadCount = N / stride;
+                int numBlocks = (threadCount + threadsPerBlock - 1) / threadsPerBlock;
+                kernUpSweep << < numBlocks, threadsPerBlock >> > (threadCount, stride, offset, dev_scanBuf);
+            }
+
+            cudaMemset(&dev_scanBuf[N - 1], 0, sizeof(int));
+            for (int d = maxDepth; d >= 0; d--)
+            {
+                int offset = 1 << d;
+                int stride = 1 << (d + 1);
+                int threadCount = N >> (d + 1); // N / stride;
+                int numBlocks = (threadCount + threadsPerBlock - 1) / threadsPerBlock;
+                kernDownSweep << < numBlocks, threadsPerBlock >> > (threadCount, stride, offset, dev_scanBuf);
+            }
+
+            // scatter
+            Common::kernScatter<<<numBlocks , threadsPerBlock>>>(n, dev_odata, dev_idata, dev_bool, dev_scanBuf);
+            timer().endGpuTimer();
+
+            // copy to host and free buffers
+            int lastExclusiveResult, lastIsNotZero;
+            cudaMemcpy(&lastExclusiveResult, dev_scanBuf + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+            cudaMemcpy(&lastIsNotZero, dev_bool + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+            int compactElementCount = lastExclusiveResult + lastIsNotZero;
+
+            cudaMemcpy(odata, dev_odata, compactElementCount * sizeof(int), cudaMemcpyDeviceToHost);
+
+            cudaFree(dev_idata);
+            cudaFree(dev_odata);
+            cudaFree(dev_bool);
+            cudaFree(dev_scanBuf);
+
+            checkCUDAError("cuda error");
+            return compactElementCount;
+        }
+    }
+
+    namespace EfficientPlus
+    {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        __global__ void kernScanBlocksInclusive(int maxDepth, int* blockIncrementData, int* idata)
+        {
+            int gidx = threadIdx.x + blockIdx.x * blockDim.x;
+            int idx = threadIdx.x;
+            int blockSize = blockDim.x;
+
+            extern __shared__ int buf[];
+
+            // load input data into shared memory
+            int globalIData = idata[gidx];
+            buf[idx] = globalIData;
+
+            // upsweep
+            for (int d = 0; d <= maxDepth; d++)
+            {
+                int nodeCount = blockSize >> (d + 1);
+
+                __syncthreads();
+                if (idx < nodeCount)
+                {
+                    int offset = 1 << d;
+                    int stride = 1 << (d + 1);
+
+                    int rIdx = (idx + 1) * stride - 1;
+                    int lIdx = rIdx - offset;
+
+                    buf[rIdx] += buf[lIdx];
+                }
+            }
+
+            // down sweep
+            if (idx == 0)
+            {
+                buf[blockSize - 1] = 0;
+            }
+            for (int d = maxDepth; d >= 0; d--)
+            {
+                
+                int nodeCount = blockSize >> (d + 1);
+                __syncthreads();
+                if (idx < nodeCount)
+                {
+                    int offset = 1 << d;
+                    int stride = 1 << (d + 1);
+
+                    int rIdx = (idx + 1) * stride - 1;
+                    int lIdx = rIdx - offset;
+
+                    int rVal = buf[rIdx];
+                    int lVal = buf[lIdx];
+                    buf[rIdx] = rVal + lVal;
+                    buf[lIdx] = rVal;
+                }
+            }
+
+            // copy back
+            __syncthreads();
+            int inclusiveScanData = buf[idx] + globalIData;
+            idata[gidx] = inclusiveScanData;
+            if (blockIncrementData != nullptr && idx == blockSize - 1)
+            {
+                blockIncrementData[blockIdx.x] = inclusiveScanData;
+            }
+        } 
+
+        __global__ void kernAddBlockIncrement(int* idata, const int* incrementData)
+        {
+            if (blockIdx.x == 0) { return; }
+            int idx = threadIdx.x + (blockIdx.x) * blockDim.x;
+            // shift left by one block because of incrementData is the result of inclusive scan
+            int blockIncrement = incrementData[blockIdx.x - 1];
+            idata[idx] += blockIncrement;
+        }
+
+        /**
+         * work-efficient scan of arbitrary size using shared memory
+         */
+        void scan(int n, int* odata, const int* idata)
+        {
+            constexpr int MAX_THREADS_PER_BLOCK = 256;
+
+            // Calculate the parameters and create buffers for each level beforehand, we may need multiple levels of inter block synchronization,
+            std::vector<int*> dev_buffers;          // dev_buffers stores the buffers for each level (both input and output since the algorithm is in-place)
+            std::vector<int> blockCounts;           // number of blocks for each level
+            std::vector<int> perBlockThreadCounts;  // number of threads per block for each level (in the last level we might launch blocks with fewer threads)
+            std::vector<int> perBlockMaxDepths;     // the max depth of the up-sweep/down-sweep within each block for each level
+
+            for (int elementCount = n; elementCount > 1;)
+            {
+                int blockCount = divUp(elementCount, MAX_THREADS_PER_BLOCK);
+                int perBlockThreadCount = std::min(MAX_THREADS_PER_BLOCK, elementCount);
+                int perBlockMaxDepth = ilog2(perBlockThreadCount - 1);
+                int* dev_buf;
+                cudaMalloc(&dev_buf, blockCount * perBlockThreadCount * sizeof(int));
+                cudaMemset(dev_buf, 0, blockCount * perBlockThreadCount * sizeof(int));
+
+                blockCounts.push_back(blockCount);
+                perBlockThreadCounts.push_back(perBlockThreadCount);
+                perBlockMaxDepths.push_back(perBlockMaxDepth);
+                dev_buffers.push_back(dev_buf);
+
+                // In the next pass we will use the per-block tails as scan input
+                elementCount = blockCount;
+            }
+            cudaMemcpy(dev_buffers[0], idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            // scan
             timer().startGpuTimer();
-            // TODO
+            for (int i = 0; i < dev_buffers.size(); i++)
+            {
+                // kernel dimension
+                int blockCount = blockCounts[i];
+                int perBlockThreadCount = perBlockThreadCounts[i];
+
+                // parameters
+                int perBlockMaxDepth = perBlockMaxDepths[i];
+                int* dev_scanInputBuffer = dev_buffers[i];
+                int* dev_blockIncrementBuffer = (i + 1) < dev_buffers.size() ? dev_buffers[i + 1] : nullptr;
+
+                kernScanBlocksInclusive<<<blockCount, perBlockThreadCount, perBlockThreadCount * sizeof(int)>>>(perBlockMaxDepth, dev_blockIncrementBuffer, dev_scanInputBuffer);
+                
+                // std::string errorMsg = "kernScanBlocksInclusive failed at level " + std::to_string(i) + "\n";
+                // cudaDeviceSynchronize();
+                // checkCUDAError(errorMsg.c_str());
+            }
+
+            for (int i = dev_buffers.size() - 1; i > 0; i--)
+            {
+                // kernel dimension
+                int blockCount = blockCounts[i - 1];
+                int perBlockThreadCount = perBlockThreadCounts[i - 1];
+
+                // parameters
+                int* dev_blockIncrementBuffer = dev_buffers[i];
+                int* dev_scanInputBuffer = dev_buffers[i - 1];
+
+                kernAddBlockIncrement <<< blockCount , perBlockThreadCount >>> (dev_scanInputBuffer, dev_blockIncrementBuffer);
+
+                // std::string errorMsg = "kernAddBlockIncrement failed at level " + std::to_string(i) + "\n";
+                // cudaDeviceSynchronize();
+                // checkCUDAError(errorMsg.c_str());
+            }
             timer().endGpuTimer();
-            return -1;
+
+            odata[0] = 0;
+            cudaMemcpy(odata + 1, dev_buffers[0], (n-1) * sizeof(int), cudaMemcpyDeviceToHost);
+
+            // free buffers
+            for (int* buf : dev_buffers)
+            {
+                cudaFree(buf);
+            }
+            checkCUDAError("");
         }
     }
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4fe..16c7197e 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -10,4 +10,11 @@ namespace StreamCompaction {
 
         int compact(int n, int *odata, const int *idata);
     }
+
+    namespace EfficientPlus
+    {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int* odata, const int* idata);
+    }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 43088769..aa07790e 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -11,15 +11,69 @@ namespace StreamCompaction {
             static PerformanceTimer timer;
             return timer;
         }
-        // TODO: __global__
+
+        __global__ void kernInclusiveScan(int n, int d, int* odata, const int *idata)
+        {
+            int idx = threadIdx.x + blockIdx.x * blockDim.x;
+            if ( idx >= n ) { return; }
+
+            int offset = 1 << (d - 1);
+            if ( idx < offset )
+            {
+                odata[idx] = idata[idx];
+            }
+            else
+            {
+                int self = idata[idx];
+                int prev = idata[idx - offset];
+                int result = self + prev;
+                odata[idx] = result;
+            }
+        }
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
+        void scan(int n, int *odata, const int *idata)
+        {
+            const int maxDepth = ilog2ceil(n);
+            const int paddedBufSize = 1 << maxDepth;
+            const int threadsPerBlock = 256;
+            const int numBlocks = (paddedBufSize + threadsPerBlock - 1) / threadsPerBlock;
+
+            // create buffers
+            int* dev_buf1;
+            int* dev_buf2;
+            cudaMalloc(&dev_buf1, paddedBufSize * sizeof(int));
+            cudaMalloc(&dev_buf2, paddedBufSize * sizeof(int));
+
+            cudaMemset(dev_buf1, 0, paddedBufSize);
+            cudaMemset(dev_buf2, 0, paddedBufSize);
+            cudaMemcpy(dev_buf1, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            // computation
+            int* pIBuffer = dev_buf1;
+            int* pOBuffer = dev_buf2;
+
             timer().startGpuTimer();
-            // TODO
+            for (int d = 1; d <= maxDepth; d++)
+            {
+                kernInclusiveScan<<<numBlocks, threadsPerBlock>>>(paddedBufSize, d, pOBuffer, pIBuffer);
+                std::swap(pIBuffer, pOBuffer);
+            }
             timer().endGpuTimer();
+
+            pOBuffer = pIBuffer;
+
+            // copy back to host
+            odata[0] = 0;
+            cudaMemcpy(odata + 1, pOBuffer, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost);
+
+            // free buffers
+            cudaFree(dev_buf1);
+            cudaFree(dev_buf2);
+
+            cudaDeviceSynchronize();
         }
     }
 }
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e7..4adc4b93 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,18 @@ namespace StreamCompaction {
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            thrust::host_vector<int> h_idata(idata, idata + n);
+            thrust::device_vector<int> d_idata = h_idata;
+            thrust::device_vector<int> d_odata(n);
+
             timer().startGpuTimer();
             // TODO use `thrust::exclusive_scan`
             // example: for device_vectors dv_in and dv_out:
             // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            thrust::exclusive_scan(d_idata.begin(), d_idata.end(), d_odata.begin());
             timer().endGpuTimer();
+
+            thrust::copy(d_odata.begin(), d_odata.end(), odata);
         }
     }
 }