data-mining-the-city · zz2335 · Nov 3, 2015 · Nov 3, 2015 · Nov 3, 2015 · Nov 18, 2015
diff --git a/README.md b/README.md
@@ -3,3 +3,73 @@
 Project containes two branches:
 - **soufun** - Implementation using soufun rental listing data set with heat map and interpolation (ML) analysis using overlay grid
 - **weibo** - Implementation using weibo social media data set with graph analysis, click events, and animated transitions.
+
+//**FINAL PROJECT**//
+
+## Project Description 
+
+ 1. Research Topic - what general question or idea are you examining?
+-We examined how the consumer behaviour reacts to the change of stock market.
+
+ 2. Research Scope - what geographic area are you focusing on?
+-We mainly focus on Pearl River delta.
+
+ 3. Hypothesis - the specific hypothesis tested by your prototype.
+-The consumer behaivor, which is represented by the number of check-ins for the Food/Drinks categories, will change based on the stock market. When the stock market goes up, the total check-ins will go up, vice versa.
+
+ 4. Methodology - general description of how your project tests the hypothesis.
+-We tried to cluster areas of the region to show the concentration of check-in types. Then we track the total unmbers of check-ins within one cluster during a specific time for a specific check-in category. More generally speaking we tried to represent the overal consumer trend of the region in order to compare this against global stock markets.
+
+ 5. Minimum Viable Product - what assumptions or generalizations did you implement in order to generate a working MVP? How would the MVP be extended to create the ultimate product you envision?
+ -We had issues running the loop within a responsive timeframe. The requests to the server would take +10 seconds for one check-in and resulted in no response for the range of 10 clusters. Without knowings the exact issues of the dataset, a sample of data would be plotted and color coordinated in order to show spatial diversity of check-ins.
+
+ ## Data Processing 
+
+ 1. Which dataset are you using (soufun or weibo)
+ -Weibo 
+
+ 2. Describe in detail any processing you did on the data to prepare it for your application. This should include any -pre-processing done outside of runtime, which was hardcoded into the dataset. Include a reference to the actual script in your project files which must be run to generate the proper data set.
+ -We want to use cluster to categorize the dataset. This is based on our assumption that there should be clusters for certain type of consumption weibo check-in. So k-means is here to help us. At the beginning, we were to use the checkin to calculate the clusters; but due to the giant amounts of records, we switch to places to get clusters. So the whole process is divided into two parts, one for calculating cluster and writing the cluster id into every place record, one to write cluster id into checkin by matching checkin record to place record. So at the end, we get a date base with cluster id in checkin records and place records. 
+
+ Still, k-means can include a weight to get clusters. Our original thought was to use soufun data and get the property value as the weight. But the thought was too perfect to be doable. So we just simplified this process to get clusters by place instead of political boundaries. 
+
+ Another idea at the beginning stage was to use the network for cluster analysis. This is based on the assumption that strong connection between two places might reveal one kind of consumption behavior. But we gave up this way due to the difficulty.
+
+ ## Server Back End 
+
+ Describe how the server interacts with the client, including:
+ 1. The arguments that are sent from the client and received by the server at the beginning of the request.
+ 2. Any update messages sent back to the client during the request.
+ 3. The data sent back to the client at the end of the request, including the data format.
+
+ -The server side gets the map bounds and the time elements from client interface by using request.args.get function. Then all these elements from client will be used in the dynamic query. Due to the large amount of database, we divided the database into several chunk. During the query, we will gradually send back the data to the client side. The group size is xxxxxx(填上具体数字). Therefore, updates on client side will happen everytime the chunk finish query. And at the end of the query, all database will be processed.
+
+ Describe how the server interacts with the database, including:
+
+ 1. The query that is sent to the database.
+ 2. How the results of the query are processed and formatted for sending back to the client.
+
+ -The query is acquired from client side. The client side will dynamiclly chose the time duration to be quried througn manual interface. The query includes the following elements 1.latitude 2. longitude 3. time 4. category (food/drinks) 5. cluster(preprocessed database) The items will be selected if they meet the query.
+
+ After the query is processed, the server will record the latitude and longtidude of each item and send back to client to display. It will be stored in a np array in respect to their rid. And to calculate the sum of all the checkin, we used an array to serve as a voter. First we convert all the time to the number showing the day in the whole year, and add its votes as we proessed the data.
+
+ ## Client Front End [to be completed by Client team]
+
+ 1. Describe the front end User Interface (UI). What options or parameters are available to the user? 
+ -The user is currently able to explore the Pearl River Delta region and see where Weibo users have checked-in for Food/Drink. The code for all categories is live but is slow and not processing properly.
+
+ 2. Describe the general User Experience (UX) story or narrative you envision for your MVP.
+ -The option to view 'FakeData' shows how the UX would perform with a fully operatative database/ request. Users can see regions of the PRD grow in relation to the check-ins for each category as well as its relation to stock market fluctuation.
+
+ 3. Reproduce and explain in detail any requests you are sending to the server, including any arguments in the query string, and how they are communicating the decisions the user has made in the UI
+ -The user is able to request data for a specific month or week within the yearly snapshot. 
+
+ 4. Describe any further processing that is done to the data received from the server in the front end.
+ -Data from the sock market is pulled from an external source to show the relation to the Weibo dataset and is represented with a histogram.
+
+ 5. Describe how the data is visualized using JavaScript/D3. How does the visualization work to communicate the important insights of the data to the user.
+ -We were not able to implement a color range for each individual cluster although this would be ideal for future implementation as well as the possibility of adding a heat map.
+
+=======
+final description
+>>>>>>> origin/UI_Experiments
diff --git a/Server_Corrected.py b/Server_Corrected.py
@@ -0,0 +1,188 @@
+#######################################################################
+#####################INDEX TIME in OrientDB FIRST!!!###################
+#######################################################################
+
+from flask import Flask
+from flask import render_template
+from flask import request
+from flask import Response
+
+import json
+import time
+import sys
+import random
+import math
+
+import pyorient
+
+from Queue import Queue
+
+from sklearn import preprocessing
+from sklearn import svm
+
+import numpy as np
+
+
+import time
+
+import numpy as np
+import matplotlib.pyplot as plt
+
+
+app = Flask(__name__)
+
+q = Queue()
+
+##############heat map display########
+def point_distance(x1, y1, x2, y2):
+	return ((x1-x2)**2.0 + (y1-y2)**2.0)**(0.5)
+
+def remap(value, min1, max1, min2, max2):
+	return float(min2) + (float(value) - float(min1)) * (float(max2) - float(min2)) / (float(max1) - float(min1))
+
+def normalizeArray(inputArray):
+	maxVal = 0
+	minVal = 100000000000
+
+	for j in range(len(inputArray)):
+		for i in range(len(inputArray[j])):
+			if inputArray[j][i] > maxVal:
+				maxVal = inputArray[j][i]
+			if inputArray[j][i] < minVal:
+				minVal = inputArray[j][i]
+
+	for j in range(len(inputArray)):
+		for i in range(len(inputArray[j])):
+			inputArray[j][i] = remap(inputArray[j][i], minVal, maxVal, 0, 1)
+
+	return inputArray
+
+def event_stream():
+    while True:
+        result = q.get()
+        yield 'data: %s\n\n' % str(result)
+
+@app.route('/eventSource/')
+def sse_source():
+    return Response(
+            event_stream(),
+            mimetype='text/event-stream')
+
+@app.route("/")
+def index():
+    return render_template("index.html")
+
+@app.route("/getData/")
+def getData():
+
+	q.put("starting data query...") ##### status display#####
+
+	startstr = str(request.args.get('starttime'))
+	endstr = str(request.args.get('endtime'))
+
+	################################map bound & window bound##############################
+	lat1 = str(request.args.get('lat1'))
+	lng1 = str(request.args.get('lng1'))
+	lat2 = str(request.args.get('lat2'))
+	lng2 = str(request.args.get('lng2'))
+
+	w = float(request.args.get('w'))
+	h = float(request.args.get('h'))
+
+	#############################cellsize###############################
+	# cell_size = float(request.args.get('cell_size'))
+
+	############################analysis trigger########################
+	# analysis = request.args.get('analysis')
+
+	print "received coordinates: [" + lat1 + ", " + lat2 + "], [" + lng1 + ", " + lng2 + "]"
+
+
+	client = pyorient.OrientDB("localhost", 2424)
+	session_id = client.connect("root", "network.ssl.keyStorePassword")
+	db_name = "weibo"
+	db_username = "admin"
+	db_password = "admin"
+
+	if client.db_exists( db_name, pyorient.STORAGE_TYPE_MEMORY ):
+		client.db_open( db_name, db_username, db_password )
+		print db_name + " opened successfully"
+	else:
+		print "database [" + db_name + "] does not exist! session ending..."
+		sys.exit()
+
+#########################################################################
+#####################CATEGORY NAME CHECK PLEASE!!!!!#####################
+        for n in range(0,10):
+	    query = 'SELECT  FROM Checkin WHERE time Between "{}" AND "{}" AND cat_2 = "Food/Drinks" AND cluster_id_1 = n AND lat BETWEEN {} AND {} AND lng BETWEEN {} AND {} limit 200'
+	    results = client.command(query.format(startstr,endstr,lat1,lat2,lng1,lng2))
+
+
+	output = {"type":"FeatureCollection","features":[]}
+
+	for record in results:
+
+		feature = {"type":"Feature","properties":{},"geometry":{"type":"Point"}}
+		feature["id"] = record._rid
+		feature["geometry"]["coordinates"] = [record.lat, record.lng]
+		feature["time"] = str(record.time)
+		feature["cat_1"] = record.cat_1
+		feature["cat_2"] = record.cat_2
+		feature["TOD"] = record.TOD
+		feature["DOW"] = record.DOW
+		output["features"].append(feature)
+
+	client.db_close()
+	# if analysis == "false":
+	# 	q.put('idle')
+	# 	return json.dumps(output)
+
+	# q.put('starting analysis...') ##### status display#####
+
+	# output["analysis"] = []
+
+	# numW = int(math.floor(w/cell_size))
+	# numH = int(math.floor(h/cell_size))
+
+	# grid = []
+
+	# for j in range(numH):
+	# 	grid.append([])
+	# 	for i in range(numW):
+	# 		grid[j].append(0)
+
+	# #HEAT MAP IMPLEMENTATION
+	# for record in records:
+
+	#  	pos_x = int(remap(record.longitude, lng1, lng2, 0, numW))
+	#  	pos_y = int(remap(record.latitude, lat1, lat2, numH, 0))
+
+	#  	spread = 12
+
+	#  	for j in range(max(0, (pos_y-spread)), min(numH, (pos_y+spread))):
+	#  		for i in range(max(0, (pos_x-spread)), min(numW, (pos_x+spread))):
+	#  			grid[j][i] += 2 * math.exp((-point_distance(i,j,pos_x,pos_y)**2)/(2*(spread/2)**2))
+
+		## ML IMPLEMENTATION
+	# grid = normalizeArray(grid)
+
+	# offsetLeft = (w - numW * cell_size) / 2.0
+	# offsetTop = (h - numH * cell_size) / 2.0
+
+
+	# for j in range(numH):
+	#     for i in range(numW):
+	#         newItem = {}
+
+	# 		newItem['x'] = offsetLeft + i*cell_size
+	# 		newItem['y'] = offsetTop + j*cell_size
+	#         newItem['width'] = cell_size-1
+	#         newItem['height'] = cell_size-1
+	#         newItem['value'] = grid[j][i]
+
+	#         output["analysis"].append(newItem)
+	q.put('idle')
+	return json.dumps(output)
+
+if __name__ == "__main__":
+    app.run(host='0.0.0.0',port=5000,debug=True,threaded=True)
diff --git a/iterate_weibo_addData_checkin_cluster.py b/iterate_weibo_addData_checkin_cluster.py
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+
+import os
+import sys
+import math
+import urllib
+import urllib2
+import json
+import types
+import datetime
+
+import pyorient
+
+
+client = pyorient.OrientDB("localhost", 2424)
+session_id = client.connect("root", "network.ssl.keyStorePassword")
+
+db_name = "weibo"
+
+if client.db_exists( db_name, pyorient.STORAGE_TYPE_MEMORY ):
+	client.db_open( db_name, "admin", "admin" )
+else:
+	print "database does not exist!"
+	sys.exit()
+
+
+result = client.command("SELECT COUNT(*) FROM Checkin")
+numRecords = result[0].COUNT
+
+numRetrieve = 2000
+
+iterations = int(math.ceil(numRecords/numRetrieve))
+print "Number of Records: " + str(numRecords)
+print "Number of Iterations: " + str(iterations)
+
+currProgress = 0
+progressBreaks = .05
+
+currentRID = "#-1:-1"
+# currentRID = "#13:4200000"
+
+
+for i in range(iterations):
+
+
+	results = client.command("SELECT FROM Checkin WHERE @rid > {} LIMIT {}".format(currentRID, numRetrieve))
+
+	print results[0]._rid
+
+	for record in results:
+
+		try:
+			r = record.cluster_id_1
+		except AttributeError:
+			place = client.command("SELECT * FROM (SELECT expand(in) FROM {})".format(record._rid))
+
+			try:
+				client.command("UPDATE {} SET {} = {}".format(record._rid, 'cluster_id_1', place[0].cluster_id_1))
+			except:
+				print "error: no cluster in place record!"
+
+
+	currentRID = results[-1]._rid
+
+	c = float(i) / float(iterations)
+
+	if c > (currProgress + progressBreaks):
+		print "done: " + str(int(c * 100)) + "%"
+		currProgress = c
+
+
+client.db_close()