-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathnotes
More file actions
72 lines (39 loc) · 1.63 KB
/
notes
File metadata and controls
72 lines (39 loc) · 1.63 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
notes:
titles-sorted.txt
just names of titles
index position is like the ID
links-simple-sorted.txt
index position is like the ID (supposed to correspond to titles)
from to (multiple)
page : page page page
so first thing is page, then delimited by : and then 1 or more pages, separated by spaces
HITS:
so there's two components:
authority score
hub score
authority score is basically, how many pages link TO it
hub score is basically how many pages its linkes OUT to
they depend on each other:
authority: sum of hub scores of pages that link TO it
hub: sum of authority scores of pages it links OUT to
calculate scores by SUMMING everything with NORMALIZATION
normalize by (a^2 + b^2 ...)^(1/2)
so sum everything, divide by the norm
WAIT< this is separte step
run algo this way:
everything initialized with 1
run Authority Update (SUM ONLY)
run Hub update (SUM ONLY)
normalize both
hub score: divide each score by: root sum of squares of all hub scores (all!)
authority score: divide each score by: root sum of squares of all authority scores (all!)
--------
data cleaning:
make sure you remove all pages that don't link to it, and that don't link to anything!
----------
my cluster:
- connect
ssh -i /Users/coolguy/Desktop/AWS_stuff/supersecret.pem hadoop@ec2-52-26-22-227.us-west-2.compute.amazonaws.com
- monitering
ssh -i /Users/coolguy/Desktop/AWS_stuff/supersecret.pem -ND 8157 hadoop@ec2-52-26-22-227.us-west-2.compute.amazonaws.com
scp -i /Users/coolguy/Desktop/AWS_stuff/supersecret.pem hadoop@ec2-52-26-22-227.us-west-2.compute.amazonaws.com:/mnt/var/log/pig/* /Users/coolguy/Downloads