Skip to content

Commit 7acac5e

Browse files
committed
markdown formatting
1 parent 42f3dd4 commit 7acac5e

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

assignment4/assignment4.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,12 +40,14 @@ RDF data is represented as a set of triples of the form:
4040

4141
The [context] is not part of the triple, but is sometimes added to tell where the data is coming from. For example, file btc-2010-chunk-200 contains the two "triples" (they are actually "quads" because they have the context too):
4242

43-
```<http://www.last.fm/user/ForgottenSound> <http://xmlns.com/foaf/0.1/nick> "ForgottenSound" <http://rdf.opiumfield.com/lastfm/friends/life-exe> .```
44-
```<http://dblp.l3s.de/d2r/resource/publications/journals/cg/WestermannH96> <http://xmlns.com/foaf/0.1/maker> <http://dblp.l3s.de/d2r/resource/authors/Birgit\_Westermann> <http://dblp.l3s.de/d2r/data/publications/journals/cg/WestermannH96> .```
43+
```
44+
<http://www.last.fm/user/ForgottenSound> <http://xmlns.com/foaf/0.1/nick> "ForgottenSound" <http://rdf.opiumfield.com/lastfm/friends/life-exe> .
45+
<http://dblp.l3s.de/d2r/resource/publications/journals/cg/WestermannH96> <http://xmlns.com/foaf/0.1/maker> <http://dblp.l3s.de/d2r/resource/authors/Birgit\_Westermann> <http://dblp.l3s.de/d2r/data/publications/journals/cg/WestermannH96> .
46+
```
4547

4648
The first says that Webpage <http://www.last.fm/user/ForgottenSound> has the nickname "ForgottenSound"; the second describes the maker of another webpage. foaf stands for _Friend of a Friend._ Confused ? You don't need to know what they mean; some of the many triples refer to music, http://dbtune.org, others refer to company relationships, etc. For our purpose, these triples are just a large collection of triples. There were 317 2GB files in the [billion triple dataset](http://km.aifb.kit.edu/projects/btc-2010/) when we downloaded it. We uploaded them to Amazon's Web Services in S3: there were some errors, and only 251 uploaded correctly, for a total of about 550 GB of data.
4749

48-
This graph is similar in size to the [web graph](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.44&rep=rep1&type=pdf). As part of this assignment, we will compute the out-degree of each node in the graph. The out-degree of a node is the number of edges coming out of the node. This is an important property. If a graph is random, the out-degree of nodes will follow an exponential distribution (i.e., the number of nodes with degree d should be exp(- c\*d) for some constant c). We will write the script in Problem 2, where we will run it on a small data sample. We will run the script on the big graph in Problem 4. What is very interesting is that we will find the distribution of node out-degrees to follow a power law (1/d^k for some constant k and it will look roughly like a straight-line on a graph with logarithmic scales on both the x and y axes) instead of an exponential distribution. If you look at Figures 2 and 3 in [this paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.44&rep=rep1&type=pdf), you will find that the degrees of web pages on the web, in general, follow a similar power law distribution. This is very interesting because it means that the Web and the semantic Web cannot be modeled as random graphs. They need a different theoretical model.
50+
This graph is similar in size to the [web graph](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.44&rep=rep1&type=pdf). As part of this assignment, we will compute the out-degree of each node in the graph. The out-degree of a node is the number of edges coming out of the node. This is an important property. If a graph is random, the out-degree of nodes will follow an exponential distribution (i.e., the number of nodes with degree d should be exp(- c\*d) for some constant c). We will write the script in Problem 2, where we will run it on a small data sample. We will run the script on the big graph in Problem 4. We will find the distribution of node out-degrees to follow a power law (1/d^k for some constant k and it will look roughly like a straight-line on a graph with logarithmic scales on both the x and y axes) instead of an exponential distribution. If you look at Figures 2 and 3 in [this paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.44&rep=rep1&type=pdf), you will find that the degrees of web pages on the web, in general, follow a similar power law distribution. This is very interesting because it means that the Web and the semantic Web cannot be modeled as random graphs. They need a different theoretical model.
4951

5052
In Problem 3, we will look for paths of length 2 in a sub-graph of our big graph. This is a simple version of more complex algorithms that try to measure the diameter of a graph or try to extract other related properties. We will do all this on a very real 0.5TB graph! How cool will that look on your resume: "Analyzed properties of a 0.5TB (a billion vertices) graph using Pig/Hadoop".
5153

0 commit comments

Comments
 (0)