1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
|
This homework involves data analysis (running community detection
methods
on synthetic networks) and writing a report about what
you did and what you observed.
You should also relate the results you find to other papers
you have read or heard about (e.g., in class lectures): are they
similarto what you saw or are they different?
What did you learn?
What are you still trying to figure out?
Here are some basic instructions.
Perform community detection using at least two methods
on at least two EC-SBM networks based on SBM+WCC input parameters.
One of the networks should be relatively small (under 10,000 nodes)
and the others should be at least 30,000 nodes.
The-Anh suggests these:
<ul>
<li>
topology (35K nodes, 171K edges)
<li>
internet_as (23K nodes, 48K edges)
<li>
marker_cafe (69K nodes, 1.6M edges)
</ul>
Some details are given below:
<ul>
<li>
The EC-SBM networks are public:
<a href="https://databank.illinois.edu/datasets/IDB-3284069">(link)</a>.
Use the "sbm+wcc" versions for this experiment.
The largest is only around 1.4M nodes.
<li>
For methods, run Leiden optimizing modularity and Leiden optimizing CPM, and any other methods you wish to explore.
For Leiden optimizing CPM, try resolution value 0.1 or 0.01.
Use the leidenalg package <a href="https://leidenalg.readthedocs.io/en/stable/index.html">(link)</a>.
You may also be interested in running Infomap
<a href="https://www.mapequation.org/infomap/">(link)</a> or graph-tool
for SBM <a href="https://graph-tool.skewed.de/static/docs/stable/demos/inference/inference.html">(link)</a>.
<li>
To evaluate accuracy, report AMI, ARI, and NMI, using our
scripts
<a href="http://github.com/illinois-or-research-analytics/network_evaluation">(link)</a>.
<li>
Besides accuracy, report percentage of nodes in non-singleton clusters (what
we refer to as "node coverage").
Also report statistics about the distributions of the cluster density
and edge connectivity (e.g., perhaps the ratio between the size of the
minimum edge cut and log10(n)).
It is up to you what you report, but report what you find interesting.
You can find scripts for some of these at the same URL given above.
</ul>
<p>
General recommendations.
<ul>
<li>
You may find it helpful to examine papers that have used these methods to see
what commands were used and how they report their analyses, to understand
reproducibility expectations.
Specifically, examine the supplementary materials documents for the
following papers, as these provide some helpful details.
<ul>
<li>
M. Park et al. "Well-connectedness and community detection".
PLOS Complex Systems, 2024.
<a href="https://doi.org/10.1371/journal.pcsy.0000009">(link)</a>
<li>
T. Vu-Le et al. "Using Stochastic Block Models for Community
Detection". Applied Network Science Vol 11, article 2.
https://doi.org/10.1007/s41109-025-00747-2.
<a href="https://link.springer.com/article/10.1007/s41109-025-00747-2">(link)</a>
</ul>
<li>
If you have trouble with anything in this analysis,
let me know early -- but most likely you can figure it out yourself.
It's best to start this as early as possible (i.e., before Feb 7) to make sure
you know how to do everything.
</ul>
<p> Writing advice
<ul>
<li>
This homework involves not only doing the analyses but writing it up
in a way that reflects your understanding that you gain from the experiment,
as well as enabling reproducibility (so that the reader can
repeat your experiment exactly).
Therefore, give yourself at least a few days for writing; don't wait until
the last day to finish experiments!
<li>
The grade will be based on reproducibility
For this project, it's important to write up your work in a way that
allows for reproducibility.
This is not about grammar and spelling, etc., so once again
write this without assistive AI.
</ul>
<p>
Grading
<ul>
<li> 25% reproducibility
<li> 25% figures or tables showing results
<li> 50% discussion of results
<li> Up to an additional 10 points for extra work beyond the minimum
</ul>
Note: to receive full points, you must do at least the minimum
required (the two community detection methods as indicated above,
analyses of at least two EC-SBM networks based on SBM+WCC parameters,
and reporting AMI, ARI, and NMI accuracy).
Anything beyond this can early extra credit.
|