Skip to content

Commit 6d8f227

Browse files
authored
Merge pull request #54 from polis-community/2025-06-02-patcon
Add selected repness and consensus statements to polis.run_clustering
2 parents 6e36a31 + d908026 commit 6d8f227

18 files changed

+1217
-583
lines changed

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,23 @@
77
- Add `select_consensus_statements()` function, and wire into Polis implementation.
88
- Allow `calculate_comment_statistics()` to work without groups/labels.
99
- Generalize `format_comment_stats()` to work for group and consensus statements.
10+
- Add `select_representative_statements()` to PolisClusteringResult as `repness` key.
11+
- Rename arg `pick_n` to `pick_max` in `select_consensus_statements()`, for clarity and consistency.
12+
- Slight change to PolisRepness type, so group IDs now returned as ints.
13+
- Add `print_selected_statements()` presenter for inspecting `PolisClusteringResult`.
14+
- Add `print_consensus_statements()` presenter for inspecting `PolisClusteringResult`.
15+
- Allow `pick_max` and `confidence` interval args to be set in `polis.run_clustering()`.
16+
- Allow `get_corrected_centroid_guesses()` to unflip each axis if correction not needed.
17+
18+
### Fixes
19+
- Handle when `is-meta` and `is-seed` columns arrive in CSV import.
20+
[`#55`](https://github.com/polis-community/red-dwarf/issues/55)
21+
- Handle loading comments data from API when `is_meta` missing in CSV import.
1022

1123
### Chores
1224

1325
- Update the release process instructions.
26+
- Added `simulate_api_response()` test helper for easier comparison with polismath output.
1427

1528
## [0.3.0][] (2025-04-29)
1629

docs/api_reference.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,20 @@ use in Scikit-Learn workflows, pipelines, and APIs.
7171
options:
7272
show_root_heading: true
7373

74+
## `reddwarf.utils.stats`
75+
76+
### ::: reddwarf.utils.stats.select_representative_statements
77+
options:
78+
show_root_heading: true
79+
80+
### ::: reddwarf.utils.stats.calculate_comment_statistics
81+
options:
82+
show_root_heading: true
83+
84+
### ::: reddwarf.utils.stats.calculate_comment_statistics_dataframes
85+
options:
86+
show_root_heading: true
87+
7488
## `reddwarf.utils`
7589

7690
(These are in the process of being either moved or deprecated.)

docs/notebooks/loading-data.ipynb

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,8 @@
176176
"\n",
177177
"# All of these are equivalent:\n",
178178
"loader = Loader(polis_id=\"r7dr5tzke7pbpbajynkv8\", data_source=\"csv_export\")\n",
179-
"loader = Loader(directory_url=\"https://pol.is/api/v3/reportExport/r7dr5tzke7pbpbajynkv8/\")\n",
179+
"# Doesn't work for now. See: https://github.com/polis-community/red-dwarf/issues/56\n",
180+
"# loader = Loader(directory_url=\"https://pol.is/api/v3/reportExport/r7dr5tzke7pbpbajynkv8/\")\n",
180181
"\n",
181182
"# math_data and conversation_data only populate from the \"api\" data_source.\n",
182183
"assert_fully_populated(loader, ignore=[\"math_data\", \"conversation_data\"])\n",
@@ -314,7 +315,8 @@
314315
"# The loader will look for files with these names:\n",
315316
"# - comments.csv\n",
316317
"# - votes.csv\n",
317-
"loader = Loader(directory_url=\"https://raw.githubusercontent.com/compdemocracy/openData/refs/heads/master/scoop-hivemind.ubi/\")\n",
318+
"# Doesn't work for now. See: https://github.com/polis-community/red-dwarf/issues/56\n",
319+
"# loader = Loader(directory_url=\"https://raw.githubusercontent.com/compdemocracy/openData/refs/heads/master/scoop-hivemind.ubi/\")\n",
318320
"\n",
319321
"assert_fully_populated(loader, ignore=[\"math_data\", \"conversation_data\"])\n",
320322
"print_summary(loader)"
@@ -420,7 +422,7 @@
420422
"source": [
421423
"# By default, the Loader imports data from the https://pol.is API.\n",
422424
"# You can also choose to import data from an alternative instance.\n",
423-
"Loader(polis_instance_url=\"https://preprod.pol.is\", polis_id=\"r7kfpvrhdpyykbhnirtcd\")\n",
425+
"Loader(polis_instance_url=\"https://polis.tw\", polis_id=\"r7xrbjj7brcxmcfmeun2u\")\n",
424426
"\n",
425427
"assert_fully_populated(loader, ignore=[\"math_data\", \"conversation_data\"])\n",
426428
"print_summary(loader)"

docs/notebooks/polis-implementation-demo.ipynb

Lines changed: 56 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
"base_uri": "https://localhost:8080/"
1919
},
2020
"id": "kEyVHx6y7zpu",
21-
"outputId": "fc4e261e-4328-4241-b2e8-bd5d80b1f740"
21+
"outputId": "edae961b-e68b-4ca6-8d67-38f36bc6d970"
2222
},
2323
"outputs": [
2424
{
@@ -28,10 +28,10 @@
2828
" Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
2929
" Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
3030
" Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
31-
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.1/116.1 kB\u001b[0m \u001b[31m26.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
32-
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m161.7/161.7 kB\u001b[0m \u001b[31m23.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
33-
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m61.4/61.4 kB\u001b[0m \u001b[31m194.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
34-
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m66.5/66.5 kB\u001b[0m \u001b[31m158.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
31+
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.1/116.1 kB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
32+
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m161.7/161.7 kB\u001b[0m \u001b[31m19.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
33+
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m61.4/61.4 kB\u001b[0m \u001b[31m166.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
34+
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m69.2/69.2 kB\u001b[0m \u001b[31m138.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
3535
"\u001b[?25h Building wheel for red-dwarf (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n"
3636
]
3737
}
@@ -48,7 +48,7 @@
4848
"base_uri": "https://localhost:8080/"
4949
},
5050
"id": "NkGdoHCy8RdA",
51-
"outputId": "ac246841-e319-494f-eb90-431fec639bc2"
51+
"outputId": "b64eb1be-64a6-4e2a-91c4-8debdffeb9d5"
5252
},
5353
"outputs": [
5454
{
@@ -125,24 +125,24 @@
125125
"\n",
126126
"# In this conversation, any -1 is moderated out. Matches upstream behavior.\n",
127127
"# TODO: Investigate why is_strict_moderation doesn't affect this.\n",
128-
"_, _, mod_out_statement_ids, _ = process_statements(statements)\n",
128+
"_, _, mod_out_statement_ids, meta_statement_ids = process_statements(statements)\n",
129129
"print(f\"{math_data['mod-out']=}\")\n",
130130
"print(f\"{mod_out_statement_ids=}\")\n",
131131
"\n",
132132
"# We can run this from scratch, but kmeans is non-deterministic and might find slightly different clusters\n",
133133
"# or even different k-values (number of groups) if the silhouette scores it finds are better.\n",
134134
"# To show how to reproduce Polis results, we'll set init guess coordinates that we know polis platform got:\n",
135-
"init_cluster_center_guesses = get_corrected_centroid_guesses(math_data, skip_correction=False)\n",
135+
"init_cluster_center_guesses = get_corrected_centroid_guesses(math_data)\n",
136136
"print(f\"{init_cluster_center_guesses=}\")"
137137
],
138138
"metadata": {
139139
"id": "EAfHaFFIhYw7",
140-
"outputId": "18e03149-988a-4919-e8a5-41a3791a782a",
140+
"outputId": "f83c0154-20e5-482d-90eb-d7121fd0b26b",
141141
"colab": {
142142
"base_uri": "https://localhost:8080/"
143143
}
144144
},
145-
"execution_count": 3,
145+
"execution_count": 8,
146146
"outputs": [
147147
{
148148
"output_type": "stream",
@@ -162,6 +162,7 @@
162162
"result = run_clustering(\n",
163163
" votes=votes,\n",
164164
" mod_out_statement_ids=mod_out_statement_ids,\n",
165+
" meta_statement_ids=meta_statement_ids,\n",
165166
" # If clustering is getting ready to find a new k, more need to uncomment\n",
166167
" # this to properly reproduce Polis visualization.\n",
167168
" #\n",
@@ -182,9 +183,9 @@
182183
"base_uri": "https://localhost:8080/"
183184
},
184185
"id": "HnTewjhSIb0a",
185-
"outputId": "e4c69d54-3382-4225-8c7f-01a84a783fdd"
186+
"outputId": "90fd02f8-95ea-43c7-db0f-208ff57ea146"
186187
},
187-
"execution_count": 4,
188+
"execution_count": 9,
188189
"outputs": [
189190
{
190191
"output_type": "stream",
@@ -223,9 +224,9 @@
223224
"height": 469
224225
},
225226
"id": "u_NmYu_bIfLR",
226-
"outputId": "d5bd08da-447d-429d-9990-c99113ea886a"
227+
"outputId": "e946c98b-fa86-4802-bcfa-a142b6503512"
227228
},
228-
"execution_count": 5,
229+
"execution_count": 10,
229230
"outputs": [
230231
{
231232
"output_type": "stream",
@@ -265,29 +266,51 @@
265266
{
266267
"cell_type": "code",
267268
"source": [
268-
"from reddwarf.utils.stats import select_representative_statements\n",
269-
"from reddwarf.data_presenter import print_repness\n",
269+
"from reddwarf.data_presenter import print_selected_statements\n",
270270
"\n",
271-
"repness = select_representative_statements(\n",
272-
" grouped_stats_df=result.group_comment_stats,\n",
273-
" mod_out_statement_ids=mod_out_statement_ids,\n",
274-
")\n",
275-
"print_repness(repness=repness, statements_data=statements)\n"
271+
"print_selected_statements(result=result, statements_data=statements)\n"
276272
],
277273
"metadata": {
278274
"id": "06pUuMhWKw5H",
279-
"outputId": "f9b0c6f4-e1f2-49cc-c4c4-ce7b6daa5c08",
275+
"outputId": "8add8b42-b47b-47be-e1b9-65fe1215f6df",
280276
"colab": {
281277
"base_uri": "https://localhost:8080/"
282278
}
283279
},
284-
"execution_count": 6,
280+
"execution_count": 11,
285281
"outputs": [
286282
{
287283
"output_type": "stream",
288284
"name": "stdout",
289285
"text": [
290-
"GROUP A\n",
286+
"# CONSENSUS STATEMENTS\n",
287+
"\n",
288+
"## FOR AGREEMENT\n",
289+
"\n",
290+
"* Authoritarian populist parties worldwide figured out how to weaponize trust and social media, winning elections.\n",
291+
" 86% of everyone who voted on statement 28 agreed.\n",
292+
"\n",
293+
"* We realized that information warfare is occurring by nonstate actors in destabilizing the international order\n",
294+
" 80% of everyone who voted on statement 20 agreed.\n",
295+
"\n",
296+
"* 2018 has been marked by the troubling rise of authoritarian leaders around the world.\n",
297+
" 88% of everyone who voted on statement 39 agreed.\n",
298+
"\n",
299+
"* The conversation about ethical uses of technology has reached a tipping point. Citizens, businesses and governments are on it, but baffled.\n",
300+
" 77% of everyone who voted on statement 27 agreed.\n",
301+
"\n",
302+
"* 2018 was the year Americans stopped thinking Silicon Valley was “different” or distinct from Wall St or the military industrial complex\n",
303+
" 74% of everyone who voted on statement 23 agreed.\n",
304+
"\n",
305+
"## FOR DISAGREEMENT\n",
306+
"\n",
307+
"None.\n",
308+
"\n",
309+
"\n",
310+
"# GROUP-REPRESENTATIVE STATEMENTS\n",
311+
"\n",
312+
"## GROUP A\n",
313+
"\n",
291314
"* Major regulatory interference in the operation of Facebook's algorithms and policies is now definitely going to happen, in the USA.\n",
292315
" 100% of those in group A who voted on statement 11 agreed.\n",
293316
"\n",
@@ -304,12 +327,14 @@
304327
" 100% of those in group A who voted on statement 15 disagreed.\n",
305328
"\n",
306329
"\n",
307-
"GROUP B\n",
330+
"## GROUP B\n",
331+
"\n",
308332
"* Swing Left's campaign in waiting: building grassroots donors and volunteers during the primary, ready to go for the winning candidate.\n",
309333
" 55% of those in group B who voted on statement 38 agreed.\n",
310334
"\n",
311335
"\n",
312-
"GROUP C\n",
336+
"## GROUP C\n",
337+
"\n",
313338
"* Cyber-security is still not taken seriously enough by most people in the politics-tech world.\n",
314339
" 100% of those in group C who voted on statement 5 agreed.\n",
315340
"\n",
@@ -326,7 +351,8 @@
326351
" 70% of those in group C who voted on statement 14 disagreed.\n",
327352
"\n",
328353
"\n",
329-
"GROUP D\n",
354+
"## GROUP D\n",
355+
"\n",
330356
"* Facebook implementing local news and local government alerts directly into its product\n",
331357
" 85% of those in group D who voted on statement 15 agreed.\n",
332358
"\n",
@@ -343,7 +369,8 @@
343369
" 80% of those in group D who voted on statement 34 agreed.\n",
344370
"\n",
345371
"\n",
346-
"GROUP E\n",
372+
"## GROUP E\n",
373+
"\n",
347374
"* The realisation that the Republicans are now just as good at the parts of Digital comms that actually influence elections as the Democrats\n",
348375
" 83% of those in group E who voted on statement 16 agreed.\n",
349376
"\n",
@@ -380,4 +407,4 @@
380407
},
381408
"nbformat": 4,
382409
"nbformat_minor": 0
383-
}
410+
}

0 commit comments

Comments
 (0)