MediaWiki Client - read and write Wikipedia, follow recent edits, access machine learning analytics

Seeking early adopters and reviewers of the mediawiki_client library (docs), which intends to wrap all of the major access methods for Wikipedia content.

The supported APIs are functionally complete and the library includes tests. My current goal is to get feedback on the exposed interfaces, and stabilize on an idiomatic Elixir style to begin the 1.x releases.

Examples

Watch all Wikimedia sites and print a link to the next 6 articles as they are created:

Wiki.EventStreams.start_link(streams: "page-create")
Wiki.EventStreams.stream()
|> Stream.take(6)
|> Enum.each(fn event -> IO.puts(event["meta"]["uri"]) end)
# https://www.wikidata.org/wiki/Q111963009
# https://en.wikipedia.org/wiki/Rhinella_xerophylla
# https://fr.wiktionary.org/wiki/stachst_aus
# https://commons.wikimedia.org/wiki/File:St_Swithin,_Quenington_-_Wall_monument_-_geograph.org.uk_-_3514076.jpg
# https://commons.wikimedia.org/wiki/File:Nebria_brevicollis_(Carabidae)_-_(imago),_Henshuisterveld,_the_Netherlands.jpg
# https://commons.wikimedia.org/wiki/Category:Protected_areas_of_Russia/5630230

Get current site statistics for German Wikipedia:

|> Wiki.Site.get!("dewiki")
|> Wiki.Action.new()
|> Wiki.Action.get!(
  action: :query,
  meta: :siteinfo,
  siprop: :statistics
)
# %Wiki.Action.Session{
#   ...
#   result: %{
#     "batchcomplete" => true,
#     "query" => %{
#       "statistics" => %{
#         "activeusers" => 19687,
#         "admins" => 188,
#         "articles" => 2583285,
#         "edits" => 211219883,
#         "images" => 130199,
#         "jobs" => 0,
#         "pages" => 7163473,
#         "queued-massmessages" => 0,
#         "users" => 3715461
#       }
#     }
#   },
#   ...
# }

Fetch all information about Douglas Adams contained on Wikidata:
```elixir
Wiki.Site.get!("wikidatawiki")
|> Wiki.Action.new()
|> Wiki.Action.get!(
    action: :wbgetentities,
    ids: "Q42"
)
# result: %{
# 	"entities" => %{
# 		"Q42" => %{
# 			"aliases" => %{
# 				"uk" => [
# 					%{"language" => "uk", "value" => "Дуглас Ноел Адамс"},
# 					%{"language" => "uk", "value" => "Адамс Дуглас"}
# 				],
# 				"pt" => [
# 					%{"language" => "pt", "value" => "Douglas Noël Adams"},
# 					%{"language" => "pt", "value" => "Douglas Noel Adams"}
# 				],

Request ORES scoring for an edit to predict whether it was spam or not:

Wiki.Ores.new("enwiki")
|> Wiki.Ores.request!(
  models: ["damaging"],
  revids: 456789
)
# %{
#   "enwiki" => %{
#     "models" => %{
#       "damaging" => %{"version" => "0.5.1"}
#     },
#     "scores" => %{
#       "456789" => %{
#         "damaging" => %{
#           "score" => %{
#             "prediction" => false,
#             "probability" => %{
#               "false" => 0.9784615344695441,
#               "true" => 0.021538465530455946
#             }
#           }
#         }
#       }
#     }
#   }
# }
"""
7 Likes

I’ve spun out a streaming bzip2 codec as part of the work to support Elixir processing of Wikipedia dumps.

Example usage:

Mix.install([:bzip2])

:inets.start()
:ssl.start()
tmpfile = '/tmp/articles.xml.bz2'
url = 'https://dumps.wikimedia.org/testwiki/20220501/testwiki-20220501-pages-articles.xml.bz2'

{:ok, :saved_to_file} = :httpc.request(:get, {url, []}, [], [stream: tmpfile])

File.stream!(to_string(tmpfile), [], 900 * 1024)
|> Bzip2.decompress!()
|> Enum.into("")
#  ...
#  <page>
#    <title>Wikipedia:Help</title>
#    <ns>4</ns>
#    <id>80</id>
#    <redirect title="Main Page" />
#    <revision>
#      <id>438551</id>
#      <parentid>438171</parentid>
#      <timestamp>2020-07-01T07:11:36Z</timestamp>
#      <contributor>
#        <username>JohanahoJ</username>
#        <id>37147</id>
#      </contributor>
#      <comment>Changed redirect target from [[Wikipedia:Main Page]] to [[Main Page]]</comment>
#      <model>wikitext</model>
#      <format>text/x-wiki</format>
#  ...

I would have liked to give a fully streamed example, but :httpc requires more than a dab of glue to do so, and I can’t find any other currently working library to stream HTTP.

2 Likes

Since it’s fun, I also decided to upgrade the Server-sent Events listener to use a new Broadway integration, replacing the homespun GenStage relay. The library is already functional but very rough—feedback encouraged! Adam Wight / Off-Broadway Server-sent Events · GitLab

Now with Livebook examples.

Also, at day job my team is writing a complementary library to parse wiki article HTML dumps, which works already but will be finalized in a few weeks, then we run it on all available dumps.

We’re looking at article footnotes for our particular use case, but it can be generalized to run whatever other parsing modules.

2 Likes