MediaWiki Client - read and write Wikipedia, follow recent edits, access machine learning analytics

Seeking early adopters and reviewers of the mediawiki_client library (docs), which intends to wrap all of the major access methods for Wikipedia content.

The supported APIs are functionally complete and the library includes tests. My current goal is to get feedback on the exposed interfaces, and stabilize on an idiomatic Elixir style to begin the 1.x releases.

Examples

Watch all Wikimedia sites and print a link to the next 6 articles as they are created:

Wiki.EventStreams.start_link(streams: "page-create")
Wiki.EventStreams.stream()
|> Stream.take(6)
|> Enum.each(fn event -> IO.puts(event["meta"]["uri"]) end)
# https://www.wikidata.org/wiki/Q111963009
# https://en.wikipedia.org/wiki/Rhinella_xerophylla
# https://fr.wiktionary.org/wiki/stachst_aus
# https://commons.wikimedia.org/wiki/File:St_Swithin,_Quenington_-_Wall_monument_-_geograph.org.uk_-_3514076.jpg
# https://commons.wikimedia.org/wiki/File:Nebria_brevicollis_(Carabidae)_-_(imago),_Henshuisterveld,_the_Netherlands.jpg
# https://commons.wikimedia.org/wiki/Category:Protected_areas_of_Russia/5630230

Get current site statistics for German Wikipedia:

|> Wiki.Site.get!("dewiki")
|> Wiki.Action.new()
|> Wiki.Action.get!(
  action: :query,
  meta: :siteinfo,
  siprop: :statistics
)
# %Wiki.Action.Session{
#   ...
#   result: %{
#     "batchcomplete" => true,
#     "query" => %{
#       "statistics" => %{
#         "activeusers" => 19687,
#         "admins" => 188,
#         "articles" => 2583285,
#         "edits" => 211219883,
#         "images" => 130199,
#         "jobs" => 0,
#         "pages" => 7163473,
#         "queued-massmessages" => 0,
#         "users" => 3715461
#       }
#     }
#   },
#   ...
# }

Fetch all information about Douglas Adams contained on Wikidata:
```elixir
Wiki.Site.get!("wikidatawiki")
|> Wiki.Action.new()
|> Wiki.Action.get!(
    action: :wbgetentities,
    ids: "Q42"
)
# result: %{
# 	"entities" => %{
# 		"Q42" => %{
# 			"aliases" => %{
# 				"uk" => [
# 					%{"language" => "uk", "value" => "Дуглас Ноел Адамс"},
# 					%{"language" => "uk", "value" => "Адамс Дуглас"}
# 				],
# 				"pt" => [
# 					%{"language" => "pt", "value" => "Douglas Noël Adams"},
# 					%{"language" => "pt", "value" => "Douglas Noel Adams"}
# 				],

Request ORES scoring for an edit to predict whether it was spam or not:

Wiki.Ores.new("enwiki")
|> Wiki.Ores.request!(
  models: ["damaging"],
  revids: 456789
)
# %{
#   "enwiki" => %{
#     "models" => %{
#       "damaging" => %{"version" => "0.5.1"}
#     },
#     "scores" => %{
#       "456789" => %{
#         "damaging" => %{
#           "score" => %{
#             "prediction" => false,
#             "probability" => %{
#               "false" => 0.9784615344695441,
#               "true" => 0.021538465530455946
#             }
#           }
#         }
#       }
#     }
#   }
# }
"""
7 Likes

I’ve spun out a streaming bzip2 codec as part of the work to support Elixir processing of Wikipedia dumps.

Example usage:

Mix.install([:bzip2])

:inets.start()
:ssl.start()
tmpfile = '/tmp/articles.xml.bz2'
url = 'https://dumps.wikimedia.org/testwiki/20220501/testwiki-20220501-pages-articles.xml.bz2'

{:ok, :saved_to_file} = :httpc.request(:get, {url, []}, [], [stream: tmpfile])

File.stream!(to_string(tmpfile), [], 900 * 1024)
|> Bzip2.decompress!()
|> Enum.into("")
#  ...
#  <page>
#    <title>Wikipedia:Help</title>
#    <ns>4</ns>
#    <id>80</id>
#    <redirect title="Main Page" />
#    <revision>
#      <id>438551</id>
#      <parentid>438171</parentid>
#      <timestamp>2020-07-01T07:11:36Z</timestamp>
#      <contributor>
#        <username>JohanahoJ</username>
#        <id>37147</id>
#      </contributor>
#      <comment>Changed redirect target from [[Wikipedia:Main Page]] to [[Main Page]]</comment>
#      <model>wikitext</model>
#      <format>text/x-wiki</format>
#  ...

I would have liked to give a fully streamed example, but :httpc requires more than a dab of glue to do so, and I can’t find any other currently working library to stream HTTP.

2 Likes