Troubleshooting/benchmarking deleting vertex in large graph with libgraph

stevensonmt · January 10, 2025, 3:26pm

I’m trying to fix this issue raised for libgraph:

I think the problem is that the current delete vertex code loops over every entry in the collections of total edges, in edges, and out edges for every deletion. I thought that could be improved by looping over the total edges once and simply reducing into the new collections all at the same time. In my simple test of a million vertices each with 999_998 edges I found very minimal if any real difference in the time it takes to run the deletion compared to the current implementation. Can anyone figure out why my intuition about this problem is wrong? Or alternatively why my test is wrong?

current implementation:

  def delete_vertex(
        %__MODULE__{out_edges: oe, in_edges: ie, edges: em, vertex_identifier: vertex_identifier} =
          g,
        v
      ) do
    vs = g.vertices
    ls = g.vertex_labels

    with v_id <- vertex_identifier.(v),
         true <- Map.has_key?(vs, v_id),
         oe <- Map.delete(oe, v_id),
         ie <- Map.delete(ie, v_id),
         vs <- Map.delete(vs, v_id),
         ls <- Map.delete(ls, v_id) do
      oe = for {id, ns} <- oe, do: {id, MapSet.delete(ns, v_id)}, into: %{}
      ie = for {id, ns} <- ie, do: {id, MapSet.delete(ns, v_id)}, into: %{}
      em = for {{id1, id2}, _} = e <- em, v_id != id1 && v_id != id2, do: e, into: %{}
      %__MODULE__{g | vertices: vs, vertex_labels: ls, out_edges: oe, in_edges: ie, edges: em}
    else
      _ -> g
    end
  end

attempt to fix:

  def delete_vertex(
        %__MODULE__{vertices: vs, vertex_labels: ls, out_edges: oe, in_edges: ie, edges: em, vertex_identifier: vertex_identifier} =
          g,
        v
      ) do
   
    with v_id <- vertex_identifier.(v),
         true <- Map.has_key?(vs, v_id),
         oe <- Map.delete(oe, v_id),
         ie <- Map.delete(ie, v_id),
         vs <- Map.delete(vs, v_id),
         ls <- Map.delete(ls, v_id) do
    {em, oe, ie} = 
        em
        |> Enum.reduce({em, oe, ie}, fn 
                                     {{v_id, id2} = k, _}, {e, o, i} -> {Map.delete(e, k), o, Map.update!(i, id2, fn ns -> MapSet.delete(ns, v_id) end)}
                                     {{id1, v_id} = k, _}, {e, o, i} -> {Map.delete(e, k), Map.update!(o, id1, fn ns -> MapSet.delete(ns, v_id) end), i} 
                                      _, acc -> acc 
                       end)

      %__MODULE__{g | vertices: vs, vertex_labels: ls, out_edges: oe, in_edges: ie, edges: em}
    else
      _ -> g
    end
  end

simple test in iex:

g = Graph.new()
vs = 1..1_000_000 |> Enum.to_list()
vs |> Enum.each(fn i -> Graph.add_vertex(g, i) end)
vs |> Enum.zip(Enum.reverse(vs)) |> Enum.each(fn {i,o} -> Graph.add_edge(g, i, o) end)
vs |> Enum.zip(Enum.reverse(vs)) |> Enum.each(fn {i,o} -> Graph.add_edge(g, o, i) end)
:timer.tc(fn -> 1..1_000_000//7 |> Enum.each(fn i -> Graph.delete_vertex(g, i) end) end, :millisecond)

MrDoops · January 10, 2025, 8:45pm

Test doesn’t seem quite right.

Maybe try something like

g =
  Graph.new()
  |> Graph.add_edges(
    Enum.map(1..1_000_000, fn i-> {i, 1_000_000 - i} end)
  )

:timer.tc(
  fn ->
    g
    |> Graph.delete_vertex(
      1..1_000_000
      |> Enum.random()
    )
  end,
  :millisecond
)

> {976, #Graph<type: directed, num_vertices: 999896, num_edges: 999998>}

Know that in most graphs implementations in general that performance will be affected depending on how many edges and potentially how connected a component is.

Libgraph is just maps and sets (which are maps also) so will perform at Erlang map implementation thresholds and :erlang.phash2/1.

stevensonmt · January 10, 2025, 9:04pm

Thanks. Will try that tonight. Actually, I’m not sure that test is going to be the best either. Introducing the Enum.random call during the timed portion could mean the result is unduly influenced by the random algorithm.

I know there’s no improving past a certain limit, but it still seems to me that iterating over 3 maps key by key versus reducing over a single map key by key should be significantly different with a large enough data set.

stevensonmt · January 11, 2025, 4:10pm

yeah, that was more demonstrative. For a single vertex deletion with the old version I get 1228 msec versus the new version at 45 msec, almost 30x improvement. Thanks for the help.

bartblast · January 12, 2025, 8:20pm

Hey @stevensonmt, thanks for working on this! It would greatly help with Hologram’s incremental compilation and live reloading. Hopefully there is also a new release of libgraph. I had to temporarily vendor it in Hologram, since the last release is from 2022.

stevensonmt · January 12, 2025, 9:10pm

curious what you mean by “vendor” the lib in this context?

bartblast · January 12, 2025, 10:29pm

I included it as part of the Hologram package: Vendor libgraph · bartblast/hologram@1241d4f · GitHub

Hologram used an unreleased version with a specific bug fix:

# It seems that libgraph is not maintained - there are pending pull requests and not released commits.
{:libgraph,
  git: "https://github.com/bitwalker/libgraph",
  ref: "460cdfd9a163a533bdab0a160ba7ccf888047927"},

All dependencies in your Hex release must also be Hex releases to make dependency resolution possible.

stevensonmt · January 13, 2025, 1:51am

I guess if you’re having to resort to that you could base it on my recent commit to test the performance gain until @bitwalker can review and merge it.

stevensonmt · January 14, 2025, 3:46am

Just to revisit this, I don’t think the test in this thread was a valid assessment of the issue. I had missed that both implementations would scale linearly with the number of edges but that only the old implementation was scaling with number of nodes that have any edges. So tons of nodes with tons of edges each yields minimal difference but tons of nodes with few edges each makes a big difference. New implementation in the issue tracker I think now correctly addresses the issue by scaling only with the number of edges involving the node being deleted, independent of the total number of nodes or edges in the graph.

bartblast · January 15, 2025, 11:51pm

Great! After the maintainer of libgraph (bitwalker) merges the pull requests I’ll benchmark your updates on a live organism.