Looking for Interesting Topics for Master Thesis for Nx

Hi, I am currently looking for topics, for my master thesis in Data Science. All topics I came around so far were applied (and better of with Python for several reasons). But a year ago, I did a longer NX project. Applying several networks, and bit of scholar and Explorer. But since then the libraries enhanced. Hence, I am not 100 % up to date.

Things I often read, but was never able to try out myself is easy remote execution. Therefore, I thought about, how does the capabilities of Numerical compare to Spark. Given, that this a) doable, b) usable for others. Would be to add some capabilities.

From my understanding, currently a partioning and load sharing, especially for topics like joining does not exist. Also, there is no multicore capability?

I study data Science in part-time. Hence, my background is, currently 70% GIS (work), 30 % Data Science (studies), although there is a huge overlap between the two both in Python. (Doing also Data Science in GIS). Before that, I did Hardware programming in C (and studied Computer Engineering).

I am aware of the threads/posts:
I study data Science in parttime.

Hence, my idea is probably not the best?

1 Like

Hi @sehHeiden,

I think distributed, remote computation is very much of the minds of the Elixir Data Science community. Check out this Keynote from ElixirConf just a few weeks ago:

I’m sure there are ideas in this area that would be great for a Master’s Thesis.

1 Like

Thanks, @billylanchantin for sharing the video. So I interpret it as, most issues, I mentioned, are considered done?

Still got some questions for understanding the tech.

When I see, that here that I can use it with fly.io, or with k8, I am wondering what it costs to train a DL model[1] :slight_smile: or how easy it is, to set-up a local k8 cluster, compared to a spark cluster. At least, Spark clusters can set up with docker, I use sometimes. But never k8. Also in our small company don’t have anyone, ever done that. :wink:

The other thing, it’s all multi-machine. But for example, having, multiprocessing is not mentioned. As having a longer calculation per row of a DataFrame, can reduce calculation time as in Dask. I just checked the Polars Doc that it can use multicore. So, I assume that’s also true for Explorer, although I have not found it in the doc.
I still, wonder how it also works, with an Explorer DataFrame partitioned, between several machines on a cluster. But that probably not enough for a thesis!?

Perhaps the GPU backend for Explorer. It is probably more work. Just how scientific work that is?


  1. Okay, the CPU prices look good. Just the GPU prices… Well, you have to pay it yourself and spin up a larger cluster for hours. ↩︎

So I interpret it as, most issues, I mentioned, are considered done?

You can do a version of distributed computation with FLAME + Explorer, sure. But is it done? Certainly not.

For instance, I doubt there exists a theoretical model for how data is processed in that setup (Though if I’m wrong, someone call me out!). Creating one is probably worthwhile. Then we could answer: what are the current limits of our processing capabilities? Can we be processing more? Could we introduce configuration settings that make certain workloads easier/faster/more efficient?

And that’s just off the top of my head. There are tons of open questions even in that one project.

The other thing, it’s all multi-machine. But for example, having, multiprocessing is not mentioned.

It’s funny you mention that. FLAME + Explorer is just one example from this area. Let me send you to this LiveBook/Nx announcement from a year ago:

That link has this video at the top, and I’ll send you to a specific timestamp:

José says:

But it gets even better. Because, what we can do now is that we can also make this distributed. And “distributed” is a funny word because depending on who you ask it has different meanings. So if you asked me two years ago I would say well distributed is when you have multiple machines communicating with each other. But if you ask a machine learning engineer they may say well distributed is when you have more than one GPU in your machine and you’re using those GPUs and sometimes even communicating across those GPUs. And in order to avoid confusion, Nx can do both. We can do both kinds of distributions. And that’s what I want to show you…

So in Nx also there is some pre-existing work. But again, I’d never describe the work as “done”.

So, I assume [that it can use multicore is] also true for Explorer, although I have not found it in the doc.

Thanks, we should definitely be documenting that!

Perhaps the GPU backend for Explorer. It is probably more work. Just how scientific work that is?

A new backend for Explorer would be a massive amount of work. And yeah I’m not sure it’s really appropriate for a Master’s thesis?

When I see, that here that I can use it with fly.io, or with k8, I am wondering what it costs to train a DL model :slight_smile: or how easy it is, to set-up a local k8 cluster, compared to a spark cluster.

I personally know less about this. I don’t use DL much in my own work. Tutorials for this kind of thing would be incredibly valuable to the community, and I’m not sure if any exist.

1 Like

@josevalim recently tweeted out:

Btw, we are starting to look into bringing Nx, Livebook, and FLAME to High Performance Computing (HPC). If your company or university works with HPC and you would like to explore these ideas, please do reach out!

Even if you don’t have a lot of experience with HPC, I bet even cursory attempts to survey the scene and publish comparisons with Elixir’s potential would be valuable!

FWIW, completely independently I interviewed some folk working on an HPC project a few months ago, to talk about applications of Elixir in their spaces. They were impressed with Elixir’s capabilities as a resilient, distributed orchestrator of embarrassingly parallelizable work; and intrigued by Nx. Primarily they had open questions about

  • Immutability/copy-on-write slowing down node-local performance for computation
  • Escape hatches into other languages/tooling that provide highly optimized solutions for their particular domain
  • Efficiently and expediently transferring large amounts of data between nodes on successful partial computation to reduce into high-level state

Essentially, these are questions that existing HPC frameworks answer for them, albeit with a lot of fuss and headache in how they write their code, structure their programs, and serialize their data structures. Exploring things like Nx, Rustler, Zigler, and Broadway through this lens could make Elixir more accessible to the domain for future pioneers!

3 Likes

This particular bit is not a concern because we use functional data structures to build a representation of the computation and then compile and execute that instead. So in many cases it can be more efficient, because there is less back and forth between Elixir and the native code doing the work. :slight_smile:

2 Likes

yup, dispelling some of those traditional FP misunderstandings featured in our discussion quite a bit

1 Like

This is the hardest part of grad school in my opinion, deciding what to do. I can’t give any advice on Nx specifically, but I can give some general advice based on my experience. The best approach is usually to dive in to whatever project seems the coolest. Don’t worry about feasibility too much, pick the direction that excites you the most. Once you start actually writing code and putting the bits to the metal you’ll start to realize the strengths and weaknesses of your idea, and should iterate from that point.