hi davidktw, nice to see u again, hope u can rem our email exchange about free software. anyway i am sorry to disturb u in this thread, but what is currently the bleeding edge of an optimized way to do these cutting/adding/sorting a ~500gibi - 1tebi .txt, excluding going for hardware fpga? i am trying to achieve o(lg n)
my servers are running a mixture of bsd, linux and using c/c++ and soon c++20
Perhaps map-reduce approaches works for you ? Don't think going FPGA will be of significant gain for you, since you are not dealing with complex processing per unit. The main domaince factor should be your data size as you deal it out across multiple processing units.
If you have a large number of cores/processors within the same server, then your network will not be a concern as you distribute your information out, unless you are using a shared filesystem holding the same piece of data. Even so, you are still talking about multiple systems drawing the data from the disk.
AWS EMR, Apache Hadoop or such fanning out data and aggregate them back should works. Even MPI or PVM should works, but I guess these days, most are more interested in the recent parallel techniques.
I personally don't have a lot of encounters with large dataset, so perhaps my suggestions wouldn't be the top notch proven approaches.
Does your tasks allow for embarrassingly parallel distribution, because I am not aware what kind of adding and cutting operation you are referring to. Sorting wise, within each node, you can use the best sorting algorithm, after which when you are combing back, use merge sort. I read on sample sort, but didn't really try out myself for I have little of such use.
Without further understanding the details of how you are operating, it might be hard to find out where exactly are the bottlenecks and what solution to offer.
Personally I will encourage you to try out MPI. The open implementation is OpenMPI. The one I was using in my uni should be MPICH using a cluster implementation in NUS during my Parallel and Concurrent programming course. I was using C to parallelise sorting algorithm.
However you don't need to use C with other PL bindings offered via Perl, Python, Java, etc... It is a very interesting distributed programming methodology where you can write all the nodes responsibilities in just one program and it can be designed to work differently depending on which node it is running on.
You can complete control over the distributed topology, using the common fanning out approach, or a tree distribution technique using recursion and so forth. Nodes are not necessarily systems, they could be processors, could be cores, could be also processes or even threads (I assumed). Thus you could have a development environment all self-contained within your own system, using VMs, or use them in the cloud. In AWS there is service that allow you to use MPI,
https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_03_batch_mpi.html. I have not used it myself, so it is something for you to venture.
Some recent works and issues worth reading
https://jiaweizhuang.github.io/blog/