Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Ask Slashdot: Best Language To Learn For Scientific Computing?

timothy posted about 9 months ago | from the english-then-chinese dept.

Education 465

New submitter longhunt writes "I just started my second year of grad school and I am working on a project that involves a computationally intensive data mining problem. I initially coded all of my routines in VBA because it 'was there'. They work, but run way too slow. I need to port to a faster language. I have acquired an older Xeon-based server and would like to be able to make use of all four CPU cores. I can load it with either Windows (XP) or Linux and am relatively comfortable with both. I did a fair amount of C and Octave programming as an undergrad. I also messed around with Fortran77 and several flavors of BASIC. Unfortunately, I haven't done ANY programming in about 12 years, so it would almost be like starting from scratch. I need a language I can pick up in a few weeks so I can get back to my research. I am not a CS major, so I care more about the answer than the code itself. What language suggestions or tips can you give me?"

cancel ×

465 comments

Python (4, Insightful)

curunir (98273) | about 9 months ago | (#45154759)

I have a friend who works for a company that does gene sequencing and other genetic research and, from what he's told me, the whole industry uses mostly python. You probably don't have the hardware resources that they do, but I'd bet you also don't have data sets that are nearly as large as theirs are.

You might also get better results from something less general purpose like Julia [julialang.org] , which is designed for number crunching.

Re:Python (4, Insightful)

the gnat (153162) | about 9 months ago | (#45154805)

the whole industry uses mostly python

This is certainly the way of the future, not just for gene sequencing but many other quantitative sciences, although a complete answer would be Python and C++, because numpy/scipy can't do everything and Python is still very slow for number-crunching. It's best to start with just Python, but eventually some C++ knowledge will be helpful. (Or just plain C, but I can't see any good reason to inflict that on myself or anyone else.)

Re:Python (4, Insightful)

Anonymous Coward | about 9 months ago | (#45154845)

Python is the new VB.

Re:Python (5, Insightful)

shutdown -p now (807394) | about 9 months ago | (#45154959)

Python is VB done right.

Re:Python (5, Funny)

SJHillman (1966756) | about 9 months ago | (#45155029)

VB is feeding your scrotum to a python.

Re:Python (-1)

Anonymous Coward | about 9 months ago | (#45155061)

Python is open-source trash.

Re:Python (3, Insightful)

Anonymous Coward | about 9 months ago | (#45155167)

VB is closed-source trash.

Re:Python (5, Interesting)

shutdown -p now (807394) | about 9 months ago | (#45154939)

a complete answer would be Python and C++, because numpy/scipy can't do everything and Python is still very slow for number-crunching.

The problem with using the mix (when you actually write the C++ code yourself) is that debugging it is a major pain in the ass - you either attach two debuggers and simulate stepping across the boundary by manually setting breakpoints, or you give up and resort to printf debugging.

OTOH, if Windows is an option, PTVS is a Python IDE that can debug Python and C++ code side by side [codeplex.com] , with cross-boundary stepping etc. It can also do Python/Fortran debugging with a Fortran implementation that integrates into VS (e.g. the Intel one).

(full disclosure: I am a developer on the PTVS team who implemented this particular feature)

Re:Python (3, Insightful)

dmbasso (1052166) | about 9 months ago | (#45155139)

The problem with using the mix (when you actually write the C++ code yourself) is that debugging it is a major pain in the ass

Only if you don't use the C/C++ code as an independent module, as it should be. If you *must* debug it in parallel, you're designing it wrong.

Re:Python (0)

Anonymous Coward | about 9 months ago | (#45154973)

Python is a nice language for "quick pick up".
If you want to take advantage of multiple cores you can launch multiple instances of your program or you can use the multiprocessing module.

Re:Python (4, Informative)

rwa2 (4391) | about 9 months ago | (#45155031)

Yes, I did my master's thesis using simpy [readthedocs.org] / scipy [scipy.org] , integrated with lp_solve for the number crunching , all of which was a breeze to learn and use. It was amazing banging out a new recursive algorithm crawling a new object structure and just having it work the first time without spending several precious cycles bugfixing syntax errors and chasing down obscure stack overflows.

I used the psyco JIT compiler (unfortunately 32-bit only) to get ~100x boost in runtime performance (all from a single import statement, woo), which was fast enough for me... these days I think you can get similar boosts from running on PyPy [pypy.org] . Of course, if you're doing more serious number crunching, python makes it easy to rewrite your performance-critical modules in C/C++.

I also ended up making a LiveCD and/or VM of my thesis, which was a good way of wrapping up the software environment and dependencies, which could quickly grow outdated in a few short years.

Re:Python (5, Informative)

Garridan (597129) | about 9 months ago | (#45154883)

I use Sage. When Python isn't fast enough, I can essentially write in C with Cython. It's gloriously easy. Have some trivially parallelizable data mining? Just use the @parallel decorator. Sage comes with a slew of fast mathematical packages, so your toolbox is massive, and you can hook it all in to your Cython code with minimal overhead.

Sage + (0)

jbolden (176878) | about 9 months ago | (#45155223)

Let me second this one. Mathematica, Maple, Sage, Matlab / Octive... Mathematical languages are so nice for scientific computing because the languages have wonderful built in functions.

Re:Python (0)

Anonymous Coward | about 9 months ago | (#45154909)

Absolutely Python.

NumPy takes care of the array buffer and fast basic math, SciPy has many scientific extensions, Matplotlib provides data visualization, iPython is your notebook of choice, Scikit-* provide faster moving toolkits that further extend SciPy to the cutting edge, and SimPy has symbolic math.

In aggregate these are known as the SciPy Stack.

Utilizing multiple cores in Python can be accomplished with JIT compilers (Numba / NumbaPro are the best developed presently), Cython with `nogil` and use of OpenMP, or subprocess management. Many libraries which need this type of behavior already offer it (like Scikit-Learn for machine learning), and iPython has inbuilt cluster management tools.

Re:Python (0)

Anonymous Coward | about 9 months ago | (#45154915)

Python is popular.
R is popular.
Perl is still in wide use.

All our new stuff is being written in Go, Julia, Scala, and *gasp* Javascript.

Java Java! (3, Interesting)

Latent Heat (558884) | about 9 months ago | (#45154953)

For research engineering, I use Java to run the numerical examples of the algorithms I develop although most of the authors in the journals I publish in are using Matlab for this purpose (ewwwwww!). Long time ago I was a Turbo Pascal person as were engineering colleagues who crossed over to Matlab seeking the same kind of ease-of-use. Me, I transitioned to Delphi but now I am with Java and Eclipse -- the Turbo Pascal of the 21st century.

For numeric-intensive work, I can get within 20% of the speed of C++ using the usual techniques -- minimize garbage collection by allocating variables once, use the "server" VM, perform "warmup" iterations in benchmark code to stabilize the JIT. I use the Eclipse IDE, copy and paste numeric results from the Console View into a spreadsheet program, and voila, instant journal article tables.

Re:Python (1)

polyphemus (473112) | about 9 months ago | (#45155177)

+1

I was using Mathematica in grad school (experimental physics). Great for simple number crunching, but awful for doing anything programmatically interesting, and annoyingly expensive.

I'm now using Python and loving it.

Fortran (2, Insightful)

Anonymous Coward | about 9 months ago | (#45154783)

sorry to say, but that is a fact

Re:Fortran (0)

Anonymous Coward | about 9 months ago | (#45154819)

This is the correct answer.

Fortran + Python = F2PY (4, Informative)

n1ywb (555767) | about 9 months ago | (#45155089)

Better yet, Fortran + Python.

http://docs.scipy.org/doc/numpy/user/c-info.python-as-glue.html#f2py [scipy.org]

I used it to wrap some crazy magnetometer processing code written in Fortran into a nice Python program. I ripped out all the I/O from the Fortran code and moved it into the Python layer. It worked great. Fortran is AWESOME at number crunching but SUCKS ASS at IO or well pretty much anything else, hence Python.

Re:Fortran (2)

shutdown -p now (807394) | about 9 months ago | (#45155015)

It depends on what exactly his computationally intensive part is. It may be something that can be trivially implemented in Python in terms of standard numpy operations, for example, with performance that's "good enough".

Re:Fortran (0)

the gnat (153162) | about 9 months ago | (#45155021)

Sure, if you don't care about having your code be maintained or extended by anyone under age 30, don't plan on doing any custom visualization beyond GNUplot, and don't care if you ever find employment outside of academia.

fortran of LaTeX (-1)

Anonymous Coward | about 9 months ago | (#45154789)

Fortran is ridiculously fast if you don't care about having a nice GUI, and LaTeX is also a great choice. Otherwise, you can go with C but everything else is going to be slower than Fortran for those kinds of mathematically intensive problems.

English (4, Funny)

Anonymous Coward | about 9 months ago | (#45154791)

Obviously.

Try the CS department? (0)

Anonymous Coward | about 9 months ago | (#45154821)

Why not trying tracking down a CS professor and getting paired up with an undergrad student who needs to create a capstone project?

Universally, and unambiguously (0)

gwstuff (2067112) | about 9 months ago | (#45154827)

Math

FORTRAN (2, Insightful)

Frosty Piss (770223) | about 9 months ago | (#45154833)

Seriously consider FORTRAN

Re:FORTRAN (2, Insightful)

Anonymous Coward | about 9 months ago | (#45154921)

Yeah, sure.

So that no one can ever check your models or replicate your results even if you publish code and initial data.

Re:FORTRAN (1)

jythie (914043) | about 9 months ago | (#45155023)

Was that supposed to be a crack about popularity? Because auditing fortran is no worse then most other languages, and it can be argued that fortran is better then most in terms of being able to validate models.

Re:FORTRAN (0)

Anonymous Coward | about 9 months ago | (#45155111)

Having to read old FORTRAN is not a pleasant experience for someone who figures they can generally read languages without hitting a book of some sort,

Re:FORTRAN (5, Interesting)

Frosty Piss (770223) | about 9 months ago | (#45155025)

Clearly you are not involved in serious science.

And if you think FORTRAN is some ancient esoteric languge, you're ignorent as well. The most recent standard, ISO/IEC 1539-1:2010, informally known as Fortran 2008, was approved in September 2010.

Fortran is, for better or worse, the only major language out there specifically designed for scientific numerical computing. It's array handling is nice, with succinct array operations on both whole arrays and on slices, comparable with matlab or numpy but super fast. The language is carefully designed to make it very difficult to accidentally write slow code -- pointers are restricted in such a way that it's immediately obvious if there might be aliasing, as the standard example -- and so the optimizer can go to town on your code. Current incarnations have things like coarray fortran, and do concurrent and forall built into the language, allowing distributed memory and shared memory parallelism, and vectorization.

The downsides of Fortran are mainly the flip side of one of the upsides mentioned; Fortran has a huge long history. Upside: tonnes of great libraries. Downsides: tonnes of historical baggage.

If you have to do a lot of number crunching, Fortran remains one of the top choices, which is why many of the most sophisticated simulation codes run at supercomputing centres around the world are written in it. But of course it would be a terrible, terrible, language to write a web browser in. To each task its tool.

Re:FORTRAN (1)

jonesy16 (595988) | about 9 months ago | (#45155157)

Agreed. There are also OpenMP implementations for doing your parallel processing. If you're running on a Xeon processor then I would SERIOUSLY consider Intel's linux fortran compiler as it will provide the best performance by far.

MATLAB? (1)

Anonymous Coward | about 9 months ago | (#45154835)

Have you looked at Matlab? It's commercial, requiring a license, but many universities have a site license available for you to use it. Pretty powerful, faster than VB, but not as fast as native C/C++ but unless you're running some calculations real-time, this probably is not an issue for you.

Re:MATLAB? (1)

golden age villain (1607173) | about 9 months ago | (#45155175)

All the labs I know in my field (neuroscience) do most of the data analysis and simulations with MATLAB. It is also used to control hardware for data acquisition.

Try Julia.. (0)

Anonymous Coward | about 9 months ago | (#45154841)

..seems pretty self-explaining to me.
http://julialang.org/

Step one: export to a database? (1)

xxxJonBoyxxx (565205) | about 9 months ago | (#45154851)

>> I initially coded all of my routines in VBA because it 'was there'.

Are you in Access? Or Excel?

If your routines work but are just slow, I'd first look at moving the data to SQL Server and porting your VBA routines to VB.NET.

If you have more time, you may want to learn what the "Hadoop" world is all about.

Re:Step one: export to a database? (1)

K. S. Kyosuke (729550) | about 9 months ago | (#45155011)

If he wrote it in VBA, I'm pretty sure he can rewrite it into a native extension of some kind and use it from the same environment. Some industries love those since you expose the functionality to many users who want or need to work in that user environment.

Re:Step one: export to a database? (0)

Anonymous Coward | about 9 months ago | (#45155047)

If you have more time, you may want to learn what the "Hadoop" world is all about.

Not to be confused with Hardocp [hardocp.com]

Re:Step one: export to a database? (1)

Anonymous Coward | about 9 months ago | (#45155149)

If you have more time, you may want to learn what the "Hadoop" world is all about.

That's the absolute wrong answer to the question as posed. Hadoop is all about massive parallelization. It's the answer to "How do I throw hardware at a difficult problem?" instead of "How do I solve a difficult problem efficiently."

I'm not saying that Hadoop isn't extremely useful, but for a student who managed to scrounge up a single Xeon machine, it's entirely ill suited.

More details? (3, Informative)

schneidafunk (795759) | about 9 months ago | (#45154853)

Depending on your needs, R may be your best bet if it is statistical processing you are interested in.

Re:More details? (4, Informative)

Bovius (1243040) | about 9 months ago | (#45154937)

Second this. There are numerous languages out there that are tailor-made for specific kinds of problems. You didn't quite share enough to narrow down what kinds problems you need to solve, but the R project is geared toward number crunching, albeit with a significant bent toward statistics and graphic display.

http://www.r-project.org/ [r-project.org]

If that's not pointed in the right direction, some other language might be. Alternatively, there are a lot of libraries out there for the more popular languages that could help with what you're doing. Heck, 12 years ago we didn't even have the boost libraries for C++. It's difficult for me to imagine using that language with out them now.

And my answer is... (0)

Anonymous Coward | about 9 months ago | (#45154863)

Java (for quick prototyping), C++ (port from Java code/structure to fine-tune performance).

Check with your potential employers what language(s) was(were) used to build their current applications, and what languages (if any) they will port to.

What are you doing? (3, Informative)

RichMan (8097) | about 9 months ago | (#45154865)

What do you mean by scientific computing?

Modelling: Hard core finite element simulations or the like. Then C or Fortran and you will be linking with the math libraries.
Log Processing: A lot of other stuff you will be parsing data logs and doing statistics. So perl or python then octive.
Data Mining: Python or other SQL front end.

Re:What are you doing? (0)

Anonymous Coward | about 9 months ago | (#45155187)

Second that. We teach C for introductory scientific computing, and modern variants of Fortran are very slick. If it's performance you're after, those two running on a unix/linux architecture is the only game in town.

Microsoft and high performance computing should never be used in the same sentence.

Re:What are you doing? (3, Informative)

UnknowingFool (672806) | about 9 months ago | (#45155235)

Well if your problems require statistical computing, R is the language to use. For general scientific computing, the last I checked Octave was still valid. As for multi-core processing only a few languages and compilers support platforms like Open MP. Fortran, C, and C++.

Python (1)

Anonymous Coward | about 9 months ago | (#45154875)

Try Python. Make sure to use scipy (numpy really), because you don't want to do the heavy lifting in native Python.
http://www.scipy.org/

Re:Python (1)

shutdown -p now (807394) | about 9 months ago | (#45154981)

You can use Cython for heavy lifting without dropping all the way down to C.

what the rest of your team uses (4, Insightful)

peter303 (12292) | about 9 months ago | (#45154887)

You should all be sharing your codes to avoid rewriting and to perfect it.
And if you are not a member of a team then I seriously question the quality of your graduate program.

BAD TIM! BAD! (5, Funny)

girlintraining (1395911) | about 9 months ago | (#45154895)

What language suggestions or tips can you give me?"

Timothy, shame on you. You should know better than to start a holy war.

Re:BAD TIM! BAD! (0)

Anonymous Coward | about 9 months ago | (#45155019)

Can not mod this up enough.
Your turn!

Re:BAD TIM! BAD! (0)

Anonymous Coward | about 9 months ago | (#45155027)

Seriously. Why even ask? I mean everybody knows that *Insert programming language you know best here* is clearly the most superior programming language ever developed. Question answered.

Julia lang seems to match your needs. (0)

Anonymous Coward | about 9 months ago | (#45154903)

But from what I heard, it's still in development. Does someone know how usable it is atm?

Are you asking for permission to use fortran? (0)

Anonymous Coward | about 9 months ago | (#45154907)

Cause that probably the answer if your having "computation performance problems", maybe even C++/OpenCL if your feeling really brave...

On the other hand, why not just throw more hardware at the problem (or wait a little longer). By the time you have recoded your VBA in something else, i'm betting the VBA code could have solved the problem running on some decent hardware.

Unless the question is "I wrote my code in VBA and it doesn't scale to a 5k node cluster, what did I do wrong". In that case you aren't really asking the right question.

Fortran (plus MPI and some CUDA) (1)

Anubis350 (772791) | about 9 months ago | (#45154925)

Fortran and learn some how to implement MPI and CUDA code is your work is parallelizable.

Re:Fortran (plus MPI and some CUDA) (1)

GiganticLyingMouth (1691940) | about 9 months ago | (#45155205)

For completeness, it should also be noted that both C and C++ work with MPI and CUDA. Fortran can theoretically be faster than C or C++ as its compiler can optimize more aggressively (due to the lack of pointer aliasing in Fortran), but I don't have any hard data for how much of a difference it would make in actual runtime speeds.

perl ftw (0)

Anonymous Coward | about 9 months ago | (#45154929)

Perl should handle literally anything you can throw at it.

2 paths (3, Informative)

johnjaydk (584895) | about 9 months ago | (#45154941)

If you can find anything that resembles a math library with the correct tools then go with Python. Numpy is everyones friend here.

If you have to do the whole thing from scratch then Fortran is the fastest platform. I can't say I've meet anyone who enjoyed Fortran but it's wicked fast.

Re:2 paths (1)

the gnat (153162) | about 9 months ago | (#45155059)

If you have to do the whole thing from scratch then Fortran is the fastest platform. I can't say I've meet anyone who enjoyed Fortran but it's wicked fast.

True, but the only place where this *really* matters is programming for repetitive calculations on massively parallel supercomputers. For anything else, there is a tradeoff between program speed and developer speed, and ultimately it's cheaper to buy more computers than hire more programmers.

Fortran, R, Matlab, C, Python or Perl, and ??? (0)

Anonymous Coward | about 9 months ago | (#45154945)

There's nothing wrong with Fortran or C. There are newer and in some cases more focused languages you may want to check into like R, Matlab, C++, Python, Perl, or Go. I'm not a fan of the language and it's not known for raw performance compared to Fortran or C,but there are probably great libraries for what you need in Java.

Rather than learning a language... (1)

Anonymous Coward | about 9 months ago | (#45154957)

I would recommend learning what a programming language is. Especially if you have the time. Personally I spent a lot of time learning languages and not really seeing the abstraction that every programming language adhere's to, making learning a new language difficult and time consuming. I can only really describe it as trying to learn a language rather than learning linguistics. All computer languages share common patterns all based on formalism, just like all spoken languages share common patterns. Learning formalism makes picking up new programming languages much easier since you'll not only be able to identify patterns shared between them faster, but pick up the lexicon to communicate well formed questions to other programmers. I'd recommend reading Structure and Interpretation of Computer Programs. There are other books that attempt to replicate what this does, but it really is great and I haven't seen other books get to the point of computer programming faster. It is based on LISP, which most people will never use, but its deceptively easy to read and understand, so getting through the book for someone that hasn't used LISP before shouldn't be a problem. Good Luck!

Python, or ... (2)

Kiliani (816330) | about 9 months ago | (#45154961)

First suggestion: Python. Lot's of nice stuff for science (NumPy, SciPy), lots of other goodies, easy to learn, many people to ask or places to get help from. Plus you can explore data interactively ("Yes Wedesday, play with your data!").

Beyond that: CERN uses a lot of Java (sorry folks, true), they have good (and fast) tools I do a project right now where I am using Jython since it is supported by the main (Java) software I have to use. I like jhepwork/SCaVis quite a bit, if you are into plotting stuff on Java.

If you have extra free time and want to learn how to program well? I'd learn something like Smalltalk (for OOP concepts) and/or Haskell (functional programming). Scientists are often lousy programmers because they often do not learn programming properly, and/or the language allows them to get away with bad programming (I know, every language allows bad programmers to write bad code, but some make it easier than others).

So, stick with Python, it works really well, is modern, and has good support. Plus you can read your code in 5 years time ...

What do I program in? Python (and Jython), Perl, C, IDL (yickes!), Smalltalk, Matlab, Mathematica. I know some Lisp, but that's just for fun. And whatever allows me to load sketches on an Arduino. I like Python (get's stuff done) and Smalltalk (works actually like I think - passing messages between objects).

Use whatever works and you don't hate :-)

R-language (4, Informative)

biodata (1981610) | about 9 months ago | (#45154965)

Most of the cutting edge data mining I've seen is done using R (which acts as a scripting wrapper for the C or Fortran code that the fast analysis libraries are coded in), or alternatively in python. Some people swear by MatLab if they have trained in it (so your octave would come in handy there). Have a look at some discussions at places like kaggle.com to see what the competitive machine learning community uses (if that is what you mean by data mining).

Re:R-language (2, Insightful)

green is the enemy (3021751) | about 9 months ago | (#45155135)

This is the correct advice: Use whatever language is most common in your research area, so you can benefit from the most existing source code. This will almost certainly be a high-level scripting language like R, MATLAB or Python, with the ability to drop down to C, FORTRAN and CUDA for the small parts of the code that need optimization. (In my case: electrical engineering = MATLAB + C and CUDA mex files)

Go (aka Golang) if you come from a C background (1)

genghisjahn (1344927) | about 9 months ago | (#45154967)

http://golang.org/ [golang.org] You won't regret it.

Re:Go (aka Golang) if you come from a C background (1)

K. S. Kyosuke (729550) | about 9 months ago | (#45155209)

Could use some vectorizing FP, but yeah, it's not a bad choice, especially if the complexity of mixed environments is undesirable. (Could also use some native port of netlib/GSL as well, though.)

It might also make him a better practical software engineer, which, as I understand, is an area in which many numerics people...experience certain difficulties.

Profile (5, Insightful)

Arker (91948) | about 9 months ago | (#45154977)

A lot of people will propose a language because it is their favorite. Others because they believe it is very easy to learn. I will give you a third line of thought.

I would not look for a language in this case, I would look for a library, then teach myself whatever language is easiest/quickest to access it. I would try to profile what you are building, figure out where the bottlenecks are likely to be (profiling your existing mockup can help here but dont trust it entirely) and try to find the best stable well-designed high performance library for that particular type of code.

Re:Profile (1)

jythie (914043) | about 9 months ago | (#45155077)

I am not sure how much that helps since unless the person is doing something very specific, chances are it will just shift the problem into 'which library is best' debate, which will again mostly involve people suggesting libraries they like or because they believe they are easy to learn.

Hadoop? (1)

bjzadwor (322436) | about 9 months ago | (#45154979)

If you are doing a computationally intensive data mining problem, have you considered porting to a Hadoop solution? You may need to rewrite your code, or you may be able to use Hadoop to call your current functions. You could use an AWS Hadoop cluster; Amazon often gives free credits to students, it may cost you nothing out of pocket, and help you learn a hot new technology.

Fortran (0)

Anonymous Coward | about 9 months ago | (#45154983)

Recent version of Fortran are very advanced, a lot easier to use than Fortran 77 and still extremely fast.

Some new features since 77: structured programming, array programming, modular programming and generic programming (Fortran 90), high performance Fortran (Fortran 95), object-oriented programming (Fortran 2003) and concurrent programming (Fortran 2008).

Free compilers: GFortran [wikipedia.org] and G95 [wikipedia.org]

Speed incarnate (2)

Impy the Impiuos Imp (442658) | about 9 months ago | (#45154997)

If you're using VBA in Excel, you can speed it up a ton by putting this at the beginning of your function:

Application.Calculation = xlCalculationManual

And restore it with ...Automatic at the end.

Do this at the top level with a wrapper function whose only purpose is to disable and enable that, calling the real function in between.

If you want a real speedup, I am available for part time work in C or C++.

My favorite is CnH2n+1OH (3, Funny)

nanospook (521118) | about 9 months ago | (#45155037)

It take all the work out of the computations..

Fortran 90+ with OpenMP or Python (1)

dlenmn (145080) | about 9 months ago | (#45155039)

If you really want to do heavy lifting, you can't beat Fortran. Just stay away from Fortran 77; it's a hot mess. Fortran 90 and later are much easier to use, and they're supported by the main compilers: gfortran and ifortran.

ifortran is Intel's Fortran compiler. It's the fastest out there, and it runs on Windows and Linux. Furthermore, you can get it as a free download for some types of academic use. (Search around intel's website -- it's hard to find.) That said, I usually use gfortran -- which is free and open source -- on linux. See http://www.polyhedron.com/compare0html [polyhedron.com] for a compiler comparison.

If you use Fortran, it's very easy to use OpenMP to do multiprocessing and make use of all those cores. OpenMP is supported by the main compilers.

If you're doing lighter work, SciPy/NumPy works fine; I use it a fair amount if maximum performance isn't essential. However, I can't speak to its multiprocessing ability.

FORTRAN (0)

Anonymous Coward | about 9 months ago | (#45155041)

For scientific stuff, FORTRAN is still the best. Simple, old things are very often the best things around. C++ is in many ways a regression, especially all the C-style stuff you can find in the average C++ program.

As soon as you need to process massive data sets or run massive simulations, all the Script languages won't cut it any longer, so you either go Fortran or C++. So, again, Fortran.

Before you C++ kids want to tell me something, read up on that Mr Kuck and his optimizers. Fortran optimizers did things about 20 years ago which C++ optimizers still cannot do.

Finally, there are tons of Fortran libraries already available for all kinds of science and engineering problems.

Re:FORTRAN (0)

Anonymous Coward | about 9 months ago | (#45155153)

See: http://www.ieeeghn.org/wiki/index.php/Oral-History:David_Kuck

Why code, when you are use a workflow tool? (1)

Grantbridge (1377621) | about 9 months ago | (#45155045)

Use KNIME and you can probably do 90% of what you want by dragging and dropping a new nodes and joining them up. KNIME does all the complicated memory caching for large filesets for you, and you can write your own Java functions to plug into it if you need something special.

Depends (1)

Enry (630) | about 9 months ago | (#45155079)

R, MATLAB, SAS, Python, there's a bunch of languages you can use, and a bunch of ways to store the data (RDBMS, NOSQL, Hadoop, etc.). It really comes down to what kind of access to the data you have, how it's presented, what other resources you have available to you, and what you want to do with it.

Depends... On the Data... (1)

tiberus (258517) | about 9 months ago | (#45155091)

Well, it depends. You say " computationally intensive data mining problem" but, what kind computations (arithmetic, mathematical, text-base, etc.).

In general for flat out speed, toss interpreted languages out (Perl, Python, Java, etc.) the door. You'll want something that compiles to machine code, esp. if you are running on older hardware. Crunching numbers, complex math, matrices then Fortran is the beast. If you're data is arranged in lists, consider lisp, then pick something else as it will likely give you a migraine. The format of your data and what you need to do with it will drive your language choice.

Is finding a partner an option? Seems you should be able to work with someone from CS who needs a coding project...

Python or R (1)

Anonymous Coward | about 9 months ago | (#45155097)

I work in the industry (all our customers are scientists), and the two languages that seem to be predominant are R and Python. R has lots of cool stuff specifically for advanced number crunching, while Python is more the swiss army knife that can be used to tackle anything. I don't think you can go wrong with either, but Python will probably be more friendly (eg. it has way more books on it than R) and will serve you better in non-scientific enterprises.

Python, numpy, Pyvot (4, Informative)

shutdown -p now (807394) | about 9 months ago | (#45155119)

Since you mention VBA, I suspect that your data is in Excel spreadsheets? If you want to try to speed this up with minimum effort, then consider using Python with Pyvot [codeplex.com] to access the data, and then numpy [numpy.org] /scipy [scipy.org] /pandas [pydata.org] to do whatever processing you need. This should give you a significant perf boost without the need to significantly rearchitecture everything or change your workflow much.

In addition, using Python this way gives you the ability to use IPython [ipython.org] to work with your data in interactive mode - it's kinda like a scientific Python REPL, with graphing etc.

If you want an IDE that can connect all these together, try Python Tools for Visual Studio [codeplex.com] . This will give you a good general IDE experience (editing with code completion, debugging, profiling etc), and also comes with an integrated IPython console. This way you can write your code in the full-fledged code editor, and then quickly send select pieces of it to the REPL for evaluation, to test it as you write it.

(Full disclosure: I am a developer on the PTVS team)

If you care about the answer... (0)

mpmansell (118934) | about 9 months ago | (#45155123)

then you should care about the code, as well. Choice of language can have a lot of consequences for accuracy and floating rounding errors need to be accounted for, and these may differ per language and implementation version of each language.

matlab (3, Informative)

smadasam (831582) | about 9 months ago | (#45155125)

FORTAN used to be it back in the day, but now days Matlab is the stuff that many engineers use for scientific computing. Many of the math libraries are very good in Matlab and don't require you to be a computer scientist to make them run fast. I used to work with scientists in my old lab to port their Matlab code to run on HPC clusters porting them to FORTAN or C. Often the matlab libraries smoked the BLAS/Atlas packages that you find on Linux/UNIX machines for instance. The same would hold true for Octave since they just build on the standard GNU math pacakges like BLAS.

Same language as your piers (1)

willy_me (212994) | about 9 months ago | (#45155127)

If you want to be able to ask someone for help then it would be best to use the same tools they use. The point is that any programming language will work. Some languages are easier then others but the difference is negligible compared to the advantage of being able to ask your piers for assistance.

Try J.. (1)

DavidHumus (725117) | about 9 months ago | (#45155129)

...at jsoftware.com .

It's more powerful, concise, and consistent than most languages. However, R and Matlab have larger user communities and this is an important consideration.

There was a note on the J-forum a few months ago from an astronomer who uses J to "...compute photoionization models of planetary nebulae." His code to do this is about 500 lines in about 30 modules and uses some multi-dimensional datasets, including a four-dimensional one of "...2D grids of the collisional cooling by each of 16 ions".

However, the point of his note was that he ported this code to his i-phone - and it works! Consider, too that porting consists mainly of copying some text and data files - there would be little to no code changes.

do you have a budget? (0)

Anonymous Coward | about 9 months ago | (#45155133)

you haven't given a lot of info on specifics of what you're trying to do, but i'm assuming something like crunching through tables of data with possible aggregations, filtering and sorting with possibly a few custom calculations based on the raw data. kind of stuff you can do in excel on a small scale.

so my first question is have you looked at excel 2010 or 2013? if not they're much better at bigger data than previous versions. but excel does have it's limits....

if you have a budget for commercial software, then something like matlab might work. it is uber fast, can handle multiple cores/64-bit and is extremely well documented on their website with copious examples and documentation. the pace of updates at 2x per year is also good with steady incremental improvements.

if you have no budget then python+matplotlib+ipython+pandas is an excellent combo. it's what i use. free and productive once you learn the ropes. and you spend minimal time on learning a programming environment, etc. if you can do VBA, you can definitely do python. and with pandas it can be quite fast.

as a final thought, if you're really just doing data mining and have the data somewhere like in a database, you might want to consider some of the newer tools like tableau. no/minimal programming required to do some pretty nice analysis and it's dead simple to play around with new ways of looking at the data.

the language is probably not the issue (0)

kbdd (823155) | about 9 months ago | (#45155137)

Nowadays, most languages can be pretty efficient. Your algorithms may be where the problem lies. Most any language can run most algorithms efficiently, but achieving that may not be easy.

No language will give you a magic speed boost if you do not understand how it processes the numbers and data structures.

My recommendation is probably not what you want to hear: pick a language that you are comfortable with and study it so that you know how to write efficient code with it.

C/C++ (3, Interesting)

ericcc65 (2663835) | about 9 months ago | (#45155143)

I'm a MSEE and I've been working in the digital signal processing realm for the last 10 years since graduating. I should mention that I haven't done a lot of low level hardware work, I haven't programmed actual DSP cards or played with CUDA. I have written software that did real-time signal processing just on a GPU. Everyone in my industry at this point uses C or C++. There is some legacy FORTRAN, and I shudder when I have to read it. Some old types swear by it, but it's fallen out of favor mostly just because it's antiquated and most people know C/C++ and libraries are available for it.

For non-real-time prototypes I'd recommend learning python (scipy, numpy, matplotlib). Perhaps octave and/or Matlab would be useful as well.

At some point you have to decide what your strength will be. I love learning about CS and try to improve my coding skills, but it's just not my strength. I'm hired because of my DSP knowledge, and I need to be able to program well enough to translate algorithms to programs. If you really want to squeeze out performance then you'll probably want to learn CUDA, assembly, AVX/SSE, and DSP specific C programming. But I haven't delved to that level because, honestly, we have a somewhat different set of people at the company that are really good in those realms.

Of course, it would be great if I could know everything. But at the moment it's been good enough to know C/C++ for most of our real time signal processing. If something is taking a really long time, we might look at implementing a vectorized version. I would like to learn CUDA for when I get a platform that has GPUs but part of me wonders if it's worth it. The reason C/C++ has been enough so far is that compilers are getting so good that you really have to know what you're doing in assembly to beat them. Casual assembly knowledge probably won't help. I might be wrong, but I envision that being the case in the not too distant future with GPUs and parallel programming.

Re:C/C++ (1)

ericcc65 (2663835) | about 9 months ago | (#45155201)

Edit: GPU in the first paragraph should be GPP, general purpose processor.

openCL, MATLAB/octave, Python (0)

Anonymous Coward | about 9 months ago | (#45155147)

If you really need fast number crunching and have a highly-parallelizable problem, consider openCL (or CUDA/directCompute, but those are less generic). There is a bit of a learning curve, but the results are worthwhile for these types of problems.

For powerful expressiveness with large datasets, I like MATLAB/Octave. For universal-ness that's easy to learn and easy for others to understand, go with Python - it's very common for certain types of simulations and models and I can understand why.

Do not even consider using Java - you will regret it.

Quick suggestion... (2)

MiniMike (234881) | about 9 months ago | (#45155181)

Do you have access to MATLAB or a similar analysis tool? Many universities have licenses, and overall it seems like it might be a good choice for you. These programs usually have a lot of build-in functionality that will be difficult to reproduce if you are not an experienced scientific programmer.

I haven't done ANY programming in about 12 years, so it would almost be like starting from scratch.

This is probably a bigger problem than choosing which language to use. If you don't know how to program properly and efficiently, it doesn't matter which language you choose. If you go this route I'd suggest taking a course to refresh or upgrade your skills. Since you're familiar with C that might be a good language to focus on in the course. Another factor is if you have to work with any existing libraries it might limit your choices. I program in C, FORTRAN, and VB and find that for computationally intensive programs C is usually the best fit, sometimes FORTRAN, and never VB.

Use an interpreted language that calls C libraries (1)

pigiron (104729) | about 9 months ago | (#45155185)

newLISP is small and can easily call most c/c++ libraries, plus Java for graphics. HTML/XML are really just LISP S-expressions for all practical purposes. Throw in a little Unix/bash and you are there.

VB (1)

confused one (671304) | about 9 months ago | (#45155189)

Personally, I would do it in C unless you have Fortran libraries you want to use, then I'd use Fortran. However, if you have existing VBA code you want to leverage, I'd just use VB.Net, import the core parts of the code and run with it. There's a moderately steep learning curve going from VB6 or VBA to VB.Net; but, it'll be much less effort than learning a new language.

R actually (0)

Anonymous Coward | about 9 months ago | (#45155193)

Perl is the second one, but if you actually mean real science, you need to learn R, or at least S, in addition to Perl.

C is a good choice too, as is C++.

We use those in real science.

What is worth your time? (1)

BruiserBlanton (133306) | about 9 months ago | (#45155195)

It depends on what you willing to deal with.

Python is good if you don't need to very heavy array code. I know you can use Python libraries that give you access to good arrays but I think of Python as a scripting language. It's good for a quick prototype as well, but for heavy computation, I would move on to a compiled language.

Fortran 90 or Fortran 2003/08 is what will be the most like what the mathematical syntax you'll use. Despite what people may tell you, it is possible to write code that is understandable and reusable in Fortran, it just takes a great deal of understanding when you design the code. Most people have only seen Fortran code that was either hacked together or is so heavily optimized that it has been obfuscated.

C++ is good as well but you'll spend more time figuring out how to express your mathematics and to use the arrays than you might might find productive. In my group, we do computer science parts of our codes in C++, but numeric calculations and heavy-duty array manipulation is done in Fortran.

The thing about taking advantage of the multiple core machine is much deeper than simply choosing a language. There are MPI and OpenMP libraries that are very good for Fortran and C++. However, producing efficient code that is parallelizable requires changing and complicating the algorithm for a well understood and functioning serial code. Writing effective parallel code will take you much more time than picking up a programming language.

details would be fantastic. (0)

Anonymous Coward | about 9 months ago | (#45155211)

How easily does your problem parallelize? How slow is too slow? Why, exactly, do you "want" to use all four cores?

Has someone solved a variant of your problem before? Since you're doing data mining, the answer is most likely 'yes', in which case unless you're a masochist or have something to prove to yourself, you want to adapt what they've done. Hell, it's quite likely that there's a nice-enough implementation in a standard software package already (R comes to mind). A few hours spent mathematically/conceptually massaging your problem into a canonical form can save you days of programming, and will train you to make useful analogies too. It won't be optimized, but that shouldn't be your concern now. Days or weeks of coding to save a few hours or days is not a smart investment.

It's really easy to get into a trap of focusing on coding, and frankly asking on Slashdot will probably lead you further that way. Sometimes you do need to focus on coding, but it should always be in the mindset of automating something you could (at least in principle) be doing by hand. For a quantitative researcher, programming is, itself, a subroutine, not an ends in itself.

Matlab (2)

necro81 (917438) | about 9 months ago | (#45155213)

If you are working in academia, then you probably have access to Matlab. Matlab, as a language, has both scripting abilities and programming abilities. The scripting was born from Matlab's roots in Unix, which makes it handy for batch processing lots of files. It's programming functions started off as C, but has since incorporated features from C++, Python, and Java. The programming side of it has, in my opinion, more structure and formalism than Python, but makes certain things like file IO and data visualization (i.e., graphing) easier than straight up C/C++. The basics of using it can be picked up in an afternoon, and the sky's the limit from there. There are lots of well-written and documented functions built in; specialized toolboxes can be had for additional fees. There's a fair bit of user-generated code out there. Plus, I expect you can find a lot of people around you who know plenty about it.

Multi-threading (0)

Anonymous Coward | about 9 months ago | (#45155215)

For a free windows compiler, go with MinGW.
Linux uses the GCC standard. You can also go with LLVM/CLang.
All of these support the C++11 standard which in turn supports multi-threading out of the box.
http://solarianprogrammer.com/2011/12/16/cpp-11-thread-tutorial/
http://en.cppreference.com/w/cpp/thread
http://cpprocks.com/wp-content/uploads/C++-concurrency-cheatsheet.pdf

As for Octave, see here: http://stackoverflow.com/questions/11889118/get-gnu-octave-to-work-with-a-multicore-processor-multithreading

If you want true horse power and are willing to work for it, invest in a compatible AMD or Nvidia card that works with OpenCL and spend some time learning that.
http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201
http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854
http://enja.org/2010/07/13/adventures-in-opencl-part-1-getting-started/

Not enough infomation (1)

already_read (1170181) | about 9 months ago | (#45155233)

The answer would really depend on the nature of the problem. If you are doing more statistics type processing then R is commonly used in academia. Python might be good in the short and medium term, but you will probably want to get acquainted with C++ if you are serious.

Don't start over (0)

Anonymous Coward | about 9 months ago | (#45155239)

You're falling for one of the most common traps in programming. It doesn't do what I want, so I'm going to start over from scratch. You'll waste a lot of time doing things you've already done and debugged.

So what you should do is keep the VB program. Identify the slow parts, most likely the inner most section of the inner most loop, and convert that to a C or Fortran module. Ideally use a "message passing" interface so the C or Fortran code can be multi-threaded while the VB portion stays more or less as is.

Do just a little bit at a time, so you can see the actual progress. Chose between C or Fortran based on the availability of libraries that make your computation easier.

Fortran (1)

quietwalker (969769) | about 9 months ago | (#45155247)

I worked as a sysadmin for a high energy physics group at the Beckman Center. Day and night, it was Fortran, on big whopping clusters, doing monte carlo simulations.

Though it ~was~ many years ago.

Elsewhere, I worked for a company doing datamining on massive datasets, over a terabyte of data back in 2000, per customer, with multiple customers and daily runs on 1-5 gig subsets. We used C + big math/vector/matrix libs for the processing because nothing else could come close, and Perl or Java for the data management; preprocessing, set creation and munging (like attempting to corrrect spelling mistakes, parsing date strings into a standard format, normalizing data against a standard metric, applying expert system filters, even actual machine analysis like clustering or shape detection, which to us was still just preprocessing).

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...