Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Is Python a Legitimate Data Analysis Tool?

Soulskill posted more than 2 years ago | from the one-end-it-is dept.

Python 67

Back in May we discussed using Python, R, and Octave as data analysis tools, and compared the relative strength of each. One point of contention was whether Python could be considered a legitimate tool for such work. Now, Bei Lu writes while Python on its own may be lacking, Python with packages is very much up to the task: "My passion with Python started with its natural language processing capability when paired with the Natural Language Toolkit (NLTK). Considering the growing need for text mining to extract content themes and reader sentiments (just to name a few functions), I believe Python+packages will serve as more mainstream analytical tools beyond the academic arena." She also discusses an emerging set of solutions for R which let it better handle big data.

cancel ×

67 comments

Sorry! There are no comments related to the filter you selected.

really? (2, Interesting)

Anonymous Coward | more than 2 years ago | (#40567621)

Any Turing-complete language is a legitimate data analysis tool.

Re:really? (2)

Meshach (578918) | more than 2 years ago | (#40567669)

Any Turing-complete language is a legitimate data analysis tool.

The question is not whether or not it is possible but whether or not it is realistic and practical.

Re:really? (0)

Anonymous Coward | more than 2 years ago | (#40567715)

No the question is whether it is legitimate.

Re:really? (3, Funny)

Billly Gates (198444) | more than 2 years ago | (#40567731)

No the question is whether it is legitimate.

Then that case Excel because you can email it and share it with colleagues and it is PHB approved.

Re:really? (1)

Phibz (254992) | more than 2 years ago | (#40568955)

You joke but you'd be surprised in the marketing industry Excel is quite popular for data analysis. For small data sets it does a fine job.

Re:really? (1)

Billly Gates (198444) | more than 2 years ago | (#40570061)

It does a great job.

The problem is when you need to share the data. Then a database is the answer and it can do data mining as well if it is a good non free one ... well except if its Oracle. UGH.

But to me that is common sense rather than feeding it into a script in some programming langauge.

Re:really? (1)

NadMutter (631470) | more than 2 years ago | (#40570535)

Replying to undo accidental moderation (sticky trackpad)

Sadly I see way too many corporate 'documents' that subscribe to the putative logic 'if it has numbers, it must be a spreadsheet; if it has pictures, it goes into powerpoint'. Where's the 'Ironic' moderation option

Re:really? (1)

cynyr (703126) | more than 2 years ago | (#40572101)

you should see small business engineering... if it is, it is in excel or autocad. These are then printed to PDF for publication.

Re:really? (1)

Anonymous Coward | more than 2 years ago | (#40569543)

I've been seeing this way too often. People don't bother considering what the intent of the person asking was. They fixate on one word with a relatively vague meaning and choose one particular interpretation of it, and then go haring off into oblivion. The discussion turns into a verbal fight over the precise definition of a word that's vague in the first place.

It's really rather annoying.

So: can you replace R with Python (let's say, for a new project), assuming that you know both languages and all the relevant libraries, without a significant hit to productivity? Without reinventing stuff that R has by default? Without pining for CRAN every ten minutes?

Re:really? (1)

KiloByte (825081) | more than 2 years ago | (#40567945)

With the right libraries, it ALWAYS is both realistic and practical.

Of course, you'd need really good libraries to overcome malbolge or brainfuck, but hey, no one says the underlying language has to be visible from behind them...

Re:really? (2)

cynyr (703126) | more than 2 years ago | (#40572121)

Oblig. XKCD [xkcd.com]

Re:really? (1)

jonadab (583620) | more than 2 years ago | (#40569251)

> The question is not whether or not it is possible but whether or not it is realistic and practical.

Using Python for data analysis is realistic, assuming you know Python (or have enough background in computer science to pick it up quickly -- it's not a particularly difficult language, as languages go: I've seen accounting software packages that would be much harder to learn).

Python is perhaps not quite as practical as some other choices. In particular, object-oriented programming is not an especially good fit for many data analysis tasks; a multiparadigmatic language would often be better, because it lets you use functional techniques, which is often very handy for working with data sets. (It's no coincidence that SQL bears a striking resemblance to the data-filtering portions of a typical impure functional language.) OTOH, it's good that not everyone uses exactly the same thing. The right tool for the specific job you're doing and all that -- all data analysis is not created identical.

Personally, I use Perl.

CERN (1)

Roger W Moore (538166) | more than 2 years ago | (#40575109)

The question is not whether or not it is possible but whether or not it is realistic and practical.

Not only is it realistic and practical but it is already in use for data analysis! Everyone on the ATLAS experiment at CERN uses python to some degree in their analysis and my grad students and I use an analysis framework almost entirely in Python with ROOT [root.cern.ch] for I/O.

Re:really? (0)

Anonymous Coward | more than 2 years ago | (#40568845)

Try doing data analysis in brainfuck or machine language then and get back to us on how legitimate a tool those are.

Re:really? (1)

luis_a_espinal (1810296) | more than 2 years ago | (#40569887)

Any Turing-complete language is a legitimate data analysis tool.

Legitimately =/= feasible without regards to cost.

Otherwise, let's use assembly to write our own analytics package.</rollseyes>

It Works (4, Insightful)

mrsquid0 (1335303) | more than 2 years ago | (#40567633)

Python may not be a legitimate data analysis tool, but it is widely used for data analysis, and it gives the right results. For the most part that is what really matters.

Re:It Works (-1)

Anonymous Coward | more than 2 years ago | (#40567657)

First Reply!

Re:It Works (5, Insightful)

mcgrew (92797) | more than 2 years ago | (#40567699)

Python is a language. It's a tool to build other tools with, including data analysis tools.

Re:It Works (2)

Instine (963303) | more than 2 years ago | (#40569777)

or use other libraries easily and quickly. PyCUDA gives genuinely huge number crunching power to the language. And allows meta programming which suits scripting languages and machine learning very well. http://mathema.tician.de/dl/pub/nvidia-gtc-2009.mp4 [tician.de]

The readability and flexibility and speed of development are what it brings, the raw power comes from the libraries it can talk to.

Re:It Works (4, Insightful)

ceoyoyo (59147) | more than 2 years ago | (#40568005)

What does "legitimate data analysis tool" mean? MatLab was included in the comparison, and MatLab is more of an engineering tool. The built in (excuse me, optional paid for) stats library is pretty limited.

R is great for doing statistical analysis, but it's not great for doing things like image analysis. Without additional libraries R isn't nearly as good as it is with libraries either.

Re:It Works (3, Funny)

roman_mir (125474) | more than 2 years ago | (#40571419)

What does "legitimate data analysis tool" mean?

- obviously it means to ask whether Python is legitimate or is bastard, what do you think it means? It is not asking whether Python is a 'data analysis tool', it is asking whether Python is a legitimate something or other.

So to answer the question you have to look at the Python's descendancy. You'll quickly discover that Python was actually conceived in a huge orgy of different programming paradigms, styles and languages, it's even named after a circus!

I believe the answer is that Python is a bastard of data analysis tools, but so what, bastards are people too.

Re:It Works (0)

Anonymous Coward | more than 2 years ago | (#40577671)

..., bastards are people too.

Unless they are gingers, in which case it depends on whether you expect people to have soul.

Call me old fashioned (0)

Anonymous Coward | more than 2 years ago | (#40567679)

But most of my data analysis stuff I put in a database and retrieve it with SQL. The language makes little difference. There are also people who swear by Excel and only excel to do calculations.

It doesn't matter where you apply the math.

Re:Call me old fashioned (5, Interesting)

ceoyoyo (59147) | more than 2 years ago | (#40567853)

It depends how complicated the math is.

I wrote a general linear model in Python because I was unhappy with the existing ones and I wanted an intimate knowledge of how it worked. I wrote most of a general linear mixed model, but then decided it wasn't worth the time and just used the one in R via RPy2. Then it turned out the one built into R was too slow, so I upgraded to the one in the lme R package. That exists because a lot of smart people use R.

But sure, if your "data analysis" involves multiplication and maybe a t-test or two, it doesn't really matter what you use.

Re:Call me old fashioned (-1)

Anonymous Coward | more than 2 years ago | (#40567895)

It would be conducive to a better world if you ingested as much sewage as possible.

Re:Call me old fashioned (1)

highacnumber (988934) | more than 2 years ago | (#40569367)

If the math is more abstract, then Sage (python-based) is a better bet than R (and Sage includes R): www.sagemath.org.

Re:Call me old fashioned (1)

ceoyoyo (59147) | more than 2 years ago | (#40569571)

Sage is basically a batteries included Python distribution. Lots of people like a bit package to use like that. I prefer putting the pieces together myself. My other complaint about Sage, last time I looked at it, is that it's more difficult to install your own packages in the Sage environment than it is to do so with stock Python. One of the great things about Python is needing a particular algorithm, typing it into Google, and downloading and installing the handy package that someone else has already written and shared.

Re:Call me old fashioned (0)

Anonymous Coward | more than 2 years ago | (#40573277)

sage -sh

easy_install foo

Re:Call me old fashioned (3, Insightful)

hey! (33014) | more than 2 years ago | (#40569997)

Alright, you're old-fashioned. And you're mixing up apples and oranges.

I think what most people these days are talking about is not just having some kind of online analytics data resource, but having a system where having that resource is taken as a given and the task is to use mathematics and AI to classify records, discover patterns and relationships, locate unusual data (without necessarily specifying the nature of the anomaly in advance), and whatnot.

A spreadsheet is fine for doing simple summaries of small, heterogeneous, tabular datasets (calculating averages and whatnot). But it's not going to help you find one record out of millions where your search criteria are too complex to be expressed in a SQL where clause.

http://en.wikipedia.org/wiki/Betteridge's_Law_of_H (0)

Anonymous Coward | more than 2 years ago | (#40567723)

No.

Re:http://en.wikipedia.org/wiki/Betteridge's_Law_o (3, Funny)

jdgeorge (18767) | more than 2 years ago | (#40567893)

No.

Working link for subject. [wikipedia.org]
In other news, How hot is vehicle theft is your area? [examiner.com]

Re:http://en.wikipedia.org/wiki/Betteridge's_Law_o (2)

godrik (1287354) | more than 2 years ago | (#40567981)

Tomorrow on slashdot:

"Can all questions in headlines be answered by 'no' ?"

Re:http://en.wikipedia.org/wiki/Betteridge's_Law_o (1)

mooingyak (720677) | more than 2 years ago | (#40568093)

Tomorrow on slashdot:

"Can all questions in headlines be answered by 'no' ?"

Most, but not all [slashdot.org]

Re:http://en.wikipedia.org/wiki/Betteridge's_Law_o (1)

physburn (1095481) | more than 2 years ago | (#40570759)

Can betteridge's Law cause paradox?

these articles are not informative (1)

Anonymous Coward | more than 2 years ago | (#40567873)

Someone who knows so little about tools like R, python, etc. should spend their time learning about what is available rather than writing articles on the topic using their own cursory knowledge.

Re:these articles are not informative (1)

Eponymous Hero (2090636) | more than 2 years ago | (#40568279)

seems legit

Use what works (5, Insightful)

hawguy (1600213) | more than 2 years ago | (#40567889)

Since people do use python for data analysis (hence the data analysis related packages that are available), of course it's legitimate.

Just like how when you're standing on the roof and you need to pound in a couple nails, that heavy pair of pliers in your pocket is a legitimate tool. It may not be the best tool for the job, the best tool might be a pneumatic nail gun, but if all you have with you and what you know how to use is pliers, then that's the right tool. Why spend time and money learning some other "more appropriate" language (or buying an air compressor and nail gun) when you already have a tool at your fingertips that will do what you need.

As your needs grow you might need to find another more appropriate tool, but if you can get the job done with Python, why bother searching for the "perfect" tool?

Depending on your needs, sh, awk, sed, sort, and uniq may be all the tools you need - many log parsing, analysis and reporting programs have been writing with those tools, often ingesting more rows of data per day than many small business BI systems.

Re:Use what works (0)

Anonymous Coward | more than 2 years ago | (#40568187)

Depending on your needs, sh, awk, sed, sort, and uniq may be all the tools you need

Add wc to your list and I'm good...

Re:Use what works (0)

Anonymous Coward | more than 2 years ago | (#40572573)

You might like pyp ('the Pyed Piper') and piep then.

Re:Use what works (1)

betterunixthanunix (980855) | more than 2 years ago | (#40568437)

Why spend time and money learning some other "more appropriate" language (or buying an air compressor and nail gun) when you already have a tool at your fingertips that will do what you need.

Indeed, although sometimes you save yourself a lot of headaches by getting a tool that was built for your task. I have, in a pinch, used a screw driver to hammer nails, but a screw driver is no replacement for a hammer.

That being said, Python+SciPy+NumPy is fine for data analysis; people use it all the time, and it works as well as R or MatLab. It is not as though we are talking about QuickBasic for data analysis.

Re:Use what works (0)

Anonymous Coward | more than 2 years ago | (#40568709)

Analysing the data is only part of it. Python is also awesome for collecting it thanks to the tons of bindings it has for things like OpenCV and Phidgets. In about 30 minutes, I wrote a script that uses a motor controller and a webcam to scan different layers in core samples and measures their relative depths. I could have done the same thing in C++ but why bother?

Re:Use what works (0)

Anonymous Coward | more than 2 years ago | (#40568705)

Just like how when you're standing on the roof and you need to pound in a couple nails, that heavy pair of pliers in your pocket is a legitimate tool.

I think that's a poor comparison, in that it implies that python is a poor tool for the job, or somehow inadequate.

I think a more apt comparison is installing a roof. You can do it with just a regular claw hammer, but things are *much* easier and faster if you pull out a pneumatic nail gun. It's not that a claw hammer is a bad or sub-standard tool, it's just that the pneumatic nailer was designed for tasks such as installing a complete roof, and has the firepower to make things quicker, easier, and more efficient.

You can certainly come up with certain edge cases where a claw hammer might be better than the pneumatic nailer. For example, if you're installing a slate roof, and the power of the nailer would risk cracking the tiles. Or if you're not installing a complete roof, but only patching a few shingles, and the overhead of lugging out the compressor would be overkill. Or you're an old school roofer with the hammer-handling chops which give you extra control with a hammer without the sacrifice of much speed. But if you're a normal person doing a large, standard, technical job like installing a complete asphalt shingle roof, then it makes sense to look at a tool that was built around such use cases.

Yes. (0)

yeltski (1438587) | more than 2 years ago | (#40567953)

Of course.

Python can do anything (5, Funny)

Anonymous Coward | more than 2 years ago | (#40568025)

http://xkcd.com/353/

Better than R (1)

Anonymous Coward | more than 2 years ago | (#40568149)

I looked at R and it's one of the most deranged languages I've ever seen in terms of syntax (up there with Erlang). At least Python is readable to the average programmer who knows C or Java.

Re:Better than R (3, Informative)

ceoyoyo (59147) | more than 2 years ago | (#40568341)

R is MUCH nicer when you use it through a bridge from Python.

Re:Better than R (0)

Anonymous Coward | more than 2 years ago | (#40568827)

Can you elaborate what is a bridge? Do you mean something like Rserve [rforge.net] ?

Re:Better than R (1)

Anonymous Coward | more than 2 years ago | (#40568911)

I think he means RPy2

Re:Better than R (0)

Anonymous Coward | more than 2 years ago | (#40571461)

Thanks, much appreciated!

A Language, dummy (0)

Anonymous Coward | more than 2 years ago | (#40568233)

It's a language, not a tool, although languages are tools to communicate with.
Analysis is a function of math, not language.

Is it reproducible? (1)

Anonymous Coward | more than 2 years ago | (#40568337)

I work in the biosciences and we occasionally have a similiar discussion.

In our context, it isn't about how one analyzes the the data, it is a question about how anyone else can recreate your experiment: that is, set up the experimental system, acquire the data, analyze it which will yield approximately the same results. It is in our best interest [and mandated by our funding agency and the journals] to publish papers that clearly define how we made our observations and how we analyzed the data.

My group concludes that any tool is fine, but it must part of a well-described logaical framework in which we generate a hypothesis, test it, and make a conclusion.

It depends on the context (0)

Anonymous Coward | more than 2 years ago | (#40568365)

The language of choice depends on the community. If everyone is using Mumps, you should use Mumps. http://en.wikipedia.org/wiki/MUMPS [wikipedia.org] You want to be able to share your work with the rest of the community.

One application I am aware of where Python is widely used is bioinformatics.
http://onlamp.com/pub/a/python/2002/10/17/biopython.html [onlamp.com]

Having said the above, Python has a lot to recommend it. It has become the initial teaching language of choice. There will be some people whose only language is Python. That's OK. It scales and can be used within almost any programming paradigm. If your only language is Python and you don't have much data to process, why would you bother learning something like R. http://en.wikipedia.org/wiki/R_(programming_language) [wikipedia.org]

Re:It depends on the context (0)

Anonymous Coward | more than 2 years ago | (#40570309)

It scales

No, no it doesn't. And I've seen it choke on large data sets that other languages have no problem with. Python is not a good choice if scalability is a concern.

Legitimate? (1)

jdavidb (449077) | more than 2 years ago | (#40568693)

"legitimate" is such a disrespectful value judgment. Are you saying that people who do data analysis with Python are illegitimate? Are you calling them bastards?

No, seriously, you can have a profitable conversation all about the reasons why you think there are serious drawbacks to using Python as your data analysis tool. Lots of people might benefit from that. But when you start saying things like "That's not a legitimate data analysis tool" or "That's not a real programming language" or whatever, then you are getting down into contentless arguing, passing off disrespect as if it were legitimate discourse.

If you really think use of Python as a data analysis tool is that bad, go all the way: don't try to have a serious subject on the discussion, turn it into a humorous essay on people who are so stupid and unenlightened that they can't see what is blindingly obvious to you.

A long time ago in my academic life, I took a neural networks class that did a lot of data analysis with matlab. I poked around with octave, but I finally wound up writing my projects in Perl with PDL. I'm sure not many people would do that, but I just wanted the learning experience. It was legitimate for my purposes, which was learning and the joy of being able to say I did it. But you might want to mock me for it. :)

Re:Legitimate? (1)

Sir_Sri (199544) | more than 2 years ago | (#40573163)

legitimate" is such a disrespectful value judgment. Are you saying that people who do data analysis with Python are illegitimate? Are you calling them bastards?

I'm not sure that's how it's meant, but I agree, it's an odd choice of phrase. If I were to look at it another way, what would make a language 'illegitimate' for data analysis? In that case you look at things like excel and access for financial transactions, or some of the early versions of CUDA that didn't support proper IEEE floating point maths (or at least, not fast IEEE floating point maths). In those cases you can use the language, and it will spit out results, but they might not be right, and there's no obvious way to know. Lets say by default some language doesn't convert in any sort of reasonably obvious way between number types (ints to floats, floats to ints, floats to double precision floats, that sort of thing), or if the math has some bizarre errors in it. A classic example would be division, that's slow to do properly, so if your language takes some shortcuts by default that are faster, but wrong, well that'd be bad.

These of course are all things that can be changed or fixed with appropriate libraries and so on, but you'd need to know those are problems. Which I guess is why you'd ask.

So ya, overall, it's a strange way to phrase the problem. In a broadly theoretical sense there's no reason any decent language couldn't be used for data analysis, obviously, so from there it's a matter of whether or not it's up front about when it does things badly.

It's a programming language (0)

Anonymous Coward | more than 2 years ago | (#40568939)

If you're seriously asking this question you're over your head.

Already considered such.... (1)

Anonymous Coward | more than 2 years ago | (#40569283)

Just ask the astronomy community. They've been moving away from IDL as an analysis environment and towards the use of python with scipy (with numpy and pyfits offering similar performance). You're asking this question several years after it's already been effectively declared as such.

Re:Already considered such.... (1)

goatbar (661399) | more than 2 years ago | (#40572577)

If you are talking astronomy, you've left out http://yt-project.org/ [yt-project.org]

Re:Already considered such.... (0)

Anonymous Coward | more than 2 years ago | (#40572847)

Yeah, I haven't been involved in an actual astronomy (astrophysics really) collaboration since about 8 years ago, so, I'm a bit dated. :)

A fundamental misunderstanding (0)

Anonymous Coward | more than 2 years ago | (#40570375)

Is Python a Legitimate Data Analysis Tool?

Python is not a data analysis tool. It's a programming language.

You would use python (or any other reasonable programming language) to BUILD a data analysis tool.

But that doesn't mean that python IS a data analysis tool.

I can only imagine what your next question will be: "Can the x86 processor be used to run a data analysis program?" Asking such questions demonstrates a deep misunderstanding about how computing is organized into layers.

"And it's critical to finding the Higgs Boson" (1)

Narrowband (2602733) | more than 2 years ago | (#40571133)

This story seems like an echo of the one a day or so ago about Linux being critical to the success of the LHC. Something with generic programmability supports something specific, then gets discussed as a tool for that specific task. Probably a lot of the comments there apply here.

Perl and Python both (2)

gizmo_mathboy (43426) | more than 2 years ago | (#40571615)

Python and Perl make great data analysis tools.

They have a plethora libraries to handle things: Numpy/Scipy for Python and PDL/GSL for Perl.

They can access FORTRAN and C libraries as necessary for either performance or legacy needs.

THey are probably best because they are high level languages, very platform neutral, and cost signficantly less than other "serious" data analysis tools/languages.

Pandas - Data Analysis for Python (1)

Anonymous Coward | more than 2 years ago | (#40572153)

Yes absolutely. Its being used to do all sort of data analysis in the real world.

Check out Pandas (http://pandas.pydata.org/) the Python data analysis library.

Also there are lots of machine learning libraries: scikits-learn is probably the best known (http://scikit-learn.org/)
Both of these are built on NumPy.

You should also check out the videos from the 2012 PyData workshop: http://marakana.com/s/2012_pydata_workshop,1090/index.html

yes (0)

Anonymous Coward | more than 2 years ago | (#40573595)

add numpy, scipy, scikit-*, pyplot to make it comfortable. perhaps pyr since you mention r. one phd in physics defended in this home with help of python.

size of data (1)

transonic_shock (1024205) | more than 2 years ago | (#40575153)

I love R and Python. However, both of them choke on big data sets. What they need is an in-built mechanism to store data on disk rather than in-memory. There are some really convoluted ways of doing this..but then dont always work with modeling packages that weren't written with the convoluted approach you are taking, in mind. So, if the base language has the ability to store object on disk, say with a simple flag, and its transparent to the rest of the system, most downstream libraries/packages would still work.

ff package in R is a good approach..maybe that should be adopted as the memory model for R.

I hate to say this but maybe R/Python can learn something from SAS here.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?