Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

What Would You Want In a Large-Scale Monitoring System?

timothy posted more than 4 years ago | from the detect-curse-words-and-fine-a-quarter-apiece dept.

Networking 342

Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered


I Name My Devices After Al Qaeda Members (5, Funny)

Philip K Dickhead (906971) | more than 4 years ago | (#28628637)

Publish them in DNS, and have the NSA monitor them for me!

Re:I Name My Devices After Al Qaeda Members (1)

ionix5891 (1228718) | more than 4 years ago | (#28628921)

What Would You Want In a Large-Scale Monitoring System?

uncle sam is that you?

Hyperic HQ (2, Informative)

Anonymous Coward | more than 4 years ago | (#28628639)

Hyperic HQ may be worth checking out.

Re:Hyperic HQ (0)

Anonymous Coward | more than 4 years ago | (#28628811)

Case of one, but I have not had good luck with Hyperic being able to consistently detect whether a server or application is truly "Available" on even 5 minute granularity. FreeNATS could also be looked at.

Re:Hyperic HQ (0)

Anonymous Coward | more than 4 years ago | (#28629199)

Yes, it's true. Hyperic is a very good option. But it's a lot expensive (well at least one year ago) in Corporate license (the only one to scale well for those 2500+ monitored nodes)...

rule based DSS (1)

ecklesweb (713901) | more than 4 years ago | (#28628661)

Don't assume that you can successfully diagnose the problem based on your understanding of the indicators. You don't know my institutional context. Instead, give me a decision support system that I can use by adding rules that key off the monitored indicators and inject some of our own expertise into the diagnostic process.

OpenNMS (5, Informative)

Anonymous Coward | more than 4 years ago | (#28628669)

That's all you should need. For 5000 devices I don't know that any of the options you listed would be appropriate.

OpenNMS is much more than monitoring, but I think that you'll appreciate the other features as well.



Re:OpenNMS (4, Interesting)

mu51c10rd (187182) | more than 4 years ago | (#28628863)

I use OpenNMS as well. I actually migrated off of Nagios to OpenNMS. Tried out Zenoss and Cacti as well. While any of these are better than OpenView IMHO, I liked OpenNMS's full suite of functionality without having to pay for the 'commercial' version.

Re:OpenNMS (4, Insightful)

Cato (8296) | more than 4 years ago | (#28628991)

I've only tried OpenNMS. It looks very powerful, but wasn't at all hard to get installed and configured on Ubuntu - it figures out the type of node it has discovered and shows useful data through SNMP, and can also do uptime monitoring, and is generally very scalable and configurable if needed.

Yep. This is the one. (0)

Anonymous Coward | more than 4 years ago | (#28629195)

It requires a substantial investment in learning how to set it up and how to use it, but then so do the big commercial products like OpenView and Tivoli.

Re:OpenNMS (0)

Anonymous Coward | more than 4 years ago | (#28629397)

Why do all free NMS systems have crappy performance (execute system()/shell scripts to check absoultely anything) or require you to learn a whole new application specific poorly designed language to do even simple tasks?

I just want something that looks reasonable (A configuration GUI), works, a sane person can learn to use reasonably well in less than an hour that scales well with strong SNMP discovery.

After spending less than 2 minutes toying around with the Open NMS demo server. I was greeted with the following:

org.opennms.web.event.EventIdNotFoundException: The event id must be an integer.
        at org.apache.jsp.event.detail_jsp._jspService(detail_jsp.java:70)
        at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
        at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:328)
        at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:315)
        at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:269)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.extremecomponents.table.filter.AbstractExportFilter.doFilter(AbstractExportFilter.java:49)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.opennms.web.StoreRequestPropertiesFilter.doFilter(StoreRequestPropertiesFilter.java:71)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:265)
        at org.acegisecurity.intercept.web.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:107)
        at org.acegisecurity.intercept.web.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:72)
        at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
        at org.acegisecurity.ui.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:166)
        at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
        at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
        at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
        at org.acegisecurity.wrapper.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:81)
        at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
        at org.acegisecurity.ui.basicauth.BasicProcessingFilter.doFilter(BasicProcessingFilter.java:173)
        at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
        at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
        at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
        at org.acegisecurity.ui.logout.LogoutFilter.doFilter(LogoutFilter.java:110)
        at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
        at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249)
        at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
        at org.acegisecurity.util.FilterChainProxy.doFilter(FilterChainProxy.java:149)
        at org.acegisecurity.util.FilterToBeanProxy.doFilter(FilterToBeanProxy.java:98)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:210)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
        at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:200)
        at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:283)
        at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:773)
        at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:703)
        at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:895)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:685)
        at java.lang.Thread.run(Thread.java:619)

I'd gladly pay for a commercial solution any day given the experience I've had with the open source solutions tried thus far and I absoultely love MRTG.

A more interesting question (5, Insightful)

drsmithy (35869) | more than 4 years ago | (#28628691)

What limitations exist in current solutions that justifying developing a new one from scratch ?

Re:A more interesting question (3, Insightful)

Meshach (578918) | more than 4 years ago | (#28628817)

What limitations exist in current solutions that justifying developing a new one from scratch ?

Exactly! Too often people just jump in and redo everything without actually investigating what needs to be fixed. Quote from George Santayana "Those who cannot learn from history are doomed to repeat it,' seems very appros here.

Re:A more interesting question (3, Informative)

glassware (195317) | more than 4 years ago | (#28628823)

He said he was asked to "develop a new solution" - which most likely means he gets to pick and choose what to implement, whether parts of it are custom developed or off the shelf. I would imagine a good solution would be a core product plus custom built extensions for the features he needs that the product doesn't implement itself.

Re:A more interesting question (2, Informative)

Krneki (1192201) | more than 4 years ago | (#28628853)

Exactly, I need a good core product that I'll evolve over time.

Re:A more interesting question (1)

drsmithy (35869) | more than 4 years ago | (#28629075)

Exactly, I need a good core product that I'll evolve over time.

Why do you want to "evolve" it (by which I'm assuming you mean modify in depth rather than "configure") it at all ? What's missing that you need ?

System monitoring isn't exactly a fresh and new field. There are numerous well-established and quite comprehensive products already out there.

Re:A more interesting question (1)

Krneki (1192201) | more than 4 years ago | (#28629143)

Mostly is SNMP and WMI hacking.

Not all the vendor respect open standards, so you have to guess how to get the info from an UPS, Printer, ....

Re:A more interesting question (3, Interesting)

abigor (540274) | more than 4 years ago | (#28629275)

The big questions are:

Will your solution need to support snmp v3?

Do the devices you talk to have published oids?

Do you need source code to extend it?

If yes to these, OpenNms is a great bet.

Re:A more interesting question (1)

drsmithy (35869) | more than 4 years ago | (#28629041)

He said he was asked to "develop a new solution" [...]

From TFSummary:

Today I have changed employer and I have been asked to develop a new monitoring solution from scratch [...]

Where I come from, "from scratch" doesn't mean "configure existing solutions to my needs".

Re:A more interesting question (2, Funny)

ArsonSmith (13997) | more than 4 years ago | (#28629475)

Yea, from scratch, first I'd develop the tools needed to mine the raw materials of silicone, iron, and other needed elements. Then I'd refine them and produce the needed components for memory and processors and storage. as well as develop the new networking, power, form factor etc... Then start working on the boot code and a core kernel, hmm should it be micro/macro or hybrid...? Then I'd start working on interface tools or user space or something along those lines. Once I got this part done I'd start gathering information on what was needed to be monitored. Then develop the required protocols to monitor those things.

On second thought maybe it'd be easier to not start from scratch and build on the tools others have created as a basis and customize from there.

Re:A more interesting question (0)

Anonymous Coward | more than 4 years ago | (#28628955)

Limitations ? thereare no need for limitations, sometimes you do something just because you can...

Re:A more interesting question (1)

dave562 (969951) | more than 4 years ago | (#28629167)

The article mentions that he is starting a job at a new employer. The systems that he listed are systems that he has experience with. It seems to me that he's open to the possibility that, despite having had experience with numerous systems, there might be a better way to do things than he has done them in the past.

Bash monitoring (0, Offtopic)

Foofoobar (318279) | more than 4 years ago | (#28628695)

I built a smal program to updating all bash profiles with timestamps, compare changes every few ticks and save changes to a database where the users were given unique ID and it would associated a parent child relationship when users su/sudo to show heightened privileges. Very useful as sys admins are known to wipe their bash historys and this kept a centralized history with relationships.

Re:Bash monitoring (0)

Anonymous Coward | more than 4 years ago | (#28628917)

export HISTFILE=/dev/null

Re:Bash monitoring (1)

karnal (22275) | more than 4 years ago | (#28629113)

Sounds like you need a centralized syslog server. It could do more for you than just log commands.....

Before I get flamed... (4, Interesting)

jwilki1 (463599) | more than 4 years ago | (#28628701)

I am going through this right now and am using and have used all the above mentioned solution. We are leaning towards System Center Operation Manager. http://www.microsoft.com/systemcenter/operationsmanager/en/us/default.aspx If you had told me 6 months about that it would be the way to go, I would have said over my dead body, but it has come a very long way in terms of usability and ease of setup.

Re:Before I get flamed... (1)

jmulvey (233344) | more than 4 years ago | (#28628905)

I forsee myself in your shoes in the next few months. Application-level awareness of our key Microsoft applications (Exchange, MOSS, AD, etc..) are very high on our need list, and so SCOM is a natural best-of-breed pick. However, I *really* want a single integrated solution that also covers our unix/linux systems. Is unix/linux monitoring part of your requirements? If so, could you briefly describe the capabilities (and requirements on the monitored systems) that SCOM currently has in this regard?

Re:Before I get flamed... (2, Informative)

Anonymous Coward | more than 4 years ago | (#28629091)

SCOM R2 integrates native unix and linux agent [microsoft.com], supported systems are :

HP-UX 11i v2 and v3 (PA-RISC and IA64)
Sun Solaris 8 and 9 (SPARC) and Solaris 10 (SPARC and x86)
Red Hat Enterprise Linux 4 (x86/x64) and 5 (x86/x64) Server
Novell SUSE Linux Enterprise Server 9 (x86) and 10 SP1 (x86/x64)
IBM AIX v5.3 and v6.1

For application awareness, you can check bridgeways management packs [bridgeways.ca].

Splunk It! (0)

Anonymous Coward | more than 4 years ago | (#28628709)

There's only one way! Splunk it!

Zabbix (5, Informative)

ender- (42944) | more than 4 years ago | (#28628715)

You can also look into Zabbix [zabbix.com]. It's open source, and has Enterprise support available. I haven't used it yet, but as soon as I have a spare moment to breath I intend to test it out for use in my environment.

Re:Zabbix (5, Informative)

TooMuchToDo (882796) | more than 4 years ago | (#28628793)

We use Zabbix in a production environment with 2500+ servers and tens of thousands of monitored items. The database will get big (currently at 150GB) but everything works like a champ, monitored at 1min intervals.

Re:Zabbix (1)

Achromatic1978 (916097) | more than 4 years ago | (#28628843)

You manage 2500 servers but a 150GB database is "big"? *confused*

Re:Zabbix (1)

TooMuchToDo (882796) | more than 4 years ago | (#28628891)

For a monitoring database, yes. We really have no need for the historical data.

No reliability issues? (1)

Colin Smith (2679) | more than 4 years ago | (#28629365)

Which revision?

i tried it for a couple of months, and rather like it, but it'd simply stop monitoring stuff, triggers wouldn't fire reliably etc.

Re:Zabbix (0)

Anonymous Coward | more than 4 years ago | (#28629501)

We use Zabbix in a production environment with 2500+ servers and tens of thousands of monitored items. The database will get big (currently at 150GB) but everything works like a champ, monitored at 1min intervals.

I second zabbix!

Slightly off topic:

My zabbix DB (MySQL) is only at 25G, but it does seem to grow quite fast. Out of curiosity, how do you back your DB up? If I dump the DB, zabbix stalls out, so that's not an option. I figure replication will work, but wanted to know what you're doing with a DB much larger than mine.


Re:Zabbix (0)

Anonymous Coward | more than 4 years ago | (#28629107)

We use Zabbix to monitor 6768 hosts with 1-8 checks per host at a frequency of 30 sec to once a day. I just can't even begin to say how much we like it. It is by far the best for our environment!

GKrellM (5, Funny)

Areyoukiddingme (1289470) | more than 4 years ago | (#28628729)

You can pry my GKrellM from my cold, dead hands!

Yeah, for 5000 devices, the displays start to take up quite a bit of screen space, but that's what video walls are for!


Re:GKrellM (1)

funkatron (912521) | more than 4 years ago | (#28628861)

Interesting! Do the mods not know what GKrellM is?

Re:GKrellM (1)

Areyoukiddingme (1289470) | more than 4 years ago | (#28629001)

To be fair, one of the two mods knew it was +11 Funny... er I mean +1 Funny...

One of today's articles has a thread deploring people who read the article, the summary, the title, or the posts before posting. So in my defense, I only read 4 words of the title and posted, so I saw Would Want Monitoring System. Naturally I thought of GKrellM.

Now if I had read 4 different words, I'd have thought it was spam and deleted it. "What You Want Large"

Re:GKrellM (1)

gmuslera (3436) | more than 4 years ago | (#28628961)

There are several desktop applets that shows what happens in more "serious" (or at least massive) monitoring solutions. Nagstamon [sourceforge.net] shows nagios alarms (and let you ssh/vnc or even see nagios reports onproblematic hosts right there), ZApplet [sourceforge.net] shows Zenoss alarms/warnings too.

Re:GKrellM (1)

Areyoukiddingme (1289470) | more than 4 years ago | (#28629039)

So what you're saying is, GKrellM needs plugins for Zenoss and Nagios and whatever else?

Damnit, now my +1 Funny is starting to sound practical. Quick, somebody add another +1 Funny! This has to be stopped!

Nagios, Munin, GKrellm (0)

Anonymous Coward | more than 4 years ago | (#28628731)

I've played with quite a few in the past. For your application, I'd stick with Nagios... the new version has plenty of scalability features. It has a pretty steep learning curve and just about all the configuration is text-based, but I've always found it well worth the investment in time. I currently use it to monitor services on only a couple hundred devices, and there are plenty of plugins to make it more useful. It's not great for creating and visualizing large 2D and 3D maps, but it has all the necessary hooks for it and with a bit of scripting you should be able to generate more useful views and reports of your farm.

Corps seem to buy into the commercial HP Openview a lot, but no one I've talked to that uses it seems to like it.

On a few of my servers, I also like to run Munin... it tracks and displays a little bit more information than Nagios, such as graphs of sensors, uptime, UPS stats, etc. It's come in handy on several occasions when Nagios had simply shown me that "the server went down", but the information from Munin showed that "the server room temperature started climbing up to 90F starting at 2AM".

For real-time monitoring, I really like GKrellm, which has a server/client mode of operation. It wouldn't be practical to have it up all the time, but it would be sweet to set up the daemon and have a link from Nagios launch a gkrellm client to a remote server, where you can see the affects of anything you do in real-time (rather than waiting for Nagios to refresh in 5-10 minutes).

The Dangers of averaging (5, Insightful)

Anonymous Coward | more than 4 years ago | (#28628753)

MRTG does it right...most of the others do it wrong
When rolling up a days worth of data (averaging), you loose the peak information on most monitoring systems
So your 380Mbps peak that you had an hour ago is fine on today's graph
But tomorrow, when you look at "yesterdays" graph...the peak is down to 100Mbps
and next week, when you look at "last weeks" graph...there's a little 50Mbps peak

Damnit... I want to keep information on my peaks for capacity planning!

I would like (1, Funny)

Anonymous Coward | more than 4 years ago | (#28628755)

Twitter client, facebook integration, google maps mashup.
And a pony.


Zenoss (4, Informative)

KerberosKing (801657) | more than 4 years ago | (#28628757)

I was really impressed by Zenoss [zenoss.com], which has all the slick features that cost the earth from vendors like HP for Openview. You get automatic discovery, CMDB inventory, availability monitoring, alerting, and performance graphs all in a web portal.

You get open source, commercial support, and a good community of users and plug-in developers. The best of both worlds IMHO.

Re:Zenoss (2, Interesting)

NuclearRampage (830297) | more than 4 years ago | (#28628879)

A little tough to setup new SNMP devices, I thought, but overall a great product. Even the free version gets you quite far.

Re:Zenoss (5, Informative)

rawler (1005089) | more than 4 years ago | (#28629283)

ZenOSS may be great, but a word of warning. We've had 3 failed attempts at implementing it in our shop. What we tried to achieve was mainly host and service-monitoring, with some slight network-monitoring on the side. Nothing fancy, just some 20 hosts, maybe 30 network-devices, and a variety of services.

One of the major parts we've found missing in most open-source solution was proper event-management (recieving syslog + snmp traps, and apply some intelligence to it regarding flow control, dispatching, archival and that stuff.) ZenOSS is on paper, and throughout the initial evaluation one of the best open source tools to do this.

However, during our three attempts to get it up and running, we've always encountered some major obstacle (usually after a while of operation), forcing us to start all over from scratch. The problems we had was always in the same category, strange and unexplainable errors, often hard to reproduce, and in general it resulted in a very flaky experience. Some of the problems have been service-checks showing both false positives and false negatives, and in the last problem ZenOSS refused to import new SNMP MIB:s, complaining about some IP-address that could not be found anywhere in the config, and grepping ultimately found the IP to be only present somewhere in the opaque zope-database, where evidently it could not easily be removed, nor even found exactly what the ip-address was for. (It was something auto-discovered in a remote network segment out of our control, but advertised throughout the routers.)

So, while ZenOSS can do all kinds of things, and does a LOT of things really well, it's extremely complex, not in all parts on solid foundation (such as all network objects in a non-accessible Zope-database that the devs themselves recommends not touching since it may upset things more). If you plan on implementing ZenOSS, I would not go without the support, which I assume is great, since there seems to be quite some dark pits to fall in on your own.

I dont know how come we had so much obstacles and strange problems when others seem to have a smooth ride. Maybe one explanation is what were the final nail in the coffin for ZenOSS in our deployment. When I started asking around about these problems (and ZenOSS has a really helpful community, no problems there), I realised that many users claimed to have gotten into similar problems that we had, but their solution were to just keep daily backups, and revert to a backup when they ran into these problems. For us, the monitoring data is basis for a lot of 3d-party agreement, and loosing even days worth of monitoring and logging is completely unacceptable due to these reasons. We do backup everything, but in case of rare disasters, and we must be able to rely on the monitoring system giving us a clear view through those disasters.

Re:Zenoss (1)

jon3k (691256) | more than 4 years ago | (#28629407)

Zenoss's commercial support prices [zenoss.com] are hilarious, I mean, literally, hilarious. The CHEAPEST support (silver) is $100 per managed host (including virtualized hosts) most expensive (platinum) is $180 per node. So your 5,000 hosts would be $500,000-$900,000 per year in support.

Yes. Seriously.

The other problem I have with Zenoss is the reporting is basically non-existant. It may sound like I'm being hyper-critical, but it's only because I've looked at Zenoss and I so wanted it to be the NMS for me (I particularly like the fact that it's both open source and written in python) but at this point I just don't think it's going to work.

We use What's Up Gold from ipswitch right now, but we're only monitoring a few hundred hosts. It's slow, runs on windows, requires ms sql, but it's surprisingly full featured and gets the job done I suppose. Oh and its $900.

Spiceworks? (1)

BagOBones (574735) | more than 4 years ago | (#28628819)

http://www.spiceworks.com/ [spiceworks.com]
Not sure how far it scales but I have played with it on some small installations, very easy to manage.

I have used Cacti but never felt it was mature or robust enough for very large environments

SCOM, System Center Operations Manager we are deploying now for our enterprise, however I would be afraid to manage IT on my own as it is a large system on to it self, yet very powerfull.

A couple of other options (3, Informative)

AFresh1 (1585149) | more than 4 years ago | (#28628827)

I use Nagios and some custom rolled scripts myself.

For some other options, Nagios has now been forked, so if that is "close" to what you want, you may want to contribute to Icinga [icinga.org].

Reconnoiter [omniti.com] also looked pretty kewl, but they haven't released anything yet, but it looks like they are planning it to be very scalable.

OpsView (1)

imemyself (757318) | more than 4 years ago | (#28628833)

I've really been impressed with OpsView. Can't say how well it scales on huge networks (but there are options for having multiple servers). Its based on Nagios, but its a lot less of a pain to configure and has a pretty good web interface. The only thing I don't really like is its graphing functionality. I use Cacti for monitoring bandwidth/server load/etc. But for availability checking OpsView does a fantastic job. I'm using it to monitor maybe twenty devices, including Linux and Windows servers, and HP/Cisco network devices. I tried Zenoss as well, but it seemed awkward to work with. For instance, with Opsview/nagios it's easy to add a check to verify that a DNS server is correctly resolving a record in a particular zone. I remember it was going to be a pain to monitor some of the things I wanted to with Zenoss. Maybe I'm biased because I used plain old Nagios for a while before I tried OpsView and Zenoss.

GAS/Plexos (0)

Anonymous Coward | more than 4 years ago | (#28628839)

Talk to these guys http://netfuel.com. Excellent client server monitoring, started out as a trading app monitoring tool and grew. Scales to thousands of nodes and has lots of options including several API's for integration.

Nagios Might Work (2, Interesting)

hax4bux (209237) | more than 4 years ago | (#28628881)

I spent last year converting a shop from OpenView to Nagios. They were in the same neighborhood as you (~5000 devices).

If you do not like the Nagios UI, you could create something else. The native Nagios UI is CGI based and implemented in C. The documentation is good and the sources are well commented.

The hardest decision about Nagios is how to implement the monitoring. I went w/SNMP (polling, not traps) for the most part. Sorting out all the Nagios plugins is something of a chore and many of them seem incomplete and abandoned.

MRTG also integrates w/Nagios, which can be useful.

Good luck.

ZenOSS all the way (5, Interesting)

Midnight Warrior (32619) | more than 4 years ago | (#28628913)

We use ZenOSS [zenoss.com] exclusively at work and have enjoyed every minute of it. Pro's include:
  • 2D map with status of all nodes or submaps, organized by network
  • Application monitoring, with more advanced maps available for purchase (Oracle, JBoss, Cisco) for those things you already paid a lot of money for
  • Performance monitoring via SNMP or other data sources using RRDtool internally which includes graphs linked to each other during zoom in/out or panning
  • Nagios plugins already do some of the heavy lifting
  • Built-in support for watching Windows servers (any metric accessible via WMI)
  • Access control using at least LDAP and Active Directory
  • Secondary data collectors for those networks which are too big for just one central source
  • Highly customizable through Python
  • It has so, so much more than pathetic commercial solutions like OpenView


  • You have to keep your eye on the back end database
  • It still takes a long, long time to tune it to remove noise events
  • If you don't know Python, it can be tough in a few places
  • Proper support is not cheap

Re:ZenOSS all the way (1)

glsiii (247900) | more than 4 years ago | (#28628983)

There are plenty of tweaks that help speed things up greatly-- including disabling the section of code that calculates overall system availability (if thats not important to you).... we dump all of our syslog in to ZenOSS so the tables got quite large before the retention rules kicked in.

FreeNATS (0)

Anonymous Coward | more than 4 years ago | (#28628923)

Some basic functionality, maybe not at the development level your looking for yet.

5000 seat network? (0)

Anonymous Coward | more than 4 years ago | (#28628929)

Shurely Shome cash is available to actually PAY somebody to sort this out, rather than ask for free help on /.? Hic - burp.

The mistake (3, Insightful)

vlm (69642) | more than 4 years ago | (#28628949)

The mistake is trying to monitor thousands of devices on a 2-D map. I'll look pretty to the suits, but be useless for the users. Nothing but endless slow clicky clicky clicky.

Give them a text screen of whats currently down ... that'll work.

Re:The mistake (1)

Krneki (1192201) | more than 4 years ago | (#28629205)

No, a text screen doesn't give the idea to the help desk of what zones are affected.

I want the help desk to know what is the problem before our clients calls us.

Re:The mistake (0)

Anonymous Coward | more than 4 years ago | (#28629299)

First, nothing you can do will mean that the helpdesk has any idea.

Second, management won't like you letting the helpdesk know when things are broken. Telling customers that things are broken costs money.

MOM (1)

aquilah (1594039) | more than 4 years ago | (#28628973)

I've only used MOM but for what it's worth the diagramming capabilities are much improved with the new visio plugin. Previously you could export your diagrams from OpsMgr to visio, but with the new plugin the visio diagrams reflect live health state. You can also create whatever diagram you want in visio and then tie it to monitored objects living in OpsMgr (for example rack diagrams)

Bling (0)

Anonymous Coward | more than 4 years ago | (#28628987)

"The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night".

If you can develop a monitoring solution that night time support personnel can understand to diagnose a problem quickly and properly, I am going to nominate you for a peace prize. BTW, give it lots of rapidly updating graphs and eye candy, you know, bling the sh1t out of it. Management types love that.

Pay $20k to stat (0)

Anonymous Coward | more than 4 years ago | (#28628989)

For the hassle of developing your own solution, you could just cough up the $20k for a statseeker box (hardware and software included in that price too) and then you can monitor up to 200,000 interfaces with ease. We poll 186,000 interfaces twice every 60 seconds and we have 2 years of data taking up only 17 gig of space and the boxes still is speedy. I've used nagios, cacti and the like for years and they are great for smaller deployment but statseeker smokes them from almost every aspect. I know they have upgraded options if you need more than 200K of interfaces but so far we haven't hit the limit.

Intellipool Network Monitor (0)

Anonymous Coward | more than 4 years ago | (#28629015)

Have you checked out Intellipool Network Monitor ? They got a new version coming out (version 4 I believe), it got maps with drill down and all that stuff, the distributed edition can distribute workload over several gateways also, so monitoring large scale networks is not an issue.


Roll you own... (2, Interesting)

hofmny (1517499) | more than 4 years ago | (#28629033)

I have looked at Cacti, Nagios, and a few others, but I think rolling your own is easy enough and gives you the best flexibility. You could also use Nagios, or others, for example, and simply pull the results into your own system.

I built and managed a software system for me previous employer called the SMS (Server Management System). It basically tracked 50 of our web servers, database servers, and Endeca (full text search) farms at data centers spread around the country. It was pretty simple.

The system did push and pull operations. First, the system was built in PHP.
In order to push commands to the servers I used PEAR SSH2 class for communication when it became stable. Another option (and what I did back in 2003) was to use exec and other command line functions in PHP in conjunction with a SETUID script (written in C) -- which gave the command line output from PHP "true" rootly powers. The problem was I had to enter a password for each server I wanted to connect to, and the PHP functions couldn't handle real time input/output, so I designed the system to work by creating an SSH2 key pair on my master monitoring server and put it's public key on each of our external servers for passwordless SSH.

The pull part of the system simply had a PHP script running on a cron per server, that would deliver information about the health of the server, its running processes, etc, to the main SMS server every 5 minutes. All load activity for all servers was logged as well to MySQL. The push operations were used to update those scripts, as well as restart Daemons on command, clear cache (such as after we did a database update), etc. It was a pretty robust system and really automated the functions of our company, to where we could perform a FULL Database Update to our 30 web servers simultaneously (using PHP and fork()), clear all cache, etc, in under an hour. We would the monitor the servers using the SMS's main screen which showed real time server stats (updated every 5 minutes, or you could "force" a push operation to get the status). If we needed to rollback the update, that was a simple mouse click away too.

I also had a hidden screen that let me run any series of commands as root on any number of servers. Everyone objected to it but I convinced my boss to let me put it in. All of our servers were a mouse click away from being "rm -rf *" 'ed. ROFL. Anyway, I hope my little story about my system helps you out, in either avoiding what I did (LOL) or by giving you ideas.

Things that go "ping"? (0, Troll)

nine-times (778537) | more than 4 years ago | (#28629073)

What would I want in a monitoring system? The first thing that pops into my head is "lots and lots of knobs." The kind where when you turn them you get a nice satisfying click. And blinking lights. Lots of switches. Things that go "ping" at regular intervals would be nice. Oh! And a nice big screen that says, "All systems nominal" all the time.

How about SolarWinds Orion? (1)

dakaix (1594051) | more than 4 years ago | (#28629097)

This doesn't seem to have already be suggested, but we use SolarWinds Orion. Its cheaper than many of the big systems, such as HP OpenView - and much simpler to use and operate.

The basic Orion package, which you can get for $2000 for up to 100 servers, will pull the usual CPU/RAM/Disk/Network statistics via SNMP. Built in is a mapping engine, that allows you to take a network map, and drop active elements onto it for live interfaces and device information. In a NOC environment, you can show this on a screen and it'll even sound an alarm when a system Alert fires through the website.

You can then bolt on additional modules, such as their Application Performance Monitor. It has ready to use templates for common business applications, Exchange, Apache, IIS etc. You can also create your own mixing, SNMP, WMI and User Experience monitors. User Experience monitors for example allow you to actively poll HTTP/FTP/DNS/SMTP/IMAP/POP etc, services to ensure they are not only UP but responding as they should to requests.

For scaling, you can tack on Additional Pollers to spread polling load across them. You can also use hot-standby pollers to resume the work of a failed poller.

Just my 2 cents, and not a corporate plug - just a very content user!

Pandora FMS (1)

guruevi (827432) | more than 4 years ago | (#28629123)

As one of the core devs and large user, I can tell you it scales well, develops easy and has a lot prefab. The system does everything you're asking for. Let me know if you need help or paid support.

not sure if this is helpful, but... (1)

sneakyimp (1161443) | more than 4 years ago | (#28629129)

I'm a software developer and, sadly, my knowledge of hardware systems isn't always what it should be. When I write an application to run on a server and it starts to get slow, I want to know where the bottleneck is. Is my application CPU-bound? I/O-bound? Memory-bound? Do I need more memory? Faster storage? More cores or faster processor speed? Is it the network that's causing the problem? I can usually figure this out using various linux command-line programs like netstat and top and all that, but I would sure love a big fat GUI to make it more graphic. I found something like this once and couldn't remember what it was called. It required all kinds of diagnostic utilities be manually installed.

Ideally, you could view a machine and get some quick idea of where the bottlenecks lie. Maybe that's asking a bit much, but the closer you can get to a single control panel where I could see see all my machines in a list with a status indicator and then drill down machine-by-machine, the happier I would be. It would be even cooler if the machines could contact me when they experience times of overload so that I could get a feel for when the trying times are so I can watch them more closely. I'm imagining a daemon that runs on each server and an admin gui that can speak to that daemon somehow. It would also be nice to have hooks so that I can easily report performance profiling information to the GUI from within my application.

The Activity Monitor utility found on Macs is pretty close to what I'm imagining.

JMX Support (1)

Cyberax (705495) | more than 4 years ago | (#28629153)

What I'd like to see is a good monitoring support for JMX-capable Java services.

It'd be nice to set up an alarm based on time spent in garbage collector in a JVM running our application, for example.

Re:JMX Support (1)

Intelopment (554080) | more than 4 years ago | (#28629257)

Cyberax, Check out dynaTrace. They have just what you're talking about. Deep dive into the JVM (or CLR).

Nagios, Munin, GKrellm (1)

rwa2 (4391) | more than 4 years ago | (#28629185)

I think Nagios should provide a good start.. they've recently added a lot of scalability features. Though it has a high learning curve and all of its configuration is done in text, I've always found it worth the time and effort. I currently use it to monitor services on a couple hundred machines.

Munin is a bit simpler, but I like the graphs it provides which occasionally are more useful than the data Nagios provides. In some cases, Nagios might tell me that a server went down, but I'd look at Munin and see that the server room temperature spiked to 90F before then. Also it's neat to see the uptime graphs for the year.

While it might not be practical to use GKrellm all the time, I'd find it useful for real-time feedback. You might set something up where you can launch a gkrellm client to a server of interest while you're working on it. Then you can see the effects of things you do without waiting for Nagios to refresh in 5-10 minutes.

Hobbit+Cacti+Smokeping (1)

adary (1255614) | more than 4 years ago | (#28629193)

That is the solution that i have implemented for our little environment that consists of about 50-ish solaris (8 and 10) servers, 80-ish windows servers, about 500 linux servers, and 40-odd cisco switches. Hobbit handles all host monitoring: availability, services, and a bunch of custom scripts written for it to check various aspects of our HPC grid, plus the SMS sending through an old nokia connected to the comm port of a solaris box. Smokeping is there to check latency, and cacti primarily for network traffic volume, and a custom module for FlexLM licenses. Works like a charm

The Dude (0)

Anonymous Coward | more than 4 years ago | (#28629217)

The Dude is what we have been toying with lately.


SNMPc (1)

Stile 65 (722451) | more than 4 years ago | (#28629227)

It's *not* open-source, but it IS inexpensive. When I worked at a NOC, we used it to monitor hundreds of routers, switches, mainframes, Tandem systems, UNIX boxes, etc. It takes SNMP traps and displays them graphically on a 2D map, and the 2D map is very nicely implemented. You can have your top level view made up only of groups of devices, so if a group goes red you double-click that group to view its members and see which device actually has the error. IIRC, you can nest groups, so it ends up being a fairly scalable solution when you talk about screen space.

I use Nagios but also recommend OpenNMS (1)

WML MUNSON (895262) | more than 4 years ago | (#28629295)

I use Nagios, but on a smaller scale than what you describe. I love the system, but I would imagine it being difficult to maintain on a larger scale. Nagios itself is requires manual configuration unless you use a separate front-end like Centreon, which is also far from perfect..

A friend of mine has been toying with OpenNMS for the last few months, and he's pretty happy with it although he reports that it's still got some minor issues that need to be worked out. It's FCAPS compliant, and I get the impression that it might be the better option for handling a large installation. There's a new version scheduled for release soon, so we'll see what that brings to the table.

There's also recently been an announcement of a Nagios fork, scheduled for release sometime around October. I forget the site or project name but I'm sure a bit of Googling will locate their site for you.

Wikipedia chart (from hell?) and reading rec (1)

vevel (1366705) | more than 4 years ago | (#28629313)

This sounds like the perfect opportunity to harness the power of app partisans to fix the wikipedia article comparing monitoring software. See http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems [wikipedia.org] . Some good info there. And probably bad info. But certainly has a good list of applications. Also, if you like nagios (and he seems to me to be fair to a lot of packages, including ossim), you might check out some of David Josephsen's articles (or Nagios book), etc.. His site is http://www.skeptech.org/ [skeptech.org] . A decent design article is here -- Best Practices for Designing a Nagios Monitoring System -- http://www.informit.com/articles/printerfriendly.aspx?p=705685 [informit.com] .

I Hate War Rooms (4, Interesting)

afabbro (33948) | more than 4 years ago | (#28629355)

I really don't like the "War Room" video wall concept. I suspect such walls are made to look cool rather than to monitor.

What you want in large-scale monitoring is:

  • The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A. Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down".
  • I want my monitoring solution to understand HA and service degredation. I want programmable rules about what happens when X is down or Y is down.
  • I want many options for escalation. If X doesn't acknowledge, try Y after 15 mins, etc.
  • I don't ever, ever want a pager to explode or be flooded. A problem should be noticed once and tracked. There should be no pager blizzards.
  • Of course, I don't want this thing relying on my mail system for paging because, of course, my mail system could go down. An ability to dial out if the mail system is down would be nice.
  • I want agents, hooks, interfaces, third-party add-ons, and every possible way of tying something into the monitoring system. I don't want dumb limitations like "you can only get an exit code from the OS and it acts on that" or something. For big monitoring, it's almost mandatory that some kind of API for agents is exposed.
  • I want "I'm working on it, stop paging" blackouts. I want to be reminded to lift them.
  • I want it to tie into my change-management system. If I open a ticket and say that server X is down for 2 hours on this date, I don't want to have to remember to black it out.
  • I want reports. I don't care about silly little charts and graphs, but a history of everything that has every gone wrong with device Y would be nice.
  • I want more info on my page-receiving device than just "HOST X IS DOWN". I want context so I can decide if I have to drop everything immediately.

Etcetera. These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas.

Hobbit (1)

Rementis (656260) | more than 4 years ago | (#28629375)

Take a close look at XYMon, previously called Hobbit. Easy to use, ton's of plugins (big brother compatible).... Really nice over all, easy to understand web-based interface, alerting, graphing, etc...

Don't be like Tivoli, OpenView, etc (2, Insightful)

duffbeer703 (177751) | more than 4 years ago | (#28629399)

Focus on usability and rapid deployment rather than wide-ranging featuresets that sit on the shelf for a decade. Nearly all products in this space really, really suck.

Intermapper (1)

ChiefArcher (1753) | more than 4 years ago | (#28629471)

Big fan of intermapper (www.dartware.com) ... It can use nagios plugins as well.
It's fairly cheap.. We monitor about 1250 devices at the moment with it... can be set all way down to 5 seconds.
Server and Client are both in Java... so more or less it runs on any platform.

They give out 30 day demo keys.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account