Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Software To Diagnose Faulty PC Hardware?

timothy posted about 5 years ago | from the diagnostics-are-underrated dept.

Hardware 274

Etylowy writes "Over the years I have repaired my own PC and those belonging to family and friends many, many times. While in most cases it turned out to be restoring a system after malware/the user/Windows made a mess, or simple cases of 'follow the smell of smoke and molten plastic,' there were some nasty ones where the computer mostly works. By 'mostly,' I mean: you can boot it up, it might even work for a while, but will crash way too often to blame it all on Microsoft — what do you do then? Once you strip it of any extra hardware (which, with today's motherboards that have pretty much everything integrated, might not be an option) you are left with the CPU, motherboard, graphics card, RAM and HDD. You can test the HDD, you can run memtest86+ to check the RAM, but how do you go about testing the CPU, motherboard and graphics card trio to find which is to blame? Replacing them one by one isn't really an option. Do you know of any software that would help the way memtest helps with RAM?"

Sorry! There are no comments related to the filter you selected.

OCCT (4, Interesting)

PFAK (524350) | about 5 years ago | (#29705577)

It will stress your RAM, CPU, and GPU or all at once with pretty temperature and utilization graphs (for Windows only): http://www.ocbase.com/perestroika_en/ [ocbase.com]

Re:OCCT (1)

Etylowy (1283284) | about 5 years ago | (#29705735)

As far as I can see there is no GPU testing option, but it seems like a good solution for Mobo + CPU.

Re:OCCT (3, Informative)

PFAK (524350) | about 5 years ago | (#29705787)

Did you actually install it? (or are you a typical /. reader?) It has a "GPU" option for stress testing your graphics card if you have the latest DirectX updates installed.

Re:OCCT (1)

Etylowy (1283284) | about 5 years ago | (#29705843)

It's the first time I have heard about this software, so I had a look at http://www.ocbase.com/perestroika_en/index.php?2008/03/14/23-what-is-occt [ocbase.com] - and all it says is "OCCT (stands for "OverClock Checking Tool) is a CPU stability testing program, developped by myself." - not a single word about GPU on the whole "what is occt" page. Maybe I am not paranoid enough to assume that the developer is a liar and check if he hadn't hidden like 1/3 of his softwares functions ;-)

Re:OCCT (2, Informative)

adamstew (909658) | about 5 years ago | (#29705949)

many people overclock their GPUs too, so it would make sense that a tool for Overclocking stability tool would stress that as well.

Overheat (4, Informative)

gd2shoe (747932) | about 5 years ago | (#29705793)

That's a marginal idea at best, but a common one.

While the technique of blasting a processing unit to see how it behaves at maximum temperature will sometimes find a faulty unit, many faults are not temperature related, and will not show up on this test. It's fine that you brought it up here, but something that both heats the CPU/GPU and tries to test as many pathways / as much of the instruction set as possible would be far more useful. (cf memtest86+ for RAM)

PSU (5, Informative)

gd2shoe (747932) | about 5 years ago | (#29705809)

Oh, and don't forget to check the PSU. When it acts up, it will often appear to be a hardware fault somewhere else in the machine. (often RAM, but can be MB, CPU, GPU...)

This certainly doesn't answer the posters question, but it is related and important.

Re:OCCT (1)

piero.grimo (1652185) | about 5 years ago | (#29705877)

It will stress your RAM, CPU, and GPU or all at once

So how does it help identifying which one is to blame?

Re:OCCT (2, Informative)

Narpak (961733) | about 5 years ago | (#29705959)

More than once I have experienced that the on-board sound chip from Realtek causes the computer crash or have significant slowdowns. Disabling and putting in a budget soundcard fixed it. So I would suggest that disabling various on-board components in turn could uncover the culprit. That being said, identifying hardware problems have always, for me, been a bit hit and/or miss.

Re:OCCT (1)

astar (203020) | about 5 years ago | (#29705973)

On your sig, English dictionaries have a lot of definitions of free, and as I understand it, none that exactly match free as in free software. That is why people who need to be precise say gratis and libre. You are playing nominalist.

Not all your fault. The current English dictionaries are probably the result of the ongoing long-term cultural deterioration. I would expect that really old English dictionaries have the meaning.

Re:OCCT (0, Offtopic)

PopeRatzo (965947) | about 5 years ago | (#29706429)

...the ongoing long-term cultural deterioration.

There is no such thing. Every culture has a small number of people for whom nothing is as good as it once was. Music "deteriorated" when The Beatles arrived, because their music wasn't as "good" as Glenn Miller or Beethoven. Painting "deteriorated" when those sloppy Impressionists stopped coloring inside the lines. Movies "deteriorated" when Godard didn't tell a story as neatly as Howard Hawkes. Human communications "deteriorated" because people send email instead of writing letters. Everything was better "back in my day".

I don't believe "ongoing long-term cultural deterioration" means what you think it means, if it exists at all.

Re:OCCT (2, Informative)

JMandingo (325160) | about 5 years ago | (#29706341)

Use a can of compressed air to purge out any accumulated dust. Less dust means a cooler box, which may just bring the unit back within whatever temperature (or, by extension, power) tolerance it is pushing the envelope on. Another technique is to wiggle every cable and connector and slotted card, just to make sure nothing has come loose. Check to make sure all the fans are running whilst powered on.

Re:OCCT (0)

Anonymous Coward | about 5 years ago | (#29706363)

Fer christsakes! Just Google "pc diagnostic card"
There's lots of them out there.

what about PTS? (2, Insightful)

midol (752608) | about 5 years ago | (#29705589)

The phoronix test suite is a good profiler, at least it would narrow the search. But, as you observed, once you are down to the RAM and integrated devices what options do you really have expect to toss the mobo?

Replace the integrated part (2, Informative)

gd2shoe (747932) | about 5 years ago | (#29705857)

Integrated devices can typically be replaced with PCI/PCIe devices. If an integrated network or sound card gives out, it can often be easier and less expensive to shove a new device into the case and disable the old one in the Device Manager. Still, integrated devices don't go out that often. It's more common for the MB itself to go (my experience, anecdotal).

Re:Replace the integrated part (3, Insightful)

sortius_nod (1080919) | about 5 years ago | (#29706257)

Even when they do, it's usually a sign the rest of the board is on it's way out too. A device on the board not functioning can mean a number of things (MB controllers acting up, visible/non-visible corrosion in the board, blown capacitors, etc), so you can be up for a lot of weird behaviour from the board that you can't pin down.

To be honest, relying purely on a test suite to tell you what's broken will lead to disaster. Only through experience do you get the pointers toward what is actually faulty. Add to this that true diagnosis only comes from swapping out parts, and, well, test suites don't look at all like a viable option.

When I am repairing hardware about the only suite I use is memtest86+ and a decent live linux distro. You can usually pick devices that have failed with lspci, however this is not always correct. It all goes back to having test hardware & the knowledge of what certain behaviours in systems are caused by certain faults. After 15 years of working in IT with both hardware & software faults, there's only so much you can do with limited or no test hardware. Most of the time when you're diagnosing hardware faults on the phone it's an educated guess at best, the only time you truly get a decent diagnosis is when you have the machine with you and can swap parts out. Hell, we don't even use the Dell diagnostics at work due to their inability to give decent results on anything other than RAM.

Eurosoft PC Check (4, Informative)

jdb2 (800046) | about 5 years ago | (#29705611)

This is probably one of the best and most comprehensive OS agnostic boot-CD/floppy general purpose PC hardware testing and burn-in tools I've come across IMHO.

Here's its web page : http://www.eurosoft-uk.com/pc_check.htm [eurosoft-uk.com]

In any case, I recommend plugging the ATX cable into a power supply tester that presents a non-trivial load as a first step in diagnosing any PC. You'd be surprised in what ways the problems caused by out-of-spec voltages can be manifested.

jdb2

Re:Eurosoft PC Check (2, Informative)

Omnifarious (11933) | about 5 years ago | (#29705645)

I second this. I've had 2 or 3 PCs now that have begun acting very strangely only to discover that the real problem was the power supply. Replace it and the PC acts fine again.

Re:Eurosoft PC Check (2, Informative)

piero.grimo (1652185) | about 5 years ago | (#29705995)

Same here. I've consistently had problems with a PC to discover years later that the PSU was defective (it actually blew up). I got a 450W PSU and all the bizarre symptoms have vanished.

Re:Eurosoft PC Check (0)

Anonymous Coward | about 5 years ago | (#29706047)

Not sure if Eurosoft's PC-Check is better than the PC-Doctor Service Center product that you can buy on Amazon. It looks like the PC-Doctor one has a lot more hardware testing tools.

http://www.amazon.com/PC-Doctor-Service-Center-Computer-Diagnostics/dp/B000Z88VXK

Re:Eurosoft PC Check (1)

Monolith1 (1481423) | about 5 years ago | (#29706409)

+1 for pc check

random thoughts (3, Insightful)

Anonymous Coward | about 5 years ago | (#29705613)

self-checking programs like Prime95 can be useful to test the computer more generally (if you've verified with memtest a failure here basically means cpu/chipset at fault).

Other things I've tried before have been (if the motherboard allows) things like significantly underclocking sections of the motherboard/processor, if an specific underclock fixes the problem you just significatnly narrowed down the list of possible failures.

there are similar programs to memtest that will check a GPUs output conforms to what it should, but if you just have random-crashy-badness that can be a pain to diagnose. Sometimes things like just running without graphics drivers for a while can help spot those problems, if the computer no longer crashes you can look a bit further away from the graphics card as most of it's capabilities won't be used.

Just replace it. (2, Informative)

lukas84 (912874) | about 5 years ago | (#29705615)

Repairing hardware makes no sense anymore. Just swap in a new machine from the pool, so the user will be happy again, call the manufacturer to send someone onsite to replace the system board, redeploy the image, and put the machine back into the pool.

At home, i usually replace the machine before it has a chance to get old and flaky.

Re:Just replace it. (1)

Trahloc (842734) | about 5 years ago | (#29705701)

That works awesome for the corporate world. But last I checked friends and family dont have a pool to draw from and if you at least read the first couple words of the summary.

"Over the years I have repaired my own, family, and friends' PCs many, many times".

I know RTFA is too much to ask on other articles but RTFS's first sentence on askslashdot can't be THAT much ... can it?

Re:Just replace it. (1)

lukas84 (912874) | about 5 years ago | (#29705765)

So they won't get a replacement machine, but it's the same thing. Call up the manufacturer, have him replace everything, and then restore from their last stable backup.

Re:Just replace it. (1)

vxvxvxvx (745287) | about 5 years ago | (#29705869)

What if he is the manufacturer? If he's building the machines for friends & family he can't simply call someone else up to replace everything. Or perhaps someone else built the machine from components. The fix is the same, replace whatever is broken, but the question is how do you determine which thing is broken. (This is a good reason not to build machines for friends and family.)

Re:Just replace it. (1)

gd2shoe (747932) | about 5 years ago | (#29705923)

You still don't seem to get it. Friends and family only rarely ask you to fix a machine that's still under warranty. More often you wind up diagnosing / replacing the broken part yourself, and sending them on their way.

(often the 1year warrantied hard drive that gives out at 13 months; people aren't going to replace their computer every year because of that.)

Re:Just replace it. (2, Interesting)

Trahloc (842734) | about 5 years ago | (#29706225)

So your solution is to pawn off the problem to someone else? Either your an ancient tech near retirement who is bored to tears with what he does or you never really loved this stuff to begin with. Unless I take glee in the idea that a particular individuals machine is broken because I despise them I'll help utter strangers fix their stuff just because its *fun* to figure out a problem. Helping the other guy and gaining their gratitude is a bonus. And yes I've been doing this for long enough that it isn't something I'll "grow out of".

Re:Just replace it. (0, Flamebait)

lukas84 (912874) | about 5 years ago | (#29706339)

No, it's just that i've given up on trying to solve issues that are utterly impossible to figure out, because you're basically just guessing what the issue could be, based on your experience.

In my job, i've learned that this does not pay - fixing an out-of-warranty machine for 185 CHF per hour is _not_ something a customer will pay for - replacing the machine is cheaper and gets you a new one with 3 years of warranty.

Of course there are still friends and family, but i've stopped building machines for them from parts since i've got out of my apprenticeship. They'll expect instant and free support for every issue they have, so my recommendation is usually to get a machine from a local shop where they can annoy someone else.

The same goes for many software issues - sure, if i have a strange issue on one of my machines, i'll usually spend a few hours on trying to resolve, just to satisfy my curiosity. The same also goes for servers at work.

But if i have a non-reproducible problem on a client machine, replacing it with a swap machine and a fresh OS image immediately fixes the users issue and costs less money.

Add to that that a lot of hardware has been replaced by laptops, where you can do very little in case of issues, since replacement parts are fuck expensive and maintenance manuals sometimes hard to come by, depending on the manufacturer.

Also, most of the client machines at work consist of very few components, and fixing out of warranty machines makes little sense - a new ThinkCentre M58 costs around 1000$ - getting a replacement mainboard for an old A51 or such costs around 200$, plus labour, and if you're unlucky the problem wasn't the board but the psu, cpu or memory, and you'll need to order more parts and invest more work.

But hey, maybe i'm just to negative about this. Maybe you can enlighten me how you can sort out these issues.

Re:Just replace it. (1)

jp10558 (748604) | about 5 years ago | (#29706385)

Yea, and basic Dell Precision Workstation T3400s are ~$650... Even less reason to deal with them out of warranty.

How to test? (3, Insightful)

girlintraining (1395911) | about 5 years ago | (#29705619)

Well... typically you find the fault by using an application which stresses one of those components far more than any other and then seeing if the failure condition you're observing occurs more often. This is just basic troubleshooting, it's not even specific to computers.

Re:How to test? (1)

Etylowy (1283284) | about 5 years ago | (#29705791)

For RAM it's fairly easy - 2-3 different data manipulation methods used by memtest and you know if there is an issue. With CPU it's WAY more tricky - example: I once had a CPU (it was some kind of old AMD, Duron I think) which crashed every single time I used LAME to encode wav to mp3. I have failed to find any other software that would crash every single time (or even often), but the system was generally less stable. As you can see sometimes it is not enough to simply stress the CPU with a generic app - that's why I was asking if there was software designed to test CPUs, just like memtest is designed to test RAM and nothing else.

Re:How to test? (1)

gd2shoe (747932) | about 5 years ago | (#29705955)

I hear you. I too have often wished for something like this. (something that would test individual pathways and as much of the instruction set as feasible, preferably all of it.) Stress testing is still a good idea, but it should be done in tandem with real testing.

Preventative Medicine - get a UPS (4, Informative)

jackchance (947926) | about 5 years ago | (#29705631)

Most home computer hardware failures come from "brownouts".

If you notice that your lights dim a little bit when your fridge compressor or AirCon comes on, that is a recipe for a computer failure. Spend $50 get a UPS [amazon.com]
Btw, i noticed that my linksys wifi router was also extremely sensitive to brownouts. It would get funked up and need to be power cycled. Plug it into a UPS , no more wifi problems either.

I learned this the hard way when i moved to an old building in the east village of NYC and had 3 motherboards/cpu fail within a 3 month period.

Re:Preventative Medicine - get a UPS (2, Insightful)

lukas84 (912874) | about 5 years ago | (#29705677)

That's not an Online UPS, so it won't protect against all grid issues. And Online UPS are expensive and noisy.

Re:Preventative Medicine - get a UPS (2, Interesting)

a09bdb811a (1453409) | about 5 years ago | (#29705777)

If you notice that your lights dim a little bit when your fridge compressor or AirCon comes on, that is a recipe for a computer failure.

Why? Doesn't the computer's PSU have enough juice in it to survive a quick dip in voltage? Besides, almost all PSUs are rated ~90-260V, so I always assumed if it dips from 230V, it won't matter.

Occasionally my lights dim but I don't seem to have had problems. I'm still waiting for my decade-old P3 to die so it can be replaced by an Atom board, but the darn thing keeps on running.

Re:Preventative Medicine - get a UPS (1, Informative)

Anonymous Coward | about 5 years ago | (#29706023)

Why? Doesn't the computer's PSU have enough juice in it to survive a quick dip in voltage?

No. Off-the-shelf computers from the big vendors tend to select the cheapest, lowest-rating power supply they can find. And since it's the cheapest the power supply vendor may additionally cut corners. A *good* power supply? A little brownout is no problem. Most PCs do not have a good power supply.

I'm still waiting for my decade-old P3 to die so it can be replaced by an Atom board, but the darn thing keeps on running.

From my experience at a computer surplus, P3s and below have been very reliable. VERY reliable. Higher-end (2.8ghz+ or so) P4s have exibited increasing rates of blown motherboard caps and power supply failures, and the Pentium Ds we are starting to get have had VERY high failure rates.

          Anyway, my burn-in method won't help you. But at the University surplus I work at, we have an automated netboot Ubuntu installer. We *could* basically ghost it on, but the net installer works out the ethernet, hard disk, CPU and RAM pretty hard -- it has actually found many machines (out of ~10,000 a year we get through) that have no apparent blown caps (*cough* *GX270* *cough*) but are nevertheless unstable. This does not help narrow down the fault, but it does narrow down if it's real or if it IS windows and/or drivers though.

            Power supply -- check the BIOS, and if it doesn't have one, you'll have to have a voltmeter with you. I've seen power supplys where the voltage sags, it'll run but crash randomly. In reality, I have not checked the power supply very frequently, the below detects most faults.
            CPUburn -- this'll exercise the CPU.
            Memtest86 -- memory
            if it doesn't crash with these going, then run something video-card-intensive. If it then crashes it could be the card unstable, mobo unstable (either not supplying the card enough voltage, or other problem...), or faulty power supply (sagging under load of the video card perhaps? You did test this right?) Or it could be drivers of course.

Re:Preventative Medicine - get a UPS (1, Informative)

Anonymous Coward | about 5 years ago | (#29706041)

Your observation is correct: Modern switching power supplies don't care much about voltage, as long as it's in a certain range: They simply draw more current when the voltage is lower. It's not the brown-out as such which kills the computer but the transients which go with it. Power supply quality varies. Good power supplies can bridge longer drop-outs and withstand stronger voltage spikes than others. It speaks for a PSU when the computer keeps running while a short drop out turns the TV off, for example. A PSU like that most likely doesn't need a UPS to protect it from bad mains.

Re:Preventative Medicine - get a UPS (1)

Etylowy (1283284) | about 5 years ago | (#29705807)

Good power supply should handle it - maybe not as well as UPS but will greatly contribute to general system stability.

Re:Preventative Medicine - get a UPS (3, Informative)

The Grim Reefer2 (1195989) | about 5 years ago | (#29705813)

Most home computer hardware failures come from "brownouts".

If you notice that your lights dim a little bit when your fridge compressor or AirCon comes on, that is a recipe for a computer failure. Spend $50 get a UPS [amazon.com]
Btw, i noticed that my linksys wifi router was also extremely sensitive to brownouts. It would get funked up and need to be power cycled. Plug it into a UPS , no more wifi problems either.

I learned this the hard way when i moved to an old building in the east village of NYC and had 3 motherboards/cpu fail within a 3 month period.

What you really need in the case you describe is a good line conditioner. I didn't look at the 'UPS' you mentioned, but many in that price range are not a true UPS and will still allow for under voltage to occur, albeit for a shorter period if you're lucky. .

Re:Preventative Medicine - get a UPS (1)

dbs11 (1653955) | about 5 years ago | (#29705897)

It sounds like you may have a bad neutral. In the US, we get two hots and a neutral from the electric company into a building. From each hot to neutral is 120 volts for receptacles and the like; across both hots you get 240 volts for heavy loads like stoves, dryers, and central air units. If the neutral opens up the voltage doesn't divide evenly. It will sag on the more heavily loaded hot leg and soar on the other. You notice this when you switch on heavier loads like a refrigerator or toaster. Open, or half-open neutrals are not rare. The connections outside can corrode due to aging. Old boxes, especially in a cellar or other damp location, are another culprit. We had this happen due to an incompetent electrician. He replaced the circuit panel and forgot to tighten the neutral screw. The kitchen lights got super bright when we turned on the one of the stovetop elements, and dimmed when we turned on a second. Among other things we lost the doorbell transformer, the garage door opener, and a few digital clocks. Settlement was a lot of fun. The real estate agent had brought this guy in to replace the box for a finicky buyer. The damages came out of their commission. They didn't argue much - it was their electrician, and he could've burnt down my house.

Re:Preventative Medicine - get a UPS (1)

xaxa (988988) | about 5 years ago | (#29706085)

Interesting...

In my parents' house, turning on the 7kW electric shower briefly dims the lights, do they have a problem with the neutral? The house is in the UK, so the electricity supply to the building is 230V single phase (IANAE).

Re:Preventative Medicine - get a UPS (0)

Anonymous Coward | about 5 years ago | (#29706281)

Wait a minute; hold up. A 7 thousand watt shower?? What are you washing with that thing? An elephant?

Imagine a beowulf cluster of those... amiright?

Re:Preventative Medicine - get a UPS (0)

Anonymous Coward | about 5 years ago | (#29706369)

How long does it take to heat a glass of water in a 1kW microwave oven? A 7kW flow heater is not out of the ordinary.

Re:Preventative Medicine - get a UPS (1)

gd2shoe (747932) | about 5 years ago | (#29705983)

I doubt it was the dips that killed your equipment. More likely, it was the spikes on the line that shortened their life. (same crummy electric grid that caused your brown outs) Of course, each dip can be accompanied by a spike as the power recovers. As the other poster here mentioned, its not a matter of keeping power to your devices, it's a matter of conditioning the power that's coming in.

I'm not so sure about that (1)

Sycraft-fu (314770) | about 5 years ago | (#29706187)

You have a source to back that up? Because if not, I'm calling shenanigans. That seems real unlikely for a number of reasons:

1) This would be a recipe for lawsuits. After all, this situation of momentary power drops happens ALL the time on all kinds of circuits. If computers weren't able to handle it, that'd be a great way to get sued. With consumer devices you don't get to say "Oh this thing is super sensitive you have to take all kinds of measures to protect it." You device is expected to deal with common conditions and there are tests out there for that sort of thing. FCC Part B is an example, which deal with unintentional radiators of EMF/RFI.

2) Modern PSUs are almost universally active PFC, meaning they smooth their load on the electrical grid. The other side effect of that is they are voltage and frequency agnostic. You'll notice they are generally spec'd like "AC input 90-264V, 47Hz-63Hz." They will work anywhere in that range, they don't require a specific voltage or frequency. Well, as such a momentary line sag is probably nothing to worry about. The voltage is still within the operational range. The PSU doesn't care, it just draws more current. Unless you are near its operational limits, this isn't a problem.

3) PSUs have bigass capacitors in them. Google around for some pictures of the inside of a PSU. You'll notice some extremely heavy duty caps. Those provide a whole bunch of instantaneous power reserves. These can deal with both quick increases in demand for power from the system, and drops in supply from the line. That is one of the major reasons to stick a cap in a system, they smooth out a power rail.

4) The consumer devices you call UPSes, aren't. What I mean is they are not truly uninterruptable. For that, you need a fully online UPS system. What that means is something where the incoming power is converted to DC, sent to a battery, then the output of that is inverted and sent to the computer. That will have no interruptions, no sags. A normal UPS is a line interactive one. It is fast, but not instant. A momentary drop it can't catch. It takes a few tens of milliseconds to switch to battery. For that matter, line interactive UPSes don't tend to make up low voltage conditions with battery, they instead switch taps on a transformer and act as a voltage regulator, again not instant. So while they'll help with a chronic sag, a true grid brownout, they aren't fast enough to catch a fast drop.

I'm not saying that power conditioning and backup is a bad idea, quite the contrary. I am saying that I think you are incorrect that the momentary sag caused by turning on a high drain device is a problem. If all you've got is anecdotal evidence, well I thin you should rethink your position. As a counter to your anecdote, here's my own: I live in the desert and thus have an AC unit that kicks in all the time. It causes a line sag when it kicks on. I also have a fridge, freezer, and some other devices that cause power on sags, like a receiver (it has a huge set of caps that have to charge). I do have a UPS on my computer, but there have been times when I didn't, and I have devices that aren't on UPS power. None have died.

Re:I'm not so sure about that (1)

jackchance (947926) | about 5 years ago | (#29706349)

I concede that i was incorrect to place the blame on the brownouts specifically. I should have said home PC hardware failures are caused mostly by electrical problems.. I mention the brownouts because that is something visible (as opposed to the spikes.)

And getting a cheap UPS solved the problem. Specifically I got an , which was around $50. [apc.com]

If you spend $500 to $5000 on a computer (or other electronics), it is a good investment to protect it with a $50 UPS.

Tools (1, Informative)

Anonymous Coward | about 5 years ago | (#29705633)

CPU:
Prime95 (Step 2): http://www.mersenne.org/freesoft/#newusers [mersenne.org] ... Blend test for memory+CPU stability, Small FFT for CPU
Lynx: http://www.softpedia.com/get/System/Benchmarks/LinX-benchmark.shtml [softpedia.com]

Video Card:
3dmark: http://www.futuremark.com/benchmarks/ [futuremark.com]

When testing the video card, listen for high pitch squealing (power issue), over heating, and symptoms like white dots appearing at random. This is not a test tool but will put some stress on the card.

Built in Self Test (1)

camgirlshide (1649725) | about 5 years ago | (#29705647)

Many machines, like my dell notebook, have a built in self test. On this machine I'm typing on now (Dell Inspiron 1501) if you go into the slef test mode (instructions to enable are on the spash screen) you can pick one of two tests - a quick 15 minute test or one that runs for several hours. I assume this self test is the same test that dell runs before they let a machine leave their factory. I have also seen their techs use it to determine which piece of hardware they should send you if they believe there is a hardware problem. Unfortunately, I don't think it is fool proof. I'm pretty sure the hard drive is going on this laptop even though I don't see anything to indicate that on the smart status or the self test screens. It's a good stat though.

Re:Built in Self Test (2, Interesting)

gd2shoe (747932) | about 5 years ago | (#29706029)

It is a good start, but no more than that. Those tests are certainly not comprehensive (and should be). On the plus side, they often have your specific hardware in mind, and might possibly catch something that other tools wont. (doesn't happen often, but sometimes...)

SMART is also not the end-all of hard drive indicators. A drive can pass SMART, and still be on the way out. I've found (for those familiar with Linux) that a dd from the hd to /dev/null will often spit out errors on a drive that's getting ready to fail. A linear read is far faster than a read/write surface scan, albeit not as thorough. (can be run from knoppix live CD)

Microscope (3, Informative)

grapeape (137008) | about 5 years ago | (#29705649)

I like the Microscope products...their newest version Microscope duo boots off of a USB stick. For machines that dont boot at all they also have a diagnostic card, its basically a pci card that has an led readout that give a series of post codes that can help diagnose if its the board, a card, memory, etc. They can be found at http://www.micro2000.com/ [micro2000.com]

The handiest piece of diagnostic gear I use is actually a simple power supply tester. You would be amazed how many systems that appear to power up are actually suffering from a dead -5 or +5 rail on the powersupply. Many tend to think if the fans spinning the powersupply is ok but thats often not the case. The best part is they are cheap...around $10 for a basic one.

Re:Microscope (0)

Anonymous Coward | about 5 years ago | (#29705967)

Man, these guys still around ? I remember them from way back in the dark days of MS-DOS and Norton Utilities to diagnose hardware...

Re:Microscope (1)

fjin (36284) | about 5 years ago | (#29706163)

I do consumer computer repairs as my work. One observation is, that powersupply lasts about 5 years - when it is in general-purpose non-gamer-box.

Testing video card? (1, Interesting)

Anonymous Coward | about 5 years ago | (#29705671)

ATITOOL [techpowerup.com]

It's not just for ATI. Has a card stressing feature.

Another program to note is something like: speedfan [almico.com]

I can't count how many times a problem was directly caused by high temperatures on the cpu, gpu, etc.

And one more tool which I keep in my toolbox:

Spacemonger [almico.com]

A quick run of it gives you a visual representation of the hard drive. I've fixed several problems by seeing that crap needs to get deleted.

Good luck!

I wish you had asked this question 2 weeks ago... (1, Interesting)

Anonymous Coward | about 5 years ago | (#29705673)

I've slowly replaced every component in my system due to random crashes. Memory, hard drives, motherboard, power supply, video card and finally this morning the CPU. Each with a fresh OS install.

I'm left with either the case, or the DVD drive being the culprit - if it is the DVD drive, I'm gonna kill someone - most likely me...

Re:I wish you had asked this question 2 weeks ago. (1)

Panaflex (13191) | about 5 years ago | (#29706319)

It's your power mains... get a good UPS with a line conditioner.

Hiren's... (3, Insightful)

Zakabog (603757) | about 5 years ago | (#29705681)

Hiren's BootCD [hiren.info] contains a bunch of different utilities for doing just this. Plus it's bootable, so if you can't get into the OS you can still use the CD. It can do just about anything you'd need to in order to diagnose and repair a machine. You just gotta find it (usually the pirate bay or other torrent sites are a good place to look.)

Hardware tester (2, Informative)

iammani (1392285) | about 5 years ago | (#29705689)

When you no longer trust your CPU/motherboard, I am afraid the only option to test them would be a hardware circuit (which can make decisions using its own CPU) specifically designed for your motherboard/processor. Which I believe only manufacturer will have access to. If you are looking for a more practical solution. The only way is to eliminate the possibility of all other hardware failing (by simply removing them or using them on a good machine) and assuming it must be CPU/motherboard issue(which means you may have to junk them both and buy new ones). And dont forget to test you power supply unit (not checking it on my old PC cost me hell a lot of hours)

Re:Hardware tester (1)

snadrus (930168) | about 5 years ago | (#29706265)

Lately buying the lastest special combo beats web purchases of old hardware for testing in all but the most special cases. In those cases either have a warranty, seller test, or expect trouble if its bleeding edge and use forums.

Best Memtest I've Used Is... (0)

Anonymous Coward | about 5 years ago | (#29705745)

Age of Conan

Believe it or not. I had some crap memory (OCZ Reaper) which other than a few random crashes, *mostly* worked. However, it would consistently corrupt AoC's patches. If AoC decides to re-download a GB of patches on you over and over, check your memory. I've since replaced the memory on this machine and had no problems since. Sadly I've stopped playing AoC. Oh well.

Re:Best Memtest I've Used Is... (1)

lukas84 (912874) | about 5 years ago | (#29705783)

I just hope at some point people will decide that ECC should be mandatory for everything.

GPUs like ATIs HD5800 series already employ memory with ECC.

2GB sticks are now the norm, systems with 4GB or 6GB of RAM pretty standard. So ECC would make a lot of sense. But we still don't see it anywhere, even though by now all modern hardware is capable of it (though Intel disables it on the consumer badged versions).

SMART for dying hard drives (4, Informative)

Wrath0fb0b (302444) | about 5 years ago | (#29705747)

http://sourceforge.net/apps/trac/smartmontools/wiki [sourceforge.net] is great for finding out what the drives think about their own health. Things to look out for are spin-retry counts (which lead to that annoying 2-5 seconds freeze), high reallocated sector counts (never never never use chkdsk to attempt to fix a broken hard drive. With the robustness of modern journaling file systems (HFS, extN, NTFS), storage errors are almost always hardware errors. Running chkdsk stresses the drive just as it's failing and usually pushes it over the edge -- and then users complain that you can't recover their data.

Good comment )))..) ) ) ) ) ))))))) (1)

ClintJCL (264898) | about 5 years ago | (#29706427)

But leaving a parenthesis open like that is offensive. )))))

prime 95 (2, Interesting)

LordKronos (470910) | about 5 years ago | (#29705775)

Prime 95 is a good test of CPU/RAM, as well as to see if the system remains stable under peak temperature. It's often used to burn in overclocked machines.

Buy Dell or HP (1)

dcheest (463877) | about 5 years ago | (#29705797)

That's exactly why I buy Dell or HP PC's. They both offer complete hardware diagnostic programs on a bootable CD or a utility partition and they replace the parts that fail the tests if it's under warranty.

and pay $200 for a $100 video card (0)

Anonymous Coward | about 5 years ago | (#29705991)

and pay $200 for a $100 video card yes at one time dell wanted $200 more for a video card upgrade that in stores / online was only $100 difference.
This was with a BTO system.

Just tell them (0)

Anonymous Coward | about 5 years ago | (#29705841)

Just tell them that they'll need to buy a new computer. Also tell them you'll be nice and take their old one off them for "proper disposal" for a fee of only $50. That plus the $50 "diagnostic fee" means you come out $100 + 1 computer richer.

stresslinux (1)

zal (553) | about 5 years ago | (#29705855)

http://www.stresslinux.org/

nice single purpose linux distro to stress test a system

PSU (1)

sulliwan (810585) | about 5 years ago | (#29705879)

Just replace the damn power supply already and stop wasting your time testing the cpu and the mobo.

Other cause (0)

Anonymous Coward | about 5 years ago | (#29705889)

Well, some years ago i had a customer's PC that had many problems at their office but was running fine when i was trying to catch the defect.
After some time, i found that the PC wasn't the problem but instead the UPS wich was causing too much magnetic fields when beging near the PC.
Then i put one meter of distance between both and problems disapeared ...

Sometimes, matters are not really obvious ;p

PC-DOCTOR is what is your solutions (0)

Anonymous Coward | about 5 years ago | (#29705899)

http://www.pc-doctor.com/

Swap the damn hardware (3, Informative)

evilviper (135110) | about 5 years ago | (#29705917)

but how do you go about testing CPU, motherboard and graphics card trio to find which is to blame? Replacing them one by one isn't really an option. Do you know any software that would help the way memtest helps with RAM?

There is no way to tell, with software, whether your PSU, CPU, or motherboard is to blame, in the overwhelming majority of cases.

It's just idiotic to say "Replacing them one by one isn't really an option". In fact, that's by far the best option. I don't run memtest for a week to find out I have bad RAM, I take 30 seconds to swap it, and find out, for certain, in no time. PSUs are equally easy to swap, AND are the more likely component to fail, so that's the best place to start.

If you don't know whether it's CPU or the MoBo, buy a new motherboard... Vastly more likely to be the cause, and pretty damn cheap just as soon as they're no longer brand new. Of course CPUs fail, but it's likely to be obvious from a visual inspection if they've been installed wrong, or otherwise abused.

Re:Swap the damn hardware (1)

Etylowy (1283284) | about 5 years ago | (#29705971)

To replace hardware piece by piece you need to have replacements. When called by a relative on a Friday evening to fix the computer you don't usually cannibalize your own hardware just to have parts, do you?

Re:Swap the damn hardware (1)

Doppler00 (534739) | about 5 years ago | (#29706139)

I agree. When I build a new system I first:
memtest86+
cpu test with something like prime95
CPU+GPU test with prime95 and then another 3D game running in the background.

If it survives that last test, then it's good. I've found overheating of my system to be the main cause of crashes. I've actually had to underclock my RAM to get it stable. If something does fail, I swap that component or add more fans and try again.

Re:Swap the damn hardware (0)

Anonymous Coward | about 5 years ago | (#29706255)

Ultimately the "swap components" is the best method to test. My guess is most on /. have access to multiple machines and can easily swap a working power supply in and rule that out. Most probably access to more than one computer using the same CPU socket type and can swap everything to determine what is bad. I don't bother buying new components to test with, if I'm going to buy a new mobo to test then I'm just going to decide it's time to upgrade the computer and get a current one instead of an old socket type that's obsolete. Sucks for single computer families - they should probably buy the extended warranty on computers and go with companies that are known for customer service. And of course there's always geek squad.

prime95 (1)

MoFoQ (584566) | about 5 years ago | (#29705935)

never heard of prime95?

it's been used for years to check stability in rigs by overclocking and gaming enthusiasts.
They even have various different "levels" of FFT tests to limit the torture tests to within CPU cache levels which tests the CPU...or more than tests the RAM, PSU, etc.

Prime95 [mersenne.org]

Serious answer: don't bother: upgrade. (1)

lkcl (517947) | about 5 years ago | (#29705943)

I've done a significant amount of PC construction and reconstruction: approximately 60 from-scratch builds in 20 years. One thing that that has taught me is: do not bother to try to diagnose motherboard or CPU faults: just replace them, end of story.

Even Integrated Motherboards can be had for £40, and CPUs for £25. You can get dual-core 1.6ghz Atom Integrated-everything-including-CPU motherboards for £90.

For the amount of time and effort spent unscrewing components and testing combinations that may, if there is some I.C. damage, result in EXTRA damage to other components, the risk and the time is *just* not worth it.

There is, however, short-circuit protection in Hard Drive channels (there is now: there used not to be!), USB devices, PCI cards etc. so the risk associated with these components of them causing further damage, if they themselves are damaged, is much lower (but still possible).

Additionally, short-circuit protection in the PSU is also present and helps mitigate the risk of further damage.

Basically: if you find that a machine is acting up, do an Internet Search for that model: there may be a firmware upgrade that fixes the problem. I once bought eight identical machines (£125 each) and they all had EXACTLY the same memory / unreliability fault. eighteen months later i found the firmware upgrade that changed the timings to workaround the problem.

Other than that: if you cannot find any evidence of firmware upgrades to potentially fix an unreliable machine - throw out the power supply, the motherboard and the CPU, without hesitation (or get them replaced under warranty). Simple as that.

*possibly* keep the memory, but bear in mind that when you upgrade the CPU and the motherboard, you will likely need a different kind of memory, and that memory is likely to be incredibly cheap, anyway.

Peripherals and cards: you should be okay (but test them one at a time).

Ultimately it's about risk management, and the level of integration is simply too high to take any risks. Throw the components out, and get new ones.

Re:Serious answer: don't bother: upgrade. (1)

vxvxvxvx (745287) | about 5 years ago | (#29706311)

Other than that: if you cannot find any evidence of firmware upgrades to potentially fix an unreliable machine - throw out the power supply, the motherboard and the CPU, without hesitation (or get them replaced under warranty).

If you've only ruled it down to one of those 3, how will you get the companies to replace those parts under warranty?

Practical System Stressing... (3, Funny)

Linker3000 (626634) | about 5 years ago | (#29705945)

I stress my Linux boxes by telling them that if they develop a fault I'll re-image them with Vista.

Not a single one has dared to fail on me yet.

PC Doctor Service Center (0)

Anonymous Coward | about 5 years ago | (#29705947)

The PC-Doctor software runs on a PCI boot ROM, DOS, Linux, and Windows. Its pretty good at identifying problem areas and problematic components. They sell a retail product called Service Center which comes with PCI tester card, USB device tester, MiniPCI tester card, power supply tester, and some other neat little toys. It looks really cool:

http://www.amazon.com/PC-Doctor-Service-Center-Computer-Diagnostics/dp/B000Z88VXK

The source of your woes (1)

brassmaster (950537) | about 5 years ago | (#29705957)

crash way too often to blame it all on Microsoft —

Not possible.

Re:The source of your woes (0)

Anonymous Coward | about 5 years ago | (#29706035)

Indeed.

empirical testing: Compile the Linux kernel (1)

lkcl (517947) | about 5 years ago | (#29705993)

gcc is an incredibly good test application. it's horrendously cpu-intensive, and it is designed to eat whatever physical memory is available. compiling c++ applications is particularly memory-intensive, but the best test of both disk and memory has to be simply to compile the linux kernel.

if you have multiple cores, you can use "make -j {number of cores + 1}" and this will test all of the CPUs, as well. if you particularly want to stress things, make that "make -j {number of cores * 2}" instead.

Re:empirical testing: Compile the Linux kernel (2, Funny)

cschepers (1581457) | about 5 years ago | (#29706105)

Or you could install Gentoo. That'll eat up the CPU, RAM, and hard disk for darn near eternity.

on which machine do i log calls, and which to test (0)

Anonymous Coward | about 5 years ago | (#29706005)

bring the crash cart.

oddly enough, a new power supply has helped more than once.

Ultimate Boot CD (1)

googlesmith123 (1546733) | about 5 years ago | (#29706021)

http://www.ultimatebootcd.com/ [ultimatebootcd.com]

This is pretty much the best free tool there is to test and diagnose a system. It also has a bunch of tools for partitioning and the like as well as password resetting.

I've had this in my arsenal for many years now, it's a great tool.

Not a perfect solution (1)

Whuffo (1043790) | about 5 years ago | (#29706067)

If your hardware is suspect, then the output of any program running on that hardware would also be suspect. Keep that in mind when you run diagnostic software - if it says the system is good then it probably is but if the software reports errors then the reported error isn't necessarily accurate. I've also seen these programs detect failures in perfectly working systems. I've tested many of these "technician on a disk" programs over the years and Microscope is the best of a bad bunch.

A more productive diagnostic method is to "divide and conquer" - consider the various replaceable sub-assemblies and diagnose only to that level. Tips: most failures are memory related - bad RAM or it's not making good contact in its socket. If the system locks up immediately at boot or after running a short time, it's almost always memory. Bad power supplies are also a fairly common source of general flakiness and no diagnostic software will be able to diagnose those problems. Bad motherboards are rare and bad CPU chips are almost unheard of.

A few thought ( elimination process ) (1)

hebertrich (472331) | about 5 years ago | (#29706073)

First .. if you repair pc's on any kind of basis i suggest you make a test jig.
you get a flat surface and fix a working mobo/cpu and 2 power supplies.Make a second
space for the mobo to be tested Have a screen keyboard and mouse at the station
That will help you trermendously as you can just move the parts about ,
for most everything but the cpu/mobo fault isolation it makes it a breeze.

But that's where we stop.
When you are hit with a fault of the mobo or cpu the only valid suggestion nowadays
is to replace both.

You may like to know what's broken , but that's pointless as you need to change both the cpu
and the motherboard , and i explain myself.

You have no way of knowing which caused which to fail.
The only valid fix is to replace them both.If you plug in a new cpu in a bad mobo and blow the new cpu
you're no better off and lost a cpu.If the cpu is at cause and caused the mobo to fail , well it's no better
fix here either cause you damaged the mobo.

No . i strongly beleive there's no point in trying to find out but satisfy your natural and beleive me, mutual
curiosity.

Happy trails :)

Re:A few thought ( elimination process ) (1)

vxvxvxvx (745287) | about 5 years ago | (#29706321)

You may like to know what's broken , but that's pointless as you need to change both the cpu and the motherboard , and i explain myself.

Only pointless if you don't plan to get a replacement under warranty.

What separates a PC from a real computer... (1)

cdrguru (88047) | about 5 years ago | (#29706091)

Your average PC hardware has utterly no way to "test" it. You can sort of test RAM - to the point of identifying there is a failure somewhere in the memory. OK, if you have four DIMMs what does that mean? Well, it means you have a RAM problem somewhere.

Motherboard? Not really any sort of testing possible. There are some "pretend" diagnostic tools that will try to tell you if something fails, but what exactly does that mean? Nothing. If you have a ATAPI DVD drive and a SATA hard drive I assure you that a failing drive can easily appear as creating a failure to some "motherboard" test.

There is no clear isolation of the hardware whatsoever, and no ability for the hardware to meaningfully participate in any sort of testing. So you are left with changing parts - more or less what I like to call "throw parts at the problem". Today this isn't terribly practical as most everything is on the motherboard. If you are a skilled screwdriver user you could replace the motherboard, but for most people it is just getting a new computer.

Even if you take a computer to a "computer shop" you are likely to see very little in the way of diagnostics or fault isolation. They will pull out something and replace it with something they have lying around to see if that "fixes" the problem. Often they will do this blindly without much real thought in the process. The end result for the customer is that their computer works again but nobody really knows what the problem was. And, by the way, here is the bill for the parts that we replaced.

There are some external hardware parts that are pretty simple to diagnose and replace. The power supply is probably the most prone to failure and is pretty obvious - the machine is dead with no lights. A CD or DVD drive is pretty simple to sort out as well with most common failures because it either works or it does not. In either case it is a few connectors and a few screws and you have the part in your hand. Both are going to be less than $100 to replace and well worth doing it.

The lack of any real diagnostic ability - or even ability to verify proper operation - is a serious limitation in the PC world. If you move up to real server hardware you see all sorts of diagnostic and fault isolation capabilities. Things like the memory test telling you what DIMM is bad or that a hard drive is failing. But the real gem of hardware diagnostics seems to be reserved for mainframe systems. It tells you a part is going to fail, tells you where the part is and you can confirm that it fails specific tests and a new part passes the same tests.

Don't overlook the hardware basics... (1)

jpdbest (44934) | about 5 years ago | (#29706117)

Sometimes a quick visual inspection of the interior of the computer can lead to the cause of the problem. Double-check the cabling, cards, memory, etc. to make sure that everything is secured in place. Even if the cards appear to be fine, I've seen it where they sometimes need to be removed and reseated. Don't forget about cooling as well. Make sure that the system has adequate cooling, that the existing fans/heatsinks are not clogged with dust and have good mobility with the flick of a finger. Double-check the fans are operational with case open and system is powered, and most motherboards have basic temperature monitoring for the CPUs and speed monitoring for the fans. On the motherboard, make sure to check the capacitors. Over the years (as recently as a couple weeks ago), I've had to replace motherboards because the capacitors had gone bad:

see -- http://en.wikipedia.org/wiki/Capacitor_plague [wikipedia.org]

Some people have already mentioned it, but it needs to be stressed, a *good* power supply is mandatory and if necessary a UPS. The power supply can be perfectly operational and even pass with a power supply tester (also a good investment), but if the power being supplied to it is not consistent (brown-outs) or simply not adequate to drive all the components (e.g. video cards, # of drives, etc.) that can cause problems. In one case by simply swapping the cheap power supply out for a good quality one that I had as a spare from an older system resolved the problem.

Inproper BIOS settings can also cause problems. Memory/CPU voltages or speed may be incorrect? Conflicting on-board video/audio still enabled when add-in video cards and audio cards have been added?

I still haven't even gotten to the software debugging side of things...

It's not the CPU (0)

Anonymous Coward | about 5 years ago | (#29706207)

I have been a pc-building hobbyist, done it for personal gain, and worked in corporate support environments for years. It's never your CPU. Modern CPUs do not "go bad" unless subjected to abuse/lack of cooling, in which case they will fry and not work at all. A CPU can't really work halfway.

Rule #1 of Diagnosing Hardware (1)

frank_adrian314159 (469671) | about 5 years ago | (#29706241)

1. Check the software
2. It's probably the software
3. Really, it's going to be the software

...

87. OK, now you should run some diagnostics

Really. The bottom line is that computers and their parts (especially non-moving ones like processors and RAM), once they're burned in and assuming you don't try to run them overclocked for twenty years without rotating them out, are pretty reliable. I can't count more than a couple instances of hardware failure post burn-in across about fifteen different home machines over twenty years. And both times those were disk failures, which are usually obvious to diagnose (as are broken CPU fans, which happened to a friend). Contrast this to my experience with Windows machines, where bad drivers, creeping registry cruft, and general unpleasantness of management force you re re-install the OS every couple years (and why I'm switching as machines rotate out to either Linux or Macs).

So as to my advice... see above.

The test suite I use (1)

cracauer (6353) | about 5 years ago | (#29706247)

Memtest86 tests much less of the memory than you think. It is 100% no-load. It does find outright broken memory cells but it does nothing if the memory interface runs unreliably.

To test your memory interface under stress you use a program named "Superpi", you run the "32M" test. It is available for Linux and runs on FreeBSD. I find a lot more problems with SuperPi than with memtest, a lot of memtest-stable machines don't actually work right once you stress-test.

%%

To test the CPUs/cores, you use "MPrime" or "Prime95" (same thing). It is the hardest load test that the overclocking record chasers have found, and they try very hard to find more and more nasty tests to proof that their competitors' overclock is not valid. They do this all day long, you should profit from their research.

You run MPrime with one instance per core. Available for Linux, IIRC also works on FreeBSD.

Be warned that the CPU temperature during MPrime will raise to levels that no other program I am aware of reaches. That's the point. MPrime also has a very high amount of plausibility checking on it's intermediate results. The combination of those two factor is why it is such an effective hardware test.

%%

So, in summary:

Run:
1) MPrime for 36 hours (all cores simultaneously, one MPrime each)
2) 24 hours of memtest86
3) a whole bunch of SuperPi 32M.

If there is any 3D graphics ever used you also run Futuremark's 3DMark (Windoze only).

Oh, and you will have to note the CPU temperature that you get during that mprime run and never exceed that temperature during everyday work from then on. This usually isn't a problem since mprime will heat your CPU like nothing else.

Good luck. Notebooks in particular, and cheap ready-made desktops not distributed by Dell tend to fail this outright. If any of these steps fail you can't pass any important data through this computer, it can and sooner or later will scramble you harddrive contents, silently, so that you backup USB drive already has the corrupted version by the time you notice.

Well (1)

ShooterNeo (555040) | about 5 years ago | (#29706267)

Toast and Pi and various other CPU stability test programs will let you test the CPU.

Go into system configuration with windows and turn off auto-reboot, so that if the machine blue screens, you can see what the error code is. Sometimes that will let you isolate it to graphics or the motherboard.

Ultimately, the way to find out IS to replace the components one by one. If you have several machines, or spares from an older machine, you should swap each component and run the machine until either you get a crash or it's been long enough that you must have found the problem.

Power supply (2, Interesting)

Alioth (221270) | about 5 years ago | (#29706359)

You didn't mention the power supply.

In my experience, a "crashy machine" is almost always down to the PSU. Out of the dozens of "crashy machines" I've had to fix, only one was due to bad memory. The rest were *all* down to faulty power supplies, and all of those were due to capacitors that had failed.

I have an oscilloscope so I can easily test for ripple without needing to open up the power supply and look for the obvious signs (bulging capacitors, maybe ones that have leaked). We've had dozens of machines at work with supplies that have gone bad this way. Bad capacitors have been a real problem in recent years. Four years ago, it wasn't just in power supplies either - we had to return 70 machines to Hewlett-Packard under warranty after the capacitors on the motherboard began failing after 3 months of use. We've not seen anything on that scale on motherboards since, but we still have frequent problems with power supplies failing from "capacitor plague".

A machine of mine was actually killed by a sudden power supply failure - the PSU let the magic smoke out with a loud "bang", and there was the sound of stuff richocheting around the computer's case. That sound turned out to be bits of exploding chips on the motherboard. The only thing that survived that incident was the CD-ROM drive - all other components were destroyed.

It's a loaded question (1)

camperslo (704715) | about 5 years ago | (#29706387)

What's the best software to change a tire on your car and find the leak?

Software can check quite a few things, but for the most part during a short time interval, digital hardware is either working or it isn't. So software performance tests may not be very good at revealing something marginal.

Beyond a few software tests and ruling some things out by substitution, it generally takes someone with some hardware troubleshooting skills, and some test equipment.

Of course test equipment starts with your senses. Software isn't very good at spotting things like a failing fan, or dust buildup in heatsinks, cooling vents, or the power supply. Software won't find that little solder blob or loose screw shorting something. It probably won't tell you about something poorly seated or dirty in a socket. It won't tell you about the marginal power supply or high-resistance connector that makes the voltage dip when a drive spins up... It won't tell you if the CPU doesn't have thermal compound properly applied (although software monitoring of temperature sensors does help).

Of course it goes without saying that you've made sure that bios settings are such that nothing is stressed. Don't be afraid to let a memory test run overnight or longer.

A multitmeter, oscilloscope, and dummy-load resistors are a good starting point. Adjustable power supplies to allow board testing at the upper and lower ends of the specified operating voltages can also be very revealing. A hair dryer and freeze spray may help localize thermal intermittents. A temperature probe and IR videcam can be handy. For example being able to see a pin of a connector heating could reveal a problem even when voltages are within normal limits.

If qualified to do so, use an oscilloscope and voltmeter to see that any switching regulators on the motherboard are functioning properly. Failing capacitors sometimes have obvious physical signs, but don't count on finding bad parts by appearance only. Seeing excessive ripple/noise with a scope can make filtering problems immediately obvious. Many modern boards take a 12 Volt input and convert it to what the CPU requires. In some cases the related components are heavily stressed.

Beyond simple things like regulator problems, it is unlikely that most outside of a specialized service facility could actually fix much on a motherboard. Even if when possible, it is not likely to be cost-effective.

Use every clue presented. What's going on when the malfunction occurs? (what's running, is the environment hotter or cooler, is equipment subjected to vibration or static discharge, note time of day when other equipment kicks in etc.)

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?