Boost UltraSPARC T1 Floating Point w/ a Graphics Card? 71
alxtoth asks: "All over the web, Sun's UltraSPARC T1 is described as 'not fit for floating point calculations'. Somebody has benchmarked it for HPC applications, and got results that weren't that bad. What if one of the threads could do the floating point in the GPU, as suggested here? Even if the factory setup does not expect an video card, could you insert a low profile PCI-E video card, boot Ubuntu and expect decent performance?"
Re:Video card for Sparc? (Score:2)
Re:Video card for Sparc? (Score:1, Interesting)
Re:Video card for Sparc? (Score:2)
Re:Video card for Sparc? (Score:2)
Just curious. Do you know of any good CAD software for Linux other than qcad?
Re:Video card for Sparc? (Score:1)
Well, IBM makes what is it called Catia for linux, and EDS is making Unigraphics NX4 for linux (why it was late). Both were unix products before going to windows too.
Re:Video card for Sparc? (Score:2)
Huh? CAD on Macs/Windows??? (Score:5, Insightful)
There are no good techical reasons not to recompile something like this for OS-X, but if you can imagine porting a package which comes as a bookshelf of CDs from UN*X to Win API, I'd like some of the stuff you are smoking!
Paul
Re:Huh? CAD on Macs/Windows??? (Score:3, Informative)
Re:Huh? CAD on Macs/Windows??? (Score:2)
The gap may have narrowed from what it once was, but there are still things (particularly in some
Re:Huh? CAD on Macs/Windows??? (Score:2)
Re:Huh? CAD on Macs/Windows??? (Score:2)
ITYM, "not even a RISC-instruction set box!" since every intel chip since the Pentium and every AMD chip since the Am586 is internally RISC.
Aside from that nit, you're totally right. I remember in the 90s seeing a video for IBM CAEDS, a CAD program that ran only on RS/6k
Re:Huh? CAD on Macs/Windows??? (Score:2)
Re:Video card for Sparc? (Score:3, Informative)
Re:Video card for Sparc? (Score:2)
Re:Video card for Sparc? (Score:2)
The CAD applications I'm familiar with are all related to electronic engineering... Cadence, Mentor, OrCAD, EAGLE, etc. Some of these have Solaris versions without a Linux option, some have both. I'm sure there are good generic CAD programs out there for Linux, but I haven't used any.
This link [tech-edv.co.at] looks useful.
No, you cannot (Score:5, Insightful)
There's been some work by David S Miller on getting BIOS emulation into the Linux kernel so that regular cards can be fooled into working, but it's not there yet and will probably fall foul of Debian's firmware loading policy (does that apply to Ubuntu too?).
Yes you can.. maybe not on SPARC though.. (Score:5, Informative)
PowerPC boards, PC graphics chips with x86 BIOS, no driver edits required on the OS side.. it is there like it would be on a PC.
http://metadistribution.org/blog/Blog/78A3C88E-1C
http://www.genesippc.com/ [genesippc.com]
Re:No, you cannot (Score:3, Informative)
Re:No, you cannot (Score:1, Interesting)
No, it won't. The firmware won't be shipped with debian, it would be run directly from the rom that is on the very card that is to be initialized. Debian has shipped XFree86 for a long time, and it supports a similar method to initialize secondary graphics cards that require their bios to set them up to function properly (probably only works on x86 CPUs).
Re:No, you cannot (Score:2)
Sun Currently OEMs ATI Radeon video cards (Score:2)
Probably, but it's not an optimal solution (Score:5, Informative)
Re:Probably, but it's not an optimal solution (Score:2)
I'm not certain of the cost of the T1 systems, but I would think that if FPU is important, you'd rather just go for a dual-dual-core server. The T1 systems are compatible with more memory though, 32GB for the T1000 vs 16GB for what I've seen in the AMD dual processor workstations.
Re:Probably, but it's not an optimal solution (Score:2)
Re:Probably, but it's not an optimal solution (Score:2)
~Glonoinha, April 23, 2006
Re:Probably, but it's not an optimal solution (Score:2)
Re:Probably, but it's not an optimal solution (Score:2)
Actually, the Sun Fire T1000 [sun.com] only supports 16GB - but the T2000 [sun.com] does support 32GB. If you're looking at these, the alternative Opteron solution would be another rack-mounted system rather than a workstation, and the obvious candidate is the Sun Fire V40z [sun.com] with 32GB and 4 dual-core Opterons. Which, btw, are very nice systems.
Thanks for making me feel old... (Score:5, Insightful)
But Sun realized that the more things change, the more they stay the same; the reason why vendors got away with making floating point an expensive option was that there are lots of workloads where floating point performance is unimportant. So they applied the RISC principle and chose to not waste a lot of silicon on the T1 implementing instructions that are not needed in their target workload, but instead figure out how to get lots of concurrent threads.
Trying to improve floating point perf on a T1 by adding another card is like trying to figure out how to put wheels on a fish. It might be a cool hack and it might solve some particular problem but it doesn't generalize.
If you want floating point perf and tons of threads, wait for the rock chip from Sun (and hope that Sun stays afloat long enough to ship it). It's like a T1 only moreso, with floating point for each thread.
Re:Thanks for making me feel old... (Score:2)
Re:Thanks for making me feel old... (Score:2)
That's why GPGPU is an interesting strategy. GPU APIs offer parallelism, too. When those APIs can be harnessed with bus signalling that's high-enough level symbolically to exploi
Re:Thanks for making me feel old... (Score:1, Informative)
Combining a T1 and a GPU offers you jack, since GPUs use single-precision arithmetic.
Re:Thanks for making me feel old... (Score:2)
As I wrote before, I'm sure there's some workload where it makes sense to mate a T1 and a GPU (besides the obvious one, i.e., rendering grap
Re:Thanks for making me feel old... (Score:4, Interesting)
The way to think about the use of GPGPU in a host with its own (GP) CPU is client/server computing. I put together such a system in 1990, a 12MHz 80286, with 4 12.5MFLOPS DSPs (AT&T DSP32c) and an FPGA "scheduler" on the ISA card. The 286 ran a loop sending data and commands to a memory mapped page on the card's SRAM, and copying the page when a status register was set. I had realtime 24bit VGA renderings of megapolygons at 30FPS, all processed on the DSPs. The systems have all scaled up, but the price improvement per FLOPS of the GPUs over the CPU is even better now than then.
As you say, the key is keeping the compute servers full, which amortizes the signalling overhead best, and keeping the signaling across the bus high-level enough that the bandwidth doesn't bottleneck. There are lots of demanding apps now which could use that architecture. Audio compression is my favorite - I'm waiting to stuff a $1000 P4 with 6 $400 dual GPUs, and beat the performance of any <$10K server, scalable down to $1500. That's the kind of host that could really transform telephony.
Re:Thanks for making me feel old... (Score:2)
http://www.drccomputer.com/pages/products.html [drccomputer.com]
Re:Thanks for making me feel old... (Score:2)
Yes, but the price improvement per bandwidth and especially latency of the interconnect between the the two is much worse. Going off-chip for anything has a huge cost; in order for it to make sense, you have to be able to amortize that cost.
And those DSP chips are CPUs in the conventional sense, although they don't have all the niceties that modern CPUs have (which, ironically, also often used to be implemented as co-processor
Damn, brings back fond memories (Score:2)
Fun, bleeding edge, stuff back then.
Re:Damn, brings back fond memories (Score:2)
As we can see from the current discussion, those same issues and techniques (or at least architectural patterns) are still relevant. In proportion - about a thousand times faster, but equally across the whole uneven platform.
Re:Thanks for making me feel old... (Score:2)
Since when (on both counts)?
DSP == Digital Signal _Processor_ which is the Central Processor Unit on several platforms I know of.
http://www.signalogic.com/index.pl?page=m44 [signalogic.com]
http://www.bittware.com/products/type/dsp-pci.cfm [bittware.com]
http://www.innovative-dsp.com/products/delfin.htm [innovative-dsp.com]
http://www.innovative-dsp.com/products/toro.htm [innovative-dsp.com]
http://www.globalspec.com/FeaturedProducts/Detail/ InnovativeIntegration/CONEJO_64_bit_PCI_DSP_Card/1 1265/0?fromSpotli [globalspec.com]
Re:Thanks for making me feel old... (Score:1)
Even more exotic was the TAAC-1 which was a wide instruction word processor which could be used for FFT's, imaging etc.
One correction the TII (Niagara II) will be the first heavily multi-threaded SPARC CPU with one FPU per core, it is due out next year with rock being due out in 2008.
Re:Thanks for making me feel old... (Score:2)
Was the Weitek an FPU or a vector processor?
Re:Thanks for making me feel old... (Score:1)
Re:Thanks for making me feel old... (Score:2)
I almost never need to use floats in my code so I haven't really looked in a long time.
Wait for the T2 (Score:4, Interesting)
Re:Wait for the T2 (Score:1)
Re:Wait for the T2 (Score:2)
Re:Wait for the T2 (Score:1)
The changes in T2 are 2 pipelines per core, up from 1. 8 threads per core, up from 4. FPU per core up from 1 per module. Faster memory subsystem, additional hw support for encryption and nework offload. On chip cache is expected to remain the same.
Feh (Score:3, Insightful)
Re:Feh (Score:1)
What if you want a better solution than the ones that are normally available?
Re:Feh (Score:2)
IE, this is not a better solution thna the ones that are normally available.
Will never work properly.... (Score:4, Informative)
Now even if you custom code an application to do all floating point work in a specific thread, you would need to completely modify the kernel thread management sub-systems. The threads themselves would need meta flag data to signify what "kind" of thread they are so that the "floating point thread(s)" are queued for running on the GPU and not on the T1 (unless there are idle T1 cores and the GPU is already busy).
Now even if you have the above changed, the only thing this will work on is custom made applications, in other words, you will need to completely re-write anything and everything to take advantage of this setup. This really isn't viable when you may possibly be dealing with non-open-source products like Matlab or Oracle. Even with open source products, it will take MAJOR rework to implement a change like this.
The T1 is designed as it is, a multi-core processor that would make a very good NFS Data Server, ftp server, or web host server with highly efficient power usage. It is NOT a database, application, or HPC server core. Too many of the latter operations require too much floating point operations to be run efficiently on the T1. In a pinch you can use it for them, but it will not shine in that application.
Re:Will never work properly.... (Score:2)
I have never written on but I have written btrees and hash algorithms and they never used floating point.
For a database server I would guess you would tend to be IO bound.
You do have a point in that the T1 is a good platform for a web server or file server but not ideal for many other tasks. I wonder how is it's SSL performance is?
Re:Will never work properly.... (Score:1)
DBMS don't require FPU performance since they don't issue floating point instructions. The app server market is also dominated by integer workloads, think Java and J2EE app servers as an example.
The T1 looks like an exceptionally effective Java/J2EE platform from the slew of great benchmark results Sun has published for the paltform. It is also no slouch as a DBMS platform as is SAP results show. It does lack single threaded performance so its going to be
GPUs == Worthless Floating Point Precision (Score:4, Insightful)
nVidia & IBM/Sony/Cell/Playstation can perform only 32-bit single-precision floating point calculations in hardware. [IBM/Sony can, at least in theory, perform 64-bit double-precision floating point calculations, but the implementation involves some weird software emulation thingamabob which invokes a massive performance penalty.]
ATi is even worse - last I checked, they could perform only 24-bit "three-quarters"-precision floating point calculations in hardware.
And just in case you aren't aware, 32-bit single-precision floats are essentially worthless for anyone doing even the simplest mathematical calculations; for instance, with 32-bit single-precision floats, integer granularity is lost at 2 ^ 24 = 16M, i.e.
Now while 64-bit double-precision floats [or "doubles"] are probably accurate enough for most financial calculations, where, generally speaking, accuracy is only needed to the nearest 1/100th [i.e. to the nearest cent], 64-bit doubles are still more or less worthless to the mathematician, physicist, and engineer.For instance, consider the work of Professor Kahan at UC-Berkeley:
In particular, read a few of these papers from the late nineties: At the time, Kahan was arguing in favor of using the full power of the Intel/AMD 80-bit extended precision doubles [i.e. embedding 64-bit doubles in an 80-bit space, performing calculations with the greater accuracy afforded therein, and then rounding the result back down to 64-bits and returning that as your answer], but, truth be told, the Sine Qua Non of hardware-based calculations is true 128-bit "quad-precision" floating point calculations as performed in hardware.Sun has a "quad-precision" floating point number for Solaris/SPARC, but, sadly, it's a software hack, and, like IBM/Sony/Cell/Playstation, far too slow to be used in practice.
I believe that IBM makes a chip for the Z-Series mainframe, which can perform 128-bits in hardware, but I imagine that it's prohibitively expensive [if you could even convince IBM to sell it to you in the first place].
The best configuration here would probably look like a fancy-schmantzy Digitial Signal Processor [DSP] chipset, from someone like Texas Instruments, capable of 128-bit hardware calculations, mounted onto a card that would plug into something very fast, like a 16x PCIe bus, which in turn would be connected to a HyperTransport bus [but boy, wouldn't it be really cool if the DSP lay directly on the HyperTransport bus itself?].
By the way, if anyone knows of a company that's making such a card, with stable drivers [or, God forbid, a motherboard with a socket for a 128-bit DSP on the HyperTransport bus], then please tell me about it, 'cause I'd be very interested in purchasing such a thing.
Re:GPUs == Worthless Floating Point Precision (Score:2)
For in-hardware calculation, yes. For a quick approximation or when the result has no serious consequences, yes. For anyone serious about getting the correct answer, no no no
We (by which I mean CS, math, and hard-science folks) have known since the earliest days of floating point that it has inherent, unavoidable flaws that no arbitrary fixed number of
Re:GPUs == Worthless Floating Point Precision (Score:1, Informative)
Not cranky and old enough.
If you care about your answer, no matter how many bits the FPU supports, you do it in software. Period. You use GMP, and don't round until the final result... and while that might not always prove possible due to having finite memory, I highly doubt we'll ever see even a 1024-bit F
Re:GPUs == Worthless Floating Point Precision (Score:1, Insightful)
Just, for the record. Cell uses no "software emulation" for their double calculations. It's 7 cycle latency to do two DP multiply-add, which is certainly not slow. The "slow" part is that the throughput is also 7 cycles, meaning that multiple DP MADDs don't pipeline. So, while this cuts the t
Re:GPUs == Worthless Floating Point Precision (Score:2)
The CDC 6600's single precision arit
Re:GPUs == Worthless Floating Point Precision (Score:2)
The error in floating point calculations is supposed to be roughly 2^-N, where N is the number of bits. Although some ALGORITHMS can be unstable, because they use series of operations that greatly increase error, many useful algorithms can be accurately
Re:GPUs == Worthless Floating Point Precision (Score:2)
You may not be aware, but AMD just released the new HyperTransport spec version - and it includes along with the usual speed and signaling imporvements, externally connected devices.
Re:GPUs == Worthless Floating Point Precision (Score:2)
I'm sure quad (or even possibly oct.) precision floats could be implemented in that bad boy.
As I said in an earlier thread, this has my intel fanboi status at risk...
-nB
Framebuffer in UST1 (Score:1)