Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Intel's Knights Landing — 72 Cores, 3 Teraflops

Soulskill posted about 10 months ago | from the go-big-or-go-home dept.

Intel 208

New submitter asliarun writes "David Kanter of Realworldtech recently posted his take on Intel's upcoming Knights Landing chip. The technical specs are massive, showing Intel's new-found focus on throughput processing (and possibly graphics). 72 Silvermont cores with beefy FP and vector units, mesh fabric with tile based architecture, DDR4 support with a 384-bit memory controller, QPI connectivity instead of PCIe, and 16GB on-package eDRAM (yes, 16GB). All this should ensure throughput of 3 teraflop/s double precision. Many of the architectural elements would also be the same as Intel's future CPU chips — so this is also a peek into Intel's vision of the future. Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics? Or will this be another Larrabee? Or just an exotic HPC product like Knights Corner?"

Sorry! There are no comments related to the filter you selected.

Imagine (3, Funny)

Konster (252488) | about 10 months ago | (#45867485)

Imagine a Beowulf cluster of these!

Re:Imagine (0)

Anonymous Coward | about 10 months ago | (#45867537)

Only if this time Intel manages to produce more than one CES sample.

Re:Imagine (1)

Anonymous Coward | about 10 months ago | (#45867609)

That would be... one unit?

Re:Imagine (1)

imevil (260579) | about 10 months ago | (#45867673)

It wouldn't be very different than the most powerful supercomputer in the world: http://top500.org/system/177999 [top500.org]

Re:Imagine (0)

dreamchaser (49529) | about 10 months ago | (#45867709)

But does it run Linux?

Re:Imagine (0)

Anonymous Coward | about 10 months ago | (#45868023)

I'd rather imagine one screaming "I'll bite your legs off!!".

Imagine, 2 (2, Funny)

Futurepower(R) (558542) | about 10 months ago | (#45868273)

Imagine having one of those in your smartphone. You could answer text messages 1 microsecond faster. The battery life wouldn't be good.

Yay more cores that I won't be using much of! (0, Troll)

Press2ToContinue (2424598) | about 10 months ago | (#45867497)

Because you can never have too many cores that you aren't using most of the time.

How about more speed? Or is that too hard?

Re:Yay more cores that I won't be using much of! (1, Insightful)

Frosty Piss (770223) | about 10 months ago | (#45867507)

Because you can never have too many cores that you aren't using most of the time.

Ask the NSA, they might have a (SECRET) opinion on that.

Re:Yay more cores that I won't be using much of! (3, Insightful)

Anonymous Coward | about 10 months ago | (#45867523)

Yes, it's too hard. The future is in concurrency. The actor model will probably take off since it's easy to pick up and use.

Re:Yay more cores that I won't be using much of! (2, Insightful)

icebike (68054) | about 10 months ago | (#45867587)

Because you can never have too many cores that you aren't using most of the time.

How about more speed? Or is that too hard?

Pretty sure it wasn't meant for you (or me).

Re:Yay more cores that I won't be using much of! (4, Insightful)

H0p313ss (811249) | about 10 months ago | (#45867649)

Because you can never have too many cores that you aren't using most of the time.

How about more speed? Or is that too hard?

Pretty sure it wasn't meant for you (or me).

However, for servers, including hypervisors, it would be very interesting. There are lots of client/server products that scale better with more cores.

Re:Yay more cores that I won't be using much of! (0)

hairyfeet (841228) | about 10 months ago | (#45868589)

But this is just a bunch of Atom cores....who wants that? I've had plenty of Intel Atoms and AMD Bobcats go through the shop and...while they are good at your basic websurfing and the Bobcats make good media tanks thanks to the better GPUs....sigh...I really REALLY wouldn't want to do any heavy lifting with 'em.

Maybe things have changed since i was working SMB but back in my day the servers were doing heavy number crunching, something I wouldn't want to do on an Intel Atom, i don't care how many of them you jam together. Now maybe if you had a Xeon to pass off the heavy jobs to...but in that case an ARM chip to do the light work would make more sense. Sigh, sounds like another "tech demo for shit you'll never see IRL" like their Laredo or whatever it was called.

Re:Yay more cores that I won't be using much of! (1)

koan (80826) | about 10 months ago | (#45867595)

Doesn't multi core imply more speed, if not by clock then by efficiency?

Requires parallelism (5, Informative)

tepples (727027) | about 10 months ago | (#45867617)

Multicore implies more speed only if your process is parallelized. Not all interactive processes on a single-user computer can be, wrote Amdahl [wikipedia.org] .

Re: Requires parallelism (0)

Anonymous Coward | about 10 months ago | (#45867669)

Good to know, I went from a 3 GHz dual to quad at the same frequency and it was faster at rendering video and 3D.

Embarrassingly parallel (4, Informative)

tepples (727027) | about 10 months ago | (#45867773)

You saw a speed-up because video and 3D are in a class of problems that are very easy to parallelize [wikipedia.org] . So is decompressing all the images in an HTML document. Laying out the document, on the other hand, isn't so easy to parallelize, if only because every floating box theoretically affects all the boxes that follow it.

Andahl's Law (0)

Anonymous Coward | about 10 months ago | (#45867721)

Actually Amdahl described the theoretically speed-up given the percentage of a process, that can be parallelized, and the number of processes. Let the latter go towards infinity and you get the maximum theoretical speed-up / minimum theoretical run-time.

Re:Andahl's Law (1)

tepples (727027) | about 10 months ago | (#45867759)

In practice, the percentage of a process on a single-user system that can be parallelized is rarely 100 percent. If one holds the performance of a core constant, even a 1000 core system will still run as slowly as a 1 core system on the fraction that cannot.

Re:Andahl's Law (2)

sjames (1099) | about 10 months ago | (#45868035)

Keep in mind, Amdahl's law can be expanded to all processes that make up a system. Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.

If the program uses async I/O, that counts as parallelism.

Then a dual core should be plenty (1)

tepples (727027) | about 10 months ago | (#45868145)

Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.

Then there's not really much of a benefit to adding more than a dual core, which will probably end up running the application with which the user is interacting on one core and the background applications and system processes on the other. To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do), or run more than one user at once using dual monitors, dual keyboards, and dual mice (which most desktop PC operating systems tend not to support).

If the program uses async I/O, that counts as parallelism.

That counts as being I/O bound, and if all your processes are I/O bound, even a single core with simultaneous multithreading is enough.

Re:Then a dual core should be plenty (1)

phantomfive (622387) | about 10 months ago | (#45868275)

Then there's not really much of a benefit to adding more than a dual core, which will probably end up running the application with which the user is interacting on one core and the background applications and system processes on the other.

Wow, I just realized you are right, and got depressed.

Re:Then a dual core should be plenty (2)

sjames (1099) | about 10 months ago | (#45868309)

Not necessarily. A process could be CPU bound and prefer not to make it worse by also waiting for I/O completion. Let another core drive the filesystem and talk to the block device (which might be a soft RAID).

My system frequent;y enough is busy compressing video or doing large compiles in the background while I work in the foreground.

If all you're doing is word processing, single thread speed isn't all that important either since it's mostly waiting for you to press a key.

Re:Then a dual core should be plenty (1)

Anonymous Coward | about 10 months ago | (#45868357)

I've encountered instances where multiple cores helped the user experience because it distributed the CPU use by the malware on the machine.

Re:Then a dual core should be plenty (1)

sjames (1099) | about 10 months ago | (#45868377)

Sad but true.

Re:Then a dual core should be plenty (2)

tepples (727027) | about 10 months ago | (#45868417)

To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do)

Let another core drive the filesystem and talk to the block device (which might be a soft RAID).

My system frequent;y enough is busy compressing video or doing large compiles in the background while I work in the foreground.

Then you're not most users. I was under the impression that most users tend not to use soft RAID 5/6 or CPU-intensive file systems, compress large videos, or do large compiles. I too compress video and do compiles, but geeks such as you and myself are edge cases.

Re:Then a dual core should be plenty (1)

dbIII (701233) | about 10 months ago | (#45868523)

Most users are not going to be able to justify the likely expense of the things the article is about. The edge cases are the ones that will think it's worth putting up the cash.

Re:Then a dual core should be plenty (1)

sjames (1099) | about 10 months ago | (#45868551)

While most people probably don't do large compiles, the video compression is just for shows I record. In my case, it just happens to happen on a PC, others might use an appliance for that. My filesystem isn't particularly CPU intensive but no filesystem uses zero cycles.

The people not doing any of that probably wouldn't fully utilize the full speed of a single core either, so it's not much of an issue.

Re:Then a dual core should be plenty (1)

tepples (727027) | about 10 months ago | (#45868649)

the video compression is just for shows I record.

For shows you record from OTA, cable, or satellite, it doesn't have to be significantly faster than real time. How many tuners does your PC have? You could put one video encode on each core, plus another core for the audio encodes. But then I confess ignorance as to how much CPU power it takes to encode video at, say, full 1080p/24.

My filesystem isn't particularly CPU intensive but no filesystem uses zero cycles.

True, which is why the file system would probably run on the second core of a dual core along with the rest of the "system processes".

Re:Then a dual core should be plenty (1)

dbIII (701233) | about 10 months ago | (#45868517)

Yes, we get it, it's not for everyone and there is still a lot of braindead software stuck in 1995 that should be multithreaded (due to the problem it is solving) but isn't, let alone the stuff that is going to be stuck on one thread forever. Meanwhile at least some stuff can use this thing.
For a lot of people bucketloads of memory is a better deal than large numbers of cores. For others there is not problem pegging all cores at 100% for days on end.

There's been this sort of discussion here ever since the two socket boards for the celeron 300A were cheap. For some people it was overkill, but for me I never wanted to go back to one core.

Re:Requires parallelism (1)

Max Threshold (540114) | about 10 months ago | (#45867765)

More to the point, even if they can be, there's no guarantee that they are. Most existing desktop software won't benefit much from multiple cores.

Re:Requires parallelism (2)

Morpf (2683099) | about 10 months ago | (#45867771)

I think you'd be surprised how many real world day to day task can be and are parallelized: almost everything concerning audio and video (images or movies), searching, analyzing, rendering web pages, compiling, computing physics and AI for games.

I can't think of one computing intensive day to day action that is not parallelized or wouldn't be easy to do so.

I fail to see parallelism in CSS flow (3, Insightful)

tepples (727027) | about 10 months ago | (#45867869)

I think you'd be surprised how many real world day to day task can be and are parallelized: [...] searching

I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?

rendering web pages

I don't see how rendering a web page can be fully parallelized. Decoding images, yes. Compositing, yes. Parsing and reflow, no. The size of one box affects every box below it, especially when float: is involved. And JavaScript is still single-threaded unless a script is 1. being displayed from a web server (Chrome doesn't support Web Workers in file:// for security reasons), 2. being displayed on a browser other than IE on XP, IE on Vista, and Android Browser <= 4.3 (which don't support Web Workers at all), and 3. not accessing the DOM.

compiling

True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.

Re:I fail to see parallelism in CSS flow (1)

msobkow (48369) | about 10 months ago | (#45867885)

High performance RDBMS indexes do indeed parallelize scans and index searches.

Re:I fail to see parallelism in CSS flow (1)

msobkow (48369) | about 10 months ago | (#45867889)

WTF is going on here? I typed "engines", not "indexes".

Is slashdot now EDITING posts before publishing them, or is Firefox screwing with me?

Re:I fail to see parallelism in CSS flow (2, Funny)

Anonymous Coward | about 10 months ago | (#45867961)

Are you in Colorado?

Re:I fail to see parallelism in CSS flow (0)

Anonymous Coward | about 10 months ago | (#45868487)

Confronted with the possibility that either you mistyped a word or what is essentially malicious magic happened, you chose malicious magic. That's amazing indicative of your thought processes.

Re:I fail to see parallelism in CSS flow (1)

msobkow (48369) | about 10 months ago | (#45868693)

Confronted with the fact that I proof-read my post, hit submit, and the comment posted was different than what I'd just proof-read, yes, I do presume something is fucking with the system.

compilation often not just one single program (1)

Chirs (87576) | about 10 months ago | (#45868371)

In my experience, most cases where compilation takes a long time involve multiple compilation units. I have a fair bit of experience with compiling linux distros professionally...when you're building glibc and the kernel and five hundred other packages it'll use as many cores as you can throw at it.

-fwhole-program --combine (1)

tepples (727027) | about 10 months ago | (#45868459)

True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.

In my experience, most cases where compilation takes a long time involve multiple compilation units.

That's what I said. But a lot of times nowadays, the compiler is set to perform whole-program optimization [wikipedia.org] on release builds to try to save cycles even in calls from a function in one translation unit of a program to a function in another. Mozilla's Firefox web browser, for example, is so big that it can't be compiled with profile-guided whole-program optimization on 32-bit machines [slashdot.org] . But I'll grant that a multi-core CPU speeds up debug builds.

when you're building glibc and the kernel and five hundred other packages

Not many people are maintainers of an operating system distribution.

Re:I fail to see parallelism in CSS flow (0)

Anonymous Coward | about 10 months ago | (#45868405)

Actually, I remember seeing at least one talk on research into parallel rendering of web pages. I don't remember exactly how it worked, but they had to accept that in the worst case, parallelism would gain them nothing for the reasons you state. Web browser rendering is just a high value enough application that dealing with the awful lack of parallelizability in general is worth the trouble.

Re:I fail to see parallelism in CSS flow (1)

Morpf (2683099) | about 10 months ago | (#45868475)

I think you'd be surprised how many real world day to day task can be and are parallelized: [...] searching

I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?

Searching a large collection of non-indexed documents from disk is likely disk-bound, yes - except you somehow formulated a very complex search or stream from multiple disks at a time - but maybe you are searching data already in RAM. Traversing an index isn't necessarily a serial process, depending on your data structure. There are parallel implementations for binary and red-black trees, as far as I know. Or one could simply use a forest of as many trees as one has searching threads. (will get worse performance when using less threads than trees). If you only have a sorted list or array you could use a parallel search. If your data is not indexed you are likely to be faster with multiple threads (if there is no other bottle neck like, for example, disk throughput). Maybe you are searching multiple things at the same time (like a string in authors and contents of e-mails) or you are searching with multiple parameters (filetype [type], last access after [date], string in content [foo]) where not all parameters are indexed.

rendering web pages

I don't see how rendering a web page can be fully parallelized. Decoding images, yes. Compositing, yes. Parsing and reflow, no. The size of one box affects every box below it, especially when float: is involved. And JavaScript is still single-threaded unless a script is 1. being displayed from a web server (Chrome doesn't support Web Workers in file:// for security reasons), 2. being displayed on a browser other than IE on XP, IE on Vista, and Android Browser <= 4.3 (which don't support Web Workers at all), and 3. not accessing the DOM.

I never stated that my problems are 100% parallelizable. ;) Parsing: Why not? Reflow: And if I have multiple boxes at the same layer? At least as long the dimension are fixed or bounded some parallel processing could be possible, if it would benefit I can't tell.
Often enough there is more than one page opened at a time. With every open page the likelihood of executing multiple JavaScripts rises and with multiple pages getting rendered at the same time you can use parallelism, too.

compiling

True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.

Many steps can be parallelized, not all, as you pointed out. And even than I am not sure if there wouldn't be a solution for whole-program / link-time optimization, but I'm no professional concerning compiler building. And even then: I happen to compile multiple binary files with one run of make most of the time, so using multiple threads is for free (there is a reason make has the -j option).

Re:I fail to see parallelism in CSS flow (1)

tepples (727027) | about 10 months ago | (#45868559)

If your data is not indexed you are likely to be faster with multiple threads (if there is no other bottle neck like, for example, disk throughput).

Or RAM throughput.

Parsing: Why not?

Sure, the browser can parse multiple CSS files or multiple HTML files or multiple JavaScript files at once, just as the browser can decode multiple images at once. But the parser for a single file is a state machine. In order to "drop the needle" halfway into the byte stream and start parsing the second half on the second core, the parser would first have to know what state the state machine was in as of halfway into the stream. What parallelization were you thinking of?

And if I have multiple boxes at the same layer?

Once the browser finishes loading stylesheets, one or more of the stylesheets can alter the "same layer" status. And even then, web browsers are a huge target for exploits, and it might be harder to prove correctness and thread safety of parallel layout code than of serial layout code, especially when it has to fall back and start over serially should a float end up discovered.

Often enough there is more than one page opened at a time.

But only one page at once is the visible tab in the frontmost window. True, it's possible to have multiple pages visible in multiple browser windows on a desktop operating system, but browsers for mobile devices have only one window. And if people who post comments to Slashdot stories about Windows 8 or web site styling are to be believed, "most" users maximize all browser windows even on desktop operating systems.

And even then: I happen to compile multiple binary files with one run of make most of the time, so using multiple threads is for free (there is a reason make has the -j option).

So do I, and this works well for projects that don't use whole-program optimization.

Re:Requires parallelism (0)

Anonymous Coward | about 10 months ago | (#45868381)

Parsing cannot efficiently be done in parallel.

Re:Requires parallelism (1)

tepples (727027) | about 10 months ago | (#45868627)

If you have one HTML document, another HTML document in an IFRAME, four CSS style sheets, and four JavaScript programs, you can parse each on a core.

Re:Requires parallelism (1)

Entropius (188861) | about 10 months ago | (#45868489)

Are there really that many interactive processes on a single-user computer that are

1) CPU-bound
2) not parallelizable
3) take long enough that waiting on them gets annoying?

I ask out of genuine curiosity; I can't think of many times when I wind up waiting on my computer to do anything that fits.

Re:Requires parallelism (1)

Entropius (188861) | about 10 months ago | (#45868493)

edit: Compiling is one, definitely. Forgot about that.

Reflow in web browsers and word processors (1)

tepples (727027) | about 10 months ago | (#45868617)

As I wrote elsewhere [slashdot.org] : laying out a web page that includes float-styled elements. That fits 1) and 2), and it fits 3) on a netbook or tablet with an ARM or Atom processor. Or repaginating a document in a word processor, which happens every time the user enters enough text to make the current paragraph one line longer, deletes enough to make it one line shorter, or changes the styling of any span of text. Repagination may affect figures, references to page numbers elsewhere in the document, etc. Repaginating text after the visible page can be deferred unless there's a "See page n" elsewhere in the document, which may even end up triggering repagination of text before the edit if the new page number has more or fewer digits than the old page number.

Also the PBKDF2 key stretching used to connect to a WPA2 access point, when run on a similarly slow machine.

Also compressing a large still image. I don't see how the DEFLATE codec used by, say, PNG can be parallelized.

Re:Requires parallelism (1)

dbIII (701233) | about 10 months ago | (#45868495)

However there are plenty that are. Geophysics, biochemistry, engineering and even editing home movies.
I'd love some of these if they come off with better price/performance than an AMD system or even if they just beat it a lot on performance without being ten times the cost (sad state of the very top end of Xeons now).

Re:Requires parallelism (1)

Darinbob (1142669) | about 10 months ago | (#45868701)

That many cores implies a big fat pipe to memory as well. Sure they have local cache but memory is going to be the bottle neck here even with parallelized computation.

mhmm (0)

Anonymous Coward | about 10 months ago | (#45867785)

implies it, sure.

Re:Yay more cores that I won't be using much of! (1)

godrik (1287354) | about 10 months ago | (#45867613)

That's an hpc processor. You are unlikely to deploy that on classical desktop/laptop for a while. Think about it as a classical coprocessor.

Re:Yay more cores that I won't be using much of! (0)

Anonymous Coward | about 10 months ago | (#45867655)

Yes, that really sucks that you will be forced to buy one of these things.

Re:Yay more cores that I won't be using much of! (1)

dreamchaser (49529) | about 10 months ago | (#45867713)

It depends on the use case. There are many applications where this would shine. Sure if you want to play Quake 3 Arena it's not going to give you much at all, but if you're doing parallel processing for scientific or engineering applications this would rock.

Re:Yay more cores that I won't be using much of! (5, Funny)

morcego (260031) | about 10 months ago | (#45867909)

Because you can never have too many cores that you aren't using most of the time.

Install McAfee Antivírus, and problem solved: no more unused cores.

Re:Yay more cores that I won't be using much of! (1)

Entropius (188861) | about 10 months ago | (#45868443)

This isn't intended for you if you can't think of what to do with all those cores.

This is for the high performance physics folks to whom the difference between 16 cores, 256 cores, and maybe even 8192 cores is a line in a config file.

It's also for the folks developing 24 megapixel RAW files (which Nikon's cheapest SLR spits out these days), where splitting the image into 64 sectors is no more difficult than splitting it into four, or for the folks doing video encoding which is pretty trivially parallelizable.

Most of the times that I can think of where I'm truly waiting on my computer to do something that's limited by the number of flops that can be brought to bear, more cores is just as good as more speed.

No it cannot compete with nVidia and AMD/ATI (1)

Rockoon (1252108) | about 10 months ago | (#45867503)

Summary asks:

Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics?

...but first it says it has 16GB of eDRAM. The 128MB is eDRAM in their "Iris Pro" adds almost $200 to the price tag.

This chip is going to cost MANY THOUSANDS OF DOLLARS.

Re:No it cannot compete with nVidia and AMD/ATI (5, Informative)

rsmith-mac (639075) | about 10 months ago | (#45867711)

"eDRAM" in this article is almost certainly an error for that reason.

eDRAM isn't very well defined, but it basically boils down to "DRAM manufactured on a modified logic process," allowing it to be placed on-die alongside logic, or at the very least built using the same tools if you're a logic house (Intel, TSMC, etc). This is as opposed to traditional DRAM, which is made on dedicated processes that is optimized for space (capacitors) and follows its own development cadence.

The article notes that this is on-package as opposed to on-die memory, which under most circumstances would mean regular DRAM would work just fine. The biggest example of on-package RAM would be SoCs, where the DRAM is regularly placed in the same package for size/convenience and then wire-bonded to the processor die (although alternative connections do exist). Conversely eDRAM is almost exclusively used on-die with logic - this being its designed use - chiefly as a higher density/lower performance alternative to SRAM. You can do off-die eDRAM, which is what Intel does for Crystalwell, but that's almost entirely down to Intel using spare fab capacity and keeping production in house (they don't make DRAM) as opposed to technical requirements. Which is why you don't see off-die eDRAM regularly used.

Or to put it bluntly, just because DRAM is on-package doesn't mean it's eDRAM. There are further qualifications to making it eDRAM than moving the DRAM die closer to the CPU.

But ultimately as you note cost would be an issue. Even taking into account process advantages between now and the Knight's Landing launch, 16GB of eDRAM would be huge. Mind bogglingly huge. Many thousands of square millimeters huge. Based on space constraints alone it can't be eDRAM; it has to be DRAM to make that aspect work, and even then 16GB of DRAM wouldn't be small.

Re:No it cannot compete with nVidia and AMD/ATI (3, Informative)

Anonymous Coward | about 10 months ago | (#45868113)

It may not be eDRAM, but I'm not sure what else Intel would easily package with the chip. We know the 128 MB of eDRAM on 22 nm is ~80 mm^2 of silicon, currently Intel is selling ~100 mm^2 of N-1 node silicon for ~$10 or less (See all the ultra cheap 32 nm clover trail+ tablets where they're winning sockets against allwinner, rockchip, etc., indicating that they must be selling them for equivalent or better prices than these companies.) By the time this product comes out 22 nm will be the N-1 node. In addition, a dedicated eDRAM chip is probably cheaper than a typical SoC/logic chip due to the smaller number of metal levels that are needed. Assuming N-1 node prices hold for a given area of silicon, 16 GB will need 12000 mm^2 of silicon (likely less as the current 128 MB die likely uses a not insignificant area for readout circuitry and PHY interface), coming out to around $1200. Add an extra $1000 for your actual processor and you have the current price of a low end Xeon Phi.

Re:No it cannot compete with nVidia and AMD/ATI (0)

Anonymous Coward | about 10 months ago | (#45868699)

16 GB will need 12000 mm^2 of silicon

Nice theory.

But there is no way Intel is making a 10.9x10.9cm (4.3x4.3 inches) chip. The yield would be terrible unless the whole thing is redundant, not to mention the thermal and bonding issues with a chip that large (or even fitting it in a case).

More likely it is stacked silicon with TSVs on top or beside the CPU die in a MCM or wafer-bonded (2.5D) package (like Xilinx flipchip on metal carrier microbumping process). Either that or they've figured out how to get the density down significantly on eDRAM.

Re:No it cannot compete with nVidia and AMD/ATI (2)

im_thatoneguy (819432) | about 10 months ago | (#45867799)

An Nvidia Quadro card costs $8,000 for an 8GB card. I would consider $8,000 "many thousands of dollars". Nobody is suggesting Knights ____ is competing with any consumer chips CPU or GPU. I have a $1,500 Raytracing card in my system along with a $1,000 GPU as well as a $1,000 CPU. If this could replace the CPU and GPU but compete with a dual CPU system for rendering performance I would be a happy camper even if it cost $3-4k.

Programmability? (4, Informative)

gentryx (759438) | about 10 months ago | (#45867509)

I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke: to get any serious performance out of the current generation of MICs you have to wrestle with vector intrinsics and that stupid in-order architecture. At least the latter will apparently be dropped in Knights Landing.

For what it's worth: I'll be looking forward to NVIDIA's Maxwell. At least CUDA got the vectorization problem sorted out. And no: not even the Intel compiler handles vectorization well.

Re:Programmability? (1)

godrik (1287354) | about 10 months ago | (#45867629)

Actually the in-order execution isn't so much of a problem in my experience. The vectorization is a real problem. But you essentially have the same problem except it us hidden in the programming model. But the performance problem are here as well.

Anybody that understand gpu architecture enough to write efficient code there won;t have much problem using the mic architecture. The programming model is different but the key diffucultues are essentially the same. If you think about mic simd element as a cuxa thread, the programming different are mostly syntactical.

Re:Programmability? (2)

Arakageeta (671142) | about 10 months ago | (#45868103)

It's not entirely syntactical. Local shared memory is exposed to the CUDA programmer (e.g., __sync_threads()). CUDA programmers also have to be mindful of register pressure and the L1 cache. These issues directly affect the algorithms used by CUDA programmers. CUDA programmers have control over very fast local memory---I believe that this level of control is missing from MIC's available programming models. Being closer to the metal usually means a harder time programming, but higher performance potential. However, I believe NVIDIA has made CUDA pretty programmer friendly, given the architectural constraints. I'd like to hear the opinions of MIC programmers, since I have no direct experience with MIC.

Re:Programmability? (2)

godrik (1287354) | about 10 months ago | (#45868187)

I don't understand. Mic is your regular cache based architecture. Accessing L1 cache in mic is very fast (3 cycle latency if my memory is correct). You have similar register constraints on mic with 32 512-bit vectors per thread(core maybe). Both architectures overlap memory latency by using hardware threading.

I programmed both mic and gpu, mainly on sparse algebra and graph kernels. And quite frankly there are differences but i find much more alike than most people acknowledge. The main difference in my opinion being the programming model where gpus are used with millions of threads while mic is better used with less number of threads and more of a work pool. Atomics are really fast in gpus and not so fast in mic. But you also have much more fine thread synchronization opportunities in mic whichsomewhat remove the interest of fast atomics.

Re:Programmability? (1)

imevil (260579) | about 10 months ago | (#45867691)

I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke

I tried recompiling and running some OpenCL code (that previously was running on GPUs). It was "just recompile and run" and the promises about performances were kept. But still, OpenCL is not what most people consider "nice to program".

Re:Programmability? (1)

Anonymous Coward | about 10 months ago | (#45867965)

nVidia didn't sort out any vectorization problem, they dodged it completely.

They use a kind of super-scalar architecture (that they term MIMT) - many many (warp size is 32 for most consumer cards) dumb ALUs sharing a single instruction scheduler, and half as many FPUs as ALUs per warp - very much like AMDs current CPU architecture but with far more ALU/FPUs to schedulers - and a slightly different approach to VLIW superscalar architectures.

The downside with superscalar architectures though is of course, branching... if 'threads' within a superscalar architecture hit a divergent branch (eg: some go into if(a){} others go into else{}) you end up evaluating both branches sequentially (ouch!).

So while there's no vectorization problem, there certainly are others. Although I admit, I find instruction/branch divergence and gather/scatter coalescing to be easier to deal with mentally than manually dealing with alignment, swizzling/shuffling, etc.

Re:Programmability? (2)

PhrostyMcByte (589271) | about 10 months ago | (#45868079)

Intel's AVX-512 is really friggin cool, and a huge departure from their SIMD of the past. It adds some important features -- most notably mask registers to optimally support complex branching -- which make it nearly identical to GPU coding so that compilers will have a dramatically easier time targeting it. I doubt it will kill discrete GPUs any time soon, but it's a big step in that long-term direction.

Not going to work (1)

faragon (789704) | about 10 months ago | (#45867543)

In my opinion, the point of using x86 in order to reuse units from desktop/server CPUs is the base of these experiments. The counterpart is to deal with the x86-mess everywhere. This seems a desperate reaction to AMD's CPU+GPGPU, which also has drawbacks. I bet that both Intel and AMD prefer to keep memory controller as simpler as possible, having a confortable long-run, without burning their ships too early. E.g. a CPU+GPGPU in the same die, with 8 x 128 bit separate memory controllers configured as NUMA (i.e. without channel interleaving/bonding) would be much better, but it would imply expensive chips, motherboards, and more DRAM chips. So I bet we'll have same-die CPU+GPU plus simple memory controller (even with embbeded RAM in 3D package) for the next 20 years (consumer-grade products).

Re:Not going to work (1)

cnettel (836611) | about 10 months ago | (#45867569)

20 years? I would be very doubtful regarding any prediction beyond the point where current process scaling trends finally break. Note, they might break the other way. Switching to a non-silicon material might allow higher frequencies which will again shift the tradeoff between locality, energy, and production cost. But there is no reason, no reason at all, to expect the current style to last for more than ten years, while you could be quite right that it could stay much the same for the next five years or so.

Re:Not going to work (1)

viperidaenz (2515578) | about 10 months ago | (#45868545)

8 128bit memory controllers? 1024 pins just for the memory bus? you've got to be kidding.

Calm down (1)

symbolset (646467) | about 10 months ago | (#45867581)

You aren't ever going to see this at Newegg.

QPI? (1)

Joe_Dragon (2206452) | about 10 months ago | (#45867589)

To bad most Intel cpus don't have it and just about all 2011 boards don't use it. The ones that do use it for dual cpu.

To bad apple mac pro does not have this and is not likely to use any time soon.

Bitcoin/Litecoin Performance (1)

Anonymous Coward | about 10 months ago | (#45867593)

Will this be any better on Bitcoin/Litecoin mining than anything else?

Re:Bitcoin/Litecoin Performance (0)

Anonymous Coward | about 10 months ago | (#45867663)

Nope. Might be nice for some of the CPU intensive Altcoins, but not BTC or LTC.

Re:Bitcoin/Litecoin Performance (-1)

Anonymous Coward | about 10 months ago | (#45867801)

Nope. Might be nice for some of the CPU intensive Altcoins, but not BTC or LTC.

So LTC isn't CPU intensive? You're nothing but a god damn liar. You CONservatives keep telling bold-face lies about BTC and LTC because it hurts you since it reduces the value of the wealth you have hoarded by stealing from minorities. That is why your kind always tells lies.

Re:Bitcoin/Litecoin Performance (1)

Jeremy Erwin (2054) | about 10 months ago | (#45868205)

Btc and ltc are best run on ASiCs or perhaps AMD GPUs

btc hardware [bitcoin.it]
ltc hardware [litecoin.info]

Perhaps this chip will change things, but for now, cpu mining is pretty inefficient

Re:Bitcoin/Litecoin Performance (-1)

Anonymous Coward | about 10 months ago | (#45867817)

> [deleted a bunch of Republican lies]

scrypt isn't CPU intensive? Wow, Republicans are getting even more desperate with their irrational attacks on alternative currencies. By telling lies like that, you Republicans are showing that you are wrong have nothing to back-up your incorrect position,. Please stop trying to ruin /. by making everything about politics. You people are disgusting with how you do that.

Also, no one believes your lies about LTC. You failed.

Re:Bitcoin/Litecoin Performance (0)

Anonymous Coward | about 10 months ago | (#45867851)

LTC can be ASIC mined just like BTC. It's only a matter of time. scrypt, like cryptocurrency in general relies on technology to prevent "attack". Yeah, that's like saying "hack in to my computer and you can have all the money in the world". Someone, somewhere, will do it because it's worth it. Cryptocurrencies are flawed from the start.

Re:Bitcoin/Litecoin Performance (4, Interesting)

InvalidError (771317) | about 10 months ago | (#45868409)

BitCoin has ASIC miners with ~10X the mining power per watt than most programmable alternatives such as GPGPU and FPGA. Anything less efficient than that is or soon will become cost-prohibitive to run.

The newer Bitcoin alternatives use memory-bound algorithms to prevent such a steep mining power escalation since memory capacity and bandwidth scale much more slowly than processing power but much more quickly on costs: with Bitcoin, increasing throughput by 10X simply required 10X the processing power but with the memory-bound alternatives, you also need 10X the RAM and 10X the memory bandwidth.

Re:Bitcoin/Litecoin Performance (1)

dbIII (701233) | about 10 months ago | (#45868549)

Just do what the other people at the top end of the bitcoin scam are doing - infect PCs with malware and let those do the mining for you.

Any day now I'm expecting media reports about how there is this nefarious web site called slashdot that is the hub of the bitcoin scam. Let's not have that please.

Form Factor (0)

Anonymous Coward | about 10 months ago | (#45867599)

Does this by any chance look like either the chip from the Terminator films, or a Borg Cube?

I mean, they've been working on 3d chips for years, right?

Yea, but can it run Crysis? (0)

Anonymous Coward | about 10 months ago | (#45867685)

How long before it's a tiny little square I can drop into my mobo's cpu slot?

fuck crysis (0)

Anonymous Coward | about 10 months ago | (#45867805)

but a stacked CPU cube built only for watercooling (coolant would travel throughout the cube) would be damn spiffy.

read IBM was pondering somewhere along these lines a while back.

Unobtainium (3, Insightful)

Anonymous Coward | about 10 months ago | (#45867745)

This is another one of those IBM things made from the most rare element in the universe: unobtainium. You can't get it here. You can't get it there either. At one point I would have argued otherwise, but no. Cuda cores I can get. This crap I can't get. Its just like the Cell Broadband engine. Remember that? If you bought a PS3, then it had a (slightly crippled) one of those in it. Except that it had no branch prediction. And one of the main cores was disabled. And you couldn't do anything with the integrated graphics. And if you wanted to actually use the co-processor functions, you had to re-write your applications. And you needed to let IBM drill into your teeth and then do a rectal probe before you could get any of the software to make it work. And it only had 256MB of ram. And you couldn't upgrade or expand that. With IBM's new wonder, we get the promise of 72 cores. If you have a dual-xeon processor. And give IBM a million dollars. And you sign a bunch of papers letting them hook up the high voltage rectal probes. Or you could buy a Kepler NVIDIA card which you can install into the system you already own, and it costs about the same as a half-decent monitor. And NVIDIA's software is publicly downloadable. So is this useful to me or 99.999% of the people on /.? No. Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B..

Re:Unobtainium (1)

Guy Harris (3803) | about 10 months ago | (#45867863)

This is another one of those IBM things made from the most rare element in the universe: unobtainium

Presumably meaning "this is like those IBM things", given that, while the first word of the title begins with "I", it doesn't have "B" or "M" following it, it has "n", "t", "e", and "l", instead.

Re:Unobtainium (1)

im_thatoneguy (819432) | about 10 months ago | (#45868301)

This is x86. Theoretically your program already runs on this. You don't have to rewrite your entire application to run on CUDA.

Re:Unobtainium (2)

radarskiy (2874255) | about 10 months ago | (#45868479)

"Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B."

I would rather have that market than all of the rest.

Bigger than the town (0)

Anonymous Coward | about 10 months ago | (#45867813)

That chip is probably 3 times the size of Knights Landing. Seriously it might be time for a new naming scheme.

How does the intercommunication work? (4, Informative)

Animats (122034) | about 10 months ago | (#45867821)

OK, we have yet another mesh of processors, an idea that comes back again and again. The details of how processors communicate really matter. Is this is a totally non-shared-memory machine? Is there some shared memory, but it's slow? If there's shared memory, what are the cache consistency rules?

Historically, meshes of processors without shared memory have been painful to program. There's a long line of machines, from the nCube to the Cell, where the hardware worked but the thing was too much of a pain to program. Most designs have suffered from having too little local memory per CPU. If there's enough memory per CPU to, well, run at least a minimal OS and some jobs, then the mesh can be treated as a cluster of intercommunicating peers. That's something for which useful software exists. If all the CPUs have to be treated as slaves of a control machine, then you need all-new software architectures to handle them. This usually results in one-off software that never becomes mature.

Basic truth: we only have three successful multiprocessor architectures that are general purpose - shared-memory multiprocessors, clusters, and GPUs. Everything other than that has been almost useless except for very specialized problems fitted to the hardware. Yet this problem needs to be cracked - single CPUs are not getting much faster.

Re:How does the intercommunication work? (1)

dbIII (701233) | about 10 months ago | (#45868563)

Historically, meshes of processors without shared memory have been painful to program

Which is why we don't see those GPU cards in absolutely every place where there is a massively parallel problem to solve. Even 8GB is not enough for some stuff and you spend so much time trying to keep the things fed that the problem could already be solved on the parent machine.

Re:How does the intercommunication work? (0)

Anonymous Coward | about 10 months ago | (#45868633)

Well that's a lovely comment, but this *is* in fact a shared-memory multiprocessor machine; all 72 cores on the die share the same 16GB on-chip RAM (and any other external RAM hooked up to it), so fundamentally it's just programming a 72-way cache-coherent SMP (well ok, NUMA) machine, similar to programming your regular SMP Intel machine, like a common-or-garden laptop or server. There are some quirks, like there only being 36 L2 caches on the chip, but your compiler should take care of figuring out the best strategy to deal with that once the appropriate support is added.

Confusingly many items (0)

Anonymous Coward | about 10 months ago | (#45867833)

The article appears to be about rumors mangled together. Socketed chip versus PCIe board, fabric over the PCIe versus over the host processor, while forgetting the chip-integrated possibility. If the DDR4 rumor is correct, it simply suggests that the coming Xeon sockets Intel use for serving HPC and low to mid-end server markets (E5) utilize 6 directly connected memory channels with maximum DIMM size of 64 GB.

Intel's version of a IBM/Sony Cell CPU (2)

Required Snark (1702878) | about 10 months ago | (#45868115)

This will have the same useability as the CELL CPU. From TFA:

Second, while Knights Landing can act as a bootable CPU, many applications will demand greater single threaded performance due to Amdahl’s Law. For these workloads, the optimal configuration is a Knights Landing (which provides high throughput) coupled to a mainstream Xeon server (which provides single threaded performance). In this scenario, latency is critical for communicating results between the Xeon and Knights Landing.

So there will be a useful mainstream CPU closely coupled with a bunch of vector oriented processors that will be hard to use effectively. (Also from TFA).

The rumors also state that the KNL core will replace each of the floating point pipelines in Silvermont with a full blown 512-bit AVX3 vector unit, doubling the FLOPs/clock to 32.

So unless there is a very high compute to memory access ratio this monster will spend most of it's time waiting for memory and converting electrical energy to heat. Plus writing software that uses 72 cores is such a walk in the park...

Re:Intel's version of a IBM/Sony Cell CPU (1)

dbIII (701233) | about 10 months ago | (#45868587)

Plus writing software that uses 72 cores is such a walk in the park

Some stuff actually is. It depends on how trivially parallel the problem is. With some stuff there is no interaction at all between the threads - feed it the right subset of the input - process the data - dump it out.

QPI bad for NVIDIA (0)

Anonymous Coward | about 10 months ago | (#45868135)

PCIe bandwidth is often the major performance bottleneck to GPGPU applications (i.e., NVIDIA's CUDA). If we suppose that Knights Landing's compute performance is always _worse_ than what NVIDIA can offer, Intel, with QPI, could still push NVIDIA out of HPC for bandwidth-constrained applications. That is, NVIDIA could have the greatest GPU, but if their GPU is stuck on pokey PCIe, Knight's Landing + QPI could still offer better performance. It would be neat if Intel could let NVIDIA on the QPI bus, but that might not even be technically feasible even if Intel were willing to license the technology (as if that would ever happen). AMD also has HyperTransport/Fusion to leverage for their GPGPU solutions. NVIDIA needs to find something better than PCIe.

QPI vs PCIe? (1)

unixisc (2429386) | about 10 months ago | (#45868697)

I just read up QPI on wiki, and it's a point to point processor interconnect, which replaces the front side bus in Xeon and certain desktop platforms - presumably the cores i7. PCIe, OTOH, is a serial computer expansion bus standard, which can take in things like graphics cards, SSDs, network cards and other such peripheral controllers. I just don't see how QPI is any sort of a replacement for PCIe. That would almost be like arguing for PCIe being superseded by USB4 or something.

Essentially, QPI is Intel's equivalent of the HyperTransport that AMD uses. The PCIe part of it is completely separate - I doubt one will have QPI graphics cards or SSDs

AT&T Network On A Chip (0)

Anonymous Coward | about 10 months ago | (#45868193)

Why stop at a meger 72 cores!

Why not, A Core For Every Instruction!

Then the "nasty" moves to the "central message passing unit parallelized on vectors Ernestine" what a laugh-n!

This of course ignites the "Core Arms Race"!

Then the "USA" vs "World" launch into a "Core Per Syllable Arms Race"!

The butt-zillions of oil-dollars will be spent wildly at an Above Fast N Furious Pace to the ultimate CORE THOUGHT paradigm shift.

Yet, there is the "NSA Wild Card."

[Dealer} What's your bet?

[High Roller] Hold! [;-)]

[Dealer] "Pervert fucker!" ;-)

72 slow as shit cores (0)

Anonymous Coward | about 10 months ago | (#45868287)

is still gonna be horrible except in very specific, and highly multithreaded scenarios... some shared parts between cores on a tile.. simply looks like intel is copying amd's current architecture (intel 'tile' = amd 'module'), mixing in an advance of their own (the on-die 'edram'), and throwing more cores at it than even amd has tried.

to keep thermals in check for a single socket package, you're looking at something that draws less than 2w per core under load.. that's a little less than what a current silvermont n2805 mobile celeron draws (4.3w for 2 cores, 1 thread per core), which delivers a whopping ..wait for it.. 228 passmarks per core. the original atom 230 scores a 301 passmarks, and we know how fast that piece of shit was. compare to a current low end desktop chip that runs win vista or newer with 'usable' performance when paired with 4gb ram that scores about 2000 passmarks on 2 cores

cool (0)

Anonymous Coward | about 10 months ago | (#45868505)

when is this new CPU going to be available at Frys

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?