Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"I am, therefore I am." -- Akira


computers / alt.windows7.general / Re: Memory diagnostic (Followup II)

Re: Memory diagnostic (Followup II)

<u7ssm3$3e8sg$1@dont-email.me>

  copy mid

https://www.novabbs.com/computers/article-flat.php?id=6496&group=alt.windows7.general#6496

  copy link   Newsgroups: alt.windows7.general
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jbb...@notatt.com (Jeff Barnett)
Newsgroups: alt.windows7.general
Subject: Re: Memory diagnostic (Followup II)
Date: Sun, 2 Jul 2023 16:11:12 -0600
Organization: A noiseless patient Spider
Lines: 252
Message-ID: <u7ssm3$3e8sg$1@dont-email.me>
References: <u7ii8v$1t8bd$1@dont-email.me> <u7lm0a$2ersj$1@dont-email.me>
<u7mi6e$2hqpr$1@dont-email.me> <u7ncmr$2klot$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64
Injection-Date: Sun, 2 Jul 2023 22:11:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="7a25993d91f1d14e9d591f8a691083e2";
logging-data="3613584"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18wE/XRhLy8k+pf3NMiya0bRbPGF9AYd3I="
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:dga6Ta62/pNg7E04tN+sfBwpyZE=
Content-Language: en-US
In-Reply-To: <u7ncmr$2klot$1@dont-email.me>
 by: Jeff Barnett - Sun, 2 Jul 2023 22:11 UTC

On 6/30/2023 2:07 PM, Jeff Barnett wrote:
> On 6/30/2023 6:35 AM, Paul wrote:
>> On 6/30/2023 12:34 AM, Jeff Barnett wrote:
>>> On 6/28/2023 6:12 PM, Jeff Barnett wrote:
>>>> CONFIGURATION: My configuration consist of an Intel Core i7-4930K
>>>> Processor, on an ASUS X79-Deluxe motherboard, with G.SKILL Ripjaws Z
>>>> Series 64GB (8 x 8GB) DDR3 2400 (PC3 19200) Desktop Memory Model
>>>> F3-19200CL10Q2-64GBZHD memory. I'm running Win7 PRO SP1 64-bit.
>>>>
>>>> PROBLEM: Recently, I wanted to run a memory diagnostic so I entered
>>>> safe mode and tried to run the one supplied by ASUS with the
>>>> motherboard. It would not load; Best guess is that ASUS
>>>> implementation uses .NET to build a cute GUI but .NET doesn't
>>>> operate in safe mode. I could still use the ASUS utility within
>>>> Windows proper but if there is a memory error, I'm afraid that
>>>> utility could cause damage.
>>>>
>>>> NEED: So I'm looking for some sort of memory diagnostic that will
>>>> run in safe mode or is able to boot from an optical disk or a thumb
>>>> drive. I'm willing to pay a few bucks or use a free one. Any
>>>> pointers would be appreciated.
>>>>
>>>> HISTORY: When we built our computers, several years ago, the ASUS
>>>> boards had difficulty supporting 64GB memory so I talked to G.SKILL;
>>>> they had no diagnostics they would share with the public (me) but
>>>> helped get the problem resolved with ASUS by board replacements and
>>>> new release of firmware. Since that time, memory has been rock solid.
>>>
>>>
>>> I want to thank everyone who replied. As a result, I downloaded
>>> Memtest, let it make a bootable thumb drive and tried to use it. No
>>> problems with it booting and loading. However, I ran into the
>>> following sequence of results that left me more puzzled than when I
>>> started:
>>>
>>> First test: It loaded and started to run. After 36 minutes it was
>>> performing Test 6 -- Block move, 64-byte blocks. At that point it
>>> froze, i.e., neither its clock or animations gave any sign of life.
>>>
>>> Second test: It started spewing error messages, exceed some limit,
>>> and stopped. First and last error message follow:
>>>      First error message recorded: 2023-06-29 14:45:26 - [MEM ERROR
>>>      - Data] Test: 4, CPU: 0, Address: 8D305543C, Expected: 02020202,
>>>      Actual: 82020202
>>>
>>>      Last error message output: 2023-06-29 14:45:45 - [MEM ERROR -
>>>      Data] Test: 4, CPU: 0, Address: 90F292834, Expected: 40404040,
>>>      Actual: C0404040
>>>
>>> Third test: Ran for about 3 hours with no errors and finished one of
>>> four repetitions at which point I halted testing. I then realized
>>> that it had estimated the total amount of memory as about 48GB
>>> instead of the actual 64GB. However, the firmware and Windows both
>>> showed the full 64GB later.
>>>
>>> I should note that I recently have been generating blue screens, each
>>> one with a different BCC code. From that I guessed hardware instead
>>> of software problem and memory as a likely candidate. If Memtest had
>>> performed more or less consistently, I would fully believe that
>>> diagnoses but see the above results. I also don't know how to
>>> translate the information from test two to either memory slots or
>>> memory module designations.
>>>
>>> In other words, I'm stuck again. Another problem is that my memory
>>> modules are under a massive CPU cooler and difficult to remove and
>>> shuffle around. At this point, I'd be willing to pay a good tech to
>>> complete the diagnoses necessary so I could effect repairs. (I live
>>> in Albuquerque, NM and don't know of any such person around here.)
>>> Any ideas on how to proceed would be most welcome.
>>
>> Note: The following is pure hypothesis. I'm only doing this to
>>        demonstrate how hard it is for a programmer to get this right.
>>        It doesn't matter how much money you pay, chance of error is
>> large.
>>
>>        We are, for example, relying on the Tech Writer at Asus, to have
>>        labeled the channels correctly. We are relying upon Intel doing
>>        something reasonable in the decoder.
>>
>> https://images.hothardware.com/static/articleimages/Item1811/intel-x79-blockdiagram.png
>>
>>     [Picture]  "DRAM labeling according to Asus"
>>
>>      https://i.postimg.cc/W3mzg45k/P9-X79-layout.jpg
>>
>> By not installing the second fan on the dual tower cooler,
>> both DIMM sets are equally accessible. The dual tower cooler
>> does not really need the second fan.
>>
>> *******
>>
>> With all DRAM pages open, it behaves like a dual channel
>> board, even though it is a four channel CPU. It should try
>> to interleave on all DIMMs, like this, in a recurring pattern.
>> The more DRAM pages a CPU can keep open, the faster it goes
>> (on memory-bound work). How the pages interleave (the pivot),
>> is immaterial for MEMTEST problems. Some strides in memory
>> work faster than others, and that is how you figure out where
>> pivots were placed. The pivot could be A11-A12 or so, or
>> on four channels A12-A13, and since the number of pages changed,
>> it could be even higher now. The pivot does not matter when
>> decoding addresses for memory failures. Each generation of memory,
>> has a different number of page registers inside.
>>
>>      A1(MS) B1(LS)   64byte Line Size, Burst-of-4 times 16 bytes (each
>> DIMM 8 bytes wide)
>>      C1(MS) D1(LS)   64byte Line Size, Burst-of-4 times 16 bytes (each
>> DIMM 8 bytes wide)
>>      A2(MS) B2(LS)   64byte Line Size, Burst-of-4 times 16 bytes (each
>> DIMM 8 bytes wide)
>>      C2(MS) D2(LS)   64byte Line Size, Burst-of-4 times 16 bytes (each
>> DIMM 8 bytes wide)
>>      ...
>>      pattern repeats, in attempt to keep pages open.
>>      Pages can close, considerably slowing down access (like during
>> purely random access)
>>
>> 8D305543C = 37,900,080,188
>>
>> 90F292834 = 38,909,061,172     Expected: 40404040   Actual: C0404040
>>
>> The next error would be at 256 bytes later, if one chip was "blown".
>> The next failure address would be:
>>
>> 90F292934
>>
>> We have to develop a decoder for the bits. The 34 is what needs to be
>> decoded.
>> Let us see what we can cook up.
>>
>>    0011 0100
>>    \/\/ ||\/
>>     | | || +---- index the 32 bit data word C0404040. This should stay
>> at 00 always, on a failure.
>>     | | |+------ which half of DIMM, upper or lower
>>     | | +------- upper or lower channel (A1 or B1, A1 is one, B1 is zero)
>>     | +--------- burst of four is immaterial with regard to which
>> chip, only matters to bit identification
>>     +----------- A1B1, C1D1, A2B2, C2D2 pair of indexing bits
>>
>> Error is on:
>>
>>     A1B1
>>     On B1
>>     Upper half of B1 (32 bit wide chunk)
>>     The C0404040 tells us it is the upper byte (byte lane 7 I suppose)
>>
>> I suspect B1 :-) Hahaha. Or it could be Colonel Mustard in the
>> drawingroom.
>>
>> *******
>>
>> What we do, is after seeing one of these errors, we test one DIMM at
>> a time. Total of eight tests.
>>
>> The annoying part is, of course the stupid thing passes. With flying
>> colours.
>> I hardly need to tell you that.
>>
>> Place a single DIMM in any of A1/B1/C1/D1 (end of channel) and text
>> the 8GB stick.
>> A detected error then, is unambiguous. DO NOT change memory timings.
>> Do not
>> reach in and switch on XMP for example, if previously it was off.
>>
>> There is a good chance, single DIMM tests will pass.
>>
>> Next, you test A1A2 DIMMS in A1A2 again, by themselves. But if a
>> failure is
>> detected with this 2DPC case, we don't really know which DIMM. And if
>> we install
>> one DIMM only again, it might well pass.
>>
>> *******
>>
>> My system isn't bodged very much. Only VCCSA (memory controller) was
>> bumped a tiny bit.
>> At this speed I'm using, it should hardly need to be bumped.
>>
>> CPU VCore   1.098V
>>
>> CPU VCCSA   1.277V  (set to 1.250 manually, code yellow)
>>
>> DRAM        1.652  (CHA, CHB) Auto, it sets itself to CPU_MAX_allowed
>> 1.65V
>>              1.659  (CHC, CHD) Auto, it sets itself to CPU_MAX_allowed
>> 1.65V
>>                                There's not much point running at only
>> 1.5V, which would be silly.
>>                                XMP can control this, and it would use
>> 1.65V as well.
>>                                Intel says to not use more than 1.65V,
>> which is where the number comes from.
>>
>> CPU PLL     1.807V  (usually higher than VCore and a separate supply
>> pin provided, auto value)
>>
>> VTTCPU      1.062V  ???  (A terminator voltage, but is this for DMI ?
>> Not sure.)
>>
>> PCH 1.1V    1.110V
>> PCH 1.5V    1.501V
>>
>> VTTDDR              CHA,CHB    This one and the ones below it, are not
>> measured.
>>                      CHC,CHD    The terminator voltage value is not
>> exactly mid-rail.
>>
>> The only thing I bumped then, was VCCSA, by a little bit.
>> That was in the hope I could get more than 1866MHz (on 2400 DIMMs)
>> while doing non-XMP manual tuning (since officially, using the XMP with
>> 8 DIMMs is undefined and bus loading predicts it won't work
>> and will throw errors). It did throw errors at 2400, so I immediately
>> got it out of my head, it was ever going to run eight sticks
>> at 2400. It quite happily runs four 4GB sticks (SS) at 2400.
>>
>> DDR3 still shows bus loading effects.
>>
>> On my DDR4 machine, I just jammed'er on XMP and... it worked!
>> And it passed the tests! I got a good chuckle out of that, I
>> can tell you, after the hell of tuning the X79.
>
> Above you expressed surprise that the addresses did not start or end on
> an obvious hex break point. The log says that it was truncated to 500
> total errors. When Memtest was running, I believe it said there were
> 10000(?) errors; who knows whether that is hex or decimal. I also have
> no idea how it selected the cut points in the errors list.
>
> Other points: I've been running with XMP since day one, i.e., for a long
> time. I told the firmware to use the XMP option recommended by G.SKILL
> and I have left all memory timings to that specification. The memory
> according to an ASUS utility runs at 2400 (it's design target) plus or
> minus a trivial amount (e.g., .1) when the CPU changes its clock. I
> presume speed change is based on a scale that allows memory and CPU
> clock ratios to be a simple rational number.
>
> I also have not touched memory voltages and set virtually everything
> else to automatic, i.e., let the firmware set it; and I have not tried
> to overclock the CPU so its max cycle rate is 3.9GHz with all sites (2
> per core) running with the same clock.
>
> Unfortunately, I understand only a little about modern hardware and age
> has robbed me of the dexterity necessary to go through extensive swap
> cycles that included removing and remounting the CPU multiple times.
>
> In the early middle 1960s, I sent some time developing hardware
> diagnostics, mostly for peripherals. I worked with the fellows who did
> CPU and memory work though. I remember that those diagnostics would some
> times identify a basic component, its exact location, and instruct the
> tech to replace it. It was my hope that a today's diagnostic would give
> results such as: replace CPU; replace motherboard; replace memory
> modules in slots x, y, and z. Even at this chunk size, the information
> would be very useful.
>
> My goal is to keep this machine and another running for another 4-6
> months then build replacements out of more modern hardware.
My machine now seems to be rock stable again after a week or more of
consistent blue screens with "inconsistent" reasons. I am now convinced
that the error is confined to a pair of memory sticks or the slots they
are in.
The firmware, windows, and the ASUS "AI Suite 3" utility now all agree
that I have 48GB of memory. There is actually 64GB plugged on the
motherboard. All of the testing and pounding that diagnostics have done
probably convinced the firmware not to count two of the eight sticks.
My remaining problem is that I can't tell from the information available
to me what the two slots are. When I look at firmware options preboot, I
find that it has timings for all four slot pairs. (Probably an artifact
of using XMP.)
At this point, I'd like to remove the two sticks that either are not
working or that are plugged into two slots that are failed. I can work
for several more months and wont really rue the memory decrease. The
reason for wanting to remove them is that the hardware is more likely to
go from latent bad to aggressive bad with the current state.
BTW Above I say "I can't tell from the information available to me what
the two slots are." and that's true. What else I hope is true is that
someone else can figure out how to tell and inform me.
Once again, thanks for the pointers and information you all have provided.
--
Jeff Barnett

SubjectRepliesAuthor
o Memory diagnostic

By: Jeff Barnett on Thu, 29 Jun 2023

27Jeff Barnett
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor