NCRs and ECRs affecting XMM-OM post-launch

Pre-launch "flight" NCRs are:

Where Summary Action Status
NCR 167 DPU software magnifier tracking investigation OPEN
NCR 171 DPU software eng4 16-chunk reversed closed
NCR 174 DEM hardware KAL not working reload code effectively closed
NCR 183 DPU software track not turned off fixed closed
NCR 184 DPU software guide star selection investigation OPEN
NCR 185 ICU software 60h in offset field fixed closed
NCR 186 DPU software fast mode investigation OPEN

Post-launch "flight" NCRs are:

Where Summary Action Status
NCR 187 ICU software memory dumps fixed closed
NCR 188 ICU software memory dumps fixed closed
NCR 189 DPU/ground single ev upsets MOC problem effectively closed
NCR 190 OM optics stray light investigated effectively closed
NCR 191 ICU software uninitialized var fixed closed
NCR 192 OM optics broad PSF focus heaters closed
NCR 193 DPU software memory check fixed closed
NCR 194 ICU software safe mode fixed closed
NCR 195 ICU software unsuccessful exec fixed closed
NCR 196 ICU software hv ramp-up fixed closed
NCR 197 DEM software dpu not get cgs nothing closed
NCR 198 DPU software scrubbing reported late fixed closed
NCR 199 ICU software ICU hang investigation OPEN
NCR 200 DPU software Spont DPU reset investigation OPEN
NCR 201 DPU software 16 bit wraparound investigation OPEN
NCR 202 DPU software Missing alerts fixed closed
NCR 203 DPU software FAQ word order fixed closed
NCR 204 DPU software Unexpected A5AD alerts investigated closed
NCR 205 DPU software Alerts from CGS investigation closed
NCR 206 DPU software FAQ failed investigation to be closed soon
NCR 207 OM hardware? HV failure investigation closed
NCR 208 DPU software fast mode pointer investigation closed
NCR 209 DPU software offsets remembered investigation closed
NCR 210 DPU software a580 seen in cgs investigation closed
NCR 211 ICU software DPU package exception investigation OPEN
NCR 212 DPU software DP_FAQ ordering investigation to be closed soon
NCR 213 OM hardwaree Cathode anomaly investigation closed
NCR 214 DPU software DPU spontaneous resetinvestigation OPEN
NCR 215 DPU software Eng 3 corruption investigation OPEN
ECR 86 ICU software mode changes fixed closed
ECR 87 ICU software exceptions->anomalies fixed closed
ECR 88 ICU software safe on filter loss fixed closed
ECR 89 ICU software tm timout change fixed closed
ECR 90 DPU software Eng BEG/ENDOF_EXP investigation closed
ECR 91 DPU software Whole frame investigation closed
ECR 92 ICU software fw offset of UV grisminvestigation OPEN

Release 10

NCR 183 185 186-partial 187 188 191 192 193 194 195 196 198 ECR 86 87-partial 88 89 are fixed in release 10 of the flight software.

Release 10b

In addition to those of Release 10, NCR 203 is fixed in release 10b of the flight software.

Release 11

NCR 202 205 208 209 ECR 87 90 91 are fixed for release 11 of the flight software. NCR 171 is reversed back to 16-chunk.

Pre-launch NCRs and ECRs are available at:
http://xmmom.mssl.ucl.ac.uk/docs/xmm-om-ncrs/ncr.ps
http://xmmom.mssl.ucl.ac.uk/docs/xmm-om-ecrs/ecr.ps


Here's the detailed list of post-launch NCRs ...


NCR 187
-------
XMM-NC-ESO-0103

A Dump from OM MID 37 was commanded at 2000.019.13.12.24, and based on the
contents of the dump packets 2037.OM6, the following has been found:

Packet #  Start address  Delta address from last packet (dec)
1         00E5 0000 ---
2         00E5 00A6 166
3         00E5 014C 166
4         00E5 01F2 166
5         00E5 0299 167
6         00E5 033F 166
7         00E5 03E5 166
8         00E5 048A 165
9         00E5 0530 166
10        00E5 05D6 166
11        00E5 067D 167
12        00E5 0723 166
13        00E5 07C9 166
14        00E5 086E 165
15        00E5 0914 166

As can be seen, the start address for the 5th packet is one larger than
expected. It then corrects itself in the 8th packet. This behaviour continues
through the rest of the dump.
In our comparison task we saw the the dump starting off OK, then slipping one
word, and then synching back up again.

Kate

Telecommand:
H4125 memory dump   2000.019.13.12.21.324

mid   = MID 37 hex
H0500 = e50000 hex
H0510 = 2990 hex

VxWorks command:
tc_dump_mem(0x25, 0xe50000, 0x2990)

Reproduced at MSSL.

Fix known.  Same as NCR 188.  Edit icu/fm/oper/memdpu.adb.  Will be implemented
for OM flight code release 10.



NCR 188
-------
XMM-NC-ESO-0103

"no dump TM for fix DPU status"

Command to dump DPU status not responding.  Telecommand HL113 Memd White LOCAL
did not produce any packet dump.  1M packet 94213 Memd White LOCAL (OM6 2019)
is expected.

Telecommand:

H4113 memory dump    2000.005.17.32.46.853
mid=13 hex
h0500=23278 dec
h0510=1

VxWorks command:

tc_dump_mem(0x13, 0x5aee, 0x1)

Reproduced at MSSL.

Fix known.  Same as NCR 187.  Edit icu/fm/oper/memdpu.adb.  Will be implemented
for OM flight code release 10.



NCR 189
-------
XMM-NC-ESO-0105

"OM DPU Code corrupted by non-recoverable doublebit error"

DPU crashes occur due to higher than expected Single Event Upsets on board and
requires manual reload of DPU code from the MOC


If this (multi-bit errors) turns out to be a problem that occurs
too frequently I can make some software modifications to
implement a voting scheme based on multiple copies of the
executable stored in the DPU. It could be done in such a way that
it would *NOT* affect upload times (loading the code from the
ground into the DPU).

Jim


The architecture of the RAMs used in the DPU have eight 8bit X
32K RAM chips on a single ceramic sub straight (for a total of
8bits X 256K). The die themselves are organized into multiple
pages of 8bit words. So it is possible that an energetic particle
could impact a single die, let's say between two adjacent RAM
cells, and cause both of them to flip. This could be seen as like
bits (ex. 2^3) in sequential addresses (ex. e01662 and e01663)
being flipped. Another possible scenario is the 2^0 and 2^7 bits
on widely separated addresses.

Jim


We will wait and see how many of these we get.

Phil




NCR 190
-------

ICU stray light

Several OM images acquired showed a low emission structure roughly three
times the background level. The increased background has the shape of 
loops or as elongated streaks. 
Intial analysis suggests that the increased background is caused by a
chamfer in the detector holding structure.



NCR 191
-------

Helpdesk Ref E351 Vega ID 480

The variable SYNCHRONISING in time_man.adb is not initialized.  This could
cause an unpredictable initial value in the housekeeping though it does not on
the real hardware.

This is seen in (and causes a problem with) the simulator at VILSPA.
Fix known.  Edit icu/fm/oper/time_man.adb.  Will be implemented for OM flight
code release 10.



NCR 192
-------

Instrument PSF

The instrument PSF is broader than expected. The magnifier PSF exhibit a 
donut shape, which suggests that the images are out of focus. Beside the
defocus seen in the lenticular filters an additional 'defocus' component
may be contained in the magnifier PSF.

Focus heaters will be set to the following when the filter is chosen.

1200        -- Blocked           - Filter 0      +100%
1400        -- V                 - Filter 1      +100%
1600        -- Magnfier          - Filter 2      -100%
1800        -- U (no bar)        - Filter 3      +100%
2000        -- B                 - Filter 4      +100%
0000        -- White             - Filter 5      +100%
0200        -- Grism 2 (Visible) - Filter 6      +100%
0400        -- UVW1              - Filter 7      +100%
0600        -- UVM2              - Filter 8      +100%
0800        -- UVW2              - Filter 9      +100%
1000        -- Grism 1 (UV)      - Filter 10     +100%
2100        -- Bar               - Filter 11     +100%



NCR 193
-------

Problem: The checking of memory access for writing single small word
integer would have generated with a hard-coded number 0xa555. The code
should report with the symbolically defined name DA_GADE_WSI. Since
this logic path is rarely passed through, the problem is only found
accidentally through inspection.

Action: Implement the correction.

Test: 

1. Write special code to generate this logic path.
2. Load and execute the special DPU codes.
3. Confirm the correct generation of the error message.

Results:

Exception message is correctly generated.

Affected codes:

include/global_access.c: v. 1.16.



NCR 194
-------

Goto Safe Mode can lose filter wheel position.

If the DPU is sending alerts or data during the filter wheel movement of a goto
safe command, this traffic on the SSI can interrupt the filter wheel on its way
to blocked.

Fix known.  Edit icu/fm/oper/modeman.adb.  Will be implemented for OM flight
code release 10.



NCR 195
-------


The expected TM packet Unsuccesful Command Execution Type 3.4  TPN 91404 Error
Code 134 which should have arrived when the FW was commanded and not Datumed,
was generated by OM however arrived in a TM packet Type 3,2 Unsuccesful Command
Acceptance.

These TM packets are set up in the database as defined in the TC and TM
Specification - User Manual XMM-OM/MSSL/ML/0010.4 section 3.3.3 as type 3.4s.
The XMCS was not able to recognise the packet when it arrived as a type 3.2 as
the packet is not defined in the database.  As a result of this no automatic
action was taken by the XMCS and commanding to OM was not stopped.

Fix known: Missing if statement at the end of icu/fm/oper/tc_verify.adb
Will be implemented for OM flight code release 10.



NCR 196
-------

HV ramp-up failed

1. First we send a hv ramp param tc and it works.
2. Then we send hv ramp tc
3. and it fails with an unknown tm packet.
4. The same tc is repeated and it works.

1. ****TELEMETRY****
2000 53d 16:49:18.265
detector event mcp23 OK

2. ****TELECOMMAND****
2000 53d 16:52:09.562
H7140 set hv ramp para
mcp1
500
19
0
0
OFF

3. ****TELECOMMAND****
2000 53d 16:52:49.785
H5140
start hv ramp

4. ****TELEMETRY****
2000 53d 16:52:51.651
XREF.XXX <------------------bad telemetry packet
8c00 edb4 000d
0332 0061 ff0e 2f93 0a17 8400 a115

5. ****TELECOMMAND****
2000 53d 16:56:45.441
H7140 set hv ramp param
mcp1
500
19
0
0
OFF

6. ****TELEMETRY****
2000 53d 16:58:12.892
MCP1 at correct voltage tm packet

This was caused by a command too soon from the ground.
The bad telemetry packet was because of NCR 195.
This NCR should make sure that the HV code sends a command too soon packet
rather than an invalid parameters packet when the command is too soon.



NCR 197
-------

Problem: It has been observed that ICU operation appears abnormal when
setting up the exposure, if the DPU continues to transmit significant
amount of data. The key symptom is the DPU does not receive the
IC_CHOOSE_GS command. The DPU continues to operate properly. This
condition does not appear to have any permanent damage to continuing
operation of OM, other than the improper set up of the exposure
configuration which leads to loss of science data.

Resolution: This condition has been alleviated during the OM commissioning
phase when the constraint was imposed that all data donwlink from
the exposure n must be completed prior to the termination of 
exposure n+1. Thus this condition should not occur under the current
operation scenario. If the operation scenario is revised, then this
error/issue needs to be recreated/revisited.





NCR 198
-------

Scrubbing reports errors too often (when not busy) and sometimes too late (when busy).





NCR 199
-------
The ICU got stuck on 2000.129.16.34.30.155 whilst loading the DPU.





NCR 200
-------
Spontaneous DPU reset

The first causes a spontaneous reset of the DPU.  At the moment it is not clear
what the cause of this is though it happens relatively infrequently.  In these
tests this was only seen once in approximately 160 hr of testing.  Previous
testing on earlier version of the code indicate that this NCR occurs randomly
with no obvious period, with occurrence times of between 2 hr to over 100 hr.





NCR 201
-------
16 bit wraparound when no stars

No star in the fast mode window which causes the 16 bit fast mode memory to
overflow.  Since operationally you would expect to have a star in the fast mode
window this is unlikely to happen often but there are plans to implement fast
mode using 24 bit memory which would solve this problem.





NCR 202
-------
Missing alerts before Dave.  12.06.2000

All alerts from the DPU except for heartbeats are missing before the first
IC_INIT_DPU 0XA430 is sent.

From Cheng:
This symptom is resolved by initializing the inhibit_ssi in cwhite.c,
instead of in su_initialize.c. This is the same as the missing Jim.
Releases 10 and 10b will inhibit all alerts except for HB between Jim and Dave.
Release 11 should fix this problem.





NCR 203
-------
FAQ failure: DPU assumes incorrect word ordering of input reference
stars.  08.10.2000.

Opened by: Jamie Kennea (8th June 2000)

Explanation:

When loading reference stars for field acquistion, the DPU code assumes
that the word ordering of the coordinates in the IC_LOAD_REF_STARS (A428)
command as least significant word first. The DPU-ICU Protocol Definitions
document (XMM-OM/MSSL/ML/0011.4) states that "In all cases, the most
significant bit is transmitted first." This error causes the input
reference star positions to be corrupted and therefore field acquisition
fails.





NCR 204
-------
Unexpected A5AD alerts.  13.06.2000.


From Jamie:
On several occassions during long exposures, the DPU has entered a mode
where it issues many A5AD alerts, indicating that the DPU is attempting to
access illegal memory locations. The alerts are random, infrequent and not
reproducable on simulator tests, indicating a possible cause to be
corruption of the PROC memory area buy an SEU, causing the DPU code to run
unpredictably.


From Mat:
I've been through the REPEX alerts and exceptions looking for the a5ad alerts.
I found the following groups of a5ad "WHITE BOUND I. EXCEPT" warnings:

ERT first warning       no.  of warnings        ERT last warning         

2000.073.19.31.42.293   lots of screenfuls      2000.074.11.19.18.102
2000.118.09.54.08.258   lots of screenfuls      2000.118.10.14.51.050
2000.119.03.53.13.409   lots of screenfuls      2000.119.03.53.54.193
2000.120.10.30.33.586   lots of screenfuls      2000.120.10.48.38.029
2000.121.01.59.19.190   lots of screenfuls      2000.121.01.59.54.051
2000.134.06.23.36.409   lots of screenfuls      2000.134.07.34.22.601
2000.135.01.34.15.016   lots of screenfuls      2000.135.01.38.50.322
2000.139.17.10.10.842          4                2000.139.17.11.30.877
2000.139.22.52.17.807          4                2000.139.22.52.18.637
2000.140.13.01.56.247   lots of screenfuls      2000.140.23.31.48.258
2000.142.11.03.07.877   lots of screenfuls      2000.142.12.15.09.510
2000.142.14.25.56.025   lots of screenfuls      2000.142.15.01.18.828
2000.143.00.45.52.465   lots of screenfuls      2000.143.00.49.35.290


From Cheng:
This e-mail concerns the so-called A5AD problem seen on XMM-Newton-OM. We
have seen a series of A5AD (GADE_RSIA_L) errors after frame 48 in a 100
frame exposures. One suspicion is that is is related to the long
exposures. But Jamie reports that there are many incidents of long
exposures where no data corruption has occurred on OM proper. Also, we
cannot reproduce this problem on the ground. Rudi, Kate, how many times
have we seen error like this? once or twice?

Jamie has set up the data archive on eridanus. I haven't got the chance to
look at them carefully. But, let me just raise this food for thought. We
have seen data corruption, presumably due to radiation, both in PROG and
real data product. It is thus not inconceivable that data corruption will
occur in the DPU operation parameter area (PROC). When that happens, the
behavior will be all over the map. One way to reduce this problem is to
have short exposures. This will force a reference frame acquisition and
processing, which can serve as a mini-reset where many variables are
refreshed. Thus, if there is a 20 sec x 100 exposure, I recommend we break
it up into two 20sec x 50, as long as the data volume and TM bandwidth can
be accommodated. Even though we might incur some overhead, I think the
resultant data resilience/robustness is worth it.




NCR 205
-------
Alerts from CGS.  

The DPU produces lots of alerts during choose guide stars for the magnifier.



NCR 206
-------
FAQ failed  03.07.2000

Field acquisition gave a5b2 alerts at 22:16 03.07.2000.



NCR 207
-------
HV failed.  (Vilspa NCRs 60 and 66) 05.07.2000.

The OM high voltages spontaneously turned off and disabled themselves on day 186 at 05:15.  The ICU software knew nothing of this (it didn't seem to be responsible) though it was seen by the ICU software in the housekeeping.  The ICU software correctly disallowed a further manual attempt to change the voltage as it correctly thought there was a problem as the measured and expected voltage were different.  A RBI reset had to be performed to reset the software.  The high voltages ramped up correctly afterwards.

It looks like the electronics did this spontaneously.  The software is very simple and before and after this problem was running correctly.  No errors were reported from the software.

The only suggestion is that this was radiation-induced.

The out-of-limits should be changed so that in science mode (mode 3) the high
voltages should be enabled.  Currently there is a limit that the high voltages
should be in limits when they are enabled but this doesn't catch the
spontaneous disabling of the high voltages.

If this is radiation-induced, we can do nothing.




NCR 208
-------
Toggling of the fast mode data buffer pointer.  17.07.2000.

Problem Statement: Toggling of the fast mode data buffer pointer between
exposures in the whitedsp.c is incorrectly coupled to the toggling of
image mode accumulated image pointer.

The correction is straightforward to implement. (It is actually
implemented and tested.)



NCR 209
-------
Old tracking offsets remembered by shift and add algorithm.  26.07.2000

In the case where tracking is off due to a failure of the choose guide
star algorithm, the Red DSP will apply the last calculated shift and add
offset (from a previous exposure). Usually this value is no more than a
+/- a few pixels - however in extreme cases (explicitely where tracking
has gone badly wrong, followed by an exposure where CGS fails) this offset
can be large (>100 pixels) - which can lead to corruption/loss of data.
This behaviour has so far been seen twice in OM data, once in a series of
observations in UV, and once using the magnifier.  



NCR 210
-------
a595 (DA_BLUE1_S_ALERT) and a580 (DA_CLK_SYNCH_ERROR) seen in Choose Guide
Stars.  The observed a580 understood to some extent.  It is correlated with
exposures with two fast mode windows. If you have only one fast window,
then we should have any problem. Reason we haven't seen this is that EOB2
has no facility to do clock sync. Shouldn't have any detrimental effect.
It happens when we tell both blue1 and blue2 to load BFAST simultaneously.
Fix is to split up the command to load BFAST.



NCR 211
-------
Ada exception in DPU data manager.

Opened by: Fabio Giannini                25.09.2000

at 11:24.53 totade the ICU code crashed.

Exception packet 92800 was sent indicating an Ada error with parameters:
H8080=32hex (Ada exception DPU data manager)
H8085=E010hex (out of range as defined in TM-TC doc)
TM was lost and the command to go to safe was not accepted.
A cold reset was then issued and a RBI dump of the memory was performed




NCR 212
-------
NCR 212: Ordering issue with DP_FAQ packet

Opened by: Jamie Kennea, Rudi Much    13.10.2000

The DP_FAQ alert contains both a list of the uplinked guide stars used for
field acquisition and the positions of the stars identified by the DPU to
be associated with these reference stars. Currently if a reference star is
not found, then this has not entry in the downlinked star list. Therefore
if 2 stars out of 16 are not found, the reference star list is an array of
16 and the found star list is an array of 14 with with padding at the end.
This makes direct comparison of the reference star list and the found star
list difficult in the case where not all uplinked stars are found.



NCR 213
-------
NCR 213: Cathode anomaly

Opened by: Jorge Fauste (Vilspa NCR 70)     04.10.2000

During Optical Monitor Activation on day 272 revolution 148 around 
20 hours 45 minutes parameter H5165 HVM VCATH had several spikes. 
The problem appeared some minutes before the cathode High voltage 
had been set up by Telecommand during the High Voltages ramp up

These are the values found:

                before 2000.272.20.45.16        H5165=0.50 volts
                at 2000.272.20.45.16            H5165=3.91 volts
                at 2000.272.20.45.25            H5165=7.32 volts
                at 2000.272.20.45.35            H5165=10.25 volts
                at 2000.272.20.45.46            H5165=98.5 volts
                at 2000.272.20.45.56            H5165=0.50 volts
                after that time some small spikes.
                at 2000.272.20.49.36 the command to set up the high 
                voltages for the cathode was sent.

After the command was sent everything was O.K


The cathode is seen to raise on the QM hardware when the other high voltages
are ramped up.
This is a known feature of the hardware.



NCR 214
-------
NCR 214: DPU spontaneous reset

Opened by: Jorge Fauste (Vilspa NCR 76)     16.01.2001

On day 2001-011 at 19:07 the following Repex message appeared:
 "92301 SSI Exception"
after this message another two messages appeared: "92802 Heartbeat lost", 
 "92803 DPU Reset exception"
Executed procedure CRP_OPM_004



NCR 215
-------
NCR 215: Eng 3 corruption

Opened by: Jorge Fauste (Vilspa NCR 78)     09.02.2001

First OM exposures of observation 0109870201 on revolution 215 showed corrupted
images. After some investigations was discovered that Engineering 3 data was 
corrupted as well. Science observations were stopped for OM, and Engineering 3
and 6 executed again. No Telecommand or telemetry problems detected.



ECR 86
------
Two mode change commands to the same mode should not produce an error.

For example,
tc_mode 2
tc_mode 2
should not generate an error

Change known.  Edit icu/fm/oper/modeman.adb.  Will be implemented for OM flight
code release 10.





ECR 87
------

Change all critical exceptions to major anomalies.





ECR 88
------

Loss of filter wheel position should cause an automatic goto safe and prevent mode changes until recovered.  As an additional safety measure, HV ramp-up is not allowed unless filter wheel in blocked position.




ECR 89
------

Change the loss of telemetry timeout to goto safe approx 1 min after the CDMU dies.



ECR 90
------

Implement BEGOF_EXP ENDOF_EXP in engineering exposures.



ECR 91
------

Implement a whole field of view exposure mode.



ECR 92
------

Change fw offset of UV grism.

Opened by: Rudi Much                    11.01.2001

 The UV grism spectra are disturbed by 0-order images and by straylight 
 features.

 After the analysis of the UV grism data acquired in several test observations 
 of BPM16274 the OM team came up with a new filter wheel position of the UV 
 grism. The new position cleans up parts of the OM FOV both for straylight 
 and 0 order iamges. Cleaner grism image are obtained.
 The new grism position will replace the current one.

 The new position is defined as (old position - 60) = 940. 
 The FW position is normally commanded by filter element identifier and the
 translation from FW identifier to FW counter position is made by the ICU.
 Therefore a change on the ICU s/w is required (OM team).
 There is also a absolute filter wheel commanding capability. Here the
 reference document "TC and TM specificition" is consulted.  An update of this
 document is required (OM team).

 However changes of the ground segment are required as well, e.g. the 
 translation from FW position counter into FW position for the mimic display 
 of the filter wheel position.