Operators Current Issues List

Current instructions for LWA1 operators as of September 29, 2014

The Heuristic Automation for LWA1 (HAL) System: The HAL system is designed to automate many aspects of running the station and provides a mechanism for the rapid processing of remote triggers. The list of tasks that HAL performs are:
- Generating a preliminary schedule between 16:48 and 19:12 UTC each day and e-mailing it to the operator.
- Implementing the schedule for the next UTC day between 21:36 and 23:59 UTC and e-mailing it to the operator. This includes scheduling:
  - the SDFs for that date,
  - DP INIs before/after beam forming sessions,
  - ASP filter changes,
  - TBW health checks,
  - and any gap filling projects, e.g., LO001.
- Moving MCS metadata tarballs out of the tp/mbox directory and into the relevant tp/YYMMDD directories.
- Uploading the MCS metadata tarballs to the LWA Data Archive and moving the tp/YYMMDD directories to tp/YYMMDD_done.
- Triggering the creationg of new LWAdb entries and updating the file sizes.
- Monitoring the station for problems, including:
  - sub-system temperatures,
  - icing of the shelter HVAC units, and
  - lightning,
  and will take action as necessary to protect the station.
- Responding to triggers from the Burst Early Response Triggering (BERT) system.
The operator is responsible for checking the preliminary schedules to make sure that they do not contain scheduling conflicts, check that the LWAdb is being updated, and maintaining an operator log. The operator also needs to watch the station for problems that HAL cannot recover from, such as bringing back sub-systems after a power outage.
- Added in the new metadata handling features and updated the list of operator responsibilities (2014 Sep 29).
- Re-enabled the progressive canceling during lighting and auto-recovery after the "all-clear" now that the thunderstorm season is winding down (2014 Sep 3).
- Disabled progressive canceling during lightning and auto-recovery after the "all-clear" to support DS001. Operators will now need to manually recover the station (2014 Jul 28).
- Updated the destination of the metadata tarballs (2014 Jun 30).
- Added (2014 Jun 23).
Useful links:
- LWA1 Operator Screen Almost everything you need to know
- TBN Histograms from PASI Useful for checking TBN gain setting
- LWA Computing Cluster Usage Useful for checking on the UCF
- Added (2013 Sep 24).

OLD ISSUES:

How to Automatically Record File Sizes: Frank has provided the following instructions: To automatically enter the file size entries into the lwadb database you can run the following scripts after you have ingested the metadata into the lwadb. This can be done once at the end of the week. - You have to login to each dr (dr1-dr5) and execute the scan_drsu_drx.py script that is found in the home directory. It takes as parameter the device name of a DRSU (e.g. /dev/md126). Thus, you have to run this script twice for each dr in order to cover both DRSUs. - Then run the following script that can be found in the home directory on dr1: fill_filesizes.py [lwadb username] [start date YYY-MM-DD] [stop date] This updates the operator field and fills out the entry for the file sizes. It outputs file tags for which it couldn't find a file size in the specified date range.
- Update: Performed by HAL (2014 Sep 29).
- Added (2014 Mar 11).
Triggered DS001 Observations: If you receive a triggering request for DS001 from Mike Stock do the following:
1. On tp, change into the ~op1/DS001/ directory
2. Run ./runDS001.py This script will take care of configuring ASP and DP, setting up the observations, and copying the data to the cluster.
3. After the observations have finished shut down ASP and DP until the "all-clear" e-mail has been sent.
- Thunderstorm season winding down for 2014 (2014 Sep 3).
- Added (2014 Jul 28).
Lightning: If you receive a warning from the station or have other cause to believe that lightning is in the area you should shut down both DP and ASP. You will get an "all-clear" e-mail from the station once 30 minutes have gone by without any lightning.
- Update: Performed by HAL (2014 Jun 23).
- Added (2013 Sep 24).
Jupiter Observing Season: Make sure when observing Jupiter (LH010) that ASP is in FULL bandwidth mode.
- Update: Performed by HAL (2014 Jun 23).
- Added (2013 Oct 1).
Gap filling with LO001: Whenever there is a gap of >1 hour during night time hours when there are no observations scheduled, please follow these instructions.
1. Identify gaps in the schedule where there are no beam observations scheduled. If the following criteria are true then continue:
  - between sunset and sunrise, i.e., the sun is down
  - the gap is >1 hour
2. Determine the time in seconds between now and the next scheduled observation if LO001 is supposed to start immediately. Otherwise determine the duration of the gap to be scheduled and note the time in UTC of the beginning of the gap, i.e., the end of the last observation. There is no need to schedule a DP_ INI after the end of the previous observation, it will be taken care of by this script.
  
  Example 1: /home/op1/LO001/runLO001_split.py 3600
  To schedule a 1 hour observation to start immediately using ASP in split bandwidth mode (current default for LEDA compatibility).
  
  Example 2: /home/op1/LO001/runLO001_split.py -t 12:00:00 -d 2013/11/06 3600
  To schedule a 1 hour observation starting at 12:00:00 UT on 2013/11/06.
  
  The script will issue a DP INI within 2 minutes after it is executed or 2 minutes after the specified start time. It runs INIdp.sh which reissues a DP_ INI if the calibration fails and hopefully brings DP into an operational state before the beam observations begin. The script then generates sdfs for beam 2, 3, and 4 and puts them into the appropriate directory, given the directory for that date exists in the ~/MCS/tp/ directory. It then calls tpss and schedules the sdfs. The sdfs are scheduled to start 15 min from the time of the execution of the script or the specified start time and ends 15 min before the end of the duration period you have specified, to leave time for a DP_ INI. The script also checks whether the sdf files appeared in the mesq.dat and if not reissues the tpss command or gives up if it hasn't managed to get it queued within 5 minutes.
  
  The script also adds three 'at' commands to the queue:
  1. INI DP before the beam observations begin
  2. Starts TBN after the beam observations begin
  3. INI DP after the beam observations are finished and 13 min before the end of the duration specified.
3. After running the script please do a sanity check using the OpScreen webpage. If there are any issues or feature requests please contact me and I will see if there is a bugfix needed or a feature to be added to make the live of the operator easier.
- Update: Performed by HAL (2014 Jun 23).
- Added (2013 Nov 11).
GRB triggers: We are now triggering observations of GRBs with 4 beams in the FULL ASP setting. If a GRB occurs and it is possible for us to initiate an observation then the station will automatically send an e-mail with detailed instructions. Basically, files will be placed in the validator queue, so they need to be retrieved and then submitted. Then DP likely needs to be INI'd, and ASP set to FULL. All this is time critical so to improve our chances it can happen that one of Greg, Jayce, Frank, or Kevin initiates the observations. Whoever is first will see the getValidatorQueue.sh script move the files. Might be a good idea for this person to e-mail lwa1ops claiming that submission is going forward.
- Update: Performed by HAL and BERT (2014 Jun 23).
- Added (2013 Nov 11).
LS003: Please run the following scripts before and after each LS003 observation:
Before: setASP_LS003.sh
After: setLEDA64_split.sh
- Update: Performed by HAL (2014 Jun 23).
- Added (2014 Mar 11).
SHL reports rack 3 current as 0: The PDU for rack 3 (DP) is currently broken and not reporting. DP is powered but we have no control over the power at the moment. Hopefully this will be fixed soon.
- Update: Fixed with new Raritan PDU (2014 Jun 6).
- Update: Problem back (2013 Dec 11).
- Update: Fixed (2013 Oct 3).
- Added (2013 Sep 24).
Operator Log: All operator logs must now be entered using the opLog.py program. We are trying to address: (1) logs getting lost in e-mail land, (2) too much variation in operator style, formats, etc., (3) missing information about downtimes. The new tool is in the OperatorLog extension (on hercules this can be found in /usr/local/extenstions/OperatorLog) and is called "opLog.py" All logs are stored in the meta-data area on the LWA archive.
- Update: Moved to Operator Manual 2013 Dec 13.
- Added (2013 Nov 19).
SDF Validator: We have a new way of accepting SDFs via the SDF Submission web page. Once a valid set of SDFs has been submitted an e-mail will go out to lwa1ops. The operator should put the files in place. To do this there is a new script in ~op1/MCS/sch/operatorScripts/getValidatorQueue.sh on tp to copy the files over to the correct directory in ~op1/MCS/tp and delete the files off fornax. If you see an unavoidable conflict please alert Greg or Jayce.
- Update: Moved to Operator Manual (2013 Dec 13).
- Added (2013 Dec 3).
Default Sky Frequency for TBN: Currently this is 37.80 MHz. Please run TBN whenever beams are not in operation, and make sure that PASI is running and LWA-TV looks good. If you have questions contact Greg Taylor or Jake Hartman.
- Update: Moved to Operator Manual (2013 Dec 13).
- Added (2013 Sep 24).
MCS Executive Crashes: About once/week we have been seeing MCS Executive stop operating in the middle of an observation. We need to properly document this rare fault so that we can track it down and fix it. The symptom is that all commands stop processing and the 5 minute polling of MCS executive stops. If you see this happen, please copy meelog.txt and mselog.txt and send them to lwa1staff. Please also describe what was going on at the time of the crash.
- Update: Appears to be fixed (2013 Dec 3).
- Update: Beware of sessions ending while others are running. This might be the source (2013 Oct 29).
- Added (2013 Oct 8).