################################################################## SHORT TERM PROJECTS ################################################################## Kreymer on shift at Minos Mon-Thu Nov 30- Dec 3 OPEN TICKETS 124576 Account and group alignment ( gfactory/gfrontend ) _____________________________________________________________ ACTIVE 5 - test UPS AFS -> fermiapp Follow up on update of resolv.conf with 131.225.8.120 at top roundup cedar file cleanup, recent note from Howie nonap file cleanup fnpcsrv1 file cleanup ECRC READ READ/SAM DFARM /minos/home in bluearc, for minospro, minosana, mindata, Alternate name possibilities : /minos/app/home /grid/home ( exists, but used for grid internal accounts ) /if/home /numi/home /neutrino/home /fnal/home /bluearc/home /netapp/home /nfs/home CSI - make and export this area, to Minos Cluster nodes only, root unsquashed to one node FEF - mount this on Minos Cluster create minospro, minosana accounts initially. give root access to one node ( minos-sam02 dev/int ? ) rootunsquashed host for bluearc Then move analysis tree to /minos/data2. ( June 21/22 downtime ) Or /minos/scratch ? _____________________________________________________________ web - look at Mengel's document http://www-css.fnal.gov/csi/webdocs/access_apache.html#certip re web access control ( hope for public Ganglia access ) mysql - archive using slave database clean out remnants on minos-mysql1 different rpms see /minos/scratch/kreymer/ minos11.rpms minos12.rpms Offsite access to rexganalia1 via pass/cert samadmin - autoregister seems to send mail to sam-admin, should be minos_sam_admin srmcp -2 should start doing this ? FARM - merge daily suppression lists in /minos/data/minfarm/lists/daq_lists/sup Total 1957 lines , 3 MBytes in 730 files. Staging raw data - followup missing files - in migration ? need to correct file names/paths using enmv. RAW DATA crc info sam update file crc --file= --crcValue=...L --crcType="adler 32 crc type" AFS timeouts - pursue minos-mysql1 timeouts correlated to nwest ssh connx Check 'WasIScanned' at security web page Write Minos computing annual review Clean up kreymer email copies to new/old desktop. Add monitor and plot of kcron timing for DNS/Kerberos diagnostic fall protection training, for underground access in the Minerva era ? Make timeline plot of raw data times, to verify deferred writes histogram of times for last week. ############################################################################# ############################################################################# W O R K L O G ############################################################################# ############################################################################# IMMEDIATE mcimport - correct the backgrounding ( -c or -l run in foreground ) move directory list inside loop control code DCache - update pool affinities for *sntp* Implement grid/cluster test of Bluearc client load, for DDN etc tests, then reply again to rayp re his 21 Oct mail. steal code from admin/mysql/scripts/contest,contact lock - use date rate limit (optional), as reported by bluwatch bluwatch - fill the PERF file for lock ============================================================================= 2009 11 23 ============================================================================= ############ # MCIMPORT # ############ mindata@minos27 rm /minos/data/mcimport/STOP $ cp -a AFSS/mcimport.20091123 . $ ln -sf mcimport.20091123 mcimport # was mcimport.20091006 set nohup ./mcimport -l 99999 ALL & Mon Nov 23 16:06:48 UTC 2009 minos27 ALL 7285 SLEEP 1/300 waiting for files ============================================================================= 2009 11 21 ============================================================================= ############ # MCIMPORT # ############ touch /minos/data/mcimport/STOP Sat Nov 21 09:24:40 CST 2009 ============================================================================= 2009 11 20 ============================================================================= ######## # GRID # ######## Date: Fri, 20 Nov 2009 20:53:47 -0800 (PST) From: Ryan B. Patterson To: minos_software_discussion@fnal.gov, minos_batch@fnal.gov, minos-admin@fnal.gov Subject: glidein disconnect problems (probably) resolved Hi, FermiGrid admin has identified the cause of the glidein failures we've been suffering from since Thursday afternoon (details below). MINOS glideins seem to be running successfully now, but let me know if you get any of these disconnect errors through Saturday. Ryan --> More detail: the FermiGrid upgrades on Thursday included a new version of glexec that is inexplicably orphaning (rather than terminating) daughter processes upon exit, filling up the memory on the worker nodes. Eventually, the kernel starts killing things (including the glidein condor_startd processes), and you see a disconnect error in your Condor log file. FermiGrid has opted simply to roll back to the previous version of glexec, and this seems to have restored normal MINOS glidein operation. __________________ Ticket search ? INC000000016927 11/20/2009 10:52:06 PM timm INC000000016642 11/18/2009 10:00:34 PMPatterson glidein to CMS nodes? ############ # PREDATOR # ############ Cleaned up the mangled .py files noted yesterday N00017210_0001 cdm cd GDAT/neardet_data/2009-11/ less N00017210_0001.log Error in : nargs (3) not consistent with expected number of arguments ([0-0]) rm N00017210_0001.sam.py* cds ; ./predator F00045028_0000 cdm cd GDAT/neardet_data/2009-11/ less F00045028_0000.log no obvious problem, just rerun this. rm F00045028_0000.sam.py cds ; ./predator These files are OK now. Must have been a network problem Wednesday night. ############# # MDSUM_LOG # ############# mdsum_log.20091116 - updated to put only directories in SMALLS ARK > ln -sf mdsum_log.20091116 mdsum_log # was mdsum_log.20081124 ############ # MCIMPORT # ############ mcimport.20091120 moved the directory list inside the -l loop ####### # WEB # ####### Added data2 rates to dhmain.html. ln -sf dhmain.20091120.html dhmain.html # was dhmain.20091118.html ########### # BLUEARC # ########### Subject: monitoring /minos/data2 from minos25 $ mkdir /minos/data2/maint/bluwatch $ GDM=/minos/data2/maint/bluwatch/minos25 $ mkdir -p $GDM date NF=0 while [ ${NF} -lt 200 ] ; do NFST=`printf "%3.3d" ${NF}` mkdir -p ${GDM}/${NFST:0:2} cp /var/tmp/100M ${GDM}/${NFST:0:2}/file${NFST} echo ${GDM}/${NFST:0:2}/file${NFST} (( NF ++ )) done date Fri Nov 20 00:06:32 CST 2009 /minos/data2/maint/bluwatch/minos25/00/file000 /minos/data2/maint/bluwatch/minos25/00/file001 ... /minos/data2/maint/bluwatch/minos25/19/file198 /minos/data2/maint/bluwatch/minos25/19/file199 $ date Fri Nov 20 00:10:12 CST 2009 kreymer@Minos25 cdadmin cd bluearc set nohup ./bluwatch -r -S 100000000 -s 120 -b /minos/data2/maint/bluwatch/minos25 & set nohup ./bratenow -n minos25 -T /minos/data2 & set nohup ./bratenow -n minos25 -T /minos/data2 -w & touch /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluwatch/rate/2009/11/16/minos25.txt ============================================================================= 2009 11 19 ============================================================================= ########## # PARROT # ########## Parrot builds are too big, recently. $ cd /afs/fnal.gov/files/data/minos/d119/GROWFS $ du -sm 20091030 139 20091030 $ du -sm 20091117 1516 20091117 $ wc -l 20091030/.growfsdir 4022115 20091030/.growfsdir $ wc -l 20091117/.growfsdir 43255297 20091117/.growfsdir $ grep '^D ' 20091030/.growfsdir | wc -l 365860 $ grep '^D ' 20091117/.growfsdir | wc -l 2983359 $ grep '^D XrdSeckrb4' 20091030/.growfsdir | wc -l 35 $ grep '^D XrdSeckrb4' 20091117/.growfsdir | wc -l 35 $ grep '^F UgliGeometry.bin' /afs/fnal.gov/files/data/minos/d119/GROWFS/20091030/.growfsdir | wc -l 298 $ grep '^F UgliGeometry.bin' /afs/fnal.gov/files/data/minos/d119/GROWFS/20091117/.growfsdir | wc -l 5924 Find the path to an instance of UgliGeometry.bin //doc/UserManual/build-logs/Fri/Linux2.6-GCC_3_4/ ... Robert found circular symlinks to and from /grid/fermiapp/... Rebuilding the index, around 15:30 Date: Thu, 19 Nov 2009 16:08:50 -0600 From: Robert Hatcher Okay, Art and I tracked down a pair of symlinks that formed a weird circular path that was causing make_growfs.auto go nuts.  The new  indexing has now completed. ______________________________________________________ Date: Thu, 19 Nov 2009 14:14:35 -0800 (PST) From: Ryan B. Patterson Thanks! I can now successfully get directory listings from the affected areas. I'll start the factory up. ______________________________________________________ Date: Thu, 19 Nov 2009 14:25:50 -0800 (PST) From: Ryan B. Patterson My test job reveals the absence of /minos mount points on at least one node (fnpc231). I've stopped the factory. I'll be away from my desk for the next hour or so. Let me know if I should submit a ticket about the mount points. (Art said this was an on-going issue from earlier in the day...) ______________________________________________________ Date: Thu, 19 Nov 2009 23:05:33 +0000 (GMT) From: Arthur Kreymer Which mount points are missing ? I see : -bash-3.00$ ls -ld /minos/scratch /minos/data /minos/data2 drwxrwxrwx 8 root root 4096 Oct 29 10:44 /minos/data drwxrwxrwx 23 root root 4096 Nov 16 16:00 /minos/data2 drwxrwxrwx 241 root root 18432 Nov 10 05:26 /minos/scratch -bash-3.00$ hostname fnpc231.fnal.gov -bash-3.00$ date Thu Nov 19 17:05:04 CST 2009 ______________________________________________________ Date: Thu, 19 Nov 2009 15:07:05 -0800 (PST) From: Ryan B. Patterson They were not there before, but they are now. I'll submit a test job once more for completeness... ______________________________________________________ I do see a failure on fnpc206 001 (808499.000.000) 11/19 16:50:06 Job executing on host: <131.225.167.206:33583> But it is OK at 17:11 . Scan for other nodes : grep '/minos/scratch' logs/glide/808*.err Consistent with empty .out files, -rw------- 1 kreymer g020 0 Nov 19 09:10 logs/glide/probe.808316.0.out -rw------- 1 kreymer g020 0 Nov 19 09:20 logs/glide/probe.808317.0.out -rw------- 1 kreymer g020 0 Nov 19 16:17 logs/glide/probe.808330.0.out -rw------- 1 kreymer g020 0 Nov 19 16:17 logs/glide/probe.808331.0.out -rw------- 1 kreymer g020 0 Nov 19 16:17 logs/glide/probe.808336.0.out -rw------- 1 kreymer g020 0 Nov 19 16:17 logs/glide/probe.808338.0.out -rw------- 1 kreymer g020 0 Nov 19 16:17 logs/glide/probe.808337.0.out -rw------- 1 kreymer g020 0 Nov 19 16:17 logs/glide/probe.808332.0.out -rw------- 1 kreymer g020 0 Nov 19 16:17 logs/glide/probe.808333.0.out -rw------- 1 kreymer g020 0 Nov 19 16:17 logs/glide/probe.808339.0.out -rw------- 1 kreymer g020 0 Nov 19 16:17 logs/glide/probe.808334.0.out -rw------- 1 kreymer g020 0 Nov 19 16:50 logs/glide/probe.808499.0.out for SEC in 808316 808317 808330 808331 808332 808333 \ 808334 808336 808337 808338 808339 808499 ; do host `grep executing logs/glide/probe.${SEC}.0.log | cut -f 2 -d '<' | cut -f 1 -d :` | cut -f 5 -d ' ' done | sort -u fnpc204.fnal.gov. fnpc206.fnal.gov. fnpc213.fnal.gov. fnpc222.fnal.gov. fnpc223.fnal.gov. fnpc231.fnal.gov. fnpc240.fnal.gov. fnpc341.fnal.gov. for HOST in 204 206 213 222 223 231 240 341 ; do ssh -akx fnpc${HOST} ls -ld /minos/scratch ; done 2>/dev/null ______________________________________________________ Date: Thu, 19 Nov 2009 15:07:05 -0800 (PST) From: Ryan B. Patterson They were not there before, but they are now. I'll submit a test job once more for completeness... ______________________________________________________ Date: Thu, 19 Nov 2009 15:10:31 -0800 (PST) From: Ryan B. Patterson Okay, the factory is back up, and I've had two successful Parrot jobs, one with a minossoft environment set up. I think we're ready for an all-systems-back-up announcement. (I can't send email to minos-users, or else I would send the email myself.) Thanks, Robert and Art, for solving the Parrot issue. ______________________________________________________ ############ # SHUTDOWN # ############ Systems are upgraded to 2.6.9-89.0.16.ELsmp Date: Thu, 19 Nov 2009 11:16:52 -0600 From: Etta Burns To: minos-admin@fnal.gov Cc: run2-sys@fnal.gov Subject: Minos Downtime - kernel updates The kernel updates/reboots on the servers has been completed. minos-mysql(1-3) minos-sam0(1-4) minos01 minos0(3-27) MINOS25 > minos_q -- Summary of minos25.fnal.gov : <131.225.193.25:62476> : minos25.fnal.gov OWNER RUN IDLE HELD OLDEST_JOB grafnj 0 0 0 11/14 17:12 0+00:00:00 paloon_SriptPlus5A rubin 1 0 0 11/18 15:40 0+16:18:51 ana_mc_driver.glid TOTALS 1 0 0 Farm glideins: R=16 I=0 H=0 Post-shutdown tests : OK >>> kreymer@minos26 FINISHED Thu Nov 19 11:15:13 UTC 2009 STOPPED Thu Nov 19 13:06:06 UTC 2009 /local/scratch26/kreymer/log/predator/2009-11.log ls -l /local/scratch26/kreymer/log/predator/STOP rm -f /local/scratch26/kreymer/log/predator/STOP OK ran predator manually 12:34 mindata@minos27 ls -l /minos/data/mcimport/STOP rm -f /minos/data/mcimport/STOP OK Test sam station UNIV=dev UNIV=prd OK >>> kreymer@minos26 verify monitor.minos26 /bin/bash /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/bratenow /bin/bash /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/bratenow -w /bin/bash ./bratenow -n d0mino06 -d /grid/data/monitor -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates -T /prj_root/3024 -w /bin/bash ./bratenow -n d0mino06 -d /grid/data/monitor -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates -T /prj_root/3024 /bin/bash ./bratenow -n d0mino05 -d /grid/data/monitor -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates -T /prj_root/5012 -w /bin/bash ./bratenow -n d0mino05 -d /grid/data/monitor -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates -T /prj_root/5012 /bin/bash ./bluwatch -r -S 100000000 -b /grid/data/minos/bluwatch/minos27 -d 10 OK >>> kreymer@minos27 ./bluwatch -r -S 100000000 -b /grid/data/minos/bluwatch/minos27 -d 10 /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/bratenow /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/bratenow -w ./bratenow -n d0mino06 -d /grid/data/monitor -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates -T /prj_root/3024 -w ./bratenow -n d0mino06 -d /grid/data/monitor -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates -T /prj_root/3024 ./bratenow -n d0mino05 -d /grid/data/monitor -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates -T /prj_root/5012 -w ./bratenow -n d0mino05 -d /grid/data/monitor -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates -T /prj_root/5012 MINOS27 > cd crontab MINOS27 > crontab minos27.kreymer OK >>> kreymer@minos-mysql1 verify @reboot verify mysqld OK >>> kreymer@minos-mysql2 verify @reboot verify mysqld The topdb_log scripts were running, no output, on mysql1/2. WOrked fine interactively. Tried via cron long after boot, 03 13 * * * /afs/fnal.gov/files/expwww/numi/html/computing/admin/mysql/scripts/topdb_log minos-mysql2 Corrected topdb_log to include /usr/krb5/bin in the path Restarted >>> minfarm@minos-sam04 ./roundup -l 999999 -r dogwood0 near ./roundup -l 999999 -r dogwood0 far OK >>> kreymer@minos-sam04 @reboot - change to ./bluwatch -r -S 100000000 -s 120 -b /minos/scratch/bluwatch/minos-sam04 -d .. /bratenow -n minos-sam04 -T /minos/scratch /bratenow -n minos-sam04 -T /minos/scratch -w cdadmin cd crontab crontab minos-sam04.kreymer OK >>> ORACLE Date: Thu, 19 Nov 2009 09:14:25 -0600 From: Maurine Mihalek minosora1 has been rebooted and is back up. jared will be patching oracle beginning now. Date: Thu, 19 Nov 2009 10:07:16 -0600 From: Jared M Platson Oracle patching is complete. Please check everything and make sure all is functioning properly. OK >>> CONDOR MINOS25 > sudo /etc/init.d/condor start Starting up Condor MINOS25 > date Thu Nov 19 13:41:30 CST 2009 ps -flu condor shows condor running on all but minos03 out of disk minos08 out of disk minos11 not configured Need to start up gfactory, gfrontend once Parrot is healthy ########### # MINOS27 # ########### Rates logged by minos27 seem to be hit hard, varying between normal and under 1 MB/sec since 02:00 CST. cannot ssh to minos27 - no response Ganglia reports minos27 being down MRTG reports high data rates to minos27 around 00:00 and 02:00, to 0 around 04:45 and 07:45. /minos/scratch/kreymer/log/mrtg.minos27.20091117.png ____________________________________________________________________________ Date: Thu, 19 Nov 2009 08:23:13 -0600 (CST) Confirmation Notification Request INC000000016688 requested by you has been submitted. Status: New Summary: minos27 nfs/oom problems last night Notes: FEF primary - run2-sys@fnal.gov Node minos27 had problems overnight. It responds to pings, but logins via ssh or rsh hang up. Existing interactive windows do not respond. Ganglia monitoring claims the node is down. There was unusual network activity, see an MRTG snapshot at /minos/scratch/kreymer/log/minos27/mrtg.minos27.20091117.png I just managed to get logged in around 08:05. And the formerly locked screens cleared up. I've copied /var/log/messages* to /minos/scratch/kreymer/log/minos27/minos27.messages* I see a lot of nfs_status errors starting in /var/log/messages.3, like Oct 26 11:02:15 minos27 kernel: nfs_statfs: statfs error = 116 Then Nov 19 04:49:22 minos27 kernel: nfs_statfs: statfs error = 116 Nov 19 04:53:02 minos27 kernel: warning: many lost ticks. Nov 19 04:53:02 minos27 kernel: Your time source seems to be instable or some driver is hogging interupts Nov 19 04:53:02 minos27 kernel: rip __do_softirq+0x4d/0xd0 Nov 19 04:53:02 minos27 kernel: Falling back to HPET Nov 19 04:54:03 minos27 ntpd[3861]: no servers reachable Nov 19 05:08:46 minos27 kernel: oom-killer: gfp_mask=0xd2 Nov 19 05:08:46 minos27 kernel: Mem-info: Then Nov 19 05:08:46 minos27 kernel: Out of Memory: Killed process 24483 (make_growfs.aut). ... Nov 19 08:05:54 minos27 kernel: Out of Memory: Killed process 26910 (make_growfs.aut). This may be related to a make-growfs script we ran yesterday. We should investigatethis script after the reboots. Please look into the nfs errors, if they persist. ____________________________________________________________________________ Date: Thu, 19 Nov 2009 09:36:26 -0600 (CST) Status: In Progress ____________________________________________________________________________ Date: Thu, 19 Nov 2009 09:57:41 -0600 (CST) Hello Art, The machine has run out of memory and swap space right after 2am, and came back after 8am. Did the script got started around 2am? There are two mounts that are stale (/minos/data2 and /minos/test9293). I think the statfs errors are result of that. We have had the errors since Oct 26. Do you want us to reboot the machine, or just try remounting the nfs mounts? ... ling ____________________________________________________________________________ Date: Thu, 19 Nov 2009 16:02:01 +0000 (GMT) From: Arthur Kreymer The suspected growfs script was running at those times. We will debug this script later. Please proceed with the scheduled reboot for the kernel upgrade. Please remove the /minos/test9293 and /minos/test mounts. ____________________________________________________________________________ Date: Thu, 19 Nov 2009 10:29:55 -0600 (CST) Status: Completed ____________________________________________________________________________ ########### # BLUEARC # ########### Date: Thu, 19 Nov 2009 08:29:52 -0600 From: Ramon C. Pasetes To: 'Arthur Kreymer' Cc: 'Jason Allen' , 'Glenn Cooper' , "'storage-admins@fnal.gov'" Subject: Update on Minos Storage Change on BlueArc Parts/Attachments: 1 OK ~40 lines Text 2 Shown ~193 lines Text ----------------------------------------  Migration of Minos data to new storage    Status: Complete.    Notes:  New area -> blue2:/minos/data            Old area -> blue2:/minos/data-old   New areas will be accessible by MINOS users once FEF reboots their nodes for patching. o    There should be no other changes required from FEF to get new area mounted. o    The old area is only accessible via minos27 (read-only). The admin for that server will need to mountup this area. ·         All areas have completed copying except /minos/data/mcimport and /minos/data/nue_group. ·         Files are still copying from this area. If a user reports that a file is missing from one of these two areas, it is probably because that file has not yet been copied to the new area.  These copies are progressing but could take hours to days to complete.   If they absolutely need their files back asap, they can access the old area and manually copy the file over versus waiting for our copy job to do it for them. -Central Storage Admins ____________________________________________________________ To : minos-users@fnal.gov, minos-admin@fnal.gov, minos-shifters@fnal.gov Cc : minos_batch@fnal.gov, minos_sim@fnal.gov, minos_software_discussion@fnal.gov, minosdb-support@fnal.gov Attchmnt: Subject : Re: Fermilab Minos Server shutdowns Thu 2009 Nov 19 ----- Message Text ----- On Thu, 19 Nov 2009, Arthur Kreymer wrote: > The Condor system is still down. > It took a while to the file system mounted correctly on FermiGrid > and we are working on Parrot problems. The Parrot problems have been resolved, and we seem to have /minos/scratch mounted everywhere again. Ryan has restarted the Glidein factory. Thanks for your patience ! ( The copy of /minos/data/mcimport and /minos/data/nue_group* is still underway, so be careful with those areas. ) ____________________________________________________________ MINOS27 > du -sk /minos/data-old/nue_group_* 1614742908 /minos/data-old/nue_group_files 2906852568 /minos/data-old/nue_group_tmp MINOS27 > du -sk /minos/data2/nue_group_* 1615697760 /minos/data2/nue_group_files du: cannot read directory `/minos/data2/nue_group_tmp/tmp/output_electrons_nd_mc': Permission denied 2837286272 /minos/data2/nue_group_tmp Fri Nov 20 08:46:10 CST 2009 MINOS27 > du -sk /minos/data2/nue_group_tmp ; date du: cannot read directory `/minos/data2/nue_group_tmp/tmp/output_electrons_nd_mc': Permission denied 2858938016 /minos/data2/nue_group_tmp Fri Nov 20 08:47:48 CST 2009 MINOS27 > dds -d /minos/data2/nue_group_tmp/tmp/output_electrons_nd_mc drwx------ 2 root root 2048 Nov 19 08:53 /minos/data2/nue_group_tmp/tmp/output_electrons_nd_mc/ ____________________________________________________________ Date: Fri, 20 Nov 2009 08:06:46 -0600 From: Ramon C. Pasetes As of this morning, 8:05AM, still going.... ____________________________________________________________ MINOS27 > du -sk /minos/data2/nue_group_tmp ; date 2911766912 /minos/data2/nue_group_tmp Fri Nov 20 10:32:58 CST 2009 MINOS27 > du -sk /minos/data2/nue_group_tmp ; date 2912209728 /minos/data2/nue_group_tmp Fri Nov 20 10:33:58 CST 2009 Still running, still some root files. MINOS27 > find /minos/data2/nue_group_tmp -user root /minos/data2/nue_group_tmp/tmp/output_electrons_nd_mc/.output_electrons_n13037023_0003.root.RrfqsT This directory is quickly getting filled du -sm /minos/data-old/nue_group_tmp/tmp/output_electrons_nd_mc 22785 MINOS27 > ls /minos/data-old/nue_group_tmp/tmp/output_electrons_nd_mc | wc -l 9529 MINOS27 > ls /minos/data2/nue_group_tmp/tmp/output_electrons_nd_mc | wc -l 1009 1326 Fri Nov 20 10:38:26 CST 2009 MINOS27 > ls /minos/data2/nue_group_tmp/tmp/output_electrons_nd_mc | wc -l ; sleep 100 1425 1741 Rate seems to be about 3 files/second. 2879 3215 MINOS27 > ls /minos/data2/nue_group_tmp/tmp/output_electrons_nd_mc | wc -l 9529 MINOS27 > find /minos/data2/nue_group_tmp -user root MINOS27 > date Fri Nov 20 11:26:44 CST 2009 MINOS27 > du -sk /minos/data2/nue_group_tmp 2934422176 /minos/data2/nue_group_tmp MINOS27 > du -sk /minos/data2/nue_group_tmp 2934572384 /minos/data2/nue_group_tmp MINOS27 > date Fri Nov 20 13:49:17 CST 2009 Checking mcimport MINOS27 > find /minos/data2/mcimport -user root /minos/data2/mcimport/mho/log/L010185R_n1104/.n11047039_0022_L010185R_D07_r4.log.46fqrT /minos/data2/mcimport/mho/log/L010185R_n1204 MCIN Fri Nov 20 14:34:27 CST 2009 MINOS27 > find /minos/data2/mcimport/rodriges -user root /minos/data2/mcimport/rodriges/log/L010185N_f2143 du -sk /minos/data2/mcimport $ du -sk /minos/data-old/mcimport MINOS27 > DIRS=`ls /minos/data-old/mcimport | sort ` MINOS27 > for DIR in ${DIRS} ; do \ printf "${DIR} " ; ls /minos/data-old/mcimport/${DIR} | wc -l ; done ( selecting big hitters ) rmehdi 1491 rodriges 554 sjc 481 for DIR in ${DIRS} ; do \ printf "${DIR} " ; ls /minos/data-old/mcimport/${DIR}/mcin | wc -l ; done wingmc 3347 data-old wingmc 1169 data2 MINOS27 > du -sh /minos/data-old/mcimport/wingmc/mcin 826G /minos/data-old/mcimport/wingmc/mcin Sat Nov 21 08:16:41 CST 2009 MINOS27 > ls /minos/data2/mcimport/rodriges | wc -l 554 MINOS27 > ls /minos/data2/mcimport/sjc | wc -l 481 MINOS27 > find /minos/data2/mcimport -user root /minos/data2/mcimport/sjc/log/L010185_fardet find: /minos/data2/mcimport/sjc/log/L010185_fardet: Permission denied /minos/data2/mcimport/sjc/log/L010185_farrock find: /minos/data2/mcimport/sjc/log/L010185_farrock: Permission denied /minos/data2/mcimport/sjc/log/L010185_farrocknutau find: /minos/data2/mcimport/sjc/log/L010185_farrocknutau: Permission denied MINOS27 > find /minos/data2/mcimport/sjc -user root find: /minos/data2/mcimport/sjc/rockmu/.n12035048_0007_L010185N_D07_r2.rockmu.hr.hRkqrT: No such file or directory 9353 /minos/data-old/mcimport/sjc/rockmu 7483 /minos/data2/mcimport/sjc/rockmu MINOS27 > ls /minos/data-old/mcimport/sjc/rockmu | wc -l 1907 MINOS27 > ls /minos/data2/mcimport/sjc/rockmu | wc -l 1690 MINOS27 > ls /minos/data2/mcimport/sjc/rockmu | wc -l ; sleep 100 1701 1776 ____________________________________________________________ Date: Sat, 21 Nov 2009 15:22:37 +0000 (GMT) From: Arthur Kreymer The copy of /minos/data/mcimport is still running. This is nearly a Terabyte of recent data to be copied to mcimport/wingmc/mcin, so this may take another day. ____________________________________________________________ ____________________________________________________________ ############ # PREDATOR # ############ Mangled .py files starting around FINISHED Wed Nov 18 21:19:55 2009 InvalidMetadata: Invalid Metadata specified for file 'N00017210_0001.mdaq.root' of type 'importedDetector': This was killed, after time out around Wed Nov 18 23:09:36 UTC 2009 STARTED Thu Nov 19 05:12:04 2009 OOPS - no tape location in F00045028_0000.sam.py This was killed, after time out around Wed Nov 18 23:14:59 UTC 2009 ============================================================================= 2009 11 18 ============================================================================= ####### # NAS # ####### Date: Wed, 18 Nov 2009 15:52:32 -0600 From: Ramon C. Pasetes To: "'site-nas-announce@fnal.gov'" Subject: Slides from Today's Users' Meeting The slides from today's users' meeting has been posted here: http://computing.fnal.gov/nasan/talks.html Our next meeting will be held in January, 2010, and the main topic will be our backup service. Thank you, Central Storage Admins ########## # ORACLE # ########## Date: Wed, 18 Nov 2009 14:42:37 -0600 From: Jared M Platson Maurine will be patching the OS and I will be patching Oracle on minosora1 OS/Oracle Patching Production databases minosprd and minerprd tomorrow November 19 at 9am  Patching should be complete in approximately 2 hours. ######### # MYSQL # ######### Updated the .table file to set the password and account for mysql status ... as is done for stop and start. ######### # MYSQL # ######### Date: Wed, 18 Nov 2009 17:52:16 +0000 (GMT) From: Arthur Kreymer To: minosdb-support@fnal.gov Subject: init.d/mysql scripts available for minos-mysql2 I have drafted init.d/mysql and config/init.mysql scripts for minos-mysql2. See the files in /home/minsoft/ups/db/mysql/config : init.d.mysql # to be put in /etc/rc.d/init.d/mysql init.mysql # called by mysql I took the route of using ups start and ups stop rather than grepping for and killing mysqld processes. I modified the table file so that 'ups stop' works without a password, getting MYSQL_PWD from ${MYSQL_DATA}/mysql.pwd I have tested these on my desktop system. Let's test them on minos-mysql2 during Thursday's shutdown. ____________________________________________________________ Date: Wed, 18 Nov 2009 12:19:26 -0600 (CST) Confirmation Notification Request INC000000016589 requested by you has been submitted. Status: New Summary: minos-mysql2 init.d/mysql update Notes: FEF primary - run2-sys@fnal.gov We have a new init.d/mysql script, which corrects errors in the previous script and which should eliminate the need for further changes by FEF. Please update /etc/rc.d/init/mysql on minos-mysql2 before the Thursday reboot for the kernel upgrade, and configure it to run at shutdown and reboot. The new file can be copied from http://cdcvs0.fnal.gov/cgi-bin/public-cvs/cvsweb-public.cgi/minossoft/admin/mysq/init/init.d.mysql?rev=1.1 &content-type=text/x-cvsweb-markup or from minos-mysql2:/home/minsoft/ups/db/mysql/config/init.d.mysql Thanks ! ____________________________________________________________ Date: Wed, 18 Nov 2009 15:42:07 -0600 (CST) Status: Completed copied new file into place. ____________________________________________________________ ####### # WEB # ####### dhmain.html - Added NET IP link to ipchicken.com This reports your IP address, helps detect proxy servers. ln -sf dhmain.20091118.html dhmain.html # was dhmain.20090912.html ============================================================================= 2009 11 17 ============================================================================= ######### # MYSQL # ######### Testing init.d scripts per lebedeva discussions Testing mysql on my desktop (SLF 5.3), where I have root, ################## # MINSOFT on ARK # ################## Account Mysql> ypcat passwd | grep minsoft minsoft:KERBEROS:9979:9531:Minos Software:/home/minsoft:/bin/bash ARK > nedit /etc/passwd minsoft:KERBEROS:9979:9531:Minos Software:/home/minsoft:/bin/bash ARK > /usr/sbin/pwconv GROUP Mysql> ypcat group | grep mysql mysql:x:9531:nwest HOME ARK > mkdir /home/minsoft ARK > chown minsoft.mysql /home/minsoft ACCESS ARK > echo "kreymer@FNAL.GOV" > /home/minsoft/.k5login ARK > chown minsoft.mysql /home/minsoft/.k5login ################ # MYSQL ON ARK # ################ Per HOWTO.mysqladmin # UPS # cat > ${HOME}/setups.sh << 'EOF' # set up ups and add our own products area , filtering out /afs products unset UPS_DIR unset SETUP_UPS . /usr/local/etc/setups.sh export PRODUCTS=${HOME}/ups/db:`printf "${PRODUCTS}\n" | tr : \\\n | grep -v ^/afs | head -1` EOF # BOOTUPS # AFSP=/afs/fnal.gov/files/code/e875/general/ups cd ${HOME} mkdir -p ups/db/foo mkdir -p ups/db/.upsfiles mkdir -p ups/db/.updfiles cp ${AFSP}/db/.upsfiles/dbconfig ups/db/.upsfiles/dbconfig nedit ups/db/.upsfiles/dbconfig changed /afs/fnal.gov/files/code/e875/general/ups to /home/minsoft/ups cp ${AFSP}/db/.updfiles/updconfig ups/db/.updfiles/updconfig # UPD installation of mysql # . ${HOME}/setups.sh setup upd # version of mysql to install MYSQLVER=v5_0_67 upd list -aK+ mysql upd install -j mysql ${MYSQLVER} upd install succeeded. ups declare -c mysql ${MYSQLVER} DECLARE: A UPS start/stop exists for this product # TAILOR # mkdir ${HOME}/database export LD_LIBRARY_PATH=/lib ups tailor mysql -bash-3.2$ ups tailor mysql chgrp: invalid group `products' Enter valid path for mysql data directory: /home/minsoft/database Never use default port number 3306 for any mysql server instances! Assign your port number here:3306 You can update mysql server options in my.cnf file before you start mysql server. Please assign a new username for your mysql daemon. For security it is recommended to substitute this name for mysql root in a mysql database. See README file in your mysql datadir for more details. Do not forget to set a strong password for root user IMMEDIATELY after initial startup of mysql daemon! Then replace root username with the newly assigned username. Enter your new username here:mysql There are small,medium,large or huge cnf files in /home/minsoft/ups/prd/mysql/v5_0_67/Linux-2-6/share/mysql directory. Which one you would like to use (s/m/l/h)? h Installing MySQL system tables... /home/minsoft/ups/prd/mysql/v5_0_67/Linux-2-6/libexec/mysqld: error while loading shared libraries: libssl.so.4: cannot open shared object file: No such file or directory Installation of system tables failed! ROOT ARK > ln -s libssl.so.0.9.8e /lib/libssl.so.4 /home/minsoft/ups/prd/mysql/v5_0_67/Linux-2-6/libexec/mysqld: error while loading shared libraries: libcrypto.so.4: cannot open shared object file: No such file or directory ROOT ARK > ln -s libcrypto.so.0.9.8e /lib/libcrypto.so.4 ups start mysql ups rootpass mysql Root password : TEST SHUTDOWN export MYSQL_PWD=... mysqladmin -u root processlist mysqladmin -u root shutdown UPS STOP Changed the table file to allow ups stop mysql Removed -p from the mysqladmin which stops mysql export MYSQL_PWD from ${MYSQL_DATA}/mysql.pwd ######### # ADMIN # ######### Date: Tue, 17 Nov 2009 12:15:14 -0600 (CST) Request INC000000016470 requested by you has been submitted. Status: New Summary: minos27:/minos/data-old mount Thur. Notes: FEF Primary - run2-sys CSI/SNS is copying /minos/data2 to new media. The copy is essentially complete, and is being kept up to date. On Thursday morning at 08:00 they will lock the original disks and export them to minos27 as blue2:/minos/data-old Please mount these disks as /minos/data-old on minos27, as part of the kernel update, after 08:00 Thursday 19 November. The /etc/fstab entry would look like blue2.fnal.gov:/minos/data-old /minos/data-old nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 The rest of the Minos servers will automatically see the new copies when they reboot with new kernels on Thursday morning. They will not need access to the old disks. _________________________________________________________________________ Date: Thu, 19 Nov 2009 10:56:24 -0600 (CST) Status: In Progress _________________________________________________________________________ Date: Thu, 19 Nov 2009 10:57:07 -0600 (CST) Status: Completed blue2.fnal.gov:/minos/data-old is now mounted on minos27 as a static mount. _________________________________________________________________________ _________________________________________________________________________ _________________________________________________________________________ ============================================================================= 2009 11 16 ============================================================================= ############# # MDSUM_LOG # ############# mdsum_log.20091116 - copied from .20090518 ( which was never used in anger) kreymer@minos-sam04 cds ./mdsum_log.20091116 "" /minos/data2 Need to touch this up to reject top level bare files SMALLS=`find . -type d -maxdepth 1 | grep / | cut -f 2 -d / | sort | grep -v "${GBIG}"` ####### # NAS # ####### Date: Mon, 16 Nov 2009 11:20:30 -0600 From: Ramon C. Pasetes To: "'site-nas-announce@fnal.gov'" Subject: Storage User's Meeting 11/18 @ 1:30PM Hornet's Nest Time: Wednesday, 11/18, 1:30PM - 3:00PM Location: WH8x Hornet's Nest There have been a lot of changes in CD lately, including changes to our service. This will be the first of a series of meetings between the central storage service and the users of the service. The intention of these meeting is to provide information on the use of the various services we provide. For this first meeting we will be discussing: 1) Brief description of CD re-org with respect to our group. 2) What services we provide (NAS/BlueArc being one of them). 3) More in-depth view of our NAS service including: a) Changes in storage purchases b) Some lessons learned from this past Summer c) Things to keep in mind (when using our service) d) Workflow for end-user requests Thank you. Central Storage Admins ######## # GRID # ######## Date: Mon, 16 Nov 2009 10:53:18 -0600 From: Frank J. Nagy To: linux-users , Mac Users Group , "csi-adm@fnal.gov" Subject: To all Linux and Mac users: new HSM-based KCA Servers are live! Parts/Attachments: 1 Shown 24 lines Text (charset: ISO-8859-1) ---------------------------------------- [ This message was cryptographically signed but the signature could not be verified. ] The new HSM-based (Hardware Security Module) KCA Servers went live on the production systems this morning (a bit after 7 AM). In order to use the new software, you must have the new clients which support 1024-bit keys (in the kx509 binary) otherwise you will get error messages about your request being rejected due to having too short of a key. If you need to update your client, please see this page: http://computing.fnal.gov/xms/Services/Getting_Services/Certificates/Certificate_Client_Update_Instructions ########### # MINERVA # ########### Added /grid/data per dschmitz request, as was done below for fermiapp. blue2:/fermigrid-data /grid/data nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr,timeo=600,noexec 0 0 [root@minerva-om ~]# mkdir /grid/data [root@minerva-om ~]# nedit /etc/fstab [root@minerva-om ~]# mount /grid/data ########### # MINERVA # ########### minerva-evd, minerva-om, minerva-rc updated fstab per dschmitz, to include /grid/fermiapp Added the fstab lines similar to minos26, adding timeo=600 per other mounts on these systems cp -a fstab fstab.200911111 [root@minerva-evd etc]# cat /etc/fstab # This file is edited by fstab-sync - see 'man fstab-sync' for details LABEL=/ / ext3 defaults 1 1 LABEL=/data /data ext3 defaults 1 2 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 LABEL=/home /home ext3 defaults 1 2 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 LABEL=/var /var ext3 defaults 1 2 /dev/sda2 swap swap defaults 0 0 blue3.fnal.gov:/minerva/data /minerva/data nfs rw,proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600,noexec blue3.fnal.gov:/minerva/app /minerva/app nfs rw,proto=tcp,vers=3,wsize=32768,rsize=32768,hard,intr,timeo=600 blue2:/fermigrid-fermiapp /grid/fermiapp nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr,timeo=600 0 0 Created the mount points [root@minerva-evd etc]# mkdir /grid [root@minerva-evd etc]# mkdir /grid/data [root@minerva-evd etc]# mkdir /grid/fermiapp [root@minerva-evd etc]# mount /grid/data mount: blue2:/fermigrid-data failed, reason given by server: Permission denied [root@minerva-evd etc]# mount /grid/fermiapp mount: blue2:/fermigrid-fermiapp failed, reason given by server: Permission denied These are not exported to the CRL systems. http://computing.fnal.gov/nasan/internal/vol2nfs-map.html Proceeded with fstab changes on minerva-om [root@minerva-om etc]# cp -a fstab fstab.20091113 [root@minerva-om etc]# mkdir /grid [root@minerva-om etc]# mkdir /grid/fermiapp [root@minerva-rc etc]# nedit fstab [root@minerva-rc etc]# mount /grid/fermiapp All systems are mounted now. ############ # PREDATOR # ############ predator.20091116 Corrected clearing of pid file when STOPping ln -sf predator.20091116 predator # was predator.20091114 Mon Nov 16 08:34:05 CST 2009 ============================================================================= 2009 11 14 Saturday ============================================================================= ############ # PREDATOR # ############ predator.20091114 Added support for ${PREDLOG}/STOP flag Removed duplicate code at entry date ln -sf predator.20091114 predator # was predator.20090902 Sat Nov 14 15:45:24 CST 2009 ############ # SHUTDOWN # ############ minos-users,minos_admin,minos-shifters minos_batch,minos_sim,minos_software_discussion,minosdb-support Fermilab Minos Server shutdowns Thu 2009 Nov 19 _________________________________________________________ Date: Sat, 14 Nov 2009 21:01:13 +0000 (GMT) From: Arthur Kreymer All Minos Offline systems will be shut down Thursday Nov 19 for kernel security upgrades. This includes the Minos Cluster ( minos01 through minos27 ) and Minos servers like minos-mysql2 and the SAM servers. The CRL will be down while minos-mysql2 reboots. We usually schedule this for 09:00 through noon. There may be an extended down time to copy Bluearc files to new disks. This is not yet scheduled. ############ # SHUTDOWN # ############ kreymer@minos26 echo 'touch /local/scratch26/kreymer/log/predator/STOP' | at 06:00 Nov 19 echo 'rm -f /local/scratch26/kreymer/log/predator/STOP' | at 12:00 Nov 19 job 34 at 2009-11-19 06:00 job 35 at 2009-11-19 12:00 mindata@minos27 echo 'touch /minos/data/mcimport/STOP' | at 04:00 Nov 19 echo 'rm -f /minos/data/mcimport/STOP' | at 12:00 Nov 19 job 6 at 2009-11-19 04:00 job 7 at 2009-11-19 12:00 ########### # BLUEARC # ########### D0 disk monitoring kreymer@minos27 cdadmin cd bluearc DATA=/grid/data/monitor WEBDIR=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates HOST=d0mino05 set nohup ./bratenow -n ${HOST} -d ${DATA} -o ${WEBDIR} -T /prj_root/5012 & set nohup ./bratenow -n ${HOST} -d ${DATA} -o ${WEBDIR} -T /prj_root/5012 -w & HOST=d0mino06 set nohup ./bratenow -n ${HOST} -d ${DATA} -o ${WEBDIR} -T /prj_root/3024 & set nohup ./bratenow -n ${HOST} -d ${DATA} -o ${WEBDIR} -T /prj_root/3024 -w & MINOS27 > date Sat Nov 14 09:05:17 CST 2009 ############ # BLUWATCH # ############ Populated 100M areas for D0 monitoring kreymer@d0mino05 scp minos-sam04:/var/tmp/100M /var/tmp/100M 105MB 35.0MB/s GDM=/prj_root/5012/bluwatch/100M mkdir -p $GDM date NF=0 while [ ${NF} -lt 200 ] ; do NFST=`printf "%3.3d" ${NF}` mkdir -p ${GDM}/${NFST:0:2} cp /var/tmp/100M ${GDM}/${NFST:0:2}/file${NFST} echo ${GDM}/${NFST:0:2}/file${NFST} (( NF ++ )) done date Sat Nov 14 07:29:08 CST 2009 /prj_root/5012/bluwatch/100M/00/file000 /prj_root/5012/bluwatch/100M/00/file001 ... /prj_root/5012/bluwatch/100M/19/file198 /prj_root/5012/bluwatch/100M/19/file199 date Sat Nov 14 07:32:56 CST 2009 Rate 22 GB / 168 sec = 131 MB/second. kreymer@d0mino06 scp minos-sam04:/var/tmp/100M /var/tmp/100M 105MB 52.5MB/s GDM=/prj_root/3024/bluwatch/100M mkdir -p $GDM date NF=0 while [ ${NF} -lt 200 ] ; do NFST=`printf "%3.3d" ${NF}` mkdir -p ${GDM}/${NFST:0:2} cp /var/tmp/100M ${GDM}/${NFST:0:2}/file${NFST} echo ${GDM}/${NFST:0:2}/file${NFST} (( NF ++ )) done date Sat Nov 14 07:35:12 CST 2009 /prj_root/3024/bluwatch/100M/00/file000 /prj_root/3024/bluwatch/100M/00/file001 ... /prj_root/3024/bluwatch/100M/19/file198 /prj_root/3024/bluwatch/100M/19/file199 date Sat Nov 14 08:06:01 CST 2009 Rate 22 GB / 1850 sec = 12 MB/second. The write to 3024 from d0mino06 ran much more slowly. It seems to have slowed down read access during the write. Was the array on the edge of overload ? Rates for both arrays during the copies 5012 Sat Nov 14 07:29:08 CST 2009 Sat Nov 14 07:32:56 CST 2009 from /grid/data/monitor/rate/2009/11/14/d0mino05.txt Sat Nov 14 07:28:12 CST 2009 9/file1671 69 Sat Nov 14 07:29:12 CST 2009 9/file1672 50 Sat Nov 14 07:30:12 CST 2009 9/file1673 45 Sat Nov 14 07:31:14 CST 2009 9/file1674 7 Sat Nov 14 07:32:16 CST 2009 9/file1675 6 Sat Nov 14 07:33:16 CST 2009 9/file1676 74 Sat Nov 14 07:34:16 CST 2009 9/file1677 70 3024 Sat Nov 14 07:35:12 CST 2009 Sat Nov 14 08:06:01 CST 2009 from /grid/data/monitor/rate/2009/11/14/d0mino06.txt Sat Nov 14 07:32:06 CST 2009 9/file1740 20 Sat Nov 14 07:33:07 CST 2009 9/file1741 36 Sat Nov 14 07:34:07 CST 2009 9/file1742 24 Sat Nov 14 07:35:08 CST 2009 9/file1743 37 Sat Nov 14 07:36:09 CST 2009 9/file1744 9 Sat Nov 14 07:37:10 CST 2009 9/file1745 32 Sat Nov 14 07:38:11 CST 2009 9/file1746 9 Sat Nov 14 07:39:13 CST 2009 9/file1747 12 Sat Nov 14 07:40:14 CST 2009 9/file1748 10 Sat Nov 14 07:41:15 CST 2009 9/file1749 9 Sat Nov 14 07:42:17 CST 2009 9/file1750 10 Sat Nov 14 07:43:18 CST 2009 9/file1751 10 Sat Nov 14 07:44:20 CST 2009 9/file1752 10 Sat Nov 14 07:45:21 CST 2009 9/file1753 21 Sat Nov 14 07:46:22 CST 2009 9/file1754 9 Sat Nov 14 07:47:24 CST 2009 9/file1755 9 Sat Nov 14 07:48:25 CST 2009 9/file1756 10 Sat Nov 14 07:49:26 CST 2009 9/file1757 10 Sat Nov 14 07:50:28 CST 2009 9/file1758 9 Sat Nov 14 07:51:29 CST 2009 9/file1759 10 Sat Nov 14 07:52:30 CST 2009 9/file1760 16 Sat Nov 14 07:53:32 CST 2009 9/file1761 10 Sat Nov 14 07:54:33 CST 2009 9/file1762 10 Sat Nov 14 07:55:34 CST 2009 9/file1763 10 Sat Nov 14 07:56:36 CST 2009 9/file1764 9 Sat Nov 14 07:57:37 CST 2009 9/file1765 10 Sat Nov 14 07:58:39 CST 2009 9/file1766 10 Sat Nov 14 07:59:40 CST 2009 9/file1767 10 Sat Nov 14 08:00:41 CST 2009 9/file1768 24 Sat Nov 14 08:01:42 CST 2009 9/file1769 10 Sat Nov 14 08:02:44 CST 2009 9/file1770 10 Sat Nov 14 08:03:45 CST 2009 9/file1771 10 Sat Nov 14 08:04:47 CST 2009 9/file1772 9 Sat Nov 14 08:05:48 CST 2009 9/file1773 10 Sat Nov 14 08:06:48 CST 2009 9/file1774 35 Sat Nov 14 08:07:49 CST 2009 9/file1775 39 Sat Nov 14 08:08:49 CST 2009 9/file1776 43 WOW ! It looks like a single full speed write slows down the arrays to around 10 MB/second reading, for both types of array. Cleaned up local test log on d0mino ls bluwatch/rate/2009/09 22 23 24 25 rm -r bluwatch ============================================================================= 2009 11 13 ============================================================================= ######## # JIRA # ######## Reviewing documents before adding more users to offline JIRA. Preparing admin/jira/HOWTO.jira ########## # BUDGET # ########## ============ http://computing.fnal.gov/xms/Internal/Budget_%26_Finance\ ============ FY10 Budget Input https://cd-entreport.fnal.gov/budgetinput-prod0/budget/Login.asp This had a couple of sub topics which may or not have latest adjustments Requires an account and password ============ Budget Line Items ( needed KCA proxy loaded ) https://appora.fnal.gov/miser_ora/www/BLI_REPORT.html This does not apparently have the latest adjustements =========== Crystal Reports http://cd-entreport.fnal.gov/cdfinancial Requires an account, password, known to me. =========== _______________________________________________ log into the budget database, select the M&S budget line items there are 2 sets to look at do searches as follows 1. "Requesting Department" = "REX" and "Activity Name"="INTENSITY FRONTIER" (this has laptops, travel, etc. ) and 2. "Requesting Department" = "NVS" and "Task Own Dept"="REX" (this shows BlueArc arrays, BLI in question 14492, 14495) ______________________________________________________________ Date: Fri, 13 Nov 2009 14:05:27 -0600 (CST) Request INC000000016225 requested by you has been submitted. Status: New Summary: Account for CD FY10 Budget input Notes: To track theMinos Budget figures., I need an account under computing.fnal.gov Internal Budget & Finance FY10 Budget Input The URL is https://cd-entreport.fnal.gov/budgetinput-prod0/budget/Login.asp Please create the necessary account. For authorization, contact Lee Lueking and/or Margaret Votava. ______________________________________________________________ Date: Mon, 16 Nov 2009 10:42:25 -0600 From: Jeffrey E Mack Your user name is the same as your Kerberos principal and your password is initially set to that value. ______________________________________________________________ Date: Mon, 16 Nov 2009 10:45:03 -0600 (CST) Subject: Request INC000000016225: Status has been updated to In Progress. ______________________________________________________________ Date: Mon, 16 Nov 2009 10:46:05 -0600 (CST) Subject: Request INC000000016225: Status has been updated to Completed. User provided with user name and password ______________________________________________________________ Date: Mon, 16 Nov 2009 18:02:30 +0000 (GMT) From: Arthur Kreymer To: Jeffrey E Mack Thanks, I have access, and have changed the password. ============================================================================= 2009 11 12 ============================================================================= ######## # GRID # ######## Date: Thu, 12 Nov 2009 22:57:40 +0000 (GMT) From: Arthur Kreymer To: Ryan B. Patterson Cc: minos-admin@fnal.gov Subject: Re: Fermilab HSM based KCA Transition - Monday 16-Nov-2009 (fwd) On Thu, 12 Nov 2009, Ryan B. Patterson wrote: > I haven't followed this topic. Is this transparent to us? Do MINOS user > grid certs continue working or must we change the kproxy script? The new versions of kx509 and kxlist are already deployed on all Minos systems. For an active test, one could modify a user's cron job on minos25 to use kx509 -s fermi-kcatest01.fnal.gov I've just now hacked a new script to do just that, kproxyvt Seems to work fine kproxyi lists a good-looking proxy. My kreymer glidein probe jobs are continuing to run using this proxy. ######## # GRID # ######## Testing corelimits again : cd /minos/scratch/kreymer/condor/probe OUTS=`ls -tr logs/glide/probe*out` for OUT in ${OUTS} ; do echo grep '^HOSTNAME \|^RUN S\|^CORELIMIT \|^unlimited' ${OUT} done > /tmp/corelims CORELIMIT varies, sometimes unlimited, sometimes 0, even on the same node, such as fnpc361. ______________________________________________________________________ Date: Thu, 12 Nov 2009 23:10:57 +0000 (GMT) From: Arthur Kreymer To: Ryan B. Patterson Cc: Parag Mhashilkar , minos-admin@fnal.gov Subject: Re: Getting coredump for Minos Jobs (fwd) On Fri, 6 Nov 2009, Ryan B. Patterson wrote: > I'm not sure that all entry points are working. The "cdf" entry point > does not report a GLIDEIN_GlobusRSL value with 'condor_status -l' like the > other entry points. The gpminos entry point does report a > GLIDEIN_GlobusRSL value, but my test job is showing a zero limit. > > Trying some things... Now that all the old pilot processes surely must be gone, I have reexamined recent tests of uname -c for Minos glideins. We are getting a very mixed result. I run these tests every 10 minutes. The Condor control file is /minos/scratch/kreymer/condor/probe/glide.run Look for the CORELIMIT report in the *.out files, under /minos/scratch/kreymer/condor/probe/logs/glide The same node sometimes returns 0 and sometimes unlimited. For a summary, you could do cd /minos/scratch/kreymer/condor/probe OUTS=`ls -tr logs/glide/probe*out` for OUT in ${OUTS} ; do echo grep '^HOSTNAME \|^RUN S\|^CORELIMIT \|^unlimited' ${OUT} done This includes, among other things : RUN STARTED Thu Nov 12 07:01:50 CST 2009 HOSTNAME fnpc350.fnal.gov CORELIMIT 0 ... RUN STARTED Thu Nov 12 16:31:12 CST 2009 HOSTNAME fnpc350.fnal.gov CORELIMIT unlimited ============================================================================= 2009 11 11 ============================================================================= ######## # GRID # ######## Date: Wed, 11 Nov 2009 13:41:17 +0000 (GMT) From: Arthur Kreymer To: run2-sys@fnal.gov Cc: minos-admin@fnal.gov Subject: Request INC000000015919 requested by you has been submitted. /etc/grid-security/hostcert.pem certificate expired on minos25 (fwd) All of the Minos Cluster systems running Condor have expired host certificates. This shuts down all Minos batch processing. Please install unexpired certificates everywhere, then set up a mechanism to prevent future expirations. ---------- Forwarded message ---------- Date: Tue, 10 Nov 2009 19:50:36 -0600 (CST) From: Fermilab Service Desk To: rbpatter@caltech.edu Subject: Request INC000000015919 requested by you has been submitted. /etc/grid-security/hostcert.pem certificate expired on minos25 Confirmation Notification Request INC000000015919 requested by you has been submitted. Status: New Summary: /etc/grid-security/hostcert.pem certificate expired on minos25 Notes: I don't know the details here, but I suspect this expired certificate: minos25:/etc/grid-security/hostcert.pem with subject: /DC=org/DC=doegrids/OU=Services/CN=minos25.fnal.gov (cert expired ~200 minutes ago) is the reason Condor services at MINOS are not working. Could someone renew this cert? (I'm not sure who does this.) Thanks, Ryan ___________________________________________________________________ ARK > for NODE in ${NODES} ; do printf "${NODE} " ; ssh -akx ${NODE} 'ls -l /etc/grid-security/hostcert.pem' ; done 2> /dev/null minos01 -rw-r--r-- 1 root root 1283 Nov 11 2008 /etc/grid-security/hostcert.pem minos03 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos04 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos05 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos06 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos07 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos08 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos09 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos10 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos11 -rw-r--r-- 1 root root 1282 Apr 23 2009 /etc/grid-security/hostcert.pem minos12 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos13 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos14 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos15 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos16 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos17 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos18 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos19 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos20 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos21 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos22 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos23 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos24 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos25 -rw-r--r-- 1 root root 1282 Nov 11 2008 /etc/grid-security/hostcert.pem minos26 -rw-r--r-- 1 root root 1283 Nov 11 2008 /etc/grid-security/hostcert.pem minos27 -rw-r--r-- 1 root root 1282 Nov 14 2008 /etc/grid-security/hostcert.pem ___________________________________________________________________ Date: Wed, 11 Nov 2009 13:47:07 +0000 (GMT) From: Arthur Kreymer To: minos-users@fnal.gov, minos_software_discussion@fnal.gov, fermigrid-help@fnal.gov, minos-admin@fnal.gov Subject: Minos Condor system is down. Waiting for new certificates. The host certificates which allow Minos Condor systems to communicate have expired. The Minos Condor system will be down until they have been refreshed. Please stand by ! ___________________________________________________________________ ___________________________________________________________________ ============================================================================= 2009 11 10 ============================================================================= ######### # MYSQL # ######### minos-mysql1 - for farm Mysql> grep max_conn /data/database/my.cnf #set-variable = max_connections=250 max_connections=500 show variables like "max_connections" ; | max_connections | 450 | set global max_connections = 500 ; minos-mysql2 - for analysis show variables like "max_connections" ; | max_connections | 250 | set global max_connections = 500 ; ######## # DATA # ######## Planning to attend Gabrelle's Grid storage meeting next week http://www.doodle.com/ibitt2dxg4p5kspb ############ # BLUWATCH # ############ Restarted watch of /minos/scratch from minos-sam04, discontinued minos-mysql2 shadow monitor. Killed /bluwatch -r -b /minos/scratch/bluwatch/minos-sam04 mindata@minos27 cd /minos/scratch/bluwatch mkdir stash mv minos-sam04 stash/0 mv 100M minos-sam04 kreymer@minos-sam04 start up in directory after last sample from minos-mysql2 Mon Nov 9 07:48:09 CST 2009 12/file23 30 cdadmin cd bluearc set nohup ./bluwatch -r -S 100000000 -s 120 -b /minos/scratch/bluwatch/minos-sam04 -d 13 & http://www-numi.fnal.gov/computing/dh/bluwatch/rate/2009/11/10/minos-sam04.txt Tue Nov 10 07:32:50 CST 2009 1/file0159 5 Tue Nov 10 07:33:51 CST 2009 1/file0160 17 Tue Nov 10 07:41:26 CST 2009 13/file30 37 Tue Nov 10 07:43:30 CST 2009 13/file31 29 ============================================================================= 2009 11 09 ============================================================================= ############ # BLUWATCH # ############ Populated more 100M areas mindata@minos-sam04 GDM=/grid/data/minos/bluwatch/stash100/0 mkdir -p $GDM date NF=0 while [ ${NF} -lt 200 ] ; do NFST=`printf "%3.3d" ${NF}` mkdir -p ${GDM}/${NFST:0:2} cp /var/tmp/100M ${GDM}/${NFST:0:2}/file${NFST} echo ${GDM}/${NFST:0:2}/file${NFST} (( NF ++ )) done date Mon Nov 9 18:51:16 CST 2009 /grid/data/minos/bluwatch/stash100/0/00/file000 /grid/data/minos/bluwatch/stash100/0/00/file001 ... /grid/data/minos/bluwatch/stash100/0/19/file198 /grid/data/minos/bluwatch/stash100/0/19/file199 Mon Nov 9 18:57:34 CST 2009 GDM=/grid/data/minos/bluwatch/stash100/1 mkdir -p $GDM Tue Nov 10 07:23:43 CST 2009 /grid/data/minos/bluwatch/stash100/1/00/file000 /grid/data/minos/bluwatch/stash100/1/00/file001 ... ############ # BLUWATCH # ############ Restarted minos27 monitor script with 100 MB files This stopped at 07:48 when my desktop crashed. Must have failed to nohup ! Mon Nov 9 07:48:17 CST 2009 7/file1213 27 Same thing happened to minos25. Mon Nov 9 07:47:40 CST 2009 09/file091 28 The bratewk processes are also gone from minos25. Shift the 100MB minos25 files to minos27, pickup where minos25 left off. mindata@minos27 cd /grid/data/minos/bluearc mv minos27 stash/1 mv minos25 minos27 kreymer@minos27 cdadmin cd bluearc set nohup ./bluwatch -r -S 100000000 -b /grid/data/minos/bluwatch/minos27 -d 10 & Mon Nov 9 18:37:07 CST 2009 10/file100 10 Mon Nov 9 18:38:15 CST 2009 10/file101 13 Mon Nov 9 18:39:24 CST 2009 10/file102 12 Mon Nov 9 18:40:35 CST 2009 10/file103 8 ######### # MYSQL # ######### minos-mysql2 max_connections=250 increased to 500 by lebedava ============================================================================= 2009 11 06 ============================================================================= ############ # SHUTDOWN # ############ ############ # BLUWATCH # ############ cvs committed bluwatch.new to bluwatch, -S SIZE support ######## # LOCK # ######## cvs committed current lock ( lock status e875 ) ####### # WEB # ####### Updated old proton plot links for FY09 using links under http://www-bdnew.fnal.gov/pbar/FixedTargetPlots/FY09/01Oct09/ProtonPlots.html cddh rm protons..html # cleaned up stray link protons..html # updated links for FY08 plots, per Changed names of current pages to show just the FY. cp protons.20081014.html protons.2009.html # edited for FY09 cp -a protons.20081014.html protons.2010.html # edited to correct FY09 link ln -sf protons.2010.html protons.html # was protons.20081014.html ########### # MINOSDB # ########### Date: Fri, 06 Nov 2009 13:58:02 -0600 From: Jared M Platson Monday 1pm I plan on patching minosora3.  This server is development and contains the following databases on it:    minosdev, minosint, and minerdev.  Please confirm that this is OK ______________________________________________________________________ Date: Fri, 06 Nov 2009 14:02:18 -0600 From: Maurine Mihalek we received email from redhat today that there is a new kernel out. i would like to patch the o/s on monday and put the new kernel in place and reboot for it to be effective. can this be done during the same downtime? ______________________________________________________________________ Date: Fri, 06 Nov 2009 20:27:44 +0000 (GMT) From: Arthur Kreymer Thanks for the warning ! Please proceed with the Oracle quarterly patches and RedHat kernel updates on minosora3 at your convenience. I will test the Minos database applications Monday after the updates. ______________________________________________________________________ ______________________________________________________________________ ______________________________________________________________________ ########### # MINOSDB # ########### Configured the list to suppress the subscripts footers : [ Note: This message contains email list management information ] Misc-Options= NO_RFC2369 Date: Fri, 06 Nov 2009 19:33:50 +0000 (GMT) From: Arthur Kreymer To: minosdb-support@fnal.gov Subject: minosdb-support NO_RFC2369 footers removed I have taken the liberty of disabling the RFC2369 option for the minosdb-support mailing list. This produced a link at the bottom of each email to the list, allowing you to use your mail reader to subscribe, unsubscribe, or send mail to the list owner. Since this is a static list of only 13 people, many of whom are owners of the list, this feature seems of little utility. As evidenced by a recent unsubscribe request, it was not achieving its intended purpose. If anyone really loved the feature, I could trivially turn it back on. Meanwhile, enjoy your slightly less cluttered minosdb-support email ! ######### # MYSQL # ######### Date: Fri, 06 Nov 2009 13:17:08 -0600 (CST) Confirmation Notification Request INC000000015596 requested by you has been submitted. Status: New Summary: minos-mysql2 disk expansion Notes: FEF primary - run2-sys@fnal.gov Please purchase and plan installation of expanded disk on minos-mysql2. minos-mysql2 is the Minos Calibration Database server. Presently, it is configured like the other servers purchased in FY08, two mirrored system disks, and two user data disks ( mirrored in this case ). All disks are 250 GB. The /data disk is at about half capacity, which is fine. The minsoft account and associated /var/minsoft working files are filling the system disk, requiring manual intervention. And nobody is happy with the use of the /var/minsoft path. The DBA group has requested that we make more space available. As I recall, there is room in these systems for more disk, typically 1 TB. I suggest purchasing a pair of 1TB or similar disks of FEF's choice, to be mirrored and deployed as /home, to hold files presently in /home and /var/minsoft. Details are extremely negotiable, to meet FEF's preferences. A natural target date would be the Nov 19 sheduled maintenance day. __________________________________________________________________________ Date: Fri, 06 Nov 2009 16:03:01 -0600 (CST) Status: In Progress __________________________________________________________________________ Date: Fri, 06 Nov 2009 16:10:50 -0600 (CST) Although there are more drive bays in the chassis, there are no more ports on the 3Ware RAID controller. We could replace the existing pair of 250GB disks with a pair of 1TB or even 2TB disks, and use that for /data, /home, and /var/minsoft. This would of course require some downtime if you want to migrate existing data in the /data filesystem. What do you think? __________________________________________________________________________ Date: Fri, 06 Nov 2009 23:33:42 +0000 (GMT) From: Arthur Kreymer Ideally, the system, home and data disks should be physically separate. Is it reasonable to replace or supplement the 3ware controller ? What was done for similar FY08 systems which were deployed full of disks ? __________________________________________________________________________ Date: Fri, 13 Nov 2009 15:38:37 -0600 (CST) Status: Pending __________________________________________________________________________ ######## # GRID # ######## Date: Fri, 06 Nov 2009 13:01:55 -0600 (CST) Request INC000000015594 requested by you has been submitted. Status: New Summary: /minos/app mount on Fermigrid Notes: fermigrid-help@fnal.gov : At the next Fermigrid maintenance opportunity, please change the mount point for minos-nas-0.fnal.gov:/minos/scratch from /minos/scratch to /minos/app and provide a compatible symbolic link, on all Fermigrid resources where /minos/scratch now exists. ( GPfarm, CDF pools, fnpcsrv*, etc ) The next likely target is November 19. This is not urgent. _______________________________________________________ Date: Thu, 19 Nov 2009 10:54:11 -0600 (CST) From: Glenn Cooper I think Etta and Seth have the changes made on minosxxx nodes, and Steve has the new mount point in the automount map for GP Grid nodes. Do you need/want the compatibility link on the grid nodes too, or is it only needed on interactive nodes? Since /minos/app is automounted on grid nodes, we can't make a symlink pointing to it, but we could make another automount entry so that minos-nas-0.fnal.gov:/minos/scratch can be mounted either as /minos/app or as /minos/scratch. (Or both at once, that would still work fine.) Shall we do that, or leave it out and have only the /minos/app mount for the worker nodes? _______________________________________________________ Date: Thu, 19 Nov 2009 18:11:55 +0000 (GMT) From: Arthur Kreymer Yes, we need /minos/scratch to continue to appear on grid nodes. This is the location of user code. _______________________________________________________ Date: Thu, 19 Nov 2009 12:52:25 -0600 (CST) From: Steven Timm To: Glenn Cooper Cc: Mark Schmitz , kreymer@fnal.gov Subject: Re: INC000000015594, /minos/app (fwd) OK--I have made this change on the GP Grid cluster master. No action is required on FEF's part. Do you have any objection if I make the same change on CDF? _______________________________________________________ Date: Thu, 19 Nov 2009 13:00:22 -0600 (CST) From: Glenn Cooper To: Steven Timm Cc: Mark Schmitz , kreymer@fnal.gov Subject: Re: INC000000015594, /minos/app (fwd) Just to let everybody know, Steve called me and we agreed that this should go to CDF nodes too. Steve is doing that now. _______________________________________________________ Date: Thu, 19 Nov 2009 13:02:28 -0600 (CST) From: Steven Timm Done. (and tested, and it works). _______________________________________________________ Date: Thu, 19 Nov 2009 13:04:26 -0600 (CST) Status: Completed The /minos/app and /minos/scratch are now available on all GPGrid, CDF and fnpcsrv* nodes _______________________________________________________ _______________________________________________________ ######### # ADMIN # ######### The 5 new FY09 servers were received and delivered Nov 4, according to the Purchase database. Requisition 211651 ============================================================================= 2009 11 05 ============================================================================= ########### # MONTHLY # ########### DATASETS 11/5 PREDATOR 11/5 VAULT 11/5 MYSQL 11/5 ls -alF /var/minsoft/archive rm -r /var/minsoft/archive/20091004 scripts/dbarchive STARTED DBARCHIVES Thu Nov 5 14:57:59 CST 2009 not enough space, clear some time gzip -1 /var/minsoft/SLAVE/offline/PULSERGAIN.MYD real 10m26.571s user 6m34.732s sys 0m24.900s 17064 /var/minsoft/SLAVE/offline/PULSERGAIN.MYD 6167 /var/minsoft/SLAVE/offline/PULSERGAIN.MYD.gz time gzip -1 /var/minsoft/SLAVE/offline/DCS_HV.MYD real 9m35.627s user 6m32.356s sys 0m17.150s 14742 /var/minsoft/SLAVE/offline/DCS_HV.MYD 5207 /var/minsoft/SLAVE/offline/DCS_HV.MYD.gz scripts/dbarchive STARTED DBARCHIVES Thu Nov 5 15:26:35 CST 2009 Archiving OFFLINE Thu Nov 5 15:26:42 CST 2009 Archiving BINLOGS Thu Nov 5 15:52:10 CST 2009 MD5SUMMING archives Thu Nov 5 15:52:16 CST 2009 ########## # CONDOR # ########## Date: Mon, 02 Nov 2009 15:08:37 -0600 From: Parag Mhashilkar I am not sure if you have been communicated about the fix to this issue. The instructions to configure factory to get the coredump back is available at - http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.v2/install/faq.html You need to reconfigure and restart the factory. Please let me know if you have any questions. ___________________________________________________________________ How to configure glideinWMS to retrieve core files produced by user jobs In the Factory configuration in the tag, the rsl attribute should contain condor_submit=('+Coresize' 'Unlimited'). Example (all on one line): rsl="(condor_submit=('+Coresize' 'Unlimited'))(queue=default)(jobtype=single)" In the condor_submit file of the user job add: +coresize = unlimited NOTE: The coresize for the actual user job will be set to the available disk space, not unlimited. Also, you can make this change to global parameter or local parameter. _____________________________________________________________________________ Date: Thu, 05 Nov 2009 10:41:53 -0800 (PST) From: Ryan B. Patterson Does this also require a version upgrade? We are running (I think) v1_5_1 of glideinWMS. What's the first version that has the fix? _____________________________________________________________________________ Date: Thu, 05 Nov 2009 13:26:12 -0600 From: Parag Mhashilkar Actually there is no fix required. That changes are to the factory configuration. In you glideinWMS.xml file for every entry in the rsl attribute, you need to make changes to enable creation of core files for the glidein job itself. This gets propagated to the actual job. For example for one of the entry I have on my test node ... needs rsl value changed to rsl="(condor_submit=('+Coresize' 'Unlimited'))(queue=d0prod)(jobtype=single)" _____________________________________________________________________________ Date: Thu, 05 Nov 2009 11:44:16 -0800 (PST) From: Ryan B. Patterson Thanks. Is "Unlimited" a special case? We have tried large numbers there, but never "Unlimited". Also, the rsl strings we have at the moment have "condorsubmit=foo" rather than "condor_submit=foo" (i.e., no underscore.) Are these equivalent? Is one incorrect? (I wonder if our constraints have been doing nothing all this time. I don't think we would have noticed, given the minimal constraints included in the strings.) _____________________________________________________________________________ Date: Thu, 05 Nov 2009 13:51:54 -0600 From: Parag Mhashilkar Where are you trying this constraints? In the rsl of user job's condor submit files or in the factory's config file. My impression from previous conversations was the job submit file and not the factory configuration. Can you please confirm it? _____________________________________________________________________________ Date: Thu, 05 Nov 2009 13:53:14 -0600 From: Parag Mhashilkar Forgot to add the url. note about unlimited is available in http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.v2/install/faq.html _____________________________________________________________________________ Date: Thu, 05 Nov 2009 12:17:56 -0800 (PST) From: Ryan B. Patterson I meant the factory configuration in my previous email. (In the user job submit file, we've also had various permutations of coresize options. However, I don't believe we've ever had a "+" in front of coresize either, in the factory config or in the user job.) _____________________________________________________________________________ Date: Thu, 05 Nov 2009 14:41:57 -0600 From: Parag Mhashilkar I am almost certain that they condor_submit and condorsubmit are not equivalent. Also, the + sign signifies that the parameter is also visible on the classad of the job. So best way to check if your constraints are passed on correctly is to put + before the parameter and see if they appear in the classad. _____________________________________________________________________________ Date: Thu, 05 Nov 2009 14:14:15 -0800 (PST) From: Ryan B. Patterson I can now get grid/glidein user jobs to report a non-zero 'ulimit -c', but cores still do not get dumped. Is there a way to see if the glidein's condor_startd is operating with "CREATE_CORE_FILES=True" (which I have been told is a requirement). The MINOS Condor cluster has this flag set to True, and I thought the factory got its configuration from the same place, but maybe I am mistaken. _____________________________________________________________________________ Date: Thu, 05 Nov 2009 14:30:51 -0800 (PST) From: Ryan B. Patterson Nevermind. I am now able to get a core file. (I had a permissions issue in my script.) _____________________________________________________________________________ Date: Thu, 05 Nov 2009 16:32:16 -0600 From: Parag Mhashilkar You do have +coresize = unlimited in the job description file right? Also, something I wasn't explicit about in my previous mail, you need to reconfig and restart factory for the rsl changes to take effect. To see if the changes have been propagated to the daemon, you can try condor_config_val by first finding the address of the startd, condor_status -l | grep -i address and then replace the ip:port from above in command below condor_config_val -address "<131.225.204.208:59363>" coresize _____________________________________________________________________________ OUTS=`ls -tr logs/glide/probe*out` for OUT in ${OUTS} ; do echo ; grep '^HOSTNAME \|^RUN S\|^CORELIMIT \|^unlimited' ${OUT} done RUN STARTED Thu Nov 5 16:00:13 CST 2009 HOSTNAME fnpc351.fnal.gov unlimited ... RUN STARTED Fri Nov 6 08:10:26 CST 2009 HOSTNAME fnpc367.fnal.gov There are still some older pilots running. _____________________________________________________________________________ Date: Fri, 06 Nov 2009 15:12:33 +0000 (GMT) From: Arthur Kreymer To: Ryan B. Patterson Cc: Parag Mhashilkar , minos-admin@fnal.gov Subject: Re: Getting coredump for Minos Jobs (fwd) After yesterdays reconfigurations, starting around 16:00 Nov 05, my standard 'probe' jobs started reporting unlimited coredumpsize limits. A few of them still report 0, probably because of older pilots still running. Congratulations ! _____________________________________________________________________________ Date: Fri, 06 Nov 2009 09:31:40 -0600 From: Parag Mhashilkar Just a note, the max core dump you can get back is limited by the system configuration. Condor will cap the max size based on it. _____________________________________________________________________________ N.B. adjusted 'probe' to print CORELIMIT value on one line, for easier parsing. Jobs after 09:40 will be in this form. _____________________________________________________________________________ Date: Fri, 06 Nov 2009 08:21:15 -0800 (PST) From: Ryan B. Patterson I'm not sure that all entry points are working. The "cdf" entry point does not report a GLIDEIN_GlobusRSL value with 'condor_status -l' like the other entry points. The gpminos entry point does report a GLIDEIN_GlobusRSL value, but my test job is showing a zero limit. Trying some things... _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ ############# # FERMIGRID # ############# MINOS25 > ls -ltr logs/glide/*.out -rw-r--r-- 1 kreymer g020 0 Nov 4 13:30 logs/glide/probe.708432.0.out -rw-r--r-- 1 kreymer g020 0 Nov 4 16:30 logs/glide/probe.709312.0.out -rw-r--r-- 1 kreymer g020 0 Nov 4 16:40 logs/glide/probe.709352.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 01:00 logs/glide/probe.711842.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 05:00 logs/glide/probe.712276.0.out -rw------- 1 kreymer g020 6131 Nov 5 08:01 logs/glide/probe.712774.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 08:10 logs/glide/probe.712801.0.out -rw------- 1 kreymer g020 5161 Nov 5 08:21 logs/glide/probe.712837.0.out ... -rw------- 1 kreymer g020 5745 Nov 5 10:00 logs/glide/probe.713259.0.out -rw------- 1 kreymer g020 5711 Nov 5 10:11 logs/glide/probe.713303.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 10:20 logs/glide/probe.713357.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 10:30 logs/glide/probe.713387.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 10:40 logs/glide/probe.713414.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 10:50 logs/glide/probe.713447.0.out -rw------- 1 kreymer g020 6043 Nov 5 11:05 logs/glide/probe.715298.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 11:10 logs/glide/probe.716115.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 11:20 logs/glide/probe.716145.0.out -rw------- 1 kreymer g020 6231 Nov 5 11:31 logs/glide/probe.716174.0.out -rw------- 1 kreymer g020 5907 Nov 5 11:41 logs/glide/probe.716206.0.out -rw------- 1 kreymer g020 6650 Nov 5 11:51 logs/glide/probe.716240.0.out -rw------- 1 kreymer g020 6466 Nov 5 12:00 logs/glide/probe.716281.0.out -rw------- 1 kreymer g020 6083 Nov 5 12:11 logs/glide/probe.716315.0.out -rw------- 1 kreymer g020 6947 Nov 5 12:22 logs/glide/probe.716351.0.out -rw------- 1 kreymer g020 6196 Nov 5 12:30 logs/glide/probe.716387.0.out -rw------- 1 kreymer g020 6077 Nov 5 12:41 logs/glide/probe.716432.0.out -rw-r--r-- 1 kreymer g020 0 Nov 5 12:50 logs/glide/probe.716470.0.out -rw------- 1 kreymer g020 5917 Nov 5 13:01 logs/glide/probe.716527.0.out -rw------- 1 kreymer g020 5999 Nov 5 13:11 logs/glide/probe.716593.0.out -rw------- 1 kreymer g020 6076 Nov 5 13:21 logs/glide/probe.716643.0.out -rw------- 1 kreymer g020 5965 Nov 5 13:31 logs/glide/probe.716710.0.out -rw------- 1 kreymer g020 5310 Nov 5 13:40 logs/glide/probe.716751.0.out - 000 (708432.000.000) 11/04 13:30:02 Job submitted from host: <131.225.193.25:65252> ... 007 (708432.000.000) 11/04 13:30:41 Shadow exception! Error from starter on glidein_4913@fnpc204.fnal.gov: error changing sandbox ownership to the user 009 (708432.000.000) 11/04 13:32:34 Job was aborted by the user. less logs/glide/probe.709312.0.log Error from starter on glidein_5092@fnpc204.fnal.gov: error changing sandbox ownership to the user less logs/glide/probe.709352.0.log Error from starter on glidein_5092@fnpc204.fnal.gov: error changing sandbox ownership to the user less logs/glide/probe.711842.0.log Error from starter on glidein_20639@fnpc204.fnal.gov: error changing sandbox ownership to the user less logs/glide/probe.712276.0.log Error from starter on glidein_21150@fnpc204.fnal.gov: error changing sandbox ownership to the user less logs/glide/probe.712774.0.log # before interruption 000 (712774.000.000) 11/05 08:00:02 Job submitted from host: <131.225.193.25:65252> ... 007 (712774.000.000) 11/05 08:00:22 Shadow exception! Error from starter on glidein_1240@fnpc204.fnal.gov: error changing sandbox ownership to the user 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 001 (712774.000.000) 11/05 08:00:51 Job executing on host: <131.225.167.98:38939> ... 005 (712774.000.000) 11/05 08:01:27 Job terminated. (1) Normal termination (return value 0) logs/glide/probe.712801.0.log 000 (712801.000.000) 11/05 08:10:02 Job submitted from host: <131.225.193.25:65252> ... 007 (712801.000.000) 11/05 08:10:44 Shadow exception! Error from starter on glidein_1240@fnpc204.fnal.gov: error changing sandbox ownership to the user 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... identical Starter amd Bytes details omitted below ... 007 (712801.000.000) 11/05 08:10:49 Shadow exception! 007 (712801.000.000) 11/05 08:10:56 Shadow exception! 007 (712801.000.000) 11/05 08:10:59 Shadow exception! 007 (712801.000.000) 11/05 08:11:02 Shadow exception! 007 (712801.000.000) 11/05 08:11:14 Shadow exception! 007 (712801.000.000) 11/05 08:11:20 Shadow exception! 007 (712801.000.000) 11/05 08:11:22 Shadow exception! 007 (712801.000.000) 11/05 08:11:26 Shadow exception! 007 (712801.000.000) 11/05 08:11:29 Shadow exception! 001 (712801.000.000) 11/05 08:11:50 Job executing on host: <131.225.167.93:36963> ... 009 (712801.000.000) 11/05 08:12:00 Job was aborted by the user. The system macro SYSTEM_PERIODIC_REMOVE expression '(JobRunCount > 10) || (JobRunCount>=1 && ImageSize>1000000 && JobStatus==1)' evaluat ed to TRUE ... logs/glide/probe.716470.0.log 000 (716470.000.000) 11/05 12:50:01 Job submitted from host: <131.225.193.25:65252> Error from starter on glidein_5031@fnpc204.fnal.gov: error changing sandbox ownership to the user 009 (716470.000.000) 11/05 12:52:41 Job was aborted by the user. The system macro SYSTEM_PERIODIC_REMOVE expression '(JobRunCount > 10) || (JobRunCount>=1 && ImageSize>1000000 && JobStatus==1)' evaluated to TRUE Every one of these failures was on fnpc204. Probably nothing to do with the kx509 problem. Some jobs are now reconnecting after the fnpc204 failures : logs/glide/probe.716710.0.log 000 (716710.000.000) 11/05 13:30:02 Job submitted from host: <131.225.193.25:65252> ... 007 (716710.000.000) 11/05 13:30:22 Shadow exception! Error from starter on glidein_5031@fnpc204.fnal.gov: error changing sandbox ownership to the user 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 001 (716710.000.000) 11/05 13:30:55 Job executing on host: <131.225.166.103:23230> fnpc204 > less /var/log/yum.log ... Oct 13 14:38:41 Updated: check-mk-agent.noarch 1.0.37-11fef Oct 29 14:35:01 Updated: check-mk-agent.noarch 1.0.37-12fef Nov 05 13:10:00 Updated: coreutils.x86_64 5.2.1-31.8.el4 Nov 05 13:10:08 Updated: freetype.x86_64 2.1.9-10.el4.7 Nov 05 13:10:09 Updated: krb5-libs.x86_64 1.3.4-60.el4_7.2 ... Nov 05 13:14:23 Updated: xorg-x11-tools.x86_64 6.8.2-1.EL.33.0.4 Nov 05 13:14:23 Updated: xorg-x11-twm.x86_64 6.8.2-1.EL.33.0.4 _________________________________________________________________________ Date: Thu, 05 Nov 2009 14:48:13 -0600 (CST) Request INC000000015501 requested by you has been submitted. Status: New Summary: Fermigrid fnpc204 problems Notes: fermigrid-help : Minos glideinWMS jobs assigned to fnpc204 have been failing as follows. This has been going on for at least a day. The jobs seem to bounce off of fnpc204 for a while, then get assigned to another node, where they are killed if they have bounced off of fnpc204 too often. From /minos/scratch/kreymer/condor/probe/logs/glide/probe.712801.0.log 000 (712801.000.000) 11/05 08:10:02 Job submitted from host: <131.225.193.25:65252> ... 007 (712801.000.000) 11/05 08:10:44 Shadow exception! Error from starter on glidein_1240@fnpc204.fnal.gov: error changing sandbox ownership to the user 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... identical Starter amd Bytes details omitted below ... 007 (712801.000.000) 11/05 08:10:49 Shadow exception! 007 (712801.000.000) 11/05 08:10:56 Shadow exception! 007 (712801.000.000) 11/05 08:10:59 Shadow exception! 007 (712801.000.000) 11/05 08:11:02 Shadow exception! 007 (712801.000.000) 11/05 08:11:14 Shadow exception! 007 (712801.000.000) 11/05 08:11:20 Shadow exception! 007 (712801.000.000) 11/05 08:11:22 Shadow exception! 007 (712801.000.000) 11/05 08:11:26 Shadow exception! 007 (712801.000.000) 11/05 08:11:29 Shadow exception! 001 (712801.000.000) 11/05 08:11:50 Job executing on host: <131.225.167.93:36963> ... 009 (712801.000.000) 11/05 08:12:00 Job was aborted by the user. The system macro SYSTEM_PERIODIC_REMOVE expression '(JobRunCount > 10) || (JobRunCount>=1 && ImageSize>1000000 && JobStatus==1)' evaluat ed to TRUE ... _________________________________________________________________________ Date: Thu, 05 Nov 2009 15:02:14 -0600 (CST) From: Steven Timm fnpc204 has an ancient version of glexec rpms on there. I don't know how that would have happened. I have called for a drain of condor jobs. FEF please yum update the glexec-etc-fermigrid rpm to the current version. _________________________________________________________________________ Date: Thu, 05 Nov 2009 15:19:58 -0600 (CST) Status: Completed _________________________________________________________________________ ######### # ADMIN # ######### /usr/krb5/bin/kx509 has disappeared from minos25 Date: Thu, 05 Nov 2009 07:07:06 -0600 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron /usr/krb5/bin/kcron /local/scratch25/grid/kproxy /local/scratch25/grid/kproxy: line 36: /usr/krb5/bin/kx509: No such file or directory /local/scratch25/grid/kproxy: line 38: /usr/krb5/bin/kxlist: No such file or directory MINOS25 > tail /var/log/yum.log Oct 14 17:21:27 Installed: kernel-devel.x86_64 2.6.9-89.0.11.EL Oct 28 09:35:09 Installed: vdt-ca-certs.noarch 52-1 Oct 29 14:57:18 Updated: check-mk-agent.noarch 1.0.37-12fef Nov 05 07:03:52 Updated: nspr.x86_64 4.7.6-1.el4_8 Nov 05 07:04:00 Updated: seamonkey.x86_64 1.0.9-50.el4_8 Nov 05 07:04:00 Updated: krb5-libs-fermi.i386 22:1.8d-17.LTS4 Nov 05 07:04:03 Updated: tzdata.noarch 2009o-2.el4 Nov 05 07:04:04 Updated: samba-common.x86_64 3.0.33-0.18.el4_8 Nov 05 07:04:04 Updated: krb5-workstation-fermi.i386 22:1.8d-17.LTS4 Nov 05 07:04:06 Updated: firefox.x86_64 3.0.15-3.el4 ARK > rpm -qf /usr/krb5/bin/kx509 krb5-fermi-getcert-1.0-5.i386 -bash-3.00$ hostname fnpc372.fnal.gov -bash-3.00$ rpm -qf /usr/krb5/bin/kx509 krb5-workstation-fermi-1.8d-3.LTS4.i386 -------------------- ___________________________________________________________________________ Date: Thu, 05 Nov 2009 09:40:39 -0600 (CST) Request INC000000015430 requested by you has been submitted. Status: New Summary: minos25 /usr/krb5/bin/kx509 missing Notes: FEF primary , run2-sys@fnal.gov On minos25 and other SLF4 systems, since today's autoyum updates, /usr/krb5/bin/kx509 is missing. For example FLXI04 > tail /var/log/yum.log | grep krb5 Nov 05 05:02:37 Updated: krb5-libs-fermi.i386 22:1.8d-17.LTS4 Nov 05 05:03:24 Updated: krb5-workstation-fermi.i386 22:1.8d-17.LTS4 If this is not corrected by about 12:00 today Minos grid jobs will start to abort due to expired proxies. ___________________________________________________________________________ Date: Thu, 05 Nov 2009 09:54:18 -0600 (CST) From: Steven Timm They will have to install the new krb5-getcert rpm to get the kx509 binary back. it was split out of krb5-workstation-fermi rpm in this iteration and made into a separate rpm for LTS4. yum install krb5-getcert should work ___________________________________________________________________________ Date: Thu, 05 Nov 2009 09:58:37 -0600 From: Keith Chadwick This appears to have updated correctly on fnpcsrv1: [root@fnpcsrv1 etc]# cd /var/log [root@fnpcsrv1 log]# ls -acl yum.log -rw-r--r-- 1 root root 6422 Nov 5 06:25 yum.log [root@fnpcsrv1 log]# tail yum.log Oct 20 14:37:20 Installed: krb5-getcert.i386 22:1.8d-15.LTS4 Oct 22 05:59:28 Updated: kdegraphics.i386 7:3.3.1-15.el4_8.2 Nov 05 06:25:24 Updated: nspr.i386 4.7.6-1.el4_8 Nov 05 06:25:27 Updated: samba-common.i386 3.0.33-0.18.el4_8 Nov 05 06:25:28 Updated: krb5-libs-fermi.i386 22:1.8d-17.LTS4 Nov 05 06:25:28 Updated: krb5-getcert.i386 22:1.8d-17.LTS4 Nov 05 06:25:32 Updated: tzdata.noarch 2009o-2.el4 Nov 05 06:25:38 Updated: firefox.i386 3.0.15-3.el4 Nov 05 06:25:39 Updated: krb5-workstation-fermi.i386 22:1.8d-17.LTS4 Nov 05 06:25:40 Updated: samba-client.i386 3.0.33-0.18.el4_8 ___________________________________________________________________________ Date: Thu, 05 Nov 2009 10:02:50 -0600 From: Jason Allen Keith, How did krb5-getcert get installed on Oct 20? Did someone install it manually? ___________________________________________________________________________ Date: Thu, 05 Nov 2009 10:03:43 -0600 (CST) Status: In Progress ___________________________________________________________________________ Date: Thu, 05 Nov 2009 10:05:12 -0600 From: Lynn Garren Our 32bit SLF4 machines seem to be OK. We've got a new kx509 from krb5-getcert-1.8d-17.LTS4. However, I note that it is missing on the 64bit machines. I recovered kx509 with an explicit "yum install krb5-getcert", which gets krb5-getcert-1.8d-17.LTS4.i386 ___________________________________________________________________________ Date: Thu, 05 Nov 2009 10:19:06 -0600 (CST) From: Steven Timm Given the time of day it's likely that I installed krb5-getcert manually on fnpcsrv1. The update was configured to only give you krb5-getcert if you had it already. ___________________________________________________________________________ Date: Thu, 05 Nov 2009 10:19:26 -0600 From: Keith Chadwick Given the timestamp (~2:37 PM), I expect that Steve installed it manually. ___________________________________________________________________________ Date: Thu, 05 Nov 2009 10:27:48 -0600 (CST) Status: Completed krb5-getcert RPM installed on all minos servers. ___________________________________________________________________________ Date: Thu, 05 Nov 2009 14:08:46 -0600 From: Frank J. Nagy To: linux-users There have been several questions/issues with the recently released Fermi Kerberos updates For SLF5, the new RPMs krb5-fermi-config/base/getcert conflict with the krb5-*-fermi-*LTS4*.rpm upadtes that some have installed onto SLF5 (but were never distrubted with SLF5). This was unavoidable and the solution is to uninstalled the LTS4 updates: yum remove "krb5*-fermi-*LTS4" should suffice. You may then have to manually intstall the new RPMs yum install "krb5-fermi-*" For SLF4, there have been some reports of problems, notably the disappearing kx509/kxlist utilities. These were removed from the krb5-workstation-fermi package so that the krb5-getcert package would not conflict. This problem has been inconsistent as some have seen it and some have not. If you have this problem, manually install the getcert package yum install "krb5-getcert-*-LTS4" or just ym install krb5-getcert should suffice. ___________________________________________________________________________ ######### # MYSQL # ######### Date: Thu, 05 Nov 2009 15:09:17 +0000 (GMT) From: Arthur Kreymer To: xbhuang@fnal.gov Cc: minosdb-support@fnal.gov Subject: minos-mysql2 overload from recent xbhuang jobs The minos-mysql2 production database has been overloaded since about 05:00. The load average is over 30. There are a large number of temp database connections being kept open. There are many queries like select * from CALDRIFTVLD where INSERTDATE < '2007-11-14 00:00:00' and (TIMEEND<='2009-11-05 05:59:53') and (TASK=0) and (DETECTORMASK & 1) and (SIMMASK & 4) order by TIMEEND desc limit 1 These seem to be coming from xbhuang jobs, running processes like /minos/scratch/xbhuang/Nue/MRE/codes/condor_make_summary_tree.sh -> loon -bq \ /minos/scratch/xbhuang/Nue/MRE/MuonRemoval/GenerateElec/makeSummary/run_elec_summary_maker_reroot_dogwood.C ("N00010622_0006") \ reroot_N00010622_0006.root ___________________________________________________________________________ Date: Thu, 05 Nov 2009 21:31:19 +0000 (GMT) From: Arthur Kreymer The load on minos-mysql2 reduced to normal levels around 13:00. ___________________________________________________________________________ ============================================================================= 2009 11 04 ============================================================================= ######### # PROBE # ######### Checking for Cluster/Process in a condor job environment created probenv, probenv.run MINOS25 > condor_q 709462 -- Submitter: minos25.fnal.gov : <131.225.193.25:65252> : minos25.fnal.gov ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 709462.0 kreymer 11/4 17:17 0+00:00:18 R 0 0.0 kcron less logs/probe/probe.709462.0.out Nope, do not see it. ENVORONMENT _CONDOR_ANCESTOR_4789=4797:1255621157:3323123968 _CONDOR_SCRATCH_DIR=/local/stage1/condor/execute/dir_20919 _CONDOR_ANCESTOR_20919=20921:1257376687:1653818624 _CONDOR_HIGHPORT=65535 PWD=/local/scratch14/stage1/condor/execute/dir_20919 _CONDOR_SLOT=2 _CONDOR_LOWPORT=61440 KRB5CCNAME=FILE:/tmp/krb5cc_1060_cron20921 SHLVL=2 _CONDOR_ANCESTOR_4797=20919:1257376686:3957722590 _=/bin/env ENVIRONMENT ############ # BLUWATCH # ############ Started monitor of /grid/data 100 MB files from minos25 cdadmin cd bluearc ./bluwatch.new -t -r -S 100000000 -b /minos/scratch/bluwatch/100M -d 19 ... Wed Nov 4 16:31:02 CST 2009 19/file90 36 1257373862311858000 1257373859602199000 2709659 2706240 3418900 Wed Nov 4 16:31:14 CST 2009 19/file91 15 1257373874836411000 1257373868329550000 6506861 6503442 3418900 ./bluwatch.new -t -r -S 100000000 -b /grid/data/minos/bluwatch/minos25 -d 19 ... Wed Nov 4 16:33:51 CST 2009 19/file19 25 1257374031478841000 1257374027588807000 3890034 3887949 2084600 Wed Nov 4 16:34:01 CST 2009 19/file190 27 Wed Nov 4 16:34:09 CST 2009 19/file191 54 Wed Nov 4 16:34:18 CST 2009 19/file192 33 Wed Nov 4 16:34:26 CST 2009 19/file193 47 Wed Nov 4 16:34:35 CST 2009 19/file194 37 Wed Nov 4 16:34:43 CST 2009 19/file195 46 Wed Nov 4 16:34:52 CST 2009 19/file196 34 Wed Nov 4 16:35:01 CST 2009 19/file197 36 Wed Nov 4 16:35:09 CST 2009 19/file198 44 set nohup ./bluwatch.new -r -S 100000000 -b /grid/data/minos/bluwatch/minos25 & mkdir -p /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos25 set nohup ; ./bratenow -n minos25 -T /grid/data/100MB & restart of bluwatch, skipping the misplaced first decade. set nohup ./bluwatch.new -r -S 100000000 -b /grid/data/minos/bluwatch/minos25 -d 10 & nope, skip to 01 set nohup ./bluwatch.new -r -S 100000000 -b /grid/data/minos/bluwatch/minos25 -d 01 & That's better. ############ # BLUWATCH # ############ Populated 100M area for tests on minos25 mindata@minos-sam04 -bash-3.00$ scp minos27:.bashrc .bashrc -bash-3.00$ ln -s .bashrc .profile GDM=/grid/data/minos/bluwatch/minos25 mkdir $GDM date NF=0 while [ ${NF} -lt 200 ] ; do NFST=`printf "%2.2d" ${NF}` mkdir -p ${GDM}/${NFST:0:2} echo cp /var/tmp/100M ${GDM}/${NFST:0:2}/file${NFST} echo ${GDM}/${NFST:0:2}/file${NFST} (( NF ++ )) done date Wed Nov 4 15:28:12 CST 2009 Wed Nov 4 15:32:13 CST 2009 $ du -sb /grid/data/minos/bluwatch/minos25/ 22000212992 /grid/data/minos/bluwatch/minos25/ 92 MBytes/second ! Oops, see above monitoring, should have used %3.3d to produce NFST NF=0 while [ ${NF} -lt 100 ] ; do NFST2=`printf "%2.2d" ${NF}` NFST3=`printf "%3.3d" ${NF}` mv ${GDM}/${NFST2:0:2}/file${NFST2} ${GDM}/${NFST3:0:2}/file${NFST3} echo ${GDM}/${NFST2:0:2}/file${NFST2} (( NF ++ )) done for N in 2 3 4 5 6 7 8 9 ; do rmdir /grid/data/minos/bluwatch/minos25/${N}* ; done Restarted bluwatch, skipping the first decade ########## # CONDOR # ########## To : rbpatter@fnal.gov Cc : minos-admin@fnal.gov Attchmnt: Subject : Suggested slowdown in condor_*_nice wrappers ----- Message Text ----- At Monday's Grid Users meeting, I asked Steve Timm what net rate in our condor_*_nice wrappers would eliminate the SAZ overload alerts that Fermigrid people see on occasion. He stated that another factor of 2 slowdown should eliminate the alerts. He mentioned that they are also working on upgrading the SAZ capacity. I suggest that we make this adjustment, pending installation of the newer Condor 7.4 and improvement of SAZ capacity. Glancing at the scripts, it looks like we would just change CYCLE_DELAY=5 to CYCLE_DELAY=10 for a net job control rate of 1/second instead of 2/second. This is not a good long term situation, but we already know that, hence the SAZ improvements underway. Sound reasonable ? ########## # CONDOR # ########## Date: Tue, 03 Nov 2009 23:14:52 -0800 (PST) From: Ryan B. Patterson Request INC000000015212 requested by you has been submitted. Status: New Summary: MINOS user jobs not starting on GPFarm Notes: Since the FermiGrid configuration changes, MINOS glideins *are* matching successfully to nodes at CDF and on the GPFarm, but user jobs are *not* starting on the GPFarm nodes after the glideins are present. User jobs are starting okay on CDF nodes. condor_q reports no matchmaking problem (example below), always indicating hundreds of nodes available to run jobs. condor_status also shows the nodes present, willing, and able. Yet the nodes remain unclaimed. The glideins time-out with no use after 20 minutes and exit, and another glidein matches to the node to replace it. I've poked around at this for a little while, but I haven't made any headway. I'm assuming something in the new configuration is at the source, but I can't see what that is. (The contact string change has been made on our end, as evidenced by the successful matching and starting of glideins. Incidentally: should or should not the gateway names still have ":2119" in them? I tried both. (It actually didn't seem to matter.)) "condor_q -bett" example output, showing VMs available at minos25: ----------------------------------------------------- 671531.000: Run analysis summary. Of 1740 machines, 126 are rejected by your job's requirements 1148 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 0 match but reject the job for unknown reasons 102 match but will not currently preempt their existing job 364 are available to run your job Last successful match: Tue Nov 3 19:26:22 2009 The Requirements expression for your job is: ( ( ( target.Arch == "X86_64" ) || ( target.Arch == "INTEL" ) ) && ( target.GLIDEIN_Site isnt undefined ) && ( target.GLIDEIN_Entry_Name != "gpminos" ) ) && ( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) && ( target.HasFileTransfer ) Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( target.GLIDEIN_Entry_Name != "gpminos" )1614 2 ( target.GLIDEIN_Site isnt undefined )1706 3 ( ( target.Arch == "X86_64" ) || ( target.Arch == "INTEL" ) ) 1740 4 ( target.OpSys == "LINUX" ) 1740 5 ( target.Disk >= 1 ) 1740 6 ( ( 1024 * target.Memory ) >= 1 ) 1740 7 ( target.HasFileTransfer ) 1740 ___________________________________________________________________ Submitted local probe at 10:00. Ryan is investigating. The job ran at 10:20, after a probably schedd restart. MINOS25 > minos_q -- Summary of minos25.fnal.gov : <131.225.193.25:65252> : minos25.fnal.gov OWNER RUN IDLE HELD OLDEST_JOB blake 5 1 0 11/4 10:16 0+00:00:31 loon_20091104_1016 grafnj 1 680 0 11/3 11:55 0+18:34:51 paloon_SriptPlus5A jdejong 8 0 0 11/4 08:39 0+00:01:45 irun-lisieve.sh_20 jelena 1 0 0 11/3 04:31 0+09:14:48 L-A-wrap-n13035004 jjling 2 0 0 11/3 20:38 0+00:01:43 condor_job_glidein jsm62 26 0 0 11/4 03:46 0+00:01:43 loon_20091104_0346 jyuko 1 19 0 11/3 16:48 0+01:04:35 run_condor.sh_2009 kafka 6 194 0 11/3 20:37 0+00:00:00 Make_ReReco_files_ kreymer 4 0 0 11/4 07:20 0+00:00:53 probe med 2 0 0 11/2 13:03 1+06:20:29 loon_20091102_1303 mho 1 4002 0 11/2 00:49 2+08:23:12 condor_dagman nigrant 4 0 0 11/4 04:49 0+00:01:22 MakeFDRun1LEDataSu pittam 1 0 0 11/4 05:13 0+00:01:41 loon_20091104_0513 rbpatter 0 889 0 11/2 11:01 1+22:11:35 condor_dagman rtoner 0 2132 0 11/4 07:13 0+01:59:01 condor_dagman rubin 112 338 0 11/3 15:58 0+03:14:35 ana_mc_driver.glid tinti 1 0 0 11/4 03:01 0+00:01:39 NDDogwood-2009-10- xbhuang 1 9806 0 11/3 15:14 0+01:00:21 calib_pro_gainfar_ TOTALS 176 18061 0 Farm glideins: R=412 I=621 H=0 MINOS25 > date Wed Nov 4 10:24:02 CST 2009 condor_q -run rubin - status seems to be catching up quickly, initially many jobs contained [????????????????] Mostly running, then ... 693509.0 rubin 11/3 18:22 0+00:00:16 glidein_9720@fnpc355.fnal.gov 693511.0 rubin 11/3 18:22 0+00:00:15 glidein_9465@fnpc352.fnal.gov 693512.0 rubin 11/3 18:22 0+00:00:00 [????????????????] 693513.0 rubin 11/3 18:22 0+00:00:00 [????????????????] 693514.0 rubin 11/3 18:22 0+00:00:14 glidein_21526@fnpc359.fnal.gov 693515.0 rubin 11/3 18:22 0+00:00:00 [????????????????] 693516.0 rubin 11/3 18:22 0+00:00:00 [????????????????] 693517.0 rubin 11/3 18:22 0+00:00:08 glidein_17178@fnpc331.fnal.gov 693519.0 rubin 11/3 18:22 0+00:00:09 glidein_29570@fnpc324.fnal.gov 693520.0 rubin 11/3 18:22 0+00:00:13 glidein_27581@fnpc333.fnal.gov 693521.0 rubin 11/3 18:22 0+00:00:00 [????????????????] 693522.0 rubin 11/3 18:22 0+00:00:12 glidein_20489@fnpc334.fnal.gov 693523.0 rubin 11/3 18:22 0+00:00:11 glidein_29135@fnpc327.fnal.gov 693524.0 rubin 11/3 18:22 0+00:00:00 [????????????????] 693526.0 rubin 11/3 18:22 0+00:00:00 [????????????????] 693527.0 rubin 11/3 18:22 0+00:00:10 glidein_2926@fnpc259.fnal.gov 693528.0 rubin 11/3 18:22 0+00:00:00 [????????????????] ... then the rest have ???? status rubin jobs are starting rapidly MINOS25 > condor_q rubin | tail -1 450 jobs; 84 idle, 366 running, 0 held ___________________________________________________________________ ########## # CONDOR # ########## Date: Tue, 03 Nov 2009 20:43:54 -0600 From: jiajie ling The condor seems have some problems. I killed about 302 jobs, however if I type condor_q jjling, I can see that all the jobs has been marked as stopped and stay there forever (which should be disappeared after I condor_rm). However my new job is not running at all. My priority is high enough. Could you please have a look? MINOS25 > minos_q -- Summary of minos25.fnal.gov : <131.225.193.25:63042> : minos25.fnal.gov OWNER RUN IDLE HELD OLDEST_JOB blake 0 0 0 11/3 18:24 0+00:03:23 loon_20091103_1824 grafnj 393 659 0 11/3 11:55 0+05:30:21 paloon_SriptPlus5A jdejong 0 2 0 11/4 08:39 0+00:00:00 irun-lisieve.sh_20 jelena 0 1 0 11/3 04:31 0+09:13:39 L-A-wrap-n13035004 jjling 0 1 0 11/3 20:00 0+00:00:00 condor_job_glidein jsm62 0 24 0 11/4 03:19 0+00:00:00 loon_20091104_0319 jyuko 0 20 0 11/3 16:48 0+01:04:35 run_condor.sh_2009 kafka 0 200 0 11/1 18:46 2+00:19:43 Make_ReReco_files_ kreymer 0 86 0 10/30 19:20 0+00:00:00 probe lefeuvre 0 0 0 11/3 18:17 0+00:11:41 loon_20091103_1817 med 0 2 0 11/2 13:03 1+06:18:18 loon_20091102_1303 mho 2 4003 0 11/2 00:49 2+07:52:09 condor_dagman nigrant 0 4 0 11/4 04:49 0+00:00:00 MakeFDRun1LEDataSu pittam 0 1 0 11/4 05:13 0+00:00:00 loon_20091104_0513 rbpatter 0 889 0 11/2 11:01 1+21:40:31 condor_dagman rtoner 0 2132 0 11/4 07:13 0+01:28:25 condor_dagman rubin 12 438 0 11/3 11:20 0+07:45:39 ana_mc_driver.glid tinti 0 1 0 11/4 03:01 0+00:00:00 NDDogwood-2009-10- xbhuang 153 9091 0 11/3 15:09 0+05:17:24 calib_pro_gainfar_ TOTALS 560 17554 0 Farm glideins: R=460 I=689 H=0 MINOS25 > date Wed Nov 4 08:42:23 CST 2009 ============================================================================= 2009 11 03 ============================================================================= ####### # SAM # ####### FYI, here is the strawman server I referred to during today's meeting. This is for the purpose of a sanity check. We might well buy something different, for uniformity. Strawman configuration, $7,491 Dell R710 5500 series 2U Rack Server. 2 x Intel® Xeon® E5520, 2.26Ghz, 8M Cache, Turbo, HT, 1066MHz Max Mem High Output Power Supply, Redundant, 870W 48GB Memory (12x4GB), 1066MHz Dual Ranked RDIMMs for 2 Processors,Optimized Red Hat Enterprise Linux 5.3, 2S, FI x64, 1yr, Auto-Entitle, Lic & Media Chassis for Up to Six 3.5-Inch Hard Drives 3 x 500GB 7.2K RPM Near Line SAS 3.5" Hot Plug Hard Drive 3 x 1TB 7.2K RPM Near Line SAS 3.5" Hot Plug Hard Drive Dual Two-Port Embedded Broadcom® NetXtreme II 5709 Gigabit Ethernet NIC 3Yr Basic Hardware Warranty Repair: 5x10 HW-Only, 5x10 NBD Onsite Notes Selected mid range CPU, good price/performance Selected 48 GB memory: 8 GB per initial customer, times 2 for growth RDIMM has Address Parity in addition to ECC. Selected 870W power supply, dual, needed for this much memory OS - should this be 3yr for $700 ? 3 x .5 TB system disks, 3 x 1 TB data disks (mirror set plus backup) Getting adequate data disks lets us use this as a Dev/Int system for Calibration databases. ######## # GRID # ######## See notes of 2009 10 26, copies from fcdfcaf1103 Repeated tests of 10 MB file time dd if=/grid/data/minos/bluwatch/stash/A/2/file0300 of=/dev/null time dd if=/grid/data/minos/bluwatch/stash/A/2/file0301 of=/dev/null time dd if=/grid/data/minos/bluwatch/stash/A/2/file0302 of=/dev/null real 0m0.542s real 0m0.918s real 0m0.634s Performance seems to be normal again. ############ # BLUWATCH # ############ Started test monitor of /minos/scratch from minos-mysql2, kreymer@minos-mysql2 set nohup ./bluwatch.new -t -r -S 100000000 -b /minos/scratch/bluwatch/100M set nohup ./bluwatch.new -r -S 100000000 -b /minos/scratch/bluwatch/100M & set nohup ; ${HOME}/minos/scripts/bratenow -n minos-mysql2 -T /minos/scratch & gnuplot> set output '/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-mysql2/minos-mysql2_20091103.png' ^ cannot open file; output not changed line 0: util.c: No such file or directory OOPS, rates are up over 800 MB/sec on the second pass through these 100 files ( 10 GB ) We have 16 GB of physical memory on this system. Increased the files in /minos/scratch/bluwatch100M to 20 GB. Removed the high entries from /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluwatch/rate/2009/11/03/minos-mysql2.txt mindata@minos-sam04 MSBM=/minos/scratch/bluwatch/100M/ for DIR in 0 1 2 3 4 5 6 7 8 9 ; do mv ${MSBM}/${DIR} ${MSBM}/0${DIR} done date NF=0 while [ ${NF} -lt 100 ] ; do NFST=`printf "%2.2d" ${NF}` mkdir -p /minos/scratch/bluwatch/100M/1${NFST:0:1} cp /var/tmp/100M /minos/scratch/bluwatch/100M/1${NFST:0:1}/file${NFST} echo /minos/scratch/bluwatch/100M/1${NFST:0:1}/file${NFST} (( NF ++ )) done date Tue Nov 3 13:31:45 CST 2009 /minos/scratch/bluwatch/100M/10/file00 /minos/scratch/bluwatch/100M/10/file01 /minos/scratch/bluwatch/100M/10/file02 /minos/scratch/bluwatch/100M/10/file03 ... /minos/scratch/bluwatch/100M/19/file98 /minos/scratch/bluwatch/100M/19/file99 Tue Nov 3 13:35:20 CST 2009 Restarted the monitor. Rates still too high, writes did not flush read cache. set nohup ./bluwatch.new -r -S 100000000 -b /minos/scratch/bluwatch/100M -d 10 & Tue Nov 3 13:43:54 CST 2009 10/file00 38 OK, these rates look realistic ############ # BLUWATCH # ############ Started keepup of minos-sam04 : set nohup ; ${HOME}/minos/scripts/bratenow -n minos-sam04 -T /minos/scratch & set nohup ; ${HOME}/minos/scripts/bratenow -n minos-sam04 -T /minos/scratch -w & ============================================================================= 2009 11 02 ============================================================================= ######### # ADMIN # ######### Send Overview and computing/dh web links to wang/ritchie per request ######### # ADMIN # ######### Tried the corrected account form again, from http://www-numi.fnal.gov/minwork/computing/minos_cluster.20091022.html This time it worked, Your MINOS Account Request has been submitted. Your request will be verified by the collaboration. (ARNOTE 10000) ####### # CVS # ####### cdcvs has been moved to the second box. I believe all services are now running. * the main cd repository was working almost immediately * the cdf repository was out longest (I overlooked a symlink, and we didn't have rcs installed in /usr/bin) ############ # BLUWATCH # ############ Added -S SIZE, default is still 10M (10000000) Test with 100MB files created last Friday LOG100M=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluwatch/rate LOG100M=/grid/data/monitor/test100 kreymer@minos27 ./bluwatch.new -t -S 100000000 -r \ -b /minos/scratch/bluwatch/100M -l /grid/data/monitor/test100 ALTLOG=/grid/data/monitor/test100 TEST READ DEBUGGING, SLEEP=6 ... OFFSET 9 OFFSET 8347700 DIR /minos/scratch/bluwatch/100M/0 Mon Nov 2 11:59:29 CST 2009 0/file00 24 1257184769629855000 1257184765458602000 4171253 4162905 8347700 Mon Nov 2 11:59:39 CST 2009 0/file01 26 1257184779475181000 1257184775712592000 3762589 3754241 8347700 Mon Nov 2 11:59:50 CST 2009 0/file02 18 1257184790889772000 1257184785548862000 5340910 5332562 8347700 Mon Nov 2 12:00:04 CST 2009 0/file03 12 1257184804981453000 1257184796969144000 8012309 8003961 8347700 Mon Nov 2 12:00:17 CST 2009 0/file04 14 1257184817777992000 1257184811053536000 6724456 6716108 8347700 Mon Nov 2 12:00:37 CST 2009 0/file05 8 1257184837436510000 1257184826232381000 11204129 11195781 8347700 Mon Nov 2 12:00:47 CST 2009 0/file06 22 1257184847917914000 1257184843500167000 4417747 4409399 8347700 Mon Nov 2 12:00:59 CST 2009 0/file07 17 1257184859779986000 1257184854003149000 5776837 5768489 8347700 Mon Nov 2 12:01:10 CST 2009 0/file08 23 1257184870104861000 1257184865855242000 4249619 4241271 8347700 Mon Nov 2 12:01:22 CST 2009 0/file09 14 1257184882873734000 1257184876195281000 6678453 6670105 8347700 DIR /minos/scratch/bluwatch/100M/1 Mon Nov 2 12:01:49 CST 2009 1/file10 7 1257184909296265000 1257184895441038000 13855227 13846879 8347700 Mon Nov 2 12:02:04 CST 2009 1/file11 10 1257184924743710000 1257184915393654000 9350056 9341708 8347700 Mon Nov 2 12:02:17 CST 2009 1/file12 15 1257184937298250000 1257184930809270000 6488980 6480632 8347700 Mon Nov 2 12:02:35 CST 2009 1/file13 8 1257184955292435000 1257184943549885000 11742550 11734202 8347700 Mon Nov 2 12:02:45 CST 2009 1/file14 23 1257184965712759000 1257184961388111000 4324648 4316300 8347700 Mon Nov 2 12:03:10 CST 2009 1/file15 5 1257184990034011000 1257184971887260000 18146751 18138403 8347700 Mon Nov 2 12:03:19 CST 2009 1/file16 28 1257184999708896000 1257184996154010000 3554886 3546538 8347700 Mon Nov 2 12:03:28 CST 2009 1/file17 35 1257185008626531000 1257185005798280000 2828251 2819903 8347700 Mon Nov 2 12:03:36 CST 2009 1/file18 49 1257185016694386000 1257185014678330000 2016056 2007708 8347700 Mon Nov 2 12:03:45 CST 2009 1/file19 36 1257185025556202000 1257185022795695000 2760507 2752159 8347700 ... ############### # CONDORPROXY # ############### Corrected gfactory .k5login, corrupted last Friday. minos26 had been changed to minos25 in kreymer kcron entry condorproxy works again, when run maually. ============================================================================= 2009 10 30 ============================================================================= ######### # ADMIN # ######### Tried the Minos Account reqeust form again. http://www-numi.fnal.gov/minwork/computing/minos_cluster.20091022.html "Popup blockers must be disabled to use this application" ( Note - This is a really bad idea. People should not disable popup blockers.) I filled out the form : Your Fermilab Id: 06135N most of the rest was filled in automatically (This will probably surprise and confuse the users) I added Kerberos principal Minos institution kreymer Fermilab I hit SUBMIT, and got The product categorization information is not valid for the specified company, """Fermilab""". Use the menus provided for these fields to select this information. (ARERR 1291047) ############ # BLUWATCH # ############ Make some 110 MB test files for bluwatch, to reduce sampling errors commission this on /minos/scratch ( stay away from /minos/data copies ) Write 11 GB of 110 MB files, should be plenty to flush caches. cd /minos/scratch/flxi09/1 cat file001* > /var/tmp/100M MINOS-SAM04 > ls -l /var/tmp/100M -rw-r--r-- 1 kreymer g020 110000000 Oct 30 18:16 /var/tmp/100M date NF=0 while [ ${NF} -lt 100 ] ; do NFST=`printf "%2.2d" ${NF}` mkdir -p /minos/scratch/bluwatch/100M/${NFST:0:1} cp /var/tmp/100M /minos/scratch/bluwatch/100M/${NFST:0:1}/file${NFST} echo /minos/scratch/bluwatch/100M/${NFST:0:1}/file${NFST} (( NF ++ )) done date Fri Oct 30 18:29:36 CDT 2009 ... /minos/scratch/bluwatch/100M/9/file99 Fri Oct 30 18:32:13 CDT 2009 Net rate was 110 GB / 157 seconds , 70MB/sec ########### # BRATEWK # ########### # BRATE # ########### Corrected purge of old /var/tmp/ME/*.png files ( had accumlated nearly 30K of these on minos27 ) Added an 'exit 0' to the end of the script, for more legible postlude comments ####### # NET # ####### Very slow network and DNS lookups. Editor is freezing up frequently after 13:30 13:50 or so TCP/Web100 Network Diagnostic Tool v5.4.12 click START to begin Connected to: shasta.fnal.gov -- Using IPv4 address Protocol error! click START to re-test Back to normal after 16:00 _______________________________________________________________________ INC000000014867 10/30/2009 7:17:43 PM Wh12W is having horrible network connections via wired. the rate is 100kbps instead of 10mbps. _______________________________________________________________________ INC000000014822 10/30/2009 2:25:29 PM All SciBooNE Linux nujpXX machines are currently down. I guess there was a power outage and the machines need rebooting. All the machines are located at WH10X. INC000000014854 10/30/2009 5:36:03 PM nujp08 cannot ping out INC000000014864 10/30/2009 7:12:54 PM Network access is slow/broken on WH11 10/30/2009 7:50:00 AM There was a spanning tree loop that caused this issue. It is fixed now. INC000000014867 10/30/2009 7:17:43 PM Wh12W is having horrible network connections via wired. the rate is 100kbps instead of 10mbps. INC000000014872 10/30/2009 7:34:19 PM problems with the network on the 11th floor INC000000014877 10/30/2009 7:58:59 PM can't access bluearc disk and pnfs disk minerva01.fnal.gov can't access bluearc and pnfs disk. _______________________________________________________________________ Date: Fri, 30 Oct 2009 14:23:47 -0500 From: orlando Art, what's the Network ID on the SM6 box? _______________________________________________________________________ To : orlando Cc : FermilabServiceDesk@fnal.gov, Site Networking Group Attchmnt: Subject : Re: Incident INC000000014867 has been assigned to your group 'Network Services'. Priority: High. Description: Wh12W is having horrible network connections via wired. the rate is 100kbps instead of 10mbps. ----- Message Text ----- On Fri, 30 Oct 2009, orlando wrote: > Art, what's the Network ID on the SM6 box? I'm not sure what an SM6 box is. The fiber adapter on the wall ? This problem has been resolved, see : INC000000014864 10/30/2009 7:12:54 PM Network access is slow/broken on WH11 10/30/2009 7:50:00 AM There was a spanning tree loop that caused this issue. It is fixed now. This probably also was the cause of tickets INC000000014854 INC000000014872 INC000000014877 _______________________________________________________________________ Date: Fri, 30 Oct 2009 17:04:37 -0500 (CDT) Request INC000000014867: Status has been updated. Status: Completed Network loop on WH10W caused outage on all PPD subnet 55. ============================================================================= 2009 10 29 ============================================================================= ########### # NETWORK # ########### ########### # BLUEARC # ########### Date: Thu, 29 Oct 2009 21:18:35 +0000 (GMT) From: Arthur Kreymer On Mon, 5 Oct 2009, Arthur Kreymer wrote: > Recent daiy and weekly data read rate plots > > D0 NexSan / Satabeast - 50 MB/sec when healthy > http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/ > > D0 Hitachi - 70 MB/sec when healthy > http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino05/ In preparing to monitor the new Minos Bluearc disks, I have upgraded some of the Minos monitoring scripts, and started continuous updates to the D0 performance plots. The plot titles now include the data path monitored, instead of the former hardcoded '/grid/data' . There have been severe D0 performance problems recently on both Nexsan and Hitachi systems. http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/20091026.week.png http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino05/20091026.week.png I suspect a classic 'CAB Attack', too many grid jobs reading files directly. We have had nothing like this on /grid/data since we separated D0 files and started using the 'cpn' utility to regulate file copies. Presently we set a limit of 20 simultaneous copies. This can probably be increased. ####### # CVS # ####### Date: Thu, 29 Oct 2009 11:31:59 -0500 From: Marc W. Mengel There will be a brief cdcvs.fnal.gov outage Monday 2009-11-2 at noon CST to move the service from cdcvs3 to cdcvs4. ####### # AFS # ####### Scanning for an empty data volume to give to Nova for web pages. for DIR in `ls` ; do fs listquota ${DIR} ; done | grep 50000000 | grep ' 0% ' nb.minos.d114 50000000 6 0% 49% nb.minos.d115 50000000 6 0% 56% nb.minos.d116 50000000 6 0% 49% nb.minos.d117 50000000 6 0% 52% nb.minos.d124 50000000 6 0% 48% fs: You don't have the required access rights on 'd192' nb.minos.d171 50000000 12 0% 49% nb.minos.d172 50000000 12 0% 52% nb.minos.d173 50000000 240556 0% 52% nb.minos.d174 50000000 14 0% 49% nb.minos.d198 50000000 59211 0% 48% fs: You don't have the required access rights on 'd273' nb.minos.d229 50000000 8 0% 56% nb.minos.d245 50000000 50089 0% 70% nb.minos.d86 50000000 6 0% 54% MINOS26 > fs listacl d229 Access list for d229 is Normal rights: minos:admin rlidwka buckley:ana_ntuples rlidwka minos rl system:administrators rlidwka system:anyuser rl brebel rlidwka They need access to this, and to make a symlink ARK > fs listacl /afs/fnal.gov/files/expwww/nova Access list for /afs/fnal.gov/files/expwww/nova is Normal rights: lauram:expwwwread rl nicholls:wadmnova rlidwka lauram:expwwwadm rlidwka system:administrators rlidwka system:anyuser rl fs setacl -dir d229 -acl buckley:ana_ntuples none fs setacl -dir d229 -acl brebel none fs setacl -dir d229 -acl minos none fs setacl -dir d229 -acl sbudd rlidwka ######## # GRID # ######## Date: Thu, 29 Oct 2009 14:12:42 +0000 (GMT) From: Arthur Kreymer To: rbpatter@fnal.gov, minos-admin@fnal.gov Cc: timm@fnal.gov Subject: Opportunistic glideins seem to have stopped. Since the restart of the Minos gfrontend yesterday, we seem to be getting no opportunistic slots. The GPFarm is fairly empty, but we are at the 400 job ceiling. The Minos pilots have drained from CDF. ______________________________________________________________________ Date: Thu, 29 Oct 2009 09:45:33 -0500 (CDT) From: Steven Timm There are 128 minos glideins on the fermigridosg1 gateway. At the moment all of them are matched to cdf clusters and waiting to run. within 10 minutes the auto-rescheduling feature will kick in and some of them will be rematched to fnpcfg2. any new ones ought to go there as well. You can tell this from any machine that is running condor: condor_q -name fg1x1.fnal.gov -pool fermigridcm1.fnal.gov -constraint 'Owner=="minosgli"' -format '%s\t' GlobusStatus -format '%s\t' gridJobId -format '%s\n' Cluster ID Or art and ryan can log directly into fg1x1 and do the command without the -name and -pool options. ______________________________________________________________________ Date: Thu, 29 Oct 2009 15:08:13 +0000 (GMT) From: Arthur Kreymer Apparenly the Pilots are matching to CDF, but not running : condor_q -name fg1x1.fnal.gov -pool fermigridcm1.fnal.gov \ -constraint 'Owner=="minosgli"' -l | grep ^LastMatchName ______________________________________________________________________ Date: Thu, 29 Oct 2009 10:53:25 -0500 (CDT) From: Steven Timm Yes that's right. Eventually they will give up on CDF and match to the GP Grid cluster opportunistic one where they can run. Every time they get rematched there is another rank penalty for clusters where they have tried and failed to run before. The problem is leak-through, that the CDF clusters report 2 or 3 slots free, the information is on a 15 minute delay, and by the time the minos jobs get there, more cdf jobs have come and started. You can see how many slots are free on each cluster by running the command on fg1x1 /usr/local/bin/freeslots.py minos We have some formatting to do to clean up the numbers still but it should give you the basic details. ______________________________________________________________________ Date: Thu, 29 Oct 2009 12:46:45 -0500 (CDT) From: Steven Timm Just a note--the current release of the Generic Information Provider was supposed to be reporting zero free slots for MINOS (or any other VO) on any FermiGrid site where they have jobs waiting to run. I just contacted the Generic Information Provider team and they confirm that there is in fact a bug where this feature is not working right for condor job managers, and they will get us a fix shortly. In the meantime I suggest that MINOS turn up the rate of glidein submission in the gfactory if you are able to do so and sooner or later some will have to match to fnpcfg2, the opportunistic GP Grid gatekeeper. ______________________________________________________________________ Date: Fri, 30 Oct 2009 14:10:36 +0000 From: Ruth Toner Hello Art! I've been running Ryan's condor_dagman LEM event matching code on the grid since last night, but my job running abilities seem to have slowed to nearly nothing this morning. At the moment, I have 778 jobs on the grid; of these, 2 are running and 776 are idle. This number has remained steady for the past two or so hours. In fact, when I checked about a half hour ago, I had 5 running and 773 idle, so the active number even seems to be going down! There seem to be far less jobs than usual in the "running" state in total for all users as well (about 300, when I know I've seen it more around 1000 in the past) - is there something wrong with the Grid at the moment, or is this just normal? Or is this something wrong with my submission/account? With the upcoming nue analysis approaching, this matching is becoming rather urgent and critical, so if there's something I can do to make it start working again or go faster, I'd love to hear it! Thank you! --Ruth ______________________________________________________________________ Date: Fri, 30 Oct 2009 08:47:06 -0700 (PDT) From: Ryan B. Patterson It looks like the Condor priority system is doing its job and letting other analyzers get some slots. However, the person with the most slots at the moment is Mhair, who is doing LEM matching anyway. Demand has been very high these last few days. (There are current over 3000 jobs waiting to run.) Hopefully it clears up through the weekend. But for now, LEM matching is accounting for 68% of running jobs (via Mhair). ______________________________________________________________________ Date: Fri, 30 Oct 2009 08:48:47 -0700 (PDT) From: Ryan B. Patterson Related to Ruth's question... I didn't understand Steve's reply yesterday about opportunistic running. *Are* we getting access to free nodes when they are available, or are there nodes sitting idle that we can't access due to a bug? My impression was that we were getting nodes when we could but that something made it difficult for us to get to the nodes first. ______________________________________________________________________ Date: Fri, 30 Oct 2009 15:58:21 +0000 (GMT) From: Arthur Kreymer We already have about 160 Idle pilots. This is much larger than the 10 to 20 Idle pilots seen during a normal rampup. I doubt that sending more pilots would change anything. We are continuing to get 0 opportunistic slots. Well, almost. condor_q -run shows one process : -- Schedd: fg1x1.fnal.gov : <131.225.107.165:62238> ID OWNER SUBMITTED RUN_TIME HOST(S) 2045412.0 minosgli 10/30 10:21 0+00:35:25 [????????????????] ______________________________________________________________________ Date: Fri, 30 Oct 2009 09:02:17 -0700 (PDT) From: Ryan B. Patterson The factory is configured not to start more glideins when 40 are idle. The fact that over 100 are idle suggests that glideins startup jobs are returning to the idle state after entering the run state, which is something I've never seen before. I concur -- glidein submission rate does not seem directly related to this issue. ______________________________________________________________________ Date: Fri, 30 Oct 2009 11:06:53 -0500 (CDT) From: Steven Timm There are three things I am trying to say here: 1) Sometimes old and stale glidein jobs on the gateway won't rematch at all after a while. 2) New glideins submitted now would probably go to the right place 3) Rate does matter, in fact it is everything, because if you have a large number of glideins to be considered in one matching cycle then some of them will be forced to go to the right place because you will fill up all the other wrong places. In addition, I received a patch this morning from the generic information provider people which should keep minos glideins from going to cdf if you already have glideins waiting to run there. I'll let you know once this patch is applied. Seeing that this has been ongoing for a couple days now I am going to open an incident ticket with the Service desk to track it. ______________________________________________________________________ Date: Fri, 30 Oct 2009 09:10:23 -0700 (PDT) From: Ryan B. Patterson Can we clear out the existing idle glideins? (I would attempt this, but I've never actually interacted with anything other than the MINOS Condor scheduler, and I wouldn't want to put the grid gateways into a weird state with an errant condor_rm.) ______________________________________________________________________ Date: Fri, 30 Oct 2009 11:13:02 -0500 (CDT) From: Steven Timm Give me a second... I think at least the oldest ones have no chance to run but I want to be sure, and if so I can remove them for you. ______________________________________________________________________ Date: Fri, 30 Oct 2009 09:50:42 -0700 (PDT) From: Ryan B. Patterson Thanks. I see these have been removed and replaced by new glideins. The new ones are sitting idle like the old ones. Interestingly, when the sole running glidein ended, a new one replaced it to maintain one running glidein. Hopefully this patch resolves things. MINOS has a huge queue of waiting analysis jobs, so restoration of opportunistic running before the weekend would be extremely valuable. Thanks again. Let me know if I can do anything to help. ______________________________________________________________________ Date: Fri, 30 Oct 2009 11:57:12 -0500 (CDT) From: Steven Timm That one isn't a running glidein, it's just your condor-g grid monitor job. Also you guys must have removed the older glideins before I got a chance to do so because when I went to remove them they were already gone. ______________________________________________________________________ Date: Fri, 30 Oct 2009 18:07:44 +0000 (GMT) From: Arthur Kreymer Thanks for putting in the Service ticket, I was about to do that. I did not remove anything myself, so the removal is a mystery. ______________________________________________________________________ Date: Fri, 30 Oct 2009 11:09:59 -0700 (PDT) From: Ryan B. Patterson I removed the idle glidein startup jobs from minos25, but I didn't think this in turn removed the jobs from the gateway. It apparently does, and new idle glideins then moved in to fill the void. ______________________________________________________________________ Date: Fri, 30 Oct 2009 13:16:13 -0500 (CDT) From: Steven Timm ______________________________________________________________________ Date: Fri, 30 Oct 2009 11:09:59 -0700 (PDT) From: Ryan B. Patterson I removed the idle glidein startup jobs from minos25, but I didn't think this in turn removed the jobs from the gateway. It apparently does, and new idle glideins then moved in to fill the void. ______________________________________________________________________ Date: Fri, 30 Oct 2009 13:16:13 -0500 (CDT) From: Steven Timm I have installed the patch from the GIP people on all the four cdf gatekeepers. fcdfosg4 is now correctly reporting zero slots free for MINOS, the other three gatekepers will shortly as the data moves through the cache, takes about 1/2 hour overall. ______________________________________________________________________ Date: Fri, 30 Oct 2009 13:22:37 -0500 (CDT) Request INC000000014857: Status has been updated. Status: Pending Summary: Opportunistic glideins seem to have stopped. Notes: I'm filing this ticket to track a troubleshooting incident for MINOS on the fermigridosg1 gateway. I will not include the full E-mail exchange that has happened up until now. this is Art's original E-mail to me on 10/29/2009 14:12:42 Since the restart of the Minos gfrontend yesterday, we seem to be getting no opportunistic slots. The GPFarm is fairly empty, but we are at the 400 job ceiling. The Minos pilots have drained from CDF. ______________________________________________________________________ Date: Fri, 30 Oct 2009 13:44:40 -0500 (CDT) Status: Pending ______________________________________________________________________ Date: Fri, 30 Oct 2009 13:47:37 -0500 (CDT) From: Steven Timm Please leave whatever glideins are currently in the queue,in the queue. You just removed a bunch of glideins that would have rematched and started, in fact were in the process of doing just that. And, as before, the more you submit, the better chance there is that some will have a chance to start. The matching algorithm right now is the following: For every cluster that shows free slots, 10 jobs get matched to it in each matchmaking round, whether there are actually 10 slots free or not. since "CDF" comes before "GP" in alphabetical order, and there are 4 CDF clusters, and you have only 40 jobs in the queue, there are no jobs left to match to GP. That's why I have repeatedly suggested that you increase your glidein submission frequency while we are tweaking the algorithm. the more you submit, the more chance there is that at least some will start. _______________________________________________________________________ Date: Fri, 30 Oct 2009 12:06:41 -0700 (PDT) From: Ryan B. Patterson condor_q -bett gives the output below for the idle glideins. Is it expected that this reports no valid machines (I know scheduler analysis is tricky when globus is involved), or is our glidein configuration no longer valid? error: bad form error: problem with ExprToProfile --- 2045608.000: Run analysis summary. Of 3581 machines, 3581 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 0 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 are available to run your job WARNING: Be advised: No resources matched request's constraints The Requirements expression for your job is: ( ( target.GlueCEInfoContactString is "fnpcfg1.fnal.gov:2119/jobmanager-condor" || target.GlueCEInfoContactString is "fnpcfg2.fnal.gov:2119/jobmanager-condor" || target.GlueCEInfoContactString is "d0cabosg1.fnal.gov:2119/jobmanager-pbs" || target.GlueCEInfoContactString is "d0cabosg2.fnal.gov:2119/jobmanager-pbs" || target.GlueCEInfoContactString is "fcdfosg1.fnal.gov:2119/jobmanager-condor" || target.GlueCEInfoContactString is "fcdfosg2.fnal.gov:2119/jobmanager-condor" || target.GlueCEInfoContactString is "fcdfosg3.fnal.gov:2119/jobmanager-condor" || target.GlueCEInfoContactString is "fcdfosg4.fnal.gov:2119/jobmanager-condor" || target.GlueCEInfoContactString is "cmsosgce3.fnal.gov:2119/jobmanager-condor" || false ) && ( stringlistimember("VO:minos",GlueCEAccessControlBaseRule) == true ) && ( target.GlueCEStateFreeJobSlots >= 1 ) && ( target.GlueCEInfoJobManager == "condor" ) && ( target.GlueCEInfoContactString isnt "cmsosgce3.fnal.gov:2119/jobmanager-condor" ) && ( target.GlueCEInfoContactString isnt "fnpcfg1.fnal.gov:2119/jobmanager-condor" ) && ( target.GlueCEInfoDefaultSE == "fndca1.fnal.gov" ) ) ______________________________________________________________________ Date: Fri, 30 Oct 2009 14:10:40 -0500 (CDT) From: Steven Timm No, it's something internal to FermiGrid but it is an easy fix. (And it also explains why nothing was matching to fnpcfg2). The new information provider which I installed on the GP Grid on 10/15 and today to make the patch on CDF is missing the :2119 in the contact string. _____________________________________________________________________ Date: Fri, 30 Oct 2009 14:23:43 -0500 (CDT) From: Steven Timm Submit some more glideins now, I just removed the old ones. Now they should work. Sorry for the inconvenience. _____________________________________________________________________ Date: Fri, 30 Oct 2009 12:42:31 -0700 (PDT) From: Ryan B. Patterson Thanks, Steve. Glideins have restarted, and MINOS user jobs are again running opportunistically. _____________________________________________________________________ Date: Fri, 30 Oct 2009 16:27:24 -0700 (PDT) From: Ryan B. Patterson A follow up... It appears that glideins are still matching where no free slots exist. The number of idle-and-matched-but-never-going-to-run glideins creeps up over the course of an hour or so until it reaches whatever limit I set for the factory (currently 160). Once that many stale, idle glideins are present, no more are submitted. If I remove the idle glideins, a new set starts appearing, and new glideins start running where they can. After another hour or so, a full backlog of stale glideins is back again. I've been able to reproduce this several times: (1) number of running glideins stops increasing, (2) delete idle glideins, (3) number running increases for a while, (4) repeat. If I understood the patch from this afternoon, this accumulation of stale idle glideins (increasing until no more will be submitted by the factory) should not happen anymore, right? We definitely are witnessing the success of the corrected contact string, but this matching issue still seems present. _____________________________________________________________________ Date: Fri, 30 Oct 2009 19:33:12 -0500 (CDT) From: Steven Timm I would say that you should bump up the limit higher. There are some tweaks also that you should do to your requirements thing--i.e. change fnpcfg1.fnal.gov:2119/jobmanager-condor to fnpcfg1.fnal.gov/jobmanager-condor for one site that you are vetoing. i.e. those that are not running now are not going to the cdf clusters but rather from fnpcfg1. I;ll send more details on Monday. _____________________________________________________________________ Date: Mon, 02 Nov 2009 10:03:08 -0600 (CST) Request INC000000014857: Status has been updated. Status: Completed The root cause of all the problems experienced was shift in the GlueCEInfoContactString format to be a different format than either the FermiGrid Site Gateway itself or the user jobs were expecting. We will be standardizing all of FermiGrid on the new format without the :2119. _____________________________________________________________________ _____________________________________________________________________ ######### # BRATE # ######### 2009 10 29 - changed -T title handling consistent with bratewk The node name and time are provided free. ARK > cvs commit -m 'corrected OUTDIR to OUT, added TIMESL for title format, added host and time to title option' brate Checking in brate; /cvs/minoscvs/rep1/minossoft/admin/bluearc/brate,v <-- brate new revision: 1.8; previous revision: 1.7 ############ # BRATENOW # ############ Corrected bug which made bad links to recent weekly files Added -T TITL Fire this up for d0mino05, 6 DATA=/grid/data/monitor HOST=d0mino05 WEBDIR=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates ./bratenow.new -t -n d0mino05 -d ${DATA} -T /prj_root/5012 ./bratenow.new -t -n d0mino05 -d ${DATA} -o ${WEBDIR} -T /prj_root/5012 ARK > cvs commit -m 'Added -l LIMIT -T TITLE, corrected weekly links, using options in call to brate, added mkdir -p recent' bratenow Checking in bratenow; /cvs/minoscvs/rep1/minossoft/admin/bluearc/bratenow,v <-- bratenow new revision: 1.2; previous revision: 1.1 done set nohup ./bratenow -n ${HOST} -d ${DATA} -o ${WEBDIR} -T /prj_root/5012 & set nohup ./bratenow -n ${HOST} -d ${DATA} -o ${WEBDIR} -T /prj_root/5012 -w & HOST=d0mino06 set nohup ./bratenow -n ${HOST} -d ${DATA} -o ${WEBDIR} -T /prj_root/3024 & set nohup ./bratenow -n ${HOST} -d ${DATA} -o ${WEBDIR} -T /prj_root/3024 -w & ============================================================================= 2009 10 28 ============================================================================= ############# # GFRONTEND # ############# No new glideins since yesterday. MINOS25 > minos_q Farm glideins: R=19 I=0 H=1 [gfrontend@minos25 log]$ tail /home/gfrontend/myvofrontend2/log/frontend_info.20091027.log [2009-10-27T14:37:15-05:00 19479] Iteration at Tue Oct 27 14:37:15 2009 [2009-10-27T14:37:22-05:00 19479] Match [2009-10-27T14:37:22-05:00 19479] Total running 358 limit 2050 [2009-10-27T14:37:22-05:00 19479] For gpgeneral@t22_glexec@minos Idle 2243 Running 358 [2009-10-27T14:37:22-05:00 19479] Advertize gpgeneral@t22_glexec@minos Request idle 40 max_run 2706 [2009-10-27T14:37:22-05:00 19479] For gpminos@t22_glexec@minos Idle 2243 Running 358 [2009-10-27T14:37:22-05:00 19479] Advertize gpminos@t22_glexec@minos Request idle 40 max_run 2706 [2009-10-27T14:37:22-05:00 19479] For cdf@t22_glexec@minos Idle 2243 Running 358 [2009-10-27T14:37:22-05:00 19479] Advertize cdf@t22_glexec@minos Request idle 40 max_run 2706 [2009-10-27T14:37:23-05:00 19479] Sleep Somebody has already restarted this [gfrontend@minos25 log]$ cat /home/gfrontend/myvofrontend2/log/frontend_info.20091028.log [2009-10-28T10:33:06-05:00 8895] Starting up [2009-10-28T10:33:06-05:00 8895] Iteration at Wed Oct 28 10:33:06 2009 [2009-10-28T10:33:09-05:00 8895] Match [2009-10-28T10:33:09-05:00 8895] Total running 11 limit 2050 [2009-10-28T10:33:09-05:00 8895] For gpgeneral@t22_glexec@minos Idle 1141 Running 11 [2009-10-28T10:33:09-05:00 8895] Advertize gpgeneral@t22_glexec@minos Request idle 40 max_run 1199 [2009-10-28T10:33:09-05:00 8895] For gpminos@t22_glexec@minos Idle 1141 Running 11 [2009-10-28T10:33:09-05:00 8895] Advertize gpminos@t22_glexec@minos Request idle 40 max_run 1199 [2009-10-28T10:33:09-05:00 8895] For cdf@t22_glexec@minos Idle 1141 Running 11 [2009-10-28T10:33:09-05:00 8895] Advertize cdf@t22_glexec@minos Request idle 40 max_run 1199 [2009-10-28T10:33:09-05:00 8895] Sleep Farm glideins: R=65 I=108 H=1 ______________________________________________________________________ Date: Wed, 28 Oct 2009 08:35:47 -0700 (PDT) From: Ryan B. Patterson To: Arthur Kreymer Cc: minos-admin@fnal.gov, scavan@fas.harvard.edu Subject: Re: grid (fwd) Resolved. The glidein factory auto-shutdown last night due to a load spike on minos25 (usually caused by filesystem slowdown) and the auto-restart mechanism was off. ########### # BRATEWK # ########### Same changes as for brate, building on Oct 16 bratewk.new Test on d0mino05 data DATA=/grid/data/monitor HOST=d0mino05 WEBDIR=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/${HOST} ./bratewk.new d0mino06 20091028 "" "" ${DATA} ./bratewk.new -n d0mino05 -t 20091028 -d ${DATA} -T "/prj_root/5012" -v ARK > date ; cp -a bratewk.new bratewk Wed Oct 28 21:26:50 GMT 2009 Touch up historic d0mino plots ./bratewk -n d0mino05 -t 20090921 -d ${DATA} -T "/prj_root/5012" -o ${WEBDIR} ./bratewk -n d0mino05 -t 20090928 -d ${DATA} -T "/prj_root/5012" -o ${WEBDIR} ./bratewk -n d0mino05 -t 20091005 -d ${DATA} -T "/prj_root/5012" -o ${WEBDIR} ./bratewk -n d0mino05 -t 20091012 -d ${DATA} -T "/prj_root/5012" -o ${WEBDIR} ./bratewk -n d0mino05 -t 20091019 -d ${DATA} -T "/prj_root/5012" -o ${WEBDIR} ./bratewk -n d0mino05 -t 20091026 -d ${DATA} -T "/prj_root/5012" -o ${WEBDIR} HOST=d0mino06 WEBDIR=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/${HOST} for DATE in 20090921 20090928 20091005 20091012 20091019 20091026 ; do ./bratewk -n ${HOST} -t ${DATE} -d ${DATA} -T "/prj_root/3024" -o ${WEBDIR} done ARK > cvs commit -m "forced STEP to 10, added qualifiers including -T title and -v, default scale 120" bratewk Checking in bratewk; /cvs/minoscvs/rep1/minossoft/admin/bluearc/bratewk,v <-- bratewk new revision: 1.4; previous revision: 1.3 done ARK > cvs commit -m "forced STEP to 10, added qualifiers including -T title and -v, default scale 120" brate Checking in brate; /cvs/minoscvs/rep1/minossoft/admin/bluearc/brate,v <-- brate new revision: 1.7; previous revision: 1.6 done ######### # BRATE # ######### brate.new 20091028 Forced step size to 10 ( Oct 16 ) Added qualifiers Test on d0mino06 data DATADIR=/grid/data/monitor WEBDIR=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino06 ./brate.new d0mino06 20091028 "" "" ${DATADIR} ./brate.new -n d0mino05 -t 20091028 -d ${DATADIR} -T "/prj_root/5012 from d0mino05 20091028" -v ARK > date ; cp -a brate.new brate Wed Oct 28 16:04:25 GMT 2009 ######### # ADMIN # ######### Updated INC000000014293 regarding vdt-ca-certs rpm ============================================================================= 2009 10 27 kreymer on vacation ============================================================================= ######## # GRID # ######## Clean up of ./-d directory in /home/kreymer on grid nodes. I think that these are stray Parrot working directories. -bash-3.00$ cd ./-d -bash-3.00$ ls -ld . drwxr-xr-x 259 kreymer e875 18432 Aug 14 2008 . -bash-3.00$ ls 00 07 0e 15 1c 23 2a 31 38 3f 46 4d 54 5b 62 69 70 77 7e 85 8c 93 9a a1 a8 af b6 bd c4 cb d2 d9 e0 e7 ee f5 fc 01 08 0f 16 1d 24 2b 32 39 40 47 4e 55 5c 63 6a 71 78 7f 86 8d 94 9b a2 a9 b0 b7 be c5 cc d3 da e1 e8 ef f6 fd 02 09 10 17 1e 25 2c 33 3a 41 48 4f 56 5d 64 6b 72 79 80 87 8e 95 9c a3 aa b1 b8 bf c6 cd d4 db e2 e9 f0 f7 fe 03 0a 11 18 1f 26 2d 34 3b 42 49 50 57 5e 65 6c 73 7a 81 88 8f 96 9d a4 ab b2 b9 c0 c7 ce d5 dc e3 ea f1 f8 ff 04 0b 12 19 20 27 2e 35 3c 43 4a 51 58 5f 66 6d 74 7b 82 89 90 97 9e a5 ac b3 ba c1 c8 cf d6 dd e4 eb f2 f9 txn 05 0c 13 1a 21 28 2f 36 3d 44 4b 52 59 60 67 6e 75 7c 83 8a 91 98 9f a6 ad b4 bb c2 c9 d0 d7 de e5 ec f3 fa 06 0d 14 1b 22 29 30 37 3e 45 4c 53 5a 61 68 6f 76 7d 84 8b 92 99 a0 a7 ae b5 bc c3 ca d1 d8 df e6 ed f4 fb -bash-3.00$ rmdir * -bash-3.00$ cd .. -bash-3.00$ rmdir ./-d ######### # ADMIN # ######### Date: Tue, 27 Oct 2009 13:54:25 -0500 (CDT) Request INC000000014523 requested by you has been submitted. Status: New Summary: Minos Cluster - add dbox to e875 Notes: FEF primary - run2-sys@fnal.gov Please add dbox to the e875 group on the Minos Cluster. ________________________________________________________________ Date: Tue, 27 Oct 2009 16:07:02 -0500 (CDT) Status: Completed dbox added to e875 group ________________________________________________________________ ________________________________________________________________ ________________________________________________________________ ============================================================================= 2009 10 26 kreymer on vacation ============================================================================= Date: Mon, 26 Oct 2009 18:52:56 -0500 From: Laura Mengel To: kreymer@fnal.gov Subject: KCA or IP web access restriction Hi Art, I think you were the one that asked me about this. I added documentation and an example on this (See #8): http://www-css.fnal.gov/csi/webdocs/access_apache.html#certip ####### # DAQ # ####### Date: Mon, 26 Oct 2009 14:53:18 -0500 From: Donald M. Gustafson To: "netdown@fnal.gov" Subject: Notice of Schedule Network Maintenance (Soudan Network) Notice of Scheduled Network Maintenance (Soudan Network) Date/Time:   Thursday October 29th, 2009 7:00 AM Duration: 30 minutes Description: Add static routes for Soudan network and then redistribute them into OSPF.   Affected areas of the network: Soudan network Unaffected areas of the network Non Soudan networks Expected Results: The routes will be added to OSPF and there will be no noticeable outage. Contact: Don Gustafson x6927 ######## # LOCK # ######## Regarding slow rtoner locks : $ /grid/fermiapp/minos/scripts/lock status LOCK STATUS Mon Oct 26 12:02:45 CDT 2009 LOCKS 20 of 20 ( 13 stale ) rtoner 20 QUEUE 478 ( 16 stale) mho 1 rtoner 426 xbhuang 51 ------------------------------------------------- Checked out a job on fnpcsrv1212 20091026.16:54:37.4620.fcdfcaf1212.30067.minosana.rtoner cp /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134048_0007_L010185N_D04.ntupleStS.root /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134048_0008_L010185N_D04.ntupleStS.root /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134048_0009_L010185N_D04.ntupleStS.root /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134049_0000_L010185N_D04.ntupleStS.root /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134049_0001_L010185N_D04.ntupleStS.root /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134049_0002_L010185N_D04.ntupleStS.root /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134049_0003_L010185N_D04.ntupleStS.root /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134049_0004_L010185N_D04.ntupleStS.root /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134049_0005_L010185N_D04.ntupleStS.root /minos/data2/LEM/sntp_libraries/nueCC_libraries/f21134049_0006_L010185N_D04.ntupleStS.root /local/stage1/condor/execute/dir_28146/glide_x28192/execute/dir_29141/no_xfer These are all about 66 MBytes. Copies are being done under parrot. Checking times : -bash-3.00$ ls -l /local/stage1/condor/execute/dir_28146/glide_x28192/execute/dir_29141/no_xfer total 667548 -rw-r--r-- 1 minosana e875 80096 Oct 26 12:14 CCE_f21134048_0007_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 69708612 Oct 26 11:56 f21134048_0007_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 68635704 Oct 26 11:58 f21134048_0008_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 67314451 Oct 26 12:00 f21134048_0009_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 68759100 Oct 26 12:02 f21134049_0000_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 67957116 Oct 26 12:03 f21134049_0001_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 67975510 Oct 26 12:04 f21134049_0002_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 68716180 Oct 26 12:06 f21134049_0003_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 67779853 Oct 26 12:08 f21134049_0004_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 68023472 Oct 26 12:09 f21134049_0005_L010185N_D04.ntupleStS.root -rw-r--r-- 1 minosana e875 67856256 Oct 26 12:11 f21134049_0006_L010185N_D04.ntupleStS.root The copies took 15 minutes, for 652 Mbytes of data. Caught another file copy in action, on fcdfcaf1103.fnal.gov -bash-3.00$ time dd if=/grid/data/minos/bluwatch/stash/A/2/file0300 of=/dev/null 21484+1 records in 21484+1 records out real 0m10.313s -bash-3.00$ time dd if=/grid/data/minos/bluwatch/stash/A/2/file0301 of=/dev/null real 0m11.282s -bash-3.00$ time dd if=/grid/data/minos/bluwatch/stash/A/2/file0302 of=/dev/null real 0m10.067s Checked out some /minos/scratch files, still slow, 2 MBytes/sec. -bash-3.00$ time dd if=/minos/scratch//bluwatch/minos-sam04/0/file2000 of=/dev/null real 0m4.051s -bash-3.00$ time dd if=/minos/scratch//bluwatch/minos-sam04/0/file1984 of=/dev/null real 0m4.717s -bash-3.00$ uname -a Linux fcdfcaf1103.fnal.gov 2.6.9-89.0.9.ELsmp #1 SMP Mon Aug 24 08:50:41 CDT 2009 x86_64 x86_64 x86_64 GNU/Linux -bash-3.00$ cat /etc/redhat-release Scientific Linux SL release 4.7 (Beryllium) -bash-3.00$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU 5148 @ 2.33GHz stepping : 6 cpu MHz : 2327.503 cache size : 4096 KB ... processor : 1 processor : 2 processor : 3 Let's look at a newer GPFarm node locked by Ruth : fnpc330 --------------------- cleanup $ ls /grid/data/e875/LOCK/LOG | wc -l 43081 Found my workstation screen blinking, and messages furiusly scrolling ls: /minos/test9293/stash/2/*: Stale NFS file handle ls: /minos/test9293/stash/2/*: Stale NFS file handle 12815 pts/1 S 24:44 \_ /bin/bash /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/bluwatch -r -b /minos/test9293/stash/2 -l /grid/data/monitor/test9293 5242 pts/1 S 0:00 | \_ /bin/bash /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/bluwatch -r -b /minos/test9293/stash/2 -l /grid/data/monitor/test9293 5243 pts/1 S 0:00 | \_ ls -d /minos/test9293/stash/2/* 5244 pts/1 S 0:00 | \_ grep -v 2005 kill 12815 AAAAHHHHHHHHHHHHH that feels much better. The test9293 disk tests are complete, it is likely being put into service as intended, for Windows. Also kill off the bratenow plots 25919 pts/1 S 0:11 \_ /bin/bash /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/bratenow -n minos27 -d /grid/data/monitor/test9293 -o /afs/fnal.gov/files/expwww/numi/h kill 25919 ####### # CRL # ####### Date: Mon, 26 Oct 2009 10:01:42 -0500 From: Suzanne Gysin There will be a short outage of the Control Room Logbook to replace the server's CPU on Tuesday 10/27 from 1 pm - 2pm. ############ # MCIMPORT # ############ Added hennessy to the .k5loginmin of mindata@minos26/27 Removed buckley ############ # MCIMPORT # ############ Updated minos27 mcimport to match minos26, including symlinks for overload control. The original minos27 .k5login was missing a few recent entries. $ scp minos26:.k5loginfull .k5loginfull $ scp minos26:.k5loginmin .k5loginmin $ sdiff -s .k5loginfull .k5login loiacono@FNAL.GOV | mho@FNALGOV mho@FNAL.GOV < mstrait@FNAL.GOV < sbudd@FNAL.GOV < wingmc@FNAL.GOV < xbhuang@FNAL.GOV < minos-wh-cr/minos/minos-om.fnal.gov@FNAL.GOV < Put in place the symlink. $ ln -sf .k5loginfull .k5login ============================================================================= 2009 10 23 ============================================================================= ######## # FARM # ######## Adjusted group_batch priority to normal value, post mrnt processing. condor_userprio -setfactor group_batch@fnal.gov 20 ######## # FARM # ######## MINOS27 > ls /minos/data/minfarm/farcat | grep mrnt.dogwood1 > /tmp/farcatmd1 MINOS27 > wc -l /tmp/farcatmd1 32794 /tmp/farcatmd1 MINOS27 > FARMD=`cat /tmp/farcatmd1 | cut -f 1 -d .` MINOS27 > RUNMD=`printf "${FARMD}\n" | cut -f 1 -d _ | sort -u` printf "${RUNMD}\n" | wc -l 2037 SAMDIM=" DATA_TIER bntp-far and VERSION dogwood1 and RUN_NUMBER > 30612 and RUN_NUMBER < 43640 " sam list files --dim="${SAMDIM}" --nosummary | cut -f 1 -d _ | sort -u | wc -l 2193 FARBR=`sam list files --dim="${SAMDIM}" --nosummary | cut -f 1 -d _ | sort -u` SAMDIM=" DATA_TIER bntp-far and VERSION cedar.phy.bhcurv and RUN_NUMBER > 30612 and RUN_NUMBER < 43640 " sam list files --dim="${SAMDIM}" --nosummary | cut -f 1 -d _ | sort -u | wc -l 1710 SAMDIM=" DATA_TIER mrnt-far and VERSION cedar.phy.bhcurv and RUN_NUMBER > 30612 and RUN_NUMBER < 43640 " sam list files --dim="${SAMDIM}" --nosummary | cut -f 1 -d _ | sort -u | wc -l 1708 So we do not expect a 5% attrition to mrnt. printf "${RUNMD}\n" > /tmp/RUNMD printf "${FARBR}\n" > /tmp/FARBR MINOS27 > sdiff /tmp/RUNMD /tmp/FARBR | less Noted these features : Volunteer early mrnt run F00030612 No mrnt files from April or May 2005 F00030613 through F00031811 Extra mrnt's not present in bntp's ( as known to SAM ) F00035374 through F00035721 sdiff F00034618 F00034618 F00035374 < ######### # ADMIN # ######### Date: Fri, 23 Oct 2009 13:59:25 -0500 (CDT) Request INC000000014293 requested by you has been submitted. Status: New Summary: Minos Cluster grid-security/certificates Notes: FEF Primary - run2-sys The KCA server upgrade has been deferred. But before this happens, please do the updates mentioned by Steve Timm . This update may be needed on all Minos Cluster systems. Most nodes were updated July 29 2009. Minos25 is out of date, Sep 30 2008. ---------------------------------------------------------------------------------------------------------- ---------------------- Date: Fri, 23 Oct 2009 11:29:57 -0500 (CDT) From: Steven Timm To: kreymer@fnal.gov Subject: minos25 /etc/grid-security/certificates Art-- while working with Lee this morning I saw that the /etc/grid-security/certificates directory on minos25 hasn't been updated since sep 2008, which means you don't have the CA file for the new KCA in there. you need to do something to get that in there otherwise everything breaks at the kca switchover. Steve -- ------------------------------------------------------------------ Steven C. Timm, Ph.D (630) 840-8525 timm@fnal.gov http://home.fnal.gov/~timm/ Fermilab Computing Division, Scientific Computing Facilities, Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader. ______________________________________________________________________ Date: Fri, 23 Oct 2009 14:17:42 -0500 (CDT) Status: In Progress _____________________________________________________________________ Date: Fri, 23 Oct 2009 14:19:44 -0500 (CDT) Please forgive my ignorance, but what is required to perform these updates? RPM update? Some bit of VDT magic? Copy and paste? Thank you for any pointers provided. _____________________________________________________________________ Date: Fri, 23 Oct 2009 14:19:44 -0500 (CDT) Your request is now waiting on an action to be taken by someone other than our support staff. Status: Pending _____________________________________________________________________ 2009 10 27 Investigation, on fcdfcaf1212 /etc/grid-security/certificates -> /usr/local/grid/globus/TRUSTED_CA -> /usr/local/grid/globus/share/certificates -> certificates-1.9 The same on fnpc340 We seem to be up to date on minos26, _____________________________________________________________________ Date: Tue, 27 Oct 2009 19:12:20 +0000 (GMT) From: Arthur Kreymer To: Fermilab Service Desk Cc: jason@fnal.gov, run2-sys@fnal.gov, timm@fnal.gov, minos-admin@fnal.gov Subject: Re: Procedural information required for INC000000014293 Regarding your question about the source of the certificates, I suggest contacting Steve Timm, who might know. I observe that the certificates are being currently updated on minos01 through minos26 , as of about 04:00 through 06:00 this morning. They are still out of date of minos25, and missing on minos11 and minos27. _____________________________________________________________________ Date: Tue, 27 Oct 2009 15:04:08 -0500 From: Jason Harrington Steve, Is there some bit of VDT magic required to make sure grid, KCA certificates are up to date on the minos condor nodes? Is it some RPM that needs installing, or updating? Thank you for any insights you can provide. _____________________________________________________________________ 2009 10 28 investigation MINOS01 > rpm -qf /etc/grid-security/certificates/3232b9bc.info vdt-ca-certs-52-1 for NODE in ${NODES} ; do printf "${NODE} " ssh -akx ${NODE} 'rpm -qi vdt-ca-certs-52-1 | grep Install | cut -f 1 -d B' 2> /dev/null done minos01 Install Date: Tue Oct 27 05:34:02 2009 minos03 Install Date: Tue Oct 27 05:31:18 2009 minos04 Install Date: Tue Oct 27 04:34:15 2009 minos05 Install Date: Tue Oct 27 05:04:48 2009 minos06 Install Date: Tue Oct 27 05:00:23 2009 minos07 Install Date: Tue Oct 27 04:41:03 2009 minos08 Install Date: Tue Oct 27 04:17:38 2009 minos09 Install Date: Tue Oct 27 05:47:09 2009 minos10 Install Date: Tue Oct 27 05:02:39 2009 minos11 minos12 Install Date: Tue Oct 27 06:51:14 2009 minos13 Install Date: Tue Oct 27 05:03:25 2009 minos14 Install Date: Tue Oct 27 05:32:00 2009 minos15 Install Date: Tue Oct 27 04:49:40 2009 minos16 Install Date: Tue Oct 27 06:51:59 2009 minos17 Install Date: Tue Oct 27 06:26:11 2009 minos18 Install Date: Tue Oct 27 05:06:06 2009 minos19 Install Date: Tue Oct 27 05:38:38 2009 minos20 Install Date: Tue Oct 27 06:45:05 2009 minos21 Install Date: Tue Oct 27 06:59:06 2009 minos22 Install Date: Tue Oct 27 06:24:22 2009 minos23 Install Date: Tue Oct 27 05:15:37 2009 minos24 Install Date: Tue Oct 27 05:14:39 2009 minos25 minos26 Install Date: Tue Oct 27 05:58:36 2009 minos27 _____________________________________________________________________ Scanned the servers minos-mysql1 minos-mysql2 minos-mysql3 Install Date: Tue Oct 27 04:41:43 2009 minos-sam01 minos-sam02 minos-sam03 minos-sam04 These file are small, and maintenance is minimal (autoyum). I suggest installing vdt-ca-certs on all Minos Cluster and Server hosts. _____________________________________________________________________ Date: Wed, 28 Oct 2009 09:55:15 -0500 From: Jason Harrington minos{11,25,27} were simply missing vdt-ca-certs. This has been fixed. minos-mysql{1,2} and minos-sam0{1,2,3,4} did not have the vdt-ca-certs yum repo defined because cfengine was only pulling this onto condor nodes. I have opened this up to all minos nodes and run an update. I have installed the vdt-ca-certs package on these nodes, as well. _____________________________________________________________________ Date: Wed, 28 Oct 2009 09:55:46 -0500 (CDT) Status: In Progress Status: Completed vdt-ca-certs yum repo has been defined on minos nodes where missing. vdt-ca-certs package has been installed on minos nodes where missing. _____________________________________________________________________ Date: Wed, 28 Oct 2009 12:22:08 -0500 (CDT) From: Steven Timm The VDT client has cron jobs that it runs to automatically update the CA certificates and the CRL files. Those were never installed on minos25. (That's what you were trying to install for minerva last week.) As others have pointed out, you can also install the CA cert files via RPM and there is a yum repository at the VDT and at the GOC that does keep them up to date. Whichever way you want to do it on the minos nodes is fine with me. _____________________________________________________________________ Date: Wed, 28 Oct 2009 13:10:20 -0500 From: Jason Harrington Thank you, Steve. We went with the RPM route as this was done on most of the minos machines running condor. Jason M. Harrington _____________________________________________________________________ _____________________________________________________________________ ######## # FARM # ######## Checking file count needed for dogwood1 mrnt reprocessing First .bntp F00030613_0000.spill.bntp.dogwood1.0.root Last .bntp F00043658_0000.spill.bntp.dogwood1.0.root But the last beam in June 09 was Run 43639 Subrun 7 SAMDIM=" DATA_TIER raw-far and RUN_NUMBER > 30612 and RUN_NUMBER < 43640 " sam list files --summaryonly --dim="${SAMDIM}" Average File Size: 31.45MB Total File Size: 1.41TB Total Event Count: 568781383 Checking recent farcat output : MINOS27 > ls -ltr /minos/data/minfarm/farcat | grep dogwood1 | tail -100 ... 7457640 Oct 23 09:18 F00043611_0012.spill.mrnt.dogwood1.1.root 7373429 Oct 23 09:18 F00043636_0002.spill.mrnt.dogwood1.1.root 7462116 Oct 23 09:18 F00043636_0009.spill.mrnt.dogwood1.1.root 7506114 Oct 23 09:18 F00043639_0004.spill.mrnt.dogwood1.1.root 7369387 Oct 23 09:18 F00043636_0019.spill.mrnt.dogwood1.1.root 7361639 Oct 23 09:18 F00043621_0010.spill.mrnt.dogwood1.1.root 7520646 Oct 23 09:18 F00043639_0011.spill.mrnt.dogwood1.1.root 7488479 Oct 23 09:18 F00043624_0014.spill.mrnt.dogwood1.1.root 7509551 Oct 23 09:18 F00043639_0003.spill.mrnt.dogwood1.1.root 7521643 Oct 23 09:18 F00043639_0005.spill.mrnt.dogwood1.1.root 7419839 Oct 23 09:18 F00043624_0019.spill.mrnt.dogwood1.1.root 7446857 Oct 23 09:18 F00043636_0015.spill.mrnt.dogwood1.1.root 7387306 Oct 23 09:19 F00043636_0010.spill.mrnt.dogwood1.1.root 7462032 Oct 23 09:19 F00043636_0008.spill.mrnt.dogwood1.1.root 7445128 Oct 23 09:19 F00043639_0001.spill.mrnt.dogwood1.1.root 7490160 Oct 23 09:20 F00043636_0004.spill.mrnt.dogwood1.1.root 7472877 Oct 23 09:21 F00043636_0023.spill.mrnt.dogwood1.1.root 7511134 Oct 23 09:22 F00043636_0014.spill.mrnt.dogwood1.1.root 7002662 Oct 23 09:49 F00043515_0020.spill.mrnt.dogwood1.1.root 7442153 Oct 23 10:21 F00043639_0012.spill.mrnt.dogwood1.1.root 7431587 Oct 23 10:35 F00043569_0006.spill.mrnt.dogwood1.1.root 1473421 Oct 23 11:03 F00043651_0016.spill.mrnt.dogwood1.1.root The slowdown hit at about 09:20, after processing runs 43515 43569 43611 43618 43621 43624 ( 43627/30/33 have small bntp files ) 43636 43639 ============================================================================= 2009 10 22 ============================================================================= ######## # FARM # ######## Adjusted group_batch priority to allow better FD mrnt processing with fewer slots. Also set rubin, in case that is needed. I doubt that, no CPU has been used here. $ condor_userprio -all | grep group_batch group_batch@fnal.gov 6031.97 301.60 20.00 69 344930.17 6/26/2009 17:30 10/22/2009 17:34 $ condor_userprio -setfactor group_batch@fnal.gov 5 Not good enough, still behind lots of others. Dropped it to 1, still behind many others. But good enough, rubin idle jobs have cleared quickly. New files have been flooding into farcat as of 17:32, MINOS25 > du -sm /minos/data/minfarm/farcat ; sleep 100 103982 /minos/data/minfarm/farcat 104070 /minos/data/minfarm/farcat 104247 /minos/data/minfarm/farcat N.B. the mrnt sntp files are bout 7 MB in size This seems consistent, I see roughly 6 seconds per file into farcat. Monitoring fnpc350, on loon process ran 4 minutes, 99% cpu, the second was stuck. A tuypical running process : PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31875 minospro 35 10 812m 469m 58m R 100 2.9 3:14.54 loon This ran 4:23 CPU The next 4:02 The next 4:20 Connection to mysql1 were 22819 temp Sleep 1 22820 dogwood1 Prepare 0 By 18:13, the load on minos-mysql1 built to 30. At 18:20 MINOS25 > ls /minos/data/minfarm/farcat | grep mrnt.dogwood1 | wc -l 24493 N.B. The database data delivery, per Ganglia, is flat at 15 MBytes/sec. ######### # ADMIN # ######### For support, added rayp and romero accounts to our cluster. MINOS01 > setup systools MINOS01 > cmd add_minos_user rayp MINOS01 > cmd add_minos_user romero ============================================================================= 2009 10 21 ============================================================================= ######### # ADMIN # ######### From: Arthur Kreymer To: Andrew J. Romero Cc: Ramon C. Pasetes , minos-admin@fnal.gov Subject: RE: Test Volume on new Minos disks On Fri, 16 Oct 2009, Arthur Kreymer wrote: > Data > > /grid/data/monitor/test9293 > > Plot > > http://www-numi.fnal.gov/computing/dh/bluearc/test9293/minos27/NOW.png > > Looks very similar to the blue2:/art-test performance so far. > I have not started load tests yet. I did some load tests on Saturday, with 10, 20, 40, and 80 clients reading at once from minos-sam04. The data rates to minos-sam04 saturated at 120 MBytes/sec in all cases. The monitor process on minos27 continued to see entirely normal data rates. In short, this level of load did not seem to cause problems. I am on shift Mon-Thu 08:00-16:00 in the Minos Control Room, WH12NW x3368 . I'll try to push the load higher today, no guarantees. ######### # ADMIN # ######### Added minos-mysql1.kreymer, minos-mysql2.kreymer ... on minos-mysql1 cdadmin cd crontab crontab minos-mysql1.kreymer ... on minos-mysql2 cdadmin cd crontab crontab minos-mysql2.kreymer ######### # MYSQL # ######### Restarted database monitoring manually on minos-mysql1 and 2 ADM=/afs/fnal.gov/files/expwww/numi/html/computing/admin/mysql/scripts set nohup ; ${ADM}/topdb_log minos-mysql1 & set nohup ; ${ADM}/topdb_log minos-mysql2 & ########### # BLUEARC # ########### Final test in /test9293, 400 fold copy ITER=0 ; NITER=40 # 400 copies cat $ML/400/0_0 Wed Oct 21 07:30:53 CDT 2009 /minos/test9293/kreymer/0/0 real 125m35.984s user 0m0.019s sys 0m1.952s Wed Oct 21 09:36:29 CDT 2009 Load average nearly 380 Data rates still net 120 MB/sec No slowdown seen by minos27, in fact a speed up again. ============================================================================= 2009 10 20 ============================================================================= ########### # BLUEARC # ########### Performance test using the new directories. Let 'er rip entirely from the /minos/test9293/kreymer areas. ML=/minos/scratch/kreymer/log/thrash ITER=0 ; NITER=3 while [ ${ITER} -lt ${NITER} ] ; do printf "${ITER}\n" for SDIR in 0 1 2 3 4 5 6 7 8 9 ; do DATA=/minos/test9293/kreymer/${ITER}/${SDIR} printf "${DATA}\n" { date printf "${DATA}\n" time cat ${DATA}/* > /dev/null date } > ${ML}/${NITER}0/${ITER}_${SDIR} 2>&1 & done (( ITER++ )) done Warmed up with NITER=3 cat $ML/30/0_0 Tue Oct 20 21:02:12 CDT 2009 /minos/test9293/kreymer/0/0 real 9m28.854s user 0m0.032s sys 0m1.233s Tue Oct 20 21:11:41 CDT 2009 This worked, load average about 30, elapsed right between former 20 and 40 times. fired off NITER=20 ITER=0 ; NITER=20 # 200 copies ( monitor with top, use O option a to order by pid ) cat $ML/200/0_0 Tue Oct 20 21:13:43 CDT 2009 /minos/test9293/kreymer/0/0 real 63m10.718s user 0m0.022s sys 0m1.545s Tue Oct 20 22:16:54 CDT 2009 Load average nearly 190 Data rates still net 120 MB/sec No slowdown seen by minos27 In fact, read rates went up substantially. ######## # FARM # ######## Status of n13047005_0017_L010185R_D04.mrnt.dogwood1.0.root PNFS already contains n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root n13047005_0018_L010185R_D04.mrnt.dogwood1.0.root FILE=n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root sam get metadata --file=${FILE} | grep parents | tr "'" \\\n | grep root | sort n13047005_0000_L010185R_D04.reroot.root FILE=n13047005_0018_L010185R_D04.mrnt.dogwood1.0.root n13047005_0018_L010185R_D04.reroot.root n13047005_0019_L010185R_D04.reroot.root n13047005_0020_L010185R_D04.reroot.root n13047005_0021_L010185R_D04.reroot.root n13047005_0022_L010185R_D04.reroot.root n13047005_0023_L010185R_D04.reroot.root n13047005_0024_L010185R_D04.reroot.root n13047005_0025_L010185R_D04.reroot.root n13047005_0026_L010185R_D04.reroot.root n13047005_0027_L010185R_D04.reroot.root n13047005_0028_L010185R_D04.reroot.root n13047005_0029_L010185R_D04.reroot.root n13047005_0030_L010185R_D04.reroot.root ls /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L010185R/cand_data/700 \ | grep n13047005_ n13047005_0000_L010185R_D04.cand.dogwood1.0.root n13047005_0001_L010185R_D04.cand.dogwood1.0.root n13047005_0002_L010185R_D04.cand.dogwood1.0.root n13047005_0003_L010185R_D04.cand.dogwood1.0.root n13047005_0004_L010185R_D04.cand.dogwood1.0.root n13047005_0005_L010185R_D04.cand.dogwood1.0.root n13047005_0006_L010185R_D04.cand.dogwood1.0.root n13047005_0007_L010185R_D04.cand.dogwood1.0.root n13047005_0008_L010185R_D04.cand.dogwood1.0.root n13047005_0009_L010185R_D04.cand.dogwood1.0.root n13047005_0010_L010185R_D04.cand.dogwood1.0.root n13047005_0011_L010185R_D04.cand.dogwood1.0.root n13047005_0012_L010185R_D04.cand.dogwood1.0.root n13047005_0013_L010185R_D04.cand.dogwood1.0.root n13047005_0014_L010185R_D04.cand.dogwood1.0.root n13047005_0015_L010185R_D04.cand.dogwood1.0.root n13047005_0016_L010185R_D04.cand.dogwood1.0.root n13047005_0017_L010185R_D04.cand.dogwood1.0.root n13047005_0018_L010185R_D04.cand.dogwood1.0.root n13047005_0019_L010185R_D04.cand.dogwood1.0.root n13047005_0020_L010185R_D04.cand.dogwood1.0.root n13047005_0021_L010185R_D04.cand.dogwood1.0.root n13047005_0022_L010185R_D04.cand.dogwood1.0.root n13047005_0023_L010185R_D04.cand.dogwood1.0.root n13047005_0024_L010185R_D04.cand.dogwood1.0.root n13047005_0025_L010185R_D04.cand.dogwood1.0.root n13047005_0026_L010185R_D04.cand.dogwood1.0.root n13047005_0027_L010185R_D04.cand.dogwood1.0.root n13047005_0028_L010185R_D04.cand.dogwood1.0.root n13047005_0029_L010185R_D04.cand.dogwood1.0.root n13047005_0030_L010185R_D04.cand.dogwood1.0.root MINOS26 > ls -l /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L010185R/sntp_data/700 | grep n13047005 -rw-r--r-- 1 minospro e875 811687214 Jul 14 03:30 n13047005_0000_L010185R_D04.sntp.dogwood1.0.root -rw-r--r-- 1 minospro e875 47372204 Sep 20 04:44 n13047005_0017_L010185R_D04.sntp.dogwood1.0.root -rw-r--r-- 1 minospro e875 623810920 Jul 14 03:30 n13047005_0018_L010185R_D04.sntp.dogwood1.0.root MINOS26 > ls -l /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L010185R/mrnt_data/700 | grep n13047005 -rw-r--r-- 1 minospro e875 167702830 Jul 14 04:59 n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root -rw-r--r-- 1 minospro e875 129478822 Jul 14 04:59 n13047005_0018_L010185R_D04.mrnt.dogwood1.0.root The SNTP parents are correct : MINOS26 > FILE=n13047005_0000_L010185R_D04.sntp.dogwood1.0.root MINOS26 > sam get metadata --file=${FILE} | grep parents | tr "'" \\\n | grep root | sort n13047005_0000_L010185R_D04.reroot.root n13047005_0001_L010185R_D04.reroot.root n13047005_0002_L010185R_D04.reroot.root n13047005_0003_L010185R_D04.reroot.root n13047005_0004_L010185R_D04.reroot.root n13047005_0005_L010185R_D04.reroot.root n13047005_0006_L010185R_D04.reroot.root n13047005_0007_L010185R_D04.reroot.root n13047005_0008_L010185R_D04.reroot.root n13047005_0009_L010185R_D04.reroot.root n13047005_0010_L010185R_D04.reroot.root n13047005_0011_L010185R_D04.reroot.root n13047005_0012_L010185R_D04.reroot.root n13047005_0013_L010185R_D04.reroot.root n13047005_0014_L010185R_D04.reroot.root n13047005_0015_L010185R_D04.reroot.root n13047005_0016_L010185R_D04.reroot.root Let's look at the files that should have fed the parent list : MINOS26 > cat /minos/data/minfarm/ROUNTMP/READ/SAM/n13047005_0018_L010185R_D04.mrnt.dogwood1.0.root n13047005_0018_L010185R_D04.mrnt.dogwood1.0.root n13047005_0019_L010185R_D04.mrnt.dogwood1.0.root n13047005_0020_L010185R_D04.mrnt.dogwood1.0.root n13047005_0021_L010185R_D04.mrnt.dogwood1.0.root n13047005_0022_L010185R_D04.mrnt.dogwood1.0.root n13047005_0023_L010185R_D04.mrnt.dogwood1.0.root n13047005_0024_L010185R_D04.mrnt.dogwood1.0.root n13047005_0025_L010185R_D04.mrnt.dogwood1.0.root n13047005_0026_L010185R_D04.mrnt.dogwood1.0.root n13047005_0027_L010185R_D04.mrnt.dogwood1.0.root n13047005_0028_L010185R_D04.mrnt.dogwood1.0.root n13047005_0029_L010185R_D04.mrnt.dogwood1.0.root n13047005_0030_L010185R_D04.mrnt.dogwood1.0.root That's consistent. Now the partially orphaned first part of this run MINOS26 > cat /minos/data/minfarm/ROUNTMP/READ/SAM/n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root n13047005_0001_L010185R_D04.mrnt.dogwood1.0.root n13047005_0002_L010185R_D04.mrnt.dogwood1.0.root n13047005_0003_L010185R_D04.mrnt.dogwood1.0.root n13047005_0004_L010185R_D04.mrnt.dogwood1.0.root n13047005_0005_L010185R_D04.mrnt.dogwood1.0.root n13047005_0006_L010185R_D04.mrnt.dogwood1.0.root n13047005_0007_L010185R_D04.mrnt.dogwood1.0.root n13047005_0008_L010185R_D04.mrnt.dogwood1.0.root n13047005_0009_L010185R_D04.mrnt.dogwood1.0.root n13047005_0010_L010185R_D04.mrnt.dogwood1.0.root n13047005_0011_L010185R_D04.mrnt.dogwood1.0.root n13047005_0012_L010185R_D04.mrnt.dogwood1.0.root n13047005_0013_L010185R_D04.mrnt.dogwood1.0.root n13047005_0014_L010185R_D04.mrnt.dogwood1.0.root n13047005_0015_L010185R_D04.mrnt.dogwood1.0.root n13047005_0016_L010185R_D04.mrnt.dogwood1.0.root n13047005_0017_L010185R_D04.mrnt.dogwood1.0.root n13047005_0018_L010185R_D04.mrnt.dogwood1.0.root n13047005_0019_L010185R_D04.mrnt.dogwood1.0.root n13047005_0020_L010185R_D04.mrnt.dogwood1.0.root n13047005_0021_L010185R_D04.mrnt.dogwood1.0.root n13047005_0022_L010185R_D04.mrnt.dogwood1.0.root n13047005_0023_L010185R_D04.mrnt.dogwood1.0.root n13047005_0024_L010185R_D04.mrnt.dogwood1.0.root n13047005_0025_L010185R_D04.mrnt.dogwood1.0.root n13047005_0026_L010185R_D04.mrnt.dogwood1.0.root n13047005_0027_L010185R_D04.mrnt.dogwood1.0.root n13047005_0028_L010185R_D04.mrnt.dogwood1.0.root n13047005_0029_L010185R_D04.mrnt.dogwood1.0.root n13047005_0030_L010185R_D04.mrnt.dogwood1.0.root EH ???? That cannot be. MINOS26 > ls -l /minos/data/minfarm/ROUNTMP/READ/SAM/n13047005_*_L010185R_D04.mrnt.dogwood1.0.root -rw-r--r-- 1 minfarm e875 1519 Jul 10 11:49 /minos/data/minfarm/ROUNTMP/READ/SAM/n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root -rw-r--r-- 1 minfarm e875 637 Jul 14 00:44 /minos/data/minfarm/ROUNTMP/READ/SAM/n13047005_0018_L010185R_D04.mrnt.dogwood1.0.root How did the _0000 file get there before the candidates got written ? MINOS26 > less /minos/data/minfarm/ROUNTMP/LOG/2009-07/dogwood1mcnearmrnt.log OK adding n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root 31 NSFIL SSIZ MSIZ DSIZ 31 310363035 307311951 101702 -rw-r--r-- 1 minfarm e875 307311951 Jul 4 12:08 /minos/data/minfarm/WRITE/n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root ... PURGED 53/123 Sat Jul 4 15:31:56 CDT 2009 OK adding n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root 31 NSFIL SSIZ MSIZ DSIZ 31 310137493 307102618 101162 -rw-r--r-- 1 minfarm e875 307102618 Jul 10 11:49 /minos/data/minfarm/WRITE/n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root PURGED 103/103 Fri Jul 10 20:03:37 CDT 2009 Tue Jul 14 00:32:02 CDT 2009 HAVE n13047005__L010185R_D04.mrnt.dogwood1.0.root:1: 0000 BADRUNS n13047005_0017_L010185R_D04.mrnt.dogwood1.0.root n13047005_0017_L010185R_D04.0 136 2009-07-13 18:38:01 fnpc200 OOPS - SUBRUN gap 17 to 17 OK adding n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root 17 NSFIL SSIZ MSIZ DSIZ 17 169322511 167702830 101230 -rw-r--r-- 1 minfarm e875 167702830 Jul 14 00:44 /minos/data/minfarm/WRITE/n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root OK adding n13047005_0018_L010185R_D04.mrnt.dogwood1.0.root 13 NSFIL SSIZ MSIZ DSIZ 13 130711720 129478822 102741 -rw-r--r-- 1 minfarm e875 129478822 Jul 14 00:44 /minos/data/minfarm/WRITE/n13047005_0018_L010185R_D04.mrnt.dogwood1.0.root Reviewed email with rmehdi, this data was intentionally processed 3 times, The cleanup included PNFS and SAM, but not the roundup READ/SAM files. Check out the files under /minos/data/minfarm/ROUNTMP/READ ls -l n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root -rw-r--r-- 1 minfarm e875 833 Jul 14 00:44 n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root cat n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root n13047005_0001_L010185R_D04.mrnt.dogwood1.0.root n13047005_0002_L010185R_D04.mrnt.dogwood1.0.root n13047005_0003_L010185R_D04.mrnt.dogwood1.0.root n13047005_0004_L010185R_D04.mrnt.dogwood1.0.root n13047005_0005_L010185R_D04.mrnt.dogwood1.0.root n13047005_0006_L010185R_D04.mrnt.dogwood1.0.root n13047005_0007_L010185R_D04.mrnt.dogwood1.0.root n13047005_0008_L010185R_D04.mrnt.dogwood1.0.root n13047005_0009_L010185R_D04.mrnt.dogwood1.0.root n13047005_0010_L010185R_D04.mrnt.dogwood1.0.root n13047005_0011_L010185R_D04.mrnt.dogwood1.0.root n13047005_0012_L010185R_D04.mrnt.dogwood1.0.root n13047005_0013_L010185R_D04.mrnt.dogwood1.0.root n13047005_0014_L010185R_D04.mrnt.dogwood1.0.root n13047005_0015_L010185R_D04.mrnt.dogwood1.0.root n13047005_0016_L010185R_D04.mrnt.dogwood1.0.root Check out the saddreco log, /home/minfarm/ROUNTMP/LOG/saddreco/daikon_04/dogwood1/near_L010185R.log Declared Jul 4, Jul 10, Clear this out, declare it again minfarm@minos27 . ~/scripts/setupsam.sh sam undeclare n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root SRLOG=/minos/data/minfarm/ROUNTMP/LOG/saddreco/daikon_04/dogwood1/near_L010185R.log SC=/grid/fermiapp/minos/minfarm/scripts SOCFILE=${HOME}/grid/samdbs_prd export SAM_ORACLE_CONNECT=`cat ${SOCFILE}` ${SC}/saddreco -m daikon_04 -d near -r dogwood1 -p L010185R --verify ${SC}/saddreco -m daikon_04 -d near -r dogwood1 -p L010185R --declare OK - declared n13047005_0000_L010185R_D04.mrnt.dogwood1.0.root /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L010185R/mrnt_data/700(von425.764) STARTED Wed Oct 21 01:10:33 2009 FINISHED Wed Oct 21 01:10:42 2009 parents are now FARM27 > sam get metadata --file=${FILE} | grep parents | tr "'" \\\n | grep root | sort n13047005_0000_L010185R_D04.reroot.root n13047005_0001_L010185R_D04.reroot.root n13047005_0002_L010185R_D04.reroot.root n13047005_0003_L010185R_D04.reroot.root n13047005_0004_L010185R_D04.reroot.root n13047005_0005_L010185R_D04.reroot.root n13047005_0006_L010185R_D04.reroot.root n13047005_0007_L010185R_D04.reroot.root n13047005_0008_L010185R_D04.reroot.root n13047005_0009_L010185R_D04.reroot.root n13047005_0010_L010185R_D04.reroot.root n13047005_0011_L010185R_D04.reroot.root n13047005_0012_L010185R_D04.reroot.root n13047005_0013_L010185R_D04.reroot.root n13047005_0014_L010185R_D04.reroot.root n13047005_0015_L010185R_D04.reroot.root n13047005_0016_L010185R_D04.reroot.root This is resolved ! ============================================================================= 2009 10 19 ============================================================================= Copied more test files to /minos/test9293, see below. ####### # CVS # ####### Removed defective RockMuons files for nwest. ============================================================================= 2009 10 17 ============================================================================= ########### # BLUEARC # ########### STRESS TESTS OF test9293 Run the tests from minos-ora4 MINOS-SAM04 > df -h /minos/test9293 Filesystem Size Used Avail Use% Mounted on blue1.fnal.gov:/test9293 1.1T 163G 958G 15% /minos/test9293 cat README largefiles -- Dir contains a bunch of 8GB files kreymer -- Where are can do whatever he wants (as long as it is secure ;) ) stash -- copied from bluwatch are on /grid/data/minos MINOS-SAM04 > du -sm largefiles/* 8192 largefiles/bigfile 4096 largefiles/bigfile2 2048 largefiles/bigfile3 3172 largefiles/bigfile4 10240 largefiles/bigfile5 1024 largefiles/bigfile6 5500 largefiles/bigfile7 MINOS-SAM04 > du -sm stash/* 21001 stash/2 151 stash/3 21001 stash/6 21001 stash/7 21001 stash/8 21001 stash/9 21001 stash/A GENERAL PLAN - Test a single stream, from stash to kreymer. Try to run 10, 20 , 40, 80 parallel copies of files. Read them from minos-sam04, look at monitoring from minos27. Crosscheck the /minos/scratch reads from minos-sam04 stash files are available, directories 2,6,7,8,9,A 3 is only partially populated. Monitoring since yesterday shows a broad range 40-90 MB/sec, ave 65 The network connection is 100 MB/sec A 1000 sec minute test, for minimal statistics, needs to move 100 GB. date ; for DIR in 2 6 7 8 9 A ; do echo ${DIR} ; time cp -ax stash/${DIR} kreymer/${DIR} ; done Sat Oct 17 13:15:01 CDT 2009 2 real 7m32.578s user 0m0.190s sys 0m39.703s 6 real 7m59.957s user 0m0.183s sys 0m39.745s 7 real 7m59.322s user 0m0.164s sys 0m39.578s 8 real 7m56.582s user 0m0.187s sys 0m40.000s 9 real 7m58.142s user 0m0.205s sys 0m40.411s A real 7m59.146s user 0m0.174s sys 0m40.603s mkdir /minos/scratch/kreymer/log/thrash mkdir /minos/scratch/kreymer/log/thrash/10 mkdir /minos/scratch/kreymer/log/thrash/20 mkdir /minos/scratch/kreymer/log/thrash/40 mkdir /minos/scratch/kreymer/log/thrash/80 Created MS=~kreymer/minos/scripts ML=/minos/scratch/kreymer/log/thrash ${MS}/paths with process number and path to read, like 0^/minos/test9293/stash/2/0 1^/minos/test9293/stash/6/0 2^/minos/test9293/stash/7/0 3^/minos/test9293/stash/8/0 4^/minos/test9293/stash/9/0 5^/minos/test9293/kreymer/2/0 6^/minos/test9293/kreymer/6/0 7^/minos/test9293/kreymer/7/0 8^/minos/test9293/kreymer/8/0 9^/minos/test9293/kreymer/9/0 ... Test the thrash script with MINOS-SAM04 > ${MS}/thrash 79 Sat Oct 17 17:21:49 CDT 2009 79 /minos/test9293/kreymer/9/9 real 0m21.671s user 0m0.012s sys 0m0.982s Sat Oct 17 17:22:10 CDT 2009 ITER=0 ; NITER=10 ITERS=` while [ ${ITER} -lt ${NITER} ] ; do (( ITER++ )) ; printf "${ITER} " ; done ` date for ITER in ${ITERS} ; do ${MS}/thrash ${ITER} > ${ML}/${NITER}/${ITER} 2>&1 & done Sat Oct 17 17:26:17 CDT 2009 real 3m7.235s real 3m7.635s real 3m7.821s real 3m8.123s real 3m8.251s real 3m8.310s real 3m8.320s real 3m8.392s real 3m8.531s real 3m8.628s user 0m0.025s sys 0m1.011s Ganglia showed 120 MB/sec, load to 8, CPU 60% (wait) _________________________________ ITER=0 ; NITER=20 Sat Oct 17 17:31:57 CDT 2009 Ganglia 120 MB/sec, load to 18, CPU to 80% wait MINOS-SAM04 > cat $ML/${NITER}/3 Sat Oct 17 17:31:57 CDT 2009 3 /minos/test9293/stash/8/0 real 6m16.979s user 0m0.011s sys 0m1.378s Sat Oct 17 17:38:14 CDT 2009 _________________________________ ITER=0 ; NITER=20 Sat Oct 17 17:40:17 CDT 2009 Ganglia 120 MB/sec, load to 18, CPU to 80% wait Sat Oct 17 17:40:17 CDT 2009 3 /minos/test9293/stash/8/0 real 6m17.179s user 0m0.007s sys 0m1.083s Sat Oct 17 17:46:35 CDT 2009 _________________________________ ITER=0 ; NITER=40 Sat Oct 17 17:47:23 CDT 2009 Ganglia 120 MB/sec, load to 38, CPU to 100% wait, typ. 75 MINOS-SAM04 > cat $ML/${NITER}/3 Sat Oct 17 17:47:23 CDT 2009 3 /minos/test9293/stash/8/0 real 12m29.572s user 0m0.035s sys 0m0.974s Sat Oct 17 17:59:52 CDT 2009 _________________________________ ITER=0 ; NITER=80 Sat Oct 17 18:01:46 CDT 2009 Ganglia 120 MB/sec, load to 75, CPU to 100% wait, typ. 80 MINOS-SAM04 > cat $ML/${NITER}/3 Sat Oct 17 18:01:46 CDT 2009 3 /minos/test9293/stash/8/0 real 24m48.303s user 0m0.018s sys 0m1.740s Sat Oct 17 18:26:34 CDT 2009 _________________________________ 2009 10 19 copy more files to /minos/test9293/kreymer Need 4 x present data files, to do 320 fold copy, Forget the paths file, calculate directories, use all 10 per top level For a 400x copy, need 40 directories DIRS=' 1 3 4 5 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ' time cp -vax /minos/test9293/kreymer/2 /var/tmp/stash real 5m18.704s time cp -ax /var/tmp/stash /minos/test9293/kreymer/0 real 6m14.721s date for DIR in ${DIRS} ; do echo ${DIR} time cp -ax /var/tmp/stash /minos/test9293/kreymer/${DIR} df -h /minos/test9293 | grep /minos echo done date Mon Oct 19 18:02:04 CDT 2009 1 real 7m18.530s user 0m0.108s sys 0m41.010s 1.1T 331G 790G 30% /minos/test9293 3 ... 39 real 7m18.137s user 0m0.128s sys 0m42.301s 1.1T 1022G 99G 92% /minos/test9293 Mon Oct 19 22:11:28 CDT 2009 _________________________________ ============================================================================= 2009 10 16 ============================================================================= ############ # MCIMPORT # ############ Restarted regular imports. Reminded self need to set -c flag when looping. This should be automatic ,,, work on this set nohup ; ./mcimport -c -l 9999 ALL & ########### # BLUEARC # ########### Started monitoring and plotting of /minos/test9293 mkdir /grid/data/monitor/test9293 ${HOME}/minos/scripts/bluwatch -t -r -b /minos/test9293/stash/A TEST READ DEBUGGING, SLEEP=6 OFFSET 0 OFFSET 1 OFFSET 2 OFFSET 3 OFFSET 4 OFFSET 5 OFFSET 6 OFFSET 7 OFFSET 8 OFFSET 9 OFFSET 1731300 DIR /minos/test9293/stash/A/0 Fri Oct 16 15:04:19 CDT 2009 0/file1801 107 1255723459460253000 1255723459365492000 94761 93029 1731300 Fri Oct 16 15:04:25 CDT 2009 0/file1802 107 1255723465569477000 1255723465474376000 95101 93369 1731300 Fri Oct 16 15:04:31 CDT 2009 0/file1803 107 1255723471677217000 1255723471582295000 94922 93190 1731300 Fri Oct 16 15:04:37 CDT 2009 0/file1804 104 1255723477787964000 1255723477690604000 97360 95628 1731300 Fri Oct 16 15:04:43 CDT 2009 0/file1805 105 1255723483897829000 1255723483801752000 96077 94345 1731300 Fri Oct 16 15:04:50 CDT 2009 0/file1806 105 1255723490006837000 1255723489910716000 96121 94389 1731300 Fri Oct 16 15:04:56 CDT 2009 0/file1807 106 1255723496115776000 1255723496020472000 95304 93572 1731300 Date rates are over 100 MB/sec m from A Possibly cached in Bluearc. Monitor stash/2 set nohup ; ${HOME}/minos/scripts/bluwatch -r \ -b /minos/test9293/stash/2 -l /grid/data/monitor/test9293 & Make plots somewhere ${HOME}/minos/scripts/ADMIN/bluearc/bratenow.new -t -n minos27 -d /grid/data/monitor/test9293 -s 120 ${HOME}/minos/scripts/ADMIN/bluearc/bratenow -t -n minos27 -d /grid/data/monitor/test9293 mkdir -p /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/test9293/minos27 set nohup ${HOME}/minos/scripts/bratenow -n minos27 -d /grid/data/monitor/test9293 \ -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/test9293 & ######### # ADMIN # ######### Very slow KDC operations, removing krb-fnal-1 helped. Specious, problems persisted. Probable DNS problems, changed desktop /etc/resolv.conf Was ; generated by /sbin/dhclient-script search fnal.gov dhcp.fnal.gov nameserver 131.225.17.150 nameserver 131.225.8.120 Now ; generated by /sbin/dhclient-script search fnal.gov dhcp.fnal.gov nameserver 131.225.8.120 nameserver 131.225.17.150 But Host commands work instantly ! host www.fnal.gov 131.225.17.150 Using domain server: Name: 131.225.17.150 Address: 131.225.17.150#53 host www.fnal.gov 131.225.8.120 Using domain server: Name: 131.225.8.120 _______________________________________________________________________ Date: Fri, 16 Oct 2009 11:24:49 -0500 (CDT) Request INC000000013671 requested by you has been submitted. Status: New Summary: DNS/Kerberos extremely slow Notes: Please route this Networking and/or Kerberos support : Logins to the Minos Cluster, my desktop , and other nodes are taking minutes. It appears that the krb-fnal-1 KDC server is failing over to krb-fnal-2, but very slowly. For a short time, putting krb-fnal-2 first in /etc/krb5.conf helped. But this is no longer effective. It takes several minutes to unlock my desktop system, ark.fnal.gov, There may be some global KDC problem, or perhaps a DNS problem. I get rapid kerberos authentication using DNS server 131.225.8.120 The same operations take nearly a minute using 131.225.17.150. The slow logins to the Minos Cluster, and moderatly slow kerberized screen unlocks have been with us for at least a week. The problem is much worse today. _______________________________________________________________________ Date: Fri, 16 Oct 2009 11:34:55 -0500 (CDT) Status: In Progress _______________________________________________________________________ _______________________________________________________________________ ============================================================================= 2009 10 15 ============================================================================= ########### # BLUEARC # ########### time cp -vax /grid/data/minos/bluwatch/stash/2 /minos/test/bluwatch real 8m9.145s Started monitoring. Log to alternate area, to not clash with mkdir /grid/data/monitor/test/rate set nohup ; ${HOME}/minos/scripts//bluwatch -r \ -b /minos/test/bluwatch -l /grid/data/monitor/test & Thu Oct 15 18:54:44 CDT 2009 0/file1801 831 Thu Oct 15 18:55:44 CDT 2009 0/file1802 900 Oops, 0 was the last directory copies MINOS27 > ls -ltcr /minos/test/bluwatch total 320 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:19 8 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:20 9 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:21 1 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:22 2 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:22 3 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:23 4 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:24 5 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:25 6 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:26 7 drwxr-xr-x 2 kreymer g020 14336 Oct 15 17:27 0 Start with directory 1 set nohup ; ${HOME}/minos/scripts//bluwatch -r -d 1 \ -b /minos/test/bluwatch -l /grid/data/monitor/test & tail -f /grid/data/monitor/test/rate/2009/10/15/minos27.txt Make plots somewhere ${HOME}/minos/scripts/bratenow -t -n minos27 -d /grid/data/monitor/test mkdir -p /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/test/minos27 set nohup ${HOME}/minos/scripts/bratenow -n minos27 -d /grid/data/monitor/test \ -o /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/test & Compare to current minos-sam04 OUTP=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam04 ./bratewk minos-sam04 20091012 ${OUTP} Oops, minos-sam04 monitoring did not restart for some reason. Started it up around 19:12 The earlier period was covered by flxi09 ########### # BLUEARC # ########### Date: Thu, 15 Oct 2009 17:25:29 -0500 (CDT) Request INC000000013605 requested by you has been submitted. Status: New Summary: Linux Minos Notes: FEF primary - run2-sys@fnal.gov Please mount another Bluearc test area blue1:/test9293 on both minos27 and minos-sam03, as /minos/test9293 The suggested NFS mount options are: -o rsize=32768,wsize=32768,timeo=600,proto=tcp,vers=3,hard,intr _______________________________________________________ Corrected this to minos-sam04 in Work Info and email to run2-sys _______________________________________________________ Date: Fri, 16 Oct 2009 06:17:53 -0500 From: Andrew J. Romero added minos-sam04 _______________________________________________________ Date: Fri, 16 Oct 2009 10:16:26 -0500 (CDT) Status: Completed dded the /minos/test9293 mount to both minos27 and minos-sam04. _______________________________________________________ ########### # BLUEARC # ########### Asked for mount on minos27, for testing Date: Thu, 15 Oct 2009 16:58:24 -0500 (CDT) From: Fermilab Service Desk Request INC000000013602 requested by you has been submitted. Status: New Summary: Linux Minos Notes: FEF primary - run2-sys@fnal.gov Please mount this on minos27, so that we can performance test the new HDS disks which will serve /minos/data. It could be mounted as /minos/test . Date: Thu, 15 Oct 2009 15:41:39 -0500 From: Andrew J. Romero To: Andrew J. Romero , 'Arthur Kreymer' Cc: Ramon C. Pasetes Subject: Test Volume on new Minos disks is ready Hi Art I created a test volume on the new array blue2.fnal.gov:/art-test The suggested NFS mount options are: -o rsize=32768,wsize=32768,timeo=600,proto=tcp,vers=3,hard,intr Host access list: minos27.fnal.gov(rw,no_root_squash) Let me know what numbers you are getting Thanks Andy > -----Original Message----- > From: Andrew J. Romero > Sent: Thursday, October 15, 2009 3:26 PM > To: 'Arthur Kreymer' > Cc: Ramon C. Pasetes > Subject: Test Volume on new Minos disks > > Art, > > I want to create a test volume on the new disks > and have you run your performance monitor. > > Give me a list of hosts which should have access > > Andy ___________________________________________________________________________________________________________ ########### # BLUEARC # ########### We need a debriefing on today's bluearc tests latencies raid controller load reports data rate ############ # BLUWATCH # ############ Stopped flxi06 and flxi09 ongoing monitoring, around 17:00 ############ # BLUWATCH # ############ investigate failing output on flxi09 /afs/fnal.gov/files/home/room1/kreymer/minos/scripts//bluwatch.20090922: line 284: /afs/fnal.gov/files/data/minos/log_data/bluwatch/lastbad/flxi09.txt: No such file or directory why was kcron/aklog not invoked ? Also clean up script/bluwatch.* 20090820 is -r 1.13 20090831 is -r 1.14 allows -t without AFS 20090922 is -r 1.14 plus ALTLOG support, slightly broken -r 1.15 corrects ALTLOG and -t mode printout Removed the above ln -sf ADMIN/bluearc/bluwatch bluwatch ####### # SAM # ####### Date: Thu, 15 Oct 2009 18:28:34 +0000 (GMT) From: Arthur Kreymer To: minosdb-support@fnal.gov Subject: Minos SAM development refresh from production At your next convenience, please refresh the Minos development SAM database from production. There is no urgent need, but these have drifted out of step, and it would be good to do this exercise. ########## # CONDOR # ########## Cut back priority of new users condor_userprio -setfactor mho@fnal.gov 100 condor_userprio -setfactor evansj@fnal.gov 100 ########### # RESTART # ########### Reestart issues : minos-mysql2 /etc/rc.d/init.d/mysql is missing minos25 /etc/rc.d/init.d/condor is present and works. But it is configured off with /sbin/chkconfig. minos-sam01/2/3 - none of these started at reboot. ########### # RESTART # ########### SAM - verify station operation [sam minos-sam01 ~] $ . ./setups.sh [sam minos-sam01 ~] $ ups start sam_bootstrap Production station passes tests minos-sam02 - restarted as above Development station passes tests minos-sam02 -restarted web server as above MYSQL2 - verify monitor not running, after reboot ups start mysql -O crl ups start mysql MYSQL1 - verify monitor running OK MINOS26 - crontab restart MINOS26 > crontab crontab.dat Thu Oct 15 11:09:53 CDT 2009 MINOS26 > set nohup ; ${HOME}/minos/scripts/monitor.minos26 & MINOS25 - restart gfactory/gfrontend after GPFarm outage. MINOS25 > condor_status CEDAR:6001:Failed to connect to <131.225.193.25:9618> MINOS25 > sudo /etc/rc.d/init.d/condor start [gfactory@minos25 ~]$ ./factory_startup start [gfrontend@minos25 ~]$ ~/start_frontend.sh MINOS27 - mcimport restart ############ # SHUTDOWN # ############ 09:00 to 09:30 for Kernel upgrades and Bluearc tests ARK > ./brate flxi09 20091015 FLXI06 > ./brate flxi06 20091015 '' '' /grid/data/monitor The entire Minos Cluster and Server system is shut down, as of 09:00. Strange, the flxi09 monitor of /minos/scratch last wrote to a log at Thu Oct 15 04:10:29 CDT 2009 3/file0591 15 set nohup ; ${HOME}/minos/scripts//bluwatch.20090922 -r \ -b /minos/scratch/bluwatch/flxi09 & FLXI09 > condor_q -run | grep brebel | wc -l 10 FLXI09 > condor_q -run -- Quill: quilld@flxi09.fnal.gov : <131.225.68.37:5432> : quill2 : 2009-10-15 09:19:01-05 ID OWNER SUBMITTED RUN_TIME HOST(S) 9962.0 brebel 10/14 13:58 0+19:20:52 slot4@flxi10.fnal.gov 9970.0 brebel 10/14 13:58 0+19:20:29 slot3@flxi09.fnal.gov 10009.0 brebel 10/14 14:55 0+18:15:13 slot1@flxi10.fnal.gov 10011.0 brebel 10/14 14:55 0+18:23:38 slot1@flxb34.fnal.gov 10012.0 brebel 10/14 14:55 0+18:23:17 slot2@flxb34.fnal.gov 10013.0 brebel 10/14 14:55 0+18:23:17 slot1@flxb33.fnal.gov 10014.0 brebel 10/14 14:55 0+18:23:17 slot2@flxb33.fnal.gov 10016.0 brebel 10/14 14:55 0+02:02:52 slot1@flxi09.fnal.gov 10017.0 brebel 10/14 14:55 0+02:02:52 slot2@flxi09.fnal.gov 10018.0 brebel 10/14 14:55 0+02:02:52 slot4@flxi09.fnal.gov condor_exec.exe /nas-pool/e929/users/brebel/gen_nova_mc genie nd 3 100 development /bin/tcsh /nas-pool/e929/users/brebel/gen_nova_mc genie nd 3 100 development ana -x genie_nd_3.xml -n 100 -i /dev/null/3/0/0 -o mc_genie_nd_100_3.root At 09:25, condorview claims that there are still about 22 jobs running, 17 group_e875.minosgli 5 minosgli All worker nodes were rebooted around 09:30, ============================================================================= 2009 10 14 ============================================================================= ######### # ADMIN # ######### Date: Wed, 14 Oct 2009 17:56:18 -0500 (CDT) From: Fermilab Service Desk Request INC000000013447 requested by you has been submitted. Status: New Summary: Linux Minos Notes: There are many disk errors logged on minos11, in /var/log/messages. The errors have been present since at least September 13. End users have observed problems with AFS access, probably related. You can expect to see delays due to filesystem checks during the Oct 15 kernel upgrade and reboot. _______________________________________________________________________ Date: Thu, 15 Oct 2009 14:11:57 -0500 (CDT) Status: Completed Is there anything you need done with this system? It came up okay after the reboot this morning. The disk errors are not super serious, it's complaining about a few bad sectors on sdb. The machine is out of warranty so there's not much we can do there. _______________________________________________________________________ Date: Thu, 15 Oct 2009 19:34:23 +0000 (GMT) From: Arthur Kreymer The issue at hand is whether we o leave minos11 running, in which a global forced fsck is needed to clear these errors o schedule its retirement, giving users one last chance to get files off I suggest a target of 1 week, Oct 22. I think this is in the hands of FEF, let us know which you prefer ( I can guess, but I'm asking anyway . ) _______________________________________________________________________ ############ # BLUWATCH # ############ mindata@minos27 time cp -vax /grid/data/minos/bluwatch/stash/2 /minos/scratch/bluwatch/flxi09 Oops, interrupted this, the files were already there ! Dated Aug 20. This system has a working AFS with kcron. Use the standard output path kreymer@flxi09 set nohup ; ${HOME}/minos/scripts//bluwatch.20090922 -r \ -b /minos/scratch/bluwatch/flxi09 -d 1 & /afs/fnal.gov/files/home/room1/kreymer/minos/scripts//bluwatch.20090922: line 67: [: argument expected < Need to quote the ${ALTLOG} string > ########### # MINOS11 # ########### Many disk error messages in the minos11 /var/log/messages files. Sep 13 04:02:02 minos11 syslogd 1.4.1: restart. Sep 13 04:15:29 minos11 smartd[2682]: Device: /dev/hda, 9 Currently unreadable (pending) sectors Sep 13 04:15:29 minos11 smartd[2682]: Device: /dev/hda, 9 Offline uncorrectable sectors Sep 13 04:15:30 minos11 smartd[2682]: Device: /dev/hdb, 6 Currently unreadable (pending) sectors ... ############ # SHUTDOWN # ############ Date: Wed, 14 Oct 2009 22:38:33 +0000 (GMT) From: Arthur Kreymer To: minos-users@fnal.gov Cc: minos_software_discussion@fnal.gov, minos_batch@fnal.gov, minos-admin@fnal.gov Subject: Minos Server shutdown at Fermilab Thu Oct 15 All Minos Offline systems will be shut down Thursday Oct 15 for kernel security upgrades. This includes the Minos Cluster ( minos01 through minos27 ) and Minos servers like minos-mysql2 and the SAM servers. The systems will be shut down from at least 09:00 to 09:30 so that we can perform Bluearc ( /minos/data ) load tests in the absence of Minos processes. We will also drain all Fermigrid jobs before the shutdown. Most Fermigrid resources will be down for a software upgrade at that time. I recommend being off the systems before 08:00. Systems will probaby be back by 10:00 CDT. ############ # SHUTDOWN # ############ kreymer@minos26 echo 'crontab -r' | at 06:00 job 33 at 2009-10-15 06:00 mindata@minos27 echo 'touch /minos/data/mcimport/STOP' | at 04:00 job 4 at 2009-10-15 04:00 echo 'rm /minos/data/mcimport/STOP' | at 10:00 job 5 at 2009-10-15 10:00 ########## # CONDOR # ########## Killed gfrontend and gfactory processes, as GPFARM has already been drained before tomorrow's shutdown. ########## # NUCOMP # ########## 14:30 WH-1E ########## # DCACHE # ########## Service cleanup Date: Wed, 14 Oct 2009 10:54:12 -0500 (CDT) Request INC000000006889: Status has been updated. Status: Completed Summary: Minos raw data file lost in DCache Many Incidents were found in the upgrade of Dcache starting on July 27th on till about August 12th These were individually addressed and files should be resent ########## # DCACHE # ########## Service cleanup Date: Wed, 14 Oct 2009 10:01:43 -0500 (CDT) Request INC000000011850: Status has been updated. Summary: FNDCA pool 26a-1 Pool Listing empty Status: Completed The public ssh key of dcache admin node was not in the known_hosts of the enstore account on the pool in ~enstore/.ssh/known_hosts on stkendca26a ############ # BLUWATCH # ############ Checking flxi06 data : ./brate flxi06 20091014 '' '' /grid/data/monitor ######### # FNALU # ######### Date: Thu, 08 Oct 2009 15:07:30 -0500 (CDT) From: Margaret_Greaney To: brebel@fnal.gov Cc: kreymer@fnal.gov Subject: please note fnalu reboots next week Brian, please note that fnalu nodes will be rebooted on 10/14/09 8-9am. The condor cluster will be down during that time. On 10/15/09 flxb31 will be down all day. ============================================================================= 2009 10 13 ============================================================================= ############ # MCIMPORT # ############ $ cp -a AFSS/mcimport.20091006 . $ ln -sf mcimport.20091006 mcimport $ ./mcimport -l 9999 OVERLAY & [2] 11851 $ rm /minos/data/mcimport/OVERLAY/STOP Tue Oct 13 17:17:05 CDT 2009 looping 1 / 9999 OK - version mcimport.20091006 on minos27 processing from /minos/data/mcimport/OVERLAY ... Strange, this went into a 300 minute loop in spite of Tue Oct 13 17:29:24 CDT 2009 PURGED/WROTE 0/31 Added printout of LPURGE/LWROTE in MAIN and calling loop code. Moved the mcin/L010000 files to the top. URK - export of LPURGED and LWRITE to the body of the script is not happening, so I always get a 5 hour loop. Will have to debug this later, do not have time for this sort script hacking anytime soon. FILES=`ls /minos/data/mcimport/OVERLAY/mcin/L010000` for FILE in ${FILES} ; do echo ${FILE} mv /minos/data/mcimport/OVERLAY/mcin/L010000/${FILE} \ /minos/data/mcimport/OVERLAY/mcin/${FILE} done $ touch /minos/data/mcimport/OVERLAY/STOP $ rm /minos/data/mcimport/OVERLAY/STOP set nohup ; ./mcimport -l 9999 OVERLAY & _______________________________________________________________________ Date: Wed, 14 Oct 2009 13:41:17 +0000 (GMT) From: Arthur Kreymer To: Minos Sim Cc: adams@physics.umn.edu, minos_batch@fnal.gov Subject: Re: Reprocessing of D07 r1 L010000N The new D07 r1 L010000N files have bee mcimported to mcin_data, as of around midnight last night. Enjoy ! ############ # BLUWATCH # ############ Try logging from flxi09 No good, /minos/data is there, but not /minos/data2 or /grid/data No /minos/data2 or /grid/data on any FNALU nodes, except flxi06, which is at SLF 5, no kcron. OK, we'll log from flxi06 to /grid/data/monitor as on d0mino0* mindata mv /grid/data/minos/bluwatch/stash/B /grid/data/minos/bluwatch/flxi06 time cp -vax /grid/data/minos/bluwatch/stash/2 /minos/scratch/bluwatch/flxi09 /grid/data/minos/bluwatch/stash/2' -> `/minos/scratch/bluwatch/flxi09' `/grid/data/minos/bluwatch/stash/2/8' -> `/minos/scratch/bluwatch/flxi09/8' `/grid/data/minos/bluwatch/stash/2/8/file1402' -> `/minos/scratch/bluwatch/flxi09/8/file1402' `/grid/data/minos/bluwatch/stash/2/8/file1403' -> `/minos/scratch/bluwatch/flxi09/8/file1403' ... kreymer@flxi06 ${HOME}/minos/scripts//bluwatch.20090922 -r \ -b /grid/data/minos/bluwatch/flxi06 \ -l /grid/data/monitor -d 2 & crontab -l MAILTO='kreymer@fnal.gov' @reboot ${HOME}/minos/scripts/bluwatch.20090922 -r -b /grid/data/bluwatch/flxi06 -l / grid/data/monitor kreymer@flxi09 ######### # ADMIN # ######### To : minos-users@fnal.gov Cc : minos_software_discussion@fnal.gov, minos_batch@fnal.gov, minos-admin@fnal.gov Attchmnt: Subject : Minos Server shutdown at Fermilab Thu Oct 15 ----- Message Text ----- All Minos server systems will be shut down Thursday Oct 15 for kernel security upgrades. This includes the Minos Cluster ( minos01 through minos27 ) and Minos servers like minos-mysql2 and the SAM servers. The systems will be shut down from at least 09:00 to 09:30 so that we can perform Bluearc ( /minos/data ) load tests in the absence of Minos processes. We will also drain all Fermigrid jobs before the shutdown. Most Fermigrid resources will be down for a software upgrade at that time. I recommend being off the systems before 08:00. Systems will probaby be back by 10:00 CDT. ######### # ADMIN # ######### Created admin/crontab directory in CVS, for crontab files minos27.kreymer - added this, for bluwatch, bratenow day and week minos-sam04.kreymer - added this for bluwatch minos-sam01.kreymer - added this for bluwatch ######### # ADMIN # ######### Service cleanup Date: Tue, 13 Oct 2009 12:10:32 -0500 (CDT) Subject: Incident INC000000009425 reported by you has been resolved. /minos and /grid on flxi09 ########## # DCACHE # ########## Service cleanup - Request INC000000009464 requested by you has been submitted. Status: Completed Summary: FNDCA Recent Ftp Transfers truncated ############ # BRATENOW # ############ renamed from the testing name of brateday per integration of week/day move into production use Killed old bratewk_afs on minos26 ( moot, bratewk_ark no longer running ) Killed brateday on minos26, will run this on minos27. set nohup ; ${HOME}/minos/scripts/bratenow & set nohup ; ${HOME}/minos/scripts/bratenow -w & ######### # BATCH # ######### Removing defective D07 L010000N files from mcin and mcout PNFS, bluearc, SAM # # MCOUT # # For example, /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i259/sntp_data/501: n13035016_0000_L010000N_D07_r1i259.sntp.dogwood1.0.root SAMDIM=' MC.RELEASE daikon_07 and VERSION dogwood1 and MC.BEAM L010000N_r1i% ' sam list files --dim="${SAMDIM}" --summaryonly File Count: 833 Average File Size: 335.87MB Total File Size: 273.23GB Total Event Count: 1756800 SAMDIM=' MC.RELEASE daikon_07 and VERSION dogwood1 and MC.BEAM L010000N_r1i% and DATA_TIER cand-near ' sam list files --dim="${SAMDIM}" --summaryonly File Count: 771 mrnt File Count: 34 sntp File Count: 28 find /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i* -type f | wc -l 839 find /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i* -type f -name \*cand\* | wc -l 771 find /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i* -type f -name \*sntp\* | wc -l 34 find /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i* -type f -name \*mrnt\* | wc -l 34 So we have 6 sntp files not declared to SAM SFILES=`find /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i* \ -type f -name \*sntp\* -exec basename {} \;` for FILE in ${SFILES} ; do sam locate ${FILE} ; done | grep -v '/pnfs/minos' Datafile with name 'n13035003_0000_L010000N_D07_r1i209.sntp.dogwood1.0.root' not found. Datafile with name 'n13035003_0022_L010000N_D07_r1i209.sntp.dogwood1.0.root' not found. Datafile with name 'n13035004_0000_L010000N_D07_r1i209.sntp.dogwood1.0.root' not found. Datafile with name 'n13035005_0000_L010000N_D07_r1i209.sntp.dogwood1.0.root' not found. Datafile with name 'n13035003_0019_L010000N_D07_r1i209.sntp.dogwood1.0.root' not found. Datafile with name 'n13035006_0000_L010000N_D07_r1i209.sntp.dogwood1.0.root' not found. Prepare to enmv these files minospro@minos26 . ./setups.sh setup sam setup encp DFILES=`find /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i* -type f` date NMV=0 for FILE in ${DFILES} ; do (( NMV++ )) DIR=`echo ${FILE} | cut -f 9-10 -d /` OUP=/pnfs/minos/BAD/D07HOFF/${DIR} mkdir -p ${OUP} enmv ${FILE} ${OUP}/ printf "\r${NMV} ${FILE}" sleep 1 done date Tue Oct 13 10:37:58 CDT 2009 1 /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i209/cand_data/500/n13035005_0028_L010000N_D07_r1i209.cand.dogwood1.0.root ... 839 /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i259/sntp_data/502/n13035021_0000_L010000N_D07_r1i259.sntp.dogwood1.0.root PRO> date Tue Oct 13 12:00:41 CDT 2009 minfarm@minos27 MCDFB='/minos/data/mcout_data/daikon_07/L010000N_r1i*' find ${MCDFB} -type f | wc -l 68 date ; find ${MCDFB} -type f -exec rm {} \; date Tue Oct 13 11:29:43 CDT 2009 Tue Oct 13 11:32:32 CDT 2009 kreymer@minos27 SAMDIM=' MC.RELEASE daikon_07 and VERSION dogwood1 and MC.BEAM L010000N_r1i% ' date ; ./samundeclare "${SAMDIM}" ; date Tue Oct 13 11:39:09 CDT 2009 Found 833 files undeclared n13035003_0013_L010000N_D07_r1i225.cand.dogwood1.0.root undeclared n13035006_0001_L010000N_D07_r1i209.cand.dogwood1.0.root Tue Oct 13 11:40:03 CDT 2009 MCIN SAMDIM=' DATA_TIER mc-near and MC.RELEASE daikon_07 and MC.BEAM L010000N_r1i% ' sam list files --dim="${SAMDIM}" --summaryonly File Count: 771 find /pnfs/minos/mcin_data/near/daikon_07/L010000N_r1* -type f | wc -l 771 minospro@minos26 chmod 775 /pnfs/minos/BAD/D07HOFF kreymer@minos26 MCID=/pnfs/minos/mcin_data/near/daikon_07/L010000N_r1* find ${MCID} -type f | wc -l 771 DFILES=`find ${MCID} -type f` date NMV=0 for FILE in ${DFILES} ; do (( NMV++ )) DIR=`echo ${FILE} | cut -f 7-8 -d /` OUP=/pnfs/minos/BAD/D07HOFF/MCIN/${DIR} mkdir -p ${OUP} enmv ${FILE} ${OUP}/ printf "\r${NMV} ${FILE}" sleep 1 done date Tue Oct 13 12:22:49 CDT 2009 ... 771 /pnfs/minos/mcin_data/near/daikon_07/L010000N_r1i259/502/n13035020_0031_L010000N_D07_r1i259.reroot.rootMINOS26 > date Tue Oct 13 13:49:43 CDT 2009 date ; ./samundeclare "${SAMDIM}" ; date Tue Oct 13 13:55:00 CDT 2009 Found 771 files undeclared n13035005_0014_L010000N_D07_r1i225.reroot.root undeclared n13035003_0002_L010000N_D07_r1i225.reroot.root ... undeclared n13035003_0006_L010000N_D07_r1i209.reroot.root undeclared n13035003_0004_L010000N_D07_r1i209.reroot.root Tue Oct 13 13:55:54 CDT 2009 Checking replacement file count ls /minos/data/mcimport/OVERLAY/mcin/L010000 | cut -f 1-2 -d _ | sort | wc -l 654 ls /minos/data/mcimport/OVERLAY/mcin/L010000 | cut -f 1-2 -d _ | sort -u | wc -l 654 ls /minos/data/mcimport/OVERLAY/mcin/L010000 | cut -f 1-2 -d _ | sort | uniq -d < no duplicates found > ============================================================================= 2009 10 12 ============================================================================= ######### # BATCH # ######### Restoring L010000N files to OVERLAY/mcin/L010000N from DUP and dcache DUP cd /minos/data/mcimport/OVERLAY/mcin mv DUP/n13035001* L010000/ mv DUP/n13035002* L010000/ dcache ls dcache/*L010000N* | wc -l 116 mv dcache/*L010000N* L010000/ find /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i* -type f | wc -l 839 find /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i* -type f -name \*sntp\* | wc -l 34 find /pnfs/minos/mcout_data/dogwood1/near/daikon_07/L010000N_r1i* -type f -name \*sntp\* | wc -l 34 find /minos/data/mcout_data/daikon_07/L010000N_r1* -type f | wc -l 68 find /minos/data/mcout_data/daikon_07/L010000N_r1i* -type f -name \*mrnt\* | wc -l 34 find /minos/data/mcout_data/daikon_07/L010000N_r1i* -type f -name \*sntp\* | wc -l 34 ######### # BATCH # ######### Remove dogwood1 far D04 and Data mrnt files, pass 0. From PNFS bluearc SAM SAMDIM='DATA_TIER mrnt-far and VERSION dogwood1' File Count: 7810 This includes both f and F files, passes null, 0 and 1. Check out detector data SAMDIM='FILE_TYPE importedSimulated and DATA_TIER mrnt-far and VERSION dogwood1' SAMDIM='FILE_TYPE importedDetector and DATA_TIER mrnt-far and VERSION dogwood1' sam list files --dim="${SAMDIM}" --summaryonly File Count: 2271 find /pnfs/minos/reco_far/dogwood1/mrnt_data -type f | wc -l 2271 find /minos/data/reco_far/dogwood1/mrnt_data -type f | wc -l 2271 MINOS26 > find /pnfs/minos/reco_far/dogwood1/mrnt_data -type f -name \*1.0.root | wc -l 2271 Prepare to enmv these files minospro@minos26 . ./setups.sh setup sam setup encp MFILES=`find /pnfs/minos/reco_far/dogwood1/mrnt_data -type f -name \*1.0.root` date NMV=0 for FILE in ${MFILES} ; do (( NMV++ )) MON=`echo ${FILE} | cut -f 7 -d /` OUP=/pnfs/minos/BAD/DOG1FARMRNT/DATA/${MON} mkdir -p ${OUP} enmv ${FILE} ${OUP}/ printf "\r${NMV} ${FILE}" sleep 1 done date did a couple by hand PRO> dds /pnfs/minos/BAD/DOG1FARMRNT/DATA/2007-10/ -rw-r--r-- 1 minospro e875 75289568 Aug 16 00:54 F00039724_0000.spill.mrnt.dogwood1.0.root -rw-r--r-- 1 minospro e875 1938897 Aug 16 00:55 F00039814_0000.spill.mrnt.dogwood1.0.root Mon Oct 12 10:57:46 CDT 2009 2271 /pnfs/minos/reco_far/dogwood1/mrnt_data/2005-09/F00032713_0000.spill.mrnt.dogwood1.0.rootPRO> date Mon Oct 12 13:36:56 CDT 2009 Individual enmv commands were taking from 1 to 20 seconds. minfarm@minos27 date ; find /minos/data/reco_far/dogwood1/mrnt_data -type f -exec rm {} \; date Mon Oct 12 13:32:14 CDT 2009 Mon Oct 12 13:44:21 CDT 2009 kreymer@minos27 SAMDIM='FILE_TYPE importedDetector and DATA_TIER mrnt-far and VERSION dogwood1' sam list files --dim="${SAMDIM}" --summaryonly date ; ./samundeclare "${SAMDIM}" ; date Mon Oct 12 13:39:20 CDT 2009 Found 2271 files undeclared F00042992_0000.spill.mrnt.dogwood1.0.root undeclared F00042998_0000.spill.mrnt.dogwood1.0.root ... Mon Oct 12 13:43:21 CDT 2009 MONTE CARLO SAMDIM='FILE_TYPE importedSimulated and DATA_TIER mrnt-far and VERSION dogwood1' sam list files --dim="${SAMDIM}" --summaryonly File Count: 5539 This is a mixture of dogwood1.root and dogwood.1.root files. SAMDIM=' FILE_TYPE importedSimulated and DATA_TIER mrnt-far and VERSION dogwood1 and FILE_NAME %.dogwood1.root' sam list files --dim="${SAMDIM}" --summaryonly File Count: 2770 MCDF=/pnfs/minos/mcout_data/dogwood1/far/daikon_04/L010185N/mrnt_data MFILES=`find ${MCDF} -type f -name \*dogwood1.root` printf "${MFILES}\n" | wc -l 2770 date ; ./samundeclare "${SAMDIM}" ; date Mon Oct 12 15:11:36 CDT 2009 Found 2770 files undeclared f21337002_0000_L010185N_D04.mrnt.dogwood1.root ... undeclared f21437373_0000_L010185N_D04.mrnt.dogwood1.root undeclared f21437391_0000_L010185N_D04.mrnt.dogwood1.root Mon Oct 12 15:14:28 CDT 2009 minfarm@minos27 MCDFB=/minos/data/mcout_data/daikon_04/L010185N/far/dogwood1/mrnt_data find ${MCDFB} -type f -name \*dogwood1.root | wc -l 2770 date ; find ${MCDFB} -type f -name \*dogwood1.root -exec rm {} \; date Mon Oct 12 14:54:20 CDT 2009 Mon Oct 12 15:01:40 CDT 2009 minospro@minos26 date NMV=0 for FILE in ${MFILES} ; do (( NMV++ )) MON=`echo ${FILE} | cut -f 10 -d /` OUP=/pnfs/minos/BAD/DOG1FARMRNT/MC/${MON} mkdir -p ${OUP} enmv ${FILE} ${OUP}/ printf "\r${NMV} ${FILE}" sleep 1 done date Mon Oct 12 15:01:55 CDT 2009 1 /pnfs/minos/mcout_data/dogwood1/far/daikon_04/L010185N/mrnt_data/713/f21437138_0000_L010185N_D04.mrnt.dogwood1.root 2770 /pnfs/minos/mcout_data/dogwood1/far/daikon_04/L010185N/mrnt_data/850/f21338507_0000_L010185N_D04.mrnt.dogwood1.rootPRO> date Mon Oct 12 16:27:10 CDT 2009 ####### # CRL # ####### The MINOS CRL, and the ILC CRL are now running on crlweb3.    The new URL is: http://crlweb3.fnal.gov:8080/minos/Index.jsp [crlweb3.fnal.gov:8080] please point your control room terminal at this URL.   You will not see the images, because ILC and MINOS are running on the same server with only one working directory and that is pointed to ILC. We'll work on this. _______________________________________________________________________ Date: Mon, 12 Oct 2009 16:18:21 -0500 From: Suzanne Gysin crlweb2 is back up on a single cpu. Replacement has been ordered. _______________________________________________________________________ Date: Mon, 12 Oct 2009 16:21:16 -0500 From: Suzanne Gysin Since crlweb2 came back up, please point your logbooks back to the original URL. ============================================================================= 2009 10 11 Sunday ============================================================================= ####### # CRL # ####### Gave spanacek and bens access to minsoft@minos-mysql2, so that they have administrative access to the crl_v1 database. They will bring up an alternate server on crlweb3, Monday at 08:00. ============================================================================= 2009 10 10 Saturday ============================================================================= ####### # CRL # ####### CRL has been down since 04:45 Saturday. Hardware failure of the host crlweb2. ============================================================================= 2009 10 09 ============================================================================= ############ # MCIMPORT # ############ mcimport.20091006 Added EXIT mesage at end of sleep, to avoid a bailout when looping $ scp AFSS/mcimport.20091006 . $ set nohup ; ./mcimport.20091006 -l 3 OVERLAY & ############ # MCIMPORT # ############ Several configurations are being regenerated. Date: Fri, 09 Oct 2009 09:23:21 -0500 From: Daniel Cronin-Hennessy To: Minos Sim 2) There may be a regeneration needed on an overlay (a typo in the config file) - Adam __________________________ Date: Fri, 09 Oct 2009 10:08:10 -0500 (CDT) From: Adam Schreckenberger The typo was in the Horn-Off Run 1 sample and has already been re-run. I am validating these files now and will have them ready in OVERLAY/mcin. Date: Fri, 09 Oct 2009 11:30:56 -0500 From: Robert Hatcher __________________________ Ahhhh, are the new file names in conflict with files that have already been generated and put into PNFS?  If so *those* files need to be   - removed from PNFS   - undeclared from SAM or any attempt at mcimport is going to be in a world of hurt.  Can you give us an exact file name pattern. __________________________ Date: Fri, 09 Oct 2009 11:47:27 -0500 From: Robert Hatcher From: Adam Schreckenberger Date: October 9, 2009 11:37:49 AM CDT n1303XXXX_YYYY_L010000N_D07r1iZZZ.reroot.root XXXX ranges from 5001 to 5021 YYYY ranges from 0000 to 0031 and ZZZ is either 209, 225, 232, or 259. Sorry for the double reply Robert. I have moved those files into a subdirectory to avoid the 'world of hurt' until I get further notice. __________________________ There are files from yesterday in OVERLAY/mcin/DUP, There was also a PNFS size mismatch, perhaps due to this stuff, /pnfs/minos/mcin_data/near/daikon_07/L010000N_r1i209/500/n13035006_0026_L010000N_D07_r1i209.reroot.root This is the file that was pulled out from under the SRMCP Check that the dups all were really dups, and are properly sorted out. touch /minos/data/mcimport/OVERLAY/STOP Fri Oct 9 14:55:40 CDT 2009 cp /minos/data/mcimport/OVERLAY/log/mcimport.log /minos/scratch/mindata/log/d7dups Edited out all but today's pass. Oct 8 15:01 n13035001_0000_L010000N_D07_r1i209.reroot.root ... missing 5001_0002 5001_0013 # # # small cleanup # # # Let's move the stray file from mcin/L010000 back to mcin/dcache cd /minos/data/mcimport/OVERLAY/mcin mv L010000/n13035006_0026_L010000N_D07_r1i209.reroot.root dcache/ _______________________________________________________________ $ find /pnfs/minos/mcin_data/near/daikon_07/L010000N_r1* | wc -l 783 _______________________________________________________________ To : Robert Hatcher Cc : Dan Cronin-Hennessy , Adam Schreckenberger , minos_sim@fnal.gov, minos_data@fnal.gov Attchmnt: Subject : Re: Fwd: MINOS: Re-running some overlays ----- Message Text ----- Adam - please make OVERLAY/mcin/L010000 group writeable : chmod 775 OVERLAY/mcin/L010000 All - Robert noticed that some run/subrun combinations occur in more than one intensity range : n13035003_0022_L010000N_D07_r1i209.reroot.root the same for runs 5004, 5005, 5006, subrun 22. This makes double entries for some physics events. Not sure how widespread this is. Standing by until this is understood ... _______________________________________________________________ Work continued 2009 10 12 ============================================================================= 2009 10 08 kreymer on shift ============================================================================= ########### # BLUEARC # ########### Moderate slowdown today, perhaps tracking startup of xbhuang jobs /minos/scratch/xbhuang/Nue/MRE/codes/condor_mre.sh 0 473 /minos/scratch/xbhuang/Nue/MRE/z4000_5000.list /minos/data2/nue_group_tmp/tmp/ wc -l /minos/scratch/xbhuang/Nue/MRE/z4000_5000.list 1000 /minos/scratch/xbhuang/Nue/MRE/z4000_5000.list ls /minos/data2/nue_group_tmp/tmp | wc -l 2480 date ls -ltr /minos/data2/nue_group_tmp/tmp | tail Thu Oct 8 16:53:47 CDT 2009 -rw-r--r-- 1 43021 e875 180939115 Oct 8 16:53 reroot_n13037318_0030.root -rw-r--r-- 1 43021 e875 173911987 Oct 8 16:53 reroot_n13037319_0004.root -rw-r--r-- 1 43021 e875 185907344 Oct 8 16:53 reroot_n13037222_0024.root -rw-r--r-- 1 43021 e875 185108772 Oct 8 16:53 reroot_n13037222_0023.root -rw-r--r-- 1 43021 e875 189710221 Oct 8 16:53 reroot_n13037222_0025.root -rw-r--r-- 1 43021 e875 172558016 Oct 8 16:53 reroot_n13037319_0012.root -rw-r--r-- 1 43021 e875 187541057 Oct 8 16:53 reroot_n13037318_0016.root -rw-r--r-- 1 43021 e875 29032448 Oct 8 16:53 reroot_n13037221_0018.root -rw-r--r-- 1 43021 e875 43057152 Oct 8 16:53 reroot_n13037318_0028.root -rw-r--r-- 1 43021 e875 167968768 Oct 8 16:53 reroot_n13037319_0010.root The script is using cpn to regulate root file writes to Bluearc. So I don't see the immediate problem. ######### # ADMIN # ######### CD110143 for FL/CD/SCF/FEF FY09 Minos Servers Requisition 211651 8/19/09 Replacement disk server as specified on page 1 of the attached quote, "Storform iServ R503 "Config #1 2U w/2TB Drives" or an equivalent RHEL v4/5 compatible system configuration. 3yr onsite service and support. Project CD Operations , Task MINOS-COMP-OP , Task Number 50.01.06.04.01.01 Exp. Org CD - FERMILAB EXPERIMENTS FACILITIES, Exp. Type MATERIAL PURCHASES Service Type OP-EXST PRGM OP-DET PO 588545 06-Oct-2009 APPROVE CARLSON/PO4 06-Oct-2009 FORWARD KONCELIK 06-Oct-2009 SUBMIT KONCELIK 06-Oct-2009 RESERVE KONCELIK 06-Oct-2009 UNRESERVE KONCELIK 24-Sep-2009 APPROVE CARLSON/PO4 24-Sep-2009 FORWARD KONCELIK 24-Sep-2009 SUBMIT KONCELIK 24-Sep-2009 RESERVE KONCELIK All our items were received and delivered Nov 4. ============================================================================= 2009 10 07 kreymer on shift ============================================================================= ####### # CVS # ####### Removed all Oxford and nearly Oxford public keys, due to breakins at that site. Removed all CD staff and inactive collaborator keys, in preparation for removal of all keys. ============================================================================= 2009 10 06 kreymer on shift ============================================================================= ############ # BLUWATCH # ############ Sampled Minos ntuple data from minos-mysql3, a modern well tuned host. default base path is /minos/data2/reco_near/cedar_phy_bhcurv/sntp_data N.B. minos-sam04 runs ./bluwatch -r -b /minos/scratch/bluwatch/minos-sam04 -d 5 minos27 runs ./bluwatch -r -b /grid/data/minos/bluwatch/minos27 -d 9 ( Need to upgrade the local crontab entries accordingly ) Latest sample by minos-sam01 is 2007-01/N00011574_0000.spill.sntp.cedar_phy_bhcurv.0.root Let's drop back to 2006-01 to get stalish files kreymer@minos-mysql2 cdadmin ./bluwatch -t -r -d '2007-01' Corrected errors in bluwatch so that test works again. OFFSET 1750500 Tue Oct 6 18:14:42 CDT 2009 2007-01/N00011458_0000.spill.sntp.cedar_phy_bhcurv.0.root 27 Tue Oct 6 18:14:49 CDT 2009 2007-01/N00011471_0000.spill.sntp.cedar_phy_bhcurv.0.root 12 Tue Oct 6 18:14:55 CDT 2009 2007-01/N00011481_0000.spill.sntp.cedar_phy_bhcurv.0.root 31 Tue Oct 6 18:15:02 CDT 2009 2007-01/N00011488_0000.spill.sntp.cedar_phy_bhcurv.0.root 22 Tue Oct 6 18:15:08 CDT 2009 2007-01/N00011491_0000.spill.sntp.cedar_phy_bhcurv.0.root 16 Tue Oct 6 18:15:15 CDT 2009 2007-01/N00011497_0000.spill.sntp.cedar_phy_bhcurv.0.root 24 Tue Oct 6 18:15:22 CDT 2009 2007-01/N00011500_0000.spill.sntp.cedar_phy_bhcurv.0.root 12 Rates are standard, normal from this modern host. Test rates from standard test files to an older host, the same one suffering so badly kreymer@minos-sam01 ./bluwatch -t -r -b /grid/data/minos/bluwatch/stash/2 OFFSET 3756000 DIR /grid/data/minos/bluwatch/stash/2/0 Tue Oct 6 18:22:33 CDT 2009 0/file1801 33 Tue Oct 6 18:22:42 CDT 2009 0/file1802 3 Tue Oct 6 18:22:50 CDT 2009 0/file1803 5 Tue Oct 6 18:22:57 CDT 2009 0/file1804 6 Tue Oct 6 18:23:06 CDT 2009 0/file1805 3 Tue Oct 6 18:23:13 CDT 2009 0/file1806 7 Tue Oct 6 18:23:21 CDT 2009 0/file1807 7 Tue Oct 6 18:23:30 CDT 2009 0/file1808 3 MINOS14 > ./bluwatch -t -r -b /grid/data/minos/bluwatch/stash/3 OFFSET 5555900 DIR /grid/data/minos/bluwatch/stash/3/0 Tue Oct 6 18:28:38 CDT 2009 0/file1801 3 Tue Oct 6 18:28:46 CDT 2009 0/file1802 5 Tue Oct 6 18:28:52 CDT 2009 0/file1803 29 Tue Oct 6 18:29:00 CDT 2009 0/file1804 8 Tue Oct 6 18:29:06 CDT 2009 0/file1805 14 Tue Oct 6 18:29:13 CDT 2009 0/file1806 15 Tue Oct 6 18:29:21 CDT 2009 0/file1807 6 Tue Oct 6 18:29:27 CDT 2009 0/file1808 25 Tue Oct 6 18:29:36 CDT 2009 0/file1809 3 Tue Oct 6 18:29:43 CDT 2009 0/file1810 7 Tue Oct 6 18:29:51 CDT 2009 0/file1811 7 FNPC340 ./bluwatch -t -r -b /grid/data/minos/bluwatch/stash/6 OFFSET 2731300 DIR /grid/data/minos/bluwatch/stash/6/0 Tue Oct 6 18:32:50 CDT 2009 0/file1801 30 Tue Oct 6 18:32:57 CDT 2009 0/file1802 20 Tue Oct 6 18:33:03 CDT 2009 0/file1803 30 Tue Oct 6 18:33:09 CDT 2009 0/file1804 34 Tue Oct 6 18:33:16 CDT 2009 0/file1805 45 Tue Oct 6 18:33:22 CDT 2009 0/file1806 15 Tue Oct 6 18:33:29 CDT 2009 0/file1807 36 Tue Oct 6 18:33:35 CDT 2009 0/file1808 40 An older farm node, fnpc177 AMD Opteron(tm) Processor 248 ./bluwatch -t -r -b /grid/data/minos/bluwatch/stash/7 OFFSET 4240500 DIR /grid/data/minos/bluwatch/stash/7/0 Tue Oct 6 18:34:47 CDT 2009 0/file1801 22 Tue Oct 6 18:34:53 CDT 2009 0/file1802 22 Tue Oct 6 18:34:59 CDT 2009 0/file1803 47 Tue Oct 6 18:35:06 CDT 2009 0/file1804 14 Tue Oct 6 18:35:12 CDT 2009 0/file1805 20 Tue Oct 6 18:35:19 CDT 2009 0/file1806 17 Tue Oct 6 18:35:25 CDT 2009 0/file1807 30 Tue Oct 6 18:35:32 CDT 2009 0/file1808 16 Scientific Linux Fermi LTS release 4.4 (Wilson) ############ # BRATEDAY # ############ Have a look at cedar_phy_bhcurv ntuples bracketing the Jun 22 bluearc upgrade. Things got a factor of 2 worse, as previously noted. ./bratewk.new minos-sam01 20090615 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam01 ./bratewk.new minos-sam01 20090622 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam01 ./bratewk.new minos-sam01 20090629 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam01 ============================================================================= 2009 10 05 kreymer on shift ============================================================================= ############ # BRATEDAY # ############ Added plots for minos-sam01 ( sntp files ) mkdir /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam01 ./bratewk.new minos-sam01 20090928 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam01 ./bratewk.new minos-sam01 20090921 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam01 ./bratewk.new minos-sam01 20090914 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam01 ./bratewk.new minos-sam01 20090907 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam01 Data rates are sometimes 'normal', mostly 4 or 8 MB/sec. ######## # EVO # ####### To connect to an ESNET ad hoc video conference from EVO, from the EVO FAQ on H323 and SIP, How to bridge ESNET meeting to EVO Run EVO / Koala, select H.323 under the Call dropdown menu. H.323 address should be @198.129.252.168 or @GK1.es.net You can Edit -> create a new profile if you like Examples : 8872634@GK1.es.net ( SAMDH ) 88436.GK1.es.net ( GDM ) ######### # ADMIN # ######### Date: Mon, 05 Oct 2009 22:23:09 +0000 (GMT) From: Arthur Kreymer To: Lee Lueking Cc: minos-admin@fnal.gov Subject: Re: Neutrino/I-Front nodes to be rebooted on Maint day Oct. 15 On Mon, 5 Oct 2009, Lee Lueking wrote: > Nodes in the list below are scheduled to be rebooted on Maintenance day, Thursday Oct. 15. > Please review and confirm that your experiment will be prepared for this downtime. ... > MINOS > Node OS Running Kernel Update Kernel By > MINOS-MYSQL1 UKNOWN UKNOWN This is LTS4 2.6.9-89.0.7.ELsmp 10/25/2009 > MINOS02 LTS4 2.6.9-78.0.17.ELsmp 06/30/2009 This node failed, and has been decommissioned. We plan to be ready for the global reboot Oct 15. We will review the detailed plan as the time approaches. There may be a request to shut all the systems down for about 15 minutes, then reboot, in order to perform a last ditch Bluearc load test, proving that the heavy client load does not come from any Minos resource. ######### # ADMIN # ######### Summary: Linux Minos Notes: FEF primary - run2-sys@fnal.gov Please add the minsoft account the the e875 group in NIS. This is so that we can use this account to replicate some AFS areas to Bluearc. ________________________________________________________________--- Date: Mon, 05 Oct 2009 10:07:04 -0500 (CDT) Request INC000000012371: Status has been updated. Status: Completed Minsoft added to e875. ________________________________________________________________--- ######### # SHIFT # ######### kreymer is on 08:00 - 16:00 Minos shift Mon-Thu Oct 5-9 ############ # BRATEDAY # ############ Touched up d0mino plots for last week ./bratewk.new d0mino05 20090928 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino05 \ "" /grid/data/monitor ./bratewk.new d0mino06 20090928 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino06 \ "" /grid/data/monitor ============================================================================= 2009 10 04 Sun ============================================================================= ######### # ADMIN # ######### Date: Sun, 04 Oct 2009 18:29:31 -0500 (CDT) Request INC000000012414 requested by you has been submitted. Status: New Summary: FTL timecard unavailable Notes: I am unable to connect to the FTL timecard service. Sun Oct 4 18:25:10 CDT 2009 Connecting to https://time.fnal.gov:8006/oa_servlets/AppsLogin I get Firefox can't establish a connection to the server at time.fnal.gov:8006. Host time.fnal.gov responds to pings. So perhaps the web server is down. _______________________________________________________________________ ######### # ADMIN # ######### Subject: [JIRA] Created: (MINOSDATA-23) INC000000008226 Minos Cluster sluggish Sat Aug 15 ------------------------------------------------- Key: MINOSDATA-23 URL: http://fermilab.go2group.com:8080/browse/MINOSDATA-23 Project: Minos Data Issue Type: Task Reporter: Arthur Kreymer Assignee: Arthur Kreymer Priority: Minor System The following has not been implemented yet. Follow up. I finally got a response from networking. They suggested to put fnsrv0 first in resolv.conf, as that server is at the FCC. fnsrv0.fnal.gov has address 131.225.8.120 ############ # BRATEDAY # ############ The draft is called brateday.new Added -w qualifier, to subsume the function of bratewk testing with d0mino05/6 data Had to adjust time format in bratewk for compatibility with SLF 4. date -d "YYYYMMDD ..." had worked as intended at SLF 5. date -d "YYYY/MM/DD is required at SLF 4" This looks good ./brateday.new -n d0mino05 -d /grid/data/monitor -t -w Try with logging : ./brateday.new -n d0mino05 -d /grid/data/monitor -w ./brateday.new -n d0mino06 -d /grid/data/monitor -w That worked, and looped. Picking up previous data as logged previously ./bratewk.new d0mino05 20090923 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino05 \ "" /grid/data/monitor ./bratewk.new d0mino06 20090922 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino06 \ "" /grid/data/monitor Created empty files to fill out the old weeks mkdir /grid/data/monitor/rate/2009/09/21 touch /grid/data/monitor/rate/2009/09/21/d0mino05.txt touch /grid/data/monitor/rate/2009/09/21/d0mino06.txt touch /grid/data/monitor/rate/2009/09/22/d0mino05.txt ./bratewk.new d0mino05 20090921 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino05 \ "" /grid/data/monitor ./bratewk.new d0mino06 20090921 \ /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino06 \ "" /grid/data/monitor Removed the overlapping plots rm /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino05/20090924.week.png rm /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino05/20090923.week.png rm /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino06/20090922.week.png Refresh some Minos plots OUTP=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos-sam04 ./bratewk.new minos-sam04 20090907 ${OUTP} # minos26 ./bratewk.new minos-sam04 20090914 ${OUTP} # ark ./bratewk.new minos-sam04 20090921 ${OUTP} # ark ./bratewk.new minos-sam04 20090928 ${OUTP} # ark ########## # DCACHE # ########## INC000000011756 Summary: FNDCA RawDataWritePools writes This was resolved, see Bugzilla 417, and notes of 2009 09 25 http://www-ccf.fnal.gov/Bugzilla/show_bug.cgi?id=417 ########### # MONTHLY # ########### DATASETS 10/4 OK PREDATOR 10/4 OK VAULT 10/4 OK - with larger 7 GB file size limit MYSQL 10/ MINOS-MYSQL2 > rm -r /var/minsoft/archive/20090904 MINOS-MYSQL2 > scripts/dbarchive STARTED DBARCHIVES Sun Oct 4 14:40:43 CDT 2009 FINISHED DBARCHIVES Sun Oct 4 16:08:43 CDT 2009 79075 . 84224 free space on /var/minsoft/archive 63115 /var/minsoft/archive/20091004/offline ============================================================================= 2009 10 03 Sat. ============================================================================= ########### # GNUPLOT # ########### Date: Sat, 03 Oct 2009 17:05:34 -0500 (CDT) Request INC000000012412 requested by you has been submitted. Status: New Summary: Linux Minos Notes: FEF primary - run2-sys@fnal.gov Please yum install gnuplot on the newer Minos servers : minos25 minos27 minos-mysql2 minos-sam04 ____________________________________________________________________________ Date: Mon, 05 Oct 2009 10:11:07 -0500 (CDT) Status: Completed Installed gnuplots on the servers. ############ # BRATEDAY # ############ Testing on d0mino05 ./bluwatch.20090922 -r -b /prj_root/5012/bluwatch/data -d 6 -l /grid/data/monitor & exec bash ./brateday -n d0minos05 -d /grid/data/monitor -o /grid/data/monitor/plot -s ${HOME} mkdir /grid/data/monitor/plot/d0mino05 Picking up some old data WEBDIR=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/d0mino05 DATADIR=/grid/data/monitor ./brate d0mino05 20090925 "" "" ${DATADIR} ./brate d0mino05 20090925 ${WEBDIR} "" ${DATADIR} for DAY in 01 02 ; do ./brate d0mino05 200910${DAY}0930 ${WEBDIR} "" ${DATADIR} ; done WEBDIR=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates for DAY in 01 02 ; do ./brate d0mino06 200910${DAY} ${WEBDIR}/d0mino06 "" ${DATADIR} ; done for DAY in 25 26 27 28 29 30 ; do ./brate d0mino06 200909${DAY} ${WEBDIR}/d0mino06 "" ${DATADIR} ; done ############ # BRATEDAY # ############ Added -t test option Working to make this useable for all rate sources and dest's, with defaults as before. Repaired old broken plots found while reviewing brateday ./brate minos27 20090929 /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos27 ./brate minos27 20090928 /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos27 This runs hideously slowly MINOS26 > time ./brate minos27 20090928 /afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos27 real 1m28.259s user 0m5.444s sys 1m16.614s This is unique to the SLF 4.2 systems ( 32 bit, older hardware ) The same thing runs in 12 seconds on minos27. The time is chewed up by the expressions inside the read loop SAMPLE=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluwatch/rate/2009/09/28/minos27.txt AFter further tests, rates are normal again on minos26. Primary guess : the slowdown was due to the overheads from running 'stage'. ############### # GRIDAPPSYNC # ############### This failed again yesterday, the target directories were owned by kreymer but no g:w permission. Trying again, first creating directories ownedy by mindata. $ rmdir /grid/fermiapp/minos/products /grid/fermiapp/minos/minossoft/ mindata@minos27 mkdir /grid/fermiapp/minos/products mkdir /grid/fermiapp/minos/minossoft chmod g+ws /grid/fermiapp/minos/products chmod g+ws /grid/fermiapp/minos/minossoft date ; set nohup ; ./gridappsync & Sat Oct 3 14:19:30 CDT 2009 total size is 47879124657 speedup is 1.00 310.99user 893.65system 5:05:32elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k 1944inputs+3564776outputs (2major+2208512minor)pagefaults 0swaps FINISHED Sat Oct 3 19:25:03 CDT 2009 date ; set nohup ; ./gridappsync -i d120 -o minossoft & Sat Oct 3 14:19:59 CDT 2009 total size is 38140659517 speedup is 1.00 236.90user 546.05system 3:15:15elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+1464496outputs (0major+2113366minor)pagefaults 0swaps FINISHED Sat Oct 3 17:35:14 CDT 2009 This is working cleanly now. ######### # STAGE # ######### checking file status, odd results from VO6876 ./stage: line 115: cd: /pnfs/minos/fardet_data/Oct-01: No such file or directory ============================================================================= 2009 10 02 ============================================================================= ############ # MCIMPORT # ############ MINOSDATA-18 The present mcimport script of 20090930 is properly reporting ACTION files which contain the recently added EXIT messages. This clutters up the logs. ACTION files containing EXIT should be silently ignored. 02/Oct/09 05:23 PM Preparing mcimport.20091002 17:55 Corrected ACTION test to skip files containing EXIT Added EXIT string to the STOP message, accordingly. Corrected maxdepth find of ACTION file cp -a AFSS/mcimport.20091002 mcimport.20091002 ln -sf mcimport.20091002 mcimport set nohup ; ./mcimport -l 9999 ALL & set nohup ; ./mcimport -l 9999 OVERLAY & ######### # ADMIN # ######### Date: Fri, 02 Oct 2009 15:35:38 -0500 (CDT) Request INC000000002416: Status has been updated. Status: In Progress Summary: Root access to a Minos Server system _______________________________________________________________________ Date: Fri, 02 Oct 2009 15:36:14 -0500 (CDT) Request INC000000002416: Status has been updated. Status: Completed Not willing to give root access to non-registered sysadmins. We can alter permissions to make filesystem writable by Art and Robert. _______________________________________________________________________ Date: Fri, 02 Oct 2009 21:53:43 +0000 (GMT) From: Arthur Kreymer Our users write to /minos/* Bluearc file systems from Fermigrid nodes where they are running as minospro or minosana ( or other accounts. ) We need to correct file ownerships and protections on occasion. This requires root access to these file systems. Since FEF cannot help, I will ask CSI and/or Fermigrid for access. ############### # GRIDAPPSYNC # ############### Fixed directories now executable in the original AFS source find /afs/fnal.gov/files/data/minos/d120/packages/MCReweight -type d ! -perm -100 /afs/fnal.gov/files/data/minos/d120/packages/MCReweight/S09-09-18-R2-00/data/CVS find /afs/fnal.gov/files/data/minos/d120/packages/MCReweight -type d ! -perm -100 -exec chmod u+x {} \; -print /afs/fnal.gov/files/data/minos/d120/packages/MCReweight/S09-09-18-R2-00/data/CVS minsoft@minos-sam03 cd ~kreymer/minos/scripts/ date ; set nohup ; ./gridappsync & Fri Oct 2 13:31:22 CDT 2009 date ; set nohup ; ./gridappsync -i d120 -o minossoft & Fri Oct 2 13:31:43 CDT 2009 tail /minos/scratch/minsoft/log/gridappsync/d119-2009-10.log Thes both failed again, minsoft is not in the e875 group in NIS. The minos-sam03 files are outdated, misleading me. Set the group sticky bit so that the group will be inherited. MINOS27 > mkdir /grid/fermiapp/minos/products MINOS27 > chgrp e875 /grid/fermiapp/minos/products MINOS27 > chmod g+s /grid/fermiapp/minos/products MINOS27 > mkdir /grid/fermiapp/minos/minossoft MINOS27 > chgrp e875 /grid/fermiapp/minos/minossoft MINOS27 > chmod g+s /grid/fermiapp/minos/minossoft This will not work until minsoft is in the e875 group list To make temporary progress, doing the sync as mindata mindata@minos27 cd AFSS date ; set nohup ; ./gridappsync & date ; set nohup ; ./gridappsync -i d120 -o minossoft & ########### # BLUEARC # ########### Date: Fri, 02 Oct 2009 09:31:53 -0500 (CDT) From: Fermilab Service Desk To: kreymer@fnal.gov Subject: INC000000007362 Art, I have updated this ticket. We are formatting the new storage for you. You will have the same amount of space (maybe a little more). We hope to start migrations by Monday. ____________________________________________________________________________ 10/2/2009 2:30:17 PM ; rayp Working on separating MINOS from Fermigrid Art, We have recovered the disks from D0 and are now working on formatting them and preparing them for Minos. Once the storage is formatted, we will begin migrating MINOS data over to the new area, separating MINOS from GRID and hopefully resolving these performance issues you have been reporting. ____________________________________________________________________________ ############### # GRIDAPPSYNC # ############### Look for and fix minossoft directories not executable by self find /afs/fnal.gov/files/data/minos/d120 -type d ! -perm -100 /grid/fermiapp/minos/products/prd/MINOS_ROOT/Linux2.4-GCC_3_4/trunk-opt/.svn/all-wcprops ############### # GRIDAPPSYNC # ############### Looking at the other directories served by parrot : cat /grid/fermiapp/minos/parrot/mountfile.grow /afs/fnal.gov/files/code/e875/general/minossoft /grow/www-numi.fnal.gov/computing/parrot/releases /afs/fnal.gov/files/code/e875/general/ups /grow/www-numi.fnal.gov/computing/parrot/ups /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL /grow/www-numi.fnal.gov/computing/parrot/MINOS_EXTERNAL /afs/fnal.gov/files/code/e875/sim /grow/www-numi.fnal.gov/computing/parrot/sim /afs/fnal.gov/files/data/minos /grow/www-numi.fnal.gov/computing/parrot/release_data du -sm /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL 898 /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL du -sm /afs/fnal.gov/files/code/e875/sim 7686 /afs/fnal.gov/files/code/e875/sim EXTP=/afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL SIMP=/afs/fnal.gov/files/code/e875/sim RELP=/afs/fnal.gov/files/data/minos/release_data find ${SIMP} -type l -exec ls -ld {} \; \ | grep "> /afs" \ | grep -v "> ${SIMP}" \ | grep -v "> ${RELP}" find ${EXTP} -type l -exec ls -ld {} \; \ | grep "> /afs" \ | grep -v "> ${EXTP}" \ | grep -v "> ${RELP}" ============================================================================= 2009 10 01 ============================================================================= ########## # DCACHE # ########## Topping off the raw data. setup encp ./volumes vols -rw-r--r-- 1 kreymer g020 308655 Oct 1 16:31 /tmp/vols First a preview pass. FVOLS=`./volumes fardet_data` { for VOL in ${FVOLS} ; do ./stage -w -s fardet_data -g q -n ${VOL} done ; } > /minos/scratch/kreymer/log/stage/fdstage0910pre.log 2>&1 & NVOLS=`./volumes neardet_data` { for VOL in ${NVOLS} ; do ./stage -w -s neardet_data -g q -n ${VOL} done ; }> /minos/scratch/kreymer/log/stage/ndstage0910pre.log 2>&1 & grep Needed /minos/scratch/kreymer/log/stage/fdstage0910pre.log | tr -d . Needed 0/5325 from VO2432 Needed 0/206 from VO3899 Needed 0/1811 from VO4298 Needed 0/4684 from VO4335 Needed 65/9921 from VO6876 Needed 0/2975 from VO8536 Needed 0/1080 from VO8555 Needed 0/3455 from VO8699 Needed 0/7138 from VO9488 Needed 0/200 from VO9830 Needed 0/223 from VOA187 Needed 0/709 from VOB499 Needed 0/238 from VOB737 Needed 0/10 from VOB935 Needed 0/462 from VOC268 Needed 0/2336 from VOC475 Needed 0/960 from VOC513 Needed 0/2660 from VOC538 Needed 0/501 from VOC560 Needed 1/2302 from VON124 Needed 21/26256 from VOO109 Needed 0/11462 from VOO218 Needed 0/2714 from VOO248 Needed 0/82 from VOO273 Needed 0/14 from VOO275 Needed 0/151 from VOO418 MINOS26 > grep Needed /minos/scratch/kreymer/log/stage/fdstage0910pre.log | tr -d . | grep -v ' 0/' Needed 65/9921 from VO6876 Needed 1/2302 from VON124 Needed 21/26256 from VOO109 grep Needed /minos/scratch/kreymer/log/stage/ndstage0910pre.log | tr -d . Needed 0/1964 from VO2307 Needed 0/1984 from VO3863 Needed 0/1322 from VO4343 Needed 0/823 from VO5081 Needed 0/402 from VO7026 Needed 0/2104 from VO7175 Needed 0/1423 from VO7421 Needed 0/1261 from VO8537 Needed 0/937 from VO8556 Needed 0/2120 from VO9752 Needed 0/149 from VO9834 Needed 0/2113 from VOA138 Needed 0/644 from VOA196 Needed 0/753 from VOB373 Needed 0/2041 from VOB962 Needed 0/892 from VOC065 Needed 0/1343 from VOC359 Needed 0/1885 from VOC443 Needed 0/2118 from VOC519 Needed 0/228 from VOD515 Needed 0/2871 from VON125 Needed 0/111 from VON128 Needed 155/15373 from VOO107 Needed 46/2804 from VOO188 Needed 0/2775 from VOO241 Needed 0/102 from VOO267 MINOS26 > grep Needed /minos/scratch/kreymer/log/stage/ndstage0910pre.log | tr -d . | grep -v ' 0/' Needed 155/15373 from VOO107 Needed 46/2804 from VOO188 2009 10 02 Move the missing files in, using the shorter volume list { for VOL in VO6876 VON124 VOO109 ; do ./stage -w -s fardet_data -g q ${VOL} done ; } > /minos/scratch/kreymer/log/stage/fdstage091002.log 2>&1 & Net 87 files { for VOL in VOO107 VOO188 ; do ./stage -w -s neardet_data -g q ${VOL} done ; }> /minos/scratch/kreymer/log/stage/ndstage091002.log 2>&1 & Needed 155/15373 from VOO107 Needed 46/2835 from VOO188 Net 201 files Fri Oct 2 08:41:14 CDT 2009 Needed 64/9921 from VO6876 Needed 1/2302 from VON124 Needed 21/26256 from VOO109 _________________________________________________________________ Checking the status of this the next day, with fresh pool listings. for VOL in VOO107 VOO188 ; do ./stage -w -s neardet_data -g q -n ${VOL} done STARTING Sat Oct 3 14:42:24 CDT 2009 Needed 155/15373 from VOO107 Needed 46/2835 from VOO188 for VOL in VO6876 VON124 VOO109 ; do ./stage -w -s fardet_data -g q -n ${VOL} done STARTING Sat Oct 3 15:36:10 CDT 2009 Needed 64/9921 from VO6876 Needed 0/2302 from VON124 Needed 21/26256 from VOO109 MINOS27 > enstore info --list=VO6876 | grep Oct | wc -l 45 ########## # DCACHE # ########## Request INC000000011850 requested by you has been submitted. Status: New Summary: FNDCA pool 26a-1 Pool Listing empty This was resolved, see below. ############### # GRIDAPPSYNC # ############### minsoft@minos-mysql2 cd ~kreymer/minos/scripts/ date ; set nohup ; ./gridappsync & Thu Oct 1 18:55:37 CDT 2009 date ; set nohup ; ./gridappsync -i d120 -o minossoft & Thu Oct 1 18:56:55 CDT 2009 less /minos/scratch/minsoft/log/gridappsync/d119-2009-10.log STARTED Thu Oct 1 18:55:37 CDT 2009 ... prd/MINOS_ROOT/Linux2.4-GCC_3_4/trunk-opt/proof/proofd/src/.svn/tmp/text-base/ rsync: close failed on "/grid/fermiapp/minos/products/prd/MINOS_ROOT/Linux2.4-GCC_3_4/trunk-opt/proof/proof/inc/.svn/prop-base/.TDataSetManagerFile.h.svn-base.3VxyXi": Disk quota exceeded (122) rsync error: error in file IO (code 11) at receiver.c(555) rsync: connection unexpectedly closed (39006251 bytes received so far) [generator] rsync error: error in rsync protocol data stream (code 12) at io.c(359) rsync: connection unexpectedly closed (5132408 bytes received so far) [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(359) Command exited with non-zero status 12 35.60user 113.20system 1:06:38elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+2382312outputs (0major+362150minor)pagefaults 0swaps FINISHED Thu Oct 1 20:02:16 CDT 2009 less /minos/scratch/minsoft/log/gridappsync/d129-2009-10.log STARTED Thu Oct 1 18:56:55 CDT 2009 building file list ... done ... packages/PhotonTransport/R1-24/macros/test/CVS/ rsync: close failed on "/grid/fermiapp/minos/minossoft/packages/PhotonTransport/R1-24-2/doc/first_present/.PhotonTransport_FirstPresentation.pdf.mTqtIX": Disk quota exceeded (122) rsync error: error in file IO (code 11) at receiver.c(555) packages/PhotonTransport/R1-24/tables/ rsync: connection unexpectedly closed (18911158 bytes received so far) [generator] rsync error: error in rsync protocol data stream (code 12) at io.c(359) rsync: connection unexpectedly closed (8400448 bytes received so far) [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(359) Command exited with non-zero status 12 13.20user 85.52system 1:05:20elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+1110976outputs (0major+156408minor)pagefaults 0swaps FINISHED Thu Oct 1 20:02:16 CDT 2009 OOPS - these are in the mysql group, not e875 Need to log into minos-sam03, where we are in e875. Mysql> du -sm /grid/fermiapp/minos/products 17243 /grid/fermiapp/minos/products Mysql> du -sm /grid/fermiapp/minos/minossoft/ Mysql> du -sm /grid/fermiapp/minos/minossoft/ du: `/grid/fermiapp/minos/minossoft/packages/MCReweight/S09-09-18-R2-00/data/CVS': Permission denied 15809 /grid/fermiapp/minos/minossoft/ ls -ld /afs/fnal.gov/files/data/minos/d120/packages/MCReweight/S09-09-18-R2-00/data/CVS drw-r--r-- 2 rhatcher e875 2048 Sep 18 15:50 CVS/ time rm -r /grid/fermiapp/minos/products Many failures to delete files, like prd/MINOS_ROOT/Linux2.4-GCC_3_4/trunk-opt/README/.svn/entries time rm -r /grid/fermiapp/minos/minossoft rm: cannot chdir from `/grid/fermiapp/minos/minossoft/packages/MCReweight/S09-09-18-R2-00/data' to `CVS': Permission denied real 21m32.325s Fixed this in the original AFS source find /afs/fnal.gov/files/data/minos/d120/packages/MCReweight -type d ! -perm -100 /afs/fnal.gov/files/data/minos/d120/packages/MCReweight/S09-09-18-R2-00/data/CVS find /afs/fnal.gov/files/data/minos/d120/packages/MCReweight -type d ! -perm -100 -exec chmod u+x {} \; -print /afs/fnal.gov/files/data/minos/d120/packages/MCReweight/S09-09-18-R2-00/data/CVS ############### # GRIDAPPSYNC # ############### Removing the two root symlinks, now that we have a bit of space in d119 cd /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4 MINOS27 > fs listquota . Volume Name Quota Used %Used Partition nb.minos.d119 50000000 47153757 94%<< 67% < time diff -r v5-18-00a /afs/fnal.gov/files/data/minos/d162/MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a real 8m10.141s user 0m1.190s sys 0m6.770s MINOS27 > time diff -r v5-18-00a-opt /afs/fnal.gov/files/data/minos/d162/MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a-opt \real 7m32.222s user 0m1.166s sys 0m5.783s ############### # GRIDAPPSYNC # ############### Saving space, to allow the ROOT version symlinks be removed, making the e119 product are self contained. Combined the GROWFS and GROWFSDIR archives . Moved the GROWFS archive directories into bluearc for d119 with symlinks in AFS, to save space. These have grown to be over 2 GB in size in d119 ( not d120 ) cd /afs/fnal.gov/files/data/minos/d119 mv GROWFSDIR/* GROWFS/ rm GROWFSDIR cp -vax GROWFS /minos/data/mindata/GROWFS/d119 diff -r GROWFS /minos/data/mindata/GROWFS/d119 rm -r GROWFS ln -s /minos/data/mindata/GROWFS/d119 GROWFS cd /afs/fnal.gov/files/data/minos/d120 mv GROWFSDIR GROWFS ######## # JIRA # ######## List of mail From extracted from kreymer minos-data archive cd ~/mail/minosdata grep ^From .mix4* | grep '<' | grep '>' | head | cut -f 2 -d '<' | cut -f 1 -d '>' | sort -u grep ^From .mix4* | grep '<' | grep '>' | cut -f 2 -d '<' | cut -f 1 -d '>' | sort -u ___________________________________________________________________________ subscribed issues@fnal.gov to minos-data, around 14:45. ########## # DCACHE # ########## Date: Thu, 01 Oct 2009 19:27:27 +0000 (GMT) From: Arthur Kreymer To: jdejong@fnal.gov Cc: minos-data@fnal.gov Subject: jdejong jobs tying up the Raw Data pools Jeff - you have batch jobs running on the Minos Cluster which are tying up all the available dcache channels to the raw data pools. They are opening raw data files directly, rather than making local copies, then they seem to be sitting using almost no CPU. loon -bq reco_far_spill_data_base_dogwood0.C \ dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos/fardet_data/2009-04/F00043075_0004.mdaq.root Please cancel these jobs, and arrange to make temporary local copies of the files before running loon. _____________________________________________________________________________ Date: Thu, 01 Oct 2009 21:09:52 +0100 From: Jeffrey de Jong Sorry, didn't think that such a small number of jobs would kill dcache. Jobs have now been killed. _____________________________________________________________________________ _____________________________________________________________________________ ########## # DCACHE # ########## Tests after restart : ./dccptest - taking a very long time. more than 20 minutes so far, this cannot be right. This test files should be on disk. Due to an overload of daq pool, connections from Minos Cluster, ID 13234 FTP transfers - nothing for fardet Tape writes are badly backloged, mainly due to database backup writes. ####### # DAQ # ####### DAQ far archived was stuck, restarted it. MINOS26 > dds -tr /pnfs/minos/fardet_data/2009-10 ... -rw-r--r-- 1 buckley e875 19033301 Oct 1 05:33 F00044823_0015.mdaq.root -rw-r--r-- 1 buckley e875 42878659 Oct 1 06:34 F00044823_0016.mdaq.root [minos@daqdcp ~]$ /home/minos/bin/init/archiver status Archiver is running [minos@daqdcp ~]$ ps xf PID TTY STAT TIME COMMAND 2768 pts/0 Ss 0:00 -bash 3186 pts/0 R+ 0:00 \_ ps xf 14475 ? S 1:08 python /home/minos/bin/archiver_krb.py 13363 ? Z 0:00 \_ [kinit] [minos@daqdcp ~]$ /home/minos/bin/init/archiver restart Stopping archiver - try graceful exit first Killing archiver with USR1 Starting archiver [minos@daqdcp ~]$ date Thu Oct 1 14:46:43 CDT 2009 -rw-r--r-- 1 buckley e875 35703358 Oct 1 14:46 F00044823_0017.mdaq.root -rw-r--r-- 1 buckley e875 18952332 Oct 1 14:47 F00044823_0018.mdaq.root ########## # DCACHE # ########## Date: Thu, 01 Oct 2009 07:23:46 -0500 From: ssa-group@fnal.gov To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov, wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov, timur@fnal.gov, d0en-announce@fnal.gov, stk-users@fnal.gov, fermigrid-announce@fnal.gov, cdf_dh_help@fnal.gov, elog_www@b0www00.fnal.gov, enstore-admin@fnal.gov, dcache-admin@fnal.gov Subject: Announcement: Service scheduled outage for enstore, dCache on d0en, stken, cdfen for a duration of Scheduled for 4 hours The public enstore system has be drained and stopped for maintenance 7Am to 11AM. The D0-LTO4G1 (GCC) and the CDF-LTO4G1 and CDF-LTO3 (both GCC) Have been paused. ___________________________________________________________________________ Date: Thu, 01 Oct 2009 10:20:45 -0500 From: George Szmuksta The CDF-LTO3, CDF-LTO4G1 and D0-LTO4G1 libraries have been released and are running. GCC rack move and SL8500 passthru work is complete. Public enstore system DB is still in progress. ___________________________________________________________________________ Date: Thu, 01 Oct 2009 11:28:49 -0500 From: ssa-group@fnal.gov The public enstore system DB work is taking longer than expected. Estimate is 12PM. ___________________________________________________________________________ Date: Thu, 01 Oct 2009 13:03:55 -0500 From: George Szmuksta The public enstore system is up and and available for use. Sorry for the delay. ___________________________________________________________________________ ########## # DCACHE # ########## Scheduled down 07:00 to 11:00 due to PNFS maintenance. ############ # STARTUP # ############ kreymer@minos26 15:20 crontab crontab.dat mindata@minos27 15:22 set nohup ; ./mcimport -l 9999 ALL & set nohup ; ./mcimport -l 9999 OVERLAY & ============================================================================= 2009 09 30 ============================================================================= ############ # MCIMPORT # ############ mcimport.20090930 Correction ACTION for non-ALL sleep loop. Added an exit ACTION Restarted OVERLAY and ALL with this version. cp -a AFSS/mcimport.20090930 . ln -sf mcimport.20090930 mcimport set nohup ; ./mcimport -l 9999 ALL & set nohup ; ./mcimport -l 9999 OVERLAY & ############ # SHUTDOWN # ############ kreymer@minos26 echo 'crontab -r' | at 06:00 job 31 at 2009-10-01 06:00 mindata@minos27 echo 'touch /minos/data/mcimport/STOP' | at 04:00 job 1 at 2009-10-01 04:00 echo 'rm /minos/data/mcimport/STOP' | at 08:00 job 2 at 2009-10-01 08:00 printf "`date -u` `hostname -s` ${INDIR} $$ STOP\n" > ${ACT} touch -d yesterday ${ACT} ######## # DATA # ######## Date: Wed, 30 Sep 2009 12:09:14 -0500 From: ssa-group@fnal.gov To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov, wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov, timur@fnal.gov, stk-users@fnal.gov, fermigrid-announce@fnal.gov, enstore-admin@fnal.gov, dcache-admin@fnal.gov Subject: Announcement: Service scheduled outage for enstore, dCache on stken for a duration of 4 hours This is a reminder for tomorrows downtime. There is a urgent need for a planned downtime for Public Enstore. We plan to do this on Oct 1, 2009 at 7am for 4 hours. This is the work that needs to be done. - Sun needs to repair an SL8500 in GCC. A pass through is broken and running in degraded mode. - Maintenance of Public Enstore databases, this will address database slowness. - Kernel updates to the Public Enstore servers. - Move 1 rack to make space for new SL8500. Public dcache will be also be down since the pnfs server is also involved in the maintenance. These are the library managers that will be down. CD-LTO4F1 CD-LTO4G1 CD-LTO3 CD-9940B 9940 CDF-LTO4G1 CDF-LTO3 D0-LTO4G1 We will bring up components as we are done to enable services as soon as we can. SSA staff ssa-group@fnal.gov ######### # FNALU # ######### INC000000010394 2009 09 13 request for exports INC000000009425 31 Aug 2009 request for mounts mgreaney set up the rest of the Bluearc mount on flxi06. ############### # GRIDAPPSYNC # ############### Remnant symlinks, as noted 2009 09 25 MINOS_EXTERN - two .so files MINOS_ROOT - two product trees PYTHIA6 - three inc areas gcc - two dead links to mengel home area ROOT cd /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_ROOT/Linux2.4-GCC_3_4 MINOS27 > fs listquota . Volume Name Quota Used %Used Partition nb.minos.d119 50000000 49692712 99%<< 67% < du -sm * 1 catman 3 db 3 dbsave 1 etc 263 GROWFS 2216 GROWFSDIR 12 man 45959 prd MINOS27 > du -sm prd/* | sort -n ... 13 prd/xanim 16 prd/ghostscript 16 prd/kerberos 17 prd/fileinfo 17 prd/imagelibs 22 prd/blt 22 prd/tk 27 prd/imagemagick 27 prd/stdhep 27 prd/tcl 28 prd/srmcp 31 prd/g4photon 47 prd/clhepsource 51 prd/dcap 53 prd/oracle_instant_client 72 prd/mysql 95 prd/LOG4CPP 116 prd/perl 116 prd/sam 131 prd/java 144 prd/cern 157 prd/encp 158 prd/geant4source 159 prd/geant 218 prd/PYTHIA6 224 prd/python 343 prd/gdb 395 prd/LABYRINTH 453 prd/oracle_client 760 prd/gcc 900 prd/NEUGEN3 1102 prd/LHAPDF 1132 prd/clhep 1231 prd/sam_cpp_api 1419 prd/geant4 1737 prd/GENIE 4148 prd/MINOS_EXTERN 30298 prd/MINOS_ROOT PYTHIA6/Linux2.4-GCC_3_4/ v6_406_nopdf/inc/inc -> /afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_406/inc v6_406_nopdf/src/inc -> /afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_406/inc v6_409_nopdf/src/inc -> /afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_409/inc cd /afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4 for INC in v6_406_nopdf/inc/inc v6_406_nopdf/src/inc v6_409_nopdf/src/inc; do date VER=`echo ${INC} | cut -f 1-2 -d _` DIR=` dirname ${INC}` ls -l ${INC} ${DIR}/../../${VER}/inc echo rm ${INC} echo ln -s ../../${VER}/inc ${INC} rm ${INC} ln -s ../../${VER}/inc ${INC} done > /minos/scratch/kreymer/log/fixupslink/pythia.log 2>&1 Wed Sep 30 17:29:51 CDT 2009 GCC links - remove them ! cd /afs/fnal.gov/files/code/e875/general/ups/prd ls -l gcc/v3_4_3/Linux-2-4-2-3-2/tar/binutils.tar.gz rm gcc/v3_4_3/Linux-2-4-2-3-2/tar/binutils.tar.gz ls -l gcc/v3_4_3/Linux-2-4-2-3-2/tar/gcc.tar.gz rm gcc/v3_4_3/Linux-2-4-2-3-2/tar/gcc.tar.gz MINOS_EXTERN cd /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_EXTERN date ls -l Linux2.4-GCC_3_2/v03/lib/libmyodbc.so ln -sf libmyodbc3.so Linux2.4-GCC_3_2/v03/lib/libmyodbc.so ls -l Linux2.4-GCC_3_2/v03/lib/libmyodbc.so ls -l Linux2.4-GCC_4_1/v04/lib/libmyodbc.so ln -sf libmyodbc3.so Linux2.4-GCC_4_1/v04/lib/libmyodbc.so ls -l Linux2.4-GCC_4_1/v04/lib/libmyodbc.so Wed Sep 30 17:39:08 CDT 2009 MINOS27 > ls -l Linux2.4-GCC_3_2/v03/lib/libmyodbc.so lrwxr-xr-x 1 kreymer e875 97 Aug 14 2008 Linux2.4-GCC_3_2/v03/lib/libmyodbc.so -> /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v03/lib/libmyodbc3.so MINOS27 > ln -sf libmyodbc3.so Linux2.4-GCC_3_2/v03/lib/libmyodbc.so MINOS27 > ls -l Linux2.4-GCC_3_2/v03/lib/libmyodbc.so lrwxr-xr-x 1 kreymer e875 13 Sep 30 17:39 Linux2.4-GCC_3_2/v03/lib/libmyodbc.so -> libmyodbc3.so MINOS27 > MINOS27 > ls -l Linux2.4-GCC_4_1/v04/lib/libmyodbc.so lrwxr-xr-x 1 kreymer e875 97 Aug 14 2008 Linux2.4-GCC_4_1/v04/lib/libmyodbc.so -> /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_EXTERN/Linux2.4-GCC_4_1/v04/lib/libmyodbc3.so MINOS27 > ln -sf libmyodbc3.so Linux2.4-GCC_4_1/v04/lib/libmyodbc.so MINOS27 > ls -l Linux2.4-GCC_4_1/v04/lib/libmyodbc.so lrwxr-xr-x 1 kreymer e875 13 Sep 30 17:39 Linux2.4-GCC_4_1/v04/lib/libmyodbc.so -> libmyodbc3.so LABYRINTH LINKS TO RELEASE_DATA ( correct links only in fermiapp ) cd /afs/fnal.gov/files/data/minos/d119/prd/LABYRINTH find . -type l -exec ls -ld {} \; | grep /afs | wc -l 54 find . -type l -exec ls -ld {} \; | grep release_data | wc -l 53 find . -type l -exec ls -l {} \; | grep /afs | grep -v bfield ./Linux2.4-GCC_3_4/fava/fluxdata/gnumi_flux/v19 -> /afs/fnal.gov/files/data/minos/d87/gnumi/v19 Most are .dat files One is to d87/gnumi/v19 which links to d146/gnumi/v19 which partially links to /minos/data/flux/gnumi/v19 MINOS26 > du -sm /afs/fnal.gov/files/data/minos/d146/gnumi/v19 48385 /afs/fnal.gov/files/data/minos/d146/gnumi/v19 For the .dat files, LINKS=`find . -type l | grep bfield/bfld ` for LINK in ${LINKS} ; do FLIN=`basename ${LINK}` TARG=/minos/data/release_data/bmaps/${FLIN} ls -l ${TARG} done ############### # GRIDAPPSYNC # ############### minsoft@minos-mysql2 date ; time cp -vax /minos/scratch/products /grid/fermiapp/minos/products Tue Sep 29 17:14:03 CDT 2009 `/minos/scratch/products' -> `/grid/fermiapp/minos/products' ... cp: closing `/grid/fermiapp/minos/products/dbsafe/ximagetools/v4_0.version': Disk quota exceeded `/minos/scratch/products/dbsafe/xxx' -> `/grid/fermiapp/minos/products/dbsafe/xxx' real 157m9.062s user 0m40.874s sys 5m47.373s OOPS, this was done by mistake as kreymer.g020 MINOS-MYSQL2 > du -sm /grid/fermiapp/minos/products 37835 /grid/fermiapp/minos/products MINOS-MYSQL2 > time rm -r /grid/fermiapp/minos/products rm: remove write-protected regular file `/grid/fermiapp/minos/products/prd/LABYRINTH ARRRRRRRRRRRRRRRGH !!!!!!!! SCAN FOR FILES LACKING O:W MINOS-MYSQL2 > find /grid/fermiapp/minos/products ! -perm -200 | wc -l 388624 find /grid/fermiapp/minos/products ! -perm -200 | cut -f 1-8 -d / | sort -u /grid/fermiapp/minos/products/prd/cern/2004 /grid/fermiapp/minos/products/prd/encp/v3_7b /grid/fermiapp/minos/products/prd/encp/v3_7d /grid/fermiapp/minos/products/prd/gdb/v5_2_1 /grid/fermiapp/minos/products/prd/LABYRINTH/Linux2.4-GCC_3_4 /grid/fermiapp/minos/products/prd/MINOS_ROOT/Linux2.4-GCC_3_4 /grid/fermiapp/minos/products/prd/MINOS_ROOT/Linux2.4-GCC_4_1 /grid/fermiapp/minos/products/prd/oracle_instant_client/v11_1_0_6k /grid/fermiapp/minos/products/prd/perl/v5_8 Clearing out GFMP GFMP=/grid/fermiapp/minos/products date ; time find ${GFMP} ! -perm -200 -exec chmod u+w {} \; Wed Sep 30 11:15:00 CDT 2009 real 13m46.752s user 0m28.907s sys 3m50.460s MINOS-MYSQL2 > date ; time rm -r /grid/fermiapp/minos/products Wed Sep 30 11:33:59 CDT 2009 real 47m17.037s user 0m2.325s sys 1m34.032s Let's try to fix this at the source. PIN=/afs/fnal.gov/files/data/minos/d119 find ${PIN}/prd/LABYRINTH ! -perm -200 -exec ls -l {} \; -r-xr-xr-x 1 rhatcher e875 ... find ${PIN}/prd/LABYRINTH ! -perm -200 -exec chmod u+w {} \; Now a global scan : find ${PIN} ! -perm -200 | wc -l 389416 MINOS-MYSQL2 > date ; time find ${PIN} ! -perm -200 -exec chmod u+w {} \; Wed Sep 30 11:36:11 CDT 2009 real 28m44.590s user 0m26.017s sys 4m1.997s ############### # GRIDAPPSYNC # ############### FIXING SAM LINKS IN PRODUCTION kreymer@minos27 testing admin/sam/fixlinks write MINOS27 > mkdir /minos/scratch/kreymer/log/fixupslink run report ./fixlinks > /minos/scratch/kreymer/log/fixlinks/prelink.log 2>&1 do write with echo in place of write, preview ./fixupslink write > /minos/scratch/kreymer/log/fixupslink/prewrite.log 2>&1 make a safe copy of db MINOS27 > echo $PRODUCTS /afs/fnal.gov/files/code/e875/general/ups/db PSAVE=/afs/fnal.gov/files/code/e875/general/ups/dbsave date ; cp -ax ${PRODUCTS} ${PSAVE} Wed Sep 30 11:58:20 CDT 2009 do it, removing the echo's from the relink lines ./fixupslink write > /minos/scratch/kreymer/log/fixupslink/write.log 2>&1 Fresh login to minos26, ran sam test suite successfully. ============================================================================= 2009 09 29 ============================================================================= ############ # BRATEDAY # ############ Merged brateday_ark and brateday_afs now that I have a usable gnuplot available off of my desktop. Allow options for NODE SCRIPTS OUTPATH ######## # FARM # ######## /home/minfarm/ROUNTMP/LOG/saddreco/daikon_04/dogwood1/near_L250200N_i152.log DETECT near MCREL daikon_04 RELEASE dogwood1 MONTH L250200N_r2 BAIL 999999 Oops, no directories found like /pnfs/minos/mcout_data/dogwood1/near/daikon_04/*_data/L250200N_r2 See /minos/data2/mcimport/jcoelho/log/mcimport.log The saddreco log is badly mangled, probably a missing input again. ######### # ADMIN # ######### Date: Tue, 29 Sep 2009 15:36:32 +0100 From: Gwenaelle Lefeuvre To: Arthur Kreymer Subject: changing the shell at Fermilab Hello, I would like to change the shell I'm currently using at fermilab to bash. How could I do this, please? _________________________________________________________________________ Date: Tue, 29 Sep 2009 15:39:44 +0000 (GMT) From: Arthur Kreymer To: Gwenaelle Lefeuvre Cc: minos-admin@fnal.gov Subject: Re: changing the shell at Fermilab On Tue, 29 Sep 2009, Gwenaelle Lefeuvre wrote: > I would like to change the shell I'm currently using at fermilab to bash. > How could I do this, please? I have submitted a Service Desk ticket asking for this on the Minos Cluster. _________________________________________________________________________ Date: Tue, 29 Sep 2009 10:39:16 -0500 (CDT) From: Fermilab Service Desk Request INC000000011979 requested by you has been submitted. Status: New Summary: Linux Minos Notes: FEF primary - run2-sys@fnal.gov Please set the login shell for lefeuvre to /bin/bash on the Minos Cluster, per her email request . _________________________________________________________________________ Date: Tue, 29 Sep 2009 11:00:24 -0500 (CDT) Status: In Progress _________________________________________________________________________ Date: Tue, 29 Sep 2009 11:03:27 -0500 (CDT) Status: Completed The login shell for lefeuvre has been changed to /bin/bash. _________________________________________________________________________ ######## # DATA # ######## Date: Tue, 29 Sep 2009 15:42:42 +0100 From: Nicholas Devenish To: Arthur Kreymer Subject: Files with user minoscvs? Hi Art, Doing a cleanup, I have one directory under minos_data: /minos/data/users/nickd/old_gaincal That it seems I cannot remove either interactively or from the grid. The user is listed as 'minoscvs' for the problem files. How can I go about removing this directory (completely)? I don't remember where exactly I ran them. ___________________________________________________________________ $ ls -l /minos/data/users/nickd/old_gaincal/July08-Sep08 total 12 drwxrwxrwx 169 nickd e875 12288 Sep 4 2008 runs $ ls -l /minos/data/users/nickd/old_gaincal/July08-Sep08/runs total 668 drwxr-xr-x 2 minoscvs e875 2048 Sep 4 2008 14450 drwxr-xr-x 2 minoscvs e875 2048 Sep 4 2008 14453 ... drwxr-xr-x 2 minoscvs e875 4096 Sep 4 2008 41885 drwxr-xr-x 2 minoscvs e875 2048 Sep 4 2008 41888 ___________________________________________________________________ Date: Tue, 29 Sep 2009 15:13:04 +0000 (GMT) From: Arthur Kreymer To: Nicholas Devenish Cc: minos-data@fnal.gov Subject: Re: Files with user minoscvs? On Tue, 29 Sep 2009, Nicholas Devenish wrote: > Doing a cleanup, I have one directory under minos_data: > /minos/data/users/nickd/old_gaincal These files seem to be dated a year ago, at which time I think that some Fermigrid jobs run under minoscvs. The simplest thing would be for me to remove these, as I have access to minoscvs. Shall I do this ? rm -r /minos/data/users/nickd/old_gaincal/July08-Sep08/runs ___________________________________________________________________ Date: Tue, 29 Sep 2009 16:15:11 +0100 From: Nicholas Devenish Yes, please run that command: ___________________________________________________________________ Date: Tue, 29 Sep 2009 15:26:13 +0000 (GMT) From: Arthur Kreymer Done. ___________________________________________________________________ ########### # BLUEARC # ########### Date: Tue, 29 Sep 2009 15:30:20 +0100 (BST) From: med@hep.ucl.ac.uk To: minos_software_discussion@fnal.gov Cc: young_minos@fnal.gov Subject: /minos/data/ full Hi all, I wasn't quite sure what email list to send this to but... it seems that minos-nas-0.fnal.gov:/minos/data is full up and I'm unable to proceed with various bits of file processing for the next round of beam fits. I wondered if anyone had a load of un-wanted files at Fermilab that could be safely removed? I would switch to /minos/scratch/ temporarily but my quota for that disc is very close to full also... _________________________________________________________________________ /minos/data/users/whitehd/GriffinTemp/NearMRCCDataFidL010185N/Run3 /minos/data/users/rodriges/antp_mrcc/near/data/2006-01/ /minos/data/users/rodriges/udst_daikon07 ######## # JIRA # ######## Date: Tue, 29 Sep 2009 13:50:31 +0000 (GMT) From: Arthur Kreymer To: cd-rex@fnal.gov Cc: sam-design@fnal.gov Subject: JIRA down ? As of around 08:40 Tuesday, the JIRA Node fermilab.go2group.com responds to pings, but is not serving the JIRA web pages, such as http://fermilab.go2group.com/browse/MINOSDATA and http://fermilab.go2group.com/secure/Administrators.jspa __________________________________________________________________ Date: Tue, 29 Sep 2009 09:05:52 -0500 From: Margaret Votava i can get to them now. is it still down for you? cdf reported a problem earlier too. __________________________________________________________________ Date: Tue, 29 Sep 2009 09:19:00 -0500 From: Michael Diesburg I think it's up. I sent in a ticket this morning just before 08:00 but didn't get a response initially. The response just came back at 09:04. __________________________________________________________________ Date: Tue, 29 Sep 2009 14:21:51 +0000 (GMT) From: Arthur Kreymer To: cd-rex@fnal.gov Cc: sam-design@fnal.gov Subject: Re: JIRA down ? The go2group support people have corrected problems with our server. Service was restored sometime around 09:00. I called their general contact number 877 442 4669. In future, we can call their office directly at 410 879 8102. __________________________________________________________________ ( I spoke with James at go2group ) ============================================================================= 2009 09 28 ============================================================================= ######### # ADMIN # ######### Summarizing OPS meeting minutes, MIS and databases seem to have rolled into ECS - Enterprise and Collaborative Systems SNS - Storage - is reported separately from CSI CIO - Chief Information Officer ########### # BLUEARC # ########### Date: Mon, 28 Sep 2009 20:04:35 +0000 (GMT) From: Arthur Kreymer To: minos-data@fnal.gov Cc: qzli@fnal.gov, lyon@fnal.gov, lammel@fnal.gov, romero@fnal.gov Subject: Re: D0 Bluearc monitoring after separation On Fri, 25 Sep 2009, Arthur Kreymer wrote: And I badly mangled the association of disks with data arrays. The initial association of projects with HDS/Sata was reversed. It was straight in my head, but not in the email. The corrected email follows : I have added logging of data rates ( MBytes/sec ) of two D0 Bluearc pools The data and plots labeled d0mino05 monitor /prj_root/5012 ( HDS ) The data and plots labeled d0mino06 monitor /prj_root/3024 ( Satabeast ) Data are under /grid/data/monitor/rate/2009/09/* Some plots are under http://www-numi.fnal.gov/computing/dh/bluearc/rates The HDS rates are excellent compared to /grid/data, see http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino05/d0mino05_20090924.png vs http://www-numi.fnal.gov/computing/dh/bluearc/rates/minos27/minos27_20090924.png The project 3024 (Satabeast) plots are all over the map, sometimes wonderful like http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/d0mino06_20090922.png sometimes pretty bad like today, http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/d0mino06_20090925.png sometimes horrible like Wed and Thu http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/d0mino06_20090923.png http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/d0mino06_20090924.png ########## # ADMIN # ########## Servicedesk closed out Reference No.: INC000000008226 Summary: Minos Cluster sluggish Sat Aug 15 I finally got a response from networking. They suggested to put fnsrv0 first in resolv.conf, as that server is at the FCC. _______________________________________________________________________ fnsrv0.fnal.gov has address 131.225.8.120 ( not implemented yet ) ############ # MCIMPORT # ############ srmcp has been stuck since Saturday : $ stat /minos/data2/mcimport/OVERLAY/log/mcimport.log Modify: 2009-09-26 17:04:45.584000000 -0500 $ cat /minos/data2/mcimport/OVERLAY/ACTION Sat Sep 26 22:04:45 UTC 2009 minos27 OVERLAY 2982 MCINWRITE n1303_L010000N_D07_r3i326.reroot.root n13035035_0019_L010000N_D07_r3i326.reroot.root $ ls -l /pnfs/minos/mcin_data/near/daikon_07/L010000N_r3i326/503/n13035035_0019_L010000N_D07_r3i326.reroot.root ls: /pnfs/minos/mcin_data/near/daikon_07/L010000N_r3i326/503/n13035035_0019_L010000N_D07_r3i326.reroot.root: No such file or directory $ ps xf 6116 ? Sl 0:03 \_ java -cp /minos/scratch/app/OSG1/srm-client-fermi/lib/srm_client.jar:/minos/scratch/app/OSG1/srm-client-fermi/lib/srm.jar:/m http://fndca3a.fnal.gov/dcache/DOORS.html 7603 GFTP? minos27.fnal.gov active ? ? ? Killed the process date ; kill 6116 Mon Sep 28 11:38:18 CDT 2009 Copying resumed. ########## # DCACHE # ########## Date: Mon, 28 Sep 2009 00:03:28 -0500 (CDT) Request INC000000011850 requested by you has been submitted. Status: New Summary: FNDCA pool 26a-1 Pool Listing empty Notes: SSA primary - dcache-admin@fnal.gov The Pool Directory Listings for w-raw-minos-stkendca26a-1.files went empty sometime after Sep 4 and before Sep 10. See http://fndca3a.fnal.gov/dcache/files/w-raw-minos-stkendca26a-1.files http://fndca3a.fnal.gov/dcache/files/old/ Please bring this listing up to date, so that I can stage Minos raw data files that may have fallen off disk. ______________________________________________________________________ Date: Mon, 28 Sep 2009 14:25:24 +0000 (GMT) From: Arthur Kreymer In addition to stkendca26a, the Pool Listings for stkendca28a also seem to be empty recently, ______________________________________________________________________ Date: Mon, 28 Sep 2009 14:04:49 -0500 From: George Szmuksta I made a bug report from this and set it to the developers. ______________________________________________________________________ Date: Mon, 28 Sep 2009 16:39:31 -0500 From: Timur Perelmutov I amended the bugzilla ticket with the detailed explanation of why this monitoring info is not being updated: kerberos identity enstore/cd/fndca2a.fnal.gov@FNAL.GOV is not authorized to access pool nodes. I can not change the .k5login files direcly as they will be overwritten by CFEngine, so I leave it to SSA Primary to resolve. ______________________________________________________________________ Date: Tue, 29 Sep 2009 09:45:48 -0500 From: George Szmuksta The principal was added to cfengine for distribution to the pool nodes. It will take about an hour or so. Try later this morning. ______________________________________________________________________ Date: Wed, 30 Sep 2009 14:43:54 +0000 (GMT) From: Arthur Kreymer Today's pool directory listings for all 26a pools are still empty. ______________________________________________________________________ Date: Wed, 30 Sep 2009 09:54:19 -0500 From: George Szmuksta I will notify the developers. ______________________________________________________________________ Date: Wed, 30 Sep 2009 10:03:26 -0500 From: Tim Messer This change was made yesterday but the pool listings remain empty. I have confirmed that the .k5logins were updated appropriately: ______________________________________________________________________ Date: Wed, 30 Sep 2009 11:15:40 -0500 From: Timur Perelmutov Thank you, I will have a look of what is going on now. ______________________________________________________________________ Date: Wed, 30 Sep 2009 15:30:05 -0500 From: Timur Perelmutov The public ssh key of dcache admin node was not in the known_hosts of the enstore account on the pool in ~enstore/.ssh/known_hosts on stkendca26a. ______________________________________________________________________ Date: Thu, 1 Oct 2009 21:27:32 +0000 (GMT) From: Arthur Kreymer The stkendca26a pool listings are present again, as of Oct 1, Thanks !!! This ticket can be closed. Some other pool listings still seem to be empty, but these are not causing me problems at the moment . r-minos-stkendca28a-2.files r-pub-stkendca23a-2.files ______________________________________________________________________ ______________________________________________________________________ ######## # LOCK # ######## Found a stray file from my touching loop, a * file was created when the LOCKS directory went empty : $ ls /grid/data/e875/LOCK/LOCKS -l total 0 -rw-r--r-- 1 mindata e875 0 Sep 27 22:18 * -rw-rw-r-- 1 rodriges e875 0 Sep 27 22:18 20090927.13:19:45.0.minos20.13974.rodriges.rodriges $ ls -l /grid/data/e875/LOCK/LOCKS/\* -rw-r--r-- 1 mindata e875 0 Sep 27 22:18 /grid/data/e875/LOCK/LOCKS/* Removed the stray file $ rm -i /grid/data/e875/LOCK/LOCKS/\* rm: remove regular empty file `/grid/data/e875/LOCK/LOCKS/*'? y Updated the lock script $ echo ${NL} /afs/fnal.gov/files/expwww/numi/html/computing/admin/bluearc ${NL}/lock status ; # cp ${NL}/lock lock $ ${NL}/lock status ; cp ${NL}/lock lock LOCK STATUS Sun Sep 27 22:33:44 CDT 2009 LOCKS 0 of 20 ( 1 stale ) QUEUE 0 ( 2 stale) xbhuang locks started queuing up . Not sure why, set PERF to large number for safety the file was absent, which should have been OK. The queue cleared up right away. time /grid/fermiapp/minos/scripts/lock clean ; date real 1m49.287s user 0m8.000s sys 0m23.639s $ date Sun Sep 27 22:38:58 CDT 2009 $ time /grid/fermiapp/minos/scripts/lock clean ; date real 0m0.553s user 0m0.101s sys 0m0.288s ########### # BLUEARC # ########### Added monitoring of two GPFarm worker nodes, and one CDF worker. ############ RE# BLUWATCH # ############ Verifying content of bluearc/bluwatch Checked out admin/bluearc cvs update -r ${REV} bluwatch ; ls -l bluwatch REL= diff bluwatch ~kreymer/minos/scripts/bluwatch.${REL} REV REL 1.1 20080703 < sleep 60 ; done # FILS --- > sleep 58 ; done # FILS 1.2 ( 20080521 2068 bytes ) 1.3 ( 20080521 3070 bytes ) 1.4 ( 20080707 3288 bytes ) 1.5 ( 20080724 3729 bytes ) 1.6 ( 20081114 3951 bytes ) 1.7 20090410 4073 1.8 20090520 5506 1.9 ( 20090601 6102 bytes ) rm bluwatch.20080703 bluwatch.20090410 bluwatch.20090520 REL=20090619 1.10 2009 06 19 added test read write REL=20090621 1.11 2009 06 21 added sleep and log options REL=20090624 1.12 2009 06 24 added base and initial dir options REL=20090820 1.13 2008 08 20 cleaned up initial dir code REL=20090831 1.14 2009 08 31 'allow -t running without AFS diff ~kreymer/minos/scripts/bluwatch.${REL} bluwatch cp -a ~kreymer/minos/scripts/bluwatch.${REL} bluwatch cvs commit -m 'cvs ' bluwatch Looks OK, rm bluwatch.20090619 bluwatch.20090621 bluwatch.20090624 Proceeding with the new D0 bleeding edge version, that can log to non-AFS areas. MINOS26 > scp d0mino05:bluwatch.20090922 bluwatch.20090922 REL=20090922 1.15 2009 09 22 ============================================================================= 2009 09 25 ============================================================================= ############ # MCIMPORT # ############ mindata@minos27 Copied .bashrc from minos26, symlinked to .profile Disabled the stty of erase to backspace via tput -s back. This setup had not been effective on 26, why on 27 ? ########### # BLUEARC # ########### Date: Fri, 25 Sep 2009 23:18:01 +0000 (GMT) From: Arthur Kreymer To: minos-data@fnal.gov Cc: qzli@fnal.gov, lyon@fnal.gov, lammel@fnal.gov, romero@fnal.gov Subject: D0 Bluearc monitoring after separation I have added logging of data rates ( MBytes/sec ) of two D0 Bluearc pools The data and plots labeled d0mino05 monitor /prj_root/5012 ( Satabeast ? ) The data and plots labeled d0mino06 monitor /prj_root/3024 ( HDS ) Data are under /grid/data/monitor/rate/2009/09/* Some [lots are under http://www-numi.fnal.gov/computing/dh/bluearc/rates The HDS rates are excellent compared to /grid/data, see http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino05/d0mino05_20090924.png vs http://www-numi.fnal.gov/computing/dh/bluearc/rates/minos27/minos27_20090924.png The project 5012 (Satabeast?) plots are all over the map, sometimes wonderful like http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/d0mino06_20090922.png sometimes pretty bad like today, http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/d0mino06_20090925.png sometimes horrible like Wed and Thu http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/d0mino06_20090923.png http://www-numi.fnal.gov/computing/dh/bluearc/rates/d0mino06/d0mino06_20090924.png ########### # BLUEARC # ########### Large queue, up to 356. PERF not terribly bad, try boosting limit to 40, around 17:38 perf not horrible, but perhaps under 10, droppped limit to 20 around 17:43 ????????????????? How did 125 locks get taken ????? A Tsunami hit at 17:47, from many fnpc nodes, like -rw-rw-r-- 1 43021 e875 0 Sep 25 17:47 20090925.22:47:54.1002.fnpc226.32184.minosana.wingmc ... -rw-rw-r-- 1 43021 e875 0 Sep 25 17:47 20090925.22:47:54.968.fnpc247.9339.minosana.wingmc -rw-rw-r-- 1 43021 e875 0 Sep 25 17:47 20090925.22:47:55.3234.fnpc166.28068.minosana.wingmc -rw-rw-r-- 1 43021 e875 0 Sep 25 17:47 20090925.22:47:55.3267.fnpc166.27936.minosana.wingmc -rw-rw-r-- 1 43021 e875 0 Sep 25 17:47 20090925.22:47:55.4764.fnpc157.10110.minosana.wingmc ${NL}/lock status LOCK STATUS Fri Sep 25 17:48:44 CDT 2009 LOCKS 125 of 20 ( 9 stale ) wingmc 125 QUEUE 195 ( 0 stale) pittam 1 rbpatter 114 rodriges 14 rtoner 1 wingmc 65 17:57 - dropped rate from 10 to 2 per second polling, this may help avoid pileups The locks are expiring faster than the files can be copied. We are headed for a possible 400 file overload. Let's keep them all alive touch /grid/data/e875/LOCK/LOCKS/* set nohup { while true ; do touch /grid/data/e875/LOCK/LOCKS/* ; sleep 500 ; done } & LOCK STATUS Fri Sep 25 18:14:36 CDT 2009 LOCKS 153 of 20 ( 0 stale ) Locks have recovered, removed the touch subprocess LOCK STATUS Fri Sep 25 19:02:46 CDT 2009 LOCKS 23 of 20 ( 0 stale ) Nope, they are not keeping up. There are 9 truly stale locks. Need to goose the locks again, till we can move to the new lock script. ( done on 2009 09 28 around 22:33 ) ########## # DCACHE # ########## Date: Fri, 25 Sep 2009 17:26:33 -0500 From: ssa-group@fnal.gov Subject: Announcement: Service scheduled outage for enstore, dCache on d0en, stken, cdfen for a duration of 4 hours Public Enstore and Public Dcache outage There is a urgent need for a planned downtime for Public Enstore. We plan to do this on Oct 1, 2009 at 7am for 4 hours. This is the work that needs to be done. - Sun needs to repair an SL8500 in GCC. A pass through is broken and running in degraded mode. - Maintenance of Public Enstore databases, this will address database slowness. - Kernel updates to the Public Enstore servers. Public dcache will be also be down since the pnfs server is also involved in the maintenance. These are the library managers that will be down. CD-LTO4F1 CD-LTO4G1 CD-LTO3 CD-9940B 9940 CDF-LTO4G1 CDF-LTO3 D0-LTO4G1 We will bring up components as we are done to enable services as soon as we can. ########### # BLUEARC # ########### Restarted D0 logging, writing to /grid/data/monitor so that I can plot these and put them on the web. One hopes that the logs can be appended, now that D0 projects are out of /grid/data cp -vax bluwatch /grid/data/monitor d0mino05 set nohup ./bluwatch.20090922 -r -b /prj_root/5012/bluwatch/data -d 6 \ -l /grid/data/monitor & d0mino06 set nohup ./bluwatch.20090922 -r -b /prj_root/3024/bluwatch/data -d 1 \ -l /grid/data/monitor & Plotted these manually, with an updated brate MINOS26 > ./brate d0mino06 20090922 out "" /grid/data/monitor MINOS26 > ./brate d0mino06 20090923 out "" /grid/data/monitor MINOS26 > ./brate d0mino06 20090924 out "" /grid/data/monitor MINOS26 > ./brate d0mino06 20090925 out "" /grid/data/monitor MINOS26 > ./brate d0mino05 20090925 out "" /grid/data/monitor MINOS26 > ./brate d0mino05 20090924 out "" /grid/data/monitor MINOS26 > ./brate d0mino05 20090923 out "" /grid/data/monitor Plots are under http://www-numi.fnal.gov/computing/dh/bluearc/rates/ ########## # CONDOR # ########## ACCESS TO CONDOR ACCOUNT Date: Fri, 25 Sep 2009 11:32:00 -0500 (CDT) From: Fermilab Service Desk INC000000011782 Summary: condor@minos25 shell = /sbin/nologin Requester: Ryan Patterson Notes: To better administer Condor at MINOS, the "condor" .k5login on minos25 now contains rbpatter and kreymer. However, the login shell for that account is /sbin/nologin. Can someone change the condor account's shell to /bin/bash? Thanks! _____________________________________________________________ Date: Sun, 27 Sep 2009 11:43:31 -0700 (PDT) From: Ryan B. Patterson FYI - the condor account can now be logged into. _____________________________________________________________ ########## # CONDOR # ########## Bluearc slowed down again at 08:33 Fri Sep 25 08:27:24 CDT 2009 9/file1797 21 Fri Sep 25 08:28:25 CDT 2009 9/file1798 21 Fri Sep 25 08:29:25 CDT 2009 9/file1799 28 Fri Sep 25 08:30:26 CDT 2009 9/file1800 34 Fri Sep 25 08:32:26 CDT 2009 0/file1801 37 Fri Sep 25 08:33:27 CDT 2009 0/file1802 13 Fri Sep 25 08:34:29 CDT 2009 0/file1803 6 Fri Sep 25 08:35:31 CDT 2009 0/file1804 6 Fri Sep 25 08:36:32 CDT 2009 0/file1805 18 Fri Sep 25 08:37:37 CDT 2009 0/file1806 2 Fri Sep 25 08:38:39 CDT 2009 0/file1807 5 Fri Sep 25 08:39:40 CDT 2009 0/file1808 9 Fri Sep 25 08:40:48 CDT 2009 0/file1809 1 -- Submitter: minos25.fnal.gov : <131.225.193.25:63857> : minos25.fnal.gov ID OWNER SUBMITTED RUN_TIME HOST(S) 436768.0 rodriges 9/25 08:22 0+01:04:16 glidein_6620@fnpc342.fnal.gov 436769.0 rodriges 9/25 08:22 0+01:04:14 glidein_30221@fnpc340.fnal.gov 436771.0 rodriges 9/25 08:22 0+01:03:53 glidein_23350@fnpc339.fnal.gov ... These jobs are reading directly from /minos/data, and getting very little CPU ( I/O bound ) -bash-3.00$ ps -p 21024 -f UID PID PPID C STIME TTY TIME CMD minosana 21024 20479 5 08:32 ? 00:04:13 loon -nbq makeCondensedNtupleNC.C("/minos/data/mcout_data/daikon_07/L010185N_r1i213/near/dogwood1//mrnt_data/504/n13035049_0000_L010185N_D07_r1i21 MINOS25 > date ; minos_q Fri Sep 25 09:54:20 CDT 2009 -- Summary of minos25.fnal.gov : <131.225.193.25:63857> : minos25.fnal.gov OWNER RUN IDLE HELD OLDEST_JOB jcoelho 2 0 0 9/24 10:45 0+19:30:08 condor_cc_correcti med 15 0 0 9/25 09:27 0+00:16:27 loon_20090925_0927 pittam 1 0 0 9/24 04:54 1+04:32:50 gen_antpDev_200909 rbpatter 105 0 0 9/25 08:08 0+01:46:13 condor_dagman rodriges 219 0 0 9/25 08:22 0+01:31:48 make_antp.sh_20090 rtoner 3 0 0 9/24 10:44 0+22:53:28 run_reco.sh_200909 rubin 14 0 0 9/23 16:05 1+16:34:59 analyze_driver.gli xbhuang 4 0 0 9/24 20:52 0+13:01:37 NueJob_Griffin_Pro zisvan 1 0 0 9/24 12:24 0+21:22:21 dst_sam_20090924_1 TOTALS 364 0 0 Farm glideins: R=478 I=41 H=0 _________________________________________________________________________ Date: Fri, 25 Sep 2009 14:53:58 +0000 (GMT) From: Arthur Kreymer To: rodriges@fnal.gov Cc: rbpatter@fnal.gov, minos-admin@fnal.gov Subject: Bluearc overload from rodriges Grid jobs Bluearc performance declined sharply at 08:33 this morning. This is when about 200 rodriges Fermigrid jobs started running. These jobs seem to be accessing /minos/data directly, and getting very little CPU on the worker nodes. /bin/bash /minos/scratch/rodriges/ncProcessing_forReal/mrcc_macros/make_antp.sh 13035049 r1i213 /minos/data/users/rodriges/antp_mrcc/near/mc/L010z185i_r1i213 loon -nbq makeCondensedNtupleNC.C("/minos/data/mcout_data/daikon_07/L010185N_r1i213/near/dogwood1//mrnt_data/504/n130 35049_0000_L010185N_D07_r1i213.mrnt.dogwood1.0.root", "/minos/data/mcout_data/daikon_07/L010185N_r1i213/near/dogwood1//sntp_data/504/n13035049_0000_L010185N_D07_ r1i213.sntp.dogwood1.0.root", "/local/stage1/condor/execute/dir_25404/glide_n25441/execute/dir_19972/no_xfer/n13035049_0000_L010185N_D07_ r1i213.antp_mrcc.dogwood1.0.root") I will take the liberty of stopping these jobs, as they are getting no useful work done. Phil - you must use the /grid/fermiapp/minos/cpn script to copy your input and output files to/from local disk on the worker nodes. ________________________________________________________________________ Date: Fri, 25 Sep 2009 16:18:25 +0000 (GMT) From: Arthur Kreymer The jobs were removed from Condor as of 10:10 CDT. Data rates immediately recovered : http://www-numi.fnal.gov/computing/dh/bluwatch/rate/2009/09/25/minos27.txt Fri Sep 25 10:06:13 CDT 2009 0/file1891 3 Fri Sep 25 10:07:15 CDT 2009 0/file1892 6 Fri Sep 25 10:08:16 CDT 2009 0/file1893 9 Fri Sep 25 10:09:18 CDT 2009 0/file1894 12 Fri Sep 25 10:10:18 CDT 2009 0/file1895 24 Fri Sep 25 10:11:19 CDT 2009 0/file1896 23 Your jobs seem to have taken out 'cpn' locks when removed from Condor, leaving a large number of stale locks and queue entries. The usual cleanup script has removed the stale entries. We seem to be in good shape again ! _________________________________________________________________________ Date: Fri, 25 Sep 2009 16:40:28 +0100 From: Philip Rodrigues I see what happened. I'll fix up the script I was running and resubmit. Sorry for the inconvenience. _________________________________________________________________ REMOVING THE rodriges JOBS RJOBS=`condor_q rodriges | grep rodriges | cut -f 1 -d ' '` date ; for JOB in ${RJOBS} ; do condor_rm ${JOB} ; sleep 2 ; done date MINOS25 > date ; for JOB in ${RJOBS} ; do condor_rm ${JOB} ; sleep 2 ; done Fri Sep 25 10:02:36 CDT 2009 Job 436768.0 marked for removal Job 436769.0 marked for removal ... Job 437069.0 marked for removal Job 437071.0 marked for removal Job 437073.0 marked for removal MINOS25 > date Fri Sep 25 10:10:20 CDT 2009 Date rates have picked up. http://www-numi.fnal.gov/computing/dh/bluwatch/rate/2009/09/25/minos27.txt Fri Sep 25 10:06:13 CDT 2009 0/file1891 3 Fri Sep 25 10:07:15 CDT 2009 0/file1892 6 Fri Sep 25 10:08:16 CDT 2009 0/file1893 9 Fri Sep 25 10:09:18 CDT 2009 0/file1894 12 Fri Sep 25 10:10:18 CDT 2009 0/file1895 24 Fri Sep 25 10:11:19 CDT 2009 0/file1896 23 _________________________________________________________________ Let's watch the queues : MINOS26 > ${NL}/lock.new status LOCK STATUS Fri Sep 25 08:19:45 CDT 2009 LOCKS 1 of 10 ( 44 stale ) QUEUE 1 ( 232 stale) MINOS26 > ${NL}/lock.new status LOCK STATUS Fri Sep 25 08:19:53 CDT 2009 LOCKS 1 of 10 ( 44 stale ) rbpatter 1 QUEUE 1 ( 232 stale) $ ${NL}/lock.new status LOCK STATUS Fri Sep 25 10:15:35 CDT 2009 LOCKS 19 of 10 ( 44 stale ) rbpatter 9 rodriges 10 QUEUE 226 ( 232 stale) rbpatter 94 rodriges 132 $ ls -ltr /grid/data/e875/LOCK/QUEUE/*rodriges -rw-rw-r-- 1 43021 e875 0 Sep 25 10:03 /grid/data/e875/LOCK/QUEUE/20090925.15:03:10.fnpc371.16123.minosana.rodriges -rw-rw-r-- 1 43021 e875 0 Sep 25 10:03 /grid/data/e875/LOCK/QUEUE/20090925.15:03:19.fnpc366.11148.minosana.rodriges ... -rw-rw-r-- 1 43021 e875 0 Sep 25 10:10 /grid/data/e875/LOCK/QUEUE/20090925.15:10:20.fnpc203.14221.minosana.rodriges 10:33 Changed stale briefly from 30 to 5, to flush the locks. Added staleq to lock.new, reduced from 600 to 10 temporarily. MINOS26 > ls /grid/data/e875/LOCK/LOG | wc -l 63517 10:38 $ ${NL}/lock.new clean 11:14 LOCK STATUS Fri Sep 25 11:14:39 CDT 2009 LOCKS 1 of 10 ( 1 stale ) QUEUE 1 ( 1 stale) staleq - restored to 600 stale - restored to 30 Jobs are running, including new corrected rodriges jobs LOCK STATUS Fri Sep 25 11:28:56 CDT 2009 rodriges 10 QUEUE 28 ( 0 stale) rbpatter 26 rodriges 2 11:29 - popped limit to 30 ######## # LOCK # ######## No queue at 8:15 this morning. Spent a lot of time clearing out the queues. Now back to testing the LOAD limit. More philosophy - bluwatch should just log rates, not write the the LOCK area lock cannot afford to probe the rate data files. Need a third process, to calculate a PERF performance metric. The PERF number should be logged and plotted, as well as having a current value. It seems natural , though to do this in bluwatch, which runs every minute, and can trivially keep running averages or minima, or whatever we need. So add a -p PERF option to bluwatch. PERF - performance metric ( 5 minunte minimum MB/sec ) perf - performance requirement for taking a lock ( initially 3 ) lock.new seems to working correctly, with faked PERF content Tested stale PERF, low PERF , lack of PERF. ############### # GRIDAPPSYNC # ############### Proceeding to test the fixlink process, putting this in admin/sam/fixupslink export PRODUCTS=/minos/scratch/products/db FLINK=/afs/fnal.gov/files/expwww/numi/html/computing/admin/sam/fixupslink Test each form of fixed link ls -l samgrid_batch_adapter/Symlinks/../../../prd/samgrid_batch_adapter/v7_1_0/NULL ls -l sam_config/v4_2_34/config.env ls -l sam_config/v4_2_34 Do this in the script, Also make a safety backup time cp -vax . ../dbsafe real 0m8.980s OK, no more excues, let's try this. date ; ${FLINK} write Oops, this worked only for the .enf file Must remove the directories before re-symlinking. The result was a bunch of extra symlinks in the targets Cleaned this up with a hacked FLINK, previewed and ran the script again MINOS-MYSQL2 > date ; ${FLINK} write Fri Sep 25 18:59:49 CDT 2009 MINOS-MYSQL2 > setup sam MINOS-MYSQL2 > sam ping dbserver The server 'SAMDbServer.prd:SAMDbServer' is alive. This seems to have worked ! ############### # GRIDAPPSYNC # ############### tracking down symlinks under prd MINOS-MYSQL2 > find . -type l -exec ls -ld {} \; | grep /afs ./LABYRINTH/Linux2.4-GCC_3_4/fava/bfield/bfld_111.dat -> /afs/fnal.gov/files/data/minos/release_data/bmaps/bfld_111.dat and many many more like this lrwxrwxrwx 1 minsoft mysql 97 Sep 4 09:33 ./MINOS_EXTERN/Linux2.4-GCC_3_2/v03/lib/libmyodbc.so -> /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_EXTERN/Linux2.4-GCC_3_2/v03/lib/libmyodbc3.so lrwxrwxrwx 1 minsoft mysql 97 Sep 4 09:45 ./MINOS_EXTERN/Linux2.4-GCC_4_1/v04/lib/libmyodbc.so -> /afs/fnal.gov/files/code/e875/general/ups/prd/MINOS_EXTERN/Linux2.4-GCC_4_1/v04/lib/libmyodbc3.so lrwxrwxrwx 1 minsoft mysql 73 Sep 4 10:38 ./MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a -> /afs/fnal.gov/files/data/minos/d162/MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a lrwxrwxrwx 1 minsoft mysql 77 Sep 4 10:38 ./MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a-opt -> /afs/fnal.gov/files/data/minos/d162/MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a-opt lrwxrwxrwx 1 minsoft mysql 81 Sep 4 12:28 ./PYTHIA6/Linux2.4-GCC_3_4/v6_406_nopdf/inc/inc -> /afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_406/inc lrwxrwxrwx 1 minsoft mysql 81 Sep 4 12:28 ./PYTHIA6/Linux2.4-GCC_3_4/v6_406_nopdf/src/inc -> /afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_406/inc lrwxrwxrwx 1 minsoft mysql 81 Sep 4 12:29 ./PYTHIA6/Linux2.4-GCC_3_4/v6_409_nopdf/src/inc -> /afs/fnal.gov/files/code/e875/general/ups/prd/PYTHIA6/Linux2.4-GCC_3_4/v6_409/inc lrwxrwxrwx 1 minsoft mysql 49 Sep 4 12:47 ./gcc/v3_4_3/Linux-2-4-2-3-2/tar/binutils.tar.gz -> /afs/fnal/files/home/room1/mengel/binutils.tar.gz lrwxrwxrwx 1 minsoft mysql 50 Sep 4 12:47 ./gcc/v3_4_3/Linux-2-4-2-3-2/tar/gcc.tar.gz -> /afs/fnal/files/home/room1/mengel/gcc-3.4.3.tar.gz ########## # DCACHE # ########## Date: Fri, 25 Sep 2009 09:08:07 -0500 (CDT) Request INC000000011756 requested by you has been submitted. Status: New Summary: FNDCA RawDataWritePools writes Starting sometime around Thursday Sep 17, Minos raw data files have been getting written from RawDataWritePools to tape immediately, instead of waiting for the desired 24 hours This may be related to the shift of r-minos-stkendca23a-3 to RawDataWritePools on Sep 11. Please adjust the write policies on that pool to match the rest of RawDataWritePools pools. Thanks ! _______________________________________________________________________ Date: Fri, 25 Sep 2009 16:45:43 -0500 From: John Hendry Here is Vijay's comment on the bug 417 which I opened this morning for this incident. Note, this is Vijay's 1st primary shift. --- Comment #1 from Vijay Sekhri 2009-09-25 15:36:53 --- Hello I am not going to change files on the prod system without getting approval from the change management people. I did found out the following On stkendca24a in file /opt/d-cache/pool/stkendca24a.write-pool-1.setup we have this entry. queue define class enstore * -expire=86400 -total=50000000000 -pending=100 Now on stkendca23a in file /opt/d-cache/pool/stkendca23a.read-pool-3.setup the same entry is missing. I am not 100% sure what this line is for , but it looks like it the one that forces it to wait 24 hours. So if we add this entry on /opt/d-cache/pool/stkendca23a.read-pool-3.setup, perhaps it will solve the problem. I will have someone else from dcache team comment on this one because of my limited knowledge on the topic. Meanwhile I will keep looking and check for any differences between stkendca23a and stkendca24a (both are on ReadWritePool, so I guess they should be same in configuration) _______________________________________________________________________ Date: Fri, 25 Sep 2009 22:24:14 +0000 (GMT) From: Arthur Kreymer Thanks for the update ! I affirm that this change should wait for proper review and approval. The immediate effect is about 24 extra tape mounts per day, which can be tolerated for another week without harm. _______________________________________________________________________ Date: Fri, 02 Oct 2009 19:44:16 +0000 (GMT) From: Arthur Kreymer The problem still seems to be present. Is there a schedule for correcting this ? _______________________________________________________________________ Date: Fri, 02 Oct 2009 14:53:17 -0500 From: John Hendry I have added your concerns to bugzilla 417. ------- Comment #2 From John Hendry 2009-10-02 14:52:14 ------- Vijay reported on 9/25 this change needs approval which Art Kreymer agreed. However, as of Oct 2, nothing yet has been done and the problem remains as Art has noted in incident 11756: _______________________________________________________________________ http://www-ccf.fnal.gov/Bugzilla/show_bug.cgi?id=417 ------ Comment #3 From Timur Perelmutov 2009-10-02 23:35:56 ------- I modified the configuration file for r-minos-stkendca23a-3 to include the command that enables the 24 hour wait before write to tape. I executed the same command on running pool, so the problem should disappear. LEt us wait for Arthur confirmation before closing the ticket. _______________________________________________________________________ Date: Sun, 04 Oct 2009 19:55:39 +0000 (GMT) From: Arthur Kreymer The problem was resolved by Timur's change Friday night. Data was written to tape after a 1 day delay late Saturday, and subsequent files are queued up as they should be. This ticket can be closed, thanks ! _______________________________________________________________________ ============================================================================= 2009 09 24 ============================================================================= ####### # WEB # ####### Date: Thu, 24 Sep 2009 15:42:53 -0500 From: John P Inkmann To: "central-web-mgrs@fnal.gov" Cc: "csi-wcs@fnal.gov" Subject: Central Web Services - NOTICE: relocating AFS volumes Sept 29 The Storage Administrators are going to be relocating a number of AFS volumes Sept 29.  They moving a number of volumes to a new machine and retiring the old one.  The moves should not affect web service, but if you notice anything strange with your website, we ask that you contact the Service Desk to let us know.   The migration is scheduled to start after the daily backups are completed (typically 6am) and if everything goes as planned, the migration should be complete by 5:30pm that evening.   The work being done requires that all volumes be released before being moved.  If you are on one of the volumes listed below, we ask that you ensure that your volume is released by 6am on the 29th, and that you not make any updates to your website until the migration has been completed. N.B. I see no obvious Numi/Minos connections _____________________________________________________________________ Date: Wed, 30 Sep 2009 08:38:31 -0500 From: Peter J Rzeminski II To: "central-web-mgrs@fnal.gov" , "csi-wcs@fnal.gov" Subject: RE: Central Web Services - NOTICE: relocating AFS volumes Sept 29 All, We have received confirmation that the AFS Migration successfully completed last night. ############### # GRIDAPPSYNC # ############### minsoft@minos-mysql2 - testing copy of UPS made 2009 09 04 /afs/fnal.gov/files/data/minos/d119 was copied to /minos/scratch/products export PRODUCTS=/minos/scratch/products/db sam sets up OK, seems to work. Can locate files Reviewing config files : Mysql> cat /minos/scratch/products/etc/upsdb_list /afs/fnal.gov/files/code/e875/general/ups/db Mysql> ls -l /minos/scratch/products/db/.updfiles total 12 -rw-rw-r-- 1 minsoft mysql 5366 May 27 2004 updconfig -rw-rw-r-- 1 minsoft mysql 29 May 27 2004 updusr.pm Mysql> ls -l /minos/scratch/products/db/.upsfiles total 20 drwxr-xr-x 3 minsoft mysql 2048 Aug 6 2008 configure -rw-rw-r-- 1 minsoft mysql 1008 May 27 2004 dbconfig -rw-rw-r-- 1 minsoft mysql 816 May 27 2004 dbconfig.orig drwxr-xr-x 2 minsoft mysql 2048 Aug 6 2008 shutdown drwxr-xr-x 2 minsoft mysql 2048 Aug 6 2008 startup db/.upsfiles/dbconfig - change path to /m/s/p etc/upsdb_list - change path to /m/s/p grep afs db/*/*.version reveals all sam_config has hard coded afs path db/sam_config/v4_2_34.version hacked this copied sam_test_py to /local/stage1/minsoft on fnpc333. Ran all sam acceptance tests, including project ! Could not 'setup sam' without -q prd -bash-3.00$ setup sam No default SAM configuration exists at this time. TESTED ROOT -bash-3.00$ setup minos_root v5-22-00c -q GCC_3_4 -bash-3.00$ type root root is /minos/scratch/products/prd/MINOS_ROOT/Linux2.4-GCC_3_4/v5-22-00c/bin/root -bash-3.00$ root ******************************************* * * * W E L C O M E to R O O T * * * * Version 5.22/00c 27 June 2009 * * * * You are welcome to visit our Web site * * http://root.cern.ch * * * ******************************************* ROOT 5.22/00c (tags/v5-22-00c@29251, Aug 11 2009, 10:53:34 on linux) CINT/ROOT C/C++ Interpreter version 5.16.29, Jan 08, 2008 Type ? for help. Commands must be C++ statements. Enclose multiple statements between { }. root [0] .exit HACKING FIX TO SAM PRODUCT LINKS cd /minos/scratch/products MINOS-MYSQL2 > find db -type l -exec ls -l {} \; | grep afs | wc -l 24 Created fixlink script in /m/s/p * * Danger Will Robinson * * some SAM directories are named SymLinks, others Symlinks ########### # ENSTORE # ########### Date: Thu, 24 Sep 2009 13:49:56 -0500 From: John Hendry To: Art Kreymer , Robert Hatcher , SSA Group Subject: minos has 1 pnfs files with empty layers (N-N-N) Greetings, The enstore pnfs audit finds this one pnfs file with empty layers: Previously known PNFS Database minos files: timestamp | pnfsid | layer1 | layer2 | layer4 | path 2003-03-21 15:53:54.000000000 | 000F00000000000000252728 | n | n | n | /pnfs/fs/usr/minos/fardet_data/2002-11/.bad.F00010802_0000.mdaq.root.orig This files has not been written to tape. There is a copy of this file (w/o the .bad. prefix and .orig suffix) on tape: [enstore@stkensrv3n ~]$ enstore pnfs --xref /pnfs/minos/fardet_data/2002-11/F00010802_0000.mdaq.root volume: VOO109 location_cookie: 0000_000000000_0017036 size: 95756568 file_family: fardet_data original_name: /pnfs/fs/usr/minos/fardet_data/2002-11/F00010802_0000.mdaq.root map_file: pnfsid_file: 000F000000000000003C31A8 pnfsid_map: bfid: CDMS123101222100000 origdrive: stkenmvr214a:/dev/rmt/tps0d0n:1310051193 crc: 1416999045 [enstore@stkensrv3n ~]$ Please remove this .bad.F00010802_0000.mdaq.root.orig file. __________________________________________________________________________ The file seems to have been put into the directory in Mar 2003. Lacking anything in PNFS or DCache, the directory entry is moot. I will remove it. MINOS26 > ls -l /pnfs//minos/fardet_data/2002-11/.bad.F00010802_0000.mdaq.root.orig -rw-rw-r-- 1 root root 629760 Mar 21 2003 /pnfs//minos/fardet_data/2002-11/.bad.F00010802_0000.mdaq.root.orig MINOS26 > rm -f /pnfs//minos/fardet_data/2002-11/.bad.F00010802_0000.mdaq.root.orig Thu Sep 24 14:07:26 CDT 2009 ######## # LOCK # ######## lock.new - picking up from 8/31 draft adding stale control file Adding RATE control. Do not take a lock if the data rate is too low. rate control file RATE information file written by bluwatch ? Which rate ? single samples are too noisy, we sometimes get lucky during a busy spell running averages respond too slowly, there is no great harm in cutting off aggressively let's try an n minute minimum rate the rate file contains rate and time info. first line used by lock to set the limit second line used by bluwatch, Logging - we should log transitions how ? lock, or bluwatch ? NL=/afs/fnal.gov/files/expwww/numi/html/computing/admin/bluearc ${NL}/lock.new status LOCK STATUS Thu Sep 24 15:00:46 CDT 2009 LOCKS 5 of 5 ( 44 stale ) rbpatter 1 rtoner 2 wingmc 2 QUEUE 121 ( 230 stale) rbpatter 74 rodriges 7 rtoner 12 wingmc 27 xbhuang 1 The system is pretty badly backlogged, cannot test the script if I cannot get a lock. Should not swap in a new script with things queued. 20:11 UTC - changed /grid/data/e875/LOCK/limit from 5 to 10 20:20 - queue as low as 60 increased limit to 15 20:30 - queue had been in 50's, now abck at 65. increased limit to 20 20:47 queue up to 117 21:04 down to 48 21:15 still around 45 set limit back to 10 21:40 queue up to 88 set limit to 30, locks overshot to 36 around 16:43, locks overshot to 46/30. LOCK STATUS Thu Sep 24 16:45:36 CDT 2009 LOCKS 40 of 30 ( 44 stale ) 21:50 No major gain, queue hanging around 50ish set limit to 10 ######## # LOCK # ######## About to run clean, but rather busy now, ls /grid/data/e875/LOCK/LOG > /tmp/locklog USERS=`cat /tmp/locklog | cut -f 8 -d . | sort -u` for US in ${USERS} ; do printf "%-10.10s" ${US} ; grep "${US}$" /tmp/locklog | wc -l done pittam 8 rbpatter 2948 rodriges 4 rtoner 27 whitehd 2500 wingmc 132 xbhuang 44779 ls /grid/data/e875/LOCK/QUEUE > /tmp/lockqueue QUSERS=`cat /tmp/lockqueue | cut -f 6 -d . | sort -u` for US in ${QUSERS} ; do printf "%-10.10s" ${US} ; grep "${US}$" /tmp/lockqueue | wc -l done Streamline this, QLIST=`ls /grid/data/e875/LOCK/QUEUE` QUSERS=`printf "${QLIST}\n" | cut -f 6 -d . | sort -u` for US in ${QUSERS} ; do printf "%-10.10s" ${US} ; printf "${QLIST}\n" | grep "${US}$" | wc -l done LLIST=`ls /grid/data/e875/LOCK/LOCKS` LUSERS=`printf "${LLIST}\n" | cut -f 8 -d . | sort -u` for US in ${LUSERS} ; do printf "%-10.10s" ${US} ; printf "${LLIST}\n" | grep "${US}$" | wc -l done And work it into the lock status output ########### # MONITOR # ########### Date: Thu, 24 Sep 2009 10:25:14 -0500 From: Parag Mhashilkar To: Arthur Kreymer Cc: minos-data@fnal.gov Subject: Re: Minos portal Parts/Attachments: I made the required changes. Let me know if there is any other thing that needs to be changed/added. _____________________________________________________________________ These changes corrected the minos25 Ganglia URL, and will keep the recent Bluearc performance plots up to date. ########## # ADMIN # ########## Added the FY10 Minos tactical plan to the CD Docdb, patterned on the FY09 plan : CD DocDB Document 2933-v1 FY09 Tactical Plan for MINOS The new plan is CD-doc-3424, version 1 http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=3424 This is subsumed under the official plan CD DocDB Document 3295-v2 FY10 Intensity Frontier Computing Support Tactical Plan https://cd-docdb.fnal.gov:440/cgi-bin/ShowDocument?docid=3295 _________________________________________________________________________ Date: Thu, 24 Sep 2009 14:59:11 +0000 (GMT) From: Arthur Kreymer To: plunk@fnal.gov, wojcicki@fnal.gov, jthomas@fnal.gov, lang@fnal.gov, rmehdi@fnal.gov, rubin@fnal.gov, gmieg@fnal.gov, nwest@fnal.gov, lueking@fnal.gov, lammel@fnal.gov, votava@fnal.gov, minos-admin@fnal.gov, minosdb-support@fnal.gov Subject: Re: Minos Computing Tactical Plan draft - FYI On Wed, 12 Aug 2009, Arthur Kreymer wrote: > The Minos FY10 Tactical Plan, the basis for our budget requests, is due Aug 24. ... > This will go into DocDB after review. Hearing no objections, the Minos Tactical plan is now available at http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=3424 Some content has been rolled into the Intensity Frontier plan, ( internal document ) https://cd-docdb.fnal.gov:440/cgi-bin/ShowDocument?docid=3295 _________________________________________________________________________ Removed the documents from CVS admin/cdplan, bad to have duplicates. ########### # GNUPLOT # ########### For testing, GFAMK=/grid/fermiapp/minos/kreymer MINOS26 > cp ~/minos/bin/gnuplot ${GFAMK}/gnuplot.SL42 MINOS26 > cp /usr/lib/libreadline.so.4 ${GFAMK}/ MINOS26 > cp /usr/lib/libpng12.so.0 ${GFAMK}/ MINOS27 > export LD_LIBRARY_PATH=${GFAMK} ============================================================================= 2009 09 23 ============================================================================= ####### # AFS # ####### kreymer ran out of AFS disk space arount 23:03 UTC Mainly due to scripts/keep/ -rw-r--r-- 1 kreymer 1525 33554432 Aug 28 19:00 F00025792_0003.cosmic.cand.dogwood1.0.root ######## # JIRA # ######## Scanned 482 mail items in April 2009 minos-data, about 65 might have been tracked via JIRA ########### # GNUPLOT # ########### Installed this on minos-evd for testing. Linux minos-evd.fnal.gov 2.6.9-89.0.9.ELsmp #1 SMP Mon Aug 24 13:47:43 CDT 2009 i686 i686 i386 GNU/Linux [root@minos-evd ~]# cat /etc/redhat-release Scientific Linux Fermi LTS release 4.2 (Wilson) # yum list gnuplot gnuplot.i386 4.0.0-4 sl-base # yum install gnuplot MINOS26 > scp minos@minos-evd:/usr/bin/gnuplot ../bin/gnuplot MINOS26 > ../bin/gnuplot G N U P L O T Version 4.0 patchlevel 0 last modified Thu Apr 15 14:44:22 CEST 2004 System: Linux 2.6.9-89.0.7.ELsmp MINOS27 > bin/gnuplot bin/gnuplot: error while loading shared libraries: libreadline.so.4: cannot open shared object file: No such file or directory MINOS27 > rpm -qa | grep readline readline-4.3-13.x86_64 readline-devel-4.3-13.x86_64 ARK > rpm -qa | grep readline readline-5.1-1.1.x86_64 readline-devel-5.1-1.1.x86_64 readline-5.1-1.1.i386 readline-devel-5.1-1.1.i386 ########### # ENSTORE # ########### Date: Wed, 23 Sep 2009 16:12:33 -0500 From: ssa-group@fnal.gov To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov, wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov, timur@fnal.gov, stk-users@fnal.gov, cms-t1@fnal.gov, fermigrid-announce@fnal.gov, enstore-admin@fnal.gov Subject: Announcement: Service scheduled outage for enstore on stken for a duration of 1 hour An emergency enstore database maintenance must be performed. All services should remain running, however there may be a slight degradation in performance during this activity. An announcement will be posted upon completion. Thanks for your patience. Best Regards, John Hendry SSA Primary ________________________________________________________________________ Date: Wed, 23 Sep 2009 17:23:14 -0500 Subject: Announcement: Service restoration for enstore on stken for a duration of completed The emergency enstore database maintenance work has been completed. Thanks for your patience. Please report any problems. ######### # BRATE # ######### Updated brateday/wk-afs in admin Made the NOW and WEEK.png symlink relative. Added recent/* symlinks for the Portal Added VERB option, for debugging mkdir ${WEBDIR}/recent Updated brateday/wk-afs in admin Restarted the processes, now running out of admin. Added symlinks to ADMIN/bluearc/brate* in minos/scripts ln -sf ADMIN/bluearc/brateday_afs brateday_afs ln -sf ADMIN/bluearc/brateday_ark brateday_ark ln -sf ADMIN/bluearc/bratewk_afs bratewk_afs ln -sf ADMIN/bluearc/bratewk_ark bratewk_ark Oops, ran this in the ADMIN area, not minos/scripts. Recovered the original new files from AFS files -rwxr-xr-x 1 kreymer 1525 1799 Sep 23 20:43 .__afs629* -rwxr-xr-x 1 kreymer 1525 1796 Sep 23 20:47 .__afsFAFB* Recreated the links, this time in minos/scripts. back to ADMIN/bluearc ARK > cvs commit -m 'VERB option, relative symlinks, results directory' brateday_afs bratewk_afs ######### # ADMIN # ######### Added the new minos25 ganglia link ( Minos Servers ) to the computing/dh page. ######### # ADMIN # ######### Andy Romero increased the e875 quota from 100 to 300 GB per verbal request, per INC000000009849 This gives us space to move products and releases to Bluearc. ####### # FCC # ####### Kreymer renewed training for computer room access CD Computer Room Hazard Analysis [CD000003/CR/01] Support documents CDIP-2060-01 08/2009 Revision 19 http://cd-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=628 Hazard Analysis for Working in Computer Rooms http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=627 FCC Emergency Plan http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=2547 Computer Room Open Floor Tile Work Rules http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=2826 Passed the test. ######### # ADMIN # ######### Registered Nick Grant for Minos/Analysis Grid access. ########### # BLUEARC # ########### Look into http://robinhood.sourceforge.net/ ######### # TOPDB # ######### The last topdb samples were at 18:43 Monday Sep 21. I do not seem the processes running now ! I probably started these by hand without nohup, hence the disconnect when my laptop rebooted then. Started again by hand, with nohup. ########### # ENSTORE # ########### Updated JIRA CDFDH-1211 This reported widespread use of an old Enstore version by CDF grid nodes Actual cause was A. Moibenko's Enstore load tests last weekend, as reported in the Grid Users meeting on Monday. ########### # BLUEARC # ########### Noted that the HDS D0 project areas are the 5K series df -h /prj_root/5??? 2> /dev/null Filesystem Size Used Avail Use% Mounted on d0-nas-0:/projects/5001 1.6T 1.5T 132G 92% /prj_root/5001 d0-nas-0:/projects/5002 1.9T 1.1T 904G 54% /prj_root/5002 ... df -h /prj_root/5??? 2> /dev/null | grep prj_root 1.6T 1.5T 132G 92% /prj_root/5001 1.9T 1.1T 904G 54% /prj_root/5002 1.9T 1.9T 68G 97% /prj_root/5003 1.6T 1.1T 541G 68% /prj_root/5004 1.9T 1.8T 128G 94% /prj_root/5005 1.6T 1.3T 379G 77% /prj_root/5006 1.6T 719G 920G 44% /prj_root/5007 1.9T 1.9T 5.3G 100% /prj_root/5008 1.9T 113G 1.8T 6% /prj_root/5010 1.9T 1.9T 8.7G 100% /prj_root/5011 1.6T 0 1.6T 0% /prj_root/5012 1.0T 0 1.0T 0% /prj_root/5024 1.5T 1.4M 1.5T 1% /prj_root/5131 1.5T 865G 672G 57% /prj_root/5132 1.5T 118G 1.4T 8% /prj_root/5133 1.5T 1.4T 103G 94% /prj_root/5141 1.5T 741G 796G 49% /prj_root/5142 1.5T 23G 1.5T 2% /prj_root/5143 1.5T 1.3T 221G 86% /prj_root/5151 1.5T 886G 651G 58% /prj_root/5152 1.5T 1.1T 500G 68% /prj_root/5153 1.5T 259G 1.3T 17% /prj_root/5161 5.5T 553G 5.0T 10% /prj_root/5162 1.5T 1.5T 0 100% /prj_root/5163 1.5T 1.5T 85G 95% /prj_root/5171 1.5T 2.6M 1.5T 1% /prj_root/5172 1.5T 0 1.5T 0% /prj_root/5173 1.5T 525G 1012G 35% /prj_root/5181 1.5T 194G 1.4T 13% /prj_root/5182 1.5T 109M 1.5T 1% /prj_root/5183 1.4T 0 1.4T 0% /prj_root/5620 1.4T 539G 896G 38% /prj_root/5621 1.4T 1.4T 42G 98% /prj_root/5622 1.4T 1.4T 4.6G 100% /prj_root/5623 1.4T 22G 1.4T 2% /prj_root/5624 4.0T 3.5T 568G 87% /prj_root/5625 2.4T 1.4T 1.1T 59% /prj_root/5626 1.4T 1.4T 9.9G 100% /prj_root/5627 1.4T 0 1.4T 0% /prj_root/5628 1.4T 577G 858G 41% /prj_root/5629 3.9T 0 3.9T 0% /prj_root/5632 3.9T 0 3.9T 0% /prj_root/5633 2.0T 27G 2.0T 2% /prj_root/5640 1.5T 0 1.5T 0% /prj_root/5644 1.7T 887G 855G 51% /prj_root/5670 7.0T 4.5T 2.6T 64% /prj_root/5700 2.0T 0 2.0T 0% /prj_root/5701 2.0T 1.2T 854G 59% /prj_root/5800 SIZES=`df -h /prj_root/5??? 2> /dev/null | grep prj_root | cut -f 1 -d T` echo `printf "0." ; for SIZE in ${SIZES} ; do printf " + ${SIZE} " ; done` | bc 93.0 Selected an emtptyish path mkdir /prj_root/5012/bluwatch date time cp -vax /grid/data/minos/bluwatch/stash/3 \ /prj_root/5012/bluwatch/data date Wed Sep 23 09:33:21 CDT 2009 ... real 24m12.484s user 0m1.717s sys 0m41.382s Wed Sep 23 09:57:33 CDT 2009 set nohup ./bluwatch.20090922 -r -b /prj_root/5012/bluwatch/data \ -l /home/kreymer/bluwatch & bash-2.03$ tail -F bluwatch/rate/2009/09/23/d0mino05.txt Wed Sep 23 11:02:36 CDT 2009 0/file1801 69 Wed Sep 23 11:03:37 CDT 2009 0/file1802 53 Wed Sep 23 11:04:37 CDT 2009 0/file1803 72 Wed Sep 23 11:05:37 CDT 2009 0/file1804 46 Wed Sep 23 11:06:37 CDT 2009 0/file1805 64 Wed Sep 23 11:07:37 CDT 2009 0/file1806 79 Wed Sep 23 11:08:38 CDT 2009 0/file1807 49 Wed Sep 23 11:09:38 CDT 2009 0/file1808 65 Wed Sep 23 11:10:38 CDT 2009 0/file1809 72 Wed Sep 23 11:11:38 CDT 2009 0/file1810 53 Wed Sep 23 11:12:38 CDT 2009 0/file1811 68 Wed Sep 23 11:13:38 CDT 2009 0/file1812 66 Wed Sep 23 11:14:39 CDT 2009 0/file1813 56 Wed Sep 23 11:15:39 CDT 2009 0/file1814 71 Wed Sep 23 11:16:39 CDT 2009 0/file1815 49 Wed Sep 23 11:17:39 CDT 2009 0/file1816 66 Wed Sep 23 11:18:39 CDT 2009 0/file1817 67 Wed Sep 23 11:19:40 CDT 2009 0/file1818 67 Wed Sep 23 11:20:40 CDT 2009 0/file1819 64 Wed Sep 23 11:21:40 CDT 2009 0/file1820 72 Wed Sep 23 11:22:40 CDT 2009 0/file1821 71 .. ============================================================================= 2009 09 22 ============================================================================= ######### # BRATE # ######### Updated brate daily plot 20090921 which was truncated yesterday due to my desktop crash kreymer@ark based on brateday-ark PLOTPATH=/home/kreymer/brate SCRIPTS=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts export TZ=:/usr/share/zoneinfo/America/Chicago TODAY=20090921 ${SCRIPTS}/brate minos27 ${TODAY} ${PLOTPATH} export -n TZ unset TZ kreymer@minos26 based on brateday-afs TODAY=20090921 WEBDIR=/afs/fnal.gov/files/expwww/numi/html/computing/dh/bluearc/rates/minos27 scp -q -c blowfish \ kreymer@ark.fnal.gov:/home/kreymer/brate/minos27_${TODAY}.png \ ${WEBDIR}/minos27_${TODAY}.png ########### # BLUEARC # ########### D0 monitoring, for after the separation of their disk arrays today. kreymer@d0mino06 ls /prj_root/ 1002 1012 1152 1182 2626 2634 2642 2651 2659 2667 2676 3002 3016 3034 5001 5010 5142 5171 5622 5631 5800 1003 1131 1153 1183 2627 2635 2643 2652 2660 2668 2677 3003 3021 3035 5002 5011 5143 5172 5623 5632 1004 1132 1161 2620 2628 2636 2645 2653 2661 2670 2678 3004 3022 4021 5003 5012 5151 5173 5624 5633 1005 1133 1162 2621 2629 2637 2646 2654 2662 2671 2679 3011 3023 4022 5004 5024 5152 5181 5625 5640 1006 1141 1163 2622 2630 2638 2647 2655 2663 2672 2680 3012 3024 4023 5005 5131 5153 5182 5626 5644 1007 1142 1171 2623 2631 2639 2648 2656 2664 2673 2681 3013 3031 4024 5006 5132 5161 5183 5627 5670 1008 1143 1172 2624 2632 2640 2649 2657 2665 2674 3000 3014 3032 4025 5007 5133 5162 5620 5628 5700 1011 1151 1173 2625 2633 2641 2650 2658 2666 2675 3001 3015 3033 500 5008 5141 5163 5621 5629 5701 Most of these seem not to exist. Some are served from nodes like bash-2.03$ df -h /prj_root/* 2> /dev/null | grep ^d0srv | cut -f 1 -d : | sort -u d0srv062 d0srv066 d0srv068 d0srv069 d0srv081 d0srv082 d0srv083 d0srv084 d0srv085 d0srv068 d0srv069 Directories 3000 through 3035 seem to be served from fermi-nas-1.fnal.gov:/projects/ 3000-3004 5 3011-3016 6 3021-3024 4 3031-3035 5 bash-2.03$ df -h /prj_root/* 2> /dev/null | grep -A 1 ^fermi fermi-nas-1.fnal.gov:/projects/3000 2.0T 2.0T 77G 97% /prj_root/3000 fermi-nas-1.fnal.gov:/projects/3001 2.0T 1.8T 274G 87% /prj_root/3001 fermi-nas-1.fnal.gov:/projects/3002 2.0T 1.9T 163G 93% /prj_root/3002 fermi-nas-1.fnal.gov:/projects/3003 2.0T 2.0T 93G 96% /prj_root/3003 fermi-nas-1.fnal.gov:/projects/3004 2.0T 1.2T 919G 56% /prj_root/3004 fermi-nas-1.fnal.gov:/projects/3011 2.5T 632G 1.9T 25% /prj_root/3011 fermi-nas-1.fnal.gov:/projects/3012 1.0T 998G 27G 98% /prj_root/3012 fermi-nas-1.fnal.gov:/projects/3013 1.0T 281G 744G 28% /prj_root/3013 fermi-nas-1.fnal.gov:/projects/3014 6.5T 6.4T 122G 99% /prj_root/3014 fermi-nas-1.fnal.gov:/projects/3015 1.0T 879G 146G 86% /prj_root/3015 fermi-nas-1.fnal.gov:/projects/3016 1.0T 1022G 2.9G 100% /prj_root/3016 fermi-nas-1.fnal.gov:/projects/3021 3.0T 3.0T 3.6G 100% /prj_root/3021 fermi-nas-1.fnal.gov:/projects/3022 3.0T 3.0T 39G 99% /prj_root/3022 fermi-nas-1.fnal.gov:/projects/3023 5.0T 5.0T 20G 100% /prj_root/3023 fermi-nas-1.fnal.gov:/projects/3024 3.0T 257G 2.8T 9% /prj_root/3024 fermi-nas-1.fnal.gov:/projects/3031 3.0T 2.9T 184G 95% /prj_root/3031 fermi-nas-1.fnal.gov:/projects/3032 3.0T 3.0T 26G 100% /prj_root/3032 fermi-nas-1.fnal.gov:/projects/3033 3.0T 1.4T 1.7T 45% /prj_root/3033 fermi-nas-1.fnal.gov:/projects/3034 4.0T 2.6T 1.5T 64% /prj_root/3034 fermi-nas-1.fnal.gov:/projects/3035 1.5T 437G 1.1T 29% /prj_root/3035 SIZES=`df -h /prj_root/* 2> /dev/null | grep -A 1 ^fermi | grep prj_root | cut -f 1 -d T` echo `printf "0." ; for SIZE in ${SIZES} ; do printf " + ${SIZE} " ; done` | bc 51.5 mkdir prj_root/3024/bluwatch date time cp -vax /grid/data/minos/bluwatch/stash/3 \ /prj_root/3024/bluwatch/data date Tue Sep 22 13:32:01 CDT 2009 bash-2.03$ time cp -vax /grid/data/minos/bluwatch/stash/2 \ > /prj_root/3024/bluwatch/data `/grid/data/minos/bluwatch/stash/2' -> `/prj_root/3024/bluwatch/data' `/grid/data/minos/bluwatch/stash/2/8' -> `/prj_root/3024/bluwatch/data/8' `/grid/data/minos/bluwatch/stash/2/8/file1402' -> `/prj_root/3024/bluwatch/data/8/file1402' `/grid/data/minos/bluwatch/stash/2/8/file1403' -> `/prj_root/3024/bluwatch/data/8/file1403' ... real 29m38.405s user 0m1.802s sys 0m44.056s bash-2.03$ date Tue Sep 22 14:01:39 CDT 2009 Trying out bluwatch, test mode first. bash-2.03$ scp minos27:minos/scripts/bluwatch.20090831 . ./bluwatch.20090831 -r -t -b /prj_root/3024/bluwatch/data DIR /prj_root/3024/bluwatch/data/0 Tue Sep 22 14:04:05 CDT 2009 0/file1801 1004 1253646245845247000 1253646245832311000 12936 9957 2979000 Tue Sep 22 14:04:11 CDT 2009 0/file1802 937 1253646251871288000 1253646251857644000 13644 10665 2979000 Trying an older directory, copied early bash-2.03$ ./bluwatch.20090831 -r -t -b /prj_root/3024/bluwatch/data -d 8 Tue Sep 22 14:04:35 CDT 2009 8/file1401 24 1253646275790497000 1253646275385235000 405262 403086 2175600 Tue Sep 22 14:04:42 CDT 2009 8/file1402 24 1253646282245981000 1253646281833118000 412863 410687 2175600 Tue Sep 22 14:04:48 CDT 2009 8/file1403 59 1253646288431373000 1253646288260645000 170728 168552 2175600 set nohup ./bluwatch.20090922 -r -b /prj_root/3024/bluwatch/data -d 9 \ -l /home/kreymer/bluwatch & This is running well now. ############# # MINOSSOFT # ############# To : minos_batch@fnal.gov, minos_software_discussion@fnal.gov Cc : Attchmnt: Subject : SLF 5 nodes for Batch testing ----- Message Text ----- There is no schedule yet for migration of Fermigrid workers to SLF 5. But the Fermigrid team have made two SLF 5.3 systems available for early testing. People who have fnpcsrv1 accounts can log in. fnpcsrv532 - 32 bit kernel fnpcsrv564 - 64 bit kernel The general Minos collaboration can use flxi06, but it is at SLF 5.1 and lacks the /grid and /minos mounts. I am trying to find a better SLF 5.3 host for general Minos collaboration access. ########### # BLUEARC # ########### Restarted brate scripts on ark.fnal.gov kreymer@ark.fnal.gov ( due to lack of gnuplot elsewhere ) ADMIN=/afs/fnal.gov/files/expwww/numi/html/computing/admin/bluearc cd ln -s ${ADMIN}/brateday_ark brateday_ark ln -s ${ADMIN}/bratewk_ark bratewk_ark cat >> crontab.dat @reboot ${HOME}/brateday_ark @reboot ${HOME}/bratewk_ark crontab crontab.dat initial startup : 28 14 * * * ${HOME}/brateday_ark 28 14 * * * ${HOME}/bratewk_ark ARK > ps xf | grep brate ============================================================================= 2009 09 21 ============================================================================= ######### # ADMIN # ######### INC000000011164 9/19/2009 12:55:21 PM Linux Minos - minos25 Ganaglia data FEF Primary - run2-sys@fnal.gov Ganglia data from minos25 stops at 11:00 Friday Sep 18. Please restart Ganglia monitoring of minos25. ___________________________________________________________________________ 9/21/2009 2:49:13 PM ; esimm Ganglia looks current for this node. ___________________________________________________________________________ 9/21/2009 10:39:59 PM ; Remedy Application Service Are we looking at the same plot ? The data here cuts off on Friday at about noon : http://rexganglia2.fnal.gov/minos/?r=week&c=MINOS+Cluster&h=minos25.fnal.gov ___________________________________________________________________________ Date: Mon, 21 Sep 2009 21:48:52 -0500 From: Edward Simmonds Strange. I was looking at this: http://rexganglia2.fnal.gov/minos/?r=week&c=MINOS+Server&h=minos25.fnal.gov But I'm not as familiar with these plots as you are. I'll check into this more tomorrow. ___________________________________________________________________________ Date: Tue, 22 Sep 2009 17:07:17 +0000 (GMT) From: Arthur Kreymer To: Edward Simmonds Cc: minos-admin@fnal.gov Subject: Re: Incident INC000000011164 reported by you has been resolved. Linux Minos - minos25 Ganaglia data That explains my problem. It appears that minos25 was moved from the Minos Cluster to Minos Server list in Ganglia. This broke the old link, and broke the continuity of data logging. The change makes sense, as minos25 is the Condor server node, and is not availble for genaral use like the rest of the Cluster. Was this change intentional ? If so, I will adjust several URL's that point to the old location. ___________________________________________________________________________ ############ # MCIMPORT # ############ fired up wingmc_gd, now that OVERLAY is all in PNFS set nohup ; ./mcimport -c -l 10 wingmc_gd & less /minos/data/mcimport/wingmc_gd/log/mcimport.log ... OK - purging 432 MCIN files ? Mon Sep 21 18:12:30 CDT 2009 PURGED n13035006_0000_L250200N_D07_r2.reroot.root 2 PURGED n13035006_0001_L250200N_D07_r1.reroot.root 8 ... PURGED n13035012_0030_L250200N_D07_r1.reroot.root 44 PURGED n13035012_0030_L250200N_D07_r2.reroot.root 40 MCIN processing 1380 files Tue Sep 22 01:03:44 CDT 2009 MCIN configuration n1303 _L010185N_D07_r2i257.reroot.root SRMCPed n13035027_0000_L010185N_D07_r2i257.reroot.root 0 SRMCPed n13035027_0001_L010185N_D07_r2i257.reroot.root 9 SRMCPed n13035027_0002_L010185N_D07_r2i257.reroot.root 12 ############### # SERVICEDESK # ############### Tested reqtest account, for generic tester experience. Better than kreymer, who has special priv's. ########### # ENSTORE # ########### VO4209 - verified that files are OK on other tapes, allowed recycling of this tape. See notes 2009 04 30 Date: Tue, 22 Sep 2009 14:06:46 -0500 From: George Szmuksta The tape was recycled. ============================================================================= 2009 09 18 ============================================================================= ########### # BLUEARC # ########### Serious slowdown starting around 17:30 CDT / 22:30 CDT Date rates close to 0 minos_q Failed to fetch ads conror_q Failed to fetch ads Latencies spiking over 100 very close to 17:00 see http://computing.fnal.gov/nasan/internal/stats/html/blue2-plots.html whitehd condor jobs ramped up to over 300 around 16:00, probably not the cause of this. Having trouble logging into fnpc* nodes, getting stuck in login before the security message. Connection closed by 131.225.166.132 ________________________________________________________________________ Date: Fri, 18 Sep 2009 18:08:39 -0500 (CDT) Status: New Summary: Grid - severe /grid/data slowdown after 17:00 Notes: fermgrid-help CSI / BASS Serious slowdown starting around 17:30 CDT / 22:30 CDT Date rates close to 0 minos_q Failed to fetch ads conror_q Failed to fetch ads Latencies spiking over 100 very close to 17:00, see http://computing.fnal.gov/nasan/internal/stats/html/blue2-plots.html whitehd condor jobs ramped up to over 300 around 16:00, probably not the cause of this. Having trouble logging into fnpc* nodes, getting stuck in login before the security message. Connection closed by 131.225.166.132 ________________________________________________________________________ On minos25 , /var/log/messages Sep 18 17:23:50 minos25 kernel: nfs_statfs: statfs error = 512 Sep 18 17:28:53 minos25 last message repeated 2 times Sep 18 17:33:56 minos25 last message repeated 4 times Same on minos26. Sep 18 17:23:50 minos26 kernel: nfs_statfs: statfs error = 512 Sep 18 17:28:53 minos26 last message repeated 2 times Sep 18 17:26:38 minos01 kernel: nfs_statfs: statfs error = 512 ________________________________________________________________________ ######### # ADMIN # ######### ssh minoscvs@minoscvs adduser grafnj cms add_minos_user grafnj Creating account... /var/yp gmake[1]: Entering directory `/var/yp/minos' gmake[1]: `ypservers' is up to date. gmake[1]: Leaving directory `/var/yp/minos' gmake[1]: Entering directory `/var/yp/minos' Updating passwd.byname... Updating passwd.byuid... Updating netid.byname... gmake[1]: Leaving directory `/var/yp/minos' Adding user to Minos AFS group... libprot: no such entry (getting token) libprot: Could not get afs tokens, running unauthenticated pts: Permission denied ; unable to add user grafnj to group minos Problem adding user to AFS group minos. Please try running "adduser -u grafnj -g minos" again. Sending mail to subscribe to minos-user mailing list ... Sending email to user... Updated token MINOS01 > cmd add_minos_user grafnj Creating account... /var/yp gmake[1]: Entering directory `/var/yp/minos' gmake[1]: `ypservers' is up to date. gmake[1]: Leaving directory `/var/yp/minos' gmake[1]: Entering directory `/var/yp/minos' Updating passwd.byname... Updating passwd.byuid... Updating netid.byname... gmake[1]: Leaving directory `/var/yp/minos' Adding user to Minos AFS group... Sending mail to subscribe to minos-user mailing list ... Sending email to user... ############ # MCIMPORT # ############ -bash-3.00$ ls /minos/data/mcimport/OVERLAY/mcin | grep .root | wc -l 5313 21:46 UTC 6084 Date: Fri, 18 Sep 2009 21:55:28 +0000 (GMT) To : Rashid Mehdiyev Cc : Adam Schreckenberger , minos-data@fnal.gov, Daniel Cronin-Hennessy On Thu, 17 Sep 2009, Rashid Mehdiyev wrote: > Adam , please check the remaining files on /OVERLAY directory and > move them to mcin (assuming that there is no duplicates with respect to > /minos/data/mcimport/wingmc_gd), so this amount of MC input (~ 3 Tb ?) > could be available for processing. > If you have already validated that data in /OVERLAY, the process of moving > files from this area to mcin should be straightforward and done asap. I presume that the presence of the 6084 files in OVERLAY/mcin indicated that they are ready to be imported. As of about 16:50 CDT ( 21:50 UTC ), I have started a manual mcimport on the OVERLAY directory, writing about 500 files at a time. set nohup ; ./mcimport -l 40 -b 500 OVERLAY & ######## # GRID # ######## Date: Fri, 18 Sep 2009 11:49:01 +0100 (BST) From: med@hep.ucl.ac.uk To: Arthur Kreymer Subject: Re: condor Hi Art, I've noticed that many jobs are still being held idle in the condor queue. I know that you said there was some work going on yesterday but do you have a feeling for how long it will take before the jobs start running? MINOS25 > minos_q -- Summary of minos25.fnal.gov : <131.225.193.25:61525> : minos25.fnal.gov OWNER RUN IDLE HELD OLDEST_JOB jdejong 0 1 0 9/17 05:22 0+00:00:00 irundq.sh_20090917 jjling 0 42 0 9/17 11:44 0+00:00:00 runjob.sh_20090917 med 0 10 0 9/17 05:16 0+00:00:00 loon_20090917_0516 nigrant 0 81 0 9/17 16:04 0+00:00:00 MakeNDRun1LEDataSu pittam 38 6 0 9/16 12:40 1+19:17:32 gen_antpDev sanjay 0 6 0 9/16 15:57 0+00:00:00 nrmcon.csh_2009091 scavan 12 0 0 9/18 07:47 0+00:10:25 paloon tinti 2 1 0 9/18 02:30 0+05:28:06 FDDogwoodTest-2009 zisvan 0 3 0 9/17 14:50 0+00:00:00 MakeFDRun1LEDataTa TOTALS 52 150 0 Farm glideins: R=29 I=0 H=5 MINOS25 > condor_q gfactory ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 177491.19 gfactory 8/5 00:10 0+00:00:00 H 0 0.0 glidein_startup.sh 177492.14 gfactory 8/5 00:10 0+00:00:00 H 0 0.0 glidein_startup.sh 202172.0 gfactory 8/10 13:47 0+00:00:00 H 0 0.0 glidein_startup.sh 251858.0 gfactory 8/17 04:20 0+09:39:58 H 0 0.0 glidein_startup.sh 252257.3 gfactory 8/17 09:59 0+03:27:40 H 0 0.0 glidein_startup.sh 381049.3 gfactory 9/17 20:13 0+11:48:15 R 0 0.0 glidein_startup.sh 381062.1 gfactory 9/17 20:29 0+11:33:15 R 0 0.0 glidein_startup.sh ... 381321.3 gfactory 9/18 07:50 0+00:14:40 R 0 0.0 glidein_startup.sh 381321.4 gfactory 9/18 07:50 0+00:14:40 R 0 0.0 glidein_startup.sh Why are those jobs Held ? They were Idle a few minutes ago, around 08:00 And they are Idle again, 09:33 Remove them, see if that helps. condor_rm 177491.19 condor_rm 177492.14 condor_rm 202172.0 condor_rm 251858.0 condor_rm 252257.3 All clear by 09:43. Why are no new factories queued up yet ? Some have started up now, submitted at 09:51 47 are running at 11:00 CDT As of 16:50 UTC, gpfarm running jobs are mainly : engage 130 fgstore 300 group_cigi 30 minosgli 30 mippro 150 group_engage 50 Steve Timm sees no problems in GPfarm. No pilots because of no requests. ############ # MCIMPORT # ############ Killed minos26 mcimport -c ALL on minos26, which had marched into wingmc_gd territory. Purging files very slowly. mcimport.20090918 Added rate report to PURGEFILE minos27 $ AFSS/mcimport.20090918 -b 2 wingmc_gd ... Fri Sep 18 07:52:08 CDT 2009 PURGED n13035005_0029_L250200N_D07_r2.reroot.root 28 PURGED n13035005_0030_L250200N_D07_r1.reroot.root 26 minos26 Fri Sep 18 12:57:25 UTC 2009 PURGED n13035005_0030_L250200N_D07_r2.reroot.root 3 PURGED n13035006_0000_L250200N_D07_r1.reroot.root 3 Going ahead with a limited ALL run on minos27 AFSS/mcimport.20090918 -b 1 -l 2 ALL du -sm /minos/data/mcimport/sjc/ is taking many minutes, will have the same problem with others. Modified to remove the top level du, avoiding the big log directory Started production running on minos27 cp -a AFSS/mcimport.20090918 . set nohup ; ./mcimport -l 9999 ALL & ============================================================================= 2009 09 17 ============================================================================= ########## # NAGIOS # ########## See the FEF evaluation of Nagios for ganlia-style monitoring, comparison to Zabbix CD DOcDB 3277 https://cd-docdb.fnal.gov:440/cgi-bin/RetrieveFile?docid=3277&version=1&filename=nagios_zabbix_evaluation.pdf ############ # MCIMPORT # ############ mcimport.20090917 Added ACTION prints to sleeps, via SLEEP minutes message printf "`date -u` `hostname -s` ${INDIR} $$ STARTUP\n" > ${ACT} Added STOP, removed NOIMPORT Inserted STOP support in SLEEP TAPER PAPER PURGE MCINPURGE MCINWRITE if [ -r "${INPAT}/STOP" ] ; then printf " ${INPAT}/STOP - bailing\n" break fi AFSS/mcimport.20090917 -l 3 wingmc_gd TESTING ALL ACTION FLAG touch /minos/data/mcimport/ALL/ACTION AFSS/mcimport.20090917 -n -b 1 ALL Oops, accidentally ran jcoelho under the new script while the old cronjob was still running. This was in the purge phase, so no harm was done, just lots of noise in the mcimport.log file. ============================================================================= 2009 09 16 ============================================================================= ######## # MAIL # ######## Got a list of email account not having updated their passwords. Placed it in ~kreymer/minos/maint/imapbad.lis for BUS in `cat ../maint/imapbad.lis` ; do echo $BUS ; done | wc -l 2545 Checked for account on the Minos Cluster Full list ypcat passwd > /tmp/passwd Just the principal cat /tmp/passwd | cut -f 1 -d : > /tmp/passuser Count them for BUS in `cat ../maint/imapbad.lis` ; do grep "^${BUS}$" /tmp/passuser ; done | wc -l 35 List them for BUS in `cat ../maint/imapbad.lis` ; do grep "^${BUS}:" /tmp/passwd ; done alklein:KERBEROS:44040:5111:Andrea Klein:/afs/fnal.gov/files/home/room3/alklein:/usr/local/bin/tcsh artemis:KERBEROS:11954:5111:Artemios Geromitsos:/afs/fnal/files/home/room2/artemis:/usr/local/bin/tcsh avva:KERBEROS:8087:5111:Sergey_Avvakumov:/afs/fnal.gov/files/home/room1/avva:/usr/local/bin/bash betan009:KERBEROS:43725:5111:Minerba Betancourt:/afs/fnal/files/home/room3/betan009:/usr/local/bin/tcsh boehm:KERBEROS:11634:5111:Joshua Boehm:/afs/fnal.gov/files/home/room3/boehm:/usr/local/bin/tcsh brb:KERBEROS:12059:5111:Benjamin Brown:/afs/fnal/files/home/room2/brb:/usr/local/bin/tcsh bseilhan:KERBEROS:3475:5111:Brandon Seilhan:/afs/fnal/files/home/room3/bseilhan:/usr/local/bin/tcsh cherdack:KERBEROS:12660:5111:Daniel Cherdack:/afs/fnal/files/home/room2/cherdack:/usr/local/bin/tcsh djauty:!:43108:5111:David Auty:/afs/fnal/files/home/room2/djauty:/usr/local/bin/tcsh dsdami:KERBEROS:13264:5111:Daniel Damiani:/afs/fnal/files/home/room3/dsdami:/usr/local/bin/tcsh ebarnes:KERBEROS:11845:5111:Elizabeth Barnes:/afs/fnal/files/home/room2/ebarnes:/usr/local/bin/bash grashorn:KERBEROS:11630:5111:Eric Grashorn:/afs/fnal/files/home/room3/grashorn:/usr/local/bin/bash hzhang02:KERBEROS:42840:5111:Hongshan Zhang:/afs/fnal/files/home/room1/hzhang02:/usr/local/bin/tcsh hzheng:KERBEROS:8319:5111:Hai Zheng:/afs/fnal/files/home/room2/hzheng:/usr/local/bin/tcsh irisliu:KERBEROS:44041:5111:Iris Liu:/afs/fnal.gov/files/home/room1/irisliu:/usr/local/bin/tcsh jjling:KERBEROS:12695:5111:Jiajie Ling:/afs/fnal/files/home/room3/jjling:/usr/local/bin/tcsh jmusser:KERBEROS:9277:5111:James_Musser:/afs/fnal.gov/files/home/room1/jmusser:/usr/local/bin/tcsh jurgenr:KERBEROS:12802:5111:Juergen Reichenbacher:/afs/fnal/files/home/room3/jurgenr:/usr/local/bin/tcsh jyuko:KERBEROS:11781:5468:Jasmine Ma:/afs/fnal/files/home/room2/jyuko:/usr/local/bin/tcsh kimjj:KERBEROS:14217:5111:Jae Kim:/afs/fnal/files/home/room1/kimjj:/usr/local/bin/tcsh koskinen:KERBEROS:11720:5111:David Koskinen:/afs/fnal/files/home/room1/koskinen:/usr/local/bin/bash lefeuvre:KERBEROS:43177:5111:Gwenaelle Lefeuvre:/afs/fnal/files/home/room2/lefeuvre:/usr/local/bin/tcsh loiacono:KERBEROS:3261:5111:Laura Loiacono:/afs/fnal/files/home/room3/loiacono:/usr/local/bin/bash moeller:KERBEROS:12833:1535:Victoria Moeller:/afs/fnal/files/home/room2/moeller:/usr/local/bin/tcsh nevans:KERBEROS:43393:1570:Nicholas Evans:/afs/fnal/files/home/room1/nevans:/usr/local/bin/tcsh nsmayer:KERBEROS:13406:5111:Nathan Mayer:/afs/fnal/files/home/room3/nsmayer:/usr/local/bin/tcsh ochoa:KERBEROS:11632:5111:Juan Ochoa:/afs/fnal/files/home/room2/ochoa:/usr/local/bin/bash pittam:KERBEROS:13893:5111:Robert Pittam:/afs/fnal/files/home/room2/pittam:/usr/local/bin/tcsh rtoner:!:12836:5468:Ruth Toner:/afs/fnal/files/home/room1/rtoner:/usr/local/bin/tcsh scavan:KERBEROS:13836:5111:Steven Cavanaugh:/afs/fnal/files/home/room3/scavan:/usr/local/bin/tcsh semenov:KERBEROS:4874:5111:Vitali_Semenov:/afs/fnal.gov/files/home/room1/semenov:/usr/local/bin/tcsh spanacek:KERBEROS:4515:5066:Suzanne_Panacek:/afs/fnal.gov/files/home/room2/spanacek:/usr/local/bin/tcsh talaga:KERBEROS:7998:5111:Richard_Talaga:/afs/fnal.gov/files/home/room2/talaga:/usr/local/bin/tcsh tinti:KERBEROS:13849:5111:Gemma Tinti:/afs/fnal/files/home/room1/tinti:/usr/local/bin/tcsh xbhuang:KERBEROS:43524:5111:Xiaobo Huang:/afs/fnal/files/home/room3/xbhuang:/usr/local/bin/tcsh for BUS in `cat ../maint/imapbad.lis` ; do grep "^${BUS}$" /tmp/passuser ; done > ../maint/imapbad.users The following have not forwarded email, so are immediately affected: for US in `cat ../maint/imapbad.users` ; do finger ${US}@fnal ; done | grep imapserver alklein@imapserver2.fnal.gov avva@imapserver1.fnal.gov bseilhan@imapserver1.fnal.gov irisliu@imapserver3.fnal.gov nevans@imapserver3.fnal.gov semenov@imapserver3.fnal.gov None of these people are active, as far as I know. ######## # FARM # ######## I am curious about why there is no directories for run # > 704 for sntp, while there is a number of then for cand for these particular series ? /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L250200N_i194/cand_data/ 702/ 703/ 704/ 707/ 708/ 709/ 712/ 713/ /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L250200N_i194/sntp_data/ 702/ 703/ 704/ I think that the sought files are in /WRITE dir, but they do not move to /pnfs for some reason... Could you investigate it ? ______________________________________________ Date: Wed, 16 Sep 2009 14:02:38 -0500 From: Rashid Mehdiyev OK, I think that I fixed it. I have removed one offending sntp file, now the files are moving on - there is 173 sntp files to write. ______________________________________________ ######## # FARM # ######## Removing the dogwood1 near cosmic prescaled files from Bluearc. This should be simple, there should be no dogwood1 near cosmic. Removing files through 2009-02. Will leave the 03/04/05/06 files for a little while, as they are unprescaled and potentially useful. That's about 150 GBytes per month, nearly 1/2 TByte in total. du -sm ${MDS} # partway through the purge, around 2005-03 2621003 /minos/data/reco_near/dogwood1/sntp_data du -sm ${MDS} # after the purge 2268610 /minos/data/reco_near/dogwood1/sntp_data 352393 freed up minfarm@minos27 MDS=/minos/data/reco_near/dogwood1/sntp_data MONS=`ls /minos/data/reco_near/dogwood1/sntp_data | head -47` FARM27 > printf "${MONS}\n" 2005-01 2005-03 ... 2009-01 2009-02 for MON in ${MONS} ; do printf "${MON}\n" ls -ld ${MDS}/${MON}/*cosmic* done > /minos/scratch/minfarm/dog1ndcosps.log date for MON in ${MONS} ; do printf "${MON}\n" rm -f ${MDS}/${MON}/*cosmic* done date Wed Sep 16 11:23:01 CDT 2009 Wed Sep 16 11:38:43 CDT 2009 ######### # PROBE # ######### Added a few things to the report printf "VENDOR " ; cat /etc/redhat-release printf "UNAME " ; uname -a printf "ULIMIT\n" ; ulimit -a ######## # JIRA # ######## Tried to log out around 14:50 UTC http://fermilab.go2group.com/logout Network Error (tcp_error) A communication error occurred: "Operation timed out" The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time. For assistance, please open a Helpdesk ticket, noting the URL accessed, the message received, your hostname/IP address and the time this error occurred.. Can ping fermilab.go2group.com 74.50.52.51 But get not response from the usual web pages ############ # MCIMPORT # ############ Added options to limit srmcp retries, per https://twiki.grid.iu.edu/bin/view/Documentation/StorageSrmcpUsing https://srm.fnal.gov/twiki/bin/view -retry_timeout 100000 (millisecs per retry), try in 100 secs. -retry_num 2 retries before giving up based on mcimport.log content, the defaults seem to be -retry_timeout 10000, incremented by 10000 each retry. -retry_num 20 tries before giving up that is a net 2100 sec. Developing the ACTION function MCIP=/minos/data/mcimport INDIR=kreymer ACT=${MCIP}/${INDIR}/ACTION STALEACT=10 VERB=verb printf "`date -u` `hostname -s` ${INDIR} $$ STARTUP\n" > ${ACT} cat ${ACT} Wed Sep 16 19:14:24 UTC 2009 minos27 kreymer 5753 STARTUP touch -d yesterday ${ACT} The revised script seems to be working. Needs more testing for ALL and TAR modes, but should be OK for concatenating wingmc files. ln -s AFSS/mcimport.20090915 mci Tested on single files, like ./mci -b 1 wingmc_gd Ran a full pass ./mci -l 10 wingmc_gd OK, logging activity to /minos/data/mcimport/wingmc_gd/log/mcimport.log OK - purging 314 MCIN files ? Wed Sep 16 18:15:26 CDT 2009 $ du -sm /grid/data/minos/users/wingmc/mcin 612772 /grid/data/minos/users/wingmc/mcin $ ls /grid/data/minos/users/wingmc/mcin | wc -l 1130 Disbled cron running of wingmc_gd Reenabled the cron job on minos26 mv /minos/data/mcimport/wingmc_gd/MCIMPORT /minos/data/mcimport/wingmc_gd/noIMPORT crontab crontab.dat Status on Thursday : Most files are copied, good data rates around 20 MB/sec, only 74 left to copy $ grep SRMCP /minos/data/mcimport/wingmc_gd/log/mcimport.log | wc -l 1369 $ grep SRMCP /minos/data/mcimport/wingmc_gd/log/mcimport.log | tail SRMCPed n13035010_0000_L250200N_D07_r2.reroot.root 21 SRMCPed n13035010_0001_L250200N_D07_r2.reroot.root 19 SRMCPed n13035010_0002_L250200N_D07_r2.reroot.root 20 SRMCPed n13035010_0003_L250200N_D07_r2.reroot.root 22 SRMCPed n13035010_0004_L250200N_D07_r2.reroot.root 22 SRMCPed n13035010_0005_L250200N_D07_r2.reroot.root 21 SRMCPed n13035010_0007_L250200N_D07_r2.reroot.root 20 SRMCPed n13035010_0008_L250200N_D07_r2.reroot.root 23 SRMCPed n13035010_0010_L250200N_D07_r2.reroot.root 24 SRMCPed n13035010_0011_L250200N_D07_r2.reroot.root 21 $ ls /minos/data/mcimport/wingmc_gd/mcin | grep root | wc -l 74 $ cat /minos/data/mcimport/*/ACTION Wed Sep 16 19:14:24 UTC 2009 minos27 kreymer 5753 STARTUP Thu Sep 17 13:39:50 UTC 2009 minos27 wingmc_gd 6723 MCINWRITE n1303_L250200N_D07_r2.reroot.root n13035010_0014_L250200N_D07_r2.reroot.root ___________________________________________________ wingmc_fg is running, via the usual cron job. This is copying about 1 file per minute, dog slow. Will have to kill this, and restart with the new mcimport on minos27. $ less /minos/data/mcimport/wingmc_gd/log/mcimport.log ... Wed Sep 16 01:50:48 CDT 2009 MCIN processing 1444 files Wed Sep 16 01:50:48 CDT 2009 MCIN configuration n1303 _L100200N_D07_r1.reroot.root Files in 500, 501 so far, as of 09:24 Killed the java -cp of mcin_data/near/daikon_07/L100200N_r1/501/n13035011_0003_L100200N_D07_r1.reroot.root Now that wingmc_gd is being used, restored OVERLAY/mcin-original to mcin. rm STAGE/OVERLAY/mcin mv STAGE/OVERLAY/mcin-normal STAGE/OVERLAY/mcin ~/saddmc --verify daikon_07 near/daikon_07/L100200N_r1/500 ~/saddmc --declare daikon_07 near/daikon_07/L100200N_r1/500 ... Needed 270 files, Rate was 3.498 OK - declared n13035009_0014_L100200N_D07_r1.reroot.root /pnfs/minos/mcin_data/near/daikon_07/L100200N_r1/500(von613.423) STARTED Wed Sep 16 14:36:16 2009 FINISHED Wed Sep 16 14:37:34 2009 ~/saddmc --verify daikon_07 near/daikon_07/L100200N_r1/501 ~/saddmc --declare daikon_07 near/daikon_07/L100200N_r1/501 ... OK - declared n13035011_0002_L100200N_D07_r1.reroot.root /pnfs/minos/mcin_data/near/daikon_07/L100200N_r1/501(dcache.16) Needed 33 files, Rate was 3.424 STARTED Wed Sep 16 14:38:01 2009 FINISHED Wed Sep 16 14:38:11 2009 ___________________________________________________ ============================================================================= 2009 09 15 ============================================================================= ######## # GRID # ######## Date: Tue, 15 Sep 2009 13:32:53 -0500 (CDT) From: Steven Timm To: fermigrid-announce@fnal.gov Subject: Reminder Sep. 17 downtime, General Purpose Grid The third thursday of the month, Sep. 17, is our normally scheduled downtime for the General Purpose Grid. Highlights of this month's downtime will include: 1) Reboot of all head nodes 2) Physical moving of a couple of head nodes from one rack to the other. 3) Reboot of all worker nodes 4) Temporary network outage while the switches of the GP Grid subnet are reconfigured. We are not going to drain the nodes in advance of this downtime but any jobs that are still running by 08:30 on 9/17 will be killed and then restarted once the cluster comes back up. Do not do any new submissions to the cluster that morning until we send the all clear, because intermittent network outages could cause jobs to fail in the interim. We anticipate that the work should be complete by noon at the latest. ############ # MCIMPORT # ############ mcimport.20090915 Adding support for ACTION file in place of .pid ######## # FARM # ######## minospro@minos26 Proceeding with the plan of 08 Sep 2009 removing the dogwood1 near cosmic files RUN < 15820 Moving ahead with cand files today . ./setups.sh setup sam setup encp SAMDIM=" VERSION dogwood1 \ and DATA_TIER cand-near \ and PHYSICAL_DATASTREAM_NAME cosmic and RUN_NUMBER < 15820 " sam list files --summaryOnly --dim="${SAMDIM}" File Count: 20977 Average File Size: 60.33MB Total File Size: 1.21TB Total Event Count: 2140799714 CFILES=`~kreymer/minos/scripts/samlocate --path "${SAMDIM}" | grep /pnfs | sort` printf "${CFILES}\n" | wc -l 20977 printf "${CFILES}\n" | head /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009334_0002.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009334_0003.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009334_0004.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009334_0005.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009334_0006.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009334_0008.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009334_0012.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009334_0019.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009361_0009.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2005-12/N00009361_0010.cosmic.cand.dogwood1.0.root printf "${CFILES}\n" | tail /pnfs/minos/reco_near/dogwood1/cand_data/2009-02/N00015817_0000.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-02/N00015817_0001.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-02/N00015817_0002.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0003.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0004.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0005.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0006.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0007.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0008.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0009.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0010.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0011.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0012.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0013.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0014.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0015.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0016.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0017.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0018.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0019.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0020.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0021.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0022.cosmic.cand.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/cand_data/2009-03/N00015817_0023.cosmic.cand.dogwood1.0.root for FILE in ${CFILES} ; do ls -l ${FILE} ; usleep 100000 ; done date for FILE in ${CFILES} ; do ls ${FILE} FOLE=`echo ${FILE} | sed 's^reco_near/dogwood1^BAD/DOG1PRE^g'` FOLP=`dirname ${FOLE}` mkdir -p ${FOLP} enmv ${FILE} ${FOLE} printf " " ; ls ${FOLE} usleep 100000 done date Tue Sep 15 09:55:27 CDT 2009 Tue Sep 15 17:13:25 CDT 2009 date ~kreymer/minos/scripts/samundeclare "${SAMDIM}" date Wed Sep 16 08:37:33 CDT 2009 Found 20977 files undeclared N00009600_0002.cosmic.cand.dogwood1.0.root undeclared N00009600_0007.cosmic.cand.dogwood1.0.root ... undeclared N00009334_0006.cosmic.cand.dogwood1.0.root undeclared N00009361_0012.cosmic.cand.dogwood1.0.root Wed Sep 16 08:56:51 CDT 2009 ============================================================================= 2009 09 14 ============================================================================= ######## # GRID # ######## rubin asked for help with DOE grid cert renewal. Referred him to 'DOEGrid certificate' in ~kreymer/LOG, around 2009 03 02 ############# # SAMLOCATE # ############# Moved this to admin/sam 2009 09 14 - added -p --path option to print full useable path including file the default is to print FILE space PATH ######## # FARM # ######## minospro@minos26 Proceeding with the plan of 08 Sep 2009 removing the dogwood1 near cosmic files RUN < 15820 . ./setups.sh setup sam SAMDIM=" VERSION dogwood1 \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME cosmic and RUN_NUMBER < 15820 " sam list files --summaryOnly --dim="${SAMDIM}" File Count: 1352 Average File Size: 250.40MB Total File Size: 330.61GB Total Event Count: 2160440493 ~kreymer/minos/scripts/samlocate "${SAMDIM}" SFILES=`~kreymer/minos/scripts/samlocate --path "${SAMDIM}" | grep /pnfs` PRO> printf "${SFILES}\n" | wc -l 1353 PRO> printf "${SFILES}\n" | sort | tail /pnfs/minos/reco_near/dogwood1/sntp_data/2009-01/N00015493_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2009-01/N00015497_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2009-01/N00015510_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2009-01/N00015522_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2009-01/N00015542_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2009-01/N00015547_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2009-01/N00015550_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2009-01/N00015553_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2009-02/N00015814_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2009-02/N00015817_0000.cosmic.sntp.dogwood1.0.root Do a test of 10 files SFILES=`~kreymer/minos/scripts/samlocate -b 10 --path "${SAMDIM}" | grep /pnfs` printf "${SFILES}\n" PRO> ~kreymer/minos/scripts/samundeclare -b 10 -n "${SAMDIM}" BAIL after 10 NOOP Found 1352 files sam.undeclareFile N00009623_0000.cosmic.sntp.dogwood1.0.root sam.undeclareFile N00011231_0000.cosmic.sntp.dogwood1.0.root sam.undeclareFile N00011259_0000.cosmic.sntp.dogwood1.0.root sam.undeclareFile N00011265_0000.cosmic.sntp.dogwood1.0.root sam.undeclareFile N00011268_0000.cosmic.sntp.dogwood1.0.root sam.undeclareFile N00011373_0000.cosmic.sntp.dogwood1.0.root sam.undeclareFile N00011422_0000.cosmic.sntp.dogwood1.0.root sam.undeclareFile N00011440_0000.cosmic.sntp.dogwood1.0.root sam.undeclareFile N00011325_0000.cosmic.sntp.dogwood1.0.root sam.undeclareFile N00011344_0000.cosmic.sntp.dogwood1.0.root The list looks good, let's remove them. for FILE in ${SFILES} ; do ls -l ${FILE} ; usleep 100000 ; done for FILE in ${SFILES} ; do echo ${FILE} ; rm -f ${FILE} ; usleep 100000 ; done ~kreymer/minos/scripts/samundeclare -b 10 "${SAMDIM}" BAIL after 10 Found 1352 files undeclared N00009623_0000.cosmic.sntp.dogwood1.0.root undeclared N00011231_0000.cosmic.sntp.dogwood1.0.root undeclared N00011259_0000.cosmic.sntp.dogwood1.0.root undeclared N00011265_0000.cosmic.sntp.dogwood1.0.root undeclared N00011268_0000.cosmic.sntp.dogwood1.0.root undeclared N00011373_0000.cosmic.sntp.dogwood1.0.root undeclared N00011422_0000.cosmic.sntp.dogwood1.0.root undeclared N00011440_0000.cosmic.sntp.dogwood1.0.root undeclared N00011325_0000.cosmic.sntp.dogwood1.0.root undeclared N00011344_0000.cosmic.sntp.dogwood1.0.root PRO> date Mon Sep 14 17:27:32 CDT 2009 Let's do the whole thing ( ntuples ) Oops, let's move these to an alternate tree. mkdir /pnfs/minos/BAD/DOG1PRE SFILES=`~kreymer/minos/scripts/samlocate -b 10 --path "${SAMDIM}" | grep /pnfs` PRO> printf "${SFILES}\n" /pnfs/minos/reco_near/dogwood1/sntp_data/2006-12/N00011347_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2006-12/N00011399_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2007-01/N00011458_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2007-01/N00011633_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2007-01/N00011552_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2007-01/N00011568_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2007-01/N00011592_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2007-01/N00011592_0014.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2007-02/N00011778_0000.cosmic.sntp.dogwood1.0.root /pnfs/minos/reco_near/dogwood1/sntp_data/2007-02/N00011807_0000.cosmic.sntp.dogwood1.0.root for FILE in ${SFILES} ; do ls -l ${FILE} ; usleep 100000 ; done setup encp date for FILE in ${SFILES} ; do echo ${FILE} FOLE=`echo ${FILE} | sed 's^reco_near/dogwood1^BAD/DOG1PRE^g'` FOLP=`dirname ${FOLE}` mkdir -p ${FOLP} enmv ${FILE} ${FOLE} ls -l ${FOLE} usleep 100000 done date ~kreymer/minos/scripts/samundeclare -b 10 "${SAMDIM}" This worked, do this to all the files : SFILES=`~kreymer/minos/scripts/samlocate --path "${SAMDIM}" | grep /pnfs` printf "${SFILES}\n" | wc -l 1332 Removed the files, as above. Mon Sep 14 17:45:27 CDT 2009 ... Mon Sep 14 18:12:39 CDT 2009 date ~kreymer/minos/scripts/samundeclare "${SAMDIM}" date Mon Sep 14 18:13:03 CDT 2009 ... Mon Sep 14 18:15:32 CDT 2009 I will not remove the Cand files quite yet, as there are 20x more of them, and these are not being copied up from the TACC. ####### # AFS # ####### For cwhite access to /afs/fnal.gov/files/expwww/numi/html/MinosAEM pts adduser -user cwhite -group wadmnumi:numiweb pts membership wadmnumi:numiweb | grep cwhite cwhite ############ # MCIMPORT # ############ Date: Mon, 14 Sep 2009 18:28:48 +0000 (GMT) From: Arthur Kreymer To: wingmc@fnal.gov Cc: hennessy@fnal.gov, rmehdi@fnal.gov, minos-data@fnal.gov Subject: Import of the next batch of daikon_07 I am told that we should import the next batch of daikon_07 I think this is /grid/data/minos/users/wingmc/ L100200_r1/ L150200_r2/ L250200_r1/ L250200_r2/ To make this happen, Adam needs to do the following : 1) Change permissions on these files to group writeable, so that the mindata account can manipulate and remove them. chmod -R g+w /grid/data/minos/users/wingmc/L100200_r1 chmod -R g+w /grid/data/minos/users/wingmc/L150200_r2 chmod -R g+w /grid/data/minos/users/wingmc/L250200_r1 chmod -R g+w /grid/data/minos/users/wingmc/L250200_r2 2) Move them out of the subdirectories up to /grid/data/minos/users/wingmc/mcin. Something like CONF=/grid/data/minos/users/wingmc/L100200_r1 MCIN=/grid/data/minos/users/wingmc/mcin FILES=`ls ${BEAM}` for FILE in ${FILES} ; do echo mv ${CONF}/${FILE} ${MCIN}/${FILE} done Run the script with 'echo' to be sure this is what you want, then remove the echo prefix and run it for real. ########### # BLUEARC # ########### Looking at D0 project disks : Slowdowns reported on 3014 and 3022 d0-nas-0:/projects/5131 1.5T 1.4M 1.5T 1% /prj_root/5131 fermi-nas-1.fnal.gov:/projects/3034 4.0T 2.6T 1.5T 64% /prj_root/3034 fermi-nas-1.fnal.gov:/projects/3014 6.5T 6.5T 102G 99% /prj_root/3014 fermi-nas-1.fnal.gov:/projects/3022 3.0T 3.0T 55G 99% /prj_root/3022 prj_roots seem to be automounted. ls /prj_root 1002 1012 1152 1182 2626 2634 2642 2651 2659 2667 2676 3002 3016 3034 5001 5010 5142 5171 5622 5631 5800 1003 1131 1153 1183 2627 2635 2643 2652 2660 2668 2677 3003 3021 3035 5002 5011 5143 5172 5623 5632 1004 1132 1161 2620 2628 2636 2645 2653 2661 2670 2678 3004 3022 4021 5003 5012 5151 5173 5624 5633 1005 1133 1162 2621 2629 2637 2646 2654 2662 2671 2679 3011 3023 4022 5004 5024 5152 5181 5625 5640 1006 1141 1163 2622 2630 2638 2647 2655 2663 2672 2680 3012 3024 4023 5005 5131 5153 5182 5626 5644 1007 1142 1171 2623 2631 2639 2648 2656 2664 2673 2681 3013 3031 4024 5006 5132 5161 5183 5627 5670 1008 1143 1172 2624 2632 2640 2649 2657 2665 2674 3000 3014 3032 4025 5007 5133 5162 5620 5628 5700 1011 1151 1173 2625 2633 2641 2650 2658 2666 2675 3001 3015 3033 500 5008 5141 5163 5621 5629 5701 ########### # SERVICE # ########### Date: Mon, 14 Sep 2009 09:14:03 -0500 (CDT) INC 4220 Completed - Requester Console omits requests recent upgrade resolved issue ############# # TOPDB_LOG # ############# MINOSDATA-4 cd /afs/fnal.gov/files/expwww/numi/html/computing/database mv oracle/topdb/minosdev topdb/minosdev mv oracle/topdb/minosprd topdb/minosprd rmdir oracle/topdb ln -s ../topdb oracle/topdb topdb_log.20090914 changed BASEDIR Made NOW.txt and TODAY relative symlinks, not absolute Added copy of TOPDB_CONN from /local/scratch26/kreymer/.grid/topdb_conn cat > /local/scratch26/kreymer/.grid/topdb_conn monitor/... chmod 600 /local/scratch26/kreymer/.grid/topdb_conn Moved these files to CVS - admin/sam ADMIN=/afs/fnal.gov/files/expwww/numi/html/computing/admin ln -sf ${ADMIN}/sam/oracle/topdb \ /afs/fnal.gov/files/home/room1/kreymer/minos/oracle/topdb set nohup ; ${ADMIN}/sam/oracle/topdb_log minosdev & set nohup ; ${ADMIN}/sam/oracle/topdb_log minosprd & Adjusted monitor.minos26, removed export TDBCONN=monitor/`cat /local/scratch26/kreymer/grid/oraclemonitor` Corrected path in topdb_log, restarted the scripts, around 11:02 CDT. MINOSDATA-4 resolved with first run of the script MINOSDATA-4 closed - scripts continue to iterate. ####### # PMS # ####### cd /minos/scratch/kreymer/trel cvs co PackageMaintenanceSupport/config nedit PackageMaintenanceSupport/config/maintainers.pms cvs diff PackageMaintenanceSupport/config/maintainers.pms Index: PackageMaintenanceSupport/config/maintainers.pms =================================================================== RCS file: /cvs/minoscvs/rep1/minossoft/PackageMaintenanceSupport/config/maintainers.pms,v retrieving revision 1.200 diff -r1.200 maintainers.pms 103c103 < last_confirmation=2009-07-23 --- > last_confirmation=2010-02-11 cvs commit -m "Update my entry" PackageMaintenanceSupport/config/maintainers.pms rm -r PackageMaintenanceSupport ######## # JIRA # ######## Will put some recent issues, for testing , into http://fermilab.go2group.com/browse/MINOSDATA ________________________________________________________________________ MINOSDATA-1 Make JIRA entries for some of the tasks presently tracked in the Work Log http://www-numi.fnal.gov/minwork/computing/dh/worklog.txt Select those tasks taking more than a day, and not tracked by the ServiceDesk. Target: Timely creation, update, and closure of the first few relevant items. ________________________________________________________________________ Filing mail in mjira folder for now. ________________________________________________________________________ Receiving email notice of Creation Start Progress Comments ________________________________________________________________________ I have created several issues, will try resolving one. ________________________________________________________________________ There is a DUE column on the issue summary. I would like to use this. How is this set ? ________________________________________________________________________ ============================================================================= 2009 09 13 Sunday ============================================================================= ########### # BLUEARC # ########### Date: Sun, 13 Sep 2009 16:29:46 -0500 (CDT) Request INC000000010394 requested by you has been submitted. Status: New Summary: Bluearc export requested to flxi06 Please export the following to flxi06, rw, for tests of SLF 5 FERMI-BLUE blue2 /minos/data RHEA minos-nas-0 /minos/data RHEA minos-nas-0 /minos/scratch FERMI-BLUE blue2 /fermigrid-fermiapp _______________________________________________________________________ Date: Tue, 15 Sep 2009 16:41:41 -0500 (CDT) Status: Completed flxi06, now has rw access to FERMI-BLUE blue2 /minos/data RHEA minos-nas-0 /minos/data RHEA minos-nas-0 /minos/scratch FERMI-BLUE blue2 /fermigrid-fermiapp _______________________________________________________________________ ############### # SERVICEDESK # ############### Of the 14 Open tickets for kreymer dating 3 working days old 3 - in actual progress 5 - completed, have requested closeout, but not closed. 6 - no response in ServiceDesk ########### # BLUEARC # ########### Date: Sun, 13 Sep 2009 21:03:34 +0000 (GMT) From: Arthur Kreymer To: FermilabServiceDesk@fnal.gov Cc: minos-data@fnal.gov Subject: Request INC000000007362: /grid/data overloaded Please close out this ticket. The d0ora2 node that caused most of the overloads was retired Aug 25. Further investigations can be pursued via new tickets, if necessary. ############ # MINOSCVS # ############ Removed .k5login entries whose purpose I do not understand : host/minos1.fnal.gov@FNAL.GOV host/minoscvs.fnal.gov@FNAL.GOV bash-2.03$ ssh minoscvs "echo OK SSH USING NEW .k5login SUCCESSFUL" OK SSH USING NEW .k5login SUCCESSFUL ############ # MCIMPORT # ############ Getting ahead with OVERLAY files Running on minos27, to avoid overloaded minos26 -bash-3.00$ AFSS/mcimport.20090903 OVERLAY OK, logging activity to /minos/data/mcimport/OVERLAY/log/mcimport.log Set aside MCIMPORT, so that cron will not run a second copy on minos26 $ mv MCIMPORT NOIMPORT MCIN processing 168 files Sun Sep 13 13:49:15 CDT 2009 MCIN configuration n1303 _L010170N_D07_r1.reroot.root SRMCPed n13035001_0001_L010170N_D07_r1.reroot.root ... ~/saddmc --declare daikon_07 near/daikon_07/L010170N_r1/500 2916765 /minos/data/mcimport/OVERLAY/ 1 /minos/data/mcimport/OVERLAY/tar 1 /minos/data/mcimport/OVERLAY/dcache 2916759 /minos/data/mcimport/OVERLAY/mcin 51041 /minos/data/mcimport/OVERLAY/mcin/dcache Sun Sep 13 14:49:42 CDT 2009 Trying again with a loop option MCIO=/minos/data/mcimport/OVERLAY mv ${MCIO}/NOIMPORT ${MCIO}/MCIMPORT AFSS/mcimport.20090913 -l 2 OVERLAY mv ${MCIO}/MCIMPORT ${MCIO}/NOIMPORT Need to change PID locking to a timeout. Need to add average srmcp data rate report. Need to add STOP file support. ============================================================================= 2009 09 12 ============================================================================= ######### # ADMIN # ######### Added two useful links to dhmain.20090912.html REX downtime calendar http://sites.google.com/a/fnal.gov/cd-scheduled-downtimes/ FEF host status including kernel status http://fefweb.fnal.gov/faultlog/nodeview.php?exp=MINOS&cluster=MINOS-CLUSTER ============================================================================= 2009 09 11 ============================================================================= FRIDAY - after DCache pool upgrades done > transfer a pool to RawDataWritePools done > restart mcimport done > do saddmc/saddreco for missing daikon_04 done > restart write of mcnear ############# # MDSUM_LOG # ############# The last global summary of files in /minos/data was run on May 18. http://www-numi.fnal.gov/computing/dh/mdsum/2009/05/18.txt This was suspended due to the ongoing severe Bluearc problems. The scan is done by the kreymer/minos/scripts/mdsum_log script. We need a current summary, so I started this script by hand around 22:26 CDT today, Friday Sep 11 2009. After 10 minutes, I see no slowdown in /grid/data access. http://www-numi.fnal.gov/computing/dh/bluearc/rates/minos27/minos27_20090911.png This completed at Mon Sep 14 13:31:04 About 63 hours elapsed time. ########### # SERVICE # ########### Date: Fri, 11 Sep 2009 17:11:17 -0500 (CDT) Request INC000000010382 requested by you has been submitted. Status: New Summary: Service Desk Application failure - broadcasst Notes: On 8/31/2009 05:00:00 AM, a Broadcast was put out PBB000000000009 Updated Interface for submission of requests to the Service Desk Attachment file Requester Console Updates.docx This was invisible, because the scrolled window shows only the two older broadcasts. and this was is third broadcase. You have to look closely to see that there are three broadcasts. The content of the broadcast was a Microsoft Word XML format attachment, which is not readable by many of the Servicedesk users.. PDF would have been much more useful. Better yet, a short text summary of the important new features, with a URL to an HTML document readable by everyone. _______________________________________________________________ Date: Mon, 14 Sep 2009 15:48:23 -0500 (CDT) pdf version has been attached instead of word doc. Will include link to computer web site for future broadcasts _______________________________________________________________ _______________________________________________________________ ############ # MCIMPORT # ############ Assessing D07 input data, needs import. find /minos/data/mcimport/OVERLAY/mcin -type f -name \*reroot.root | wc -l 8164 du -sm /minos/data/mcimport/OVERLAY/mcin 2774406 /minos/data/mcimport/OVERLAY/mcin that's 2.8 TBytes of files. MINOS26 > du -sm /minos/data/mcimport/OVERLAY/mcin/* 839 /minos/data/mcimport/OVERLAY/mcin/DUP 23648 /minos/data/mcimport/OVERLAY/mcin/L010000_overlay_D07_r1i209 185175 /minos/data/mcimport/OVERLAY/mcin/L010000_overlay_D07_r1i225 15404 /minos/data/mcimport/OVERLAY/mcin/L010000_overlay_D07_r1i232 99609 /minos/data/mcimport/OVERLAY/mcin/L010000_overlay_D07_r1i259 12618 /minos/data/mcimport/OVERLAY/mcin/L010000_overlay_D07_r2i209 13446 /minos/data/mcimport/OVERLAY/mcin/L010000_overlay_D07_r2i225 247086 /minos/data/mcimport/OVERLAY/mcin/L010000_overlay_D07_r2i232 4825 /minos/data/mcimport/OVERLAY/mcin/L010000_overlay_D07_r2i259 48577 /minos/data/mcimport/OVERLAY/mcin/L010000_overlay_D07_r2i300 51041 /minos/data/mcimport/OVERLAY/mcin/L010170_r1 88568 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r1i124 122887 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r1i191 180314 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r1i213 140560 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r1i224 51612 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r1i232 30194 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r1i243 105799 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r1i257 60268 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r1i282 44112 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i124 50850 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i191 59638 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i213 91709 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i224 181187 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i232 181602 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i243 43113 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i257 27687 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i282 21938 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i303 42924 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i324 18463 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i124 43273 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i191 10960 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i213 4941 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i224 4929 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i232 10000 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i243 68425 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i257 102297 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i282 207972 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i303 173601 /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r3i324 40054 /minos/data/mcimport/OVERLAY/mcin/L010200_r1 1 /minos/data/mcimport/OVERLAY/mcin/dcache $ du -sm /grid/data/minos/users/wingmc/* 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r1i124 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r1i191 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r1i213 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r1i224 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r1i232 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r1i243 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r1i257 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r1i282 17117 /grid/data/minos/users/wingmc/L010185_overlay_D07_r21224 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r21232 1 /grid/data/minos/users/wingmc/L010185_overlay_D07_r21243 8179 /grid/data/minos/users/wingmc/L010185_overlay_D07_r21257 13412 /grid/data/minos/users/wingmc/L010185_overlay_D07_r2i282 10619 /grid/data/minos/users/wingmc/L010185_overlay_D07_r2i303 15843 /grid/data/minos/users/wingmc/L010185_overlay_D07_r2i324 10915 /grid/data/minos/users/wingmc/L010185_overlay_D07_r3i213 4927 /grid/data/minos/users/wingmc/L010185_overlay_D07_r3i224 4933 /grid/data/minos/users/wingmc/L010185_overlay_D07_r3i232 9979 /grid/data/minos/users/wingmc/L010185_overlay_D07_r3i243 48286 /grid/data/minos/users/wingmc/L010185_overlay_D07_r3i257 70825 /grid/data/minos/users/wingmc/L010185_overlay_D07_r3i282 108068 /grid/data/minos/users/wingmc/L010185_overlay_D07_r3i303 94297 /grid/data/minos/users/wingmc/L010185_overlay_D07_r3i324 107786 /grid/data/minos/users/wingmc/L100200_r1 140782 /grid/data/minos/users/wingmc/L150200_r2 189345 /grid/data/minos/users/wingmc/L250200_r1 188845 /grid/data/minos/users/wingmc/L250200_r2 1 /grid/data/minos/users/wingmc/mcin $ du -sm /grid/data/minos/users/wingmc 1044152 /grid/data/minos/users/wingmc Directory problems some are named _r21* instead of _r2i* $ ls -d /grid/data/minos/users/wingmc/*r21* /grid/data/minos/users/wingmc/L010185_overlay_D07_r21224 /grid/data/minos/users/wingmc/L010185_overlay_D07_r21232 /grid/data/minos/users/wingmc/L010185_overlay_D07_r21243 /grid/data/minos/users/wingmc/L010185_overlay_D07_r21257 There are many files of 0 length : $ find /grid/data/minos/users/wingmc -size 0 | cut -f 7 -d / | sort -u L010185_overlay_D07_r21224 L010185_overlay_D07_r21232 L010185_overlay_D07_r21243 L010185_overlay_D07_r21257 Checking empty duplicate directories for IN in 124 191 213 224 232 243 257 282 ; do printf " INTENSITY ${IN} \n " find /grid/data/minos/users/wingmc/L010185_overlay_D07_r1i${IN} -type f ls -l /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r1i${IN} done for IN in 232 243 ; do printf " INTENSITY ${IN} \n " find /grid/data/minos/users/wingmc/L010185_overlay_D07_r21${IN} -type f ls -l /minos/data/mcimport/OVERLAY/mcin/L010185_overlay_D07_r2i${IN} done Go ahead and move the L010170_r1 files up to mcin. cd /minos/data/mcimport/OVERLAY/mcin CONFIG=L010170_r1 for FILE in `find ${CONFIG} -type f` ; do mv ${FILE} `basename ${FILE}` done ########## # DCACHE # ########## Need to add another pool to RawDataWritePools Review likely candidates, with MINOS26 > dcache/datasets m dcache/datasets: line 144: [: too many arguments Run Fri Sep 11 12:41:34 CDT 2009 Data from 107K Fri Sep 11 06:41:38 CDT 2009 r-minos-stkendca21a-3.files 1737/1933 Fri Sep 11 06:49:02 CDT 2009 r-minos-stkendca22a-3.files 949/1933 Fri Sep 11 06:57:23 CDT 2009 r-minos-stkendca23a-3.files 1737/1933 Fri Sep 11 07:05:17 CDT 2009 r-minos-stkendca25a-1.files 231/1933 Thu Sep 10 06:44:23 CDT 2009 old/r-minos-stkendca26a-3.files.1 0/0 Fri Sep 11 07:08:58 CDT 2009 r-minos-stkendca27a-2.files 2103/2900 These list free/total size of each pool Daq pools are Fri Sep 11 06:42:46 CDT 2009 w-raw-minos-stkendca21a-1.files 0/1933 Fri Sep 11 06:50:09 CDT 2009 w-raw-minos-stkendca22a-1.files 0/1933 Fri Sep 11 07:00:40 CDT 2009 w-raw-minos-stkendca24a-1.files 0/1933 Thu Sep 10 06:44:23 CDT 2009 old/w-raw-minos-stkendca26a-1.files.1 0/0 Rats, looks like 26a is offline ? Nope, just not pruducing file listings, sincd before 6 Sep. The obvious choices would be one of 21a-3 or 23a-3. Picking 23a-3 would keep the file split over more servers. Will request that one : _______________________________________________________________________ Date: Fri, 11 Sep 2009 12:58:42 -0500 (CDT) Request INC000000010333 requested by you has been submitted. Status: New Summary: Move 23a-3 pool to RawDataWritePools Notes: SSA primary - dcache-admin@fnal.gov We are keeping all Minos raw data files on disk by having plenty of space in the RawDataWritePools pool group. But this pool group is full. Please move r-minos-stkendca23a-3 from MinosPrdReadPools to RawDataWritePools. This will add another 2 TBytes to the group, enough to get us through another year or two of running. The are relatively few files in 23a-3, so this should have little operational impact. Please contact me at minos-data@fnal.gov if further discussion is needed. This move can wait till Monday, if this is prudent. _______________________________________________________________________ Date: Tue, 15 Sep 2009 10:04:03 -0500 (CDT) Status: Pending _______________________________________________________________________ Date: Fri, 18 Sep 2009 09:43:31 -0500 From: Marty Buchaus We're looking for permission to make the disruptive change. --- Comment #2 from Alex Kulyavtsev 2009-09-15 18:11:20 --- Waiting for change approval. The Pool manager configuration shall be reloaded with dcache admin command to PollManager in dcache admin interface : > psu reload -yes This command can be rether disruptive, do not do it right away. To implement proposed change the following commands can be issued to Poolmanager instead : > psu removefrom pgroup MinosPrdReadPools r-minos-stkendca23a-3 > psu addto pgroup RawDataWritePools r-minos-stkendca23a-3 _______________________________________________________________________ Date: Fri, 18 Sep 2009 14:59:30 +0000 (GMT) From: Arthur Kreymer Presently, pool r-minos-stkendca23a-3 shows no activity. No Movers, Restores, or P2P activity. Therefore, the disruption is only theoretical. If this assessment is correct, go ahead and make this change whenever you like, then send mail to minos-data. _______________________________________________________________________ Date: Fri, 18 Sep 2009 13:59:24 -0500 (CDT) Status: In Progress _______________________________________________________________________ At about 16:44 / 21:44 UTC, I see 109 MB precious space in r-minos-stkendca23a-3 _______________________________________________________________________ Date: Fri, 20 Nov 2009 12:02:38 -0600 (CST) Status: Completed ------- Comment #3 From Alex Kulyavtsev 2009-09-18 11:50:34 [reply] ------- - updated PoolManager.conf on head node with patch in attachment applied (r1.52). ############ # SADDRECO # ############ Need to pick up the L250200N D04 files mentioned below. minfarm@minos27 SRLOG=/minos/data/minfarm/ROUNTMP/LOG/saddreco/daikon_04/dogwood1 SC=/grid/fermiapp/minos/minfarm/scripts ls ${SRLOG}/near_L250* FARM04 > ls ${SRLOG} | grep L250 near_L250200N_i100.log near_L250200N_i114.log near_L250200N_i130.log near_L250200N_i152.log near_L250200N_i194.log ${SC}/saddreco -m daikon_04 -d near -r dogwood1 -p L250200N_${INT} --verify INTS=`ls ${SRLOG} | grep L250 | cut -f 3 -d _ | cut -f 1 -d .` printf "${INTS}\n" i100 i114 i130 i152 i194 for INT in ${INTS} ; do SLOG=${SRLOG}/near_L250200N_${INT}.log ${SC}/saddreco -m daikon_04 -d near -r dogwood1 -p L250200N_${INT} --declare \ 2>&1 | tee -a ${SLOG} done FINISHED Fri Sep 11 21:48:50 2009 ______________________________________________________________________ Date: Fri, 11 Sep 2009 22:20:57 +0000 (GMT) From: Arthur Kreymer To: Howard Rubin Cc: Rashid Mehdiyev , minos-data@fnal.gov Subject: Re: Some utilities > > Rashid Mehdiyev wrote: > > > > in another example, I was trying to interrogate a mc file, but: > > > > > > cd /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L250200N_i130/sntp_data/701 > > > sam_find.mc n13037010_0010_L250200N_D04_i130.sntp.dogwood1.root > > > n13037010_0010_L250200N_D04_i130.sntp.dogwood1.root not found I have brought the mcin and mcout declarations up to date for /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L250200N_* Concatenation should be resumed for these files. ########## # SADDMC # ########## removed saddmc symlink in kreymer/minos/scripts, not used by mcimport, and is out of date. 10:01 ######### # ADMIN # ######### To : minos_software_discussion@fnal.gov, minos_batch@fnal.gov, minos_sim@fnal.gov, lueking@fnal.gov Cc : Attchmnt: Subject : Minos Cluster and Servers - no kernel updated this month. ----- Message Text ----- All Minos Cluster and Server systems had new kernels installed on Aug 20. Because there are no urgent security problems this month, we are skipping the time reserved for upgrades Thursday Sep 17. We will be shutting down for kernel upgrades next month, Thursday Oct 13. ############ # MCIMPORT # ############ After yesterday's DCache upgrades, $ less /minos/data/mcimport/jcoelho/log/mcimport.log OK - purging 306 MCIN files ? Thu Sep 10 16:54:15 CDT 2009 PURGED n13037110_0000_L250200N_D04_i130.reroot.root ... MCIN processing 301 files Fri Sep 11 05:45:08 CDT 2009 MCIN configuration n1303 _L250200N_D04_i165.reroot.root SRMCPed n13037120_0000_L250200N_D04_i165.reroot.root ... No errors as of 08:22 Files are taking about 150 seconds to write, 700 mB each, that's 5 MB/sec, that's poor ! 08:35 $ time sum STAGE/jcoelho/mcin/n13037129_0030_L250200N_D04_i194.reroot.root real 4m15.660s user 0m4.567s sys 0m0.730s real 5m4.139s user 0m4.591s sys 0m0.850s real 5m9.132s user 0m4.715s sys 0m0.929s ####### # WEB # ####### See email , filed in cdweb Date: Thu, 10 Sep 2009 17:31:21 -0500 From: Laura Mengel Reply-To: webteam@fnal.gov To: central-web-mgrs@fnal.gov Cc: webteam@fnal.gov, gaines@fnal.gov Subject: Web changes needed if you use KCA authentication on your web site Verify that we do not have .htaccess files needing to change CN=Kerberized CA to CN=Kerberized CA HSM cd /afs/fnal.gov/files/expwww/numi/html find . -name .htaccess -exec grep 'Secure ByPassword' {} \; find . -name .htaccess -exec grep 'Kerberized' {} \; find: ./minwork/daqlogs: Permission denied find: WARNING: Hard link count is wrong for .: this may be a bug in your filesystem driver. Automatically turning on find's -noleaf option. Earlier results may have failed to include directories that should have been searched. (message only on ark.fnal.gov ) ============================================================================= 2009 09 10 ============================================================================= ######### # ADMIN # ######### Date: Thu, 10 Sep 2009 14:11:36 -0500 From: Lee Lueking To: Jason Allen Cc: Arthur Kreymer , run2-sys@fnal.gov, minos-admin@fnal.gov Subject: Re: Sep 17 Minos shutdown details ? Hi Art, I talked with Jason this morning, the official CD policy is that kernels should not be more than 60 days old. We agreed that we can wait until next month's Maint. day, Oct. 15, for a reboot. From now on, we will adhere to the 60 day rule, however there may be exceptions. Jason's group provides a convenient way to check the update dates for machines at the following url: http://fefweb.fnal.gov/faultlog/clusters.php Select your experiment and then "Kernel Report for"... and you can see when their records indicate the machine needs to be updated. Lee __________________________________________________________________ Followed the link, most kernels are unknown. Fermilinux kernels : from fermilinux.fnal.gov 5.x Latest Kernel: 2.6.18-128.7.1.el5 Release Date:August 26, 2009 Older Kernel: 2.6.18-128.4.1.el5 Release Date:August 12, 2009 and from my desktop, 2.6.18-128.1.14.el5 Jun 16 2.6.18-128.1.6.el5 Apr 1 4.x Latest Kernel: 2.6.9-89.0.9.EL Release Date: August 26, 2009 Older Kernel: 2.6.9-89.0.7.EL Release Date: August 19, 2009 Older Kernel: 2.6.9-89.0.3.EL Release Date: July 8, 2009 3.x Latest Kernel: 2.4.21-60.EL Release Date: Sep. 2, 2009 Older Kernel: 2.4.21-58.EL Release Date: Jan. 7, 2009 Getting actual kernels from ganglia, CDF ILP, for example, node kernel chart inst newer fcdflnx1 2.6.9-89.0.7.ELsmp Unknown Aug 19 Aug 26 fcdflnx2 2.6.9-89.0.7.ELsmp Unknown fcdflnx3 2.6.18-128.1.14.el5 Expired Jun 16 Aug 12 fcdflnx4 2.4.21-58.ELsmp Expired Jan 7 Sep 02 fcdflnx9 2.4.21-58.ELsmp Expired CDF SAM - all Expired CDFSAMWEB 2.4.21-58.ELsmp Jan 7 FCDFSAM1 2.6.18-128.1.14.el5 Jun 16 FCDFSAM2 2.6.18-128.1.14.el5 FCDFSAM3 2.6.18-128.1.14.el5 FCDFSAM4 2.6.18-128.1.14.el5 FCDFSAM5 2.6.18-128.1.14.el5 FCDFSAM7TST 2.6.18-128.1.14.el5 FCDFSAM8TST 2.6.18-128.1.14.el5 FCDFSAM9TST 2.6.18-128.1.14.el5 /4/9 Expired ( ############ # MCIMPORT # ############ Picked up remaining mcin declares, mindata@minos27. $ scp -r -c blowfish minos26:.grid .grid $ ln -s /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/ AFSS $ ln -s AFSS/saddmc.20071120 saddmc PD=/pnfs/minos/mcin_data/near/daikon_04 $ SOCFILE=${HOME}/.grid/samdbs_prd $ export SAM_ORACLE_CONNECT=`cat ${SOCFILE}` $ . ./setups.sh $ setup sam IDS=`ls $PD | grep _i` for IDIR in ${IDS}; do DS=`ls ${PD}/${IDIR}` for DIR in ${DS} ; do ~/saddmc --verify daikon_04 near/daikon_04/${IDIR}/${DIR} done ; done Treating 154 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i114/700 Treating 149 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i114/705 Treating 61 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i130/700 Treating 153 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i130/701 Treating 60 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i130/705 Treating 148 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i130/706 Treating 154 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i152/701 Treating 153 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i152/706 Treating 93 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i165/702 Treating 92 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i165/707 Treating 216 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i194/702 Treating 298 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i194/703 Treating 211 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i194/707 NIDS=`ls $PD | grep L250200N_` for IDIR in ${NIDS}; do DS=`ls ${PD}/${IDIR}` for DIR in ${DS} ; do ~/saddmc --declare daikon_04 near/daikon_04/${IDIR}/${DIR} done ; done FINISHED Thu Sep 10 22:46:23 2009 ########## # DCACHE # ########## Date: Thu, 10 Sep 2009 16:33:17 -0500 From: ssa-group@fnal.gov Subject: Announcement: Service restoration for dCache on stken for a duration of none The public Dcache Pool nodes have all been updated with the reinstallation of the Raid controller for OS Drives. and are all to the same build and HW level. Stability has been greatly increased after changing the heavy offender nodes last week. _________________________________________________________________________ Restarted mcimport at 15:54. Missed the 16:37 normal cron slot, so started this with a temporary cron $ crontab MAILTO=minos-data@fnal.gov 54 16 * * * ${HOME}/mcimport -c ALL Then loaded the normal crontab crontab crontab.dat _________________________________________________________________________ To : minos_batch@fnal.gov, minos_software_development@fnal.gov, minos_sim@fnal.gov Cc : Attchmnt: Subject : Announcement: Service restoration for dCache on stken for a duration of none (fwd) ----- Message Text ----- DCache pools have been upgraded. I have restarted the mcimport cron job . Let's look for stability overnight, then resume full scale processing. ... _________________________________________________________________________ ######### # ADMIN # ######### Date: Thu, 10 Sep 2009 18:40:33 +0000 (GMT) From: Arthur Kreymer To: run2-sys@fnal.gov Cc: lueking@fnal.gov, minos-admin@fnal.gov Subject: Sep 17 Minos shutdown details ? The cd-scheduled-downtimes calendar page has come back online. I have heard a rumor via Lee that FEF plans to reboot something Sep 17. Lacking any specifics, I must treat this as a rumor. If FEF intends to reboot Minos systems on Sep 17, we need to hear a specific plan ( times, specific work to be done . ) Please communicate this to minos-admin. As far as I know, none of the Minos systems are vulnerable to known security problems, because we are not running any of the problematic kernel modules. Rebooting all servers is very disruptive to the experiment, involving intervention by serveral people within Minos. We need at least a week of lead time. ######### # EMAIL # ######### Dear employee, As part of the Tune IT Up program, we are improving the security of the passwords you use for work. Starting today, if you reset your password for your e-mail account on the IMAP server, it will need to be at least 10 characters long. Passwords must be at least this long to fulfill DOE password complexity requirements. Information on changing IMAP password http://computing.fnal.gov/xms/Services/Getting_Services/Imap_Password_Change A strong password should include a combination of letters, symbols and numbers. This password should be different from your Fermi Windows domain or Kerberos password. On Sept. 15, all IMAP users who have not chosen a new, 10-character password will be required to reset their passwords. For more information, see the Tune IT Up site at www.fnal.gov/tuneitup or contact the Service Desk. Thank you for your help making IT at Fermilab safer, stronger and better. The Tune IT Up committee, Fermilab Service Desk 630-840-2345 ######### # ADMIN # ######### The downtimes calendar is back, at http://sites.google.com/a/fnal.gov/cd-scheduled-downtimes/ So far, it shows only the generic entry FEF: EAG, Neutrino When Thursday, Sep 17, 2009 Description Standard EAG and Neutrino Downtimes ============================================================================= 2009 09 09 ============================================================================= ############ # MCIMPORT # ############ Rashid identified many near daikon04 mcin files not in SAM. This was due to untimely SRMCP failures. Added a couple of directories : . ./setups.sh # get setup and sam setup sam SOCFILE=${HOME}/.grid/samdbs_prd $ ~/saddmc --verify daikon_04 near/daikon_04/L250200N_i100/700 MODE verify saddmc 20071120 processing mcin_data STARTED Thu Sep 10 03:20:44 2009 Declaring to SAM v8_2_0 prd daikon_04 verify 999999 Scanning /pnfs/minos/mcin_data/near/daikon_04/L250200N_i100 ['700'] Needed /pnfs/minos/mcin_data/near/daikon_04/L250200N_i100/700 Added sam tape location /pnfs/minos/mcin_data/near/daikon_04/L250200N_i100/700 Treating 59 files in /pnfs/minos/mcin_data/near/daikon_04/L250200N_i100/700 OK - verified n13037001_0016_L250200N_D04_i100.reroot.root /pnfs/minos/mcin_data/near/daikon_04/L250200N_i100/700(von543.275) ... Needed 59 files, Rate was 4.318 STARTED Thu Sep 10 03:20:44 2009 FINISHED Thu Sep 10 03:20:59 2009 $ ~/saddmc --declare daikon_04 near/daikon_04/L250200N_i100/700 ... Needed 59 files, Rate was 3.235 STARTED Thu Sep 10 03:21:31 2009 FINISHED Thu Sep 10 03:21:50 2009 I have fixed 705 also, where files were needed. $ ~/saddmc --declare daikon_04 near/daikon_04/L250200N_i100/710 Needed 31 files, Rate was 3.063 STARTED Thu Sep 10 03:24:03 2009 FINISHED Thu Sep 10 03:24:14 2009 ######## # FARM # ######## Investigate pending write, possible duplicate ? MINOS26 > less ROUNTMP/LOG/2009-08/dogwood1mcnear.log OOPS - Size mismatch , BAILING -rw-r--r-- 1 42411 e875 1205437598 Aug 26 20:20 /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L250200N_i100/cand_data/700/ n13037002_0003_L250200N_D04_i100.cand.dogwood1.root MINOS26 > dds /minos/data2/minfarm/WRITE/n13037002_0003_L250200N_D04_i100.cand.dogwood1.root -rw-rw-r-- 1 minospro e875 1205435866 Sep 3 19:42 /minos/data2/minfarm/WRITE/n13037002_0003_L250200N_D04_i100.cand.dogwood1.root WFS=`ls /minos/data2/minfarm/WRITE | grep cand.dogwood1.root` 25 files PD=/pnfs/minos/mcout_data/dogwood1/near/daikon_04/L250200N_i100/cand_data for FILE in ${WFS} ; do SD=`echo ${FILE} | cut -c 6-8` [ -r "${PD}/${SD}/${FILE}" ] && \ mv /minos/data2/minfarm/WRITE/${FILE} /minos/data2/minfarm/DUP/${FILE} #ls -l /minos/data2/minfarm/DUP/${FILE} #echo ${FILE} #ls -l /minos/data2/minfarm/WRITE/${FILE} #ls -l ${PD}/${SD}/${FILE} done These were written to PNFS Aug 26 and 27 These were written to WRITE Sep 3 and 4 n13037002_0003_L250200N_D04_i100.cand.dogwood1.root n13037002_0004_L250200N_D04_i100.cand.dogwood1.root n13037002_0005_L250200N_D04_i100.cand.dogwood1.root n13037002_0006_L250200N_D04_i100.cand.dogwood1.root n13037002_0007_L250200N_D04_i100.cand.dogwood1.root n13037002_0008_L250200N_D04_i100.cand.dogwood1.root n13037002_0009_L250200N_D04_i100.cand.dogwood1.root n13037002_0010_L250200N_D04_i100.cand.dogwood1.root n13037002_0011_L250200N_D04_i100.cand.dogwood1.root n13037002_0012_L250200N_D04_i100.cand.dogwood1.root n13037002_0013_L250200N_D04_i100.cand.dogwood1.root n13037002_0014_L250200N_D04_i100.cand.dogwood1.root n13037002_0015_L250200N_D04_i100.cand.dogwood1.root n13037002_0016_L250200N_D04_i100.cand.dogwood1.root n13037002_0017_L250200N_D04_i100.cand.dogwood1.root n13037002_0018_L250200N_D04_i100.cand.dogwood1.root n13037002_0019_L250200N_D04_i100.cand.dogwood1.root n13037002_0020_L250200N_D04_i100.cand.dogwood1.root n13037002_0021_L250200N_D04_i100.cand.dogwood1.root n13037002_0022_L250200N_D04_i100.cand.dogwood1.root n13037002_0023_L250200N_D04_i100.cand.dogwood1.root n13037002_0024_L250200N_D04_i100.cand.dogwood1.root n13037002_0025_L250200N_D04_i100.cand.dogwood1.root n13037002_0026_L250200N_D04_i100.cand.dogwood1.root n13037002_0027_L250200N_D04_i100.cand.dogwood1.root Moved all these to DUP. ########### # ROUNDUP # ########### Adjusted to allow for HAVE subruns which are subsequently in the BADRUNS, NOSPILL or SUPPRESSED list. cd /grid/fermiapp/minos/minfarm/scripts ln -s roundup.20090909 roundup # was roundup.20090806 date Wed Sep 9 12:19:41 CDT 2009 Allow NET to be greater or equal to RAWN, to handle HAVE subruns which are BAD, NOSPILL or SUPPRESSED Corrected Size mismatch message to use ls -lL, which will correctly report candidate file sizes. Changed rm to rm -f when purging files from WRITE, again due to file protections, to keep the script from stalling out waiting for user input. Added -z option for SRM logging, creates an .srmlog log file with details and times of SRM copies. Correcged srmcp failure message from 'bailing' to 'breaking', and retained the PIFL file. ######## # FARM # ######## Date: Wed, 09 Sep 2009 09:57:04 -0500 From: Rashid Mehdiyev To: Howard Rubin Cc: Arthur Kreymer Subject: FD runs omitted ? Hi Howie, Art There is a handful of FD spill files still sitting in /farcat: 2005-04 -rw-rw-r-- 1 42411 e875 952436 Aug 24 19:58 F00030612_0005.spill.sntp.dogwood1.0.root -rw-rw-r-- 1 42411 e875 925437 Aug 24 20:22 F00030612_0006.spill.sntp.dogwood1.0.root -rw-rw-r-- 1 42411 e875 909272 Aug 24 17:58 F00030612_0007.spill.sntp.dogwood1.0.root These sets have other subruns in /reco_far, but why they were not concatenated into one file ? 2007-06 -rw-rw-r-- 1 42411 e875 3096305 Jul 30 15:45 F00038215_0007.spill.sntp.dogwood1.0.root -rw-rw-r-- 1 42411 e875 2594556 Jul 30 16:09 F00038218_0017.spill.sntp.dogwood1.0.root -rw-rw-r-- 1 42411 e875 3135180 Jul 30 16:34 F00038258_0013.spill.sntp.dogwood1.0.root 2007-07 -rw-rw-r-- 1 42411 e875 1351613 Jul 30 18:41 F00038512_0012.spill.sntp.dogwood1.0.root -rw-rw-r-- 1 42411 e875 1264615 Jul 30 18:38 F00038522_0010.spill.sntp.dogwood1.0.root -rw-rw-r-- 1 42411 e875 1433282 Jul 30 18:40 F00038525_0008.spill.sntp.dogwood1.0.root -rw-rw-r-- 1 42411 e875 1258379 Jul 30 18:41 F00038525_0012.spill.sntp.dogwood1.0.root 2007-08 -rw-rw-r-- 1 42411 e875 2298557 Aug 4 12:21 F00039337_0008.spill.sntp.dogwood1.0.root Has it happened because there were gaps in the list of subruns for corresponding run numbers ? ___________________________________________________________________________ less ROUNTMP/LOG/2009-08/dogwood1farsntp.log HAVE F00038215_.cosmic.sntp.dogwood1.0.root:28: 0000 0001 0002 0003 0004 0005 0006 0008 0009 0010 0011 0012 0013 0014 0015 0016 0017 0018 0019 0 020 0021 0022 0023 0024 0025 0026 0027 0028 BADRUNS F00038215_0002.cosmic.sntp.dogwood1.0.root F00038215_0002.0 2007-06 139 2009-07-25 01:22:58 S fnpc388 BADRUNS F00038215_0014.cosmic.sntp.dogwood1.0.root F00038215_0014.0 2007-06 139 2009-07-25 07:26:39 S fnpc373 BADRUNS F00038215_0021.cosmic.sntp.dogwood1.0.root F00038215_0021.0 2007-06 139 2009-07-25 07:25:56 S fnpc377 PEND - have 29/26 subruns for F00038215_*.cosmic.sntp.dogwood1.0.root 10 07/30 15:44 28 1 HAVE F00038218_.spill.sntp.dogwood1.0.root:23: 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0011 0012 0013 0014 0015 0016 0018 0019 00 20 0021 0022 0023 NOSPILL F00038218_0000.spill.sntp.dogwood1.0.root NOSPILL F00038218_0001.spill.sntp.dogwood1.0.root NOSPILL F00038218_0002.spill.sntp.dogwood1.0.root NOSPILL F00038218_0003.spill.sntp.dogwood1.0.root NOSPILL F00038218_0005.spill.sntp.dogwood1.0.root NOSPILL F00038218_0006.spill.sntp.dogwood1.0.root NOSPILL F00038218_0007.spill.sntp.dogwood1.0.root PEND - have 24/17 subruns for F00038218_*.spill.sntp.dogwood1.0.root 10 07/30 16:09 23 1 HAVE F00038258_.spill.sntp.dogwood1.0.root:23: 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0011 0012 0014 0015 0016 0017 0018 0019 00 20 0021 0022 0023 SUPPRESS F00038258_0007.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0008.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0009.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0010.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0011.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0012.cosmic.sntp.dogwood1.0.root +SUPPRESS+ F00038258_0013.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0014.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0015.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0016.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0017.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0018.cosmic.sntp.dogwood1.0.root SUPPRESS F00038258_0019.cosmic.sntp.dogwood1.0.root PEND - have 24/12 subruns for F00038258_*.cosmic.sntp.dogwood1.0.root 10 07/30 16:34 23 1 HAVE F00038512_.spill.sntp.dogwood1.0.root:18: 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0011 0013 0014 0015 0016 0017 0018 BADRUNS F00038512_0008.cosmic.sntp.dogwood1.0.root F00038512_0008.0 2007-07 139 2009-07-26 03:23:37 S fnpc358 PEND - have 19/18 subruns for F00038512_*.cosmic.sntp.dogwood1.0.root 10 07/30 18:40 18 1 ########### # SERVICE # ########### Date: Wed, 09 Sep 2009 10:18:24 -0500 (CDT) Request INC000000010112 requested by you has been submitted. Status: New Summary: HTML in FermilabServiceDesk email Notes: Recently, FermilabServiceDesk email has started to include both a plain text and HTML copy of each item. Please drop the HTML copies. These are cluttering up my mail box, and are not good practice. There is nothing which needs to be formatted with HTML. Removing this part of each email before archiving is a lot of work for me. And please drop the 99 byte Application attachment that comes with each email, such as ARNotification_601_NTS000000214873.ARTask The users have no idea what to do with these. ########### # BLUEARC # ########### FILE FRAGMENTATION Finding very few references to this in Google. Apparently it can happen, and is even common. DiskKeeper sells commercial defragmenting software. http://productionsysadmins.com/board/thread/68/ delete and re-create our FS's to combat file fragmentation with the new Stone FS we are getting closer to a solution... The next release (mid-summer?) should have de-fragmentation implemented... http://www.studiosysadmins.com/wiki/display/testing_storagetests/ performance benchmarks I think this is a typo, should be SiliconFS Whitepaper promises this soon : http://www.bluearc.com/unleash/collateral/BlueArc_WP_SiliconFS_R4.pdf Volume shrinking and defragmentation are not currently supported, but will be available soon in a future release. This is a great whitepaper, in depth design discussion ============================================================================= 2009 09 08 ============================================================================= ######### # TOPDB # ######### Created admin/mysql/scripts/topdb and topdb_log for mysql minos-mysql1 - placed the password file under ~minsoft/maint/kreymer minos-mysql2 - Had to change setups.sh to contain MINHOME=~minsoft export PRODUCTS=${MINHOME}/ups/db:`printf "${PRODUCTS}\n" | tr : \\\n | grep -v ^/afs | head -1` _____________________________________________________________________________ Date: Tue, 08 Sep 2009 23:48:16 -0500 From: Arthur Kreymer To: minos_batch@fnal.gov Cc: minosdb-support@fnal.gov Subject: minos-mysql1 database overload At today's batch meeting, we noticed records of nightly overloads of the Farm database on minos-mysql1, from about 23:00 to 01:00. This matches the time of nightly keepup running. We are now logging 'mysqladmin processlist' every 10 minutes, to help diagnose such issues. See http://www-numi.fnal.gov/computing/database/topdb/minos-mysql1/ The following listing shows tonight's load during keepup http://www-numi.fnal.gov/computing/database/topdb/minos-mysql1/2009/08/23.txt The load does seem to be coming from queries against BEAMMONSPILLVLD _____________________________________________________________________________ ######## # FARM # ######## rmehdi concatenated some of the TACC full dogwood1 near cosmic sntp files before we removed the prescaled files. MINOS26 > ls -ltr /pnfs/minos/reco_near/dogwood1/sntp_data ... drwxrwxr-x 1 minospro e875 512 Aug 21 15:56 2009-05 drwxrwxr-x 1 minospro e875 512 Aug 24 16:20 2009-06 drwxrwxr-x 1 minospro e875 512 Sep 7 19:01 2006-02 drwxrwxr-x 1 minospro e875 512 Sep 7 19:02 2006-06 drwxrwxr-x 1 minospro e875 512 Sep 7 19:31 2006-07 2006-02/N00009749_0009.cosmic.sntp.dogwood1.0.root 2006-06/N00010195_0019.cosmic.sntp.dogwood1.0.root 2006-07/N00010586_0015.cosmic.sntp.dogwood1.0.root 2006-07/N00010586_0023.cosmic.sntp.dogwood1.0.root These are all declared to SAM. These all came from a single input subrun. I have undeclared them, to allow a clean purge of prescaled data. sam undeclare N00009749_0009.cosmic.sntp.dogwood1.0.root sam undeclare N00010195_0019.cosmic.sntp.dogwood1.0.root sam undeclare N00010586_0015.cosmic.sntp.dogwood1.0.root sam undeclare N00010586_0023.cosmic.sntp.dogwood1.0.root Checking the log, MINOS26 > less /minos/data/minfarm/ROUNTMP/LOG/2009-09/dogwood1nearcosmic.log OK adding N00009334_0000.cosmic.sntp.dogwood1.0.root 23 This file was not written, roundup was killed with a STOP file MINOS26 > cat /minos/data/minfarm/ROUNTMP/READ/N00009334_0000.cosmic.sntp.dogwood1.0.root N00009334_0000.cosmic.sntp.dogwood1.0.root ... N00009334_0022.cosmic.sntp.dogwood1.0.root CLEANUP Remove the prescaled files per email from rmehdi, SAMDIM=" VERSION dogwood1 \ and DATA_TIER cand-near \ and PHYSICAL_DATASTREAM_NAME cosmic and RUN_NUMBER < 15820 " MINOS26 > sam list files --summaryOnly --dim="${SAMDIM}" File Count: 20977 Average File Size: 60.33MB Total File Size: 1.21TB Total Event Count: 2140799714 > 15820 File Count: 2367 Average File Size: 827.12MB Total File Size: 1.87TB Total Event Count: 241245757 SAMDIM=" VERSION dogwood1 \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME cosmic and RUN_NUMBER < 15820 " MINOS26 > sam list files --summaryOnly --dim="${SAMDIM}" File Count: 1356 Average File Size: 250.45MB Total File Size: 331.65GB Total Event Count: 2160861354 > 15820 File Count: 144 Average File Size: 1.26GB Total File Size: 180.99GB Total Event Count: 219759611 ___________________________________________________________________________ Date: Tue, 08 Sep 2009 22:00:27 +0000 (GMT) From: Arthur Kreymer To: minos_batch@fnal.gov Cc: minos-data@fnal.gov Subject: Plan for removal of prescaled dogwood1 near cosmic files Here is the specific plan to remove the prescaled dogwood1 near cosmic files to be replaced by the unprescaled files. 1) Remove all candidates. There are over 0.8 TBytes of candidates per month of data, which would be well over 30 TBytes for the full run. Part of the origin plan for running unprescaled near cosmics was to drop the candidates. 2) Remove the prescaled sntp files produced at Fermilab, RUN < 15820. Here is the SAM query and summary : SAMDIM=" VERSION dogwood1 \ and DATA_TIER sntp-near \ and PHYSICAL_DATASTREAM_NAME cosmic and RUN_NUMBER < 15820 " sam list files --summaryOnly --dim="${SAMDIM}" File Count: 1356 Average File Size: 250.45MB Total File Size: 331.65GB Total Event Count: 2160861354 3) On Friday, after the DCache maintenance is completed, start concatenating the full cosmic sample from TACC. ########### # BLUEARC # ########### Increased /grid/data/e875/LOCK/stale from 10 to 30 to help with xiaobo jobs running under parrot, copying 4 GB files. That won't help much with xiaobo's jobs, using cp1, but will help other like wingmc and toner. 17:05 - increased cp1 idle timeout from 1000 to 2000 seconds ( 45 minutes ) Performance recovering, as of around 15:00 ########### # ENSTORE # ########### Date: Tue, 08 Sep 2009 13:38:58 -0500 From: Jon Bakken To: Stanley J. Naymola , Enstore Admin Cc: Jon Bakken , T1 Tier1 , Arthur Kreymer Subject: CMS tape drive on loan I have not seen the drive we loaned out to our neutrino colleagues in use for some time. Could you please reallocate that drive back to CMS's fair share. It is not urgent, but sometime this week please. Thanks. Jon __________________________________________________________________________ Date: Tue, 8 Sep 2009 19:11:59 +0000 (GMT) From: Arthur Kreymer I agree, drive utilization has been reasonable for the last few weeks, so the drive can be returned to CMS at your convenience. ########## # DCACHE # ########## Date: Tue, 08 Sep 2009 11:40:10 -0500 From: ssa-group@fnal.gov To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov, wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov, timur@fnal.gov, stk-users@fnal.gov, cms-t1@fnal.gov, fermigrid-announce@fnal.gov, dcache-admin@fnal.gov Subject: Announcement: Service disruption for dCache on stken for a duration of 2 hours each Several Public dcache nodes have been unstable in operation. An urgent re-install of the most used nodes was completed last week and have not had a repeat of the noticed errors. The remaining nodes however are still in need of this re-install and are scheduled: Wed Sept 9th stkendca25a 10 AM stkendca27a 1 PM stkendca18a 3 PM Thurs Sept 10th stkendca19a 10 AM stkendca20a 1 PM These nodes holds a combination of read and write pools, all data will be preserved over the re-install. The doors will be not available on these nodes during the re-install. Please contact SSA group if you have any questions. ######### # ADMIN # ######### Date: Tue, 08 Sep 2009 11:37:18 -0500 (CDT) Request INC000000010011 requested by you has been submitted. Status: New Summary: Minos cluster add_minos_user null name Notes: For several accounts added recently to the Minos Cluster via cmd add_minos_user, the personal name (GECOS) field is null : MINOS01 > date Tue Sep 8 11:34:39 CDT 2009 MINOS01 > ypcat passwd | grep :: cvson:KERBEROS:44279:5111::/afs/fnal.gov/files/home/room2/cvson:/usr/local/bin/tcsh hennessy:KERBEROS:5456:5111::/afs/fnal.gov/files/home/room2/hennessy:/usr/local/bin/tcsh ragomes:KERBEROS:10792:5111::/afs/fnal.gov/files/home/room2/ragomes:/usr/local/bin/tcsh This has no great operational impact, but it would be nice to correct this for these accounts, and to correct the add_minos_user script. ________________________________________________________________________ Date: Wed, 09 Sep 2009 09:04:14 -0500 (CDT) Status: Completed Full names added for accounts where they were missing. The add_minos_user script has been corrected, so this should not happen any more. Thanks for noticing this. ######### # ADMIN # ######### Ticket 9927 9928, MINOS01 > setup systools MINOS01 > cms add_minos_user cvson ARK > ssh -l minoscvs minoscvs 'adduser cvson' ########### # KSPLICE # ########### Date: Tue, 8 Sep 2009 15:11:06 +0000 (GMT) From: Arthur Kreymer To: csieh@fnal.gov, dawson@fnal.gov, jallen@fnal.gov, leininger@fnal.gov, minos-admin@fnal.gov, sam-design@fnal.gov Subject: Update Without a Reboot | Ksplice (fwd) Have you heard of this tool ? This company provides modules for critical kernel updates, so that you can deploy these without a reboot. Obviously the Upstream Vendor should provide such modules. But meanwhile, ksplice might be very valuable for use with SL / SLF. ---------- Forwarded message ---------- Date: Sat, 05 Sep 2009 15:59:46 -0500 From: Arthur Kreymer To: "Arthur Kreymer (FNAL)" Subject: Update Without a Reboot | Ksplice Hot kernel security upgrades http://www.ksplice.com/ ============================================================================= 2009 09 07 Labor Day Holiday ============================================================================= ######## # LOCK # ######## At about 08:47 CDT 2009 09 07, I started automatic cleanup of lock logs and stale locks : mindat@minos27 echo '* * * * * /grid/fermiapp/minos/scripts/lock clean' | crontab ============================================================================= 2009 09 05 ============================================================================= ############ # DATASETS # ############ Revised to handle new pool list web format, which broke the old script. Added summary of free/total space for each pool Ran monthly summaries. RawDataWritePools is too full Fri Sep 4 06:19:32 CDT 2009 w-raw-minos-stkendca21a-1.files 0/1933 Fri Sep 4 06:36:42 CDT 2009 w-raw-minos-stkendca22a-1.files 0/1933 Sat Aug 29 06:44:31 CDT 2009 w-raw-minos-stkendca24a-1.files 0/1933 Fri Sep 4 06:50:32 CDT 2009 w-raw-minos-stkendca26a-1.files 0/1933 Files = 128598 Size = 7730 Capacity = 7734 We need to shift a pool from MinosPrdReadPool, one of Fri Sep 4 06:16:54 CDT 2009 r-minos-stkendca21a-3.files 1739/1933 Fri Sep 4 06:35:58 CDT 2009 r-minos-stkendca22a-3.files 949/1933 Fri Sep 4 06:41:18 CDT 2009 r-minos-stkendca23a-3.files 1739/1933 Fri Sep 4 06:44:02 CDT 2009 r-minos-stkendca25a-1.files 231/1933 Fri Sep 4 06:48:52 CDT 2009 r-minos-stkendca26a-3.files 1737/1933 Fri Sep 4 06:55:32 CDT 2009 r-minos-stkendca27a-2.files 2103/2900 Size = 14802 Capacity = 23314 ########### # BLUEARC # ########### Summary of scavan lock test in August. MINOS26 > grep scavan /grid/data/e875/LOCK/LOGS/200908.log | cut -f 3 -d . | sort -u | sort -n 0 1 6 7 12 SECS=`grep scavan /grid/data/e875/LOCK/LOGS/200908.log | cut -f 3 -d . | sort -u | sort -n` for SEC in ${SECS} ; do SCO=`grep scavan /grid/data/e875/LOCK/LOGS/200908.log \ | cut -f 3 -d . | grep ${SEC} | wc -l` printf "%4d %4d\n" ${SEC} ${SCO} done Same thing for time queued, in field 4 : SECS=`grep scavan /grid/data/e875/LOCK/LOGS/200908.log | cut -f 4 -d . | sort -u | sort -n` for SEC in ${SECS} ; do SCO=`grep scavan /grid/data/e875/LOCK/LOGS/200908.log \ | cut -f 4 -d . | grep ${SEC} | wc -l` printf "%4d %4d\n" ${SEC} ${SCO} done _________________________________________________________________________ Date: Sat, 05 Sep 2009 15:50:19 +0000 (GMT) From: Arthur Kreymer To: Steven Cavanaugh Cc: Ryan B. Patterson , minos-data@fnal.gov, gmieg@fnal.gov Subject: Re: scavan jobs loading /grid/data I am sorry not to have replied to your email of Aug 31. This is not from lack of interest, indeed you have done a lot of very interesting work, and many of these techniques can be used generally. I look forward to seeing you at the Collaboration meeting. I'll ask to add these issues to the agenda of the Core/Cal/Batch meeting. Specific issues with your August jobs : I think that your slowdowns were not caused by the cpn script, but the fact that you are running under Parrot, and in particular sometimes running 6 copies at once from the same node. The node where we saw slow transfers for your job works fine when used for single non-parrot copies. I reviewed the logs from your tests of cpn. Almost all of your locks were held for 1 or 2 seconds. The longest was 12 seconds. Here are the statistics : Secs times 0 592 1 248 6 19 7 1 12 1 Similar statistics for the amount of time queued : 0 17 1 335 2 375 3 102 4 23 5 4 6 4 So the locking function should not have doubled your file times from 10 seconds to 22 seconds. Perhaps there was a lot of overhead in invoking cpn from within the context of your process, under Parrot. Now some development thoughts. I admit to a general prejudice against central servers. They have to meet strict security controls, and are another point of failure. ( And they always get stuck at an inconvenient time ;-} ) In this case, your locks are specific to each of your jobs, which does not scale to use by the whole experiment or a broader community. The existing cpn and lock mechanism uses only the shared file system. The algorithm is designed to scale efficiently to the full Fermigrid, tens of thousands of clients. To eliminate the delays you saw, one could impelement cpn as a loadable root module. I am also intrigued by the idea of using UDP socks to provide a shortcut, reducing the time spent polling. This would not require a central lock manager. When relaseing a lock,each client could consult the QUEUE and send a UDP packet to the next client. The clients could be polling and waiting for a packet in parallel. At worst we would fall back to the present rather efficient polling. At best, we would run with almost no delays. I also like very much the things you have done to track progress of each jobs. I think this can be done using the shared file system, rather than a server. I'll try to find time to look into this. ============================================================================= 2009 09 04 ============================================================================= ######## # FARM # ######## new passless files are showing up in daikon04 : cd /pnfs/minos/mcout_data/dogwood1/far/daikon_04/L010185N/sntp_data/740 MINOS26 > ls -l *.0.root | cut -f 7- -d ' ' Aug 13 00:19 f21037400_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:10 f21337400_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:10 f21337401_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:10 f21337402_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:10 f21337403_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:10 f21337404_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:10 f21337405_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:10 f21337406_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:10 f21337407_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:11 f21337408_0000_L010185N_D04.sntp.dogwood1.0.root Aug 12 22:11 f21337409_0000_L010185N_D04.sntp.dogwood1.0.root Aug 13 06:21 f21437400_0000_L010185N_D04.sntp.dogwood1.0.root MINOS26 > ls -l *wood1.root | cut -f 7- -d ' ' Aug 19 15:47 f21037400_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337400_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337401_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337402_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337403_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337404_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337405_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337406_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337407_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337408_0000_L010185N_D04.sntp.dogwood1.root Aug 19 22:04 f21337409_0000_L010185N_D04.sntp.dogwood1.root Aug 27 11:29 f21437400_0000_L010185N_D04.sntp.dogwood1.root See later discussion in email. Seems that Howie writes the pass field, Matt does not There is some, but not too much, overlap. Scanning directories from CFL MINOS27 > grep /pnfs/minos/mcout_data/dogwood1/near/daikon_04 CFL | \ cut -f 8- -d / > /minos/scratch/kreymer/dog1d04.lis MINOS27 > grep 'dogwood1.root' /minos/scratch/kreymer/dog1d04.lis | cut -f 1 -d / | sort -u CosmicLE L010000N L010000N_i209 L010000N_i317 L010000N_i380 L250200N_i100 L250200N_i114 L250200N_i130 L250200N_i152 L250200N_i165 L250200N_i194 L250200N_i232 MINOS27 > grep 'dogwood1.0.root' /minos/scratch/kreymer/dog1d04.lis | cut -f 1 -d / | sort -u L010000N_i209 L010000N_i225 L010000N_i232 L010000N_i259 L010000N_i300 L010000N_i317 L010000N_i326 L010000N_i380 L010185N_i124 L010185N_i191 L010185N_i213 L010185N_i224 L010185N_i232 L010185N_i243 L010185N_i257 L010185N_i282 L010185N_i303 L010185N_i324 L010185R ######### # MYSQL # ######### /var/minsoft filled, due to archives, round 13:.. Mysql> du -sm /var/minsoft/archive/* 25989 /var/minsoft/archive/20090612 26217 /var/minsoft/archive/20090710 26340 /var/minsoft/archive/20090811 8105 /var/minsoft/archive/20090904 2619 /var/minsoft/archive/crl 1 /var/minsoft/archive/database Mysql> rm -r /var/minsoft/archive/20090612 Mysql> rm -r /var/minsoft/archive/20090710 Mysql> rm -r /var/minsoft/archive/20090904 Updated dbarchive to check and bail on low free space. dbarchive revision 1.8, version 20090904 ########### # MONTHLY # ########### DATASETS 9/5 after upgrading script dcache/datasets PREDATOR 9/4 after dcache maintenance VAULT 9/3 OK MYSQL 9/4 after upgrading script dbarchive Mysql> scripts/dbarchive STARTED DBARCHIVES Fri Sep 4 12:29:12 CDT 2009 filled disk cleared space optimized dbarchive to gzip small files first, restarted STARTED DBARCHIVES Fri Sep 4 15:58:36 CDT 2009 FINISHED DBARCHIVES Fri Sep 4 17:29:35 CDT 2009 ########### # RAWCOPY # ########### rawcopy.20090904 - increased SLIM from 1.8 GB to 7.8 GB. for MD in 0329 0330 0401 0414 ; do mv rawcopy.${MD} rawcopy.2006${MD}; done cp rawcopy.20060414 rawcopy.20090904 ln -sf rawcopy.20090904 rawcopy # was rawcopy.0414 ########## # DCACHE # ########## dcache upgrades - DONE 9 AM stkendca23a ( down 09:10 - 10:30 ) 11 am stkendca24a ( down 11:00 - 12:45 ) 1 pm stkendca26a. ( down 15:00 - 16:15 ) restarted mcimport crontab crontab.dat around 17:14 - next cycle 16:37, will bounce off of running jcoelho restart DAQ archivers as necessary all archivers are running and moving files. review vault copy was OK last night review predator 17:06 cycle was clean ############### # GRIDAPPSYNC # ############### Modified gridappsync to run under minsoft( minos-mysql2 ) MINOS27 > ls /minos/scratch/products/ -ld drwxr-sr-x 2 products 4525 2048 Oct 18 2007 /minos/scratch/products/ MINOS27 > rmdir /minos/scratch/products MINOS-MYSQL2 > mkdir /minos/scratch/products MINOS-MYSQL2 > ls -alF /minos/scratch/products total 24 drwxr-xr-x 2 minsoft mysql 2048 Sep 4 09:07 ./ drwxrwxrwx 239 root root 18432 Sep 4 09:07 ../ MINOS-MYSQL2 > date ; set nohup ; ./gridappsync & Fri Sep 4 09:10:19 CDT 2009 [1] 25943 Mysql> du -sm /afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_4 28532 /afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT/Linux2.4-GCC_3_4 Mysql> du -sm /afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT 30273 /afs/fnal.gov/files/data/minos/d119/prd/MINOS_ROOT less /minos/scratch/minsoft/log/log/gridappsync/2009-09.log prd/clhep/v2_0_3_1/Linux-2-6/doc/Units/Units.ps prd/clhep/v2_0_3_1/Linux-2-6/doc/Vector/ rsync: send_files failed to open "/afs/fnal.gov/files/data/minos/d119/prd/clhep/v2_0_3_1/Linux-2-6/doc/Vector/VectorDefs.ps": Input/output error ( 5) rsync: send_files failed to open "/afs/fnal.gov/files/data/minos/d119/prd/clhep/v2_0_3_1/Linux-2-6/doc/Vector/eulerAngleComputation.ps": Input/out ... rsync: send_files failed to open "/afs/fnal.gov/files/data/minos/d119/prd/ximagetools/v4_0/NULL/ReleaseNotes.v4_0": Input/output error (5) prd/ximagetools/v4_0/NULL/ups/ rsync: send_files failed to open "/afs/fnal.gov/files/data/minos/d119/prd/ximagetools/v4_0/NULL/ups/ximagetools.table": Input/output error (5) sent 43231085555 bytes received 30462780 bytes 2899664.76 bytes/sec total size is 49199201572 speedup is 1.14 rsync error: some files could not be transferred (code 23) at main.c(702) Command exited with non-zero status 23 278.58user 1703.04system 4:08:39elapsed 13%CPU (0avgtext+0avgdata 0maxresident)k 776inputs+3301488outputs (4major+1971793minor)pagefaults 0swaps FINISHED Fri Sep 4 13:20:10 CDT 2009 Possibly due to the filled /var disk ! Try another pass Mysql> cd ~kreymer/minos/scripts/ Mysql> date ; set nohup ; ./gridappsync & STARTED Fri Sep 4 17:31:11 CDT 2009 sent 6115017215 bytes received 3404420 bytes 2212410.64 bytes/sec total size is 49199201572 speedup is 8.04 59.05user 185.57system 46:05.01elapsed 8%CPU (0avgtext+0avgdata 0maxresident)k 1088inputs+2194256outputs (3major+334367minor)pagefaults 0swaps FINISHED Fri Sep 4 18:17:16 CDT 2009 ============================================================================= 2009 09 03 ============================================================================= ############### # GRIDAPPSYNC # ############### Resuming rsync of products to /grid/fermiapp, preparing to retire parrot for normal use. Preserved former gridappsync as gridappsync 20070911 Reviewing what is needed ( defined by parrot's /grid/fermiapp/minos/parrot/mountfile.grow /afs/fnal.gov/files/code/e875/general/minossoft /grow/www-numi.fnal.gov/computing/parrot/releases /afs/fnal.gov/files/code/e875/general/ups /grow/www-numi.fnal.gov/computing/parrot/ups /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL /grow/www-numi.fnal.gov/computing/parrot/MINOS_EXTERNAL /afs/fnal.gov/files/code/e875/sim /grow/www-numi.fnal.gov/computing/parrot/sim /afs/fnal.gov/files/data/minos /grow/www-numi.fnal.gov/computing/parrot/release_data We need ups, MINOS_EXTERNAL, ups -> /afs/fnal.gov/files/data/minos/d119/ MINOS_EXTERNAL is native QUOTA MINOS27 > du -sm /afs/fnal.gov/files/data/minos/d119 48354 /afs/fnal.gov/files/data/minos/d119 MINOS26 > du -sm /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL 898 /afs/fnal.gov/files/code/e875/general/MINOS_EXTERNAL MINOS27 > quota -s -g e875 blue2:/fermigrid-fermiapp 51518M 0 100G 1265k 0 0 We need to clear some space or get more quota before cloning. MINOS27 > du -sm /grid/fermiapp/minos/* 1 /grid/fermiapp/minos/enstore 1 /grid/fermiapp/minos/griddb 1 /grid/fermiapp/minos/kreymer du: cannot read directory `/grid/fermiapp/minos/minfarm/Minossoft/EXTERNAL/mysql-5.0.22/sql/share/japanese-sjis': Permission denied 45531 /grid/fermiapp/minos/minfarm 2309 /grid/fermiapp/minos/offline 1343 /grid/fermiapp/minos/parrot 1 /grid/fermiapp/minos/scripts 6151 /grid/fermiapp/minos/sim MINOS27 > du -sm /grid/fermiapp/minos/minfarm/* 1 /grid/fermiapp/minos/minfarm/bin 2 /grid/fermiapp/minos/minfarm/condor_submit 4 /grid/fermiapp/minos/minfarm/glide_submit 0 /grid/fermiapp/minos/minfarm/Include_R2.0.0.txt du: cannot read directory `/grid/fermiapp/minos/minfarm/Minossoft/EXTERNAL/mysql-5.0.22/sql/share/japanese-sjis': Permission denied _______________________________________________________________________ MINOS26 > find /afs/fnal.gov/files/data/minos/d119/prd -maxdepth 3 -type l -exec ls -ld {} \; misperl/v2_2_4/current -> NULL MINOS_ROOT/Linux2.4-GCC_3_2/v5-10-00 -> v5-10-00d MINOS_ROOT/Linux2.4-GCC_3_3/v4-00-08e -> v4-00-08f MINOS_ROOT/Linux2.4-GCC_3_3/v4-00-08d -> v4-00-08e MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02-opt -> v4-04-02b-opt MINOS_ROOT/Linux2.4-GCC_3_4/v4-04-02 -> v4-04-02b MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a -> /afs/fnal.gov/files/data/minos/d162/MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a-opt -> /afs/fnal.gov/files/data/minos/d162/MINOS_ROOT/Linux2.4-GCC_3_4/v5-18-00a-opt GENIE/Linux2.4-GCC_3_4/genie_build.sh -> build_genie.sh gridappsync - Revised to write from d119 to fermiapp, log to /minos/scratch Changed log to mindata path, run this as mindata. 18:41 set nohup ; ./gridappsync & Killed this before it got really started. _______________________________________________________________________ ADMIN Date: Thu, 03 Sep 2009 19:02:56 -0500 (CDT) Request INC000000009849 requested by you has been submitted. Status: New Summary: Group E875 quota in fermiapp Notes: please assign to FermiGrid Services group - fermigrid-help Please increase the e875 group quota in /grid/fermiapp from 100 to 150 GB. This should give us room to clone our AFS products, working to remove AFS from Fermigrid. _______________________________________________________________________ Date: Wed, 23 Sep 2009 11:23:31 -0500 (CDT) Status: Completed Increased Quota ( to 300 GB ) ########### # BLUEARC # ########### Put the minos-sam04 plots ( /minos/scratch tests ) at http://www-numi.fnal.gov/computing/dh/bluearc/rates/minos-sam04 Did ./bratewk minos-sam04 20090831 out etc ########### # ROUNDUP # ########### Date: Tue, 01 Sep 2009 17:11:46 -0500 From: Rashid Mehdiyev this is an excerpt from the log file: dogwood1farcosmic.log Looks like there is a problem in line 421 or what ? FILE=F00026803_0001.cosmic.cand.dogwood1.0.root OOPS - mismatched Enstore and local size/crc SIZE 286642030/286642030 CRC 488030443/ PINFO VON096 0000_000000000_0004360 286642030 reco_far_dogwood1_cand /pnfs/fnal.gov/usr/minos/reco_far/dogwood1/cand_data/2004-08/ F00026803_0001.cosmic.cand.dogwood1.0.root 000F000000 0000000AB76148 CDMS125142581300000 stkenmvr209a:/dev/rmt/tps0d0n:9310025229 488030443 MINOS26 > ecrc /minos/data/minfarm/WRITE/F00026803_0001.cosmic.cand.dogwood1.0.root CRC 488030443 ls -l ROUNTMP/ECRC/F00026803_0001.cosmic.cand.dogwood1.0.root no such file FILE=F00026824_0003.cosmic.cand.dogwood1.0.root ls -l /pnfs/minos/reco_far/dogwood1/cand_data/2004-08/${FILE} -rw-r--r-- 1 minospro e875 238677318 Aug 27 21:17 ls -l /minos/data/minfarm/WRITE/${FILE} -rw-rw-r-- 1 minospro e875 238677318 Aug 27 16:14 cat /minos/data/minfarm/ROUNTMP/ECRC/${FILE} In both cases, the ECRC file is missing. That's because this file was not removed, due to file protections, on Aug 27. Removing it manually. rm -f /minos/data/minfarm/WRITE/${FILE} ########## # DCACHE # ########## Date: Thu, 03 Sep 2009 15:01:38 -0500 From: ssa-group@fnal.gov Several Public dcache nodes have been unstable in operation. We will schedule an urgent re-install of stkendca21a Will be started immediately. stkendca21a holds a combination of read and write pools, all data will be preserved over the re-install. The doors will be not available on these nodes during the re-install. Please contact SSA group if you have any questions. ______________________________________________________________________ N.B. - MRTG plots show outage 15:20 to 16:30 ______________________________________________________________________ Date: Thu, 03 Sep 2009 17:08:17 -0500 From: ssa-group@fnal.gov Subject: Announcement: Service restoration for dCache on stken for a duration of 2 Hours each individually Several Public dcache nodes have been unstable in operation. We will schedule an urgent re-install of stkendca23a, stkendca24a and stkendca26a Will be started at 9AM Friday 9/4 with stkendca23a Then 11 am for stkendca24a and Finally 1pm for stkendca26a. All Three hold a combination of read and write pools, all data will be preserved over the re-install. The doors will be not available on these nodes during their respective re-install. Please contact SSA group if you have any questions. ______________________________________________________________________ ########## # DCACHE # ########## There was a delayed srmcp in mcimport/jcoelhl, but it eventually succeeded , writing to -rw-r--r-- 1 kreymer e875 727934134 Sep 3 12:45 /pnfs/minos/mcin_data/near/daikon_04/L250200N_i194/707/n13037076_0000_L250200N_D04_i194.reroot.root This file is in pool 23a, which rebooted at about that time. ############ # MCIMPORT # ############ mcimport.20090903 updated this to allow parallel running again, as we used to have. Corrected usage examples Skip CRON pid file when not running under cron This allows parallel execution as needed. Skip removal of CRON pid when not under cron Moved printf of pid diagnostic into pid routine, where it will get run. Tested sjc files $ ./mcimport.20090903 -b 2 -v -n sjc The initial pass took from Thu Sep 3 08:56:18 CDT 2009 Thu Sep 3 08:59:21 CDT 2009 448574 /minos/data/mcimport/sjc/ 1 /minos/data/mcimport/sjc/tar 1 /minos/data/mcimport/sjc/dcache 2149 /minos/data/mcimport/sjc/mcin 1 /minos/data/mcimport/sjc/mcin/dcache Thu Sep 3 09:27:09 CDT 2009 about 25 minutes to run 'du' This look OK, run for real to get sjc files imported. Thu Sep 3 12:05:56 CDT 2009 Sorting 3502 logs in /minos/data/mcimport/sjc/log Sorting 3613 logs in /minos/data/mcimport/sjc/mcin/log STAGE, MCINPURGE, MCINWRITE OK - staging 1748 files Thu Sep 3 12:16:47 CDT 2009 ################# # NEAR_DCS_DATA # ################# Correcting location of near_dcs_data/2009-09 files from June/July SAMDIM='DATA_TIER dcs-near and FULL_PATH=/pnfs/minos/near_dcs_data/2009-09' MINOS26 > ./samlocate "${SAMDIM}" N090724_000004.mdcs.root /pnfs/minos/near_dcs_data/2009-09 ... Move the June files setup encp FILE6=`ls /pnfs/minos/near_dcs_data/2009-09 | grep N0906` date for FILE in ${FILE6} ; do enmv /pnfs/minos/near_dcs_data/2009-09/${FILE} \ /pnfs/minos/near_dcs_data/2009-06/${FILE} done This fails : ERROR: USERERROR Insufficent permissions to move file The files are owned by buckley, no group permissions. We should probably change this to minfarm, relatively safe, tightly controlled usage. I can change the ownership of the directory to minfarm : MINOS26 > chown minfarm /pnfs/minos/near_dcs_data/2009-09 MINOS26 > ls -l /pnfs/minos/near_dcs_data/2009-09 -d drwxrwxr-x 1 minfarm e875 512 Sep 3 01:55 /pnfs/minos/near_dcs_data/2009-09 ############ # MCIMPORT # ############ Closed excessively open permissions on /minos/data/mcimport/wingmc/mcin At about 10:15 $ ls -ld /minos/data/mcimport/wingmc/mcin drwxrwxrwx 2 mindata e875 149504 Sep 3 10:17 /minos/data/mcimport/wingmc/mcin $ chmod 775 /minos/data/mcimport/wingmc/mcin ########### # ENSTORE # ########### Date: Thu, 03 Sep 2009 11:28:09 -0500 (CDT) Request INC000000009765 requested by you has been submitted. Status: New Summary: /pnfs/minos/looking* files Notes: SSA primary enstore-admin Please remove the following files from /pnfs/minos. These seem to have been put there by enstore administrators. $ ls -l /pnfs/minos/looking* -rw-r--r-- 1 root root 2149120 Apr 15 11:36 /pnfs/minos/looking-results-grepped -rw-r--r-- 1 root root 992531 Apr 14 13:46 /pnfs/minos/lookingforlto3-minos -rw-r--r-- 1 root root 42376 Apr 15 11:31 /pnfs/minos/lookingforlto3-minos-results -rw-r--r-- 1 root root 11655141 Apr 14 17:34 /pnfs/minos/lookingforlto3-minos-resultslookingforlto3-minos-results ___________________________________________________________________________ Date: Thu, 03 Sep 2009 11:44:12 -0500 (CDT) Status: In Progress ___________________________________________________________________________ Date: Wed, 23 Sep 2009 16:04:12 -0500 (CDT) Status: Completed SSA has removed the 4 files which were accidentally written to minos pnfs metadata. ___________________________________________________________________________ ___________________________________________________________________________ ########### # ENSTORE # ########### Cleaning up stray files in /pnfs/minos/analysis/nue/Ent, as reported at http://www-stken.fnal.gov/enstore/www-stken_pnfs_monitor mindata@minos26 find . -size 0 -exec ls -l {} \; -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-015.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-006.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-010.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-001.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-025.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-016.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-007.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./_match -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-020.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-011.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-002.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-026.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-017.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-008.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-030.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-021.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-012.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-003.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-027.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-018.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-009.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-031.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-022.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-013.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-004.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-028.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-019.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-023.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-014.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-005.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-029.md5sums -rw-r--r-- 1 mindata e875 0 Oct 19 2008 ./Ent_CP_Daikon_CalStudy_N_RelErrMinus1S-024.md5sums $ find . -size 0 -exec rm {} \; ############ # PREDATOR # ############ But first, we are stuck since about 09:06 doing dccp of /pnfs/minos/neardet_data/2009-09/N00016752_0019.mdaq.root Login information : DCap01-fndca4a-unknow-56142 DCap01-fndca4a-unknow-56142 minos26.fnal.gov active Sep 03 09:06:17 Sep 03 09:06:17 1060/5425 DCap01-fndca4a-unknow-56142 1060 ? ? ? ? open minos/neardet_data/2009-09/N00016752_0019.mdaq.root Trying dccptest, also failing MINOS26 > ./dccptest neardet_data/2009-09/N00016752_0019.mdaq.root PORT 24136 Datafile with name 'neardet_data/2009-09/N00016752_0019.mdaq.root' not found. dccp -d 4 dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos//neardet_data/2009-09/N00016752_0019.mdaq.root . Connected in 0.00s. [Thu Sep 3 09:10:43 2009] Going to open file dcap://fndca1.fnal.gov:24136/pnfs/fnal.gov/usr/minos//neardet_data/2009-09/N00016752_0019.mdaq.root in cache. But dccptest works again, around 09:39.. Killed off the stuck dccp in predator ______________________________________________________________________ Date: Thu, 03 Sep 2009 19:23:03 +0000 (GMT) From: Arthur Kreymer To: Marty Buchaus Cc: Stanley Naymola , Dcache Admin , minos-data@fnal.gov Subject: Re: Stkendca22a On Wed, 2 Sep 2009, Marty Buchaus wrote: > Stkendca22a has been rebuilt with the 3ware raid card for the os drivesinstead of directly connected to > the onboard sata controller. I was wonderingif we could get feedback that the pools on this node are for > sure beingutilized. A converse in this would be appreciated. The system seems to be stable since the upgrades. We suggest rebuilding the other nodes supporting RawDataWritePools ASAP ( 21a 24a 26a ) ######### # KCRON # ######### The corrected kcron has been propogated via autoyum. Yesterday, we had minos01 krb5-workstation-fermi-1.8d-14.LTS4 ARK > for NODE in ${NODES} ; do printf "${NODE} " ssh -ax ${NODE} 'rpm -q krb5-workstation-fermi' ; done minos01 krb5-workstation-fermi-1.8d-15.LTS4 minos03 krb5-workstation-fermi-1.8d-15.LTS4 .... ============================================================================= 2009 09 02 ============================================================================= ############ # PREDATOR # ############ Enabled the -c copy flag for mdcs and beam files date ; ln -sf predator.20090902 predator # was predator.20090816 Wed Sep 2 22:27:01 GMT 2009 Started manual run at 17:29 ####### # DAQ # ####### Date: Wed, 02 Sep 2009 05:50:01 -0500 From: MINOS DAQ To: carl.metelko@stfc.ac.uk, geoff.pearce@stfc.ac.uk, kreymer@fnal.gov Subject: NEARDAQ: 1 file(s) waiting more than 1h for archival This seems to have cleared up 2009-09-2 00:21:34 fardet_data/2009-09/F00044639_0015.mdaq.root 451 Aborting transfer due to session termination 2009-09-2 00:49:35 OK 2009-09-2 03:26:56 Fardet_data/2009-09/F00044639_0018.mdaq.root 451 Aborting transfer due to session termination 2009-09-2 09:49:17 near_dcs_data/2009-09/N090720_000003.mdcs.root NOT_FINISHED Removing 0 length files as necessary : -rw-r--r-- 1 buckley e875 0 Sep 2 09:49 N090720_000003.mdcs.root -rw-r--r-- 1 buckley e875 0 Sep 2 04:14 N00016731_0018.mdaq.root -rw-r--r-- 1 buckley e875 0 Sep 2 03:26 F00044639_0018.mdaq.root Predator failed this morning around 5:06 N00016731_0018.mdaq.root F00044639_0018.mdaq.root N090629_180804.mdcs.root N090630_000004.mdcs.root N090701_000002.mdcs.root Local copies work fine, but dcap gets nowhere. This started happening on Sep 1. There are many files left to archive, we're stuck on N090720_000003.mdcs.root We need to do a good bit of cleanup : Stopped the archiver : 10:48 CDT [minos@dcsdcp-nd ~]$ bin/init/archiver stop Predator is disabled. Copied archiver_krb.py from minos/scripts to dcsdcp-nd:bin/archiver_krb.20080703.py mv archiver_krb.py archiver_krb.20051103.py ln -s archiver_krb.20080703.py archiver_krb.py Removed the stuck files MINOS26 > rm -f /pnfs/minos/near_dcs_data/2009-09/N090720_000003.mdcs.root rm -f /pnfs/minos/neardet_data/2009-09/N00016731_0018.mdaq.root rm -f /pnfs/minos/fardet_data/2009-09/F00044639_0018.mdaq.root Restarted the archiver [minos@dcsdcp-nd ~]$ date ; bin/init/archiver start Wed Sep 2 10:59:02 CDT 2009 Starting archiver Same file stuck for 10 minutes. Stopped all archivers, and removing this file again. near/far daq/dcs [minos@daqdcp-nd ~]$ bin/init/archiver status Archiver is running [minos@daqdcp-nd ~]$ bin/init/archiver stop Stopping archiver - try graceful exit first. Please wait ...... Killing archiver with USR1 fardaq was not running [minos@dcsdcp ~]$ bin/init/archiver stop Stopping archiver - try graceful exit first 13:55 - somebody restarted the var archiver, we have another 0 length file -rw-r--r-- 1 buckley e875 0 Sep 2 11:50 F00044639_0019.mdaq.root rm -f /pnfs/minos/fardet_data/2009-09/F00044639_0019.mdaq.root 14:43 Restarted near dcs archiver Files are moving, with the new 6 second interval All 21 files were written cleanly. Restarted near daq archiver, about 30 to archive All archived cleanly. far dcs was already running Started far daq archiver, 14 files to write, at 15:12 All archived cleanly ########### # NEARDCS # ############ Moving the existing DCS files to the proper directories setup encp FILES6=`ls /pnfs/minos/near_dcs_data/2009-09 | grep N0906` printf "${FILES6}\n" N090629_180804.mdcs.root N090630_000004.mdcs.root ######### # ADMIN # ######### Ticket 8799 closed, the new kcron is deployed ? Date: Tue, 01 Sep 2009 14:48:58 -0500 From: Troy Dawson These rpm's are going out in todays autoyum. Synopsis: Medium: Fermi kerberos update Issue date: 2009-09-01 Possible security issue with kcron SL 4.x i386: krb5-getcert-1.8d-15.LTS4.i386.rpm krb5-libs-fermi-1.8d-15.LTS4.i386.rpm krb5-workstation-fermi-1.8d-15.LTS4.i386.rpm x86_64: krb5-getcert-1.8d-15.LTS4.i386.rpm krb5-libs-fermi-1.8d-15.LTS4.i386.rpm krb5-workstation-fermi-1.8d-15.LTS4.i386.rpm ________________________________________________________________ But these do not show up in yum list, nor are they deployed 09:50 Wed. ########## # DCACHE # ########## Date: Tue, 01 Sep 2009 18:06:46 -0500 From: ssa-group@fnal.gov Several Public dcache nodes have been unstable in operation. We will schedule an urgent re-install of 2 nodes, stkendca17a and stkendca22a. stkendca17a will be started at 10AM and stkendca22a will be started at 1PM. These both hold a combination of read and write pools, all data will be preserved over the re-install. The doors will be not available on these nodes during the re-install. Please contact SSA group if you have any questions. ____________________________________________________________________ This could affect raw data logging. 22a serves RawDataWritePools ( 21a 22a 24a 26a ) ____________________________________________________________________ 22a seems to be up. Restarted mcimport cron manually : $ crontab 35 14 * * * ${HOME}/mcimport -c ALL ######## # DCAP # ######## Sent email asking for closeout of INC000000008454 dcap 1.9.4 needed in UPS/UPD This was resolved, see entry of Aug 21 ============================================================================= 2009 09 01 ============================================================================= ######### # ADMIN # ######### kcron rpm's are on the way. Date: Tue, 01 Sep 2009 14:48:58 -0500 From: Troy Dawson i386: krb5-getcert-1.8d-15.LTS4.i386.rpm krb5-libs-fermi-1.8d-15.LTS4.i386.rpm krb5-workstation-fermi-1.8d-15.LTS4.i386.rpm x86_64: krb5-getcert-1.8d-15.LTS4.i386.rpm krb5-libs-fermi-1.8d-15.LTS4.i386.rpm krb5-workstation-fermi-1.8d-15.LTS4.i386.rpm ######## # FARM # ######## Preparing to purge all dogwood1 near cosmic prescaled data. Do not purge files from March 2009, these were not prescaled. SAMDIM=' VERSION dogwood1 and PHYSICAL_DATASTREAM_NAME cosmic and DATA_TIER sntp-near ' All of dogwood1/cosmic : File Count: 73274 Average File Size: 149.06MB Total File Size: 10.42TB Near sntp File Count: 1497 Average File Size: 349.95MB Total File Size: 511.60GB Total Event Count: 2382706716 Near all tiers and DATA_TIER %-near File Count: 24865 Average File Size: 151.51MB Total File Size: 3.59TB Total Event Count: 4767258799 Consistence, cand-near File Count: 23368 - correct Looking at files sizes of ntuples, 2009-01 - up to 1.3 GB concatenated. 2009-02 - two runs, over 2 GB, N00015814_0000.cosmic.sntp.dogwood1.0.root N00015817_0000.cosmic.sntp.dogwood1.0.root 2009-03 - full cosmic reco, N00015820_0000.cosmic.sntp.dogwood1.0.root ... Looking at cand files, for more precision 2009-01 - many runs, 53 MB through run 15454 840 MB from run 15457 2009-02 - two runs 840 MB. Bottom line, I suspect that we remove SAMDIM=' VERSION dogwood1 and PHYSICAL_DATASTREAM_NAME cosmic and DATA_TIER %-near and RUN_NUMBER < 15457 ' XFILES=`sam list files --dim=${SAMDIM} --nosummary` MINOS26 > sam list files --dim="${SAMDIM}" --summaryOnly File Count: 22117 Average File Size: 65.13MB Total File Size: 1.37TB Total Event Count: 4262899874 MINOS26 > ./samlocate "${SAMDIM}" | wc -l 22117 ########## # DCACHE # ########## Another empty fardet file /pnfs/minos/fardet_data/2009-09/F00044639_0001.mdaq.root -rw-r--r-- 1 buckley e875 0 Sep 1 10:22 F00044639_0001.mdaq.root -rw-r--r-- 1 buckley e875 72778000 Sep 1 10:38 F00044636_0005.mdaq.root -rw-r--r-- 1 buckley e875 71714880 Sep 1 10:38 F00044636_0007.mdaq.root -rw-r--r-- 1 buckley e875 73117897 Sep 1 10:38 F00044636_0010.mdaq.root From http://fndca3a.fnal.gov/cgi-bin/dcache_files.py 2009-09-1 10:22:26 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/fardet_data/2009-09/F00044639_0001.mdaq.root daqdcp.minos-soudan.org 300 0 0 ERROR 451 Internal timeout This was the last file copied ( later time stamps came from Enstore ) Restarted the archiver around 11:55 . Subrun 2 copied cleanly, subrun1 failed again. -rw-r--r-- 1 buckley e875 0 Sep 1 11:55 F00044639_0001.mdaq.root -rw-r--r-- 1 buckley e875 56770619 Sep 1 11:56 F00044639_0002.mdaq.root From the DAQ logs QOL W Tue 1-09-2009 11:55:24 archiver 6324 198.124.213.171 1 106719 run 44639 File F00044639_0001.mdaq.root failed to transfer, try again: STOR F00044639_0001.mdaq.root: Opera QOL I Tue 1-09-2009 11:56:07 archiver 6324 198.124.213.171 1 106729 run 44639 File F00044639_0002.mdaq.root transferred to /pnfs/minos/fardet_data/2009-09 From the Recent FTP Transfers page 451 Operation failed: FTP Door: got response from '[>w-raw-minos-stkendca22a-1@w-raw-minos-stkendca22a-1Domain:*@w-raw-minos-stkendca22a-1Domain:*@dCacheDomain]' with error Failed to enqueue mover: java/util/zip/DataFormatException removed the file again at 12:02 Copied successfully at 12:07 ######### # BATCH # ######### Date: Tue, 01 Sep 2009 10:59:51 -0500 From: Rashid Mehdiyev according to the last information from Adam, now we have following directories ready at this moment: /grid/data/minos/users/wingmc/L010170_r1 /grid/data/minos/users/wingmc/L100200_r1 /grid/data/minos/users/wingmc/L150200_r2 /grid/data/minos/users/wingmc/L250200_r1 /grid/data/minos/users/wingmc/L250200_r2 Art, could you create more pnfs directories for (including ones we anticipate to be filled): ./pnfsdirs near dogwood1 daikon_07 L010000_r1 write ./pnfsdirs near dogwood1 daikon_07 L010000_r2 write ./pnfsdirs near dogwood1 daikon_07 L010000_r3 write ./pnfsdirs near dogwood1 daikon_07 L010200_r1 write ./pnfsdirs near dogwood1 daikon_07 L100200_r1 write ./pnfsdirs near dogwood1 daikon_07 L250200_r2 write ./pnfsdirs near dogwood1 daikon_07 L010185_r1 write ./pnfsdirs near dogwood1 daikon_07 L010185_r2 write ./pnfsdirs near dogwood1 daikon_07 L010185_r3 write ########### # SERVICE # ########### Cancelled an item that was stuck in limbo, already handled as ticket 501. Visible only when selecting View Requests --> Open from the left frame Incident 501 can be seen only from the Incident Management sscreen, not from the Requester Console. Request ID: In Process Summary: /pnfs/minos/reco_far/dogwood0/cand_data/2007-05/F00038002_0014.spill.cand.dogwood0.0.root seems to be lost Notes: dcache-admin : /pnfs/minos/reco_far/dogwood0/cand_data/2007-05/F00038002_0014.spill.cand.dogwood0.0.root was written to DCache write pools on April 13. It now seems to be missing from DCache, and is not on tape. What happened to this file ? Please reply to and cc: minos-data@fnal.gov __________________________________________________________________________ Date: Tue, 01 Sep 2009 10:41:01 -0500 (CDT) Status: Cancelled ########## # DCACHE # ########## Date: Tue, 01 Sep 2009 09:50:30 -0500 (CDT) Request INC000000009464 requested by you has been submitted. Status: New Summary: FNDCA Recent Ftp Transfers truncated Notes: SSA primary - dcache-admin The FNDCA Recent FTP Transfers web page seems to be truncated. It shows transfers only for the first few users. ( buckley, des, des939, fnalgrid ) and omits the rest ( ildg, jdem, jpalen, kreymer, mindata, minospro, mippro, oracle, podstvkv, stoughto, timur ) _____________________________________________________________________________ Date: Tue, 01 Sep 2009 15:00:36 -0500 (CDT) Status: In Progress _____________________________________________________________________________ Date: Tue, 08 Sep 2009 13:22:19 -0500 From: Terry Jones The issue is assigned to the dcache experts. _____________________________________________________________________________ Date: Tue, 13 Oct 2009 15:13:33 -0500 (CDT) Status: Completed _____________________________________________________________________________ ============================================================================= 2009 08 31 ============================================================================= ######### # SRMCP # ######### while true ; do ./srmtest3 2>&1 | grep -A 1 TURL ; sleep 2 ; done \ | tee -a /minos/scratch/mindata/log/srmtest3-0901 Tue Sep 01 08:41:46 CDT 2009: received TURL=gsiftp://stkendca20a.fnal.gov:2811///neardet_data/2004-11/N00004502_0000.mdaq.root -rw-r--r-- 1 mindata e875 0 Sep 1 08:41 TEST.dat Tue Sep 01 08:43:34 CDT 2009: received TURL=gsiftp://stkendca22a.fnal.gov:2811///neardet_data/2004-11/N00004502_0000.mdaq.root -rw-r--r-- 1 mindata e875 0 Sep 1 08:43 TEST.dat Tue Sep 01 08:44:23 CDT 2009: received TURL=gsiftp://fndca4a.fnal.gov:2811///neardet_data/2004-11/N00004502_0000.mdaq.root -rw-r--r-- 1 mindata e875 0 Sep 1 08:44 TEST.dat Tue Sep 01 08:45:24 CDT 2009: received TURL=gsiftp://stkendca27a.fnal.gov:2811///neardet_data/2004-11/N00004502_0000.mdaq.root -rw-r--r-- 1 mindata e875 0 Sep 1 08:45 TEST.dat Every door is failing jcoehlo mcimport have been stuck since 08:37 Jobs have been failing overnight. Updated ticket 9188 Now the copy is stuck writing a file which does not exist. /pnfs/minos//mcin_data/near/daikon_04/L250200N_i152/706/n13037065_0001_L250200N_D04_i152.reroot.root PNFS -rw-r--r-- 1 kreymer e875 750615485 Sep 1 15:31 /minos/data2/mcimport/jcoelho/mcin/n13037065_0001_L250200N_D04_i152.reroot.root ######### # FNALU # ######### Date: Mon, 31 Aug 2009 18:25:07 -0500 (CDT) Request INC000000009425 requested by you has been submitted. Status: New Summary: /minos and /grid on flxi09 Notes: fnalu-admin : Please mount the /minos and /grid data areas on flxi09, to help us test Minos software compatibility at SLF 5. It would be better if flxi09 could go to SLF 5.3 The natural time for this would be at the reboots for kernel upgrades. Here are typical /etc/fstab elements for these file systems : blue2:/minos/data /minos/data2 nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 minos-nas-0.fnal.gov:/minos/scratch /minos/scratch nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 minos-nas-0.fnal.gov:/minos/data /minos/data nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 blue2:/fermigrid-data /grid/data nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 blue2:/fermigrid-app /grid/app nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 blue2:/fermigrid-fermiapp /grid/fermiapp nfs rsize=32768,rw,timeo=600,proto=tcp,vers=3,hard,intr 0 0 ___________________________________________________________________________ Date: Tue, 01 Sep 2009 09:05:07 -0500 (CDT) your request is being reviewed by management. ___________________________________________________________________________ Date: Wed, 02 Sep 2009 10:12:51 -0500 (CDT) Status: In Progress ___________________________________________________________________________ Date: Thu, 10 Sep 2009 13:57:48 -0500 (CDT) The feedback so far from management is to go ahead and mount the filesystems you requested, but I am getting permission denied errors. Here are examples: [root@flxi09 ~]# mount /grid/app mount: blue2:/fermigrid-app failed, reason given by server: Permission denied [root@flxi09 ~]# mount /grid/app mount: blue2:/fermigrid-app failed, reason given by server: Permission denied You will need to request permission from the storage group to allow fnalu mounts, or from whoever managesthose filesystems to ask them to export them to fnalu. Then the mounts will probably work. Additionally and more importantly, management says that FNALU will be retired/deprecated so you should not be planning dependency on the cluster. In fact you should provide feedback on your needs that FNALU currently meets to your management so that they are prepared to provide input to the FNALU transition process. ___________________________________________________________________________ Date: Fri, 25 Sep 2009 11:22:32 -0500 (CDT) From: Margaret_Greaney This is to reflect a discussion that Art and I had this morning about grid mounts on flxi06 for the purpose of testing on slf5.3. I posed the 3 month duration for these mounts to Art and he said he was puzzled about what would happen after the 3 months. I told him that there was an effort to transition fnalu to a new cluster and organization and Art said he was aware of this and was in fact helping to plan this. Jason told me after my discussion that the issue would be revisited after a 3 month period and that it was the intent to try to move to get the new cluster with Stu Fuess set up soon but that they did not know if it would be done in that time period and so the issue would be revisited then. Art said he would be glad if the mounts could be done for a 3 month period and thanks for the help. I am trying to mount the grid mounts but only one of them works, so I have also sent mail to Keith Chadwick asking if his group can export the other 2 to flxi06. ___________________________________________________________________________ Date: Fri, 25 Sep 2009 11:34:17 -0500 (CDT) From: Margaret_Greaney the other mounts should be available later today. just one grid mount works on flxi06. ___________________________________________________________________________ Date: Fri, 25 Sep 2009 14:39:10 -0500 (CDT) it might take until next week for the other grid mounts to work. I've tried a couple times to get them to mount and they still don't. Keith gave the ok, but the schedule for this may have a lower priority in the storage group. Just to update you. From: Margaret_Greaney ___________________________________________________________________________ Date: Wed, 30 Sep 2009 08:42:18 -0500 (CDT) Status: In Progress ___________________________________________________________________________ Date: Wed, 30 Sep 2009 08:48:29 -0500 (CDT) on flxi06, the mounts you requested are available. ___________________________________________________________________________ Date: Wed, 30 Sep 2009 08:48:30 -0500 (CDT) Status: Pending ___________________________________________________________________________ Date: Wed, 30 Sep 2009 13:57:18 +0000 (GMT) From: Arthur Kreymer To: minos_batch@fnal.gov, minos_software_discussion@fnal.gov, mgreaney@fnal.gov, minos-admin@fnal.gov Subject: Re: SLF 5 nodes for Batch testing On Tue, 22 Sep 2009, Arthur Kreymer wrote: > The general Minos collaboration can use flxi06, > but it is at SLF 5.1 and lacks the /grid and /minos mounts. flxi06 now has all the usual Bluearc mounts, for SLF 5 testing, thanks to Margaret Greaney : /minos/data /grid/data /grid/fermiapp Enjoy ! ___________________________________________________________________________ Date: Tue, 13 Oct 2009 12:10:31 -0500 (CDT) Status: In Progress ___________________________________________________________________________ Date: Tue, 13 Oct 2009 12:10:32 -0500 (CDT) Status: Completed task created, grid group researching ___________________________________________________________________________ ######## # DATA # ######## Forgot to undeclare this 0 length file from SAM, when it was removed from PNFS on Aug 24 sam undeclare n10033001_0003_CosmicLE_D04.cand.dogwood1.root This messed up the Predator listings. ######## # LOCK # ######## lock.new - Added LOCK/stale - sets the time for stale locks Queues should time out on a much longer time scale, fix this QOLD at 10 hours. ####### # DAQ # ####### _____________________________________________________________________________ Date: Mon, 31 Aug 2009 11:22:45 -0500 From: Elizabeth Buckley-Geer To: Arthur E Kreymer Subject: [Fwd: Archiver on dcsdcp] Hi Art, seems I am still getting mail from the archiver on dcsdcp at Soudan. I think this comes from the monitoring script that checks to see if the archiver is running and if not restarts it. I forget what I called it but check the crontab. I would login and check but I can't remember the account it is running under! I thought it was minos but that doesn't seem to work anymore. Liz -------- Original Message -------- Subject: Archiver on dcsdcp Date: Sun, 30 Aug 2009 20:01:10 -0500 From: MINOS Account for data transfer To: buckley@fnal.gov Archiver not running, restarted it _____________________________________________________________________________ Fardet transfers stopped again on Sunday . from http://fndca3a.fnal.gov/cgi-bin/dcache_files.py 2009-08-31 07:53:47 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/fardet_data/2009-08/F00044422_0000.mdaq.root daqdcp.minos-soudan.org 7890 0 0 ERROR 451 Aborting transfer due to session termination 2009-08-30 23:01:15 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/far_dcs_data/2009-08/F090830_000008.mdcs.root dcsdcp.minos-soudan.org 7890 0 0 ERROR 451 Aborting transfer due to session termination 2009-08-30 20:01:17 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/far_dcs_data/2009-08/F090830_120011.mdcs.root dcsdcp.minos-soudan.org 7890 0 0 ERROR 451 Aborting transfer due to session termination 2009-08-30 17:46:58 buckley(1019.5111) krbftp write /pnfs/fnal.gov/usr/minos/far_dcs_data/2009-08/F090830_000008.mdcs.root dcsdcp.minos-soudan.org 7891 0 0 ERROR 451 Aborting transfer due to session termination 11:36 - [minos@daqdcp ~]$ bin/init/archiver start [minos@daqdcp ~]$ ls -l /daqdata/archiver/data-archived/ ... -rw-r--r-- 1 minos e875 0 Aug 29 13:46 F00044421_0002.mdaq.root -rw-r--r-- 1 minos e875 0 Aug 29 14:47 F00044421_0003.mdaq.root -rw-r--r-- 1 minos e875 0 Aug 29 15:30 F00044421_0004.mdaq.root -rw-r--r-- 1 minos e875 0 Aug 31 11:36 F00044423_0000.mdaq.root -rw-r--r-- 1 minos e875 0 Aug 31 11:36 F00044424_0000.mdaq.root -rw-r--r-- 1 minos e875 0 Aug 31 11:37 F00044425_0000.mdaq.root ... Similar on DCS, but the archiver claims to be running there. [minos@dcsdcp bin]$ cat /var/lock/dcs/archiver.pid 2195 [minos@dcsdcp bin]$ ps xf PID TTY STAT TIME COMMAND 2195 ? S 0:00 python /home/minos/bin/archiver_krb.py 2210 ? Z 0:00 \_ [kinit] [minos@dcsdcp ~]$ bin/init/archiver restart Stopping archiver - try graceful exit first At around 12:05, lost contact with the far detector, windows froze. But the copies continued on their normal schedule. Windows unfroze eventually. Copies are up to date, as of about 12:38 ######### # SRMCP # ######### Rubin repeats continued failures from stkendca23a while true ; do ./srmtest3 2>&1 | grep -A 1 TURL ; sleep 2 ; done \ | tee -a /minos/scratch/mindata/log/srmtest3-0831 Stuck on 23a again, and 24a Mon Aug 31 09:26:24 CDT 2009: received TURL=gsiftp://stkendca23a.fnal.gov:2811 Mon Aug 31 09:29:15 CDT 2009: received TURL=gsiftp://stkendca24a.fnal.gov:2811/ grep TURL /minos/scratch/mindata/log/srmtest3-0831 | cut -f 3 -d / | sort -u _________________________________________________________________________ To : Fermilab Service Desk Cc : dcache-admin@fnal.gov, minos-data@fnal.gov, rubin@fnal.gov Attchmnt: Subject : Re: Request INC000000009188 requested by you has been submitted. FNDCA srm is failing ----- Message Text ----- The problem with stkendca23a has returned, as of 09:00 Monday Aug 31. Door stkendca24a.fnal.gov:2811 is also stuck. Please take these doors out of the configuration. stkendca23a.fnal.gov:2811 stkendca24a.fnal.gov:2811 The working doors seem to include. fndca4a.fnal.gov:2811 fndca4a.fnal.gov:2812 stkendca19a.fnal.gov:2811 stkendca20a.fnal.gov:2811 stkendca25a.fnal.gov:2811 stkendca27a.fnal.gov:2811 Door 23a seems to have started working again, as of 09:44 CDT, then seems to disappeared from the list of doors used after 09:45. Door 24a continues to fail. _________________________________________________________________________ 8/31/2009 8:02:48 PM ; jonest I had to reboot four dcache nodes this morning. stkendca21a, 23a, 24a & 26a I had to reboot four of the stken dcache nodes this morning. they seem to have recovered. I hope this cleared up your problem of accessing files. Please let me know if you are still having difficulty _________________________________________________________________________ Date: Tue, 01 Sep 2009 13:55:05 +0000 (GMT) From: Arthur Kreymer SRMCP has continued to fail all night. At present, all the doors are timing out. Please raise the level of this request to URGENT. Minos data processing has been down since last Thursday due to this problem. _________________________________________________________________________ Date: Tue, 01 Sep 2009 09:06:57 -0500 From: Stan Have you in the last 2 weeks increased or decreased you writing to dcache. We don't understand why this will run for days and then fall over days in a row. _________________________________________________________________________ Date: Tue, 01 Sep 2009 14:24:26 +0000 (GMT) From: Arthur Kreymer Not that I am aware of. Our writes to DCache come from a few sourced, all well controlled : 1) Raw data writes, via the ancient Python kerberized ftp client, see the buckley transfers under http://fndca3a.fnal.gov/cgi-bin/dcache_files.py Except when catching up from a backlog, like the last two weekends, this runs at about 2 files per hour. Rates were log since about 13:00 yesterday. 2) Monte carlo imports, under the mindata account on minos26. This is a single stream of srmcp writes, which have been stuck most of the time. 3) Farm worker nodes reading raw data, and writing 'cand' files. I do not see these in the Recent FTP Transfers page, this web page seems to be truncated. _________________________________________________________________________ Date: Tue, 01 Sep 2009 10:24:42 -0500 From: David Saranen Yesterday there was a special group of runs at the Far Detector - lots of little short runs (somewhere around 200). That was well after the ftp failures had occurred. _________________________________________________________________________ Date: Wed, 02 Sep 2009 04:12:58 -0500 From: Howard Rubin To: Art Kreymer , Timur Perelmutov , ssa-admin@fnal.gov Subject: srm authentication errors continue Copyright (c) 2002-2008 Fermi National Accelerator Laboratory Specification Version 2.0 by SRM Working Group (http://sdm.lbl.gov/srm-wg) SRM Configuration: ... Wed Sep 02 04:09:45 CDT 2009: starting SRMPutClient ... SRMClientV2 : connecting to srm at httpg://fndca4a.fnal.gov:8443/srm/managerv2 ... org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused GSSAPI authentication. (error code 1) [Nested exception message: Custom message: Unexpected _________________________________________________________________________ _________________________________________________________________________ At 09:26, I removed the mcimport crontab entry. _________________________________________________________________________ Date: Fri, 04 Sep 2009 16:16:19 -0500 From: ssa-group@fnal.gov To: cdweb@fnal.gov, helpdesk@fnal.gov, oleynik@fnal.gov, stan@fnal.gov, wolbers@fnal.gov, crawdad@fnal.gov, white@fnal.gov, moibenko@fnal.gov, timur@fnal.gov, stk-users@fnal.gov, cms-t1@fnal.gov, fermigrid-announce@fnal.gov, dcache-admin@fnal.gov Subject: Announcement: Service restoration for dCache on stken for a duration of Unknown Stkendca 23a 24a and 26a have been rebuilt and are in production all service should be restored. and Stable. _________________________________________________________________________ Date: Fri, 04 Sep 2009 16:53:26 -0500 (CDT) Status: Completed We have received a resolution from our support staff. ########### # BLUEARC # ########### Testing data rates from fnpc370, which was slow yesterday scp kreymer@minos27:minos/scripts/bluwatch . ./bluwatch -r -t -b /grid/data/minos/bluwatch/stash/2 Data rates look normal there. Had to hack bluwatch.20090831 to work without AFS. ============================================================================= 2009 08 29 SUNDAY ============================================================================= ########### # BLUEARC # ########### Date: Sat, 29 Aug 2009 23:35:09 +0000 (GMT) From: Arthur Kreymer To: scavan@fnal.gov, minos-data@fnal.gov, rbpatter@fnal.gov, fermigrid-help@fnal.gov Subject: scavan jobs loading /grid/data Steve : I have removed your batch jobs which started running this morning. They are getting nowhere, tying up 293 slots, and overloading /grid/data This is because they are copying their input files from /minos/scratch ( part of the grid file system ) without using the /grid/farmiapp/minos/scripts/cpn or cp1 locking. Your script /minos/scratch/scavan/NueAnaPID/nueana_minijob_mylock_mrcc.sh seems to be using some other lock via loon, loon -b -q "$LockScripts/getLock.C($LockPort)" LockScripts=/minos/scratch/scavan/ServerScripts/FileLock/ This seems to be connecting to some sort of network lock server. I have never heard of this server, and have no idea what it is doing. It is obviously not working, as every job section seems to be copying. You must use the public, documented lock mechanisms. Please try out the new cpn, I thing you'll like it. What is worse, the locks and file copies are running under parrot, which is extremely inefficient. Input/Output staging needs to be done outside of parrot. Details follow : /grid/data slowed down, but not disastrously, after about 08:00 CDT today, Aug 29 2009. This is around the time that about scavan grid jobs started. They seem to be copying files without locking . Here is part of one of the process trees from ps axf 28527 ? DN 2:20 | \_ parrot -m /grid/fermiapp/minos/parrot/mountfile.grow -H -t /local/stage1/minosgli/parrot /minos/scratch/scavan/NueAnaPID/nueana_minijob_mylock_mrcc.sh 42 /minos//data/users/scavan/MiniPIDDogwood1/mrcc^50786^minos09.fnal.gov 28528 ? TN 0:00 | \_ /bin/sh /minos/scratch/scavan/NueAnaPID/nueana_minijob_mylock_mrcc.sh 42 /minos//data/users/scavan/MiniPIDDogwood1/mrcc^50786^minos09.fnal.gov 22823 ? TN 0:00 | \_ cp /minos/data2/nue_group_files/2ndAnalysis/AnaNue_Files/BeforeLEM/Full/Untrimmed/Near/Data/L010185N/Standard/ AllRuns/AnaNue-N00011995_0010.spill.sntp.dogwood1.0-PECutXTalk-2.000000.root input/ There were 7 scavan jobs running on fnpc370, all of which started around 07:55 this morning. The sections running on this host are include : 1 10 42 45 67 86 94 You seem to have two clusters of jobs running 286254.0 scavan 8/29 07:33 286255.0 scavan 8/29 07:33 And one set of jobs queued ( idle ) 288773.0 scavan 8/29 16:28 These all seem to be using variants of the defective locks nueana_minijob_mylock_mrcc.sh or nueana_minijob_mylock.sh To be sure to remove these slowly, starting with the queued sections, I did the following : SCIDS=`condor_q scavan | grep scavan | cut -f 1 -d ' ' | sort -nr` date for SCID in ${SCIDS} ; do sleep 1 ; condor_rm ${SCID} ; done date Sat Aug 29 18:11:39 CDT 2009 Job 288773.99 marked for removal Job 288773.98 marked for removal Job 288773.97 marked for removal ... Job 286254.102 marked for removal Job 286254.101 marked for removal Job 286254.100 marked for removal Job 286254.10 marked for removal Job 286254.1 marked for removal Job 286254.0 marked for removal Sat Aug 29 18:19:41 CDT 2009 Data transfer rates immediately recoverd, No samples under 6 MByte/second samples since 19:20, see http://www-numi.fnal.gov/computing/dh/bluwatch/rate/2009/08/29/minos27.txt To be fair, this overload is not nearly as badly as during d0ora2 backups. I am amazed that the system can handle over 290 file copies with only a moderate overload. See the scavan overload from 08:00 to 18:00 on this plot http://www-numi.fnal.gov/computing/dh/bluearc/rates/minos27/minos27_20090829.png compared to a d0ora2 overload starting at 18:00 at http://www-numi.fnal.gov/computing/dh/bluearc/rates/minos27/minos27_20090818.png ============================================================================= 2009 08 28 ============================================================================= ######## # JIRA # ######## Will put some recent issues, for testing , into http://fermilab.go2group.com/browse/MINOSDATA ########### # BLUEARC # ########### Restarted brate scripts on ark.fnal.gov updated brate script not to echo OUTPUT cds set nohup ; ./brateday_ark & set nohup ; ./bratewk_ark & ########### # DESKTOP # ########### Delayed work this morning, due to non-responsive desktop. Power cycled, upgraded, recovering. ######### # FARM # ######### More nasty problems handing candidates from recent processing ( due to heavy srmcp failures ) One is a duplicate ( slightly different file sizes ) /pnfs/minos/reco_far/dogwood1/cand_data/2004-06/F00025792_0003.cosmic.cand.dogwood1.0.root CRC('666077390L' 242540164 WRITE CRC 1617668122 242540153 Cleared the dup out with FILE=F00025792_0003.cosmic.cand.dogwood1.0.root mv ${FILE} ../DUP/${FILE} Some files cannot be removed when successfully written, not group writeable : -rw-r--r-- 1 minospro e875 238677318 Aug 27 16:14 F00026824_0003.cosmic.cand.dogwood1.0.root -rw-r--r-- 1 minospro e875 286642030 Aug 27 16:13 F00026803_0001.cosmic.cand.dogwood1.0.root ============================================================================= 2009 08 27 ============================================================================= ######### # SRMCP # ######### roundup and mcimport are stuck in srmcp again. Created srmtest3 at mindata@minos26, omitting the srmls step,. while true ; do ./srmtest3 2>&1 | grep -A 1 TURL ; sleep 60 ; done \ | tee -a /minos/scratch/mindata/log/srmtest3 Got stuck in the first srmcp,killed : See /minos/scratch/mindata/log/srmtest3 Failed for Thu Aug 27 17:08:18 CDT 2009: TURL=gsiftp://stkendca23a.fnal.gov:2811 Thu Aug 27 17:12:18 CDT 2009: TURL=gsiftp://stkendca23a.fnal.gov:2811 Thu Aug 27 17:18:19 CDT 2009: TURL=gsiftp://stkendca23a.fnal.gov:2811 Thu Aug 27 17:20:29 CDT 2009 Thu Aug 27 17:25:09 CDT 2009 grep TURL /minos/scratch/mindata/log/srmtest2 | cut -f 3 -d / | sort -u fndca4a.fnal.gov:2811 fndca4a.fnal.gov:2812 stkendca17a.fnal.gov:2811 stkendca18a.fnal.gov:2811 stkendca19a.fnal.gov:2811 stkendca20a.fnal.gov:2811 stkendca21a.fnal.gov:2811 stkendca22a.fnal.gov:2811 stkendca23a.fnal.gov:2811 stkendca24a.fnal.gov:2811 stkendca25a.fnal.gov:2811 stkendca27a.fnal.gov:2811 stkendca28a.fnal.gov:2811 grep TURL /minos/scratch/mindata/log/srmtest3 | cut -f 3 -d / | sort -u fndca4a.fnal.gov:2812 stkendca17a.fnal.gov:2811 stkendca18a.fnal.gov:2811 stkendca19a.fnal.gov:2811 stkendca20a.fnal.gov:2811 stkendca21a.fnal.gov:2811 stkendca22a.fnal.gov:2811 stkendca23a.fnal.gov:2811 stkendca24a.fnal.gov:2811 stkendca25a.fnal.gov:2811 stkendca27a.fnal.gov:2811 ________________________________________________________________________ Date: Thu, 27 Aug 2009 17:40:19 -0500 (CDT) Request INC000000009188 requested by you has been submitted. Status: New Summary: FNDCA srm is failing Notes: SSA primary - dcache-admin srmcp is getting stuck when it uses TURL=gsiftp://stkendca23a.fnal.gov:2811 This happens reading or writing. This is halting all Minos production processing. Please take this door out of the configuration. Please add active monitoring of the srm doors. These other doors seem to be OK : fndca4a.fnal.gov:2812 stkendca17a.fnal.gov:2811 stkendca18a.fnal.gov:2811 stkendca19a.fnal.gov:2811 stkendca20a.fnal.gov:2811 stkendca21a.fnal.gov:2811 stkendca22a.fnal.gov:2811 stkendca24a.fnal.gov:2811 stkendca25a.fnal.gov:2811 stkendca27a.fnal.gov:2811 Thanks ! When the processes get stuck, they issue messages like copy failed with the error org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused GSSAPI authentication. (error code 1) [Nested exception message: Custom message: Unexpected reply: 500 Operation failed due to internal error: java/text/MessageFormat]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom me ssage: Unexpected reply: 500 Operation failed due to internal error: java/text/MessageFormat at org.globus.ftp.extended.GridFTPControlChannel.authenticate(GridFTPControlChannel.java:171) at org.globus.ftp.GridFTPClient.authenticate(GridFTPClient.java:106) at org.globus.ftp.GridFTPClient.authenticate(GridFTPClient.java:91) at org.dcache.srm.util.GridftpClient.(GridftpClient.java:158) at gov.fnal.srm.util.Copier.javaGridFtpCopy(Copier.java:615) at gov.fnal.srm.util.Copier.copy(Copier.java:493) at gov.fnal.srm.util.Copier.run(Copier.java:321) at java.lang.Thread.run(Thread.java:595) try again sleeping for 10000 before retrying This is halting all Minos production processing. Please take this door out of the configuration. Thanks ! ________________________________________________________________________ Date: Thu, 27 Aug 2009 20:18:43 -0500 From: Stan Door restarted at 2020. ________________________________________________________________________ while true ; do ./srmtest3 2>&1 | grep -A 1 TURL ; sleep 2 ; done \ | tee -a /minos/scratch/mindata/log/srmtest3a grep TURL /minos/scratch/mindata/log/srmtest3a | cut -f 3 -d / | sort -u $ grep TURL /minos/scratch/mindata/log/srmtest3a | cut -f 3 -d / | sort -u fndca4a.fnal.gov:2811 fndca4a.fnal.gov:2812 stkendca22a.fnal.gov:2811 stkendca23a.fnal.gov:2811 stkendca24a.fnal.gov:2811 stkendca27a.fnal.gov:2811 No hangups in an hour ######### # PLOTS # ######### Examined minos portal plots, at http://fgt3x6.fnal.gov:8080/portal/portal/minos/Home also portal/portal/cms, dzero-mc, dzero-reco ######## # FARM # ######## ./pnfsdirs near dogwood1 daikon_07 L010170_r1 write ./pnfsdirs near dogwood1 daikon_07 L150200_r2 write ./pnfsdirs near dogwood1 daikon_07 L250200_r1 write ######### # ADMIN # ######### tokencron investigation 08:55 - shut down cron jobs, no longer needed on minos25/flxi02/flxi04 Cleared out the stale ticket caches find /tmp -maxdepth 1 -name krb5cc_cron* -mmin +600 | wc -l 1202 find /tmp -maxdepth 1 -name krb5cc_cron* -mmin +600 -exec ls -l {} \; find /tmp -maxdepth 1 -name krb5cc_cron* -mmin +600 -exec rm -f {} \; Cannot remove most of these, just the 80 that belonged to me. Scan the full Minos Cluster for these ARK > for NODE in ${SNODES} ; do printf "${NODE} " ; ssh -ax ${NODE} 'find /tmp -maxdepth 1 -name krb5cc_cron* -mmin +600 | wc -l' ; done minos-mysql1 1 nwest minos-mysql2 43 nwest minos-mysql3 1 kreymer minos-sam01 102 buckley kreymer minos-sam02 0 minos-sam03 17 kreymer minos-sam04 1 kreymer Showing the non-0 nodes of the Cluster : ARK > for NODE in ${NODES} minos27 ; do minos01 41 blake kreymer minos03 98 rbpatter minos07 4457 gmieg minos08 17 jdejong minos12 3 rhatcher minos13 2 jdejong minos25 1123 jdejong rbpatter rubin et.al. minos26 13 kreymer rubin Aug 19/20 minos27 1 kreymer Aug 20 I have removed the kreymer caches from minos-mysql3 minos-sam03 minos-sam04 minos27 Scanning old files in /tmp, mostly Aug 17,except minos05 total 696 -rw------- 1 ochoa e875 729 Aug 8 19:51 krb5cc_11632_LnXvT8 -rw------- 1 rmehdi e875 113 Aug 17 06:57 tkt42916_PL3cqj minos08 total 3236 -rw-r--r-- 1 ahimmel us_cms 0 Aug 13 07:34 condor_q_20090813_073425_9417 -rw-r--r-- 1 ahimmel us_cms 0 Aug 13 07:36 condor_q_20090813_073625_9424 -rw-r--r-- 1 ahimmel us_cms 0 Aug 13 07:38 condor_q_20090813_073825_9431 -rw-r--r-- 1 ahimmel us_cms 0 Aug 13 07:40 condor_q_20090813_074025_9441 -rw-r--r-- 1 ahimmel us_cms 0 Aug 13 07:42 condor_q_20090813_074226_9448 -rw-r--r-- 1 ahimmel us_cms 0 Aug 13 09:34 condor_q_20090813_093426_10000 -rw-r--r-- 1 ahimmel us_cms 0 Aug 13 09:36 condor_q_20090813_093626_10007 minos15 total 524 -rw------- 1 xbhuang e875 1129 Aug 16 18:26 krb5cc_43524 -rw------- 1 cherdack e875 115 Aug 17 09:54 tkt12660_U3YKH9 Repeated with -ltur ( access/use time ) The oldest file access is Aug 17. So in a few days, all these Aug 19/20 caches will be cleared out. _________________________________________________________________________ Date: Thu, 27 Aug 2009 12:38:15 -0500 (CDT) Request INC000000009146 requested by you has been submitted. Status: New ============================================ minos25 stale files in /tmp ============================================ FEF primary - run2-sys@fnal.gov Please remove about 1100 stale kcron ticket cache files from /tmp on minos25. These almost all originated during the kernel upgrade reboots on Aug 20. They have names like /tmp/krb5cc_cron11914 They are causing error messages from some of our production scripts. There are similar files on other Minos Cluster nodes, especially minos07, but they do not seem to be doing any immediate harm, and will be cleared automatically in a few days. On minos25, please do something like : find /tmp -maxdepth 1 -name krb5cc_cron* -mmin +600 -exec ls -l {} \; find /tmp -maxdepth 1 -name krb5cc_cron* -mmin +600 -exec rm -f {} \; __________________________________________________________________________ Date: Thu, 27 Aug 2009 17:33:02 -0500 (CDT) Status: In Progress _________________________________________________________________________ Date: Thu, 27 Aug 2009 17:36:41 -0500 (CDT) Status: Completed old kcron caches purged as specified ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ Date: Tue, 01 Sep 2009 14:25:20 -0500 From: Frank J. Nagy I have tested the fixed kcron on both 32- and 64-bit SLF4.7. It works fine but I have not been able to reproduce the unusual file names (mine all were krb5cc_1111_cron) that Art describes above. I tested on fgt0x5/7 - OSG development machines where I installed the RPMs I uploaded (also in nagy/public in AFS). Troy, Can you put these RPMs into distribution today? The kcron error is a possible major security hole that this update fixes. ________________________________________________________________________ Date: Tue, 01 Sep 2009 14:35:11 -0500 From: Troy Dawson Will do ________________________________________________________________________ ============================================================================= 2009 08 26 ============================================================================= ######## # FARM # ######## Cleaning up the many duplicate cand files detected on Aug 18 ====================================================== PLAN find and remove duplicates from WRITE verify that roundup detects future cand duplicates ======================================================= Notes from email : Date: Tue, 18 Aug 2009 10:49:57 -0500 From: Rashid Mehdiyev ... WRITING to DCache 52 OOPS - Size mismatch , BAILING ls: /minos/data/mcout_data/daikon_04/CosmicLE/near/dogwood1/cand_data/314/n10033148_0002_CosmicLE_D04.cand. dogwood1.root: No such file or directory ... I have lost my records of the commands used to diagnose this, will reproduce them here. The problem: many duplicate cand files are in the WRITE area. samdup trips on the stale Merged. Created scripts/setupsam.sh for convenience. ============================================================ Duplicates in WRITE. First, remove stale Merged files -rw-r--r-- 1 minfarm e875 1142331333 Aug 15 20:00 Merged.11134.root -rw-r--r-- 1 minfarm e875 471831565 Jul 1 18:30 Merged.12615.root -rw-r--r-- 1 minfarm e875 721909056 Aug 9 20:52 Merged.31560.root -rw-r--r-- 1 minfarm e875 120523472 Aug 7 07:34 Merged.5596.root -rw-r--r-- 1 minfarm e875 9162752 Aug 12 23:12 Merged.9270.root FARM04 > find . -name Merged\* -mtime +5 -exec ls -l {} \; -rw-r--r-- 1 minfarm e875 471831565 Jul 1 18:30 ./Merged.12615.root -rw-r--r-- 1 minfarm e875 1142331333 Aug 15 20:00 ./Merged.11134.root -rw-r--r-- 1 minfarm e875 9162752 Aug 12 23:12 ./Merged.9270.root -rw-r--r-- 1 minfarm e875 721909056 Aug 9 20:52 ./Merged.31560.root -rw-r--r-- 1 minfarm e875 120523472 Aug 7 07:34 ./Merged.5596.root FARM04 > find . -name Merged\* -mtime +5 -exec rm {} \; ============================================================ Now find and remove the duplicates in WRITE FARM04 > ls | grep cand | wc -l 493 . ~/scripts/setupsam.sh ~/scripts/samdup -s cand /minos/data2/minfarm/WRITE > /minos/data/minfarm/maint/candups.200908.txt FARM04 > wc -l /minos/data/minfarm/maint/candups.200908.txt 171 /minos/data/minfarm/maint/candups.200908.txt These are all dogwood1 files. DUPF=/minos/data/minfarm/maint/candups.200908.txt WHAT ???????? samdup lists a log of N* files, which are not in WRITE. Oops, roundup was running on dogwood1 near and was purging Near cand's. Remake the list, now that roundup is past that phase grep -v ^N ${DUPF} | wc -l 34 ~/scripts/samdup -s cand /minos/data2/minfarm/WRITE > ${DUPF} wc -l ${DUPF} for FILE in `cat ${DUPF}` ; do ls -l ${FILE} ; done Local files are all Aug 17,19.21 for FILE in `cat ${DUPF}` ; do SLOC=`sam locate ${FILE} | cut -f 2 -d "'" | cut -f 1 -d ,` ls -l ${SLOC}/${FILE} ls -l ${FILE} echo done >> /minos/data/minfarm/maint/candups.200908.diff -rw-r--r-- 1 42411 e875 584493926 Aug 19 19:33 /pnfs/minos/mcout_data/dogwood1/far/daikon_03/CosmicLE/cand_data/125/f20031250_0000_CosmicLE_D03.cand.dogwood1.root -rw-rw-r-- 1 42411 e875 584488698 Aug 19 12:32 f20031250_0000_CosmicLE_D03.cand.dogwood1.root ... -rw-r--r-- 1 42411 e875 358927610 Aug 21 20:54 /pnfs/minos/mcout_data/dogwood1/near/daikon_04/L010000N_i317/cand_data/704/n13037046_0018_L010000N_D04_i317.cand.dogwood1.root -rw-rw-r-- 1 42411 e875 358927225 Aug 21 16:53 n13037046_0018_L010000N_D04_i317.cand.dogwood1.root Most of the files are like this. The PNFS copies are a bit later than those in WRITE, and differ in size. Exception : -rw-r--r-- 1 42411 e875 109625749 Jul 30 16:15 /pnfs/minos/mcout_data/dogwood1/near/daikon_04/CosmicLE/cand_data/314/n10033148_0002_CosmicLE_D04.cand.dogwood1.root -rw-r--r-- 1 minfarm e875 109625527 Aug 17 21:59 n10033148_0002_CosmicLE_D04.cand.dogwood1.root The exception looks like a classic duplicate. I don't understand the others. I also don't understand the duplicate, it should have been detected ecrc checksums for the classic duplicate : SAM 1340755968 PNFS 1340755968 WRITE 977452518 What about a the first short term dup , f20031250_0000_CosmicLE_D03.cand.dogwood1.root SAM 2470755152 PNFS 2470755152 WRITE 3494086095 Looking at the size differences for FILE in `cat ${DUPF}` ; do SLOC=`sam locate ${FILE} | cut -f 2 -d "'" | cut -f 1 -d ,` SPN=`ls -l ${SLOC}/${FILE} | tr -s ' ' | cut -f 5 -d ' '` SWR=`ls -l ${FILE} | tr -s ' ' | cut -f 5 -d ' '` (( SDIFF = SPN - SWR )) echo ${SDIFF} ${FILE} done Differences are + and - equally, under 20K bytes, random. Let's dig through the logs some more : grep f20031250_0000_CosmicLE_D03.cand.dogwood1.root ROUNTMP/LOG/2009-08/*.log No obvious problem. I have no clue yet where the PNFS copies got written from, so close in time to these duplicates. For present, will move these to DUP. for FILE in `cat ${DUPF}` ; do ls -l ../DUP/${FILE} done for FILE in `cat ${DUPF}` ; do mv ${FILE} ../DUP/${FILE} done ________________________________________________________________________ ________________________________________________________________________ Date: Wed, 26 Aug 2009 17:31:59 +0000 (GMT) From: Arthur Kreymer To: minos_batch@fnal.gov Subject: mcear/mcfar concatenation resuming The writing of dogwood1 mcear and mcfar data to PNFS has been stuck for a couple of weeks due to duplicate cand files in the write queue. I have moved the 34 offending files out of the way, and writes to PNFS have resumed. I still do not know just how these came about. The 'roundup' script does check for duplicate files, and has generally handled this smoothly. I will do more testing, once the backlog is cleared out. ########### # SERVICE # ########### Date: Wed, 26 Aug 2009 09:15:50 -0500 (CDT) Request INC000000002263: Status has been updated. Status: Completed Summary: servicedesk very slow on Linux Date: Wed, 26 Aug 2009 09:15:51 -0500 (CDT) Unable to recreate ________________________________________________________________________ Searched for this under Incident Management, Assigned to Carolina Sinclair Owner Allen M Forni ________________________________________________________________________ I have updated the Work Info as follows : 8/26/2009 2:54:19 PM I have replicated this problem on the email system at the Wilson Hall Service Desk. ( Property tag 90824 ) It takes about 5 seconds to open windows for viewing a ticket or Work Info. It takes a few seconds to close these windows. On the Vista system ( Tag 109284 ) opening windows takes a fraction of a second, and closing is so fast that the operation is not visible. ________________________________________________________________________ 8/27/2009 3:54:28 PM UTC - 10:54 AM CDT I have updated the Work Info as follows : I have tested this again on the Service Desk Linux email system, this morning at about 10:30. Viewing a ticket takes about 10 seconds. Closing the window takes about 5 seconds. I am timing this with a local clock display xclock -digital -update 1 The xclock display stops updating for most of the 10 and 5 seconds that it takes to open and close the viewing window, indicating an extremely heavy system load of some sort. ________________________________________________________________________ Date: Tue, 01 Sep 2009 15:21:08 +0000 (GMT) From: Arthur Kreymer I have solved this problem by upgrading to Firefox 3.5.2. Service desk windows open and close very quickly, in well under a second. Please pass the word. ________________________________________________________________________ Date: Tue, 01 Sep 2009 13:30:49 -0500 (CDT) Status: Completed User upgraded to upgrading to Firefox 3.5.2 ________________________________________________________________________ ######### # ADMIN # ######### tokencron investigation Verified that the initial klist works on all nodes, 08:29 token email folder for all these cron emails, out of minosadmin Failure summary last night Aug 25/26 : 16:45 through 08:45 ( 16 hours ) flxi02 - 12 times SLF 4.6 2.6.9-89.0.3.ELsmp x86_64 flxi04 - 0 times SLF 4.5 2.6.9-89.0.3.ELsmp i386 minos25 - 34 times SLF 4.7 2.6.9-89.0.7.ELsmp x86_64 On flxi02, got the UID 13228 tokens 6 times, at (CDT) Aug 25 14:43:04 18:55:05 22:46:04 23:10:14 Aug 26 02:40:04 08:07:03 According to http://www-giduid.fnal.gov/cd/FUE/uidgid/uid.lis this is Brian Bockelman bbockelm@fnal.gov bbockelm@math.unl.edu /afs/fnal/files/home/room2/bbockelm I have added additional diagnostics to verify that this token is Brian's. Waiting for a recurrence. ___________________________________________________________________ Created ticketcron, which does not bother with aklog, but which tests the validity of the kerberos ticket. Running on minos27, since 16:15 Had at add MAILTO to the crontab, in order to get email MAILTO='kreymer@fnal.gov' ___________________________________________________________________ Stale ticket cache files from other users are being used by kcron. I added diagnostics of the ticket cache file permissions to tokencron. CCNAME=`echo ${KRB5CCNAME} | cut -f 2 -d :` printf "\nTICKET CACHE\n\n" ls -l ${CCNAME} Fresh diagnostics from tokencron on node minos25 : -rw------- 1 jdejong e875 889 Aug 19 18:44 /tmp/krb5cc_cron29932 The ticket cache collision removes the subject files from /tmp. Grabbed a shapshot of tokens on minso25 for future reference ( mostly files from Aug 19 and 20 ) MINOS25 > ls -l /tmp/krb5cc_cron* > /tmp/oldcache Looking at the nodes of interest ARK > ssh flxi04 'ls /tmp/krb5cc_cron* | wc -l' 51 ARK > ssh flxi02 'ls /tmp/krb5cc_cron* | wc -l' 432 ARK > ssh minos25 'ls /tmp/krb5cc_cron* | wc -l' 1327 ARK > ssh minos27 'ls /tmp/krb5cc_cron* | wc -l' 1 All the stale flxi02 ticket cache files belong to bbockelm, and date from Aug 20 and 21. This explains my getting bbockelm AFS tokens on flxi02 ___________________________________________________________________ I added this to the workinfo , about 11:00 Thursday 27 Aug 2009 AFS security problem I have verified that when I get one of the stale kcron ticket cache files, that I may get a valid AFS token for the user who owns that file. For example, here is output from a run of the tokecron test script, showing the successful creation of a TOKENTEST file in Brian Bockelman's home area. Date: Thu, 27 Aug 2009 04:33:06 -0500 From: Cron Daemon To: kreymer@flxi02.fnal.gov Subject: Cron /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/tokencron kinit: Internal file credentials cache error when initializing cache klist: Credentials cache file permissions incorrect while setting cache flags (ticket cache /tmp/krb5cc_cron10676) aklog: Couldn't get fnal.gov AFS tickets: aklog: Invalid argument while getting AFS tickets ERROR 4 ON flxi02 at Thu Aug 27 04:33:06 CDT 2009 Tokens held by the Cache Manager: User's (AFS ID 13228) tokens for afs@fnal.gov [Expires Aug 27 14:28] --End of list-- ORIGINAL KLIST CURRENT KLIST klist: Credentials cache file permissions incorrect while setting cache flags (ticket cache /tmp/krb5cc_cron10676) TICKET CACHE -rw------- 1 bbockelm us_cms 889 Aug 20 11:33 /tmp/krb5cc_cron10676 OOPS, got UID 13228 -rw-r--r-- 1 bbockelm cdwww 0 Aug 27 04:33 /afs/fnal/files/home/room2/bbockelm/TOKENTEST ___________________________________________________________________ ___________________________________________________________________ ___________________________________________________________________ ############ # PREDATOR # ############ Several very slow dccp's last night. These are all raw data files less than 2 hours old, so there should be no tape restores involved. Minutes File 10 N00016708_0001.mdaq.root Wed Aug 26 02:06:27 UTC 2009 55 F00044399_0003.mdaq.root Tue Aug 25 22:14:38 UTC 2009 25 F00044399_0004.mdaq.root Tue Aug 25 23:09:49 UTC 2009 50 F00044399_0007.mdaq.root Wed Aug 26 02:31:23 UTC 2009 128 F00044399_0009.mdaq.root Wed Aug 26 05:28:31 UTC 2009 There were about 30 queued restores around this time, in w-raw-minos-stkendca21a-1 w-raw-minos-stkendca24a-1 w-raw-minos-stkendca26a-1 Queues went to about 20 over the 10 active transfer limit. Probably due to 40 jdejong jobs running on the Minos Cluster 19:00 through 07:00 UTC Bottom line - no action, talk to Jeff re making local copies ============================================================================= 2009 08 25 ============================================================================= ######### # ADMIN # ######### tokencron __________________________________________________________________________ !/bin/sh # Run this once a minute, to detect the problems found by condorweb # * * * * * /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/tokencron HOST=`hostname -s` PATH="/usr/krb5/bin:${PATH}" aklog ; ERR=$? if [ ${ERR} -ne 0 ] then printf "\nERROR ${ERR} ON ${HOST} at `date`\n\n" tokens fi __________________________________________________________________________ Started this up around 14:25 CDT Got one error fairly early on Date: Tue, 25 Aug 2009 14:27:03 -0500 From: Cron Daemon To: kreymer@fnal.gov Subject: Cron /usr/krb5/bin/kcron /afs/fnal.gov/files/home/room1/kreymer/minos/scripts/tokencron kinit: Internal file credentials cache error when initializing cache aklog: Couldn't get fnal.gov AFS tickets: aklog: Invalid argument while getting AFS tickets aklog ERROR 4 ON minos25 at Tue Aug 25 14:27:03 CDT 2009 Added 'tokens' to the script, in error report. Added 'klist' to the script, in error report Started on flxi02 (ia64) and flix04 (i386) around 14:35 kcron did not work on flxi02 till after 14:40 Cleaned up a log, including use of /usr/krb5/bin/klist forcing the c