Last daily update:
2019-09-18 16:34:41 UTC
Last incremental update:
2019-09-19 08:31:52 UTC
2019-09-19 09:04:59 UTC
The recent problem has been cleaned up. A nanoHUB machine that handled results returned from the BONIC server froze, stranding 100k successfully completed results, along with some live nanoHUB sessions. We saved all the results and set up a new dedicated machine to handle that part of the work flow. We will ramp up WU production again.
We're having trouble with a nanoHUB server that sends jobs to the BOINC server for WU production. We had to temporarily halt WU production while we fix the other machine.
We are back up again. The BOINC server for this project is connected to several other key parts of the nanoHUB infrastructure. We had to upgrade the BOINC server in January to support the volume of WUs our crunchers consume. The WU's are created from nanoHUB input files by another server, and we had to do significant upgrades to that server. The database of in-progress jobs (via BOINC and other venues) was consuming all the memory on that machine, leading to swapping. That server is critical for all nanoHUB jobs, including courses at several universities, so the DB upgrade had to be thoroughly tested before rolling it out. We are tiptoeing back into the creation of WUs. TLDR: our crunchers consume jobs at a rate at least one order of magnitude higher than our other computing venues, so it stresses all the infrastructure. We're watching everything closely.
While examining the BOINC job records we noticed that some jobs are failing with the exit code EXIT_ABORTED_VIA_GUI. If you have aborted a nanoHUB@Home job in your client, what was the reason?
In the same analysis we are working to improve the time and disk estimates for nanoHUB tools that produce WUs that fail often.
We are currently cleaning up a nanoHUB server (not the BOINC server) that was completely overwhelmed by the rate of WU consumption last week.
Thanks for your patience.
The machine that feeds jobs to the BOINC server is still hitting memory and process limits, so we haven't been adding new WUs. We are working on ugrades to that machine.
We installed a new job creation system on nanoHUB that should maintain a constant level of tasks ready to send. We are still tuning this system, so the level will vary over the next few days, but hopefully this will provide a steady stream of workunits for everyone.
The "temporary" slowdown ran longer than we thought it would. The systems that send nanoHUB jobs to BOINC had significant updates. The BOINC server will be down briefly sometime in the next few days as we increase the server size to support more running WUs. We will post an update on that shutdown as soon as we know.
We had several major software updates on January 7 on the nanoHUB servers that produce work for this project. Due to these updates, we are not producing new workunits right now. We will resume production in the next day or so.
The short job return time was intended to address some limits in our internal tools, used to produce BOINC workunits from nanoHUB simulation files. We are
working to remove these limits.
We posted a bit more information about the scope of the project on the front page. The 200+ nanoHUB simulation tools that are producing workunits for this project can create simulations with a wide range of time and memory requirements. We are still working to produce batches of workunits that are NOT all long simulations or all high memory simulations. We have been accumulating statistics on previous workunits to address this problem.
We began exporting stats on Nov 30, and we updated the version of vboxwrapper distributed for Windows.
The project was down for roughly 24 hours due to a bad software update. Everything seems to be running smoothly again.
Disk problem is solved.
We filled up the server disk. We will shut down the project temporarily until we can increase the disk size.
If you encounter a scheduler request error with a message Error in request message: xp.get_tag() failed, please disconnect your client and reconnect to the project. The project homepage previously included incorrect scheduler configuration. Forcing your client to reconnect will update to the correct info.
Thanks for helping us find issues with our BOINC setup. This project is supporting 185 nanoHUB simulation tools. We've been testing on our systems for a while now, mostly using Linux. Please use the message boards to report any problems you find. Thanks for your help and patience.