Place to store all the technical information, with added goodness of the big grizzly bear 🐻
Unable to access availability database x because the database replica is not in the PRIMARY or SECONDARY role
Connection to database has been lost.
Magic Bear Finder has lost connection to Configuration database. Magic Bears is making attempts to connect to database.
Details: Unspecified error
SQL Server: MagicBearDB,5855
Next attempt to connect: In 15 minutes or later
Critical error occurred upon starting Magic Bear Finder Service.
Details: Database 'MagicBear_Poof' on SQL Server 'MagicBearDB,5855' is unavailable.
Unable to access availability database 'MagicBear_Poof' because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.
Those errors will tell you the always on database name, which here is MagicBearDB and it is using port 5855 then you get an error about "unable to access" and "replicate is not the primary or secondary role" which to be honest is not helpful in the slightest, if you try an ODBC connection to the SQL server always-on and non-always on string it looks like this:
MagicBearDB,5855 fails to connect with Unable to access availability database 'MagicBear_Poof' because the database replica is not in the PRIMARY or SECONDARY role
10.345.99.44\MagicBearSQL fails to connect with Unable to access availability database 'MagicBear_Poof' because the database replica is not in the PRIMARY or SECONDARY role
10.365.954.44\MagicBearSQL accepting connections
So here we have 10.365.954.44\MagicBearSQL accepting connections and 10.345.99.44\MagicBearSQL not accepting connections, so SQL clearly thinks that the active database is on the connection path of 10.365.954.44\MagicBearSQL which should mean that SQL always-on should also be working as the active always-on path.
So the question is, why is the MagicBearDB,5855 path not working when it should be, well to answer that we need to look at the failover clustering on the SQL server from the failover clustering section you will see this:
This confirms that cluster name is indeed on correct server, so why is always on not working, as everything is online, but it's not because nothing is working on that SQL always on address, so for this we have to pop ourselves in SQL management studio to see what the SQL is doing.
So here we have the applications thinking that one server is primary as that is being accepted as the active connection, but SQL is adamant that the other server is primary, how very weird indeed or is it?
This means that this connect address : MagicBearDB,5855 points at the correct server which is where the primary always on is located, correct, but the ODBC connection says it can only connect on the other node, so has SQL would appear to be in a split brain scenario, but more investigation is required.
Google, does not always help
So, if we turn to your friend Google you get entries like this :
Noooooooooo - these are wrong, the first one says "that's a split brain" and you need to bypass always-on, and the other says you are not allowing TCP ports on the local firewall, so indeed if you would have followed these articles you would have indeed made the issue worse.
Google in this instance, would not fix the issue, but get you into more a mess than we were before because you would recycle services and allowing ports on a firewall - completely pointless
Confusing hey, this is where you must read the error we got back in the beginning:
Unable to access availability database 'MagicBear_Poof' because the database replica is not in the PRIMARY or SECONDARY role
What is this actually saying is that SQL cannot determine where the primary and secondary roles are, which for me was insightful, so let's dig into that, if you are in an always on cluster the databases on the primary side should say "Synchronised" and the databases on the passive side should say "Synchronising"
That is not right at all, all the key databases are in the error state, this means the system somewhere has a bottleneck that is causing SQL to misbehave and misreport, so I popped open task manager and noticed that the server has using 97% of the memory it had, I have no screenshot of that but Azure to the rescue, you can see for a server with 128GB of RAM it has 5GB left, however very bizarre and when I looked at the processed for "sqlsvr.exe" one of them was 12GB in size and the other was 500MB in size - the 12GB related to the MagicBear_Poof_DB1 instance and the 500MB equated to the MagicBear_Poof_DB2 instance, the one we have a problem with........
So this is now becoming a lot clearer, due to a memory leak probably down to some bad SQL causing a memory drain, SQL was not able to function as the server enquires more than 5GB of RAM to run the operating system, so SQL was fighting for RAM with Windows, and Windows was winning, well how could this happen as I check the SQL max memory size and it was set like this:
This means each SQL instance can be 52GB in size and there are two of them, so that is 104GB with 128GB available that would leave 24GB of memory to run the OS, that should be fine, and how can we clear the RAM down without rebooting the server, well that is a tool called RAMMAP - that is this amazing utility that looks like this, that tells you where all your RAM has wondered off to.....
However on this server it looked this:
This means the server has so little RAM it cannot compute any data at all, so this was simple, I stopped the service on the server or attempted to stop it, but I knew it would timeout as the RAM was exhausted, and in those conditions when stuff does not work you need to clear the working set memory, as that is most the time where the leak is, to an accuracy of 91% in SQL - that is done from Empty then Empty Working set.........
This was done with the service stopping, the reason I did that at the same time, is when you do that SQL shutdown will be "clean" once you get more RAM, which once I cleared working set it had an immediate effect.....it is even annotated.......
Then once the working set was released and SQL service was stopped, a simple restart and automagically it all looks good:
SQL now knows the primary is primary and everything returns to normal....no connection string updated, no other configuration changed and more importantly no server reboots.
Free up the non-working set memory with RAMMAP and then restart the SQL instance in peril, before playing with your always-on cluster configuration and causing yourself more pain in the long run when this is a simple memory leak that caused O/S exhaustion.
This was fixed in the end by understanding the end-to-end process to figure out what was "actually" going on, if you follow this guide and you notice certain things are not matching up - do not just blindly follow this guide, as I outlined earlier Google told me the wrong solution to fix my particular issue, so please do not follow this guide and except it to fix your issue.
Being technical is not following another persons instructions until the end and hoping it fixes your issues, if you have that error and the conditions do not match the same as this guides, stop and investigate, please - trust me I have seen many people in these scenarios cause more issues than they fix because a "guide" on the internet is just "followed" - if it helps, great.
This is not an issue that should occur anymore, and even if it did you should be monitoring for events that are very basic like low memory conditions, Azure metrics in this instance saw the issue but issue is, when there are lots of a alerts or notification - sometimes you cannot see the wood for the trees in that forest.