+ Reply to Thread
Results 1 to 18 of 18

Thread: Reseller6 Problems

  1. #1
    TonyD is offline New Bee
    Join Date
    Apr 2007
    Posts
    5
    WHB Points this Month
    0.00
    WHB Points
    0.00
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Reseller6 Problems

    If I sound frustrated in this post, please forgive me. I just joined WHB less than a month ago as a reseller and so far this server has had two "emergency reboots". I saw those high up-time figures before joining... if those are even close to accurate, Reseller6 should show 96% up-time for this month (that's being extremely forgiving)... and it's much, much worse if you consider the network-wide problems of the last 24 hours.

    And I thought my previous hosts had up-time issues.

    No, I'm not just ranting. My point is that some of us would like to know exactly what's going on. I even asked via Trouble ticket what caused the last reboot, and I didn't get an answer (just asked again, though). Does anyone know? Is the problem going to be fixed? Is the server just over-loaded?

    As I type this... it's come back up? But for how long? Another week? Dang... it's like I'm on an IIS server or something.

    About five minutes away from running back to HostGator as fast as I can...

    TonyD

  2. #2
    Matt R.'s Avatar
    Matt R. is offline WeeHBie
    Join Date
    Jul 2006
    Posts
    1,378
    WHB Points this Month
    0.00
    WHB Points
    55.00
    Thanks
    1
    Thanked 2 Times in 2 Posts

    Default

    Tony,

    You need to subscribe to the announcements forum. Updates were being posted every 15 minutes at http://www.whbstatus.com/showthread.php?t=491 by myself.
    Matt Russell
    WebHostingBuzz CEO

    Follow me on Twitter: http://www.twitter.com/mattdrussell

  3. #3
    Matt R.'s Avatar
    Matt R. is offline WeeHBie
    Join Date
    Jul 2006
    Posts
    1,378
    WHB Points this Month
    0.00
    WHB Points
    55.00
    Thanks
    1
    Thanked 2 Times in 2 Posts

    Default

    By the way, I checked the uptime so far for Reseller6 - it's at 99.621% this month. This is lower than we'd like and our technical team are investigating the server, but it still isn't "bad", nor is it 96%.
    Matt Russell
    WebHostingBuzz CEO

    Follow me on Twitter: http://www.twitter.com/mattdrussell

  4. #4
    TonyD is offline New Bee
    Join Date
    Apr 2007
    Posts
    5
    WHB Points this Month
    0.00
    WHB Points
    0.00
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default I'm confused

    Quote Originally Posted by Matt R
    By the way, I checked the uptime so far for Reseller6 - it's at 99.621% this month. This is lower than we'd like and our technical team are investigating the server, but it still isn't "bad", nor is it 96%.
    [Edit] I'm glad to hear that someone is looking into the problem. I really hope we can get this server's issues resolved ASAP. I have new customers using this and it's a huge problem for me because hosting is not my company's primary business and losing customers because of it is a huge issue.

    I'm really not trying to be confrontational. I hate it when people do that on these forums (as if it helps anything). However, I love math, and 99.621% isn't possible any way I run the figures. So please tell me where I've gone wrong in my calculations.

    (Let me apologize in advance. When I said 96% up-time, I was figuring in all the problems of the last two days. It should have been 99.1% due to reboots.)

    First, let's be conservative and assume that it's really only been down the two times that are actually recorded in the Reseller6 status forum. Both times, the original posting was well after (about an hour for the first one... I didn't time the second) the problem started, and each reboot took close to two hours. So let's say it's had a mere 5 hours of downtime this month.

    Today is the 25th of the month and the day isn't over, but we'll assume it's midnight for the sake of argument.

    25*24 = 600 hours total for the month
    5 hours downtime means 595 hours up-time
    595/600 = 99.167% (not 99.621%)

    Again, that's if we pretend that all the other downtime never happened.

    Factor in the 15 hours of downtime that I've recorded for this server in the last two days and we get:

    580/600 = 96.667%

    And yes, I count that because the server was inaccessible. Whatever the cause--which I blame nobody for--the result is the same. But I'm starting to see where real up-time figures aren't the same as what's being posted and I'm trying to understand why.

    Let's be extremely overly conservative and pretend that the server didn't go down before the reboot was required either of those two times, that the server was restarted at the exact moment of the first announcement in the forum, and that it came up when the last announcement was made each time. Let's further assume that it's the end of the month (30 full days). Each reboot took two hours. So:

    716/720 = 99.445%

    Obviously, this last scenario is far from the truth. If the server had actually been running, there would have been no need for a reboot. But even with this patently false calculation, I still don't get 99.6%.

    I'm officially confused by this number, I guess.

    Sorry to nit-pick. As I said, I love math, and I couldn't help myself but try to over-analyze this number. Again, nothing meant by it except that I don't understand it.

    Sincerely,

    TonyD
    Last edited by TonyD; 04-25-2007 at 09:09 PM. Reason: Added title, and additional text

  5. #5
    TonyD is offline New Bee
    Join Date
    Apr 2007
    Posts
    5
    WHB Points this Month
    0.00
    WHB Points
    0.00
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Not "bad"?

    Quote Originally Posted by Matt R
    By the way, I checked the uptime so far for Reseller6 - it's at 99.621% this month. This is lower than we'd like and our technical team are investigating the server, but it still isn't "bad", nor is it 96%.
    I just re-read this post. As forgiving as I've been, I'm hoping you just mis-spoke here or that I'm being unfair in my interpretation.

    Let's assume that in some reality, the 99.621% is actually correct. "It isn't 'bad'"? My understanding is that you are one of the owners. I own a company myself. I know what my own standards are for quality and reliability (we primarily deal in IT consulting) and 99.621% up-time for a web server (even a non-mission-critical one) is certainly "bad" by my standards. I would think you would consider your servers "mission-critical" since your company's focus is on selling hosting packages (translation: if I were the consultant, I'd shoot for 99.9% up-time).

    Perhaps I'm getting hung up on semantics here. I consider "bad" to be synonymous with "unacceptable" or "unreasonable". And even if we really had 99.621% up-time, that translates to about 3 hours of downtime in a month. If this was a Windows IIS server, I would consider 1-2 hours of downtime per month to be expected or "reasonable" (reboots due to patches would account for 1/2 an hour). However, not one Linux/Apache server that I deal with is expected to have more than 20-30 minutes of downtime per month (goal of 99.9% up-time in most cases... and it's achieved).

    Before you think I'm being unreasonable, I can perfectly understand having a "bad" month, but I certainly hope you don't mean that this amount of downtime is even close your company's acceptable up-time standards. I can understand saying something to the effect of:

    "We realize this server has had an inordinate number of issues this month. This is outside our acceptable norm and we are rectifying the situation to bring this within our company's policy of achieving 99.9% up-time."

    Or whatever your policy is.

    Perhaps I've been a consultant too long and am used to dealing with suits that deal in hard numbers. I'm not trying to tell you how to run your business, nor am I trying to insinuate that you should be required to tell us all your company policies. But, with all the downtime issues you've had this month, you could at least give us some idea of what your accepted standard is for reliability so that we can feel that there is a specific goal that you are working to achieve.

    Setting a goal or standard for reliability doesn't mean you'll always meet it, but communicating this standard to your customers and being able to achieve it on a regular basis is a solid foundation for customer loyalty and satisfaction.

    I know this to be true from my relations with my company's customers--they are extremely loyal, and mostly because we set goals that match or exceed that of all of our competitors, and we almost always reach those goals. This way, on the rare occasion that we fail at something, customers know that we have a proven track record and that this is the exception rather than the norm.

    So far, in the month that I've been hosting with WHB, I've seen a lot of problems. This may be the exception rather than the rule. But I've seen next-to-nothing in the way of "this is the goal we are working towards; here's what we're doing to get there; and here's what we're doing to make things right by our customers in the meantime." I'm still waiting for this statement, but my patience (like anyone's) does have its limits.

    Sorry for the long post. I'm sure I just misunderstood what you said. I'm sure you (like me) do consider hours of downtime in a month to be "bad."

    Sincerely,

    TonyD

  6. #6
    TonyD is offline New Bee
    Join Date
    Apr 2007
    Posts
    5
    WHB Points this Month
    0.00
    WHB Points
    0.00
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default Got An Answer!!!



    I got an answer to the "why" question! It was an account that was causing high server loads (not mine thankfully... not that I thought it could be). The account has now been suspended.

    That's half of what I was looking for and is certainly a major improvement.

    I'm still curious about what WHB's policy is in regards to up-time expectations... but it's my own short-sightedness (and desperation at the time) for not finding this out before signing up.

    Sincerely,

    TonyD

  7. #7
    Matt R.'s Avatar
    Matt R. is offline WeeHBie
    Join Date
    Jul 2006
    Posts
    1,378
    WHB Points this Month
    0.00
    WHB Points
    55.00
    Thanks
    1
    Thanked 2 Times in 2 Posts

    Default

    Tony,

    Your figures are correct but recorded downtime is not 15 hours, your method of recording this is inaccurate. We have a nagios cluster (2, infact) which monitor uptime and the 99.6% figure is pulled directly from this.
    Matt Russell
    WebHostingBuzz CEO

    Follow me on Twitter: http://www.twitter.com/mattdrussell

  8. #8
    Matt R.'s Avatar
    Matt R. is offline WeeHBie
    Join Date
    Jul 2006
    Posts
    1,378
    WHB Points this Month
    0.00
    WHB Points
    55.00
    Thanks
    1
    Thanked 2 Times in 2 Posts

    Default

    And in terms of expectations, we consider 99.75%+ uptime each month acceptable.
    Matt Russell
    WebHostingBuzz CEO

    Follow me on Twitter: http://www.twitter.com/mattdrussell

  9. #9
    Matt R.'s Avatar
    Matt R. is offline WeeHBie
    Join Date
    Jul 2006
    Posts
    1,378
    WHB Points this Month
    0.00
    WHB Points
    55.00
    Thanks
    1
    Thanked 2 Times in 2 Posts

    Default

    Quote Originally Posted by TonyD


    I got an answer to the "why" question! It was an account that was causing high server loads (not mine thankfully... not that I thought it could be). The account has now been suspended.

    That's half of what I was looking for and is certainly a major improvement.

    I'm still curious about what WHB's policy is in regards to up-time expectations... but it's my own short-sightedness (and desperation at the time) for not finding this out before signing up.

    Sincerely,

    TonyD
    And lastly... it's not always apparent what causes the problem. We have an intricate monitoring system but sometimes the load can spiral up so quickly, it renders the server inaccessible before we can find the culprit. We had to do a cold power cycle last night for rs6 which resulted in the lengthy fsck - usually a reboot lasts under 2 minutes.
    Matt Russell
    WebHostingBuzz CEO

    Follow me on Twitter: http://www.twitter.com/mattdrussell

  10. #10
    TonyD is offline New Bee
    Join Date
    Apr 2007
    Posts
    5
    WHB Points this Month
    0.00
    WHB Points
    0.00
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default

    Quote Originally Posted by Matt R
    Tony,

    Your figures are correct but recorded downtime is not 15 hours, your method of recording this is inaccurate. We have a nagios cluster (2, infact) which monitor uptime and the 99.6% figure is pulled directly from this.
    Hmmm... wonder what exactly is recorded as "up-time" then. httpd had been going down several times a day over the last couple weeks on that server according to WHM. I suppose if you're counting the amount of time the system itself was actually turned on and running, then 99.6% makes sense. If we're talking about the amount of time that websites were accessible on that server, there is simply no way (as I said, even if we just take the time listed in the forum for those two reboots, we still don't come up with that high a number).

    But this is all a side issue anyway. As I said, I understand problems, and I understand that these problems never come one at a time. I'm glad to know what your up-time policy is, even if it is lower than what many other hosts have. I believe you hit the problem on the head in the thread regarding the network issues... communication. Expectations of good communication were my own primary reason for joining WHB.

    I like what others have suggested regarding some other type of alert system. These forums work great... but if the forums are down, like they were this week... then they do no good.

    Sincerely,

    TonyD

  11. #11
    Matt R.'s Avatar
    Matt R. is offline WeeHBie
    Join Date
    Jul 2006
    Posts
    1,378
    WHB Points this Month
    0.00
    WHB Points
    55.00
    Thanks
    1
    Thanked 2 Times in 2 Posts

    Default

    Quote Originally Posted by TonyD
    Hmmm... wonder what exactly is recorded as "up-time" then. httpd had been going down several times a day over the last couple weeks on that server according to WHM. I suppose if you're counting the amount of time the system itself was actually turned on and running, then 99.6% makes sense. If we're talking about the amount of time that websites were accessible on that server, there is simply no way (as I said, even if we just take the time listed in the forum for those two reboots, we still don't come up with that high a number).

    But this is all a side issue anyway. As I said, I understand problems, and I understand that these problems never come one at a time. I'm glad to know what your up-time policy is, even if it is lower than what many other hosts have. I believe you hit the problem on the head in the thread regarding the network issues... communication. Expectations of good communication were my own primary reason for joining WHB.

    I like what others have suggested regarding some other type of alert system. These forums work great... but if the forums are down, like they were this week... then they do no good.

    Sincerely,

    TonyD
    You will not find better uptime at a better price, that I am sure. The figures I posted are httpd uptime, and come from checks performed every minute, 24 hours a day.
    Matt Russell
    WebHostingBuzz CEO

    Follow me on Twitter: http://www.twitter.com/mattdrussell

  12. #12
    equazcion is offline Member
    Join Date
    Apr 2007
    Posts
    42
    WHB Points this Month
    0.00
    WHB Points
    0.00
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default

    Tony, in Matt's defense, your posts are wicked long and involve too much math so I don't blame him for just going by the figures he got from the monitoring software.

    I also agree that 99.6% is not such a horrible uptime for the price. If you expect 99.9% uptime then you should also expect to pay more. You can't go with one of the cheapest hosts out there and then complain that it's not perfect.

    However, Matt, if 99.621% is the figure you got from Nagios, then I think you might want to look into hosting the Nagios cluster at a different datacenter than your main locations -- or at least get one outside server to use for Nagios -- so that it can test the actual remote availability of HTTP.

    I say this because Alertra has you at 99.243% uptime for April. That's about 5 hours of downtime.

    If your Nagios cluster lives in the same datacenter as the servers it's checking, then it will only show downtime caused by internal server issues. You'll never see downtime caused by things like routing problems or service provider failures, because Nagios never has to leave the datacenter in order to contact the servers. It can just use the internal network. That could be the source of the discrepancy.

    This might also explain why in my "Downtime" thread, when I first mentioned the downtime, you said you hadn't even noticed it -- and that was after all 3 WHB sites had been going down over and over again, along with ALL the servers at FortressITX.

    Now this Alertra figure is from the Alertra monitoring that's included in the FindMyHosting.com report, and I'm pretty sure what they do is just test your main company URL. However, since it seems Tony's server suffered the same downtime as the main WHB site did in that incident a few days ago, then they must both be hosted at FortressITX, and this Alertra reading should be accurate for BOTH the WHB sites AND reseller6 (as well as every other server at Fortress).

    At least, it should be accurate in that the actual downtime for a server at Fortress is AT LEAST what Alertra says it is. Alertra wouldn't pick up on downtime caused by maintenance on one specific server, unless it happened to be the server where the main company site resides. That means that this Alertra figure only represents the downtime caused by that routing disaster. The downtime caused by reboots and maintenance on reseller6 should be ADDED to that figure.

    So I apologize for getting into math when I criticized Tony for it, but I'm going to anyway. Matt's figure from Nagios doesn't seem to include downtime caused by the routing failure, and Alertra's doesn't include the maintenance on reseller6. Therefore we should be able to put them together to get the total downtime reseller6 suffered this month, so far. The Nagios figure of 99.6% means 0.4% downtime, and Alertra's 99.25% means 0.75% downtime. The grand total is 1.05% downtime, 98.95% uptime.

    That's about 8 hours of downtime by my rough estimate. Still a far cry from 15 hours as Tony claims, but, 8 hours is still a lot.

    Hey now my post is wicked long and has a lot of math too. Oh well.

    http://fmh.alertra.com/fmhuptime/?id1=125006
    Last edited by equazcion; 04-28-2007 at 11:04 AM.

  13. #13
    Matt R.'s Avatar
    Matt R. is offline WeeHBie
    Join Date
    Jul 2006
    Posts
    1,378
    WHB Points this Month
    0.00
    WHB Points
    55.00
    Thanks
    1
    Thanked 2 Times in 2 Posts

    Default

    Hi All,

    Our Nagios is clustered and I believe those to be accurate results. One thing that may cause the discrepancies is that we monitor at 1 minute intervals. So if a reboot takes 2 minutes, Nagios will record 2 minutes.

    I suspect your Alertra monitoring monitors every 5 minutes. So the same reboot, if it occurs when Alertra is scheduled to check, it's going to show much more than the 2 minutes it actually took. I know Alertra can monitor at 1 minute intervals but it's very expensive to do so.

    Either way, the downtime this month wasn't welcomed or expected and it was ironic that it had to happen when the two most senior members of staff were travelling, and travelling back from the data center in question. They are usually more stable, and we maintain an excellent relationship with them but we have called in our SLA this month to ensure we keep them on their toes.

    We are also moving to a more independant network solution within the coming weeks and we'll announce when we do this. Wayne (our senior admin) will be doing this in conjunction with datacenter personel.

    It is our intention to hit 100%, or as close to 100%, uptime each and every month whilst still maintaining current pricing. We have always pioneered the affordability level of hosting and we will continue to do so - at a ground breakin price, and with performance we are all happy about. At the end of the month, we'll have 9-5 phone support and 24x7 live chat support too. I do not know of another host with our pricing that will offer all of this, with the reliable service you are accustomed to.

    And lastly (this is possibly the longest forum post I've ever made!), we are going to be launching "business class" hosting probably next quarter. We are looking at a number of ways to implement high availability through software and hardware load balancers to the mass market. Our goal is to have entry level pricing starting at under $10 / month. There was significant interest in this product in the recent poll we did, which reinforced our belief that a product such as this would be extremely popular.
    Matt Russell
    WebHostingBuzz CEO

    Follow me on Twitter: http://www.twitter.com/mattdrussell

  14. #14
    equazcion is offline Member
    Join Date
    Apr 2007
    Posts
    42
    WHB Points this Month
    0.00
    WHB Points
    0.00
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default

    Clusters are great, but as I mentioned, if they are located within the same datacenter as the servers they're monitoring, then they are not picking up downtime caused by routing issues.

    And it doesn't matter what interval Alertra is using. Alertra is not monitoring your individual servers, it is only monitoring webhostingbuzz.com. You're talking about an inaccuracy due to the rebooting of reseller6, and I'm saying there's no way it would have even known reseller6 was down. It's only showing downtime from the routing issue. The discrepancy is caused by the fact that Alertra and Nagios were each logging downtime from 2 separate issues occuring at 2 separate times. Neither one shows the actual total downtime from both.

  15. #15
    Matt R.'s Avatar
    Matt R. is offline WeeHBie
    Join Date
    Jul 2006
    Posts
    1,378
    WHB Points this Month
    0.00
    WHB Points
    55.00
    Thanks
    1
    Thanked 2 Times in 2 Posts

    Default

    I believe the Nagios cluster is intelligent (as we have 2 Nagios servers). I'll check, and if not, we'll definitely set it up this way.
    Matt Russell
    WebHostingBuzz CEO

    Follow me on Twitter: http://www.twitter.com/mattdrussell

  16. #16
    equazcion is offline Member
    Join Date
    Apr 2007
    Posts
    42
    WHB Points this Month
    0.00
    WHB Points
    0.00
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default

    I'm sure it's intelligent...

    I admit I don't know a lot about what Nagios is capable of, but I'm pretty sure that it's not possible for any software to do what you're hoping it does.

    One computer trying to connect to another will always choose the shortest possible route. If both computers are within the same local network, they will end up connecting via the local router. There's just no way around that, at least for software. It would take a good many custom router configurations to somehow force the request to go out onto the internet before coming back to a local server.

    This is why people use services like Alertra -- because the only way to monitor remote connectability is by actually connecting from a remote location. Nagios monitors services and tells you if they go down, as that's what it's meant for, but it can't tell you anything about server reachability from outside the datacenter.

    There is another way which I've used to do this from my home network, to test my own local web server using its remote address, and that is by using a remote proxy server. If I connect through a remote proxy, then I'm making a request to the proxy, rather than to my local server, so the request actually leaves the local network. This still involves using a remote machine though. I don't think there's any way around that.

    If you care, I have a suggestion for you. Since you now have servers at two different locations, Texas and NJ, I would take advantage of that. Set up one Nagios server in Texas, and one in NJ. Have the Texas one monitor the servers in NJ, and have the NJ one monitor the Texas servers. That way you'll always know if anything is inaccessible for any reason whatsoever.

    Anyway. I'll shut up now.
    Last edited by equazcion; 04-28-2007 at 01:54 PM.

  17. #17
    Matt R.'s Avatar
    Matt R. is offline WeeHBie
    Join Date
    Jul 2006
    Posts
    1,378
    WHB Points this Month
    0.00
    WHB Points
    55.00
    Thanks
    1
    Thanked 2 Times in 2 Posts

    Default

    Yeah, that's what we have. Alertra uses Nagios too, just fyi. It is incredibly powerful (and complex).
    Matt Russell
    WebHostingBuzz CEO

    Follow me on Twitter: http://www.twitter.com/mattdrussell

  18. #18
    Freak is offline New Bee
    Join Date
    Jul 2006
    Posts
    10
    WHB Points this Month
    0.00
    WHB Points
    0.00
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default

    And whenever your proxy is down, downtime is reported as well

    Now we know what uptime you think is reasonable, what server load do you think is acceptable?

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts