Web Service goes down - But Music Continues to Play on Auto DJ - Advice please

Read 15824 times
Hey All,

Ran into a strange issue here  -  have a ticket in now for days - but its just back and forth no solutions yet.

Let me start by saying I have Pingdom, and UptimeRobot set up to both alert me for if my server goes down, or if my control panel website goes down.

For the last 5 or 6 weeks - on either Monday or Tuesdays - I am getting a notification from both my monitoring sites that the HTTP log in site has gone down.

Weird part - the Auto DJs that are running are still playing - DJs can log in - go live, etc.  But no one can reach the Centova panel.

The only way I have been able to remedy the situation is to reboot the entire server.

Tech support gave me some commands to try. They told me:

Please run this command in your SSH terminal:

/usr/local/centovacast/sbin/fixperms

Now, restart the Web service again:

/etc/init.d/centovacast stop-web
/etc/init.d/centovacast start-web


I tried this - no fix.


Has anyone had this happen? How do I fix it? Weirdest part its only occurring on either Mon or Tues... Same auto DJs always running, no other common DJs live, or anything like that.  Server stays up, Web service down.

Any help is appreciated!

Todd
I had this happen to me a while back and the above remedy that you posted fixed my issue..

/etc/init.d/centovacast stop-web
/etc/init.d/centovacast start-web

Check your logs, it will tell you why it shut down.

Not seeing anything on the shut down in the logs. They were empty.

When you enter that FixPerms command - shouldnt that take some time to run? For me it was immediate.  Split second.

I have since rebooted the server bec I needed the panel - without having it rectified - so Web Service is back up.  But most likely will crash again on Mon or Tues. It extremely strange.
for me it was instantly.

This helped.  Check your cron job logs.

http://forums.centova.com/index.php?topic=1498.0

the last part tells you how to restart it and add a debug log feature
I had this happen to me a while back and the above remedy that you posted fixed my issue..

/etc/init.d/centovacast stop-web
/etc/init.d/centovacast start-web

Check your logs, it will tell you why it shut down.

I had the same issue but did what you said and fixed it.  Thanks for posting. 
My problem was every 2 days it would say 502 bad gateway error and I had to reinstall the CPanel when it would happen to me.
Ok, My frustration is continuing to grow - as are my expenses.

I have been in contact with support many times now, and am struggling to get this problem rectified. My messages were even escalated to the developers.

My server, was an OVH - cheap stuff - and this was the error log:

Dec 29 22:20:22 ks20321 kernel: PAX: size overflow detected in function atomic_add_return /var/home/fx/src/ovh-kernel/ovhkernel-xxxx-grs-ipv6-64/linux-3.10.9/arch/x86/include/asm/atomic.h:337 cicus
Dec 29 22:20:22 ks20321 kernel: CPU: 0 PID: 4433 Comm: sc_serv Not tainted 3.10.9-xxxx-grs-ipv6-64 #1
Dec 29 22:20:22 ks20321 kernel: Hardware name: /DH67BL, BIOS BLH6710H.86A.0156.2012.0615.1908 06/15/2012
Dec 29 22:20:22 ks20321 kernel: 0000000000000000 ffff8801fafc7dd8 ffffffff81da3b50 ffff8801fafc7de8
Dec 29 22:20:22 ks20321 kernel: ffffffff8119bb34 ffff8801fafc7df8 ffffffff811af257 ffff8801fafc7e18
Dec 29 22:20:22 ks20321 kernel: ffffffff81b7fc56 0000000000000000 ffff8801f7cc1e00 ffff8801fafc7f08
Dec 29 22:20:22 ks20321 kernel: Call Trace:
Dec 29 22:20:22 ks20321 kernel: [<ffffffff81da3b50>] dump_stack+0x19/0x21
Dec 29 22:20:22 ks20321 kernel: [<ffffffff8119bb34>] report_size_overflow+0x24/0x30
Dec 29 22:20:22 ks20321 kernel: [<ffffffff811af257>] get_next_ino+0x77/0x80
Dec 29 22:20:22 ks20321 kernel: [<ffffffff81b7fc56>] sock_alloc+0x26/0x80
Dec 29 22:20:22 ks20321 kernel: [<ffffffff81b82c00>] SYSC_accept4+0x70/0x290
Dec 29 22:20:22 ks20321 kernel: [<ffffffff811b3bd0>] ? mntput_no_expire+0x40/0x140
Dec 29 22:20:22 ks20321 kernel: [<ffffffff811b456f>] ? mntput+0x1f/0x40
Dec 29 22:20:22 ks20321 kernel: [<ffffffff81108c94>] ? ktime_get_ts+0x54/0xf0
Dec 29 22:20:22 ks20321 kernel: [<ffffffff811e2531>] ? poll_select_copy_remaining+0x91/0x230
Dec 29 22:20:22 ks20321 kernel: [<ffffffff81b842b9>] SyS_accept4+0x9/0x20
Dec 29 22:20:22 ks20321 kernel: [<ffffffff81bbddea>] compat_sys_socketcall+0x1da/0x310
Dec 29 22:20:22 ks20321 kernel: [<ffffffff81db26bc>] sysenter_dispatch+0x7/0x24



I was told by Centova there was a serious hardware problem.  We proceeded to upgrade the kernels, then when that didnt work,  downgraded the kernels to an older, stable version according to kernels.org.  But no fix.  The web service portion of the server continued to crash every 80 to 120 hours - pretty much like clock work.

I contacted the data center - they confirmed the server was NOT down - and that the problem was software related.

So where does that leave me? Centova telling me its hardware, DC telling me its softwate.


This past Wednesday, I purchased a new server, not an OVH, but from Online.Net.  Triple the price, better specs.  Had CentOS installed, and paid Centova $120 to perform the migration to this new server.

Problem solved? Nope.

At 1 pm today, 88 hours after the migration was completed - guess what?  The webservice goes down.

I have two alerts set up, one pinging the IP address of the server, and one on IP:2199 - the only one that goes down again is the IP:2199

Hopefully I can put the new log here this evening when I am not mobile, and maybe someone can help with a solution.   It cant be another hardware issue on a totally different server, different kernel version, different DC.

I have another support request in to Centova. But any help here is greatly appreciated.


Edit: One of the major problems here is - anytime I bring the web service back up by rebooting, it takes me 80 to 120 hours or so to figure out if it worked.. because that the time frame that it goes down.

Last Edit: January 26, 2014, 11:43:23 am by Todd73NJ
So here are the basics of the entire situation, I do have a new ticket in with Centova - but maybe one of you guys will see something:



I had been running my Centova on a Kimsufi/OVH server. The server itself never actually went down in the entire time that I have had it in my possession and been monitoring it. However, every 80 to 120 hours the web service for Centova went down, and in order to bring it back up, I was forced to reboot the server. This solved the problem for another 80 to 120 hours or so.

Centova support escalated the issue to a dev.

He was kind enough to look at my logs, and found all sorts of Kernel errors, which he diagnosed as a serious hardware issue with the server. The listing of the errors is in the above response.

I contacted my DC, and they said there were no hardware issues with the server. We also upgraded the Kernels to the current stable version 3.12.9 (according to Kernel.org - rebooted the server - and the webservice crashed some 88 hours later. We then rolled back the Kernels to 2.6.34 - rebooted - and again the webservice crashed near the 100 hour mark.

As this point, I decided to chalk it up to being a bad server - contrary to what the DC was telling me

On Wed, 1/22, I contracted Centova to migrate over from my OVH/Kimsufi Server to my Online.Net server. This job was completed approx 10pm EST on 1/22.

Everything appeared to be running great for a few days.

But today, 1/26, at 12:58 EST, the Webservice monitor reported that it was down. I checked Streams.CentralSocial3d.com:2199 and received:

502 Bad Gateway

cc-web/1.2.9

However, the monitor on just the server IP address showed the server was up and running running fine.. I was able to ping the server, but unable to reach the Centova Panel. The problem again appears to be the web service only.

Ironically, this down time occurred 84 hours and 55 minutes from the migration to the new server. Which happens to be in the same window of time that the other servers web service would go down on a consistent basis.

I was forced to reboot several times to get the web service to come back up. On the 3rd reboot it did. And it is now running flawlessly again. However, judging by what I saw today, I would fully expect the web service to crash again between Wednesday night and Friday night.

I desperately need to fine a solution to this problem

Here are all the logs. (They are based on Paris time - so the time that the server went down would be between 18:53 and 18:58. I cannot get a specific minute because my monitor checks the web service and the server once every 5 minutes.

CC-AppServer Logs:
http://brooketv.net/cc-appserver.log
http://brooketv.net/cc-appserver.log.1.log
http://brooketv.net/cc-appserver.log.2.log
http://brooketv.net/cc-appserver.log.3.log

CC-Web Logs:
http://brooketv.net/cc-web.log
http://brooketv.net/cc-web.log.1.log
http://brooketv.net/cc-web.log.2.log
http://brooketv.net/cc-web.log.3.log

Var/logs:
Jan 20 - 26 http://brooketv.net/messages-20140126.txt

Jane 26th - current http://brooketv.net/messagesnew.txt



If any of you see anything jump out at you, Id appreciate your input.  I really need to get to the bottom of this problem.

Thanks!
Ok.. so its been 3 days, and 17 hours - and everything was fine till this morning.

Puts me right back in that 80 to 120 window. 

All along server load has been .05 to .10.  Auto DJs running, live events.. no issues. Memory used 1/8. 0 Swap.

This morning.. Load was at 5.. then 10.. a few hours later 20.. now 30+! Memory still shows 1/8. 0 Swap.

So Id sense the server is about to crash.


I attempted to get an SSH prompt.. took me about 15 attempts... but here is the TOP report.

Can anyone make anything of this???

http://i1215.photobucket.com/albums/cc506/Todd73NJ/TOPreport_zps3c189bd1.jpg
Im starting to feel like I bought a software package thats in beta testing or something... 5 support replies from me, 7 days go by - and nothing from Centova.

I just dont get it. Centova has been the ONLY software running now on 2 totally different server packages - but yet the blame continues to be the hardware, and me needing to figure out the problem.


I had a crash about 80 hours ago as I detailed in prior posts where something caused the server to reach 90 load over the course of 12 hours, from basically no load.


Well last night - around 3am.. I stopped monitoring my server from my computer. Load was .01, .02 - even with a few hundred listeners on live events, all the auto DJs running as they normally do.  No issues what so ever.

So I decide to check in this morning - and for the first time - I noticed Swap memory had been used... 412168k to be exact.  What would make this happen?  No DJs were live, just the auto DJs running as they had been for the past 80 hours.

The server load is now showing slightly higher... Im seeing a lot of .20 to .40s range readings.  Something that I had not been seeing at all since the last reboot.

Ironically - we are now in that 80++ hour window where two different servers only running Centova have crashed - and the assessment from support is hardware issues.

The commands that were suggested are being used in cron jobs, not helping the issue.

Someone has to have some better insight for me... my frustration is building. Over $600 in license, install, migration and service fees.... and now my second server is still experiencing the same problems that the one I migrated away from was experiencing.

Im no server guru - hence why I bought a professional package with support available. You sort of expect it to work.

Any help would be appreciated!
 
who is your server co?
I was with OVH. (Kimsufi Brand)

I got the server there - put a monitor on it. Ran straight for a month - decided to use it for Centova.  I have no clue how to do installs. Hired them to do it.

80 hours ran great - web service crashed - music continued to play for another day - then all crashed. But server was still up, pingable, showing a very light load. 

Took some of their support methods, tried them - no matter what I tried - every 80 to 120 hours the same events occurred.

The Devs told me from looking at my logs I had a very serious hardware problem (due to the kernels).  The DC said the hardware was fine - its the software. I tried upgrading and reverting back to other kernels.  same results.  Runs great for 80 or so hours... then stops.


Kimsufi is the cheapest server on the market.  (However, that being said, we have one with Wowza running on it for 6 months straight without a reboot or hitch - so they cant be all that bad)  But I decided to take the Devs assessment.

I now have a server with Online.Net.  Runs great. Double the specs of the other server. No issues at all the first week I had it. So I had Centova migrated over.

Guess what? 80 to 120 hours.. same problem.


I am no techie - but just using some logic it seems like something is overflowing.  Maybe based on use, hence it happening in the same time window.  MySql? Logs files? I have no idea.


But this same issue cant be happeneing in two different DCs. Different CPUs, specs, kernels, etc.

do you encode all your music with the same program, specs etc?

in other words could one of your mp3's be encoded differently than the others.
I thought about that - I have tried the settings for auto DJ on both re-encode and also have had that off.

And honestly - the max amount of auto DJ on any single stream is only 3GB.  The problem doesnt occur daily - it occurs every 80 to 120 hours. So if that were the case I think it would happen much more frequently.
well it could be a bad file/meta data. Everytime centova goes to play that mp3 it could be causing the trouble.

which is why I asked if you are encoding all the files before you add them to your server.  Thats something I do and everyone should do for consistency.