HylaFAX The world's most advanced open source fax server

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [hylafax-users] faxq crashing: QLink::remove



Gavin White wrote:

I'm using hylafax-4.3.0-2rhel4 on a stock RHEL4 install. I'm sending
five or ten faxes through a minute, and things are generally ok.

However, every few hours, the faxq scheduler dies unexpectedly. I have
turned up all the tracing, and so far the only clue I have is the
following syslog line, which appears at about the time the scheduler
crashes:

FaxQueuer[14682]: Assertion failed "QLink::remove: item not on a
list", file "QLink.c++" line 53.

Does anyone know what this could be? What information would be
required to investigate further?


Please allow me to respond to this post again - expounding this time with more detail and explanation now that I've researched the matter even more than I had before.

faxq works in a largely asynchronous manner. When "idle" faxq just sits and waits for things like FIFO activity or signals upon the receipt of which it will then run through some procedure to address the event... perhaps by starting fax job preparation - perhaps by launching faxsend - perhaps by just running through the outbound queue to "check" on things that may have changed - perhaps by removing jobs from the queues.

This asynchronous processing makes faxq quite capable. However, it also is somewhat more "risky" in getting things "right" because the programmer/developer needs to consistently be aware that, at any moment, at any point in any processing path an event can be received (FIFO activity, signals, etc.) which quite literally interrupts the faxq processing at the point wherever it may be, goes to handle the event, and, when done, returns back to where it was interrupted. Now the programmer does have tools available to ignore events temporarily, but unfortunately faxq largely does not make use of them. So let me illustrate the risk as it pertains to the error message you show us. Say we have some code that looks like this...

   if ( job is in a QLink list ) {
       do some logging or whatever
       remove job from QLink list
       perform operations on job
   }

The risk is that if faxq has event handlers that remove jobs from lists (and faxq does), and if event-handling-triggers are not ignored (as is largely the case with faxq), then there is no guarantee that the if condition (job is on a QLink list) is going to always be true at every execution point within the "if" statement. Indeed, when faxq goes to remove the job from the QLink list it may have already been removed by an event handler - in which case you would get the assertion error that you see. So we have a whole slew of race conditions in faxq that are there due to the nature of the asynchronous design and a lack of conscientious control of the risk involved in it.

On lower-volume systems faxq spends the vast majority of its time "idle" - just sitting and waiting for something to do. And as long as the events come in at a slow-enough pace where faxq can handle each event completely before another event arrives - especially another event dealing with the same job - then all will be fine. However, as you increase the event activity to a pace where - even sometimes - faxq is likely to receive an event *while handling* previous events, and even worse, when faxq is likely to receive an event for a job for which it is already handling an event... well, that's when things can break down.

Now your post didn't provide us with the specifics of what event happened leading to the problem occurring. Indeed, your ServerTracing was probably not high enough to say, anyway. But the events can be as innocent as a job's time-to-send arriving, a job's killtime expiring, or a user suspending a job.

In the last number of HylaFAX+ releases (since about 4.3.0.3) I've endeavored to address some of these things. As I'm not the original author of the software it has taken me some time to come to visualize the code design behind faxq. And as I do come to see that design better, and as I see problems occur, I try to improve things. I doubt that I've taken care of all of the problems - doing that will require a fair amount of time in code review. I made some changes in 5.0.1 that *should* prevent the QLink error you quote from ever happening... but that code has been running on my production servers for a relatively short amount of time - and sometimes it takes months for me to see a faxq error like this crop up. If you're really seeing this happen every few hours then I would truly like to work closely with you to see if we can't make a very thorough "fix" to the more general problem.

Let me know.

Thanks,

Lee.


____________________ HylaFAX(tm) Users Mailing List _______________________ To subscribe/unsubscribe, click http://lists.hylafax.org/cgi-bin/lsg2.cgi On UNIX: mail -s unsubscribe hylafax-users-request@xxxxxxxxxxx < /dev/null *To learn about commercial HylaFAX(tm) support, mail sales@xxxxxxxxx*




Project hosted by iFAX Solutions