HylaFAX The world's
most advanced open source fax server
|
|
[
Date Prev][
Date Next][
Thread Prev][
Thread Next]
[
Date Index]
[
Thread Index]
Re: [hylafax-users] faxq crashing: QLink::remove
Gavin White wrote:
I'm using hylafax-4.3.0-2rhel4 on a stock RHEL4 install. I'm sending
five or ten faxes through a minute, and things are generally ok.
However, every few hours, the faxq scheduler dies unexpectedly. I have
turned up all the tracing, and so far the only clue I have is the
following syslog line, which appears at about the time the scheduler
crashes:
FaxQueuer[14682]: Assertion failed "QLink::remove: item not on a
list", file "QLink.c++" line 53.
Does anyone know what this could be? What information would be
required to investigate further?
Please allow me to respond to this post again - expounding this time
with more detail and explanation now that I've researched the matter
even more than I had before.
faxq works in a largely asynchronous manner. When "idle" faxq just sits
and waits for things like FIFO activity or signals upon the receipt of
which it will then run through some procedure to address the event...
perhaps by starting fax job preparation - perhaps by launching faxsend -
perhaps by just running through the outbound queue to "check" on things
that may have changed - perhaps by removing jobs from the queues.
This asynchronous processing makes faxq quite capable. However, it also
is somewhat more "risky" in getting things "right" because the
programmer/developer needs to consistently be aware that, at any moment,
at any point in any processing path an event can be received (FIFO
activity, signals, etc.) which quite literally interrupts the faxq
processing at the point wherever it may be, goes to handle the event,
and, when done, returns back to where it was interrupted. Now the
programmer does have tools available to ignore events temporarily, but
unfortunately faxq largely does not make use of them. So let me
illustrate the risk as it pertains to the error message you show us.
Say we have some code that looks like this...
if ( job is in a QLink list ) {
do some logging or whatever
remove job from QLink list
perform operations on job
}
The risk is that if faxq has event handlers that remove jobs from lists
(and faxq does), and if event-handling-triggers are not ignored (as is
largely the case with faxq), then there is no guarantee that the if
condition (job is on a QLink list) is going to always be true at every
execution point within the "if" statement. Indeed, when faxq goes to
remove the job from the QLink list it may have already been removed by
an event handler - in which case you would get the assertion error that
you see. So we have a whole slew of race conditions in faxq that are
there due to the nature of the asynchronous design and a lack of
conscientious control of the risk involved in it.
On lower-volume systems faxq spends the vast majority of its time "idle"
- just sitting and waiting for something to do. And as long as the
events come in at a slow-enough pace where faxq can handle each event
completely before another event arrives - especially another event
dealing with the same job - then all will be fine. However, as you
increase the event activity to a pace where - even sometimes - faxq is
likely to receive an event *while handling* previous events, and even
worse, when faxq is likely to receive an event for a job for which it is
already handling an event... well, that's when things can break down.
Now your post didn't provide us with the specifics of what event
happened leading to the problem occurring. Indeed, your ServerTracing
was probably not high enough to say, anyway. But the events can be as
innocent as a job's time-to-send arriving, a job's killtime expiring, or
a user suspending a job.
In the last number of HylaFAX+ releases (since about 4.3.0.3) I've
endeavored to address some of these things. As I'm not the original
author of the software it has taken me some time to come to visualize
the code design behind faxq. And as I do come to see that design
better, and as I see problems occur, I try to improve things. I doubt
that I've taken care of all of the problems - doing that will require a
fair amount of time in code review. I made some changes in 5.0.1 that
*should* prevent the QLink error you quote from ever happening... but
that code has been running on my production servers for a relatively
short amount of time - and sometimes it takes months for me to see a
faxq error like this crop up. If you're really seeing this happen every
few hours then I would truly like to work closely with you to see if we
can't make a very thorough "fix" to the more general problem.
Let me know.
Thanks,
Lee.
____________________ HylaFAX(tm) Users Mailing List _______________________
To subscribe/unsubscribe, click http://lists.hylafax.org/cgi-bin/lsg2.cgi
On UNIX: mail -s unsubscribe hylafax-users-request@xxxxxxxxxxx < /dev/null
*To learn about commercial HylaFAX(tm) support, mail sales@xxxxxxxxx*