HylaFAX The world's
most advanced open source fax server
|
|
[
Date Prev][
Date Next][
Thread Prev][
Thread Next]
[
Date Index]
[
Thread Index]
Re: [hylafax-users] Document conversion failed
Darren Nickerson wrote:
"Lee Howard" <faxguy@xxxxxxxxxxxxxxxx> wrote:
Darren Nickerson wrote:
Whereas HylaFAX used to bog down with faxq consuming 99% CPU and
jobs submitting _VERY_ slowly, it should now be pretty zippy.
I can't say that I recall the last time I saw faxq consume 99% CPU
(or anything close to even half of it)... and I routinely stuff
between 500 or 1000 jobs into the outbound queue... this without the
major refactoring.
I guess maybe 500 or 1000 jobs isn't sufficiently large to uncover the
problem these days. 10,000 jobs is not a contrived scenario - we work
with a lot of people who submit that many jobs to their servers in one go.
Of course, 10,000 jobs should be no sweat.
I first introduced the problem in 2002 because it was even worse back
then:
http://bugs.hylafax.org/show_bug.cgi?id=344
Yes, it was very bad back then.
although there was some improvement from the work in bug 667, we
really just moved the goalposts a bit.
In my estimation that collateral work done in 667 was a major
improvement, vastly improving performance. Yes, by no means was it the
end-all to the problem, but it was significant enough for you at the
time to close 344 due to it. Certainly I would not characterize
post-667 performance as "bogged down" or "submitting _VERY_ slowly".
Indeed, there have been performance improvements yet later in
HylaFAX+... and undoubtedly there are more yet to come.
There was still a fundamental problem queueing jobs with long queues
and a more comprehensive solution was needed. It really was an
algorithmic problem ... we were limited by the way faxq was written
and unless things were restructured we'd only be able to acheive
incremental improvements.
What, exactly, was that fundamental algorithmic problem?
(The long quote below not trimmed for contextual meaning for my comments
below it...)
Here's some rough numbers for the time taken to queue 10,000 jobs in
one go:
HylaFAX+-5.1.9: 07:03:58
HylaFAX-4.3.5: 03:06:18
HylaFAX-4.4 CVS: 00:02:52
That's 7 hours for HylaFAX+, 3 hours for HylaFAX-4.3.x, and about 3
minutes for HylaFAX-4.4 CVS commit
2d0b444c473ce4d7b0e5bd8da14725e2923f16e0).
The command we ran was:
sendfax -h test@localhost -n -t 1 -T 1 -z diallist-10000.txt
somefile.pdf
and the diallist file contained 10,000 unique numbers.
Some settings from faxq config:
LogFacility: local1
CountryCode: 1
AreaCode: 415
LongDistancePrefix: 1
InternationalPrefix: 011
DialStringRules: etc/dialrules
SendFaxCmd: /usr/sbin/faxsend
Use2D: yes
DestControls: etc/destcontrols
PollLockWait: 1
MaxConcurrentCalls: 100
MaxBatchJobs: 2
MaxConcurrentCalls: 1
NotifyCmd: /bin/true
ServerTracing: 0
I know that with benchmarks the devil is in the details, and so I'm
the first to consider them suspect. These may be wrong. This was on a
dual AMD Athlon box with a Raid5 array (which was running in degraded
mode due to a failed disk so reduced I/O performance). We can repeat
the test and gather more details if anyone cares. I find the above
numbers a little suspect myself, but we repeated the tests twice. I'm
particularly concerned by the slowness of HylaFAX+ versus 4.3.5 - that
would seem unlikely to me, and should probably be reconfirmed. But we
definitely did expect 4.4.x to sing compared to 4.3.x.
How about you run similar tests, show us the results you get, and if
they're dramatically different we can dig into what we may have done
wrong here on our side?
Well, I'm testing with a slightly less-beefy machine here. :-) It's a
PII-400 with 224MB RAM and a rather slow QUANTUM BIGFOOT TS8.4A, ATA
DISK drive. I'm running pretty-much stock-updated Fedora Core 2 on it,
and I've not messed with any hardware tuning at all.
The config file that I started with looked like this:
LogFacility: daemon
CountryCode: 1
AreaCode: 360
LongDistancePrefix: 1
InternationalPrefix: 011
DialStringRules: etc/dialrules
ServerTracing: 0
MaxBatchJobs: 2
NotifyCmd: /bin/true
Also, in hfaxd.conf I likewise have ServerTracing: 0. And although the
system has modems, I have disabled all of the modems (by stopping all
faxgettys) because, as I'll discuss later, they would play a significant
factor. And you have an obsolete DestControls and two disagreeing
MaxConcurrentCalls in your config that I have ignored.
In my test I am using the command "echo test | sendfax -n -t 1 -T 1 -z
/tmp/numbers > /dev/null". The /tmp/numbers file contains the numbers 1
through 10000 on separate lines. I'm running current HylaFAX+ CVS code
(marked as 5.1.9) although nothing has changed in HylaFAX+ for a long
time that would affect this.
With MaxBatchJobs: 2 the command completes in 25:47. That's a little
less than 26 minutes... on my more than antiquated poorly-endowed dev
system.
With MaxBatchJobs: 1 (which is probably where it would be set for a
10,000 job broadcast anyway) the command completes in 17:07. That's
roughly 10 jobs per second. Watching CPU usage of the faxq process
during that time shows it usually around 10-20% CPU... although there
are very brief peaks above that.
Furthermore, an astute user would probably not submit 10,000 jobs for
immediate transmission. Why? Well, notice that when I use the command:
echo test | sendfax -a "00:00 Jan 1" -n -t 1 -T 1 -z /tmp/numbers >
/dev/null
... the command completes in 11:20, or 33% faster, nearly 15 jobs per
second on my lousy 9 year-old dev box. I'm just guessing, but I suspect
that if I were to have done the same on a modern server-grade system
(64-bit, SMP, 3GHz CPU; 2GB RAM; SATA hard drives) I'd expect that time
to be noticeably less than 2 minutes.
faxq is event-driven. So as you reduce the overall effect of any
faxq-related event you will speed up functions that repeatedly trigger
events... such as submitting jobs. This is why the future time-to-send
has an effect on performance... because those jobs then aren't actively
triggering queue events while the job group is being submitted.
Similarly, if faxgettys were running then a whole host of other things
become involved including document conversions, modem initializations,
faxsend calls and return handling, etc. And, depending on the
prioritization that we want to give to those events they will have an
effect on overall performance of other events.
A lot of this comes down to how we interact with the client's
instructions. If a client submits a job... do we address it
immediately? And if we do, then there will be a cost that gets paid by
that user or by any other user on the system at the time. If we don't
address it immediately then how long do we wait before we taint the
perception of responsiveness in that respect? Certainly, I suppose one
could code the system to inhale and swallow jobs just as fast as
possible... at the cost of everything else... but I think that that
there is a certain balance that needs to be acheived between the various
users and functions of the server.
... anyway... I'm not sure what it is that happened in your tests to
cause the job submission to run so slowly, but I hope that I've given
you some useful pointers.
And, in the end, I'm not really a big fan of how broadcast jobs are
handled anyway. I mean, really, do we truly want 10,000+ identical
files (or links) in the docq? Do we really, truly want to fire-off
10,000+ job notifications to the user (or inhibit job notifications
altogether)? I'm more and more of the opinion that for broadcast jobs
there should be but one copy of the source document in the docq, but one
sendq file, and but one notification message when the entire job group
is finished. I think that entire job group submission time would then
maybe take a second. :-)
Thanks,
Lee.
____________________ HylaFAX(tm) Users Mailing List _______________________
To subscribe/unsubscribe, click http://lists.hylafax.org/cgi-bin/lsg2.cgi
On UNIX: mail -s unsubscribe hylafax-users-request@xxxxxxxxxxx < /dev/null
*To learn about commercial HylaFAX(tm) support, mail sales@xxxxxxxxx*