19 December 2007

Canon ImageRunner and Ubuntu

We have gone away from individual departments printers with a centralised printer/photocopier/scanner device. We chose Canon's ImageRunner 3300 as our solution. Installing the printer drivers for Windows was easy. Installing it on Ubuntu was even easier - no drivers necessary.

1) Select System / Administration / Printing.
2) Click on "New Printer". This would take about 10 seconds before this screen appears:

As it scans the network for available printers. The Printer I wanted was the Canon iR3300.
3) Clicking "Forward" would display the available drivers.

The Manufacturer was automatically selected, but the model was not. There was is an entry for imageRunner 330s which I selected.
4) Clicking "Forward" again was a screen for descriptive information for the printer.

Its not entirely necessary to fill it in...
5) and finally, when the "Apply" button is clicked, the new printer is displayed:

And the printer is immediately available! The test print worked, and a spreadsheet from OpenOffice.org Calc worked. Duplex (double sided) printing also works.

So that was relatively straightforward!

However, now Id have to figure out how to get the network scanning feature to work ...

yk.

How to make nice looking diffs

I was wondering how to make nice looking diff patch files yesterday, as using "diff " gave really cryptic outputs and not very user friendly. I IM'ed Aizat who happened to be online in Chile. He just said use "svn diff". I told him that I was working on files local on my machine, so svn was not appropriate.

Googling didn't help much. So I just submitted the standard diff output as my patch.

Then this morning, Ow had a blog post about his patch, and he included his command line. The answer is "diff -Nau"!

So here is the patch for the archivemail-dspam script:

yky@x1407:~/dspam$ diff -Nau archivemail archivemail-dspam
--- archivemail 2007-12-18 19:13:34.000000000 +0800
+++ archivemail-dspam 2007-12-18 19:02:47.000000000 +0800
@@ -187,6 +187,8 @@
min_size = None
verbose = 0
warn_duplicates = 0
+ """ 071218 yky DSPAM-Confidence setting """
+ spam_confidence = 0.00

def parse_args(self, args, usage):
"""Set our runtime options from the command-line arguments.
@@ -206,7 +208,7 @@
"filter-append=", "pwfile=", "dont-mangle",
"archive-name=",
"preserve-unread", "quiet", "size=", "suffix=",
- "verbose", "version", "warn-duplicate"])
+ "verbose", "version", "warn-duplicate", "spam=" ])
except getopt.error, msg:
user_error(msg)

@@ -256,6 +258,8 @@
self.verbose = 1
if o == '--archive-name':
self.archive_name = a;
+ if o == '--spam':
+ self.spam_confidence = float(a)
if o in ('-V', '--version'):
print __version__ + "\n\n" + __copyright__
sys.exit(0)
@@ -265,7 +269,7 @@
"""Complain bitterly about our options now rather than later"""
if self.output_dir:
check_sane_destdir(self.output_dir)
- if self.days_old_max <>= 10000:
user_error("--days argument must be less than 10000")
@@ -661,6 +665,7 @@
--include-flagged messages flagged important can also be archived
--no-compress do not compress archives with gzip
--warn-duplicate warn about duplicate Message-IDs in the same mailbox
+ --spam=FLOAT SPAM Confidence levels ( e.g. 0.80 )
-v, --verbose report lots of extra debugging information
-q, --quiet quiet mode - print no statistics (suitable for crontab)
-V, --version display version information
@@ -737,6 +742,22 @@
mbox_from = "From %s %s\n" % (address, date_string)
return mbox_from

+
+def get_spam_confidence(message):
+ """Returns the DSPAM_Confidence from the message headers. Zero by default"""
+ """ 071218 yky Created """
+
+ assert(message != None)
+
+ for header in ('X-DSPAM-Confidence', 'SPAM-Confidence'):
+ confidence = message.get(header)
+ if confidence:
+ confidence_val = float( confidence )
+ if confidence_val:
+ vprint("Spam Confidence: %f " % confidence_val)
+ return confidence_val
+
+ return 0.0

def guess_return_path(message):
"""Return a guess at the Return Path address of an rfc822 message"""
@@ -987,6 +1008,11 @@
return 0
if options.preserve_unread and is_unread(message):
return 0
+
+ # 071218 yky Filtering by SPAM Confidence
+ if (options.spam_confidence > 0) and (options.spam_confidence > get_spam_confidence(message)):
+ return 0
+
return 1


@@ -1019,7 +1045,7 @@
max_days -- maximum number of days before message is considered old

"""
- assert(max_days >= 1)
+ assert(max_days >= 0)

time_now = time.time()
if time_message > time_now:


Thanks Ow!

yk.

18 December 2007

Making Archivemail work with DSpam

Ive got an dspam "appliance" where the enterprise emails filter through. I've set it up so that only one dspam user is used to filter all the emails. This has worked well over the past few years, but managing it has been quite a chore. Every morning, I'd have to wade through the emails in the quarantine (about 15K), and free up any False Positives which were caught.

Beyond the 58% spam confidence as reported by DSpam is pretty much spam. Below that, between the 47% - 57% there may exist one or two False Positives.

After freeing them up, deleting the remaining emails is a huge chore, because the DSpam UI will not allow deleting the quarantine file when new spam pops in.

So I needed a little program which would scan the quarantine mbox file and delete off any messages which are 58% or higher spam confidence.

I tried the most obvious program called 'archivemail', which was readily available in all distros, but was disappointed that it only allowed filtering on the messages age. There was a mysterious "Filter" switch but it only applied to IMAP mailboxes.

The great thing about this is that archivemail, like the entire emailling stack on my servers, is its completely Free Software. I just had to invest some time to look at the code. archivemail lived in /usr/bin/. I had a look at the file, and its a very small 1500 line python script!

I haven't programmed in python before, but looking at the code, it didn't look too scary. It had classes, but no colons. Indentation seemed to be important here. I scanned the code, and I found the little function called "should_archive(message)". And sure enough, the crux of the logic which defines whether a message is to be archived away or not, was there.

So I added this line:
if (options.spam_confidence > 0)
and (options.spam_confidence > get_spam_confidence(message)):
return 0
And modified the options class to include the spam_confidence field. Did some modifications on the code to read in the command line options, and then had to create the section which extracts the spam confidence from the message headers. Doing this was relatively easy, because the rest of the code basically does the same things: reading things off the headers and using the information. So my new function looked like this:

def get_spam_confidence(message):
"""Returns the DSPAM_Confidence from the message headers. Zero by default"""
""" 071218 yky Created """

assert(message != None)

for header in ('X-DSPAM-Confidence', 'SPAM-Confidence'):
confidence = message.get(header)
if confidence:
confidence_val = float( confidence )
if confidence_val:
vprint("Spam Confidence: %f " % confidence_val)
return confidence_val

return 0.0
Thats it!

I also set some cronjobs to run against the quarantine file; to kill 88% and above spams every hour, kill 58% spams after 3 days, and kill the rest if they are more than 14 days old.

I then followed up with my corporate responsibility duties, and submitted the patch back to the archivemail project in sourceforge. This didn't take me long, and it is worth while whether they accept it or not. At least the source is available online.

I hope this helps other dspam admins out there too!

yk.