KRIMP
Krimp is an algorithm that mines patterns using the MDL principle: the best set of patterns is that set that compresses the database best.
The algorithm implementated by Jilles Vreeken, Matthijs van Leeuwen and Koen Smets accessible here is used.
Below is shown an example of dataset conversion, pattern mining and output file explanations based on chess dataset (which is included in krimp).
When you try Krimp on homework data, you can modify convertdb.conf and compress.conf that are already configured to work well on the given data - but you still have to change some parameters before you can use them (db names, folders etc).
1. Configure Krimp on your system
- Download Krimp from here.
- Modify datadir.conf such that the [dataDir] and [expDir] directives point to your data respective experiment output directories. Please note that these should always be full paths, not relative!
- On a 32-bits x86 Windows system, try running krimp.exe (or krimp_x86.exe ). If it raises an error or terminates without any message, you may have to install the Microsoft Visual C++ 2010 SP1 Redistributable Package. The x86 version is available for download here.
- On a 64-bits x64 Windows system, krimp_x64.exe should also work. If it gives an error or terminates without any message, you may have to install the Microsoft Visual C++ 2010 SP1 Redistributable Package. The x64 version is available for download here.
- On a Linux system, try running krimp . If it does not work, compile your own executable using the included source (see compilation).
2. Prepare your database
- Databases can be in either [dataDir] or [dataDir]\datasets.
- Krimp accepts currently only item data. Items should be positive integers.
- The original data should be formatted so that each row is a transaction, each number is an item that is present in that particular transaction. See also the example database (see for example chess.dat in data/datasets folder).
- Modify analysedb.conf and run krimp analysedb. The output textfile gives some general information about the database and the translation from the original item numbers to the numbers as used in the converted database. The given converted alphabet is ordered by count i.e. support count: the number of transactions where item occurs. In the brackets is the support: the proportion of transactions where item occurs. Value is in the form [converted item]=>[original item]. For chess dataset it can be seen from chess.db.analysis.txt that:
- In the original data item 58 has the largest support count 3195 and is converted as 0.
- In the original data item 52 has the second largest support count 3185 and is converted as 1.
- ...
- In the original data item 1 has support count 1669 and is converted as 36.
- In the original data item 2 has support count 1527 and is converted as 37.
- Krimp has its own database file format. Files in this format generally end with .db . Modify convertdb.conf to suit your needs and run from the command line krimp convertdb.
- See the datasets directory for the converted chess.db. It holds the following information:
- mi:
- nR - number of rows
- nT - number of transactios
- nI - number of items = sum of row lengths
- aS - number of unique items i.e. the length of alphabet
- sS -
- mL - average row length
- b? - has bin sizes
- c? -
- tid? - has transaction id's
- ab - the converted alphabet
- ac - support count of each converted alphabet item in the converted data (0 occures in the converted data 3195 times, 1 occures in the converted data 3185 times etc.)
- ar - support count of each original alphabet item in the original data (1 occures in the original data 1669 times, 2 occures in the original data 1527 times etc.)
- it - the original alphabet in the converted alphabet order (58 is converted as 0, 51 is converted as 1 etc.)
- cd - the number of itemsets is equal to the average row length, items are of the converted alphabet (from 0 to 74). Not sure yet what it shows exactly...
- Then the transactions in the converted alphabet follow.
- mi:
Important to note
This section holds important information gleaned from working with KRIMP about formatting your data before mining itemsets. These are vital to the success of your experiment as the application is obviously written for research purposes and is buggy and unhelpful. Disregarding these notes can and likely will cause you frustration and waste your time.
- The items in your dataset must be labeled with positive integers that start from zero and go up continuously (e.g. BAD - 1,2,3,4,5; BAD - 0,1,2,4,5; GOOD - 0,1,2,3,4). Failing this gets you a segfault.
- You can use dbTranslateForward parameter in convertdb.conf by setting it to true and this will relabel your data to follow the convention mentioned in the previous point but this may or may not break your data by splitting item names in random places if they, formatted as integers, are considered too large numbers by KRIMP. If it does, you will not get notified and must check whether it did or not.
- Remove all duplicate values from itemsets. Every item can be in an itemset only once. This is true for all FIM applications that I have encountered, but the option in convertdb.conf is turned off by default and may not work when turned on. If you have duplicates in your itemsets you will get a segfault.
3. Find those itemsets that matter
- Open compress.conf.
- The most important configuration directives are:
- iscName = chess-all-2500d - Determine which candidates are to be used. Format: [dbName]-[itemsetType]-[minSup] [candidateOrder]
- dbName - chess, mushroom, ...
- itemsetType - all, closed
- minSup - absolute minimum support level
- candidateOrder - the standard order as described in the papers is d
- iscName = chess-all-2500d - Determine which candidates are to be used. Format: [dbName]-[itemsetType]-[minSup] [candidateOrder]
The program first looks for existing item set collection files. If the correct file does not exist, candidates are automatically mined. Whether this file is kept after compression is determined by the following directive:
- iscIfMined = zap - Determine whether to keep file.
- zap - Do not keep file if created for this compression run.
- store - Store file in [dataDir]\candidates if created for this compression run.
- pruneStrategy = nop - Determine whether on-the-fly code table pruning has to be applied or not.
- nop - no pruning.
- pop - post-pruning which is the regular pruning method described in the paper.
- iscIfMined = zap - Determine whether to keep file.
Important to note
- KRIMP is able to use many cores but will also most likely eat a lot of memory as it first mines the itemsets and then proceeds to compression.
- SLIM should be able to use many cores, but in practice has thus far not been observerd by me whatever the settings. It will eat a lot less memory as the compression is carried out in conjunction with itemset mining. On large datasets using SLIM is recommended, but gather your patience as you will need it.
4. Inspect your results
- Compression results are stored in directory xps/compress. Each run gets its own directory, based on configuration and a timestamp.
- In such a directory, you'll find (amongst some unimportant files):
- compress-[timestamp].conf The configuration you used to run this particular experiment.
- ct-[candidateSet]-[pruneStrategy]-[timestamp]-[support]-[candNum].ct The code table at support level [support] , which was outputted after processing [candNum] candidate item sets. In such a file, you'll find the complete code table. Except for the first two rows (which make up a header, with info on the number of code table elements and the number of singleton item sets), each row is a code table element. The pattern is represented by the space-separated items. Between brackets is some meta info: first the current count of this particular pattern in the cover, second the total number of occurences of this pattern in the original database.
- report-[candidateSet]-[pruneStrategy]-[timestamp].csv A full report, summarizing what happened during the run. You'll find the following columns:
- minSup - Support level reported on.
- numCands - Number of candidates processed so far.
- numAlphUsed - Number of singleton item sets required in the cover.
- numSetsUsed - Number of non-singleton code table elements with non-zero count.
- numSets - Number of non-singleton code table elements. (Always equal to numSetsUsed when pruning is enabled.)
- countSum - Total number of times a code table element is used in the cover. (Used to determine code lengths.)
- dbSize - Encoded database size, in bits. (Without code table!)
- ctSize - Code table size, in bits.
- totalSize - dbSize + ctSize, in bits.
- seconds - Number of seconds it took to reach this point.