|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.jackrabbit.oak.commons.sort.ExternalSort
public class ExternalSort
Source copied from a publicly available library.
Goal: offer a generic external-memory sorting program in Java. It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient. This software is in the public domain. Usage: java org/apache/oak/commons/sort//ExternalSort somefile.txt out.txt You can change the default maximal number of temporary files with the -t flag: java org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -t 3 You can change the default maximum memory available with the -m flag: java org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -m 8192 For very large files, you might want to use an appropriate flag to allocate more memory to the Java VM: java -Xms2G org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Jerry Yang, First published: April 2010 originally posted at http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
| Field Summary | |
|---|---|
static Comparator<String> |
defaultcomparator
|
| Constructor Summary | |
|---|---|
ExternalSort()
|
|
| Method Summary | |
|---|---|
static void |
displayUsage()
|
static long |
estimateBestSizeOfBlocks(File filetobesorted,
int maxtmpfiles,
long maxMemory)
|
static void |
main(String[] args)
|
static int |
merge(BufferedWriter fbw,
Comparator<String> cmp,
boolean distinct,
List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer> buffers)
This merges several BinaryFileBuffer to an output writer. |
static int |
mergeSortedFiles(List<File> files,
File outputfile)
This merges a bunch of temporary flat files |
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp)
This merges a bunch of temporary flat files |
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
boolean distinct)
This merges a bunch of temporary flat files |
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs)
This merges a bunch of temporary flat files |
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs,
boolean distinct)
This merges a bunch of temporary flat files |
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs,
boolean distinct,
boolean append,
boolean usegzip)
This merges a bunch of temporary flat files |
static void |
sort(File input,
File output)
|
static File |
sortAndSave(List<String> tmplist,
Comparator<String> cmp,
Charset cs,
File tmpdirectory)
Sort a list and save it to a temporary file |
static File |
sortAndSave(List<String> tmplist,
Comparator<String> cmp,
Charset cs,
File tmpdirectory,
boolean distinct,
boolean usegzip)
Sort a list and save it to a temporary file |
static List<File> |
sortInBatch(File file)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. |
static List<File> |
sortInBatch(File file,
Comparator<String> cmp)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. |
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
boolean distinct)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. |
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
long maxMemory,
Charset cs,
File tmpdirectory,
boolean distinct)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. |
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
long maxMemory,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader,
boolean usegzip)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static Comparator<String> defaultcomparator
| Constructor Detail |
|---|
public ExternalSort()
| Method Detail |
|---|
public static void sort(File input,
File output)
throws IOException
IOException
public static long estimateBestSizeOfBlocks(File filetobesorted,
int maxtmpfiles,
long maxMemory)
public static List<File> sortInBatch(File file)
throws IOException
file - some flat file
IOException
public static List<File> sortInBatch(File file,
Comparator<String> cmp)
throws IOException
file - some flat filecmp - string comparator
IOException
public static List<File> sortInBatch(File file,
Comparator<String> cmp,
boolean distinct)
throws IOException
file - some flat filecmp - string comparatordistinct - Pass true if duplicate lines should be discarded.
IOException
public static List<File> sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
long maxMemory,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader,
boolean usegzip)
throws IOException
file - some flat filecmp - string comparatormaxtmpfiles - maximal number of temporary filescs - character set to use (can use Charset.defaultCharset())tmpdirectory - location of the temporary files (set to null for default location)distinct - Pass true if duplicate lines should be discarded.numHeader - number of lines to preclude before sorting startsusegzip - use gzip compression for the temporary files
IOException
public static List<File> sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
long maxMemory,
Charset cs,
File tmpdirectory,
boolean distinct)
throws IOException
file - some flat filecmp - string comparatormaxtmpfiles - maximal number of temporary filescs - character set to use (can use Charset.defaultCharset())tmpdirectory - location of the temporary files (set to null for default location)distinct - Pass true if duplicate lines should be discarded.
IOException
public static File sortAndSave(List<String> tmplist,
Comparator<String> cmp,
Charset cs,
File tmpdirectory,
boolean distinct,
boolean usegzip)
throws IOException
tmplist - data to be sortedcmp - string comparatorcs - charset to use for output (can use Charset.defaultCharset())tmpdirectory - location of the temporary files (set to null for default location)distinct - Pass true if duplicate lines should be discarded.
IOException
public static File sortAndSave(List<String> tmplist,
Comparator<String> cmp,
Charset cs,
File tmpdirectory)
throws IOException
tmplist - data to be sortedcmp - string comparatorcs - charset to use for output (can use Charset.defaultCharset())tmpdirectory - location of the temporary files (set to null for default location)
IOException
public static int mergeSortedFiles(List<File> files,
File outputfile)
throws IOException
files - outputfile - file
IOException
public static int mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp)
throws IOException
files - outputfile - file
IOException
public static int mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
boolean distinct)
throws IOException
files - outputfile - file
IOException
public static int mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs,
boolean distinct,
boolean append,
boolean usegzip)
throws IOException
files - The List of sorted Files to be merged.distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)outputfile - The output File to merge the results to.cmp - The Comparator to use to compare Strings.cs - The Charset to be used for the byte to character conversion.append - Pass true if result should append to File instead of
overwrite. Default to be false for overloading methods.usegzip - assumes we used gzip compression for temporary files
IOException
public static int merge(BufferedWriter fbw,
Comparator<String> cmp,
boolean distinct,
List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer> buffers)
throws IOException
fbw - A buffer where we write the data.cmp - A comparator object that tells us how to sort the lines.distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)buffers - Where the data should be read.
IOException
public static int mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs,
boolean distinct)
throws IOException
files - The List of sorted Files to be merged.distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)outputfile - The output File to merge the results to.cmp - The Comparator to use to compare Strings.cs - The Charset to be used for the byte to character conversion.
IOException
public static int mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs)
throws IOException
files - outputfile - filecs - character set to use to load the strings
IOExceptionpublic static void displayUsage()
public static void main(String[] args)
throws IOException
IOException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||