org.apache.jackrabbit.oak.commons.sort
Class ExternalSort

java.lang.Object
  extended by org.apache.jackrabbit.oak.commons.sort.ExternalSort

public class ExternalSort
extends Object

Source copied from a publicly available library.

See Also:
https://code.google.com/p/externalsortinginjava
 Goal: offer a generic external-memory sorting program in Java.
 
 It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient.
 
 This software is in the public domain.
 
 Usage: java org/apache/oak/commons/sort//ExternalSort somefile.txt out.txt
 
 You can change the default maximal number of temporary files with the -t flag: java
 org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -t 3
 
 You can change the default maximum memory available with the -m flag: java
 org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -m 8192
 
 For very large files, you might want to use an appropriate flag to allocate more memory to
 the Java VM: java -Xms2G org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt
 
 By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan
 Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Jerry Yang, First published:
 April 2010 originally posted at
 http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
 

Field Summary
static Comparator<String> defaultcomparator
           
 
Constructor Summary
ExternalSort()
           
 
Method Summary
static void displayUsage()
           
static long estimateBestSizeOfBlocks(File filetobesorted, int maxtmpfiles, long maxMemory)
           
static void main(String[] args)
           
static int merge(BufferedWriter fbw, Comparator<String> cmp, boolean distinct, List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer> buffers)
          This merges several BinaryFileBuffer to an output writer.
static int mergeSortedFiles(List<File> files, File outputfile)
          This merges a bunch of temporary flat files
static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp)
          This merges a bunch of temporary flat files
static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, boolean distinct)
          This merges a bunch of temporary flat files
static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs)
          This merges a bunch of temporary flat files
static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct)
          This merges a bunch of temporary flat files
static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct, boolean append, boolean usegzip)
          This merges a bunch of temporary flat files
static void sort(File input, File output)
           
static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory)
          Sort a list and save it to a temporary file
static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip)
          Sort a list and save it to a temporary file
static List<File> sortInBatch(File file)
          This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
static List<File> sortInBatch(File file, Comparator<String> cmp)
          This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
static List<File> sortInBatch(File file, Comparator<String> cmp, boolean distinct)
          This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct)
          This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip)
          This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

defaultcomparator

public static Comparator<String> defaultcomparator
Constructor Detail

ExternalSort

public ExternalSort()
Method Detail

sort

public static void sort(File input,
                        File output)
                 throws IOException
Throws:
IOException

estimateBestSizeOfBlocks

public static long estimateBestSizeOfBlocks(File filetobesorted,
                                            int maxtmpfiles,
                                            long maxMemory)

sortInBatch

public static List<File> sortInBatch(File file)
                              throws IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.

Parameters:
file - some flat file
Returns:
a list of temporary flat files
Throws:
IOException

sortInBatch

public static List<File> sortInBatch(File file,
                                     Comparator<String> cmp)
                              throws IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.

Parameters:
file - some flat file
cmp - string comparator
Returns:
a list of temporary flat files
Throws:
IOException

sortInBatch

public static List<File> sortInBatch(File file,
                                     Comparator<String> cmp,
                                     boolean distinct)
                              throws IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.

Parameters:
file - some flat file
cmp - string comparator
distinct - Pass true if duplicate lines should be discarded.
Returns:
a list of temporary flat files
Throws:
IOException

sortInBatch

public static List<File> sortInBatch(File file,
                                     Comparator<String> cmp,
                                     int maxtmpfiles,
                                     long maxMemory,
                                     Charset cs,
                                     File tmpdirectory,
                                     boolean distinct,
                                     int numHeader,
                                     boolean usegzip)
                              throws IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.

Parameters:
file - some flat file
cmp - string comparator
maxtmpfiles - maximal number of temporary files
cs - character set to use (can use Charset.defaultCharset())
tmpdirectory - location of the temporary files (set to null for default location)
distinct - Pass true if duplicate lines should be discarded.
numHeader - number of lines to preclude before sorting starts
usegzip - use gzip compression for the temporary files
Returns:
a list of temporary flat files
Throws:
IOException

sortInBatch

public static List<File> sortInBatch(File file,
                                     Comparator<String> cmp,
                                     int maxtmpfiles,
                                     long maxMemory,
                                     Charset cs,
                                     File tmpdirectory,
                                     boolean distinct)
                              throws IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.

Parameters:
file - some flat file
cmp - string comparator
maxtmpfiles - maximal number of temporary files
cs - character set to use (can use Charset.defaultCharset())
tmpdirectory - location of the temporary files (set to null for default location)
distinct - Pass true if duplicate lines should be discarded.
Returns:
a list of temporary flat files
Throws:
IOException

sortAndSave

public static File sortAndSave(List<String> tmplist,
                               Comparator<String> cmp,
                               Charset cs,
                               File tmpdirectory,
                               boolean distinct,
                               boolean usegzip)
                        throws IOException
Sort a list and save it to a temporary file

Parameters:
tmplist - data to be sorted
cmp - string comparator
cs - charset to use for output (can use Charset.defaultCharset())
tmpdirectory - location of the temporary files (set to null for default location)
distinct - Pass true if duplicate lines should be discarded.
Returns:
the file containing the sorted data
Throws:
IOException

sortAndSave

public static File sortAndSave(List<String> tmplist,
                               Comparator<String> cmp,
                               Charset cs,
                               File tmpdirectory)
                        throws IOException
Sort a list and save it to a temporary file

Parameters:
tmplist - data to be sorted
cmp - string comparator
cs - charset to use for output (can use Charset.defaultCharset())
tmpdirectory - location of the temporary files (set to null for default location)
Returns:
the file containing the sorted data
Throws:
IOException

mergeSortedFiles

public static int mergeSortedFiles(List<File> files,
                                   File outputfile)
                            throws IOException
This merges a bunch of temporary flat files

Parameters:
files -
outputfile - file
Returns:
The number of lines sorted. (P. Beaudoin)
Throws:
IOException

mergeSortedFiles

public static int mergeSortedFiles(List<File> files,
                                   File outputfile,
                                   Comparator<String> cmp)
                            throws IOException
This merges a bunch of temporary flat files

Parameters:
files -
outputfile - file
Returns:
The number of lines sorted. (P. Beaudoin)
Throws:
IOException

mergeSortedFiles

public static int mergeSortedFiles(List<File> files,
                                   File outputfile,
                                   Comparator<String> cmp,
                                   boolean distinct)
                            throws IOException
This merges a bunch of temporary flat files

Parameters:
files -
outputfile - file
Returns:
The number of lines sorted. (P. Beaudoin)
Throws:
IOException

mergeSortedFiles

public static int mergeSortedFiles(List<File> files,
                                   File outputfile,
                                   Comparator<String> cmp,
                                   Charset cs,
                                   boolean distinct,
                                   boolean append,
                                   boolean usegzip)
                            throws IOException
This merges a bunch of temporary flat files

Parameters:
files - The List of sorted Files to be merged.
distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
outputfile - The output File to merge the results to.
cmp - The Comparator to use to compare Strings.
cs - The Charset to be used for the byte to character conversion.
append - Pass true if result should append to File instead of overwrite. Default to be false for overloading methods.
usegzip - assumes we used gzip compression for temporary files
Returns:
The number of lines sorted. (P. Beaudoin)
Throws:
IOException
Since:
v0.1.4

merge

public static int merge(BufferedWriter fbw,
                        Comparator<String> cmp,
                        boolean distinct,
                        List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer> buffers)
                 throws IOException
This merges several BinaryFileBuffer to an output writer.

Parameters:
fbw - A buffer where we write the data.
cmp - A comparator object that tells us how to sort the lines.
distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
buffers - Where the data should be read.
Returns:
The number of lines sorted. (P. Beaudoin)
Throws:
IOException

mergeSortedFiles

public static int mergeSortedFiles(List<File> files,
                                   File outputfile,
                                   Comparator<String> cmp,
                                   Charset cs,
                                   boolean distinct)
                            throws IOException
This merges a bunch of temporary flat files

Parameters:
files - The List of sorted Files to be merged.
distinct - Pass true if duplicate lines should be discarded. (elchetz@gmail.com)
outputfile - The output File to merge the results to.
cmp - The Comparator to use to compare Strings.
cs - The Charset to be used for the byte to character conversion.
Returns:
The number of lines sorted. (P. Beaudoin)
Throws:
IOException
Since:
v0.1.2

mergeSortedFiles

public static int mergeSortedFiles(List<File> files,
                                   File outputfile,
                                   Comparator<String> cmp,
                                   Charset cs)
                            throws IOException
This merges a bunch of temporary flat files

Parameters:
files -
outputfile - file
cs - character set to use to load the strings
Returns:
The number of lines sorted. (P. Beaudoin)
Throws:
IOException

displayUsage

public static void displayUsage()

main

public static void main(String[] args)
                 throws IOException
Throws:
IOException


Copyright © 2012-2014 The Apache Software Foundation. All Rights Reserved.