Friday, October 11, 2013

External Sorting and Use.

External Sorting 

From the Wiki, External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). External sorting typically uses a hybrid sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file. External sorting is a form of distribution sort,


Problem


We are given a very large text file of size say 8 GB and we need to sort the file and store it in an output location. But the memory (RAM) is limited, say 1 GB.



Solution

We will only read the input file line by line and store it into the TreeSet. This step is similar to above, but we would not store the entire file in the TreeSet. If the file is 10 GB and our RAM capacity is 1 GB, we will store 512MB data in the TreeSet. Once it becomes 512 MB we flush it into disk, on a temporary file (call it temp1.txt). Repeat this procedure of writing into temporary files, till you read 10 GB completely. So we will have 20 sorted temporary files. Next step is to merge these 20 files into a single sorted file and delete the temporary files. We call it a K-Way merge algorithm.

Next utilize the Memory-Mapped-Files (MMAP). For this we need to consider how Java does file IO. Data in the file system (secondary storage or hard-disk) has to be brought to memory (RAM). To do this, we can either read character by character, or read a chunk of data at a time. The time taken by file IO is huge, compared to the CPU processing time. So its better to reduce the frequency of file reads, hence we use BufferedReader.
But if we use MMAP, then a file or portion of a file can be mapped directly to memory. To achieve this, it uses SWAP space (or virtual memory). Every system has this virtual memory and is different from the RAM. In Java, objects are created on the Heap or RAM or memory as we call it. But if we bring data to swap space then the data can be read directly without an operating system call (which is magnitude of times slower). You cannot Memory Map a full 10 GB file, so we can MMAP a portion of the file. The other advantage of MMAP is, multiple processes can share the same file (thread-safe). In Java, we have the classes for this in NIO package. FileChannel is the class to hold the data.

With this approach, We can memory map the 400 MB file for 8KB size. Read this 8KB data and store it in a TreeSet. Repeat 5000 times, and store the 40 MB data in a temporary file (we will have 10 temporary sorted files). Then apply k-way merge.




The Java Program


package com.sachi.datastruct;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.Date;
import java.util.Map;
import java.util.TreeMap;

/**
 * Sort a 400MB file using FileChannel
 * 
 *
 */
public class ExternalFileSort {

  private static final int DEFAULT_BUFFER_SIZE = 1024 * 8;

/**
* @param args
*/
public static void main(String[] args) throws Exception {

long t1 = new Date().getTime();
int i = 0;
FileChannel source = new FileInputStream(new File("path to file")).getChannel();
ByteBuffer buf = ByteBuffer.allocateDirect(DEFAULT_BUFFER_SIZE);

int j = 0;
FileChannel destination = new FileOutputStream(new File("path to temp file" +i+".txt")).getChannel();
while((source.read(buf)) != -1) {
buf.flip();
destination.write(buf);
if( j == 5000 ) {
i++;
destination.close();
destination = new FileOutputStream(new File("path to temp file" +i+".txt")).getChannel();
j=0;
} else {
j++;
}
buf.clear();
}

source.close();
Map<String, Integer> map = new TreeMap<String, Integer>();

BufferedReader[] brArr = new BufferedReader[i+1];
for(j=0; j<=i; j++) {
brArr[j] = new BufferedReader(new FileReader(new File("path to temp file" +j+".txt")));
map.put(brArr[j].readLine(), j);
}


BufferedWriter bw = new BufferedWriter(new FileWriter(new File("path to output file")));

String s = null;
String endofline = "\n";
while(!map.isEmpty()) {
s = map.keySet().iterator().next();
i = map.get(s);
map.remove(s);
bw.write(s);
bw.write(endofline);
s = brArr[i].readLine();
if(s != null ) {
map.put(s, i);
}
}
bw.close();

for(j=0; j<brArr.length; j++) {
brArr[j].close();
new File("path to temp file" +j+".txt").delete();
}

long t2 = new Date().getTime();
System.out.println("Time taken = " +(t2-t1)/1000 + " sec");

}

}

No comments:

Post a Comment