Skip to content

Commit

Permalink
Initial push
Browse files Browse the repository at this point in the history
  • Loading branch information
luminoso committed Feb 9, 2017
0 parents commit 448d95e
Show file tree
Hide file tree
Showing 36 changed files with 14,477 additions and 0 deletions.
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2017 Guilherme Cardoso

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
100 changes: 100 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Information Retrieval (IR) Engine

Considering the increasing volume of unstructured data in the world, Information Retrieval (IR) (a sub-area of text mining) and Information Extraction (IE) are extremely important to deal efficiently with all that data. Industry, IR, companies, marketing, economics and many other sectors highly depend on the efficiency and robustness of these techniques and tools.

Developed at [Aveiro University](https://www.ua.pt) this IR/IE engine deals with the overall processing of gathering, indexing and searching for relevant documents from huge collections of textual data so that extract of knowledge from unstructured existing data.

Features:

* Components are developed in a modules
* Memory adaptability to the host
* Fully threaded

The engine is currently adapted to process a CSV corpus collected from StackOverflow questions and answers and a small stack is included in the repository for purpose of demonstration. Given the modularity of the engine it can be easily adapted to any other type of corpus stack. Full stack for further testing can be downloaded [here](ttps://meocloud.pt/link/8b405a8f-c5af-4898-b1a2-4b9af7e259e3/stacksample.zip/).


## How to run

This project is built against [Apache Maven](https://maven.apache.org/) and minimal major java version required is 8. This engine is compatible with both Oracle Java 8 and OpenJDK 8 and it isn't backward compatible to Java 7.

There are two ways to run the engine. Preferably by importing Maven project to your favorite IDE and running from it or use the provided compiled jar. For sake of simplicity examples are running with from the jar.

### Display help
1. Run with **-h** switch for help:
```
$ java -jar java -jar IR-2016_17-0.0.1-SNAPSHOT.jar -h
```

Option | Description | Default
------------ | -------------| -------------
-d *\<arg>* | Directory containing text corpus to process | ./stacksample
-f *\<arg>* | Stop words to use | ./stop_processed.txt
-o *\<arg>* | Output directory to store processed index | ./disk
-h | print the help message |

### Processing the given sample

2. Processing *./stacksample* requires no arguments. Default stack directory is *./stacksample*
```
$ java -jar java -jar IR-2016_17-0.0.1-SNAPSHOT.jar
```

Output of the progress is displayed while running.

### Query the database

3. Query processed stack for the words *buffer* and *color*.

The interface for querying the database is shown. For example:
```
$ java -jar IR-2016_17-0.0.1-SNAPSHOT.jar -q
Insert query (Control+c to exit): buffer color
Number of results to query (10):
┌──────────────────────────────────────────────────────────────────────────┐
│ Information Retrieval │
├──────────────────────────────────────────────────────────────────────────┤
├───────────────────────────────┬──────────────────────────────────────────┤
│ Terms: │ │
│ • Query │ [buffer, color] │
│ • Tokenized │ [buffer, color] │
├───────────────────────────────┴──────────────────────────────────────────┤
├───────────────────────────────┬──────────────────────────────────────────┤
│ Results found │ 14 │
├───────────────────────────────┼──────────────────────────────────────────┤
│ Database size │ 958 │
├───────────────────────────────┼──────────────────────────────────────────┤
│ Token count │ 4804 │
├───────────────────────────────┼──────────────────────────────────────────┤
│ Results to retrieve │ 10 │
├───────────────────────────────┴──────────────────────────────────────────┤
├──────┬────────────────────────┬──────────┬───────────────────────────────┤
│ Rank │ Score │ Document │ Path │
├──────┼────────────────────────┼──────────┼───────────────────────────────┤
│ 1 │ 0.2792691357898761 │ 896 │ ./stacksample/Questions.csv │
│ 2 │ 0.2544781393660751 │ 781 │ ./stacksample/Questions.csv │
│ 3 │ 0.20702654309708213 │ 354 │ ./stacksample/Questions.csv │
│ 4 │ 0.18761207533221913 │ 340 │ ./stacksample/Questions.csv │
│ 5 │ 0.16297804872950658 │ 37 │ ./stacksample/Answers.csv │
│ 6 │ 0.16273914301263454 │ 12 │ ./stacksample/Answers.csv │
│ 7 │ 0.1576064181842389 │ 394 │ ./stacksample/Questions.csv │
│ 8 │ 0.15217728526150623 │ 108 │ ./stacksample/Answers.csv │
│ 9 │ 0.1287816215117765 │ 322 │ ./stacksample/Questions.csv │
│ 10 │ 0.12816593333587364 │ 6 │ ./stacksample/Answers.csv │
└──────┴────────────────────────┴──────────┴───────────────────────────────┘
```


## Project architecture

The engine is a designed as a macro modules that interact with each other. Overall view is the following:

![Engine overview](https://raw.githubusercontent.com/luminoso/information-retrieval/master/doc/pipeline.png)


| Module | Description |
| ------ | ----------- |
| Corpus Reader | Parses the input. In the given example, files in *./stacksample* |
| Tokenizer | Tokenizes document (removal of stop words, stemming, etc) |
| Indexer | Processes the tokens, computes LNC and serialize the results |
| Searcher | Controls the query interface and the mechanisms to perform a query |
| Ranker | Ranks the results using LNC/TLC approach |
53 changes: 53 additions & 0 deletions dependency-reduced-pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>IR-2016_17</groupId>
<artifactId>IR-2016_17</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>pt.ua.deti.ir.Main</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.10</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.19.1</version>
<configuration>
<systemPropertyVariables>
<java.util.logging.SimpleFormatter.format>%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n</java.util.logging.SimpleFormatter.format>
</systemPropertyVariables>
</configuration>
</plugin>
</plugins>
</build>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
</project>

Binary file added doc/pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
46 changes: 46 additions & 0 deletions nbactions.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
<?xml version="1.0" encoding="UTF-8"?>
<actions>
<action>
<actionName>run</actionName>
<packagings>
<packaging>jar</packaging>
</packagings>
<goals>
<goal>process-classes</goal>
<goal>org.codehaus.mojo:exec-maven-plugin:1.2.1:exec</goal>
</goals>
<properties>
<exec.args>-Xmx2g -Djava.util.logging.SimpleFormatter.format="%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n" -classpath %classpath pt.ua.deti.ir.Main help</exec.args>
<exec.executable>java</exec.executable>
</properties>
</action>
<action>
<actionName>debug</actionName>
<packagings>
<packaging>jar</packaging>
</packagings>
<goals>
<goal>process-classes</goal>
<goal>org.codehaus.mojo:exec-maven-plugin:1.2.1:exec</goal>
</goals>
<properties>
<exec.args>-Xdebug -Xrunjdwp:transport=dt_socket,server=n,address=${jpda.address} -Xmx2g -Djava.util.logging.SimpleFormatter.format="%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n" -classpath %classpath pt.ua.deti.ir.Main help</exec.args>
<exec.executable>java</exec.executable>
<jpda.listen>true</jpda.listen>
</properties>
</action>
<action>
<actionName>profile</actionName>
<packagings>
<packaging>jar</packaging>
</packagings>
<goals>
<goal>process-classes</goal>
<goal>org.codehaus.mojo:exec-maven-plugin:1.2.1:exec</goal>
</goals>
<properties>
<exec.args>-Xmx2g -Djava.util.logging.SimpleFormatter.format="%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n" -classpath %classpath pt.ua.deti.ir.Main help</exec.args>
<exec.executable>java</exec.executable>
</properties>
</action>
</actions>
105 changes: 105 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>IR-2016_17</groupId>
<artifactId>IR-2016_17</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>6.3.0</version>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>1.3.1</version>
<type>jar</type>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version>
<type>jar</type>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>1.4</version>
</dependency>
<dependency>
<groupId>org.mapdb</groupId>
<artifactId>mapdb</artifactId>
<version>3.0.2</version>
</dependency>
<dependency>
<groupId>de.ruedigermoeller</groupId>
<artifactId>fst</artifactId>
<version>2.48</version>
</dependency>
<dependency>
<groupId>de.vandermeer</groupId>
<artifactId>asciitable</artifactId>
<version>0.2.5</version>
</dependency>
</dependencies>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<!-- Jar file entry point -->
<mainClass>pt.ua.deti.ir.Main</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.10</version>
</plugin>
<!-- Surefire -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.19.1</version>
<configuration>
<systemPropertyVariables>
<!-- Set JUL Formatting -->
<java.util.logging.SimpleFormatter.format>%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n</java.util.logging.SimpleFormatter.format>
</systemPropertyVariables>
</configuration>
</plugin>
</plugins>
</build>
</project>
26 changes: 26 additions & 0 deletions src/main/java/pt/ua/deti/ir/Constants.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
package pt.ua.deti.ir;

/**
* Configuration constants
* @author Guilherme Cardoso [email protected]
* @author Rui Pedro [email protected]
*/
public class Constants {

public static final int MINIMUM_WORD_LENGTH = 2;

// group 1: Id
// group 2: CreationDate
// group 3: Score
// group 4: FilePath
// group 5: Body
public static final String CORPUS_REGEX_DOCUMENT = "Id:([\\d]*?),CreationDate:(.*?),Score:([\\d]*?),FilePath:(.*?),Body:(.*)";

public static final String ASCII_WORD_REGEX_MATCH = "([a-zA-Z0-9]+)";

public static final String CORPUS_FILE_EXTENSION = ".csv";

public static final int CORPUS_COUNT_HINT = 3165237; // not zero

public static final String STATS_FILE = "processing.stats";
}
Loading

0 comments on commit 448d95e

Please sign in to comment.