Skip to content

Releases: gederajeg/collogetr

Bug fixes to comply with tidyr's `nest` and `unnest` new behaviour

20 Mar 03:26
Compare
Choose a tag to compare

This is a backward-compatible release of bug fix, following the new behaviour of tidyr's nest() and unnest() functions that require the data argument to be specified.

Bug fix and update

17 Mar 01:44
Compare
Choose a tag to compare

Bug fixes

  • The bug includes error when pulling out nn as the results of tally() in the previous version of dplyr (i.e. v0.7.8). This bug was identified in the AppVeyor and Travis builds (cf. here and here respectively), where column nn was not identified from .data. There is one line of code where column nn was used in colloc_leipzig(). Now, that has been changed into n and the builds for this release are success with the updated dplyr version (v0.8.0.1) (cf. here and here for AppVeyor and Travis builds respectively).

Development

  • Add a new function called collex_llr() to perform association measure using log-likelihood ratio.

Bug fixes and updates

20 Aug 08:15
Compare
Choose a tag to compare

Bug fixes

  • Fix bug in the searching procedure. In this version, the corpus is firstly tokenised and the node word is searched through its exact word-form.
  • Fix bug in the output column names and the number of columns output when the save_interim argument is TRUE.

Development

  • Increase the test coverage for the codes
  • Add lifecycle and repo status badge, including the app veyor build badge

Next release

  • Add the Log-likelihood as alternative association measure
  • Add the Multiple Distinctive Collexeme Analysis (MDCA) as association measure for contrasting more than two near-synonymous node words. MDCA uses one-tailed, exact Binomial Test to determine the distinctive collocates of a node word in comparison to its near-synonyms.

Minor update on LICENSE and Website

30 Jul 01:21
Compare
Choose a tag to compare

This is a minor update involving change of License from GPL-2 to MIT. The update also includes setting up GitHub webpage for the package. There are no additional functions, but more test coverage for the existing functions.

collogetr 1.0.0

24 Jul 06:48
Compare
Choose a tag to compare

Breaking changes

Existing functions

  • colloc_leipzig()
    • A feature to search collocates for multiple node words in one go. These words have to be combined in the form of a character vector (e.g., c("membeli", "menjual")).
    • Additional output of (i) sentence-match in which the collocates and the node word(s) are found, and (ii) window span information of the collocates in relation to the node word(s) (e.g., r1 for collocates occurring one-word to the right of the node).
  • assoc_prepare()
    • Allows processing the input frequency data per corpus or combined across all corpus files.
    • Allows to select a give collocate span to focus on for the association measure.

New functions

  • assoc_prepare_dca()
    • The function to generate required input data for performing Distinctive Collexeme/Collocates Analysis (DCA). It takes the output of assoc_prepare(), which in turns is fed with the output of colloc_leipzig().
  • collex_fye_dca()
    • The function to perform DCA using one-tailed Fisher-Yates' Exact (FYE) test. It requires the output of the assoc_prepare_dca().
  • dca_top_collex()
    • The function to extract the top-n distinctive collocates/collexemes for a given word/construction.
  • collex_chisq()
    • The function to perform association measure using the Chi-square statistics.

Future developments

  • The next iteration of the package will include:
    • Other kinds of association measures commonly used in collocational studies, such as Mutual Information and Log-likelihood, and the inclusion of the odds ratio from the FYE test.
    • Another function to retrieve collocates from different corpus types (e.g. from a corpus that is not parsed/split according to sentences as in the Indonesian Leipzig Corpora).

First release

04 Jul 05:09
Compare
Choose a tag to compare
First release Pre-release
Pre-release

The package contains one function called colloc_leipzig() to retrieve window-span collocates from Indonesian Leipzig Corpora. The function currently can only search for one word at a time. Thus, it is slow considering the function do tokenisation in the process. So, if we want to search for word X and Y in corpus C, two searching calls are required and thus corpus C need to be tokenised in each of these calls.

The package also contains a function to prepare an input table (assoc_prepare()) for performing association measure for collocational analysis using Fisher's Yates Exact Test (collex_fye()).

The next release will fix the colloc_leipzig() function for multiple pattern search and more efficient procedure.