Skip to content

jamesdunham/multiline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

multiline is an R package for reading data from multiline fixed-width-formatted (FWF) files. This format is like that of typical FWF files, except that data for a given observation wraps after some number of columns to span a fixed number of rows.

Digitized punch card data are often found in multiline FWF format. If data for each observation exceeded the horizontal space on a card (conventionally 80 columns), additional decks of cards were used. When digitized, their rows were were often interleaved so that data for each observation would appear in consecutive rows, one for each card.

Installation

Install from GitHub with devtools:

if (!require(devtools, quietly = TRUE)) install.packages("devtools")
devtools::install_github("jamesdunham/multiline")

Background

Consider the following multiline FWF (MFWF) data. As with FWF data, parsing requires the column positions of each field (ie, variable). But furthermore, we need the line position of each field.

123456789
789      
987654321
987      

Parsing requires:

  • The column positions of each field, as with FWF data;
  • The number of lines per observation; and
  • The line position of each field.

Suppose there are 2 lines per observation in the data; field1 occupies columns 1-4 of line 1; field2 columns 5-9 of line 1; and field3 columns 1-3 of line 2.

123456789  [line 1, obs. 1]
789        [line 2, obs. 1]
987654321  [line 1, obs. 2]
987        [line 2, obs. 2]

The purpose of multiline is reading this data into a tidy table:

obs field 1  field 2  field 3
  1    1234    56789      789
  2    9876    54321      987

Usage

Specify the column and line positions of each field in a table or list of tables. multiline imports the fwf_ functions from readr to help with this task.

As a list:

positions <- list(
  fwf_positions(start = c(1, 5), end = c(4, 9), col_names = c('field1', 'field2')),
  fwf_positions(start = 1, end = 3, col_names = 'field3'))
positions
#> [[1]]
#> # A tibble: 2 x 3
#>   begin   end col_names
#>   <dbl> <dbl>     <chr>
#> 1     0     4    field1
#> 2     4     9    field2
#> 
#> [[2]]
#> # A tibble: 1 x 3
#>   begin   end col_names
#>   <dbl> <dbl>     <chr>
#> 1     0     3    field3

The line position of each field is implicit in the list order. Here, field1 and field2 are in line 1 and field3 is in line 2.

Given the data:

d <- "123456789\n789\n987654321\n9871"
d
#> [1] "123456789\n789\n987654321\n9871"

read_multiline() returns a tidy table with observations in rows and fields in columns. Note that read_multiline() requires that the number of items in the list of positions exactly match the number of lines in the MFWF.

tidy <- read_multiline(d, lines = 2, positions)
tidy
#> # A tibble: 2 x 3
#>   field1 field2 field3
#>    <int>  <int>  <int>
#> 1   1234  56789    789
#> 2   9876  54321    987

About

Read multiline fixed-width-formatted text in R

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages