Dan Villarreal (University of Pittsburgh)
This repository contains data for the paper "Gender separation and the Speech Community: Rhoticity in early 20th century Southland New Zealand English," published in 2021 in the journal Language Variation and Change. The data consists of 30,777 non-prevocalic /r/ tokens from Southland New Zealand English, with one row per token. Almost all tokens are coded for /r/, most by a sociolinguistic auto-coding algorithm. 10,337 tokens were analyzed in at least one model due to various exclusions (e.g., only content words were analyzed). To ensure anonymity, both the Speaker and Word columns have been replaced with anonymous codes.
The data is in two formats: .Rds (for use in the R statistical computing environment) and .csv (with blanks for what are called NA
s in R parlance).
If you have any questions, please do not hesitate to email me (d.vill atsign pitt.edu) or create a GitHub issue.
If you use this data in any published work, please cite it. Citing open data is a small thing you can do to ensure that researchers have the incentive to keep making data open.
MatchId
: Internal LaBB-CAT code for each tokenIn_Mod_AllF
,In_Mod_AllM
,In_Mod_NurseF
,In_Mod_NurseM
: (Boolean) Was this token included in the respective model?In_AnyMod
: (Boolean) Was this token included in any of the models?TokenNum
: Token counter downloaded from LaBB-CATSpeaker
: Anonymized speaker codeGender
,BirthYear
: Speaker attributesGeneration
: Binned generation groups for the purpose of analysisGrewUpRegion
: Binned subregions within and/or beyond SouthlandUrbanRural
: Invercargill vs. rural SouthlandVersionDate
: Inherited from LaBB-CATLemma
: LemmaCelexFreqLemma
: Lemma frequency in CELEXCorpusFreqLemma
: Lemma frequency in the Southland corpusPerMilSldLemma
: Lemma frequency in the Southland corpus, normalized per millionPerMilBaselineLemma
: Lemma frequency in a corpus of General New Zealand English, normalized per millionSldnessLemma
:PerMilSldLemma
/PerMilBaselineLemma
Word
: Anonymized word codeWordStart
,WordEnd
: Word boundary timepoints within transcriptContFuncWord
: Word category (content or function)CelexFreqWord
: Word frequency in CELEXCorpusFreqWord
: Word frequency in the Southland corpusPerMilSldWord
: Word frequency in the Southland corpus, normalized per millionPerMilBaselineWord
: Word frequency in a corpus of General New Zealand English, normalized per millionSldnessWord
:PerMilSldWord
/PerMilBaselineWord
Syllable
: Syllable in DISC notationSyllStart
,SyllEnd
: Syllable boundary timepoints within transcriptStress
: Syllable stress: ' for primary, " for secondary, 0 for unstressedTokenStart
,TokenEnd
: Token boundary timepoints within transcript (where token = vowel + possible /r/)Vowel
: Preceding vowel in Wells lexical set notationVowelCat
:Vowel
, with an Other category for contexts that in nonrhotic accents are centering diphthongs or triphthongs (CURE, MOUTH-R, NEAR, PRICE-R, SQUARE)FollSegRawNoPause
: The next segment after the token, ignoring pauses, in Wells notation for vowels and two-letter ARPABET notation for consonantsFollSegRaw
: The next segment after the token, unless a pause of at least 100 ms came firstFollSeg
:FollSegRaw
, binned for the purposes of analysisSyllFinal
: (Boolean) DoesTokenEnd
equalSyllEnd
(relevant for determining rhoticity status)?WordFinal
: (Boolean) DoesTokenEnd
equalWordEnd
(relevant for determining rhoticity status)?FollPause
: (Boolean) Is the token followed by a pause of at least 100 ms?FollPauseDur
: Duration of following pausePrevRInWord
: (Boolean) Does the token follow another /r/ token in the same word?HowCoded
: Whether theRpresent
code came from a human hand-coder or auto-coding algorithmRpresent
: Rhoticity code: Present (aka r-ful, rhotic) vs. Absent (aka r-less, nonrhotic)ProbPresent
: Classifier probability: Probability that each token was Present, as estimated by the auto-coding algorithm