Sld-R-Data

Dan Villarreal (University of Pittsburgh)

This repository contains data for the paper "Gender separation and the Speech Community: Rhoticity in early 20th century Southland New Zealand English," published in 2021 in the journal Language Variation and Change. The data consists of 30,777 non-prevocalic /r/ tokens from Southland New Zealand English, with one row per token. Almost all tokens are coded for /r/, most by a sociolinguistic auto-coding algorithm. 10,337 tokens were analyzed in at least one model due to various exclusions (e.g., only content words were analyzed). To ensure anonymity, both the Speaker and Word columns have been replaced with anonymous codes.

The data is in two formats: .Rds (for use in the R statistical computing environment) and .csv (with blanks for what are called NAs in R parlance).

If you have any questions, please do not hesitate to email me (d.vill atsign pitt.edu) or create a GitHub issue.

If you use this data in any published work, please cite it. Citing open data is a small thing you can do to ensure that researchers have the incentive to keep making data open.

Columns

MatchId: Internal LaBB-CAT code for each token
In_Mod_AllF, In_Mod_AllM, In_Mod_NurseF, In_Mod_NurseM: (Boolean) Was this token included in the respective model?
In_AnyMod: (Boolean) Was this token included in any of the models?
TokenNum: Token counter downloaded from LaBB-CAT
Speaker: Anonymized speaker code
Gender, BirthYear: Speaker attributes
Generation: Binned generation groups for the purpose of analysis
GrewUpRegion: Binned subregions within and/or beyond Southland
UrbanRural: Invercargill vs. rural Southland
VersionDate: Inherited from LaBB-CAT
Lemma: Lemma
CelexFreqLemma: Lemma frequency in CELEX
CorpusFreqLemma: Lemma frequency in the Southland corpus
PerMilSldLemma: Lemma frequency in the Southland corpus, normalized per million
PerMilBaselineLemma: Lemma frequency in a corpus of General New Zealand English, normalized per million
SldnessLemma: PerMilSldLemma/PerMilBaselineLemma
Word: Anonymized word code
WordStart, WordEnd: Word boundary timepoints within transcript
ContFuncWord: Word category (content or function)
CelexFreqWord: Word frequency in CELEX
CorpusFreqWord: Word frequency in the Southland corpus
PerMilSldWord: Word frequency in the Southland corpus, normalized per million
PerMilBaselineWord: Word frequency in a corpus of General New Zealand English, normalized per million
SldnessWord: PerMilSldWord/PerMilBaselineWord
Syllable: Syllable in DISC notation
SyllStart, SyllEnd: Syllable boundary timepoints within transcript
Stress: Syllable stress: ' for primary, " for secondary, 0 for unstressed
TokenStart, TokenEnd: Token boundary timepoints within transcript (where token = vowel + possible /r/)
Vowel: Preceding vowel in Wells lexical set notation
VowelCat: Vowel, with an Other category for contexts that in nonrhotic accents are centering diphthongs or triphthongs (CURE, MOUTH-R, NEAR, PRICE-R, SQUARE)
FollSegRawNoPause: The next segment after the token, ignoring pauses, in Wells notation for vowels and two-letter ARPABET notation for consonants
FollSegRaw: The next segment after the token, unless a pause of at least 100 ms came first
FollSeg: FollSegRaw, binned for the purposes of analysis
SyllFinal: (Boolean) Does TokenEnd equal SyllEnd (relevant for determining rhoticity status)?
WordFinal: (Boolean) Does TokenEnd equal WordEnd (relevant for determining rhoticity status)?
FollPause: (Boolean) Is the token followed by a pause of at least 100 ms?
FollPauseDur: Duration of following pause
PrevRInWord: (Boolean) Does the token follow another /r/ token in the same word?
HowCoded: Whether the Rpresent code came from a human hand-coder or auto-coding algorithm
Rpresent: Rhoticity code: Present (aka r-ful, rhotic) vs. Absent (aka r-less, nonrhotic)
ProbPresent: Classifier probability: Probability that each token was Present, as estimated by the auto-coding algorithm

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.Rhistory		.Rhistory
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
Southland-R-Data_8Dec2020.Rds		Southland-R-Data_8Dec2020.Rds
Southland-R-Data_8Dec2020.csv		Southland-R-Data_8Dec2020.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sld-R-Data

Columns

About

Releases

Packages

License

nzilbb/Sld-R-Data

Folders and files

Latest commit

History

Repository files navigation

Sld-R-Data

Columns

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages