-
Notifications
You must be signed in to change notification settings - Fork 2
/
r_users.Rmd
3766 lines (2834 loc) · 190 KB
/
r_users.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
cap-location: margin
css: [css/layout.css, css/edit-pagedown.css, css/fonts.css]
output:
pagedown::html_paged:
toc: true
self_contained: false
front_cover: images/cover-r.png
toc-title: Contents
paged-footnotes: true
lot: false
lof: false
toc: true
---
```{r, setup, echo=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
comment = NA,
message = FALSE,
R.options = list(width = 88),
fig.align = 'center'
)
source("utils/utilities.r")
```
# Preface {.front-matter .unnumbered}
This guide was commissioned and funded by the Family Planning Team at the Bill & Melinda Gates Foundation. The examples here are directly based on the companion [IPUMS PMA data analysis blog](https://tech.popdata.org/pma-data-hub/), with R examples developed by Matt Gunther and IPUMS PMA documentation by Devon Kristiansen under the direction of Kathryn Grace, PhD and Elizabeth Heger Boyle, PhD at IPUMS PMA, University of Minnesota. The Stata version and statistical consulting were provided by Mia Yu and Dale Rhoda at [Biostat Global Consulting](http://www.biostatglobal.com/). These authors are grateful for helpful reviews & comments from Philip Anglewicz, PhD; Linnea Zimmerman, PhD, and Aisha Siewe at Johns Hopkins University. Thanks also to Caitlin Clary, PhD, Mary Kay Trimner, Nina Brooks, PhD, and Finn Roberts for code contributions and review.
#### Suggested Citation {.unnumbered}
Matt Gunther, Mia Yu, Dale Rhoda, and Devon Kristiansen. *IPUMS PMA Longitudinal Analysis Guide for R Users* (November 2022). Minneapolis, MN: IPUMS. [pma.ipums.org](https://pma.ipums.org/pma/)
#### Source Code {.unnumbered}
The code provided in this manual is open source ([© MPL 2.0](https://www.mozilla.org/en-US/MPL/2.0/)). This manual was constructed from [R Markdown](https://rmarkdown.rstudio.com/) files with the `r funlink(pagedown)` package for R.^[`r funlink(pagedown)` © Xie, Yihui et al. (MIT)] These files are available on our [GitHub repository](https://github.com/IPUMS-Global-Health/IPUMS-PMA-Longitudinal-Guide), where you will also find `.r` and `.do` files containing the code shown in this manual.
The IPUMS PMA data files referenced in this manual are also available at no cost, but you must register and adhere to terms of use at [pma.ipums.org/register](https://pma.ipums.org/pma/register.shtml). Dataset access is granted only for non-commercial purposes. Users must register an account with IPUMS, request access to data from particular countries, and describe their intended use for the data. Users who have been approved for access to certain countries may submit justification to expand their access to other countries.
[La version française du formulaire d'inscription](https://pma.ipums.org/pma/formulaire_d_inscription.shtml)
#### Revision History {.unnumbered}
Revisions to this manual are listed by date and accompanied by comments [here](https://github.com/IPUMS-Global-Health/IPUMS-PMA-Longitudinal-Guide/commits/main). **Questions and suggested changes are welcome!** Please submit requests to our [Issues](https://github.com/IPUMS-Global-Health/IPUMS-PMA-Longitudinal-Guide/issues) forum on GitHub.
\newpage
#### Hyperlinks {.unnumbered}
Hyperlinks to IPUMS PMA variable documentation, relevant R and Stata documentation, and various other resources are highlighted [in pink](https://pma.ipums.org) throughout this manual. If the reader prefers a printed version, they are recommended to compile the manual from source files on our GitHub repository, changing the `r funlink(pagedown)` option described [here](https://pagedown.rbind.io/#links). **Warning:** this will add additional footnotes to the document, and may impact pagination.
#### Acronyms {.unnumbered}
- BMGF - [Bill & Melinda Gates Foundation](https://www.gatesfoundation.org/)
- CI - confidence interval
- CMC - century month code
- CONSORT - [Consolidated Standards of Reporting Trials](www.consort-statement.org)
- CRAN - [The Comprehensive R Archive Network](https://cran.r-project.org/) (statistical software)
- CSV - comma-separated values file format
- DEFF - design effect
- DEFT - root design effect (square root of DEFF)
- DRC - Democratic Republic of Congo
- EA - enumeration area
- FP - family planning
- FP2020 - Family Planning 2020
- FP2030 - [Family Planning 2030](https://fp2030.org/)
- GPS - [global positioning system](https://www.gps.gov/)
- IPUMS - [Integrated Public Use Microdata Series](https://www.ipums.org/)
- ISO - International Organization for Standardization
- IUD - intrauterine device
- LAM - lactational amenorrhea method of contraception
- NA - not available (R notation for a missing data element)
- NIU - not in universe
- PMA - [Performance Monitoring for Action](https://www.pmadata.org/)
- PPS - probability proportional to size
- SAS - [statistical analysis system](https://www.sas.com/) (statistical software)
- SPSS - [statistical package for social sciences](https://www.ibm.com/spss) (statistical software)
# Introduction
```{r, echo=FALSE, results='hide'}
knitr::opts_chunk$set(echo = FALSE)
widef <- read_ipums_micro(
ddi = "data/pma_00008.xml",
data = "data/pma_00008.dat.gz"
)
```
[Performance Monitoring for Action (PMA)](https://www.pmadata.org/) uses innovative mobile technology to support low-cost, rapid-turnaround surveys that monitor key health and development indicators.
PMA surveys collect longitudinal data throughout a country at the household and health facility levels by female data collectors, known as resident enumerators, using mobile phones. The survey collects information from the same women and households over time for regular tracking of progress and for understanding the drivers of contraceptive use dynamics. The data are rapidly validated, aggregated, and prepared into tables and graphs, making results quickly available to stakeholders. PMA surveys can be integrated into national monitoring and evaluation systems using a low-cost, rapid-turnaround survey platform that can be adapted and used for various health data needs.
The PMA project is implemented by local partner universities and research organizations who train and deploy the cadres of female resident enumerators.
<aside>
PMA has also published a guide to **cross-sectional** analysis in both [English](https://www.pmadata.org/media/1243/download?attachment) and [French](https://www.pmadata.org/media/1244/download?attachment).
</aside>
The purpose of this manual is to provide guidance on the analysis of **harmonized longitudinal data** for a panel of women age 15-49 surveyed by PMA and published in partnership with [IPUMS PMA](https://pma.ipums.org/pma/). IPUMS provides census and survey products from around the world in an integrated format, making it easy to compare data from multiple countries. IPUMS PMA data are available free of charge, subject to terms and conditions: please [register here](https://pma.ipums.org/pma/register.shtml) to request access to the data featured in this guide.^[PMA data for individual countries is also available at no cost from [pmadata.org](https://www.pmadata.org/). Please note that the variable names, value labels, numeric codes, and other metadata featured in this guide have been altered by IPUMS PMA to facilitate comparison across countries.]
This manual provides reproducible coding examples in the statistical programming language [R](https://www.r-project.org/). Each chapter also appears as a post on the IPUMS PMA [data analysis blog](https://tech.popdata.org/pma-data-hub/index.html), where you'll find new content posted every two weeks.
**Stata users:** a companion manual for IPUMS PMA longitudinal analysis is also available with coding examples written in Stata.
## IPUMS PMA data in R
The first two chapters of this manual introduce new users to [PMA longitudinal data](https://www.pmadata.org/data/survey-methodology) and the [IPUMS PMA website](https://pma.ipums.org/pma/), respectively. After demonstrating how to obtain an IPUMS PMA data extract, the remaining chapters feature extensive data analysis examples written in R.
<aside class="hex">
```{r}
hex("Rlogo")
```
</aside>
To follow along, you'll need to download the appropriate version of R for your computer's operating system at [r-project.org](https://www.r-project.org/). **R is available at no cost** and it runs on Windows, MacOS, and a wide variety of UNIX platforms. We also recommend downloading a free copy of [RStudio](https://www.rstudio.com/), an integrated development environment (IDE) designed to make your experience with R much easier.
Individual chapters may introduce one or two **R packages** that provide helpful functions for longitudinal survey analysis, in particular. Two packages we feature in *every* chapter are `r funlink(ipumsr)` and `r funlink(tidyverse)`. You can install these and other packages featured in this guide like so:
<aside class="hex">
```{r}
hex("ipumsr")
```
</aside>
```{r, echo=TRUE, eval= FALSE}
install.packages("ipumsr")
install.packages("tidyverse")
```
The `r funlink(ipumsr)` package is designed to help R users import and explore data extracts downloaded from IPUMS. As we'll see, categorical variables from IPUMS require additional tools because they appear as **labelled integers** represented in R by a number and a label like this:
```{r, echo = FALSE}
widef %>% count(COUNTRY)
```
The `r funlink(tidyverse)` is actually a collection of packages developed in-part by contributors at RStudio. These include:
<aside class="hex">
```{r}
hex("tidyverse")
```
</aside>
- `r funlink(ggplot2)` for data visualization
- `r funlink(dplyr)` for data manipulation
- `r funlink(tidyr)` for data tidying
- `r funlink(readr)` for data import
- `r funlink(purrr)` for functional programming
- `r funlink(tibble)` for tibbles, a modern re-imagining of dataframes
- `r funlink(stringr)` for strings
- `r funlink(forcats)` for factors
#### Featured Data Extracts {.unnumbered}
In subsequent chapters, we will include instructions for requesting data extracts from IPUMS PMA that are identical those used in our analysis. These data are available at no cost, but you must register and adhere to terms of use at [pma.ipums.org/register](https://pma.ipums.org/pma/register.shtml).
Each data extract that you request from IPUMS PMA is named with a unique number. For example, your very first extract will include a pair of files named `pma_00001.dat.gz` and `pma_00001.xml`. In this guide we reference seven data extracts, but your own file names may vary depending on the number of IPUMS PMA extracts you have requested previously.
- `pma_00001.dat.gz` and `pma_00001.xml`
- `pma_00002.dat.gz` and `pma_00002.xml`
- `pma_00003.dat.gz` and `pma_00003.xml`
- `pma_00004.dat.gz` and `pma_00004.xml`
- `pma_00005.dat.gz` and `pma_00005.xml`
- `pma_00006.dat.gz` and `pma_00006.xml`
- `pma_00007.dat.gz` and `pma_00007.xml`
As you follow along with each chapter, save each data extract in folder called "data" within your `r funlink(base::setwd, "R working directory")`.
#### Working Directory {.unnumbered}
R users can identify their current working directory with the function `r funlink(base::getwd)` and change it with `r funlink(base::setwd)`. Files within the working directory can be found by R using the **relative path** from this location. For example, we'll load our first data extract into R *assuming* that you have placed it in a folder called "data" within your `r funlink(base::setwd, "R working directory")`.
```{r, eval=FALSE, echo = TRUE}
dat <- read_ipums_micro(
ddi = "data/pma_00001.xml",
data = "data/pma_00001.dat.gz"
)
```
Rstudio users can find all of the code demonstrated in this guide in [this RStudio Project](https://github.com/IPUMS-Global-Health/IPUMS-PMA-Longitudinal-Guide).^[Lean more about RStudio Projects [here](https://support.posit.co/hc/en-us/articles/200526207-Using-RStudio-Projects).] Simply open the file `pma-longitudinal.Rproj` and navigate to the [RMarkdown](https://rmarkdown.rstudio.com/) file `r_users.Rmd` in RStudio - no need to set your own working directory!
\newpage
#### Learning More {.unnumbered}
This manual focuses exclusively on longitudinal family planning data from IPUMS PMA, but the companion [data analysis blog](https://tech.popdata.org/pma-data-hub/) covers a wide range of topics like:
- A free [online course](https://tech.popdata.org/pma-data-hub/introduction.html) for beginners
- New data announcements
- Data cleaning and reformatting
- Data analysis and visualization
- Spatial analysis
- Guides to PMA Service Delivery Point & Client Exit Interview data
Beyond the blog, it's important to know where to find **instructions and examples** for the R packages featured in this guide. Nearly all of these packages have a dedicated website with a homepage, reference page (documentation for individual functions), collection of articles (for general instructions), and change-log (for news about updates). The `r funlink(ipumsr)` page is a great place to start:
```{r, out.width='75%'}
knitr::include_graphics(here("images/ipumsr_home.png"))
```
Finally, if you're looking for a more general introduction to R, we strongly recommend the following **free resources**:
- [R for Data Science](https://r4ds.had.co.nz/index.html) for beginners
- [Advanced R](https://adv-r.hadley.nz/) for a deeper dive
- [RSpatial](https://rspatial.org/) for analysis with spatial data
- [ggplot2](https://ggplot2-book.org/) for data visualization
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) for producing annotated code, word documents, presentations, web pages, and more
- [R-bloggers](https://www.r-bloggers.com/) for regular news and tutorials
## PMA Background
Dating back to 2013, the original PMA survey design included high-frequency, **cross-sectional** samples of women and service delivery points collected from eleven countries participating in [Family Planning 2020](http://progress.familyplanning2020.org/) (FP2020) - a global partnership that supports the rights of women and girls to decide for themselves whether, when, and how many children they want to have. These surveys were designed to monitor annual progress towards [FP2020 goals](http://progress.familyplanning2020.org/measurement) via population-level estimates for several [core indicators](http://www.track20.org/pages/data_analysis/core_indicators/overview.php).
Beginning in 2019, PMA surveys were redesigned under a renewed partnership called [Family Planning 2030](https://fp2030.org/) (FP2030). These new surveys have been refocused on reproductive and sexual health indicators, and they feature a **longitudinal panel** of women of childbearing age. This design will allow researchers to measure contraceptive dynamics and changes in women’s fertility intentions over a **three year period** via annual in-person interviews.^[In addition to these three in-person surveys, PMA also conducted telephone interviews with panel members focused on emerging issues related to the COVID-19 pandemic in 2020. These telephone surveys are already available for several countries - the IPUMS PMA blog series on [PMA COVID-19 surveys](https://tech.popdata.org/pma-data-hub/#category:COVID-19) covers this topic in detail.]
Questions on the redesigned survey cover topics like:
* awareness, perception, knowledge, and use of contraceptive methods
* perceived quality and side effects of contraceptive methods among current users
* birth history and fertility intentions
* aspects of health service provision
* domains of empowerment
## Sampling
PMA panel data includes a mixture of **nationally representative** and **sub-nationally representative** samples. The panel study consists of three data collection phases, each spaced one year apart.
As of this writing, IPUMS PMA has released data from the first *two* phases for four countries where Phase 1 data collection began in 2019; IPUMS PMA has released data from only the *first* phase for three countries where Phase 1 data collection began in August or September 2020. Phase 3 data collection and processing is currently underway.
```{r, results='hide', message=FALSE}
library(kableExtra)
options(knitr.kable.NA = '')
avail <- read_csv("utils/sample_avail.csv", show_col_types = F)
names(avail)[2] <- paste0(
names(avail)[2],
footnote_marker_symbol(1)
)
```
```{r}
avail %>%
arrange(Sample) %>%
kable(escape = FALSE, format = "html", table.attr = "style='width:100%;'") %>%
kable_styling() %>%
add_header_above(c(" " = 2, "Now Available from IPUMS PMA" = 3)) %>%
scroll_box(
width = "100%",
box_css = paste(
sep = "; ",
"margin-bottom: 1em",
"margin-top: 0em",
"border: 0px solid #ddd",
"padding: 5px"
)
) %>%
footnote(
symbol = "<em>Each data collection phase is spaced one year apart</em>",
escape = FALSE
)
```
<aside>
**Resident enumerators** are women over age 21 living in (or near) each EA who hold at least a high school diploma.
</aside>
PMA uses a multi-stage clustered sample design, with stratification at the urban-rural level or by sub-region. Sample clusters - called [enumeration areas](https://pma.ipums.org/pma-action/variables/EAID#description_section) (EAs) -- are provided by the national statistics agency in each country.^[[Displaced GPS coordinates](https://tech.popdata.org/pma-data-hub/posts/2021-10-15-nutrition-climate/PMA_displacement.pdf) for the centroid of each EA are available for most samples [by request](https://www.pmadata.org/data/request-access-datasets) from PMA. IPUMS PMA provides shapefiles for PMA countries [here](https://pma.ipums.org/pma/gis_boundary_files.shtml).] These EAs are sampled using a *probability proportional to size* (PPS) method relative to the population distribution in each stratum.
\newpage
At Phase 1, 35 household dwellings were selected at random within each EA. Resident enumerators visited each dwelling and invited one household member to complete a [Household Questionnaire](https://pma.ipums.org/pma/resources/questionnaires/hhf/PMA-Household-Questionnaire-English-2019.10.09.pdf)^[Questionnaires administered in each country may vary from this **Core Household Questionnaire** - [click here](https://pma.ipums.org/pma/enum_materials.shtml) for details.] that includes a census of all household members and visitors who stayed there during the night before the interview. Female household members and visitors aged 15-49 were then invited to complete a subsequent Phase 1 [Female Questionnaire](https://pma.ipums.org/pma/resources/questionnaires/hhf/PMA-Female-Questionnaire-English-2019.10.09.pdf).^[Questionnaires administered in each country may vary from this **Core Female Questionnaire** - [click here](https://pma.ipums.org/pma/enum_materials.shtml) for details.]
<aside>
`r r_link(SAMEDWELLING)` indicates whether a Phase 2 female respondent resided in her Phase 1 dwelling or a new one.
`r r_link(PANELWOMAN)` indicates whether a Phase 2 household member completed the Phase 1 Female Questionnaire.
</aside>
One year later, resident enumerators visited the same dwellings and administered a Phase 2 Household Questionnaire. A panel member in Phase 2 is any woman still age 15-49 who could be reached for a second Female Questionnaire, either because:
* she still lived there, or
* she had moved elsewhere within the study area,^[The "study area" is area within which resident enumerators should attempt to find panel women that have moved out of their Phase 1 dwelling. This may extend beyond the woman's original EA as determined by in-country administrators - see [PMA Phase 2 and Phase 3 Survey Protocol](https://www.pmadata.org/data/survey-methodology) for details.] but at least one member of the Phase 1 household remained and could help resident enumerators locate her new dwelling.^[In cases where no Phase 1 household members remained in the dwelling at Phase 2, women from the household are considered **lost to follow-up**. Chapter 3 covers this topic in detail.]
Additionally, resident enumerators administered the Phase 2 Female Questionnaire to *new* women in sampled households who:
* reached age 15 after Phase 1
* joined the household after Phase 1
* declined the Female Questionnaire at Phase 1, but agreed to complete it at Phase 2
\newpage
When you select the new **Longitudinal** sample option from IPUMS PMA, you'll be able to include responses from every available phase of the study. These samples are available in either **Long** format (responses from each phase will be organized in separate rows) or **Wide** format (responses from each phase will be organized in columns).
```{r}
knitr::include_graphics("images/long_radio.png")
```
\newpage
<aside>
`r r_link(CROSS_SECTION)` indicates whether a household member in a longitudinal sample is also included in the cross-sectional sample for a given year (every person in a cross-sectional sample is included in the longitudinal sample).
</aside>
In addition to following up with women in the panel over time, PMA also adjusted sampling so that a cross-sectional sample could be produced concurrently with each data collection phase. These samples mainly overlap with the data you'll obtain for a particular phase in the longitudinal sample, except that replacement households were drawn from each EA where more than 10% of households from the previous phase were no longer there. Conversely, panel members who were located in a new dwelling at Phase 2 will not be represented in the cross-sectional sample drawn from that EA. These adjustments ensure that population-level indicators may be derived from cross-sectional samples in a given year, even if panel members move or are lost to follow-up.
You'll find PMA cross-sectional samples dating back to 2013 if you select the **Cross-sectional** sample option from IPUMS PMA.
```{r}
knitr::include_graphics("images/cross_radio.png")
```
## Inclusion Criteria for Analysis
```{r}
knitr::opts_chunk$set(echo = TRUE)
```
Several chapters in this manual feature code you can use to reproduce key indicators included in the **PMA Longitudinal Brief** for each sample. In many cases, you'll find separate reports available in English and French, and for both national and sub-national summaries. For reference, here are the highest-level population summaries available in English for each sample where Phase 2 IPUMS PMA data is currently available:
* [Burkina Faso](https://www.pmadata.org/sites/default/files/data_product_results/Burkina%20National_Phase%202_Panel_Results%20Brief_English_Final.pdf)
* [DRC - Kinshasa](https://www.pmadata.org/sites/default/files/data_product_results/DRC%20Kinshasa_%20Phase%202%20Panel%20Results%20Brief_English_Final.pdf)
* [DRC - Kongo Central](https://www.pmadata.org/sites/default/files/data_product_results/DRC%20Kongo%20Central_%20Phase%202%20Panel%20Results%20Brief_English_Final.pdf)
* [Kenya](https://www.pmadata.org/sites/default/files/data_product_results/Kenya%20National_Phase%202_Panel%20Results%20Brief_Final.pdf)
* [Nigeria - Kano](https://www.pmadata.org/sites/default/files/data_product_results/Nigeria%20KANO_Phase%202_Panel_Results%20Brief_Final.pdf)
* [Nigeria - Lagos](https://www.pmadata.org/sites/default/files/data_product_results/Nigeria%20LAGOS_Phase%202_Panel_Results%20Brief_Final.pdf)
Panel data in these reports is limited to the *de facto* population of women who completed the Female Questionnaire in both Phase 1 and Phase 2. This includes women who slept in the household during the night before the interview for the Household Questionnaire. The *de jure* population includes women who are usual household members, but who slept elsewhere that night. In order to reproduce the findings from PMA reports, we'll remove *de jure* cases recorded in the variable `r r_link(RESIDENT)`.
<aside>
We will demonstrate how to request and download an IPUMS PMA data extract in Chapter 2.
</aside>
For example, let's consider a **Wide** format data extract containing Phase 1 and Phase 2 respondents to the Female Questionnaire from Burkina Faso. We've downloaded such an extract and placed it in the "data" sub-folder of our R working directory. We'll load `r funlink(ipumsr)` and `r funlink(tidyverse)` together with our extract.
```{r, eval = FALSE}
library(ipumsr)
library(tidyverse)
dat <- read_ipums_micro(
ddi = "data/pma_00001.xml",
data = "data/pma_00001.dat.gz"
)
```
```{r, echo = FALSE}
# n = 5491 (checked with flowchart)
dat <- widef %>% filter(SAMPLE_1 == 85409)
```
In a **Wide** format data extract, a numeric suffix indicates the data collection phase associated with each variable. So, the you'll find the the number of women who slept in the household before the Household Questionnaire for each phase reported in `r r_link(RESIDENT_1)` and `r r_link(RESIDENT_2)`.
\newpage
This extract includes 174 women who are not members of the *de facto* population because they did not sleep in the sampled household during the night before the Phase 1 interview:
```{r}
dat %>% count(RESIDENT_1)
```
The extract also includes 230 women who are not members of the *de facto* population because they did not sleep in the sampled household during the night before the Phase 2 interview:
```{r}
dat %>% count(RESIDENT_2)
```
Moreover, there are 492 `NA` values in `r r_link(RESIDENT_2)` representing women who were **lost to follow-up** after Phase 1. We will explain **loss to follow-up** in detail in Chapter 3.
The *de facto* population is represented in codes 11 and 22 in both of these variables. We'll use `r funlink(dplyr::filter)` to include only those cases.
```{r}
defacto <- dat %>% filter(RESIDENT_1 %in% c(11, 22) & RESIDENT_2 %in% c(11, 22))
defacto %>% count(RESIDENT_1, RESIDENT_2)
```
\newpage
Additionally, PMA reports only include women who completed (or partially completed) both Female Questionnaires. This information is reported in `r r_link(RESULTFQ)`. In our **Wide** extract, this information appears in `r r_link(RESULTFQ_1)` and `r r_link(RESULTFQ_2)`: if you select the **Female Respondents** option at checkout, only women who completed (or partially completed) the Phase 1 Female Questionnaire will be included in your extract.
```{r, echo = FALSE}
knitr::include_graphics("images/cases1.png")
```
\newpage
We'll further restrict our sample by selecting only cases where `r r_link(RESULTFQ_2)` shows that the woman also completed the Phase 2 questionnaire. Notice that, in addition to each of the values 1 through 10, there are several **non-response codes** numbered 90 through 99. You'll see similar values repeated across all IPUMS PMA variables, except that they will be left-padded to match the maximum width of a particular variable (e.g. `9999` is used for `r r_link(INTFQYEAR)`, which represents a 4-digit year for the Female Interview).
```{r}
dat %>% count(RESULTFQ_2)
```
Possible **non-response codes** include:
* `95` Not interviewed (female questionnaire)
* `96` Not interviewed (household questionnaire)
* `97` Don't know
* `98` No response or missing
* `99` NIU (not in universe)
The value `NA` in an IPUMS PMA extract indicates that a particular variable is not provided for a selected sample. In a **Wide** extract, it may also signify that a particular person was not included in the data from a particular phase. Here, an `NA` appearing in `r r_link(RESULTFQ_2)` indicates that a Female Respondent from Phase 1 was not found in Phase 2.
\newpage
You can drop incomplete Phase 2 female responses as follows:
```{r}
completed <- dat %>% filter(RESULTFQ_2 == 1)
completed %>% count(RESULTFQ_1, RESULTFQ_2)
```
Generally, we will combine both filtering steps together in a single function like so:
```{r}
dat <- dat %>%
filter(
RESIDENT_1 %in% c(11, 22) & RESIDENT_2 %in% c(11, 22),
RESULTFQ_2 == 1
)
```
In subsequent analyses, we'll use the remaining cases to show how PMA generates key indicators for **contraceptive use status** and **family planning intentions and outcomes**. The summary report for each country includes measures dis-aggregated by demographic variables like:
* `r r_link(MARSTAT)` - marital status
* `r r_link(EDUCATT)` and `r r_link(EDUCATTGEN)` - highest attended level of education^[Levels in `r r_link(EDUCATT)` may vary by country; `r r_link(EDUCATTGEN)` recodes country-specific levels in four general categories.]
* `r r_link(AGE)` - age
* `r r_link(WEALTHQ)` and `r r_link(WEALTHT)` - household wealth quintile or tertile^[Households are divided into quintiles/tertiles relative to the distribution of an asset `r r_link(SCORE, description)` weighted for all sampled households. For sub-nationally-representative samples (DRC and Nigeria), separate wealth distributions are calculated for each sampled region.]
* `r r_link(URBAN)` and `r r_link(SUBNATIONAL)` - geographic location^[`r r_link(SUBNATIONAL)` includes sub-national regions for all sampled countries; country-specific variables are also available on the [household - geography](https://pma.ipums.org/pma-action/variables/group?id=hh_geo) page.]
## Survey Design Elements
Throughout this guide, we'll demonstrate how to incorporate PMA sampling weights and information about its stratified cluster sampling procedure into your analysis. This section describes how to use survey weights, cluster IDs, and sample strata in R.
Whether you intend to work with a new **Longitudinal** or **Cross-sectional** data extract, you'll find the same set of sampling weights available for all PMA Family Planning surveys dating back to 2013:
<aside>
A fourth Family Planning survey weight, `r r_link(POPWT, description)`, is currently available only for **Cross-sectional** data extracts.^[`r r_link(POPWT)` can be used to estimate population-level *counts* - [click here](https://pma.ipums.org/pma/population_weights.shtml) or view [this video](https://www.youtube.com/watch?v=GnCq26t4zgM) for details.]
</aside>
* `r r_link(HQWEIGHT, description)` can be used to generate cross-sectional population estimates from questions on the Household Questionnaire.^[`r r_link(HQWEIGHT)` reflects the [calculated selection probability](https://pma.ipums.org/pma/resources/documentation/weighting_memo.pdf) for a household in an EA, normalized at the population-level. Users intending to estimate population-level indicators for *households* should restrict their sample to one person per household via `r r_link(LINENO, description)` - see [household weighting guide](https://pma.ipums.org/pma/weightguide.shtml#hh) for details.]
* `r r_link(FQWEIGHT, description)` can be used to to generate cross-sectional population estimates from questions on the Female Questionnaire.^[`r r_link(FQWEIGHT)` adjusts `r r_link(HQWEIGHT)` for female non-response within the EA, normalized at the population-level - see [female weighting guide](https://pma.ipums.org/pma/weightguide.shtml#female) for details.]
* `r r_link(EAWEIGHT, description)` can be used to compare the selection probability of a particular household with that of its EA.
Additionally, PMA created a new weight, `r r_link(PANELWEIGHT, description)`,
which should be used in longitudinal analyses spanning multiple phases, as it adjusts for loss to follow-up. `r r_link(PANELWEIGHT)` is available only for **Longitudinal** data extracts.
PMA sample clusters are identified by the variable `r r_link(EAID)`, while sample strata are identified by `r r_link(STRATA)`. We'll demonstrate how to use each of these survey design elements in R below.
### Set survey design
<aside class="hex">
```{r, echo = FALSE}
hex("srvyr")
```
</aside>
Throughout this guide, we'll use tools from the `r funlink(srvyr)` package to incorporate survey design elements into our analyses.^[The `r funlink(srvyr)` package is a `r funlink(tidyverse)` implementation of the popular [survey](http://r-survey.r-forge.r-project.org/survey/) package for R, authored by Dr. Thomas Lumley. For thorough discussion of the types of weights available in both R and Stata, we recommend [this blog post](https://notstatschat.rbind.io/2020/08/04/weights-in-statistics/) by Dr. Lumley.] You can install or update `r funlink(srvyr)` from CRAN like so:
```{r, eval = FALSE, echo = TRUE}
install.packages("srvyr")
```
Load `r funlink(srvyr)` for use in an R session with:
```{r, eval=FALSE, echo=TRUE}
library(srvyr)
```
Let's return to the **Wide** data extract described in the previous section, which includes Phase 1 and Phase 2 **Female Respondents** from Burkina Faso. In the following example, we'll show how to use IPUMS PMA survey design elements to estimate the proportion of reproductive age women in Burkina Faso who were using contraception at the time of data collection for both Phase 1 and Phase 2. In a **Cross-sectional** or **Long** format longitudinal extract, you'd find this information in the variable `r r_link(CP)`. In the **Wide** extract featured here, you'll find it in `r r_link(CP_1)` for Phase 1, and in `r r_link(CP_2)` for Phase 2.
Here is how to count the *unweighted* number of sampled women using and not using contraception in both phases. (We drop 5 cases coded 99 for "NIU (not in universe)" in Phase 1).
```{r}
dat <- dat %>% filter(CP_1 < 90 & CP_2 < 90)
dat %>% count(CP_1, CP_2)
```
\newpage
To estimate a population percentage, we'll need to tell `r funlink(srvyr)` that we are working with a sample survey dataset and specify the IPUMS PMA survey design elements. This is accomplished with `r funlink(srvyr::as_survey_design)`: we use `r r_link(PANELWEIGHT)` as the sampling `weight`. We also use `r r_link(EAID_1)` to `id` the sample clusters,^[As we'll see in Chapter 3, women are considered **lost to follow-up** if they moved outside the study area after Phase 1. Therefore, `r r_link(EAID_1)` and `r r_link(EAID_2)` are identical for all panel members: you can use either one to identify sample clusters.] and `r r_link(STRATA_1)` to represent sample `strata`.^[As with `r r_link(EAID)`, you may use either `r r_link(STRATA_1)` or `r r_link(STRATA_2)` if your analysis is restricted to panel members]
Summary functions like `r funlink(srvyr::survey_mean)` use information from `r funlink(srvyr::as_survey_design)` to derive weighted population estimates with cluster-adjusted standard errors. The argument `vartype = "ci"` reports a cluster-robust 95% confidence interval,^[The confidence level in `r funlink(srvyr::survey_mean)` can be adjusted with `level` (e.g. `level = 0.99)`] while `prop = TRUE` and `prop_method = "logit"` ensure that no estimated proportion includes values beyond 0% and 100%.^[See `r funlink(survey::svyciprop)` for a complete list of available adjustment methods.]
<aside>
`coef` shows the estimated population proportion
`_low` and `_upp` show the lower and upper bounds of a 95% confidence interval
</aside>
```{r}
dat %>%
as_survey_design(
weight = PANELWEIGHT,
id = EAID_1,
strata = STRATA_1
) %>%
summarise(
survey_mean(
CP_1 * CP_2,
vartype = "ci",
proportion = TRUE,
prop_method = "logit"
)
)
```
Using the survey design information for this sample, we estimate that about 18.8% of all reproductive age women in Burkina Faso were using contraception at the time both Phase 1 and Phase 2 data were collected. We're 95% certain that this value falls between 16.4% and 21.4%.
### Sample strata for DRC
Importantly, the variable `r r_link(STRATA)` is *not available* for samples collected from DRC - Kinshasa or DRC - Kongo Central. If your extract includes any DRC sample, you'll need to amend this variable to include one unique numeric code for each of those regions.
For example, let's look at a different **Wide** extract, `dat2`, containing all of the samples included in this data release.
```{r, echo=FALSE}
options(tibble.print_min = 30)
```
```{r, echo = TRUE, results='hide'}
dat2 <- read_ipums_micro(
ddi = "data/pma_00002.xml",
data = "data/pma_00002.dat.gz"
)
dat2 <- dat2 %>%
filter(
RESIDENT_1 %in% c(11, 22) & RESIDENT_2 %in% c(11, 22),
RESULTFQ_2 == 1,
CP_1 < 90 & CP_2 < 90
)
```
Notice that `r r_link(STRATA_1)` lists the sample strata for every `r r_link(COUNTRY)` *except* for DRC, where you see the value `NA`.
```{r}
dat2 %>% filter(is.na(STRATA_1)) %>% count(COUNTRY, STRATA_1)
```
Now let's see what happens when we try to produce population-level estimates with `r r_link(STRATA_1)`:
```{r, error=TRUE}
dat2 %>%
as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>%
group_by(COUNTRY, GEOCD, GEONG) %>%
summarise(
survey_mean(
CP_1 * CP_2,
vartype = "ci",
proportion = TRUE,
prop_method = "logit"
)
)
```
This fails because `r funlink(srvyr::as_survey_design)` encounters `NA` values in `r r_link(STRATA_1)`. Fortunately, we can replace those values with numeric codes from the variable `r r_link(GEOCD)`:
```{r}
dat2 %>% count(GEOCD)
```
If `r r_link(GEOCD)` is not `NA`, we'll use its numeric code in place of `r r_link(STRATA_1)`. Otherwise, we'd like to leave `r r_link(STRATA_1)` unchanged. However, because both variables include *value labels*, we'll first need remove them with `r funlink(ipumsr::zap_labels)`. To avoid confusion with the original variable `r r_link(STRATA_1)`, we'll call our new variable `STRATARC` (for "strata recoded").
* `STRATARC` - Numeric codes for PMA sample strata (recoded for DRC samples)
<aside>
Use `r funlink(ipumsr::zap_labels)` to remove all labels from an IPUMS variable.
</aside>
```{r}
dat2 <- dat2 %>%
mutate(
STRATARC = if_else(
is.na(GEOCD),
zap_labels(STRATA_1),
zap_labels(GEOCD)
)
)
```
\newpage
Notice that `STRATARC` replaces the `NA` values in `r r_link(STRATA_1)`, leaving its numeric values unchanged.
```{r}
dat2 %>% count(GEOCD, STRATA_1, STRATARC)
```
\newpage
Finally, we can use the updated survey design information to estimate the proportion of women who were using contraception at both Phase 1 and Phase 2 in every sample (including those from Kinshasa and Kongo Central).
```{r, error=TRUE}
dat2 %>%
as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATARC) %>%
group_by(COUNTRY, GEOCD, GEONG) %>%
summarise(
survey_mean(
CP_1 * CP_2,
vartype = "ci",
proportion = TRUE,
prop_method = "logit"
)
)
```
Now that we've identified variables that describe an IPUMS PMA analytic sample, let's proceed by downloading these and other variables of interest in a data extract from IPUMS PMA. In Chapter 2, we'll see that longitudinal data extracts can be requested in either **Long** or **Wide** format, depending on your needs.
# Longitudinal Data Extracts
```{r, echo=FALSE, results='hide'}
knitr::opts_chunk$set(
echo = FALSE,
eval = TRUE,
out.width = "85%"
)
```
Chapter 2 provides a guided tour of the [IPUMS PMA data extract system](https://pma.ipums.org/pma/), which you may use to combine survey data collected from multiple countries and multiple phases of the longitudinal study.
IPUMS PMA also makes it easy to switch between multiple [units of analysis](https://pma.ipums.org/pma-action/variables/group) covered in PMA surveys. In addition to the longitudinal data featured in this guide, you'll find surveys representing:
<aside>
A video tour of the longitudinal extract system is available [here](https://www.youtube.com/embed/VwjYHDvpHk0) on the IPUMS PMA Youtube channel.
</aside>
- [Service Delivery Points (SDPs)](https://tech.popdata.org/pma-data-hub/#category:Service_Delivery_Points)
- [Client Exit Interviews conducted at SDPs](https://tech.popdata.org/pma-data-hub/#category:Client_Exit_Interviews)
- Participants in special surveys covering topics like [COVID-19](https://tech.popdata.org/pma-data-hub/#category:COVID-19), [nutrition](https://tech.popdata.org/pma-data-hub/#category:Nutrition), and maternal & newborn health
To get started with a longitudinal data extract, you'll need to select the **Family Planning** topic under the **Person** unit of analysis.
```{r, out.width="85%"}
knitr::include_graphics("images/unit.png")
```
## Sample Selection
Once you've selected the **Family Planning** option, you'll next need to choose between cross-sectional or longitudinal samples. Cross-sectional samples are selected by default; these are nationally or sub-nationally representative samples collected each year dating backward as far as 2013.
```{r}
knitr::include_graphics("images/cross-sectional.png")
```
Longitudinal samples are only available from 2019 onward, and they include all of the available phases for each sampled country (sub-nationally representative samples for DRC and Nigeria are listed separately). You'll only find longitudinal samples for countries where Phase 2 data has been made available; as of this writing, Phase 1 data for Cote d'Ivoire, India, and Uganda can only be found under the Cross-sectional sample menu.
\newpage
Clicking the Longitudinal button reveals options for either **Long** or **Wide** format. You'll find the same samples available in either case.
**Important:** if you decide to change formats after selecting variables, your Data Cart will be emptied and you'll need to begin again from scratch.
```{r}
knitr::include_graphics("images/wide.png")
```
\newpage
After you've selected one of the available longitudinal formats, choose one or more samples listed below. There are also several Sample Members options listed.
```{r}
knitr::include_graphics("images/cases.png")
```
<aside>
`r r_link(PANELWOMAN)` indicates whether an individual is a member of the panel study.
`r r_link(ELIGIBLE)` indicates whether an individual was eligible for the female questionnaire.
</aside>
**Female Respondents** only includes women who completed *all or part* of a Female Questionnaire. **This option selects all members of the panel study.** In addition, it includes women who only participated in only one phase - we will demonstrate how to identify and drop these cases below.^[Women who completed all or part of the Female Questionnaire in *more than one phase* of the study are considered **panel members**. Women who completed it only at Phase 1 are included in a longitudinal extract, but they are not **panel members**. Likewise, women who completed it for the first time at Phase 2 are included, but are not **panel members** if they 1) will reach age 50 before Phase 3, or 2) declined the invitation to participate again in Phase 3.]
**Female Respondents and Female Non-respondents** includes all women who were eligible to participate in a Female Questionnaire. Eligible women are those age 15-49 who were listed on the roster collected in a Household Questionnaire. If an eligible woman declined the Female Questionnaire or was not available, variables associated with that questionnaire will be coded "Not interviewed (female questionnaire)".
\newpage
<aside>
`r r_link(RESULTFQ)` indicates whether an individual completed the Female Questionnaire.
`r r_link(RESULTHQ)` indicates whether a member of the individual's household completed the Household Questionnaire.
</aside>
**Female Respondents and Household Members** adds records for all other members of a Female Respondent's household. These household members did not complete the Female Questionnaire, but were listed on the household roster provided by the respondent to a Household Questionnaire. Basic [demographic](https://internal.pma.ipums.org/pma-action/variables/group?id=hh_roster) variables are available for each household member, as are common [wealth](https://internal.pma.ipums.org/pma-action/variables/group?id=hh_wealth), [water](https://internal.pma.ipums.org/pma-action/variables/group?id=water_watersource), [sanitation](https://internal.pma.ipums.org/pma-action/variables/group?id=water_wash), and other variables shared for all members of the same household.
**All Cases** includes all members listed on the household roster from a Household Questionnaire. If the Household Questionnaire was declined or if no respondent was available, any panel member appearing in other phases of the study will be coded "Not interviewed (household questionnaire)" for variables associated with the missing Household Questionnaire.
After you've selected samples and sample members for your extract, click the "Submit Sample Selections" button to return to the main data browsing menu.
## Variable Selection
You can browse IPUMS PMA variables by topic or alphabetically by name, or you can [search](https://pma.ipums.org/pma-action/variables/search) for a particular term in a variable name, label, value labels, or description.
```{r}
knitr::include_graphics("images/topics.png")
```
\newpage
In this example, we'll select the [Discontinuation of Family Planning](https://pma.ipums.org/pma-action/variables/group?id=fem_fpst) topic. The availability of each associated variable is shown in a table containing all of the samples we've selected.
* `X` indicates that the variable is available for *all phases*
* `/` indicates that the variable is available for *one phase*
* `-` indicates that the variable is not available for *any phase*
You can click the `+` button to add a variable to your cart, or click a variable name to learn more.
```{r}
knitr::include_graphics("images/table.png")
```
### Codes
<aside>
"Case-count view" is not available for longitudinal samples. For cross-sectional samples, this option shows the frequency of each response.
</aside>
Let's take a look at the variable `r r_link(PREGNANT)`. You'll find the variable name and label shown at the top of the page. Below, you'll see several tabs beginning with the [CODES](https://pma.ipums.org/pma-action/variables/PREGNANT#codes_section) tab. For discrete variables, this tab shows all of the available codes and value labels associated with each response. You'll also see the same `X`, `/`, and `-` symbols in a table indicating the availability of each response in each sample.
```{r}
knitr::include_graphics("images/codes-fr.png")
```
\newpage
Above, there are no responses for "Not interviewed (female questionnaire)" and "Not interviewed (household questionnaire)"; this is because only samples members included in a "Female Respondents" extract are displayed by default. If we instead choose "All Cases", this variable will include those response options because we'll include every person listed on the household roster (even if the Household or Female Questionnaire was not completed).
```{r}
knitr::include_graphics("images/codes-all.png")
```
\newpage
The symbol `/` again indicates that a particular response is available for some - but not all - phases of the study. For `r r_link(PREGNANT)` it indicates that one of the options was either unavailable or was not selected by any sample respondents in a particular phase. If a variable was not included in all phases of the study, all response options will be marked with this symbol. For example, consider the variable `r r_link(COVIDCONCERN)`, indicating the respondent's level of concern about becoming infected with COVID-19.
```{r}
knitr::include_graphics("images/covidconcern.png")
```
Because Phase 1 questionnaires were administered prior to the emergence of COVID-19, this variable only appeared on Phase 2 questionnaires. The symbol `/` indicates limited availability across phases.
### Variable Description
You'll find a detailed description for each variable on the [DESCRIPTION](https://pma.ipums.org/pma-action/variables/PREGNANT#description_section) tab. This tab also indicates whether a particular question appeared on the Household or Female Questionnaire.
```{r}
knitr::include_graphics("images/desc.png")
```
### Comparability Notes
The [COMPARABILITY](https://pma.ipums.org/pma-action/variables/PREGNANT#comparability_section) tab describes important differences between samples. Additionally, it may contain information about similar variables appearing in [DHS](https://dhsprogram.com/) samples provided by [IPUMS DHS](https://www.idhsdata.org/idhs/).
```{r}
knitr::include_graphics("images/comp.png")
```
### Sample Universe
The [UNIVERSE](https://pma.ipums.org/pma-action/variables/PREGNANT#universe_section) tab describes selection criteria for this question. In this case, there are some differences between samples:
* In DRC samples, all women aged 15-49 received this question.
* For all other samples, the question was skipped if any such woman previously indicated that she was menopausal or had a hysterectomy.
```{r}
knitr::include_graphics("images/universe.png")
```
### Availability Across Samples
The [AVAILABILITY](https://pma.ipums.org/pma-action/variables/PREGNANT#availability_section) tab shows all other samples (including cross-sectional samples) where this variable is available.
```{r}
knitr::include_graphics("images/avail.png")
```
### Questionnaire Text
Finally, you'll find the full text of each question on the [QUESTIONNAIRE TEXT](https://pma.ipums.org/pma-action/variables/PREGNANT#questionnaire_text_section) tab. Each phase of the survey is shown separately, and you may click the "view entire document: text" link to view the complete questionnaire for a particular sample in any given phase.
```{r}
knitr::include_graphics("images/question.png")
```
### Checkout
Use the buttons at the top of this page to add the variable to your Data Cart, or to "VIEW CART" and begin checkout.
```{r, fig.align='center'}
# knitr::include_graphics("images/buttons.png")
htmltools::img(
src = "images/buttons.png",
style =
"margin-top: 20px; margin-bottom: 25px; max-width: 100%; width: 1033px;"
)
```
## Data for R Users
Your Data Cart shows all of the variables you've selected, plus several "preselected" variables that will be automatically included in your extract. Click the "CREATE DATA EXTRACT" button to prepare your download.
```{r}
knitr::include_graphics("images/cart.png")
```
### Select a Fixed-width File
Before you submit an extract request, you'll have the opportunity to choose a "Data Format". **R users should select a Fixed-width text file (.dat)** - you'll notice that data formatted for Stata, SPSS, and SAS are also available. CSV files are provided, but not recommended. (If you wish to change Sample Members, you may do so again here.)
```{r}
knitr::include_graphics("images/checkout1.png")
```
Once the Fixed-width option is selected, you may add a description and then proceed to the download page.
### Download
After a few moments, you'll receive an email indicating that your extract has been created. You'll need to obtain two files from the download page:
* Click the green "Download DAT" button to download the data file. You'll receive a file with a number like `pma_00003.dat.gz`.
* Right click on "DDI" and click "Save link as". You'll receive a corresponding XML file like `pma_00003.xml`.
```{r, fig.align='center'}
# knitr::include_graphics("images/download.png")
htmltools::img(
src = "images/download.png",
style = "margin-top: 20px; margin-bottom: 25px;"
)
```
Place both files in a folder that R can use as its [working directory](https://r4ds.had.co.nz/workflow-projects.html?q=working%20directory#where-does-your-analysis-live). We **strongly recommend** using [RStudio projects](https://r4ds.had.co.nz/workflow-projects.html?q=working%20directory#rstudio-projects) to manage all of the files and analysis scripts used for a particular research project. We'll place our files in a sub-folder called "data" within our own RStudio project folder.
Open RStudio (or R) and load the packages `r funlink(ipumsr)` and `r funlink(tidyverse)`. If you are not using an RStudio project, you will need to change your working directory to match the location of your downloaded files.
```{r, eval=FALSE, echo=TRUE}
library(ipumsr)
library(tidyverse)
setwd("~/Downloads") # ONLY if not using an RStudio project (change as needed)
```
We’ll now demonstrate loading both a long and a wide extract, and we’ll take a brief look at the structure of each.
```{r}
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)
options(tibble.print_min = 12)
```
## Long Data Structure
We've downloaded a **Long** data extract (**Female Respondents** only) and saved it in a folder called "data" in our working directory. We'll now load it into R as an object called `long`.
To load an IPUMS PMA extract into R, you'll need to reference *both* the DDI file *and* the fixed-width data file in the function `r funlink(ipumsr::read_ipums_micro)` from `r funlink(ipumsr)`.
```{r, results='hide'}
long <- read_ipums_micro(
ddi = "data/pma_00003.xml",
data = "data/pma_00003.dat.gz"
)
```
In a **Long** extract, data from each phase will be organized in *separate rows*. Here, responses from three panel members are shown:
```{r}
long %>%
filter(FQINSTID %>% str_starts("011") | FQINSTID %>% str_starts("015")) %>%
arrange(FQINSTID) %>%
select(FQINSTID, PHASE, AGE, PANELWOMAN)
```
Each panel member receives a unique ID shown in `r r_link(FQINSTID)`. The variable `r r_link(PHASE)` shows that each woman's responses to the Phase 1 Female Questionnaire appears in the first row, while her Phase 2 responses appear in the second. `r r_link(AGE)` shows each woman's age when she completed the Female Questionnaire for each phase.
`r r_link(PANELWOMAN)` indicates whether the woman completed all or part of the Female Questionnaire in a *prior* phase, and that she'd agreed to continue participating in the panel study at that time. The value `NA` appears in the rows for Phase 1, as `r r_link(PANELWOMAN)` was not included in Phase 1 surveys.
\newpage
We mentioned above that you'll also include responses from some non-panel members when you request an extract with **Female Respondents**. These include women who did not complete all or part the Female Questionnaire in a prior phase, as indicated by `r r_link(PANELWOMAN)`. These women are not assigned a value for `r r_link(FQINSTID)` - instead, you'll find an empty string:
```{r}
long %>% count(PHASE, PANELWOMAN, FQINSTID == "")
```
Chapter 1 describes **Inclusion Criteria for Analysis** and shows how to identify women in a **Wide** extract who did not complete the Female Questionnaire in both phases. In **Long** format, we use `r funlink(dplyr::group_by)` to ensure that there is one row for every `r r_link(FQINSTID)` where `PHASE == 1` and another row where `PHASE == 2 & RESULTFQ == 1`.
```{r}
long <- long %>%
group_by(FQINSTID) %>%
filter(any(PHASE == 1) & any(PHASE == 2 & RESULTFQ == 1)) %>%
ungroup()
```
The *de facto* population is identified where `r r_link(RESIDENT)` takes the value `11` or `22` in both rows.
```{r}
long <- long %>%
group_by(FQINSTID) %>%
filter(all(RESIDENT %in% c(11, 22))) %>%
ungroup()
```
\newpage
Following these steps, you can check the size of each analytic sample like so. (Reminder: samples for DRC and Nigeria are sub-nationally representative, so we'll show separate frequencies for each `r r_link(GEOCD)` and `r r_link(GEONG)`).
```{r}
long %>% count(COUNTRY, GEOCD, GEONG, PHASE)
```
## Wide Data Structure
We've also downloaded a **Wide** data extract (**Female Respondents** only) and saved it in the "data" folder in our working directory. We'll also load this extract into R as an object named `wide`.
```{r, results='hide', eval = TRUE}
wide <- read_ipums_micro(
ddi = "data/pma_00004.xml",
data = "data/pma_00004.dat.gz"
)
```
In a **Wide** extract, all of the responses from one woman appear in the *same row*. The IPUMS PMA extract system appends a numeric suffix to each variable name corresponding with the phase from which it was drawn. Consider our three example panel members again:
```{r}
wide %>%
filter(FQINSTID %>% str_starts("011") | FQINSTID %>% str_starts("015")) %>%
select(FQINSTID, AGE_1, AGE_2, PANELWOMAN_1, PANELWOMAN_2)
```
Each panel member has one unique ID shown in `r r_link(FQINSTID)`. However, `r r_link(AGE)` is parsed into two columns: `r r_link(AGE_1)` shows each woman's age at Phase 1, and `r r_link(AGE_2)` shows her age at Phase 2.
As we've discussed, `r r_link(PANELWOMAN)` is not available for Phase 1, as it indicates whether the woman completed all or part of the Female Questionnaire in a *prior* phase. For this reason, all values in `r r_link(PANELWOMAN_1)` are `NA`. Most variables are copied once for each phase, even if they - like `r r_link(PANELWOMAN_1)` - are not available for all phases.
\newpage
You might expect the total length of a **Wide** extract to be half the length of a corresponding **Long** extract. This is not the case! A **Wide** extract includes one row for each woman who completed all or part of the Female Questionnaire *for any phase* - you'll find placeholder columns for phases where the interview was not conducted.
```{r}
wide %>%
filter(FQINSTID == "0C8VQU6B03BXLAVVZ8SB90EKQ") %>%
select(RESULTFQ_1, AGE_1, RESULTFQ_2, AGE_2)
```
In a **Long** extract, rows for the missing phase are dropped. In this example, the woman was "not at home" for the Phase 2 Female Questionnaire. When we select a **Long** extract containing only Female Respondents, her Phase 2 row is excluded automatically (it will be included if you request an extract containing **Female Respondents and Female Non-respondents**).
```{r, results='hide', echo = FALSE}
long <- read_ipums_micro(
ddi = "data/pma_00003.xml",
data = "data/pma_00003.dat.gz"
)
```
```{r}
long %>%
filter(FQINSTID == "0C8VQU6B03BXLAVVZ8SB90EKQ") %>%
select(PHASE, RESULTFQ, AGE)
```
The **Inclusion Criteria for Analysis** section in Chapter 1 shows how to identify members of the *de facto* population who completed the Female Questionnaire in both phases for a **Wide** extract. Those steps are repeated here:
```{r}
wide <- wide %>%
filter(
RESIDENT_1 %in% c(11, 22) & RESIDENT_2 %in% c(11, 22),
RESULTFQ_2 == 1
)
```
\newpage
Following these steps, each analytic sample contains the same number of cases shown in the final **Long** format extract above.
```{r}
wide %>% count(COUNTRY, GEOCD, GEONG)
```
## Which format is best for me?
The choice between **Long** and **Wide** formats ultimately depends on your research objectives.
Many data manipulation tasks, for example, are faster and easier to perform in the **Wide** format. In the example above, we needed to identify women who completed a Female Questionnaire and were members of the *de facto* population in both phases. In the **Long** format, we first had to group the data by `r r_link(FQINSTID)` with `r funlink(dplyr::group_by)`, thereby ensuring that a Phase 1 and Phase 2 check could be performed for each woman. In preparing for this post, this approach took about 36.5 seconds. By comparison, the same task was achieved without `r funlink(dplyr::group_by)` in **Wide** format in just 0.16 seconds. If your workflow requires multiple comparisons between phases, the **Wide** format may be the best choice!
On the other hand, many of the longitudinal modeling packages available for R require data to be in a **Long** format - this includes both the `r funlink(survival)` package used in Chapter 6 and the `r funlink(lme4)` package for multilevel models. Users who prefer the **Wide** format for data cleaning and exploration can manually switch to **Long** format with help from `r funlink(tidyr::pivot_longer)`, for example:
```{r}
wide %>% select(FQINSTID, AGE_1, PREGNANT_1, AGE_2, PREGNANT_2)
```
\newpage
With `r funlink(tidyr::pivot_longer)`, you can strip the suffix `1` or `2` from each variable, placing the result in a new column called `PHASE`. Then, we'll pivot each woman's age and pregnancy status from 2 **Wide** columns into a single **Long** one.
<aside>
We will revisit `r funlink(tidyr::pivot_longer)` when analyzing PMA Contraceptive Calendar data in Chapter 6.
</aside>
```{r}
wide %>%
select(FQINSTID, AGE_1, PREGNANT_1, AGE_2, PREGNANT_2) %>%