Skip to content

Commit

Permalink
[sewtha,#5][l]: redo markdown extraction so as to have the separate c…
Browse files Browse the repository at this point in the history
…hapters e.g. now have /without-hot-air/chap01.

* extract via unzipping epub and then processing each html (not only needed for sections but cleaner markdown)
* redo processing script in python (vs bash)
  * not only cleaner but can have tests
* move to symlinking content from the without-hot-air folder into content and site folders
  • Loading branch information
rufuspollock committed Aug 24, 2021
1 parent 37fbae9 commit 685e45b
Show file tree
Hide file tree
Showing 58 changed files with 11,928 additions and 21,826 deletions.
1 change: 1 addition & 0 deletions content/without-hot-air
240 changes: 240 additions & 0 deletions without-hot-air/extract.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
import os
import shutil
from distutils.dir_util import copy_tree

TMPDIR = './tmp'
EPUB = 'without-hot-air.epub'
SRC = './src'
MARKDOWN = os.path.join(TMPDIR, 'markdown')
IMAGES = './Images'

def retrieve():
# Main epub version linked on https://www.withouthotair.com/epubVersions.html
cmd = 'curl http://www.inference.eng.cam.ac.uk/sustainable/book/translate/SustainableEnergy-withoutthehotair-DavidJCMacKay.epub > without-hot-air.epub'
os.system(cmd)

def prepare():
if os.path.exists(TMPDIR):
shutil.rmtree(TMPDIR)
os.makedirs(TMPDIR)
os.makedirs(MARKDOWN)

if not os.path.exists(SRC):
os.makedirs(SRC)

if not os.path.exists(IMAGES):
os.makedirs(IMAGES)

def extract():
# unzip then pandoc xhtml to markdown

cmd = 'unzip -q %s -d %s' % (EPUB, TMPDIR)
os.system(cmd)

htmldir = os.path.join(TMPDIR, 'OEBPS', 'Text')
files = [ f for f in os.listdir(htmldir) if f.endswith('xhtml') ]
for f in files:
mdfn = f.split('.')[0] + '.md'
infp = os.path.join(htmldir, f)
md = os.path.join(MARKDOWN, mdfn)
cmd = 'pandoc %s -t gfm -o %s --wrap=none' % (infp, md)
os.system(cmd)

def etl():
'''
1. make tmp
2. E: unzip epub
3. T: 1. pandoc to markdown from xhtml 2. process each chapter
4. L: copy output to right location
5. cleanup
Layout:
/tmp/
/markdown/
unzip results in text files in ...
OEBPS/Text
e.g.
./tmp/OEBPS/Text/chap01.xhtml
./tmp/OEBPS/Text/chap02.xhtml
./tmp/OEBPS/Text/chap03.xhtml
'''
prepare()
extract()

# now clean up each file
files = [ f for f in os.listdir(MARKDOWN) ]
for fn in files:
md = os.path.join(MARKDOWN, fn)
dest = os.path.join(SRC, fn)
try:
content = open(md, encoding='utf8').read()
out = transform(content)
except:
print(md)
raise
open(dest, 'w', encoding='utf8').write(out)

# TODO: have manually symlinked - could symlink automatically here
# copy_tree(SRC, '../content/without-hot-air/')

# sort images
copy_tree(os.path.join(TMPDIR, 'OEBPS', 'Images'), 'IMAGES')
# TODO: (did manually) - symlink this from site folder
# site/public/img/without-hot-air => ./Images



import re
def transform(file_string):
# clean up the markdown
out = file_string

# replace non-breaking spaces ...
out = out.replace(u'\xa0', u' ')

# find and replace patterns
regexes = [
# TODO: does this even exist in xhtml conversion? (this was done for epub)
# remove stuff like <span id="titlepage.xhtml"></span>
# <span class="figurenumber">Figure 1.2.</span> Are "our" fossil fuels running out? Total crude oil production from the North Sea, and oil price in 2006 dollars per barrel. [<span class="darkred">\[10\]</span>](#chap01.xhtml#ch01n10)
# ['^<span.*><\/span>$', ''],

# Fix # 1   Motivations
[' ', ' '],

# convert quotes
# <div class="quote"> ...
[r'<div class="quote">\n\n(.*)\n\n<\/div>', r'> \g<1>'],
# without a proper parser a bit hacky to handle when multiple lines
[r'<div class="quote"[^>]*>\n\n(.*)\n\n((.*))?\n\n<\/div>', r'> \g<1>\n>\n> \g<2>'],

# fix image links
[r'\.\./Images/', '/img/without-hot-air/'],

# correct quotes to normal quotes
['”', '"'],
['“', '"'],
['”', '"'],

# strip trailng white space
[' *$', ''],

# remove [image] all divs
# e.g. <div class="imgcap" style="float: right; width: 26%">
# e.g. <div class="smallfont" style="width: 50%; padding-left: 10%">
[r'<div[^>]*>', ''],
[r'</div>', ''],
# remove multiple blank lines
[r'\n\n(\n)+', r'\n\n'],

# footnotes
# footnote ref
# e.g. [<span class="darkred">\[5\]</span>](#ch01n05)
[r'\[<span class="[^>]*>\\\[(\d+)\\\]<\/span>\]\(#ch0?(\d+)n0?(\d+)\)',
r'[^\g<1>]'],

# footnote itself
# [<span class="mark">\[22\]</span>](#ret22)
[r'\[<span class="mark">\\\[(\d+)\\\]</span>\]\(#ret\d+\)', r'[^\g<1>]: '],

# we generally don't need the pandoc escaping of [
# e.g. \[energy\]
[r'\\\[', '['],
[r'\\\]', ']'],
]

for regex in regexes:
out = re.sub(regex[0], regex[1], out, flags=re.MULTILINE)

return out


def test_transform():
instring = '''# 1   Motivations
<div class="quote">
*We live at a time when emotions and feelings count more than truth, and there is a vast ignorance of science.*
James Lovelock
</div>
<div class="quote">
*if everyone does a little, we’ll achieve only a little.*
</div>
<div class="imgcap" style="float: right; width: 26%">
![OutOfGas](../Images/OutOfGasS.jpg)
<div class="caption2">
David Goodstein’s *Out of Gas* (2004).
</div>
![SkepticalEnvironmentalist](../Images/lomborgSES.jpg)
<div class="caption2">
Bjørn Lomborg’s *The Skeptical Environmentalist* (2001).
</div>
![RevengeOfGaia](../Images/revengeOfGaiaS.jpg)
<div class="caption2">
*The Revenge of Gaia: Why the earth is fighting back – and how we can still save humanity.* James Lovelock (2006). © Allen Lane.
</div>
</div>
“Wind or nuclear?”, for example. ... to fill the \[energy\] gap is living in an utter dream world and is, in my view, an enemy of the people.” [<span class="darkred">\[1\]</span>](#ch01n01)<span class="red"> \*</span>
<div class="caption2">
[<span class="mark">\[3\]</span>](#ret03)*quote text here ...*
'''

exp = '''# 1 Motivations
> *We live at a time when emotions and feelings count more than truth, and there is a vast ignorance of science.*
>
> James Lovelock
> *if everyone does a little, we’ll achieve only a little.*
![OutOfGas](/img/without-hot-air/OutOfGasS.jpg)
David Goodstein’s *Out of Gas* (2004).
![SkepticalEnvironmentalist](/img/without-hot-air/lomborgSES.jpg)
Bjørn Lomborg’s *The Skeptical Environmentalist* (2001).
![RevengeOfGaia](/img/without-hot-air/revengeOfGaiaS.jpg)
*The Revenge of Gaia: Why the earth is fighting back – and how we can still save humanity.* James Lovelock (2006). © Allen Lane.
"Wind or nuclear?", for example. ... to fill the \[energy\] gap is living in an utter dream world and is, in my view, an enemy of the people." [^1]<span class="red"> \*</span>
[^3]: *quote text here ...*
'''

out = transform(instring)
print(out)
assert out == exp


if __name__ == '__main__':
etl()
64 changes: 0 additions & 64 deletions without-hot-air/extract.sh

This file was deleted.

29 changes: 29 additions & 0 deletions without-hot-air/src/acknowledgments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Acknowledgments

For leading me into environmentalism, I thank Robert MacKay, Gale Ryba, and Mary Archer.

For decades of intense conversation on every detail, thank you to Matthew Bramley, Mike Cates, and Tim Jervis.

For good ideas, for inspiration, for suggesting good turns of phrase, for helpful criticism, for encouragement, I thank the following people, all of whom have shaped this book. John Hopfield, Sanjoy Mahajan, Iain Murray, Ian Fells, Tony Benn, Chris Bishop, Peter Dayan, Zoubin Ghahramani, Kimber Gross, Peter Hodgson, Jeremy Lefroy, Robert MacKay, William Nuttall, Mike Sheppard, Ed Snelson, Quentin Stafford-Fraser, Prashant Vaze, Mark Warner, Seb Wills, Phil Cowans, Bart Ullstein, Helen de Mattos, Daniel Corbett, Greg McMullen, Alan Blackwell, Richard Hills, Philip Sargent, Denis Mollison, Volker Heine, Olivia Morris, Marcus Frean, Erik Winfree, Caryl Walter, Martin Hellman, Per Sillrén, Trevor Whittaker, Daniel Nocera, Jon Gibbins, Nick Butler, Sally Daultrey, Richard Friend, Guido Bombi, Alessandro Pastore, John Peacock, Carl Rasmussen, Phil C. Stuart, AdrianWrigley, Jonathan Kimmitt, Henry Jabbour, Ian Bryden, Andrew Green, Montu Saxena, Chris Pickard, Kele Baker, Davin Yap, Martijn van Veen, Sylvia Frean, Janet Lefroy, John Hinch, James Jackson, Stephen Salter, Derek Bendall, Deep Throat, Thomas Hsu, Geoffrey Hinton, Radford Neal, Sam Roweis, John Winn, Simon Cran-McGreehin, Jackie Ford, Lord Wilson of Tillyorn, Dan Kammen, Harry Bhadeshia, Colin Humphreys, Adam Kalinowski, Anahita New, Jonathan Zwart, John Edwards, Danny Harvey, David Howarth, Andrew Read, Jenny Smithers, William Connolley, Ariane Kossack, Sylvie Marchand, Phil Hobbs, David Stern, Ryan Woodard, Noel Thompson, Matthew Turner, Frank Stajano, Stephen Stretton, Terry Barker, Jonathan Köhler, Peter Pope, Aleks Jakulin, Charles Lee, Dave Andrews, Dick Glick, Paul Robertson, Jürg Matter, Alan and Ruth Foster, David Archer, Philip Sterne, Oliver Stegle, Markus Kuhn, Keith Vertanen, Anthony Rood, Pilgrim Beart, Ellen Nisbet, Bob Flint, David Ward, Pietro Perona, Andrew Urquhart, Michael McIntyre, Andrew Blake, Anson Cheung, Daniel Wolpert, Rachel Warren, Peter Tallack, Philipp Hennig, Christian Steinrücken, Tamara Broderick, Demosthenis Pafitis, David Newbery, Annee Blott, Henry Leveson-Gower, John Colbert, Philip Dawid, Mary Waltham, Philip Slater, Christopher Hobbs, Margaret Hobbs, Paul Chambers, Michael Schlup, Fiona Harvey, Jeremy Nicholson, Ian Gardner, Sir John Sulston, Michael Fairbank, Menna Clatworthy, Gabor Csanyi, Stephen Bull, Jonathan Yates, Michael Sutherland, Michael Payne, Simon Learmount, John Riley, Lord John Browne, Cameron Freer, Parker Jones, Andrew Stobart, Peter Ravine, Anna Jones, Peter Brindle, Eoin Pierce,Willy Brown, Graham Treloar, Robin Smale, Dieter Helm, Gordon Taylor, Saul Griffith, David Cebonne, Simon Mercer, Alan Storkey, Giles Hodgson, Amos Storkey, Chris Williams, Tristan Collins, Darran Messem, Simon Singh, Gos Micklem, Peter Guthrie, Shin-Ichi Maeda, Candida Whitmill, Beatrix Schlarb-Ridley, Fabien Petitcolas, Sandy Polak, Dino Seppi, Tadashi Tokieda, Lisa Willis, Paul Weall, Hugh Hunt, Jon Fairbairn, Miloš T. Kojašević, Andrew Howe, Ian Leslie, Andrew Rice, Miles Hember, Hugo Willson, Win Rampen, Nigel Goddard, Richard Dietrich, Gareth Gretton, David Sterratt, Jamie Turner, Alistair Morfey, Rob Jones, Paul McKeigue, Rick Jefferys, Robin S Berlingo, Frank Kelly, Michael Kelly, Scott Kelly, Anne Miller, Malcolm Mackley, Tony Juniper, Peter Milloy, Cathy Kunkel, Tony Dye, Rob Jones, Garry Whatford, Francis Meyer, Wha-Jin Han, Brendan McNamara, Michael Laughton, Dermot McDonnell, John McCone, Andreas Kay, John McIntyre, Denis Bonnelle, Ned Ekins-Daukes, John Daglish, Jawed Karim, Tom Yates, Lucas Kruijswijk, Sheldon Greenwell, Charles Copeland, Georg Heidenreich, Colin Dunn, Steve Foale, Leo Smith, Mark McAndrew, Bengt Gustafsson, Roger Pharo, David Calderwood, Graham Pendlebury, Brian Collins, Paul Hasley, Martin Dowling, Martin Whiteland, Andrew Janca, Keith Henson, Graeme Mitchison, Valerie MacKay, Dewi Williams, Nick Barnes, Niall Mansfield, Graham Smith, Wade Amos, Sven Weier, Richard McMahon, Andrew Wallace, Corinne Meakins, Eoin O’Carroll, Iain McClatchie, Alexander Ac, Mark Suthers, Gustav Grob, Ibrahim Dincer, Ian Jones, Adnan Midilli, Chul Park, David Gelder, Damon Hart-Davis, George Wallis, Philipp Spöth, James Wimberley, Richard Madeley, Jeremy Leggett, Michael Meacher, Dan Kelley, Tony Ward-Holmes, Charles Barton, James Wimberley, Jay Mucha, Johan Simu, Stuart Lawrence, Nathaniel Taylor, Dickon Pinner, Michael Davey, Michael Riedel, William Stoett, Jon Hilton, Mike Armstrong, Tony Hamilton, Joe Burlington, David Howey, Jim Brough, Mark Lynas, Hezlin Ashraf-Ball, Jim Oswald, John Lightfoot, Carol Atkinson, Nicola Terry, George Stowell, Damian Smith, Peter Campbell, Ian Percival, David Dunand, Nick Cook, Leon di Marco, Dave Fisher, John Cox, Jonathan Lee, Richard Procter, Matt Taylor, Carl Scheffler, Chris Burgoyne, Francisco Monteiro, Ian McChesney, and Liz Moyer. Thank you all.

For help with finding climate data, I thank Emily Shuckburgh. I’m very grateful to Kele Baker for gathering the electric car data in figure 20.21. I also thank David Sterratt for research contributions, and Niall Mansfield, Jonathan Zwart, and Anna Jones for excellent editorial advice.

The errors that remain are of course my own.

I am especially indebted to Seb Wills, Phil Cowans, Oliver Stegle, Patrick Welche, and Carl Scheffler for keeping my computers working.

I thank the African Institute for Mathematical Sciences, Cape Town, and the Isaac Newton Institute for Mathematical Sciences, Cambridge, for hospitality.

Many thanks to the Digital Technology Group, Computer Laboratory, Cambridge and Heriot–Watt University Physics Department for providing weather data online. I am grateful to Jersey Water and Guernsey Electricity for tours of their facilities.

Thank you to Gilby Productions for providing the TinyURL service. TinyURL is a trademark of Gilby Productions. Thank you to Eric Johnston and Satellite Signals Limited for providing a nice interface for maps [[<span class="websitetitle">www.satsig.net</span>](http://www.satsig.net/)].

Thank you to David Stern for the portrait, to Becky Smith for iconic artwork, and to Claire Jervis for the photos on pages ix, 31, 90, 95, 153, 245, 289, and 325. For other photos, thanks to Robert MacKay, Eric LeVin, Marcus Frean, Rosie Ward, Harry Bhadeshia, Catherine Huang, Yaan de Carlan, Pippa Swannell, Corinne Le Quéré, David Faiman, Kele Baker, Tim Jervis, and anonymous contributors to Wikipedia. I am grateful to the office of the Mayor of London for providing copies of advertisements.

The artwork in chapter 31, "Maid in London," and in chapter D, "Sunflowers," by Banksy [<span class="websitetitle">www.banksy.co.uk</span>](http://www.banksy.co.uk/). Thank you, Banksy\!

Offsetting services were provided by <span class="websitetitle">cheatneutral.com</span>.

This book is written in LaTeX on the Ubuntu GNU/Linux operating system using free software. The figures were drawn with <span>gnuplot</span> and <span>metapost</span>. Many of the maps were created with Paul Wessel and Walter Smith’s <span>gmt</span> software. Thank you also to MartinWeinelt and OMC. Thank you to Donald Knuth, Leslie Lamport, Richard Stallman, Linus Torvalds, and all those who contribute to free software.

Finally I owe the biggest debt of gratitude to the Gatsby Charitable Foundation, who supported me and my research group before, during, and after the writing of this book.
14 changes: 14 additions & 0 deletions without-hot-air/src/author.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# The author

*Sustainable Energy – without the hot air*

David JC MacKay

![](/img/without-hot-air/author.jpg)

The author, July 2008.
Photo by David Stern.

## About the author

David MacKay is a Professor in the Department of Physics at the University of Cambridge. He studied Natural Sciences at Cambridge and then obtained his PhD in Computation and Neural Systems at the California Institute of Technology. He returned to Cambridge as a Royal Society research fellow at Darwin College. He is internationally known for his research in machine learning, information theory, and communication systems, including the invention of Dasher, a software interface that enables efficient communication in any language with any muscle. He has taught Physics in Cambridge since 1995. Since 2005, he has devoted much of his time to public teaching about energy. He is a member of the World Economic Forum Global Agenda Council on Climate Change.
Loading

0 comments on commit 685e45b

Please sign in to comment.