Sveriges mest populära poddar

History and Philosophy of the Language Sciences

Podcast episode 35: Interview with Nick Thieberger on historical documentation and archiving

22 min • 31 juli 2023

In this interview, we talk to Nick Thieberger about the value of historical documentation for linguistic research, and how this documentation can be preserved and made accessible today and in the future in digital form.

Download | Spotify | Apple Podcasts | Google Podcasts

References for Episode 35

Crane, Gregory, ed. 1987–. Project Perseus. Web resource: http://www.perseus.tufts.edu/hopper/

Gardner, Helen, Rachel Hendery, Stephen Morey, Patrick McConvell et al. 2020. Howitt and Fison’s Archive. Web resource: https://howittandfison.org/

Lillehaugen, Brook Danielle, George Aaron Broadwell, Michel R. Oudijk, Laurie Allen, May Plumb, and Mike Zarafonetis. 2016. Ticha: a digital text explorer for Colonial Zapotec, first edition. Web resource: http://ticha.haverford.edu/

Takau, Toukolau. 2011. “Koaiseno”, in Natrauswen nig Efat, Stories from South Efate, ed. Nick Thieberger, pp. 88–90. Melbourne: University of Melbourne. Open access: http://hdl.handle.net/11343/28967

Takau, Toukolau. 2017. “Koaiseno”, in recording NT1-20170718. https://catalog.paradisec.org.au/collections/NT1/items/20170718

Thieberger, Nick. 2017. Digital Daisy Bates. Web resource: http://bates.org.au

Thieberger, Nick, Linda Barwick, Nick Enfield, Jakelin Troy, Myfany Turpin and Roman Marchant Matus. 2022–. Nyingarn: a platform for primary sources in Australian Indigenous languages. Web resource: https://nyingarn.net/

Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC). https://www.paradisec.org.au/

Transcript by Luca Dinu

TT [singing]: Koaiseno koaiseno seno, nato wawa nato wawa meremo… [00:13]

JMc: That was the late Toukolau Takau from Erakor village, Vanuatu, singing Koaiseno, a song that’s part of the folktale of the same name. [00:24] The recording of the song is stored in the PARADISEC digital archive, which we’ll talk about later in this episode. [00:31] Links to the recording and the complete story are included in the bibliography for this episode. [00:38] I’m James McElvenny, and you’re listening to the History and Philosophy of the Language Sciences Podcast, online at hiphilangsci.net. [00:47] Today we’re joined by Nick Thieberger, who’s Associate Professor of Linguistics at the University of Melbourne. [00:55] Among his many interests, Nick works extensively with archival data, both contemporary and historical. [01:03] We’re going to talk to him about how historical data can inform present-day linguistic research, [01:09] and what we can do in our present to ensure that it becomes the most productive past of the future, if I can put it that way. [01:17] So Nick, you’ve been involved in a number of projects that make historical sources in Australian languages accessible to present-day communities and researchers. [01:27] The most significant of these are perhaps the Howitt and Fison Archive and the Digital Daisy Bates. [01:34] So can you tell us about these projects? What historical materials did you work with, [01:40] how did you make them accessible to people today, and what are the use of these materials today? [01:48]

NT: Yeah, so these are a couple of major projects, and in some ways they were testing out a method for how to work with historical manuscripts. [02:01] I was only slightly involved with Howitt and Fison, but I ran the Digital Daisy Bates project, so maybe I’ll talk about that one. [02:09] Daisy Bates recorded on paper lots of information about Australian Indigenous languages in the very early 1900s. [02:18] So in 1904, she sent out a questionnaire, and that was filled out by a number of respondents. [02:23] And so there were in the order of 23,000 pages of questionnaire materials sitting in the National Library of Australia and two other libraries, [02:35] the State Library of Western Australia and South Australia. And so they were fairly inaccessible. [02:39] I’d worked with them, and I realised that they were very valuable, but they were really difficult to work with because they’re just all on paper. [02:48] So I thought it’d be interesting to try all of this methodology that we have with the Text Encoding Initiative and all these ways of dealing with texts and manuscripts. [02:58] So I worked with the National Library of Australia, and that took a bit of time because they’re a big institution and these things take time. [03:05] But it took about eight years, really, of getting the approvals from the National Library and also getting them to digitise these papers. [03:14] And they did that from microfilms, so not going back to the original papers, but… Because it was just much cheaper and easier to run the microfilms through and digitise them. [03:23] So then we had the images, and this was going back a while now, and OCR, optical character recognition, wasn’t very good for these typescripts. [03:33] So I sent them off to an agency to get them typed and then put them online. [03:39] And the idea, the principle behind this too, was that we should have an image of the original manuscript together with the text, [03:46] because, if you like, the warrant for the text is the original manuscript, and separating them, which is something that we’ve done a lot in the past, [03:55] we’ve gone in, found manuscripts, extracted what we think is the important information, reproduced it in some way, but then there’s no link back. [04:04] And so people can’t retrace your steps, [04:07] and if you’ve made some errors or just you’ve made some interpretations that they don’t agree with, there’s no real way for them to correct that. [04:15] So Digital Daisy Bates put the page images online and it put up the text, and you could then search the text, [04:24] and for every text page that you found, you retrieved the page image as well. [04:30] It’s been up online now for quite a while, and it’s had many, many users. [04:35] I think one of the exciting things about doing this sort of work is that once you prepare material in this way, you don’t know what uses people will make of it. [04:45] And one of the big target groups for this was Aboriginal people who wanted access to materials in their own languages, and that was satisfied. [04:55] But I was finding biologists who were finally able to search through 23,000 pages of Bates’ materials for plant and animal names. [05:06] Before, they were having to look through paper, and basically it defeated them, I think. [05:11] They were really not able to do it. [05:13]

JMc: And all this material is still up online and available for anyone to use. [05:18]

NT: Yeah, it is still up online and available for anyone to use. [05:21] And, you know, one of the issues with a lot of this is, what right do I have to put this online, and what changes digitisation makes, what changes it can make to the nature of the material. [05:36] So while it’s on paper, it’s got its own inherent restrictions. [05:40] You know, you can’t easily get access to it. [05:42] Once it’s online, it’s much more easily accessible. [05:45] So I was a bit worried with Daisy Bates. [05:47] This is mainly Western Australian material, and it represents dozens of languages and a huge geographic area. [05:55] There would be people who would feel perhaps aggrieved that they may feel some ownership of the language and not want it to be put online, [06:05] but I also recognised the value of putting it online. [06:10] So there was a risk. [06:11] And I think we have to take these risks. [06:13] I don’t think it’s very fruitful to say, “Oh, there’s a risk that somebody will be offended, so I won’t do this,” [06:20] because really, my experience with Daisy Bates is that everybody, all the Aboriginal people who’ve used it, have really valued being able to use it and finding materials. [06:29] And they can download this stuff and use it themselves as text. [06:32] So we have to be a bit less cautious. [06:35] I mean, obviously, we have to be cautious and we have to be respectful of the people represented, [06:40] but if I were to try and get permission from every Aboriginal person who’s got an ancestor in those papers, it would be impossible. [06:49] It would just, you know, it would stymie the whole project. [06:52] And on top of that, how can you go to people and say you want permission to do something when they don’t really know what you’re talking about because the papers are in the National Library in Canberra? [07:00] So putting the papers up and using a takedown principle, so saying, “If you’re aggrieved by this, please get in touch with me and we can take it down if necessary,” I think is a much more productive way of dealing with these papers. [07:14]

JMc: Yeah, so it’s a very fraught situation in Australia in the moment, isn’t it? Because, I mean, these documents were produced by a member of the colonial settler population, Daisy Bates, who had very strong colonialist views, [07:31] but what she was documenting were the culture and language of Indigenous inhabitants of the country. [07:38] So the question is, yeah, who does it belong to? And what is even contained in these documents? [07:44] Is it Daisy Bates’ image of what she thought was the culture and language of these people, [07:49] or is it something, you know, some actual essential property of their culture and language that has in fact been recorded and belongs to them? [07:58]

NT: Yeah, exactly. [07:59] And, I mean, as you say, Daisy Bates is quite a problematic character in Australian history. [08:04] She’s very well known. [08:06] And she did have very strange views, idiosyncratic views, and quite conservative from our perspective today. [08:13] In some way, you know, she would be a candidate for cancelling in the way that other historical figures have been. [08:21] But I think in all of these cases, you really have to weigh up the total person and the total legacy and not just say, “Well, you know, they did one thing that I don’t like, and therefore I won’t use any of the materials.” [08:35] And, as you say, there is a lot of material here which is neutral to some extent, it’s not her interpretation. [08:43] These were questionnaires that she sent out that had in the order of 1,000 prompt words and sentences. [08:48] So this is primary material. [08:50] Of course, it’s handwritten. So we have to interpret the handwriting. [08:54] But it’s not as potentially florid as some of her other recording, which is really it is her interpretation, and she did have some rather peculiar views. [09:05] But even there, knowing her views, you can strain out the essential or potentially the more ethnographic or historical detail from this material. [09:18] So, you know, I do think it’s important to do this and I do think it’s important to take risks in putting this material online. [09:26] Doing it, you know, talking to Indigenous people about it and knowing that they value it. [09:33] So, I mean, obviously, if there’s something that’s really offensive or that encodes some ceremonial event that is clearly not for general consumption, then you wouldn’t put that online, but that’s not the case for most of these materials. [09:49]

JMc: You’re also a pioneer of ensuring that more recent materials are properly archived. [09:55] So probably from the mid-20th century up to the present. Your greatest contribution here would be your work at PARADISEC. [10:03] So can you tell us what PARADISEC is all about, and what value do you think the materials that you’ve archived there will have in the future, and can you also tell us what the particular challenges are that you’ve faced with the material that is archived at PARADISEC? [10:19]

NT: So PARADISEC is the Pacific and Regional Archive for Digital Sources in Endangered Cultures. [10:24] It’s a project that’s been going for 20 years that I’m currently leading, but, you know, had worked on for 20 years and it was established by Linda Barwick and me all that time ago. [10:38] The aim of PARADISEC was to digitize analog recordings. [10:44] So recordings made by field workers in the 1950s, ’60s, ’70s, mainly in Papua New Guinea, Melanesia and Southeast Asia, that were not being looked after by any other agency in Australia. [10:59] So we have National Film and Sound Archive and National Library and so on, but because these materials were not made in Australia, it wasn’t part of the role of these agencies to look after these recordings. [11:11] So we started digitizing the recordings, and we just kept going and getting bits and pieces of funding from various places, Australian Research Council in particular. [11:21] And now we have in the order of 16,000 hours of digital audio, a few thousand hours of video. [11:30] It’s a huge collection and it represents in the order of 1,355 languages. [11:38] It’s an enormous range of material that’s in there. And this is song, it’s oral tradition, it’s elicitation, it’s all kinds of things. [11:48] So the problem we set ourselves to solve was: how can this get back to the source communities that it came from? [11:56] Because we take it as part of our responsibility when we make these recordings that we will look after them and that they will go back to the communities, and in a lot of ways, the people we work with understand that when we’re working with them. [12:09] They understand that they are talking to the future, they are talking to us as custodians of this material for future generations. [12:17] And I think we’ve fallen down a little bit in our practice as linguists, musicologists, ethnographers, [12:25] in not really making proper provision for looking after this material and ensuring that it does get back, if not to the source communities, [12:34] because these are small villages in remote locations, but nevertheless to perhaps the national cultural agencies in the Solomon Islands, Vanuatu, Papua New Guinea, and so on. [12:44] And that’s what we’ve been doing. [12:45] So one of the big challenges then is, well, finding the tapes in the first place. [12:49] Often they’re deceased estates that we’re working with or retired academics who feel the weight of this often. [12:57] They feel the weight of all of these recordings. [12:59] They understand that they should have done something with them, but there was no, to be fair, there usually was nowhere for them to actually deposit these materials. [13:07] So we’re providing that for them. [13:09] In general, the tapes are in pretty good condition, so it doesn’t take a lot of effort to digitize them. [13:14] But in having done this, we’ve established lots of relationships with these cultural agencies in the Pacific, and a lot of them have tapes as well. [13:21] And that’s where our effort is going now as well. [13:25] And that is working with the Solomon Islands National Museum, the Vanuatu Cultural Centre, and digitizing tapes for them. [13:31] But in this case, often the tapes have been stored in the tropics. [13:34] They’re mouldy, they’re dirty, and they require quite a lot of work to make them playable, and no one is funding this work, so we have to do that on whatever funding we can put together. [13:45] But it is really valuable because the Vanuatu Cultural Centre, for example, has 5,000 tapes sitting in Port Vila, in a country that’s prone to earthquakes, cyclones, tsunamis. [13:59] It’s got the lot. [14:01] And the potential for this stuff to be lost is really, is very real. [14:05] So working with these agencies is important and finding more and more of these tapes. [14:09] We run a project called Lost and Found, where we invite people to tell us about collections of tapes, [14:15] and they put that into our spreadsheet, and then we try and tee up some funding wherever we can, [14:20] through the Endangered Archives Program from the British Library, the Endangered Language Documentation Program, and so on. [14:28]

JMc: So I guess there’s also a technical problem that once you’ve digitized these analogue tapes, [14:34] how do you ensure that the data formats don’t become obsolete, and that there isn’t data rot on the archival copies? [14:44] And then when you’re returning things to communities, how do you ensure that people in the communities can actually play back what they’ve recorded? [14:51] I guess there are many greater challenges with audiovisual material than with old archival material that’s on paper, [14:59] because all you have to do with paper is ensure that it is kept dry and out of sunlight. [15:05]

NT: Yeah, indeed. So for storing this stuff and making sure that it’s going to last into the future, we adhere to all the necessary standards. [15:13] So this has all been done by others, obviously, so we follow the same standards. [15:18] And one of those standards is that you always digitize to a standard format, the European Broadcast Wave format, which is a WAV file. [15:27] We make MP3 versions, so they’re compressed versions, and MP3 we know is a proprietary format, [15:33] but for the time being it seems to be a format that works, and that’s the format that people can play relatively easily. [15:41] We have backup copies of the whole collection in different locations. [15:45] We have one in Melbourne, and the collection itself is in Sydney, and it’s in two locations in Sydney as well. [15:51] So we make provision for all of that technical backup. [15:57] We do checksum… So, checksums are checking the integrity of each file, and we do a checksum run through random parts of the collection every day, [16:06] and that points to anything that may have bit rot. [16:09] We haven’t actually encountered bit rot yet, but we know that it could be a thing. [16:14] And finally, getting it back to the right place, that’s really a big challenge. [16:18] So we do send hard disks back to these locations when we’ve digitized the tapes, and we have a catalogue, [16:26] and we keep a piece of whatever catalogue entry there is for an item, for a tape or whatever, [16:33] we put that together with the files, and we send that back to the cultural centre so that there’s contextual information with the files. [16:41] Files on their own are very difficult to interpret, so at least having that with them. [16:46] We’ve also experimented using Raspberry Pis, which are small computers that have a Wi-Fi transmitter in them, [16:54] and they cost a couple hundred dollars, and you can put all the material relevant to a particular place on a Raspberry Pi, [17:00] take it there, and then people can access that on their mobile phones, [17:04] and that is probably a better way of them accessing this material, because often they don’t have computers, [17:11] USB sticks and hard disks aren’t that relevant to them. [17:15] So we’ve been experimenting with that, as I say. [17:17] We’ve done it in a few villages. [17:19] We’ve done it in Tahiti, we’ve done it in the Western Desert in Australia, where people can then just access material on their phones, [17:25] and it does look like a good model, and probably the way to do this in future, [17:30] but it requires the local cultural centre to have this running there as well, [17:36] so yes, it sounds great and it does work, but it’s not necessarily going to work for a long time into the future. [17:43] We’ll see. [17:45]

JMc: So your latest project is the Nyingarn repository. [17:49] So can you tell us what the purpose of the Nyingarn repository is, and how it builds on your previous work? [17:56]

NT: Yes, so when we talked about Daisy Bates and the Howitt and Fison project, [18:00] these were particular projects designed around a set of material, [18:06] and as I said, experimenting with how to put that online and make it accessible, [18:10] and I think what that taught me and the team that I’m working with was that it works very well, [18:17] and it would be great to have a way of just adding more and more manuscripts to that platform, [18:24] and that’s what Nyingarn is. [18:25] So Nyingarn is… It’s a three-year project, we’re currently just at the end of the second year, [18:31] and the idea is that you should be able to take any digitized manuscripts, [18:35] put them into the platform, and it will try to OCR them, [18:41] or you can also put an existing transcript into the system as well, [18:45] and we’ve got a few different pathways in for different kinds of transcripts, [18:49] and the idea is that this will just grow as a platform with more and more manuscripts, [18:55] and it’s working very well. [18:57] We have at the moment about 350 manuscripts in our workspace, [19:03] so we distinguish a workspace, which is where all of the transcription [19:08] and sort of enrichment of the manuscript is, and then the next step is a repository, [19:13] which is where it goes once we have a fairly stable version of it, [19:19] and that’s where we allow people to search and do other things with it. [19:23] We did set ourselves the task also of getting permissions [19:27] from current language authorities for these documents, [19:31] and as we said earlier, it’s quite a sensitive issue in Australia, [19:36] and we recognize these sensitivities, so we don’t want to just be putting manuscripts online, [19:42] even if some of them have been in the public domain for some time. [19:46] We recognize that Aboriginal people have been disempowered for so long [19:50] that we don’t want to compound that, but the exciting thing is that there are a lot [19:55] of young Indigenous people in Australia now who are desperately looking for things to do, [20:00] and especially on the east coast of Australia where languages, [20:04] really the speakers of those languages suffered the initial onslaught [20:09] of the European invasion, and so that’s where the languages have not been spoken [20:14] for the longest, and people are trying to go back to these original sources now [20:17] to recover their languages, and so they recognize the value of Nyingarn [20:22] as a way of doing this transcription and then being able to use the manuscripts, [20:29] the text of the manuscripts. [20:31] So it’s a fairly simple idea. [20:34] You take a manuscript, you get a textual version of it, [20:38] and then you do something else with it, but actually making transcripts [20:43] of manuscripts isn’t that easy if you don’t have a good system for it [20:46] because you very rapidly start losing track of which page is related to which piece of transcript and so on. [20:52] So the simple technology does allow – it facilitates this transcription [20:58] and then further use of the materials. [21:01] So it’s exciting to see it working. [21:04] At the end of the project, the Australian Institute of Aboriginal and Torres Strait Islander Studies has undertaken to take the repository and host it there, so we hope that it will keep going into the future. [21:16]

JMc: Do you see any international application for Nyingarn? [21:20]

NT: Well, it’s all in GitHub. [21:22] It’s there if anybody wants to use it. [21:24] We actually – when I was doing this, I was looking at international models, [21:28] so there’s Project Perseus in Europe, which is all the sort of classic [21:32] Greek-Roman texts, and in the United States there’s Ticha, [21:37] which is, it’s working with a particular Zapotec canon of classical materials, [21:43] and it uses a similar sort of approach to what we’ve built up with Nyingarn. [21:49] So, yes, I think it’s very – it’s logical that it should happen. [21:53] I’m sort of – I was a bit astounded that there wasn’t a way [21:58] of looking at these texts up until now, but nevertheless, [22:02] I hope that this will continue into the future. [22:05]

JMc: Excellent. Well, thank you very much for answering those questions. [22:09]

TT [singing]: …koaiseno seno, nato wawa nato wawa meremo, koaiseno seno.

00:00 -00:00