# Generating wordclouds from massive PDF libraries

My friend Dani posed an interesting question on Twitter yesterday:

I used Zotero to manage my bibliography during and after my PhD, and like any recovering academic, I have a giant folder full of PDFs that I am weirdly attached to and can’t bring myself to delete.

So I thought to myself: This must be possible, and I certainly don’t have anything better to do with my time.

We are going to make use of two Python packages: pdftotext and wordcloud. The following was done on a Mac but it should work anywhere Python works.

Note: There’s nothing actually Zotero specific here, the following works on any large directory of PDFs

## A note on Zotero’s filesystem

Zotero’s local database consists of it’s SQLite bibliography database, and linked files. Linked files are stored in a tree directory structure, so lots of directories with a few files in each. So we will need to traverse these folders to find all PDFs.

## Installing Packages

Assuming you have Python installed, we need to install the command line tools we will use. Instructions for Mac.

brew install pkg-config poppler python
pip install wordcloud pdftotext

Hopefully that all goes smoothly. I can’t help you install Python, sorry.

## Converting a library of PDFs to a giant text file

The Wordcloud tool builds wordcloud images from a text file. We have a bit set of PDFs. So the first step is to merge the content of the PDFs into one giant text file.

find \$THE_DIRECTORY_WHERE_MY_PDFS_ARE -name '*.pdf' -exec pdftotext "{}" - >> combined_text.txt \;

We now have a single text file containing all of the text content of all the PDFs in it. Note: This won’t work for scanned PDFs.

## Converting the text to a wordcloud

Now that we have a single text file containing all our text, we can feed that into wordcloud.

wordcloud_cli --text combined_text.txt --imagefile wordcloud.png

Depending on how big your library is, this could take a little while. But you should end up with a nice Wordcloud:

The Wordcloud package has lots of options for customising the output.

### Excluding Words

You can provide a list of words you want excluded. In the example above, ‘use’, ‘using’, and ‘used’ aren’t particularly useful. To exclude words, create a new text file with each excluded word on its own line, and provide it to wordcloud. This is great for excluding common words that are not particularly interesting for your topic.

Wordcloud has a built in list of stop words, and providing your own overrides its built in ones. You can start with the default list, which you can find here, and add your own words to it.

wordcloud_cli --text combined_text.txt --imagefile wordcloud.png --stopwords excluded.txt

And we end up with this:

### Excluding short words

In addition to the stopwords list, you can also tell Wordcloud to just not include words shorter than a specific length:

wordcloud_cli --text combined_text.txt --imagefile wordcloud.png --stopwords excluded.txt --min_word_length 8

Things start to get a bit more interesting:

### Customising the output

There are many ways to customise the resulting image, just run wordcloud --help for a list of all options. For example:

wordcloud_cli --text combined_text.txt \
--imagefile wordcloud.png \
--background white \
--color purple \
--min_word_length 8 \
--width 1280 \
--height 720 \
--fontfile intro.otf

Which gives you this:

You get the idea.

## Bonus: Work on a subset of my Zotero Library

If you have a big giant sprawling Zotero library, or different libraries for different projects, you might want to generate wordclouds from just some of your documents. You can export files from Zotero to accomplish this.

1. Select the publications you want to use
2. Right click and select Export Items
3. On the export options, make sure Export Files is checked
4. Choose where to save the export

You’ll end up with a new folder containing just the PDFs you selected.

## Going Further

The wordcloud_cli is just a frontend to the wordcloud Python package. If the command line options don’t give you the customisation you are looking for then you can write code to accomplish what you want.

#### Join the conversation!

Hit me up on Twitter or send me an email.
Posted on June 25, 2022ResearchTags: Python, wordcloud, Zotero

# Latex Thesis Template

Tweaking your own Latex template for a PhD dissertation is a rite of passage/time waster for most PhD candidates. There are lots of templates around on the Internet too, of varying quality.

I spent a fair bit of time procrastinating perfecting the template used for my thesis. I’ve pulled out all the unnecessary bits and put it up on GitHub. Hopefully some other poor PhD student will find it useful. This template is itself based on styles developed by Peter Hutterer, an earlier PhD student from the Wearable Computer Lab.

This template is particularly suited to students at the University of South Australia. It meets the guidelines specified by the Graduate Research Office. That said, with some adjustments this template should be useful for anybody.

## Features:

• Nice cover page
• Author’s publications
• Acknowledgements
• TOC, List of Figures, Abbreviations, etc.

## How to use:

• Fork and clone the GitHub repository
• Edit the information in thesis.tex
• Update images/00/author_sig.png with your own signature
• Update images/00/uni.png with your university’s logo
• Tweak the styles as necessary

I’m happy to accept Pull Requests for improvements on this template.

Check it out on GitHub!

Happy Writing!

#### Join the conversation!

Hit me up on Twitter or send me an email.
Posted on November 10, 2015ResearchTags: latex, phd, template, thesis

# BuildMyKitchen Demonstration

This video demonstrates BuildMyKitchen, an application constructed to demonstrate the use of spatial augmented reality for interior architecture tasks.

This work was presented at the IEEE VR conference as a poster, and at the Australasian User Interface Conference.

#### Join the conversation!

Hit me up on Twitter or send me an email.

# Dr. Michael Marner, PhD

My PhD is finally finished! I graduated at the end of August, 2013. Please have a look at my thesis if spatial augmented reality is of interest to you.

Thanks to my supervisor, Professor Bruce Thomas, associate supervisor, Dr. Christian Sandor, my reviewers, Professors Greg Welch and Henry Gardner, my family, etc.

#### Join the conversation!

Hit me up on Twitter or send me an email.
Posted on September 01, 2013ResearchTags: graduation, phd, thesis

# Tiled Projector Calibration

Here’s a quick post to show off the tiled projector display calibration I have implemented.

Please note: I did not invent this! What you see is more or less an implementation of the algorithm described here:

O. Bimber and R. Raskar, Spatial Augmented Reality: Merging Real and Virtual Worlds. Wellesley: A K Peters, 2005.

#### Join the conversation!

Hit me up on Twitter or send me an email.

# AUIC 2012 Roundup

So the Australasian User Interface Conference for 2012 has been and gone. The Wearable Computer Lab presented two full papers and two posters, of which I was an author of one 🙂

The papers we presented are listed below, and the publication page has been updated so you can get the PDFs. Cheers!

E. T. A. Maas, M. R. Marner, R. T. Smith, and B. H. Thomas, “Supporting Freeform Modelling in Spatial Augmented Reality Environments with a New Deformable Material,” in Proceedings of the 13th Australasian User Interface Conference, Melbourne, Victoria, Australia, 2012. (pdf) (video)

T. M. Simon, R. T. Smith, B. H. Thomas, G. S. Von Itzstein, M. Smith, J. Park, and J. Park, “Merging Tangible Buttons and Spatial Augmented Reality to Support Ubiquitous Prototype Designs,” in Proceedings of the 13th Australasian User Interface Conference, Melbourne, Victoria, Australia, 2012.

S. J. O’Malley, R. T. Smith, and B. H. Thomas, “Poster: Data Mining Office Behavioural Information from Simple Sensors,” in Proceedings of the 13th Australasian User Interface Conference, Melbourne, Victoria, Australia, 2012.

T. M. Simon and R. T. Smith, “Poster: Magnetic Substrate for use with Tangible Spatial Augmented Reality in Rapid Prototyping Workflows,” in Proceedings of the 13th Australasian User Interface Conference, Melbourne, Victoria, Australia, 2012.

#### Join the conversation!

Hit me up on Twitter or send me an email.
Posted on February 15, 2012ResearchTags: 2012, auic, conference, publication

# Latex, Texlipse, and EPS Figures

I’m currently in the early stages of writing my PhD thesis. I’m writing it using LaTeX, and I’m trying to get the perfect build system and editing environment going. Yesterday I had a look at Texlipse, a plugin for Eclipse. There was one problem: EPS figures didn’t work.

In newish versions of Latex, if you use the epstopdf package, your images are converted on the fly, but this wasn’t  working in Texlipse. Luckily the fix is easy, and the rest of this post explains what to do.

\documentclass{minimal}
\usepackage{epsfig}
\usepackage{epstopdf}
\usepackage{graphicx}

\begin{document}

Here's an EPS Figure:

\includegraphics[height=5cm]{unisa}

\end{document}

Download unisa.eps, and try this yourself. On Ubuntu, I get output that looks like this:

If you look at the console output generated by TexLipse, you will see one of two problems, described below.

### Problem 1: Shell escape feature is not enabled

I encountered this problem on Ubuntu. If you see the following output:

pdflatex> Package epstopdf Warning: Shell escape feature is not enabled.

Then you have encountered this. The fix is quite easy.

1. Open up Eclipse Preferences
2. Click on Texlipse Builder Settings
3. Click on PdfLatex program, and press the edit button
4. Add –shell-escape to the argument list as the first argument.
5. You’re done! Rebuild your project and it should work fine.

### Problem 2: Cannot Open Ghostscript

I encountered this problem on OSX. Weird how the two systems have the same symptoms with different causes, but whatever. If you see the output:

pdflatex> !!! Error: Cannot open Ghostscript for piped input

Then you are suffering from problem 2. This problem is caused by the PATH environment variable not being set correctly when Texclipse runs pdflatex. Essentially, the Ghostcript program, gs, cannot be found by pdflatex. The fix is to add an environment variable to Texlipse’s builder settings so the path is corrected.

#### Step 1: Locate Ghostscript, Repstopdf, and Perl

Open up a terminal, and type:

which gs

This should show you the directory where Ghostscript lives on your system. On my laptop it is:

/usr/local/bin

Repeat the process with repstopdf:

which repstopdf

Which on my system gives:

/usr/texbin

And with perl:

which perl

gives me:

/opt/local/bin

The exact paths will depend on how you have installed these things. For example, Perl lives in /opt on my system because I installed it using macports. It doesn’t really matter. However, if you don’t have any of these packages installed, you will need to do so.

#### Step 2: Create the Environment Variable

Now that we know where the programs are installed, we need to create a PATH environment variable for Texlipse to use.

1. Open up Eclipse Preferences
2. Go down to Environment, which is under Texlipse Builder Settings
3. Click new to create a new environment variable
4. the key should be set to PATH. The value should be the three directories, separated by colons (:). For example, on my system:
5. You’re done! Save the settings and everything should work.

### Conclusions

If you complete the steps above, depending on what problem you had (you may have even had both), then you should see the correct output, which looks like this:

Well, I hope that helps someone. Its surprising that this error came up on both of my computers. Searching the internet finds others with the same problem, but as yet no solutions. This post should fix that.

#### Join the conversation!

Hit me up on Twitter or send me an email.
Posted on June 15, 2011ResearchTags: eclipse, eps, epstopdf, figure, latex, mac, osx, texlipse, ubuntu

# Quimo. A deformable material to support freeform modelling in spatial augmented reality environments

Hello Everyone

3DUI has wrapped up for the year, so here is our second publication. We introduce a new material for freeform sculpting in spatial augmented reality environments. Please read the paper, and have a look at the video below.

#### Join the conversation!

Hit me up on Twitter or send me an email.
Posted on March 22, 2011ResearchTags: Augmented Reality, industrial design, Programming, publication, sar, sculpting

# Adaptive Color Marker for SAR Environments

Hey Everyone

So right now I am at the IEEE Symposium on 3D User Interfaces in Singapore. We have a couple of publications which I’ll be posting over the next few days. First up is Adaptive Color Marker for SAR Environments. In a previous study we created interactive virtual control panels by projecting onto otherwise blank designs. We used a simple orange marker to track the position of the user’s finger. However, in a SAR environment, this approach suffers from several problems:

• The tracking system can’t track the marker if we project the same colour as the marker.
• Projecting onto the marker changes it’s appearance, causing tracking to fail.
• Users could not tell when they were pressing virtual controls, because their finger occluded the projection.

We address these problems with an active colour marker. We use a colour sensor to detect what is being projected onto the marker, and change the colour of the marker to an opposite colour, so that tracking continues to work. In addition, we can use the active marker as a form of visual feedback. For example, we can change the colour to indicate a virtual button press.

I’ve added the publication to my publications page, and here’s the video of the marker in action.

#### Join the conversation!

Hit me up on Twitter or send me an email.
Posted on March 20, 2011ResearchTags: Augmented Reality, c++, opengl, Programming, publication, sar

# Augmented Foam Sculpting for Capturing 3D Models

This weekend I presented my paper, Augmented Foam Sculpting for Capturing 3D Models, at the International symposium on 3D user interfaces. Since the conference has passed, I have added the video to youtube and the paper to my publications page. First, the video, then some discussion after the jump.

## Foam Sculpting

The inspiration for this work came out of a project we did with some industrial design students. Their job was to create some input devices for my SAR Airbrushing system. First up, we had a  meeting where I showed them a very early stages of development version of the system, to give them an idea of what we were doing. They went away and came up with ideas for input devices, and in the next meeting had a bunch of sketches ready. We discussed the sketches; what we liked and what we didn’t like. Next, they brought us foam mockups of some of the designs. We discussed these, and then eventually they came back with full CAD models ready for 3D printing.

This slideshow requires JavaScript.

They did a great job by the way. But it got us thinking:

How can we make this process better?

Augmented Foam Sculpting is the result of this work. It allows a designer/artist to simultaneously create a physical design mockup and matching virtual model. This is a Good Thing™, because it utilises the skills and tools that designers are already using.

The system works by tracking the position and orientation of both the hot wire foam cutter, and the piece of foam the user is sculpting. We can track the motion of the hot wire as it passes through the foam.

From there, we can create geometry that matches the cut path, and perform a Boolean difference operation on the foam geometry.

This replicates the cut in the physical object in the 3D model .

Using projectors, we can add extra information to the foam as the user sculpts. We implemented 2 visualisations to aid designers when creating specific models.

Cut Animation displays cuts to be made as animated lines on the foam surface. Once a cut has been made, the system moves to the next one. This visualisation could be used to recreate a previous object, or to instruct novices. An algorithm could be developed to calculate the actual cuts that need to be made, reducing the amount of planning needed when making an object.

The second visualisation, Target, projects a target model so that it appears to be inside the foam. The foam is coloured based on how much needs to be removed to match a target model. This could be used to create variations on a previous model.

Finally, we can use 3D procedural textures to change the appearance of the foam. For example, we implemented a wood grain 3D texture. This works pretty well, because as you cut away the foam, the texture updates to appear as though the wood was actually cut. 3D textures are also ideal because we don’t need to generate texture coordinates after each cut.

For all the details, please have a read of the paper. If you have any questions/comments/feedback/abuse, please comment on this post, or send me an email.

#### Join the conversation!

Hit me up on Twitter or send me an email.
Posted on March 24, 2010ResearchTags: Augmented Reality, foam cutter, industrial design, sar, sculpting