Saturday, 5 October 2013

Almost a 2 to 1 ratio of total code words to words in my LaTeX files...

In my previous post I posted a small python script that will recursively go through all directories in a directory and return the word count distribution using texcount (a utility that strips away LaTeX code to count words in documents). In this one I'm going to try and find a way of finding out how many words are in my LaTeX files without counting them (kind of).

On G+ +Dima Pasechnik suggested the use of wc as a proxy but wc gives the count of all words (include code words). I thought I'd see how far off wc would be. So I modified the python script from my last post so that it not only runs texcount but also wc and carries out a simple linear regression (using the stats package from scipy). The script is at the bottom of this blog post.

Here's the scatter plot and linear fit for all the LaTeX files on my system:

We see that the line $y=.68x-27.01$ can be accepted as a predictor for the number of words in a LaTeX document as a function of the total number of code words.

As in my previous post I obviously have an outlier there so here's the scatter plot and linear fit when I remove that one larger file:

The coefficient is again very similar  $y=.64x+26$.

So based on this I'd say that multiplying the number of codewords in a .tex file by .6 is going to give me a good indication of how many words I have in total.

Here's a csv file with my data, I'd love to know if other people have similar fits.

Here's the code (a +Dropbox link is here):

#!/usr/bin/env python import fnmatch import os import subprocess import argparse import pickle import csv from matplotlib import pyplot as plt from scipy import stats parser = argparse.ArgumentParser(description="A simple script to find word counts of all tex files in all subdirectories of a target directory.") parser.add_argument("directory", help="the directory you would like to search") parser.add_argument("-t", "--trim", help="trim data percentage", default=0) args = parser.parse_args() directory = p = float(args.trim) matches = [] for root, dirnames, filenames in os.walk(directory): for filename in fnmatch.filter(filenames, '*.tex'): matches.append(os.path.join(root, filename)) wordcounts = {} codewordcounts = {} fails = {} for f in matches: print "-" * 30 print f process = subprocess.Popen(['texcount', '-1', f],stdout=subprocess.PIPE) out, err = process.communicate() try: wordcounts[f] = eval(out.split()[0]) print "\t has %s words." % wordcounts[f] except: print "\t Couldn't count..." fails[f] = err process = subprocess.Popen(['wc', '-w', f],stdout=subprocess.PIPE) out, err = process.communicate() try: codewordcounts[f] = eval(out.split()[0]) print "\t has %s code words." % codewordcounts[f] except: print "\t Couldn't count..." fails[f] = err pickle.dump(wordcounts, open('latexwordcountin%s.pickle' % directory.replace("/", "-"), "w")) pickle.dump(codewordcounts, open('latexcodewordcountin%s.pickle' % directory.replace("/", "-"), "w")) x = [codewordcounts[e] for e in wordcounts] y = [wordcounts[e] for e in wordcounts] slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) plt.figure() plt.scatter(x, y, color='black') plt.plot(x, [slope * i + intercept for i in x], lw=2, label='$y = %.2fx + %.2f$ ($p=%.2f$)' % (slope, intercept, p_value)) plt.xlabel("Code words") plt.ylabel("Words") plt.xlim([0, plt.xlim()[1]]) plt.ylim([0, plt.ylim()[1]]) plt.legend() plt.savefig('wordsvcodewords.png') data = zip(x,y) f = open('wordsvcodewords.csv', 'w') wrtr = csv.writer(f) for row in data: wrtr.writerow(row) f.close()