25 December 2008

Statistical Abstraction

Getting exiftool to cough up lens type and focal length data is straightforward; it just takes ages to do it for 11,944 files.
for x in */*/*dng; do exiftool -FocalLength $x >> LengthList; done
for x in */*/*dng; do exiftool -LensType $x >> LensList; done

Focal length is a simple single number; that means
sort -ung UniqueLengths > UniqueLengthList
for x in `gawk '{print $4}' UniqueLengthList`; do print $x,`grep -wc $x LengthList` >> ShotsPerLength; done
will get me a useful table:

Length(mm) Shots
0 27
14 119
31 1565
35 916
50 820
55 332
62.5 27
70 398
75 19
77 849
77.5 24
80 12
85 4
87.5 35
97.5 58
100 2405
107.5 34
108 21
120 72
133 52
135 91
150 150
170 199
190 351
210 250
230 153
240 14
260 135
300 2812
Which can be graphically represented as:

Lens names are more interesting to extract; I had to use perl, rather than a combination of awk and grep, since having parens and square brackets in lens names that span several fields (from awk's point of view) made things Interesting. One can get the whole lens name out just fine, but quote-escaping it so the shell doesn't split on spaces before grep gets it either lumps them all together (zero matches) or fails, so all the elements of the lens name are split out by spaces, leaving many matches, few of them useful. But a little perl gives me the useful:

Lens Shots
M-42 or No Lens 27
Sigma Lens (3 255) 2120
smc PENTAX-DA 14mm F2.8 ED[IF] 119
smc PENTAX-DA 35mm F2.8 Macro Limited 916
smc PENTAX-DA 55-300mm F4-5.8 ED 3044
smc PENTAX-DA 70mm F2.4 Limited 83
smc PENTAX-FA 31mm F1.8AL Limited 1565
smc PENTAX-FA 50mm F1.4 820
smc PENTAX-FA 77mm F1.8 Limited 849
smc PENTAX-FA MACRO 100mm F2.8 2401

Which can be graphically represented as:
I'm not sure what this tells me; I use the long zoom a great deal, I use the 100mm macro a great deal, I would probably use a fast 200mm lens if I had it (that cluster around 200mm in the focal length graph is trading off focal length for f-stop number taking pictures of feeder birds) and a I really like the 31mm Ltd. out of the primes. I also spend a lot of time at 300mm and would probably use a longer lens if I had it.

None of this is precisely news, but I suppose it's a good thing to be able to back it up numerically.


Jeremy Leader said...

For that sort of counting, I typically use something like "cut -d ' ' -f 4 < UniqueLengths | sort -ng | uniq -c". (Not sure if the cut is equivalent to "gawk '{print $4}'" or not)

That way, it only makes one pass over the list. I suspect it might even avoid the need for Perl in processing the lens list.

Graydon said...

cut to get field number four would work much the same way telling gawk to print field 4 would. The problem I had was with the lens names, which are space-separated strings. Those get subjected to interpolation by the shell unless they are in quotes, and it's beyond me to put each name in quotes (fields 5 through 8) without putting all the names in quotes, which isn't ever going to match anything.

There is probably a way to do that (and I'll take a poke at cut to see it knows how) but perl was easier for the lens names just because it got me away from the shell interpolation breaking everything at each space.