Big data = big problems...
Sorry for using such a ridiculous buzzword. I'm not working with exabytes or anything here, but my data set is huge and more awkward to handle than the three large suitcases I had to single-handedly drag from the baggage claim to the taxi line outside the San Jose airport. Maybe it's because the siesmic signals I'm working with are in the .sac file format and thus require learning how to work with SAC in order to do anything useful with them, or maybe it's because I'm working with recordings from three channels of seven stations that sample at 40 samples/per second for a little over one month. That means somewhere around 2,177,280,000 samples. Wait, that actually doesn't look that bad written out! It's not as much as I thought! And it's not like the raw data is really 2 gigs anyway; it's just clunky. I'm actually feeling much better about all this now. These blog posts really are quite helpful.
My data set was collected using a "completely horrible mish-mash of things," as described by one of my officemates. A network of temporary seismic recording stations were deployed across southern California by some students in my research group about two years ago. The network consists of five types of broadband seismometers from three different companies, which, now that I think about it, helps explain why the appearance of the signals vary so greatly from station to station. Since this network was deployed by my group here at Stanford, the data collected from the stations in the network has neither been seen nor used by any other researchers.
I'm working with the raw, continuous signals picked up by these stations to find events that occurred in a seismic swarm about one year ago. Working with raw data gives me the confidence that I'm not missing any important information, but it also gives me a little bit of anxiety that anything that happens to this data set is definitely my fault. It's like babysitting a six-month-old child. It's not going to hurt itself, right? If I turn my back to pour a cup of tea and the dog sits on it, it's totally my fault! The raw data thing is kind of a major responsibility, but I'm excited to see what I can find.
Just for fun, let's try another number dump. Using pattern matching code, one template on one channel of one day of data gave me 600 potential matches, and another gave me 360. To make this fair, let's underestimate a bit and say the average pattern will find 300 matches per run. That means running my 290 templates over all the channels over the course of the month could potentially find 54,810,000 events, or more! That was really just for fun, though, because there's no way I'm letting that happen. As if there's even enough heap memory to allow for all those plots, am I right? Back to work on the code I go! Is it normal for all the wiggles to start moving after a while?
You must be logged in to post a comment.