Things have been a bit crazy lately in the bug counting lab. There’s a few personnel moving on to new things, and I’m in a rush to get projects wrapped up with them before I’m no longer interacting with them on a daily basis. Last week, it was all about bees, but not the bees I promised you in the previous post. These were urban bees. These bees had a faster paced lifestyle and were more demanding than the simple, salt-of-the-earth bees I’ve come to love in agricultural systems. Maybe I’m just reading too much into it.
To make up for my lack of gentle cajoling to get you and your data through, I’ve found a few like-minded people to fill in for me. This first guest post is by the lovely and talented Ignasi Bartomeus, who wants you to be consistent with your style. HEED HIS WORDS FOR THEY ARE WISE.
“What was venerated as style was nothing more than an imperfection or flaw that revealed the guilty hand.” – Orhan Pamuk –My name is red-
The highest compliment you could pay an Islamic miniaturist in the year 1500 was to say that his work was indistinguishable from that of the old masters. To have a style of one's own was a sign of imperfection. What has this to do with data management? well, I could argue that when creating your dataset you should follow the old masters, make your dataset formatting indistinguishable from any other dataset (not your data!). That would facilitate a lot the task of analyzing data (and specially #otherpeoplesdata), because everyone would use the same conventions. Unfortunately we don’t really have old masters to follow, the number of researchers creating data is too large and heterogeneous to agree in one style to rule them all*. So I am not going to recommend being styleless, but the opposite. Know your style, and make it easy to identify.
When I refer to style I am not talking about the substance (e.g. how you structure data in variables, observations and values**), but about the form (e.g. naming conventions). While having a non consistent style doesn’t affect the quality of the data, it can really help reusing code, and speed up the cleaning and analysis process. For example, be consisten on which file formats you use, how you name the variables, what symbol represent no data, and do it consistently among and within data tables. I am biased to the R world, and I personally like to use csv comma separated tables, use
variable_name style, never use CAPS and use
NA for values with no data. If your data is tab separated, you use
NULL I may complain about it once, but I can a) easily tweak your data to look exactly like I like data to look like e.g.
colnames(data) <– gsub(“.”, “_”, colnames(data)). I may do that for combining it with other data. Or most likely b) just get used to your notation very easily. However, if you have no style and combine randomly
name.var3 I will complain any single time I call one of your variables, for example:
< head(data$Variable.name) Error: object 'Variable.name' not found
Oh wait, maybe was:
< head(data$Variable_name) Error: object 'Variable_name' not found either
< head(data$variable_name) Error: object 'Variable.name' not found, again.
at this point you will be forced to call
colnames(data) once again.
<colnames(data) < head(data$variable.name)
So we can argue about which style is better,
if_using_underscore_increase_readability, or if a dot
is.faster.to.type, you can choose your own set of rules, but be consistent and you will gain lot of time on the long run when working in command line stile software.
*I don’t have a note for that, but as a guest, I am trying to follow the blog’s style of having lots of foot notes.
*** and No, using colors in excel is not stylish, is having too much free time (unless you are sinestesic)