Tuesday, October 30, 2012


Rogue Data

We have an issue with free-range databases.

Like most colleges, mine was born before IT became a fact of life.  IT had to be grafted onto a pre-existing culture, or, more accurately, set of micro-cultures.  Different departments and support programs have their own ways of doing things; some have welcomed technology, some have grudgingly adapted, and some have shoved it over in a corner, hoping it would eventually go away.

In concrete terms, that means that we have one master system for data, based on our “live” ERP system -- that’s the system that handles student registrations and scheduling, among other things -- and a whole set of other mini-systems housed in various departments, usually running rogue Excel spreadsheets.

That’s not necessarily a bad thing, of course.  Individual departments or programs have specific needs, and if they’re simply using a custom recipe to mix the same ingredients as everybody else, I have no objection.  Yes, some are savvier about data analysis than others, but almost any high-level skill is unevenly distributed.  I don’t see that as a crisis.

The problem is that  they draw data at different times, and define terms differently.  So individual programs are receiving different facts.  This leads to plenty of low-level conflict.

(My favorite was a few years ago, when an advocate for a particular intervention came to me with her own customized data on the success rates of students who tried it.  She had a small sample, but the percentages were impressive.  When I looked more closely at the numbers, I saw a gap, and asked about it.  She replied that she didn’t count the student who dropped out to follow his girlfriend, on the grounds that it had nothing to do with her program.  I nearly fell off my chair.)

Now we’re looking at corralling the various databases into a single, unified set of queries drawing from the same data at the same time.  Which is simple enough conceptually, but it involves getting the folks who’ve only grudgingly made peace with Excel to start wrestling with the campuswide ERP system in a serious way.  This is no small thing.  And it involves having each separate program give up some control over its own data.  

In a perfect world, that wouldn’t matter; if anything, it could be seen as offloading some work.  But experience tells me that some folks like to use data to tell the story they want told.  That’s often based on good intentions, and sometimes on local knowledge that is easily lost in an aggregation.  But experience tells me it’s also based on a sense of control, of filtering which numbers get out, and of not wanting to change how things are done.

That’s not necessarily an entirely bad thing, of course.  I cringe when I hear people who should know better use idiotic, if technically “correct,” statistics to indict community colleges.  (For example, graduation rates that count early transfers as dropouts drive me around the bend.)  But I don’t think the answer to that is to hide from statistics or to cherry-pick idiosyncratic measures.  The battle to have is over definitions and relevance, and that battle should be had openly.  If we’re using the wrong measures to evaluate a program, that’s probably a sign of not really understanding the program; an open discussion of the measures could lead, if indirectly, to a better understanding of the program.

Wise and worldly readers, have you been through a process of corralling rogue databases?  Is there anything in particular we should know?

We certainly have this issue on our (four-year) campus, to the extent that Microsoft Access isn't installed on IT-managed machines, in order to prevent people from doing some really silly things with it. Yes, people have Excel spreadsheets all over the place, but at least they can't get into Access and start building relational databases with cherry-picked data. There are also only a handful of people who have the authority to run queries against our ERP. But it's still an issue, and I wish you luck.
ah, lies, damned lies and statistics.
I think the hardest part is to actually reconcile mathematical statistics with the stuff produced from some departments. Education Departments, particularly love to average apples and oranges, that they sample from the strawberry fields.

The previous comments about acess is a hoot. Tho, in my neck of the woods many dont know how to use access or excel.

If I were asked these same questions in my setting, I would suggest a Committee which would establish a handbook for each school and department, outlining parameters for data collection or presentation, which would be approved by a general board.

So, next time someone needs to average apples and oranges, he ought to submit a request in writing to the "Average Coordinator"

I remember once pointing out to someone that the average did not make sense, and was told that was no problem, you just announce that there were flaws in your method of data collection.. AH, apples, oranges, standard deviations.
As a data analyst at a university, I wish you good luck and godspeed. I am a huge advocate for central data sources, for all the good reasons you describe and more, but as you indicate, particular stakeholders tend to get all up in arms when the central database doesn't provide numbers as positive as those they get from Bill's back-of-the-napkin Excel file.
Also, MAKE A DATA DICTIONARY. I can't stress enough that a central database needs documentation, including a regularly updated list of the fields you are including and the specific meaning of each. Our institution has little to no data documentation and it creates tremendous confusion and misunderstanding. I went so far as to hold some meetings to discuss the metrics our department uses and nail down definitions for each, but no one at the central data office cares a whit about this, unfortunately.
I would also encourage you and your institution to think of data as supporting practice/pedagogy, and not as an end unto itself. I'm a data person, but I'm also an educational theory and sociology person, so I think it's important to not lose sight of the big picture. Don't let data become the gorilla that bends others to its will. For example, sometimes a data point would be nice to have, but collection would hamstring the program in question so much that it would cause more problems than it solves. This happens all the time, and is worth watching out for.
Good luck! If you want to talk more about the wonderful world of institutional research or higher ed data, I love to talk about it.
My school is dealing with similar issues. Lots of rogue spreadsheets. Our previous VP of Tech and Research did a lot of reports and projections (e.g. headcount) personally. We had to scrap everything when he left; nobody can reproduce his numbers.

This is also a security issue. If people are pulling personal information and storing it on their personal computers, that creates additional opportunities for that information to be improperly disclosed or compromised by an outsider.
But I don’t think the answer to that is to hide from statistics or to cherry-pick idiosyncratic measures. The battle to have is over definitions and relevance.

One of the more intelligent things I've seen written on a higher ed blog.

The discourse about 'lying with statistics' rarely has anything to do with statistics and everything to do with representation and forthrightness.
I think there is a trust issue underlying some of this behavior. I know people who don't believe the official line, so they keep their own little data mines in order to be able to prove the official interpretations wrong (and protect themselves from what they perceive as malicious agendas of the official data people). I've also known people who collected their own data because the official data is so frequently and wildly inaccurate that they don't believe it is possible to learn anything from the official sources. These are clear signs of a dysfunctional institutional culture, but there you have it.
The battle to have is over definitions and relevance, and that battle should be had openly. If we’re using the wrong measures to evaluate a program, that’s probably a sign of not really understanding the program; an open discussion of the measures could lead, if indirectly, to a better understanding of the program.

That's all mom and apple pie administratively and no-one would disagree in a vacuum. But I think that Anonymous 4:32 is correct to say that it doesn't look at all like that from a faculty standpoint - processes of rationalization and data consolidation inevitably compromise departmental autonomy. This may be "for good reason" at an institutional level, but that's from your point of view, and even with a sane administrator like yourself directing the process, it's unlikely to result in anything but short-term misery for the departments involved.

You could offer various IT-related carrots to take up the new systems - our university combined an unpopular and non-consultative HR software rollout with sharply increased data allowances and speed upgrades for Internet use. The two things had nothing to do with each other, but the timing meant that people grumbled less as they distracted by watching better quality animals on YouTube.

But the "for the greater good" argument won't wash no matter how true you think it is, so go for specific benefits somewhere else perhaps, and steer a tough line on whatever centralization you do.
I think there is a trust issue underlying some of this behavior.


When you collect your own data you know that you have access to it. Relying on a central data source that you only have limited access to, and no idea how the data was obtained (and thus what it means) is giving up a lot of control of information.

Having been in situations where access to central data was reserved for central administrators, I'm a big fan of keeping my own data. I don't have as much as management, but it serves as a valuable check on the accuracy of stats we are given about my department.
To highlight a few points that I think should be reinforced:

You should first find out why there are rogue databases in use. Some of the problems mentioned above, such as having to go through a single-person bottleneck to get at the ERP system or the failure to include some important item in the ERP data, is a likely reason.

One example concerns your own question of success of students measured by stupid federal statistics. Do you record where people go, even if it is only where they have their transcript sent? Does your state mandate telling a previous in-state school that person X with a transcript from there is now enrolled? Or even graduated?

There is also an issue of manpower if you only allow a handful of people to do a query and none of them work for the faculty.
Hi..I am Mark...thank u for sharing such good and valuable information on
online training for Micrtostrategy


I have seen several reasons for rogue databases (whether Excel, SQL, Access, whatever)

1) the centralized one doesn't have the data they want

2) the centralized one has overzealous access control so the staff in the units cannot access the data/fields that they need for the queries they want to write (this is a problem on my campus with Crystal Reports). When staff already have access to the same data for canned reports that don't tell them what they want (i.e. why they want to write their own reports)

3) the documentation and/or interface sucks (I have to be blunt here - yes you, Banner)

4) lest I just blame "the man", the last reason is that broad range from ignorance to incompetence at the staff level of people creating the rogue databases.

Lots of reasons - solutions depend on what people want and why they are not getting it.
thanks a lot.hadoop training in chennai....I hope to reallu understand this information.hadoop training in chennai...........
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?