Using Business Intelligence to Improve the Quality of Your Data
Cleaning data was not my intended topic for this post, but it was the focus of my week so I thought I would share what I learned from the experience. I probably just lost half of you, but if the rest of you will push through this post, you will find that cleaning data can be a valuable exercise that can teach you as much about your data as the actionable KPIs you are searching for.
I started working with the CMS cost report data a couple years ago. The structure is fairly simple, for each cost report there is a report record with a unique report number. Linked to these report records are two tables, one with the alpha data items and another with numeric. Each of these is referenced by the report number, worksheet, line and column. These tables include every individual data item contained in the cost reports disassembled and electronically stored.
This is a description of this data model from the CMS HCRIS (Healthcare Cost Report Information System):
By recombining this data through your favorite BI tool, you can reproduce the cost reports that were submitted by all healthcare provider organizations to CMS.
In the previous posts, we have connected this data to develop a better understanding of the current home health revenue cycle and regulatory issues. Many of you might imagine a set of organized and complete cost report data. Every report is error checked by the agencies, the MACs (Medicare Administrative Contractors), and CMS for accuracy as it is submitted, verified and added to the existing CMS cost report databases. The actual data falls short of this expectation.
On the CMS cost report site, they begin with a disclaimer. They state that they are not responsible for conclusions generated from this data, however, they bet the bank on the cost report data themselves and it is integral to each CMS provider payment model under Medicare Part A.
Cost reports are processed similar to tax returns. After the end of a provider fiscal year, HHAs have a deadline to submit the reports to their MAC. These are the same contractors that process claims submitted to Medicare. The cost reports are completed by the agencies or by vendors that do it for them. The report data is then posted and submitted to the MAC. The MAC accepts the report or returns it to the agency to be corrected. Agencies can later submit amended cost reports.
Many of the issues I will describe to you, CMS is aware of. To deal with them, they go through a cleaning process when they combine the cost report data with claim data to develop the CMW (Case Mix Weights) for the proposed rules each year.
The objectives of cleaning data can vary, but the general idea is to omit data that is known to be inaccurate so that the accurate data is not contaminated. Like a physician, a data cleaner must first do no harm. By that I mean you should make sure that valid data is not excluded, if possible. If part of the data is correct, but not all of it, is there a way to keep what is valid? On the other hand, if part of the data is invalid, does it mean that we should not trust the rest of it? These are the questions we will process as we go through these steps with the cost report data.
In my last blog post, I used cost report data recently updated to include reports submitted for the 3rd quarter of 2023. A few reports were added to the 2021 data, but it is mostly finished.
For 2021, there were 10361 cost report records in the downloaded data. Assuming these were all valid cost reports, each would have had associated numeric and alpha data elements, some do not. This leads us to the first cleaning criteria:
Exclude any reports that are header records only without any numeric or alpha data elements.
The related numeric data elements include all financial data, census, visits, etc. It is difficult to understand why there are cost reports without them, but there are quite a few. Using your BI tool, you can create a calculated field in the report record counting the number of records in the related tables. I did this and excluded all reports with no related data. This took the total number of “valid” cost reports from 10361 to 7330.
An important lesson at this point is to avoid the temptation to attempt to solve all of the mysteries you will encounter in the cleaning process. Why do these header records exist with no related data? What is their purpose? They can be time killers and distract you.
The important thing is that any report without any financial data is useless. More importantly, they should be excluded from the data so that when you use the number of reports as the denominator in a fraction, like a percentage, you know that only reports with data in the numerator are included. If you put on your software developer hat, you might want to find out why this is happening and develop better edits to prevent it, but at this point, that ship has sailed for us as data researchers depending on this data for insights.
The next issue I discovered was that although some reports had numeric data, they did not include the most important data for CMS, and for us, cost data. In particular, they did not report any salaries for the six clinician disciplines who visit home health patients. I created a calculated field that added up all these salaries and excluded any cost report that did not have a value greater than zero. That took us down to 6716 valid cost reports for 2021.
In addition to the financial revenue and cost data, we will depend on valid census and visit data for subsequent research. I added another criteria to exclude all cost reports where they did not have a value of more than 0 for total visits or total census. This gave us 6568 valid records using this collective criteria:
Exclude any reports that are header records only without any numeric or alpha data elements.
Exclude any reports without clinician salaries.
Exclude any reports with a total census of 0 or 0 total visits.
This criteria was applied to the cost report data in Sisense for all the numbers and charts I have provided in my blog posts so far.
There are three data elements that we depend on for the revenue cycle KPIs we will develop with this data, revenue, census and visits. There are four data elements reported for each of these metrics, one for each financial class (Medicare, Medicaid, Other) and a total. I began to notice some report records where the reported totals did not match the three individual numbers by financial class added together. In other words, the field Total Census did not equal Medicare Census + Medicaid Census + Other Census.
As I explored this problem, I also noticed that some reports had Other Census, for example, but the Other Revenue value was zero. I created a field to flag all records where the submitted totals did not match the actual total of the individual data elements. It turned out that about 30% of the remaining “valid” cost reports had at least one of these problems.
At this point in the cleaning process, you have two bad choices, eliminate a substantial portion of your data and possibly alter the results or include invalid or questionable data. Fortunately, with BI tools, there can be other choices in between where you can revise this data based on rules to make it more accurate without throwing it away.
I examined the data metrics, starting with revenue. I found that without exception, in the set of validated cost reports, the total revenue always matched the sum of the three revenue figures that made up the total. For the other two metrics, census and visits, this was not true in a minority, but substantial, portion of the data.
What this tells me is that total revenue is actually calculated based on the sum of the other values entered, but the other two totals were entered by the agencies when the other metrics were entered.
It is unlikely that if the revenue was keyed and not validated, like the other totals, that no one would make a keying error since there were keying errors on the other values in over 20% of the validated reports. The fact that the other totals were wrong 20% of the time points to keying errors that were not validated or checked manually. The same is true of the reports where revenue was entered for a specific financial class, but no census, etc.
To fix this, I created a calculated total for census and visits and used it instead of the total entered in the cost reports for these metrics.
On the issue of the mismatch of revenue with census and visit stats, I made the assumption that the presence and absence of revenue data by financial class was more reliable then the other unit statistics. If a cost report included census or visits by a financial class, but no revenue for that class, I replaced the individual census and visit stats present in the data with zero.
I ran into another issue that was specific to one large group of agencies and how they entered their cost report data. They would report revenue for all three financial classes, but census for Medicare and Medicaid only. However, the entered total census was always greater than Medicare and Medicaid added together and the difference seemed to generally match what you would expect for the missing “Other Census”. To fix this, I replaced the 0 other census values with the difference in the total census and the sum of Medicare census and Medicaid census.
I refer to these types of corrections as “data fixes”. In general, it is usually better to simply exclude the records as untrustworthy instead of cleaning data at this level, but in this particular case, it would have prevented me from including data from the largest HHA group in the industry. This data is very valuable for a variety of reasons so in my judgment, it was worth applying this fix to keep it.
For whatever reason, there were many reports where the total census entered was greater than the actual census total for the three financial classes. Here is a partial list of the reports with the column Census Diff being the difference between the entered census total and the calculated actual. This problem is present in 1755 (26%) of the 6716 otherwise valid cost reports from 2021. Of these, 72% have an entered census greater than the actual total census. This leaves 74% of the reports without any of these issues. They have a Census Diff of 0.
The columns starting with a “C” at the beginning of their name were my calculated results for the individual census stats. Most of the time, they match the original value. The highlighted items were set to zero because no revenue existed for that financial class. These adjusted census stats are included in the calculated total, “Clean TotalCensus”.
This problem is present for cost reports for all three years. Due to the initially incorrect census totals in the 2021 data, the reported total census of 7,345,706 was reduced to 6,825,316 using a calculated total. This a 7% decrease. In 2022, the decrease is 6.7%.
Since the revenue data did not change and census went down, the impact on our KPIs is that the metrics went up, per census. This includes expenses per census since this is calculated using total operating expenses for the year divided by total census.
After this cleaning was done, I had to alter all the charts that previously used the excluded records or invalid subtotals so that they used cleaner, calculated data. None of the charts changed in a way that would alter any previous conclusions. The ratios of the bars and pie charts were basically the same, but some of the values were different. Here is a before and after view of the Medicare Part A Revenue per Census and Profit Margins for the last three years.
Original Sisense charts for Medicare Part A
Updated Sisense charts for Medicare Part A
The largest impact of the update is larger operating expenses per census with the decreased industry census from the cleaning. This pushed all the actual margins down by about 5% each year. For the net profit, with Part A and MA combined, the margins decreased by 7%. Here is the updated net revenue and margins after the cleaning:
I believe that this version of the data model provides more accurate KPIs than the version using the original data. In the future, if I find other data related problems, I will document any actions I take to either invalidate the records or clean the data.
Sometimes even seemingly useless data has value. The cost reports include balance sheet items as well. These include total assets, AR, liabilities, etc. While examining the data, I found that most HHA organizations that owned multiple agencies used their collective assets on the cost reports for each individual agency because they do not book the assets separately in their own financials.
Using this data, we can connect these agencies together by comparing those with similar names and equal total assets from the cost reports. With this technique, I have been able to group these agencies into the organizations many of us know like Amedisys, Centerwell, and Bayada.
As I promised previously, in the next post, we will begin looking at parts of the overall data to see if groups like these perform differently than the industry as a whole for the KPIs we have measured so far.
コメント