Diving into Data Lakes

Credit unions have one unique and incredibly powerful advantage: collaboration.

Introduction
What is a Data Lake? - Part 1: Testing the Waters
What is a Data Lake? - Part 2: Sink or Swim
Credit Unions and Data Lakes – The Next Wave
Data Pooling: Leveraging Your Neighbor’s Data
5 Reasons to Pool Your Data
CuOS – A Platform for Credit Unions Similar to Apple iOS
Data “De-Identification”: The Stairway to Big Data Heaven
Credit Union Cooperation: Google Maps Style
Why Analytics is a Credit Union Industry Opportunity

Introduction

A collaborative industry data lake is the next step to empower credit unions with capabilities that even large banks and fintech startups can only dream of at this point.

The days of the internally developed data warehouse are over. We only win as an industry by collaborating on a common data standard that supports both: a forum for building and sharing applications and a data lake that ignites the research and innovation needed to remain relevant to our members.

Key Subjects

Defining and differentiating a data lake
How a data lake can complement a data warehouse
The benefits of sharing data
The opportunities for the credit union industry

We hope you find this ebook valuable, and that your understanding of business intelligence tools is strengthened.

Sincerely,

- The Team at OnApproach

"Data opportunities multiply as the data is transformed"

- Sun Tzu

6th century BCE Chinese general, military strategist, and philosopher

Part 1

Understanding Data Lakes

Defining and differentiating data lakes, data warehouses, and other analytic tools.

Header

What is a Data Lake? Part 1: Testing the Waters

By Mark Portz

Financial institutions all over are working to build effective data strategies and improve decision-making. With so many new technologies and innovations out there, it can get very difficult to keep up with the industry and even keep straight the buzzwords we hear throughout the day. In this piece, let’s dive in to better understand what makes a data lake.

What is a Data Lake?

Simply, a data lake is a data repository for raw data in its native format. As the name implies, these repositories are capable of holding massive amounts of data. Ideally, data lakes are available at an enterprise level and can be easily queried to find relevant data for managers to analyze.

How does it compare to a Data Warehouse?

Data Lakes and Data Warehouses have a number of similarities. Both are designed to:

Header

House disparate data sources in a single repository
Allows improved data analytics
Provide an enterprise source for querying data

However, there are distinct differences between a Data Lake and a Data Warehouse. As the name implies, a Data Lake’s architecture is completely flat. As opposed to a warehouse, in which data is integrated and organized hierarchically in files and folders, data lakes rely on utilizing a proper series of unique identifiers and metadata tags for organization.

A key point to note about the definition of the data lake is that the data is contained in its “native form”. A primary reason to utilize a data lake is to deposit and analyze data from any number of disparate data sources. As the lake accepts any data format, data can easily be submitted to the repository, and extracted in the original format of its disparate source system. This differs from a Data Warehouse, which aggregates the various disparate sources and standardizes the data into a single source, superseding the native formats of the data.

Like anything worth doing, managing a data lake requires some effort. It is not a set-it-and-forget-it solution. When properly managed, a data lake and a data warehouse should not be viewed as competing products, but should create a fantastic partnership, which allows any data consumer within an organization to easily uncover answers to his or her questions from years of past transactional data.

Header

What is a Data Lake? Part 2: Sink of Swim

By Mark Portz

The previous blog, “What is a Data Lake? Part 1”, discussed how to define a data lake, and how it differs from a data warehouse. To better understand the idea, let’s dive a bit deeper and get to know the advantages and disadvantages surrounding data lakes.

To start, there are a number of advantages data lakes serve for financial institutions:

1. Storage/Scalability – As mentioned in the definition, data lakes are capable of holding vast amounts of data, and can easily be scaled up if needed.

2. Input Form – Due to the fact that data lakes accept data in its native forms, it is easy for credit unions to dump data in from as many sources as desired. Data from any disparate sources, internal or external, structured or unstructured, can be stored easily in the data lake, regardless of the data format.

Header

3. Price – Data lakes are relatively affordable form of data repository. Due to the flat architecture, there is not a need for special hardware systems, and server and license reduction actually helps to cut cost.

4. Enterprise Solution – Data lakes are designed so that data can be retrieved by query enterprise-wide. Anyone in the organization with a question should be able to access the data lake and extract the necessary data for analysis.

These advantages work together to create a very compelling product for certain purposes. Prior to declaring a data lake to be your primary tool for analytics though, there are several other characteristics of this technology to consider:

1. Connectivity – As previously mentioned, data is stored in its native form. This means the various data sources are collected in the same place, but not integrated. The data silos still exist, which is a major pain-point for the purposes of enterprise analytics, as there is not a single source of truth.

2. Organization – Rather than a filing system, data lakes work by relying on metadata and a consistent tagging system. While this can be a very successful method, it requires extreme governance. Especially if it is used as an enterprise system, everyone within an organization must understand how to properly tag the data, and know the system well in order to retrieve the desired data. If not, you run into the risk of maintaining a “data swamp”.

3. Data Quality – By lacking proper governance and organization, it can become very difficult to determine data quality. Some users may be successfully finding value from the lake, but there is not necessarily a way to track that lineage or take advantage of the reports being built by other users.

4. Training – While marketed as an enterprise solution, data lakes do not generally act as such in the real world. For it to be a true enterprise solution, it assumes every user in the organization has the necessary training to analyze and manipulate the various data sources contained within the lake. Unfortunately, this does not tend to be case, and most “users” will rely on others within the organization to find and analyze the data for the rest.

Header

Credit Unions and Data Lakes - The Next Wave

By Peter Keers

In the two previous blogs, the concept of a data lake was defined and differentiated from a traditional data warehouse. Yet, a key point was a data lake and a data warehouse are not mutually exclusive. In fact, a structured data warehouse could be a subset of an overall data lake architecture.

Simply stated, a data lake is an effective way to store and access very large quantities of data.

What does this mean for credit union decision makers?

Header

Delving into these questions, OnApproach Senior Engagement Manager, Pete Keers recently discussed the world of data lakes and other cutting-edge ideas with Bill Preachuk, Solutions Engineer with Hortonworks. With a mission, “to manage the world’s data”, Hortonworks describes itself as an innovator in creating, distributing and supporting enterprise-ready open data platforms and modern data applications. Among “Big Data” players, Hortonworks ranks in the top tier.

PK: Bill, what do you consider to be the major themes regarding data lakes?

BP: I think of them as a synthesis of structured data and unstructured data. Yet, they are not just a dumping ground for all kinds of data. It’s all about the business value and answering the questions that credit unions have.

PK: Why data lakes? Credit unions are just now starting to embrace the idea of the traditional structured data warehouse.

BP: Credit unions are used to fixed queries and fixed data. There’s only “rear view mirror reporting” now. Data lakes can help set a larger vision for the credit union industry. In a data lake, large volumes of structured and unstructured data can be brought into one place. When you bring transactional data from multiple sources into one place you can expand that reporting. It’s like having multiple data warehouses where you will be able to drill across and drill down.

Where it really gets exciting is where you have disparate data sources – weather, time, geography, external financial feeds, or credit rating information. It can be structured and unstructured. For example, imagine having geographic information about homes and neighborhoods sorted by geo-location or zip code. What if all that was in one place? It could be integrated and used to expand and augment existing reports. This is a step along the road but the real value comes when you can go from rearview reporting to predictive analysis.

All this data allows for forward thinking and integrated forecasting. It allows extrapolation of past information with external data resulting in faster analysis and faster forecasting. It means being able to make decisions forward and see patterns.

It’s a situation where data scientists can prepare and build out predictive analyses. It also allows the use of machine learning and augmentation via artificial intelligence. Augmentation and machine learning can not only add value but also set business direction.

Header

PK: How can this translate in actual business benefit for a credit union?

BP: The benefits can be such things as cost savings, cost avoidance, or finding new opportunities that were not previously known. You end up with more products you can sell. A specific example for credit unions would be to look at mortgage information for particular customers and augmenting that with real estate information for forecasting home sales of homes in a particular area. The analysis would involve applying patterns to see how they look forward. You may be able to see potential cash crunches in certain zip codes.

Another possible area of value might be risk avoidance. Being able to apply patterns, mine information, and see potential risk threats at a credit union level or zip code or even with individual members. It enables detecting fraud and avoiding fraud. There’s a huge potential for security. Being able to take disparate log information from servers, from external rating agencies, and bring that all together and apply machine learning credit unions can look for those patterns which could identify service attacks and brute force attacks. It could stop threats as they happen.

PK: If data lake can have both unstructured and structured data, what does this mean for the traditional data warehouse?

BP: A data lake does not necessarily do away with the data warehouse. The data warehouse can be replicated in the data lake and in doing so extends the utility of the data warehouse. You can run your data warehouse on the cluster.

PK: Can you define a cluster?

BP: A Hortonworks Data Platform (HDP) Hadoop cluster refers to a group of commodity computers that are connected and centrally coordinated – each with their own processors and disk. Processing and storage are all redundant and fault-tolerant by default. (A really nice intro to clusters can be found at https://hortonworks.com/apache/hadoop/).

Jobs are divided into tasks and tasks are sent to the nodes in the cluster. Each node completes tasks in parallel and the results are brought together. Massive amounts of data can be processed at a low cost using this method.

Header

Suppose a credit union were to attempt to use a single server to build a very large data lake and was experiencing 6 to 8-hour processing cycles. In that case a single server is tasked with all the computational requirements, and cannot easily scale out. When they bring those same data feeds onto a 20 to 30 node HDP cluster, all of the sudden you have similar ETL processing to load the dimensional data warehouse but you only require perhaps a half hour of processing since you have so much more parallel horsepower you can throw at it.

The cost of storage and the cost of processing in a cluster is also much lower since it uses commodity disk, allowing you to keep full fidelity of your historical source data and all of your transaction feeds in the cluster. You can restate or add information to your data warehouse because it is there and available – you didn’t have to purge source data since it is no longer cost-prohibitive to store it.

Within the cluster you can bring all of your structured data warehouse data and other unstructured data together in one place. Multiple tools are available that allow you to explore these different types of data very quickly and derive value with little overhead. Later you can choose to store it in a highly optimized format.

PK: When you talk about clusters handling large amounts of data, how large are we talking about?

BP: Hadoop clusters scale to petabyte levels.

But a big difference is how Hadoop handles data ingestion. Conventional relational database data has to be loaded into a single specific schema/format. This is where a huge amount of time is spent during ETL processing. But Hadoop gives you “schema-on-read” capabilities. You need only ingest your file as-is into the cluster, and if your data has any kind of structure (CSV, tab-separated, JSON, etc.) – you can instantly define a schema on the file and use the data immediately.

As soon as you have that, which only takes a second after the file shows up, you can issue SQL queries against it, and that can validate your data. At that point you have your data available. Data scientists love it. They can analyze data quickly without load processing overhead.

Header

PK: It sounds like this technology could evolve in these traditional data stores where everything migrates to Hadoop clusters. They are in a more efficient environment.

BP: I came originally from the traditional SQL relational database world, and what I see is a kind of convergence. Relational databases are adding more Big Data capabilities, and at the same time Big Data SQL offerings (like Hive LLAP, Spark SQL, and Phoenix) are adding more and more capabilities available to the relational databases.

But it’s not just performance improvements, It’s that open source software overall improves at a pace far quicker than closed source software. For example, four years ago Security, Metadata handling, and stream processing were rather immature in Hadoop. Now you’ve got Ranger and Kerberos built/integrated for security and Atlas for metadata. You also have massive improvements to the ability to stream in your data in real time, process it, and cleanse it in-stream with Storm, Kafka, and NiFi. All built in the open, with contributors and committers from dozens of organizations and companies.

I recall there were 12 eco-system tools in a Hortonworks Cluster 3 years ago and now there’s almost 30.

PK: As I said before, credit unions are just now embracing the concept of the traditional data warehouse. Isn’t this asking the industry to move a bit too fast?

BP: And perhaps they aren’t able to move too quickly.

PK: So, it seems to me that there’s an opportunity for a vendor to create a product that will help with a standardized solution. It would take some of the complexity out of the situation. OnApproach is interested in partnering with companies like Hortonworks to bring in the concept of data lakes and unstructured data to reduce the complexity so credit unions can more quickly get the value out of it. A credit union wouldn’t have to get its own data scientists or hire whole crew of employees who know how to run this. It would be an effort to bring value without big cost or having to hire their own experts.

BP: Yes, this would be a situation where credit unions that already have a data warehouse can augment what they have and make it better. It would provide them the opportunity to develop and sell entirely new products that are built upon this data that they would never have been able to do themselves.

Header

PK: A huge component of this that we haven’t touched on yet is the prospect for multiple credit unions to pool their data in a data lake. OnApproach has been looking for a way to pool data across the industry. As we’ve been discussing, the technology to do this is readily available. I think the challenge is to convince credit unions to contribute their data and articulate why it is valuable to pool data across credit unions. My sense is if the immense value of the opportunity can be clearly communicated, credit unions will gladly join in to get that value. Again, one of the value opportunities is not only sharing data in a pool but sharing a centralized group of data scientists.

BP: There’s great value in being able to take information from multiple credit unions as an anonymized aggregate and find patterns in the customers and in the states – then being able to package that up for them across the country. Individual credit unions cannot do this alone. There’s such a value in bringing all that data together. It’s a terrific opportunity.

Header

Data Pooling: Leveraging Your Neighbor's Data

By Mitch Nelson

The trend of data-driven decision making is exploding within the credit union space. Pressures to increase revenue, reduce risky assets, and efficiently identify qualified sales leads have all contributed to the growing trend. But as the push for data-driven decision making has gained popularity, the need for a wider breadth of data has become apparent.

For most decision making, credit unions need only leverage the data within their own walls. However, some types of decisions require a larger volume of data. Data like credit risk forecasts require immense amounts of underlying data to be accurate. Most credit unions by themselves do not have the critical mass of necessary data for such forecasting. However, in the collaborative spirit of the industry, credit unions can join forces to amass an adequate amount of data through the process of data pooling.

What is Data Pooling?

Header

The process of data pooling involves multiple credit unions securely transmitting their data to a data pool provider. The data is compiled into a single data set where algorithms created by the data pool provider generate analytic results. The results are then sent back to the originating credit unions for their personal use. A credit union will only receive data back related to the data they originally sent.

The beauty of data pools, however, is the results sent back from the data pool are based off ALL the data within the data pool. In effect, a credit union can leverage another credit union’s data. That being said, pulling together multiple credit union data sets into one common data set is not an easy feat.

The challenge in data pooling arises from the fact that credit unions tend to store their data in varying systems. Different core and ancillary data systems sometimes do not mix well when creating a data pool. In order for multiple credit unions’ data to be compatible, the data needs to be stored under the same standard. Once the data is housed under a uniform standard, creating a data pool becomes more manageable.

The key to successful data pooling is establishing the connection. Connection to a data pool consists of five main phases: data extraction, data transmission, data pool presence, data retrieval, and data storage.

Data extraction first involves identification of the relevant data that will be sent to the data pool. In a data pool, not all data at the credit union needs to be sent to the pool. Only the data needed for analysis and identification will be sent, and it will be arranged to fit the data pool format.

Data transmission to the data pool requires a few preparations for security measures. First data is masked, meaning that the data which identifies either an individual member or credit union is scrambled to random characters. The credit union has the unique key to unmask its own data. The data transmission will also be packaged in an encrypted file. The credit union and the receiving pool have the unique key to decrypt the transmission package. With the data security measures in place, the data is sent via a secure file transfer protocol to the data pool.

In the data pool (data pool presence), the data pool provider will run its data analytics model against the pool’s data and assign the result to the corresponding individual credit union data. The data itself will not be identifiable, just the origin of the data so it can be packaged for retrieval by the appropriate credit union. The new package of data is then prepared for retrieval.

Header

For data retrieval, the results are posted to an origin specific directory. A credit union based scheduler pings credit union specific mailboxes and retrieves data when present. Once returned, the data is decrypted using the credit union unique password, and unmasked using the unique masking key.

Data storage is the last phase in a data pool connection. After the data is retrieved, the analytic results are linked to the corresponding member data and are integrated back into the credit union’s data set. The newly generated data is stored into the credit union’s data warehouse as another data element and can now be used for analysis and reporting.

Data pooling is an innovative process that can expand a credit union’s view of useful (and profitable) information. Utilization of analytics from pooled data can give your credit union the extra validation needed for important decision making. Imagine knowing that each decision made has measurable and tangible proof behind it. Integrating data from your organization into a data pool may provide the answers your credit union is missing out on.

Header

5 Reasons to Pool Your Data

By Mark Portz

Data continues to prove itself as a necessity for decision-making in financial institutions. For years, major banks and innovative companies such as Google and Amazon have taken advantage of “Big Data” to gain better insights into their customer base and make business decisions to position themselves for the future. The credit union industry is finally beginning to take advantage of their data and utilize new technologies. However, credit unions are much smaller than major banks and simply don’t have the same quantity of data that banks are able to collect from their customers. Fortunately, data pooling serves as a great solution to this problem. Here are 5 reasons your credit union should participate in data pooling:

1. Access to Diverse Data

Header

“Why do I care about the data collected from a credit union on the other side of the country?” This is a frequently asked question when discussing data pools. Of course, it is a valid question. The economy may be different in December in Alaska compared to Florida. However, it is important to recognize that this diversity can actually be a major advantage that should not be overlooked.

As Joe Breeden of Deep Future Analytics explains in a podcast with Best Innovation Group, titled The CECL Effect – How the New Credit Loss Rule will alter Financial Analytics, data diversity is healthy for pooling and advanced analytics. In the podcast he states, “If we get folks spread around the country, in a shared blind repository, then it gives us a better overall view of the scaling of the risk versus economics and other things.” He continues to explain that “We leverage that pool to learn aspects that are in common, like economic sensitivities, but then also to calibrate to the individual… so you get the benefit of the whole, but specific to the individual institution.”

2. Affordable Access to Data Scientists

Data scientists are highly skilled, highly demanded, and expensive resources. They play a major role in analyzing and creating predictive insights (such as ALLL forecasting for CECL) from raw data, which means there is a reason data scientists often earn $175k+ per year.

Credit unions simply don’t have the same assets and hiring power as Google, Microsoft or the large banks which makes hiring a single data scientist a non-option. This is where the power of the data pool comes into play. If a data scientist works on a pool of data, consisting of the data from, say, 50 credit unions, those 50 credit unions get to split the cost of the data scientist, making advanced analytics much more affordable.

3. Encrypted and Secure

Another common concern around the topic of data pooling is the access to private information. In a proper data pool, all personally identifiable information (PII) is encrypted prior to leaving the firewall at the credit union. In the pool, the data is still anonymized. Only after the data reenters the firewall again, is it de-encrypted using a de-encryption key that only the credit union holds.

Data Scientist don’t need to know your individual members’ contact information, SSNs, etc., but all contributing organizations will benefit from sharing data that provides insights into loan risk, for example. Post analysis, you will never even be able to tell your data was pooled, except for the increased accuracy in your results.

Header

4. Quantity of Data for Predictive Analytics

Predictive analytics is no longer a luxury, but a requirement for upcoming regulations such as CECL. It is well-known that more data means more accurate results. Credit unions have potentially very insightful data to learn more about their members, but only if done collectively with the rest of the industry. There is simply not a large enough data set to perform accurate predictive analytics within the individual credit union. 95% of the credit unions in the United States are below $3.0 billion in Assets and do not have enough data to build accurate predictive models.

Fortunately, data pooling is coming to the rescue. Pooling data provides an opportunity to analyze a much larger data set. With a good model, each additional credit union participating in the pool will help to continue to decrease your margin for error and allow you to have more confidence in your data-driven decision making for the future.

5. Near Real Time Industry Data for Peer to Peer Analysis

Although it is highly valuable, it is currently very difficult for credit unions to perform peer to peer analysis in a manner that is near real time. Typically, the best option for credit unions to perform any sort of peer to peer analysis is to compare data captured in 5300 Call Reports. However, this data is collected only once a quarter and likely published at least a month after collection. Valuable insights can be gained from this type of analysis, and it would be beneficial for credit unions to have access to this data before it is 4-5 months old. For example, if you realize your credit union is behind on loan origination, what changes can be made today versus 5 months from now.

A proper data pool makes it possible for credit unions to access industry data and perform analysis on data that is updated daily. This makes it possible to stay on top of industry trends before they have passed.

To learn more, listen to the Joe Breeden BIGcast about data pooling and CECL at https://www.big-fintech.com/Media/BIGcast/ArticleID/269/The-CECL-Effect-How-the-new-credit-loss-rule-will-alter-financial-analytics

Part 2

Getting Credit Unions to the Next Level

Re-imagining the applications of collaborative analytics.

Header

CuOS - A Platform for Credit Unions Similar to Apple iOS

By Austin Wentzlaff

Apple has made a tremendously successful company off of one thing… Is it the iPhone, iPad, iPod, or Mac series? No. What makes Apple so powerful and successful is not its products, but rather the ecosystem it’s created through its standard operating system, the iOS. An operating system or “platform” that enables its users to connect with the rest of its users as a community and its developers. With this common platform, all users are on a level playing field with a similar access to all “apps” and services that have been created on the platform – rather than each user building everything themselves.

Header

While this wildly successful model exists in several industries, it has failed to gain traction with credit unions, especially as it pertains to digital transformation (including data and analytics). Instead of having one common platform, credit unions are all on their own unique “platforms” comprised of different data sources with different data definitions. In other words, each credit union is on its own to create all new “apps” and services required to compete in the modern world. At one time, this wasn’t an issue but with the new regulatory pressures and competitive threats of today, this is no longer a viable option for credit unions.

In the new, digitally-transformed world of today, credit unions must adopt a collaborative and common “platform” to compete, which for the sake of a better term, I will call the CuOS (or Credit Union Operating System) to stay aligned with the Apple iOS analogy. Doing so will require all credit unions to find a means to conform to one common standard. While this is no easy task with all of the different data sources that exist in the credit union industry (Core, LOS, Debit, Credit, etc…), it is possible and several efforts are striving for this common standard such as the CUFX (Credit Union Financial Exchange) and OnApproach’s CU Analytics Ecosystem (which leverages CUFX).

While CuOS might not seem like a strategic priority for a single credit union today, the future needs to be the consideration. By establishing this common platform today, all credit unions will be able to leverage the work of the industry. The largest, and even the smallest of credit unions will be able to work together to compete with the rest of the financial institutions out there including the largest banks and the newly establish FinTechs. Rather than one credit union taking these institutions on their own, the industry would be able to work as one, leveraging its inherit collaboration, to compete with all other financial institutions. All of the work being done an industry platform versus a single credit unions site will begin to grow exponentially just as the apps and services on the Apple iOS have.

Credit unions need to fully embrace the inherit collaboration of the industry they’ve always celebrated and come together on one common platform with one common cause – digitally transforming the industry to compete with the rest of the financial institutions in the digital age. This one common platform leads to success for ALL credit unions and builds an industry that will compete and survive for years to come.

Header

"Data De-Identification": The Stairway to Big Data Heaven

By Peter Keers

Credit union interest in Big Data is at an all-time high. The promise of predictive analytics and other Big Data opportunities will be a key part of helping the industry compete more effectively with traditional banks and fintech upstarts.

However, where does the data for Big Data come from? The answer is simple: from the credit unions themselves. For example, the loan loss forecasts required by CECLmodels will require data from many credit unions to increase their predictive accuracy.

While credit unions are eager to cash in on the Big Data boom, one of the costs is “contribution” of their own data to the Big Data “lake”. A data lake is a virtual “storehouse in the cloud” that holds a vast amount of data that can be used for Big Data analytics.

Header

At this point, credit union decision makers often turn sour on Big Data. Why? The cost of “contribution” is too high. The credit union is obligated to protect the sensitive member data in its care. This data cannot simply leave the credit union’s firewall perimeter and be uploaded to the Data Lake.

The healthcare industry faced a similar conundrum regarding electronic medical records. As medical records evolved from paper to an electronic format, the opportunity to perform analytics on this data was gigantic. Yet, the Health Insurance Portability and Accountability Act (HIPPA), a law about patients' medical records privacy, stood in the way.

To take advantage of this opportunity but still adhere to HIPPA, healthcare analytics companies devised processes to “de-identify” the sensitive data in medical records. In this this way, no specific patient could be uniquely identified while analysts gleaned insights from millions of medical records uploaded by thousands of healthcare providers.

Credit union member data can be handled in a similar way. In fact, the same method for protecting patient privacy can be adapted to the data of credit union members.

In a 2015 publication from the National Institute of Standards and Technology (NIST), the concept of “de-identification” of data is explained. It is defined as, “…a tool that organizations can use to remove personal information from data that they collect, use, archive, and share with other organizations.”

The document describes the HIPPA Safe Harbor method which specifies 18 specific types of data to be de-identified. The list has been altered to replace healthcare data types with credit union data types. The 18 types are:

Names
All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:

a. The geographic unit formed by combining all ZIP codes with the same three initial digits contain more than 20,000 people; and

b. The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older

Header

Telephone numbers

Fax numbers

Email addresses

Social security or other tax identification numbers

Member or Customer Numbers Medical record numbers

Member or Customer Numbers Health plan beneficiary numbers

Account or application numbers

Operator numbers/employee numbers/officer numbers Certificate/license numbers

Vehicle identifiers and serial numbers, including license plate numbers

Device identifiers and serial numbers

Web Universal Resource Locators (URLs)

Internet Protocol (IP) addresses

Biometric identifiers, including finger and voiceprints

Full-face photographs and any comparable images

Any other unique identifying number, characteristic, or code

An important consideration is how this data is de-identified. Removal of Direct Identifiers is at the heart of de-identification. The NIST document defines Direct Identifiers as “data that directly identifies a single individual. Examples of direct identifiers include names, social security numbers, and email addresses.”

The document notes numerous ways Direct Identifiers can be de-identified:

The direct identifiers can be removed.

The direct identifiers can be replaced with either category names or data that are obviously generic. For example, names can be replaced with the phrase “PERSON NAME”, addresses with the phrase “123 ANY ROAD, ANY TOWN, USA”, and so on.

The direct identifiers can be replaced with symbols such as “*****” or “XXXXX”.

The direct identifiers can be replaced with random values. If the same identity appears twice, it receives two different values. This preserves the form of the original data, allowing for some kinds of testing, but makes it harder to re-associate the data with individuals.

The direct identifiers can be systematically replaced with pseudonyms, allowing records referencing the same individual to be matched.

Header

“Pseudonymization” is an extremely important topic in de-identification. Unlike the techniques above, it allows “linking information belonging to an individual across multiple data records or information systems, provided that all direct identifiers are systematically pseudonymized.”

In layman’s terms, this means that authorized parties can restore de-identified data back to its original form. For example, Member Number is de-identified via pseudonymization. The data is not comprehensible to any unauthorized party. However, when the data returns to the credit union, it can be reversed and integrated back into the database.

Credit unions that can understand the what and how of data de-identification will be better prepared to take advantage of Big Data opportunities.

Header

Credit Union Cooperation: Google Maps Style

By Nate Wentzlaff

As credit unions begin their journey into the future, they must rely on an industry standard analytics platform to guide them to their destinations.

Google Maps has revolutionized how we navigate our lives. It saves us from headaches caused by unnecessary traffic and other challenges in traveling. My journey from work to home has many different routes depending on traffic patterns. During days with slower traffic (i.e. - winter snowstorms), the Google Maps recommended route will change every 5 – 10 minutes. Using an analytics engine that informs me of the best route allows me to spend extra time on more important things in life. Credit unions have a similar opportunity when navigating their institutions into the uncertain future of financial services. Establishing an industry standard analytics platform will enable credit unions to cooperate on analytics and guide them to their desired destinations.

Header

Analytics Platform

Google Maps uses complex algorithms and thousands of integrated data sources to provide users a clear path to their destination. Utilizing a common data platform to conform the data for its users provides a unique opportunity to share data throughout the globe. This platform empowers users to navigate an ever-changing world. The credit union industry needs to embrace a similar type of analytics platform for all users to enable collaboration throughout the industry. Speaking the same data language is crucial for the success of credit unions throughout the world. With a common data platform, applications can be designed for credit unions regardless of their software systems feeding the data. When someone develops an application, it can be used throughout the industry without a need for costly (and time intensive) implementation. A majority of time spent on application development is mapping data. As credit unions adopt an industry standard analytics platform, developing APIs (Application Program Interface) from an analytics platform will become much easier, which will enable credit unions to navigate their journeys cooperatively.

Data Integration

“We recognize that in order to provide our users with the best, most up-to-date map possible, we must partner with the most comprehensive and authoritative data sources.” -Google

Understanding that they don’t have all the data (logistically and legally) within their company, Google reaches out to trustworthy 3^rd parties to integrate data within their analytic platform. Through Google’s Base Map Partner Program, they are able to integrate thousands of sources to improve Google Maps every day. Credit Unions have the same opportunity when contributing to an analytics platform. With all the credit union data integrated, the route will become much clearer.

Data Pooling

“Yes, that’s right: if Google has access to the location data collected by your smartphone, then you’re part of Google’s crowdsourced operation to improve and expand Maps.” -Rob Nightingale

Header

Google Maps uses anonymized user data to build dynamic analytics for all users. For example, using the speed of users driving on the road, Google can infer traffic conditions. In order to bolster their analytics platform, credit unions should pool anonymized member data including all transactions on a daily basis. Understanding how members behave throughout the industry will give credit unions the power to drive to their destinations efficiently.

Setting your destination

Utilizing Google Maps is only useful when users know their desired destination. Beginning an adventure without a destination will inhibit any type of analytics from assisting the user. Analytics without a purpose will result in another fancy dashboard hidden in an analyst’s folder (while usually incurring a monthly payment to a vendor). Setting up goals and defining KPIs (Key Performance Indicators) is essential before a credit union begins their journey.

Data Visualization

Visualizing data along the path is another powerful feature of Google Maps. Seeing their environment allows users to spot outliers and identify opportunities along the route (i.e. - gas and food). As credit unions embark into the future, they will be able to identify valuable opportunities along the way. This will allow credit unions to alter their route to obtain value that was not directly identified by the analytics. This Ad Hoc ability allows credit union employees to steer their credit union with the guidance of analytics. The analytics is the most efficient route. However, sometimes value can be found outside of the most efficient route. Data visualization gives analyst the power to give data-driven recommendations on any alteration of the analytics-guided path.

Applied Analytics

Sometimes Google Maps takes users on routes that are unfamiliar. Most people assume that the analytics are correct and will follow wherever it leads them. However, there may be times that the analytics need to be overridden by a user. For example, the destination address may have been entered incorrectly. However, this should not be done based on a “gut” feeling about a particular area the user is driving through. Sometimes, a thriving city is on the other side of a barren desert. Similarly for credit unions, when the analytics don’t seem to make sense, further experimenting should be performed before adjusting the route. Many different hypotheses may be formulated along the journey. By utilizing the scientific method, credit unions will be able to “trust but verify” their analytics along the way.

Header

Credit Union Efficiency

Just as our transportation systems are becoming more efficient with commuters collaborating on an analytics platform, the credit union industry can become more efficient as they embrace an industry standard analytics platform. With scarce resources (i.e. - time and money), credit unions should take advantage of an analytics platform that will bring them greater efficiencies. As a movement built on collaboration and using resources thriftily, cooperating on analytics will allow the credit union industry to thrive in 2017 and beyond!

Header

Why Analytics is a Credit Union Industry Opportunity

By Austin Wentzlaff

Analytics has been a prevalent topic for many years but never more prevalent in the credit union industry than it is today. Just a few years ago, the topic hardly came up, but in 2017, it’s hard to find a credit union not talking about, or planning and budgeting for a proper analytics solution. This excitement about analytics has gathered widespread attention, involving industries, companies, and individuals new to the field of analytics.

Now that there is a lot of buzz around the topic, it is important to understand whose challenge, but more importantly, whose opportunity analytics is. Analytics is the credit unions’ opportunity. Not just one individual credit union, but all credit unions – the industry or movement.

Header

Credit unions need to understand the value of their data, not just as one credit union’s data, but the value of all credit union data. Alone, as a single credit union, how do we compete with U.S. Bank, Wells Fargo, and Citi Group, not to mention the FinTechs like SoFi? We can’t. The answer to our biggest challenges is something that is inherent to the credit union industry – collaboration. While we might not be able to fully execute on analytics alone, we can do it together – as ONE.

One of the best things that the credit union industry has ever formed is the Credit Union Service Organization (CUSO). Doug Petersen, President/CEO of Workers Credit Union says it best:

“ provide a means to an end – allowing credit unions the capability to fulfill the financial needs of their members in a cost effective environment through efficient delivery channels. Plus, they attract the brightest and most innovative minds to the board table, bringing best practices of credit unions across the country, which is a priceless experience.”

The CUSO model is exactly what the credit union industry needs in order to tackle the analytics challenge and take advantage of the opportunity. CUSOs leverage the power of collaboration that already exist within the industry to offer several benefits such as:

Economies of Scale

Economies of scale are achieved when a company produces goods and services on a larger scale while simultaneously lowering average input costs. CUSOs achieve economies of scale by producing goods or services for several credit unions rather than having a single credit union attempt to replicate the same benefit. By utilizing the power of collaboration, CUSOs can specialize on a given product or service, which enables them to provide higher-value products and services at a much lower cost.

Competitive Advantage

CUSOs offer credit unions the ability to remain competitive by improving efficiencies and producing a wider array of products and services that would be unobtainable without CUSO collaboration. They enable credit unions to acquire scale and market power along with other resources, such as capital and staff that far exceed their individual sizes. For example, an analytics solution can take about three years to build and has an initial cost of about $2,000,000 with an additional cost of $200,000 per year to support. With a CUSO; however, credit unions can get the same solution for less than $50,000 upfront and only $40,000 per year to support.

Header

Multiple Credit Union Perspectives

The collaboration of several owners spurs more innovative products and services because there are several different viewpoints, many of which are from the most innovative minds in the industry. Unlike most other vendors in the credit union space, CUSOs have board members and owners that represent the industry as a whole. “There can be some overlap with the credit union, but the management team can’t be 100% the same,” says Guy Messick, Attorney/Partner at Messick & Lauer and General Counsel to NACUSO. This is incredibly beneficial because there are more minds contributing to what’s best for the industry.

For Credit Unions, by Credit Unions

Owners of CUSOs are, themselves, credit unions. Therefore, it is in their best interest to do what is best for credit unions. Rather than focusing solely on profit, CUSOs also focus on the overall well-being of credit unions and their members. Credit unions, not shareholders, are in control of the CUSO’s product development roadmap and the CUSOs deliver on the roadmaps by leveraging expertise of credit union executives. Using ideas from the best and the brightest in the industry, both large and small credit unions, ensures best practices are shared and practiced. This results in the best products and services available for the industry.

Ownership

This is easily the most important aspect of a CUSO, especially as it pertains to data and analytics. WHO OWNS THE DATA and WHO OWNS THE TECHNOLOGY – it needs to be the credit union industry. When, or if, credit unions start to give away their data and/or the technology that effectively stores, mines, and utilizes their data, they not only give away the value of their data, but they also give away what makes them who they are – their members. When we start to give away our member data, we start to give away our members. This industry must hold onto our most important asset, our members. Which in 2017 and beyond, is their data and the technology required to leverage that data.

For all of the reasons above, the CUSO is the only way credit unions can successfully execute on analytics. Credit unions must realize that this is much bigger than one credit union. This is an industry opportunity and it many ways, analytics, and digital transformation, is the demise, or prosperous future for credit unions – in all hopes, the latter.

About OnApproach

Company History

OnApproach began in 2005 as a data consulting company. In six years, OnApproach completed over 50 major projects for Fortune 500 companies, such as Toro and Land )' Lakes. In 2009, OnApproach completed an extensive reporting and analytics project for a credit union and realized the significant need for a stardard enterprise data integration solution.

OnApproach became a Credit Union Service Organization (CUSO) in 2014 and received a patent for the M360 Enterprise analytic data model in 2015. To further its commitment to credit union analytics, OnApproach annually co-hosts the Analytics and Financial Innovation (AXFI) Conference with Best Innovation Group to provide a forum for industry collaboration.

OnApproach Today

OnApproach is the only CUSO dedicated to credit union success through a collaborative analytics ecosystem. By providing a secure and frictionless data experience, OnApproach empowers credit unions to take full control of their own data and their own futures. We exist to serve the credit union movement with technology and expertise required for the digital transformation of the industry business model. OnApproach’s collaborative ecosystem enables communities of users, data scientists, and application developers focused on analytics innovation.

OnApproach is the creator of the CU Analytics Ecosystem, a network of credit unions interconnected through a common data integration platform (leveraging the CUFX standards) that is powered by OnApproach M360 data integration middleware. The CU Analytics Ecosystem is a collaborative ecosystem that enables communities of users, data scientists, and application developers focused on innovation, driven by analytics.

Make the Most of Your Data

OnApproach is a CUSO dedicated to credit union success through a collaborative analytics ecosystem. By providing a secure and frictionless data experience, OnApproach empowers credit unions to take full control of their own data and their own futures.

We exist to serve the credit union movement with technology and expertise required for the digital transformation of the industry business model. OnApproach’s collaborative ecosystem enables communities of users, data scientists, and application developers focused on analytics innovation.

Learn more at OnApproach.com or

Schedule a Demo

Table of Contents

Key Subjects

"Data opportunities multiply as the data is transformed"

What is a Data Lake? Part 1: Testing the Waters

What is a Data Lake? Part 2: Sink of Swim

Credit Unions and Data Lakes - The Next Wave

Data Pooling: Leveraging Your Neighbor's Data

CuOS - A Platform for Credit Unions Similar to Apple iOS

"Data De-Identification": The Stairway to Big Data Heaven​

Credit Union Cooperation: Google Maps Style

Why Analytics is a Credit Union Industry Opportunity

About OnApproach

"Data De-Identification": The Stairway to Big Data Heaven