Fixing Payment Systems with Competition

This Target hack is a BFD. I’m at the mall this weekend because I’m a very last-minute shopper and it was the only time I could find to shop. My wife calls me because she gets this email from Chase which I’ll paraphrase here:

You got hacked.  Lolz!  It ain’t our fault, really.  So sorry. So so sorry. Oh, BTW we’re putting new limits on how you can use your card in the middle of Christmas week because of Target. Hey hope this doesn’t screw you up, but I hope you weren’t planning on spending more than $100 a day with us.   Happy holidays.

Think about this for longer than a few minutes, think about how this affects millions of customers, and then you’ll realize that this Target hack could potentially ding a percent or two off of this holiday season for a few retailers.

When we look back at this time, we’re going to laugh at how silly our approach to payment systems were from about 1980 – 2013.  I think that the Target hack is likely just the beginning, but it is clear that (even with strict PCI-compliance) we need radical change in payment.

Problems with Payment

  1. Our credit cards (at least in the US) are the technology-equivalent of a cassette tape. While I’m running around town with a smartphone that can read my fingerprint whenever I shop I’m still using the equivalent of an 8-track cassette tape to pay for everything. Instead of moving toward a system that uses my location and my fingerprint. We’re just walking around with wallets that are no more secure than a envelope labeled “My Credit Card Numbers” that is totally unprotected. Steal my wallet and you’ve got my credit card numbers… there’s a better way.
  2. We still have this irrational belief in the signature (and checkout clerks still eyeball them). This is our idea of identity verification – here’s a quill pen, why don’t you just sign this.  Now wait… there’s enough reliable location data flowing from my phone to enable every checkout clerk to say, “Welcome to the store Mr. O’Brien” without me saying anything.  The store should know I’m there already, the technology also exists to have the store take care of payment authorization every time I pick something up. My phone could generate a piece of data that could encrypt not just who I am, but where I’ve been today and what the time is down to the microsecond authenticated by several GPS satellites.
  3. Online payment systems that offer more security are tiny in comparison to the 50,000 lbs gorillas that dominate the system.  No one uses these systems. Add up the value of  all the innovative payment companies in the Bay Area (Square, PayPal, + a thousand others) and you still don’t touch the $6.9 trillion total volume of Visa.  That’s $6.9 trillion dollars flowing through billions of point-of-sale terminals (or “all the money”). Someone needs to figure out how to upgrade that instead of creating yet another payment system to trial in San Francisco and New York.

When I wrote about payment systems in 2010, the universal warning everyone was throwing at me was, “Don’t expect anything to change in the short-term.  The retail industry moves slowly and no one wants to make the capital investment necessary to upgrade point-of-sale.”  At the time I was talking to a senior manager at a well-known payment company based in the Bay Area about NFC payment systems.  According to him the future was now an revolution was upon us.  It wasn’t.

The solution

1. Ensure real competition in the payment processing space. Huge payment providers like the ones that have logos in your wallet have had a history of using confidentiality agreements with vendors and transaction fees as a tool to lock out the competition. For example, you are not allowed to offer discounts for different kinds of payment methods.  Whether or not this continues to happen after the interchange fee settlement is up for debate, but we need to make sure that new technologies are not locked out of the physical point-of-sale space.

2. Put all the risk on payment providers.  If you provide a card or a technology that people can use for payment, put all of the responsibility for a compromise on the payment provider. This will motivate payment providers to move away from the current, insecure methods of payment that we use today. Your credit card won’t just be a series of easy to copy numbers, it will make use of the technology we have available. Also, this would force dramatic changes to PCI.  “Storing a credit card #” at a merchant would go away, and instead your transactions would look more like PayPal’s authorization process for recurring payments.

With real competition, the payment processors that can control risk will be able to offer a significantly lower cost to the retailer, and retailers will provide the necessary motivation to consumers to adopt more secure technology.  If Square has the best risk management and fraud prevention technology available, a retailer should be able to offer consumers that use that technology a 1-2% discount if they pay with Square. Competition (not regulation) is the way out of this mess.

Whirr

Whirr + Spot Prices + Thanksgiving Weekend means that I can run large m1.xlarge instances on the cheap.

<griping>Also, Whirr is essential, but the project has a sort of “forgotten Maven site” feel about it. It’s annoying when an open source project has several releases, but no one bothers to republish the site.  It’s even more annoying when the “Whirr in 5 minutes” tutorial takes 60 minutes because it doesn’t work.</griping>

The Fall Guy (or Representing Open Source in the Business)

The problem with being the developer who can write at an open source company is that you end up being enlisted into the whole “Please explain how open source works” discussion when the company hires non-technical managers.  You end up as the representative of this strange thing called “open source.” A VP (not yours) calls you up and says, “Hey, could you explain what open source is to our sales team?”

You seize upon this as an opportunity to spread the Gospel of FOSS. You prepare elaborate slides that speak of Cathedrals and Bazaars. You turn some Lessig into an inspirational dramatic monologue that will inspire these non-developers to start thinking of OSS as the heroic effort we are mounting to take back control from proprietary vendors and create an even larger sharing economy. You think that maybe it is appropriate to introduce some of the developers that work on the project that company is currently making money…

…and then you show up at the “Sales Kick-off” meeting and you realize that this is more of a Glengarry Glen Ross joke festival than it is an audience receptive to the idea of profiting from a sharing economy.  You quickly try to revise slides about “Free as in Beer”, because you realize that any mention of beer is going to get this crowd derailed pretty quickly. They scheduled you at the end of the day, after the VP of Sales gave a speech that involved football metaphors and after the regional sales director had a loud fight about territory with the sales team.  You realize that no one really wants to hear about OSS because they are all about to go out on some sales team-building exercise that involves a lot of drinking and more discussion of sports.

You are summoned to present with “…Ok, some hippy developer is going to tell us what this freeware @#$# is all about anyway. Go ahead show ‘em how to ‘make it rain.'”

If this is your job, you’ll find yourself in a room full of people asking you questions like “Alright, so do you geeks have anything better to do with your weekend?” and “Why are my customers getting all worked up over open source? I don’t get no commission on this crap.”

Some things that you’ll notice in the reaction:

  • People with a background in business and sales have no idea why you’ve been participating in open source for years.  Not only do they not understand it, some of them discount the entire idea (even if the company was built atop an OSS foundation).
  • Even if you think you’ve explained open source, there’s a large portion of the audience that either wasn’t listening or refuses to admit that it could ever work. (Someone will make a joke about how you are a communist.  It will be unclear whether that person was really joking or not.)
  • Jokes will be made about open source being about “free love,”, “hippies,” and “unicorns.”
  • Invariably, someone from the 1980s will show up and talk about how they once made a lot of money selling business software.  This will be used as an attempt to show others that your generation just has it all wrong.

If just the right kind of manager is there, everything you say about the “possibilities of open source” will be dismissed as over-idealistic nonsense.  Even though you might have just delivered a presentation on how Hadoop has created billions of dollars in value and how organizations like the Apache Software Foundation act as foundries for innovations that drive computing, someone will invariably stand up right after you and say, “Ok, enough about this open source crap, how are we going to make money?”

You realize that your “open-source” stuff is just going to be used as a scapegoat for a sales team that has no idea what OSS is.  This is the reason why you see headlines about large companies canceling support for OSS projects and products.  It isn’t because they couldn’t find a way to “monetize” – no it was often because they refused to understand the gold mine they were sitting on.

The Shift to Local Data Centers

In my post on Friday I wrote a fictional piece from 2020 predicting that the world’s IT infrastructure shifted to in-country data centers after the recent surveillance revelations.   It looks like this is going to happen faster than I expected.

What shall we name this trend?  How about “Jurisdictional Data Compliance” or “Jurisdictional Data Security”.  Walk up to your CIO today and ask what your JDC implementation plan is given your client’s new concerns about privacy.

View from 2020: American Complacency on Surveillance Ruined the Internet

Assume we’re in year 2020, we can all remember a time when Google was the largest internet company in the world – the #1 ranked search engine everywhere (well, everywhere except China). In 2020, this is no longer the case, because of a continued stream of revelations about government surveillance, just about every country in the world decided to enact regulations that encouraged (if not required) services like Email, Advertising, Social networks, and IM to be served from an in-country data center.

In 2020, if you are in Russia you use the Russian social network (already happening), if you are in Germany you use the German email provider, and if you are in China you use the Chinese version of Twitter (already established).  In seven years we went from these ubiquitous internet companies all having a global reach to a reality that encourages providers to confine themselves to a “state”.  The transition was difficult, a number of large internet companies stocks tanked in Q3 and Q4 of 2014 for a number of reasons, but one of the driving factors was that earnings suffered greatly when large portions of the EU and Asia lost trust in anything related to US-based internet providers.  Many of these companies were banking on international expansion as a source of growth. The free lunches and massive campuses in the Bay Area were built on a vision of linking the world’s populations together. Those went away when the promise of a global user base evaporated.

The period of time between 2013 and 2020 was about more than just businesses being affected by the surveillance fallout, after the surveillance scandals of 2013, people started putting up more walls to international cooperation.  This wasn’t an overnight decision, but over years and years as new projects were being implemented both in the private and public sector people who had to make decisions about where to host servers, what cloud providers to use, they all tended to opt for hosting something “in-country”. It wasn’t about which cloud provider had the easiest API any more, it was about a German company hosting a Germany web site in Berlin because of pressure from German customers.  All across the world, companies started to say things like, “Your data doesn’t cross any national boundaries” in marketing materials.  Jurisdictional Data Security became a selling point.

Companies made a mint over compliance with a series of laws passed in the EU, but this new “local-only” approach to services resulted in the creation of isolated islands of activity. In 2020, there’s no more “Internet” really.  The “Baidu-ification” of the internet influenced culture broadly as there is far less cross-cultural exchange.  In 2020, K-pop is confined to Korea, Russian dash-cams of insurance fraud are confined to Russia, and Australian Reddit users no longer salute the North America users during the wee hours of the night.  Nations and regions keep activity to themselves.  Advertising networks (these great vacuums of data) had a much more difficult time operating across networks, and companies started aiming at a target an order-of-magnitude less than multiple-billions of users.

Without thinking about the ramifications to US-based businesses, the government just decided to start using its leverage over US-based internet companies to compel compliance with a collection of secret laws. In an effort to protect us, they ended up sapping energy from one of the only sectors of growth in the economy.  They ended up ruining the global surveillance network they had so successfully established.

Back in 2013, even after the stories broke, most of the American public was still complacent.  Only a tiny percentage of people were paying attention in this country, and of those that were, a sizable portion just thought, “Oh, well, we have to keep track of the terrorists.”   It isn’t like the public was in the habit demanding swift action for anything really, as a nation we had decided to stop electing effective representatives years ago and both of the branches of government they had any control over were locked in an endless battle over Bread and Circus.  Poll random people on the street about FISA in 2013, and they’d probably tell you it was a European soccer league.

The public was complacent, even accepting of this new reality.  The Patriot Act was passed in a time of great fear and an anticipation of constant threat.  Our collective complacency with surveillance and our inability to stand for core values like privacy were the competitive disadvantage that ruined Silicon Valley.   In an effort to protect ourselves, we ended up doing more harm than good.  Some blamed the NSA and CIA, but these people were just implementing policy enacted into law by the public’s representatives.  The real source of the problem was the American public. We were complacent with surveillance.

In 2020 people say things like, “Remember when Google was a global company?  They had an entire campus in Mountain View.  Those were the days.” Google started having problems after 2013, they had to spend so much time at the executive level dealing with high-level negotiations with governments that they took their eye off of the local competition. Yes, America has the best capital markets in the world, the largest economy, and a strong national defense, but the loss of trust trumped all of that.

Lift Now Has Plans

Two weeks ago I blogged about Lift as a good site to help people meet personal goals.   Now, Lift has announced a new feature “Plans”.

lift-plans

What I like about Lift is it’s simplicity.  It isn’t asking me to tweet every other second, and the mobile application hasn’t decided to ask me to write a review. (Have I mentioned I hate that.)

10 Steps to Get Your Crazy Logs Under Control

Two days ago I wrote a post about how “developers tailing the logs” is a common pattern.  A couple people responded to me directly asking me if I had some sort of telepathic ability because they were stuck in a war room tailing logs at that very moment.  It’s a common pattern.  As developers we understand that tailing log files is much like tasseomancy (reading tea leaves) – sometimes the logs roll by so quickly we have to use a sixth sense to recognize the errors.  We are “log whisperers.”

The problem here is that tailing logs is ridiculous for most of the systems we work with.  Ok, if you have 1-2 servers, go knock yourself out – tail away.  If you have 2,000 servers (or more) tailing a log to get any sort of idea about errors or behavior isn’t just inappropriate, it’s dangerously misleading.  It’s the kind of practice that just gives you and everyone around you the false reassurance that because one of your servers is working well, they are all fine.

So, how do we get a handle on this mess?

#1. Stop Tailing Logs @ Scale – If you have more than, say, 10 servers you need to get out of the business of tailing logs.   If you have a reasonable log volume up to a handful of GBs a day, throw it into some system based on Apache Solr and find a way to make the system as immediate as possible.  That’s the key, figure out a way to get logs indexed quickly (in a couple of seconds) because if you don’t?  You’ll have to go back to tailing logs.

You can also use Splunk.  Splunk works, but it’s also expensive, and they charge by daily bandwidth. If you don’t have the patience to figure out Solr, use Splunk, but you’re going to end up paying crazy money for something that you could get for free.

If you have more than a few GBs a day on the order of tens of GBs, hundreds of GBs, or even a few TBs of data a day.  You are in another league, and your definition of “logging” likely encompasses system activity.  There are companies that do this and they have campuses, and this isn’t the kind of “logging” I’m talking about here.

#2. If possible, keep logs “readable” – If you operate a very large web site this may not be possible (see the first recommendations), but you should be aiming for production log messages that are reasonable.   If you are running something on a smaller scale, or something that needs to be interactive don’t write out a MB every 10 seconds.  Don’t print out meaningless gibberish.  When you are trying to come up with a log message, think about the audience which is partially yourself, but mostly the sysadmin who will need to maintain your software far into the future.

#3. Control log volume – This is related to the previous idea. Especially in production, keep volume under control.  Remember that log activity is I/O activity, and if you don’t log correctly you are making your system wait around for disk (disk is crazy slow).   Also, if you are operating @ scale, all that extra logging is just going to slow down whatever consolidated indexing that is going on making it more difficult to get immediate answers from something like Solr.

#4. Log messages should be terse – Log message should be terse. Aim for a single line when possible and try to stay away from messages that need to wrap.  You shouldn’t print a paragraph of text to convey an idea that can be validated with a single integer identifier.  It should fit neatly on a single line if at all possible.   For example, your log messages don’t need to say:

"In an attempt to validate the presence or absence of a record for BLAH BLAH BLAH INC, it was noted that there was an empty value for corporation type.  Throwing an exception now."

Instead:

"Empty corp type: BLAH BLAH... (id: 5). Fix database."

#5. Don’t Log, Silence is Golden – I can’t remember who it was, but someone once commented on the difference between logging in Java and logging in Ruby (I think it was Charles Nutter talking about the difference between Rake and Maven).  When you run a command-line tool in Ruby it often doesn’t print out anything unless something goes horribly wrong.  When you run a tool like Rake it doesn’t print much if things go as planned.   When you run Maven?  It prints a lot of output, and this is output that no one ever reads. This is a key point.  Normal operation of a system shouldn’t really warrant that much logging at all.  If it “just works”, then don’t bother me with all that logging.

If you are operating a web site @ scale, this is an important concept to think about.  Your Apaches (or nginx) are already going to be logging something in an access log so do you really need to have a log that looks like this?

INFO: Customer 13 Logged In
INFO: Attempting to Access Resource 34
INFO: Resource Access for resource 34 from customer 13 approved
INFO: Sending email confirming purchase to customer 13

I don’t think you need this. First, you should have some record of these interactions elsewhere (in the httpd logs), and second, it’s just unnecessary.  In fact, I think those are all DEBUG messages.  Unless something fails – Unless something needs attention, you should strive for having zero effect on your log files. If you depend on your log files to convey activity, you should look elsewhere for a few reasons: 1. It doesn’t scale, and 2. It is inefficient.  Instead of relying on a log file to convey a sense of activity, tell operations to look at database activity over time.

#6. Give Someone (Else) a Command – This is something no one does, but everyone should.  Your logs should tell an administrator what to do next (and it rarely involves you.) The new criteria for printing a log file in production is either something goes wrong or something needs serious attention. If you are printing a message about something that has gone wrong don’t assume that the person reading this message has any understanding about the internals of the system. Give them a direct command.

Instead of this message:

ERROR: Failed to retrieve bean property for the customer object null.

Write this:

ERROR: Customer object was null after login. Call DBA, ask about customer #342R.

You see the difference? The second log gives the admin a fighting chance (it also shifts blame to the database).  In this case, someone sent you a corrupt customer record, so point someone at the DBA.  You’d likely redirect them there anyway.  This way the sysadmin can skip the call to engineering and go directly to the source of the problem.

If you do this right, you’ll minimize the production support burden. Trust me, you want to minimize your production support burden – if you don’t minimize this you won’t have much time for development because you will be fielding calls from production all the time.

#7. Provide Context – Unless you are logging a truly global condition like “Starting system…” or “Shutting down system…”, every log message should have some contextual data.  This is very often an identifier of a record, but what you should try to avoid is the log message that provides zero data or context.   The worst kind of message is something like this:

ERROR: End of File encountered in a stream.  Here's a stack trace and a line number...

This begs two questions: what is that stream from? What exactly were you trying to do? A better log message might be:

ERROR: (API: Customer, Op: Read, Server: api-www23, ID: 32) EOF Problem. Call API team.

In this second example, we’re using something like Log4J Nested Diagnostic Context to put some details into the log that will help diagnose the problem in production.

#8. Don’t Write Multiple Log Statements at a Time – Some developers see logs as an opportunity to have a running commentary on the system and they log everything that happens in a class. I dislike seeing this in both code and in a log.  Here’s the example, you have a single class somewhere and you see code like this:

log.info( "Retrieving customer record" + id );
Customer cust = retrieveCustomer( id );
if( cust != null ) {
     log.info( "Customer isn't null. Yay!" );
     log.info( "Doing fancy thin with customer. );
     doFancyThing( cust );
     log.info( "Fancy thing has been done." );
} else {
     log.error( "Customer " + id + " is null, what am I going to do?" );
}

And, in the log you have:

INFO: Retrieving customer record 3
INFO: Customer isn't null. Yay!
INFO: Doing fancy thing with customer.
INFO: Fancy thing has been done.

Consolidate all of these log messages into a single message and log the result (or don’t log at all unless something goes wrong).  Remember, logging is often a write to disk, and disks are insanely slow compared to everything else in production.   Every time you write a log to disk think of the millions of CPU cycles you are throwing into the wind.

If a developer is writing a class that prints 10 log statements one after another,  these log statements should be combined into a single statement.  Admins don’t really care to see every step of your algorithm described, that’s not why you pay them to maintain the system.

#9. Don’t Use Logs for Range Checking – There’s a certain kind of logging that creeps into a system that has more to do with bad input than anything else.  If you find yourself constantly hitting NullPointerExceptions in something like Java you may end up trying to print out variables to help you evaluate how things failed in production.  After a few years of this, you’ll end up with a production system that logs the value of every variable in the system on every request.

You’ll end up with this:

Customer logged in value: { customer: { id: 3, name: "Charles", ......}
Purchasing products: { product: { id: 32, manufacturer: { id: 53, name: "Toyota"....}
Running through a list of orders: [ { order: { id: 325... }, { order: { id:2003...} ]

…and so on.  In fact, you may end up serializing the entire contents of your database to your log files using this method.

Programmers are usually doing this because they are trying to diagnose problems caused by bad input.  For example, if you read a customer record from the database, maybe you’ll just log the value of the customer record somewhere in the log so you can have it available when you are debugging some production failure. Have a process that takes a customer and a product, well why not print out both in the log just in case we need them.   There are issues with customer records have null values, so… don’t do this, just create better database constraints.

This is the fastest road to unintelligible log files, and it also hints at another problem.  You have awful data.  If you are dealing with user data, check it on the way in.   If you are dealing with a database, take some time to add constraints to the table so that you don’t have to test to see if everything is null.  It’s an unachievable ideal, I know, but you should strive for it.

#10. Evaluate Logs Often – The things described in this post are really logging policy, and no one has it.  This is why we have these production logging disasters, and this is why we create systems that are tough to understand in production. To prevent this, you should put the evaluation of logging on some sort of periodic schedule.  Once every month, or once every release you should have some metric that tells you if log volume has outpaced customer growth.

You should conduct some investigations into how useful or how wasteful your current approach to logging is.   You should have some policy document that defines what each level means.  What is a DEBUG message in your system?  What should be an ERROR?  What does it mean to throw a FATAL? Prior to every release you should do a quick “sanity check” to make sure that you haven’t added some ridiculous log statement that is going to make maintaining the system awful.

But… most people don’t do these things which is why production logs end up being a disaster.