Data Mining the Enron E-mail Corpus

By
Date -

About

This simple webpage describes and provides samples of various software projects I have worked on.

If you would like to know any project algorithms or would like some sample project code please check my Github or contact me via my email address.

If you wish to present my graph images or results please provide a reference back to this webpage. Specific graphs or dynamic Gephi visualisation videos for each Enron employee & complete dataset may be available on request.

Abstract

A large portion of corporate communication in the modern world is exchanged via email. It is imperative that each organisation can understand and visualise their employee’s digital communication data. This study applies information processing, data mining and analysis techniques on the Enron email dataset in order to detect, visualise and classify heavily exchanged emails and discover employee social network groups while providing the option to monitor communication content. A complete Enron corporation communication network visualisation was also produced.

Each Enron employee was included in the investigation and specific mined data was extracted to allow the creation of an individual employee profile, exposing vital information and statistics that would theoretically be impossible for a human to manually find.

The Enron Dataset contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 517,000 messages (log files).

Tasks included but not limited to:

  • Data cleansing (Corrupt, Duplicate etc.)
  • Data extraction, transformation, parsing & database design
  • Employee personal & work email address identification

  • Identification of the smallest & largest (top 100) exchanged emails (replies and forwards) throughout the organisation.
  • Identification of the average number of replies and forwards an email has developed over its life span.
  • Email circulation graphs visualising heavily exchanged emails between employees and external email addresses.
  • Generic context of popular emails
  • Complete Enron Organisation Communication Network (Visualising all employee communication)
  • Individual Social Network Analysis (Visualisation of each specific employee communication)
  • Identifying Email Group Members

  • Individual employee statistics (Original, Replied, Received and Forwarded)
  • Mining the number of communications (with to, cc and bcc counts) each employee or job role makes over certain periods of time.
  • Identifying each user’s most frequent contacts including the earliest, latest and number of communications.
  • Limited word analysis applied on certain employee’s direct communication (1 - n emails)

Sample of results & graphs

  • Email log file cleansing, extraction, parsing & database design.

    Data Count
    email_messages 255,205
    email_message_x_headers 255,205
    email_message_recipients 1,629,160
    duplicate_email_messages 261,944
    corrupt_email_messages 275

  • Email Address Identification and Extraction

    After identifying each mailbox's owner (first name & surname), this data could be used to scan all email files identifying possible email addresses owned by each employee.

    After scanning all email files, the strict algorithm identified 672 email addresses mapped to the 150 Enron employees. Some employees had one email address mapped to their profile and others had many; for example Vince Kaminski had 15 distinct email addresses identified.

    Enron employee: dasovich-j

    • Strict Emails identified: [dasovich@wco.com, dasovich@inhale.com, dasovich.jeff@enron.com, jdasovich@enron.com, jeff_dasovich@ees.enron.com, dasovich@haas.berkeley.edu, jeffdasovich@enron.com, jeff.dasovich@enron.com, dasovich@enron.com, jeff_dasovich@enron.com]

    • Fuzzy Emails identified: [dasovich63@hotmail.com, dasovich.nancy@gene.com, dasovichd@aol.com, dasovichd@home.com, dasovich.nancy@enron.com, 'dasovich@enron.com, sheiladasovich@yahoo.com]

  • Identifying each user’s most frequent contacts (top 20) including the earliest, latest and number of communications.

    This task was achieved by running certain SQL statements per individual into the database I designed and populated at the beginning. Left outer join provided the ability to indentify email addresses without an employee_ID present in the database.

    Example results:

    After the algorithm to discover the most frequent contacts completed processing, analysis could be performed on the statistics. The statistics discovered that the Manager of Risk Management Head, Vince Kaminski had sent 1619 emails to his personal email address vkaminski@aol.com. This would raise serious suspicions as to why he had sent a huge volume of emails to his personal external email account. The president of Enron Online, Louise Kitchen had sent the most emails (223) the CEO of Enron America, John Lavorato (john.lavorato@enron.com). Similarly John Lavorato had sent the most emails (173) to Louise Kitchen (louise.kitchen@enron.com); indicating a very close relationship. After some organisation research it was revealed that John Lavorato and Louise Kitchen had received bonuses in 2002 for $5 million and $2 million respectively from work on Enron’s energy trading component [http://edition.cnn.com/2002/LAW/02/09/enron.bonuses/index.html]. The bonuses were awarded shortly before thousands of Enron employees were laid off and the company declared bankruptcy; sparking anger amongst employees who suggested that the money should have been used for their severance packages.


  • Email circulation graphs

    The graph visualisations present the sender, original recipients and additional recipients that had been added via reply or forward. All nodes were linked with edges based on who they inherited the email from. The edge colours were inherited from their source node and an increased edge thickness indicated more communication between the source and target nodes. Edge labels state the initial contact method (Original, Reply or forward) and recipient type (To or Cc). Bcc recipients were ignored as they were a duplicate of Cc recipients. The following table describes the visualisation graph node colour scheme:

    Node Colour Meaning
    Red The email address that sent the original email message.
    Green A To recipient.
    Blue A Cc recipient.
    Pink A From header email address that was not present in any prior exchanged email’s From, To or Cc recipients. This scenario is likely to have been caused by an Enron employee deleting a reply or forward that contained the specified email address as a recipient. Therefore it was unknown who the email address inherited the email from, thus the initial edge could not be established. Original emails with recipients including distribution email addresses have a greater chance of having pink nodes. This is caused by the distribution list hiding its member’s email addresses from the server logs. When a member then replies it is unknown who they inherited the email from, as they were not in the recipient list.

    The visualisation demonstrates that marie.heard@enron.com was the original email sender and she sent the email to billy.dixon@bp.com, tana.jones@enron.com and carol.st.@enron.com as To, Cc and Cc recipients respectively. Billy has frequent communication with Marie, Tana and Carol as well as introducing multiple new Cc recipients l.voinorosky@enron.com, e..mcdermed@enron.com and j.ruitenschild@enron.com. Marie forwards the email to debra.perlinglere@enron.com. Darren.van@enron.com replies to Marie and Carol but it is unknown who Darren inherited the email from. It is important to note that the most communication is between the sender Marie and the three main recipients Billy, Carol and Tana. This is indicated by the thickness of the communication edges.

    Another example:


  • Identifying email group members from email logs


  • Enron Organisation Communication Network

    All emails were scanned and all communication between each employee was recorded and mapped.

    The following is the node colour scheme: