Email pipeline to train GPT-2

Daniel Rojas Ugalde
2 min readMar 31, 2019

By now probably everybody has heard about the Open AI language model GTP-2. They released a little model and a great blog post (https://openai.com/blog/better-language-models/). They have great examples of unicorns, Miley Cyrus and even Buddhist-ish writing (@Miles_Brundage).

I have been thinking about applying GPT-2 on emails. We all have lot’s of emails and everybody has its own style. I built a small pipeline that you can input a csv (extracted from outlook) and it will return txt file. You can input that txt file in a Google Collab GPT-2 notebook and check the results. I got some interesting results:

“Daniel,Following up with my voice message, have you all reviewed the ***(redacted) demo? We looked at the quality and was pretty positive we are on to the next level of tools and processing power Let me know if you have any concerns here”

“Team, here is the email of the people who are working on the project: Basically we have 2 approaches for the document:

1 We need to have a set of team members determine the level of effort based on the attributes these will be using, and then we can go over the level of effort on the License Manager

2 We also need to have someone to help communicate the same information to the client on the Confirmative Action”

The emails look very human in my opinion and have a scary similarity to some of the stuff we work on.

We will use a couple of libraries for the pipeline: pandas and langdetect. Pandas will be used to manipulate the csv and handle dataframes. Langdetect will help us detect the language of the email (in my case I had english and spanish emails mixed).

The pipeline has different phases or modules:

  1. Reading the csvs and concatenating them (outlook will create one csv per folder in the mailbox)
  2. We will filter by address in the From
  3. We will detect language and filter the emails that are in english
  4. We will try to detect email signature to avoid feeding other emails to the model
  5. We will generate the txt

You can find the jupyter notebook: https://github.com/drojasug/emailPipelineLanguageModels.

Please let me know if you have questions, comments and feel free to improve the pipeline (I think the first place to start is the email signature detection).

--

--