Summarizing Text in Python

by Matt Gallivan


def summarize(text, num_summaries=3):
    '''Create small summaries of a larger text.'''
    summaries = []
    # ...
    return summaries[:num_summaries]
	  

How?

TextRank

1. Break text into pieces.
2. Connect the pieces in a graph.
3. Find the most important piece.

1. Break Text Into Pieces

Sentences!


import re

def sentences(text):
	'''Break text blob into sentences.'''
	ends = re.compile('[.?!]')
	return ends.split(text)
	    

Better Sentences!


from nltk.tokenize import sent_tokenize

def sentences(text):
	'''Break text blob into sentences.'''
	return sent_tokenize(text)
	    

2. Connect the pieces in a graph

Graph?

Edges


def connect(nodes):
	'''Return a list of edges connecting the nodes,
        where the edges are given a weight based on their
        similarity.'''
        return [(start, end, similarity(start, end))
                             for start in nodes
                             for end in nodes
                             if start is not end]
	    

def similarity(c1, c2):
	'''Return the amount of similarity between two chunks.'''
        return len(common_words(c1, c2)) /
               (log(len(words(c1))) + log(len(words(c2)))
	    

3. Find the most important piece

PageRank

What websites are the most important?


# INPUT:
nodes = get_all_websites()
edges = connect_all_websites_that_link_to_each_other(nodes)

# OUTPUT
pagerank(nodes, edges) # =>

{
    'www.google.com' : 40000000000000000,
    'www.yahoo.com' : 25,
    'www.reddit.com' : 23,
    'news.ycombinator.com' : 14,
    # ...
}
	    

PageRank

Use sentences instead of websites!


# INPUT:
nodes = sentences(text)
edges = connect(nodes)

# OUTPUT
pagerank(nodes, edges) # =>

{
    'A really good summary sentence!' : 243,
    'Pretty good at summarizing - maybe too specific though' : 165,
    'Not that great.' : 142,
    # ...
}
	    

PageRank


import networkx as nx

def rank(nodes, edges):
    '''Return a dictionary containing the scores for each vertex.'''
    graph = nx.DiGraph()
    graph.add_nodes_from(nodes)
    graph.add_weighted_edges_from(edges)
    return nx.pagerank(graph)
	    

Putting It Together


def summarize(text, num_summaries=3):
    '''Create small summaries of a larger text.'''
    nodes = sentences(text)
    edges = connect(nodes)
    scores = rank(nodes, edges)
    return sorted(scores, key=data.get)[:num_summaries]
	  

Adobe Customer Data

"Very recently, Adobe's security team discovered sophisticated attacks on our network, involving illegal access of customer information as well as source code for numerous Adobe products", Arkin said in a blog post.

Twitter IPO

The company unsealed the documents on Thursday, disclosing that it generated 317 million in revenue in 2012 and that it had more than 218 million active users as of the end of June, up 44 percent from a year earlier.

Twitter has revealed its highly anticipated stock offering, with the hugely popular messaging platform stating that it seeks to raise $1bn.

Green Eggs and Ham

I do not like them in a house.

I do not like them in a box.

I will not eat them in a house.

I do not like them with a fox.

I do not like them with a mouse.

And I would eat them with a goat... And I will eat them, in the rain.

And I will eat them, in the rain.

In Conclusion...

1. Machine learning doesn't have to be hard

2. There are a ton of libraries to help you

3. Machine learning is good