GraphFrames in Jupyter: a practical guide

in #art7 years ago

Graph analysis, originally a method used in computational biology, has become a more and more prominent data analysis technique for both social network analysis (community mining and modeling author types) and recommender systems. A simple and intuitive example are the once so famous Facebook friend cluster maps, which visualize the hidden structure of your Facebook network.

For a while, power graph analysis has been a discipline for academici or Facebook or Twitter insiders. But since packages such as GraphX and Graphframes were released, everyone with a little bit of data and a Jupyter Notebook can easily play around in the fascinating world of graphs.


lostcircles.com

In this article I provide an extensive teaser about the Graphframes API in Jupyter Notebooks. First we walk through setting up the Jupyter Notebook environment for using the GraphFrames package (on a Windows machine). Then we discover the basic functionalities and how to import data into the frames, to finish off with an explanation of some of the commonly used high-level function. Please note that this article assumes you have a basic knowledge about Spark and Spark DataFrames.

GraphX is to RDDs as GraphFrames are to DataFrames.

The functionality of GraphFrames and GraphX is essentially the same, the only difference is that GraphFrames are based upon Spark DataFrames, rather than RDDs. If you are used to work with GraphX, Graphframes should be easy to learn just by browsing through the API documentation.

Setting up the environment

The first thing you want to check before proceeding, is making sure Spark and pyspark are properly installed on your machine. For simplicity we wil run spark in local in mode during this tutorial. The easiest way to check this is to enter pyspark in the shell of your Python distribution. You should see something like this if both are installed properly:

Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.xx.xx
/_/

Using Python version 3.xx.xx

For more information about setting up Apache Spark and pyspark I recommend this tutorial and the official documentation.

Next, you want to set your environment variables to run pyspark in the Jupyter Notebook (as opposed to the shell). This can be achieved easily by adding two environment variables:

set PYSPARK_DRIVER_PYTHON=jupyter
set PYSPARK_DRIVER_PYTHON_OPTS=notebook

Then navigate to the location where you want to store the new notebook and run pyspark again in your shell, but add a packages flag and indicate you want to use the GraphFrames package. Here, the newest version is used, but any older version can be used by changing the last part of the argument:

pyspark --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11

Create a new Notebook and make sure you can successfully run:

from graphframes import *

Getting started

The core class of the package is —surprisingly — the GraphFrame. A GraphFrame is always created from a vertex DataFrame (e.g. users) and an edges DataFrame (e.g. relationships between users). The schema of both DataFrames has some mandatory columns. The vertex DataFrame must contain a column named id that stores unique vertex IDs. The edges DataFrame must contain a column named src that stores the source of the edge and a column named dst that stores the destination of the edge. All other columns are optional and can be added depending on one’s needs.

An easy example can be:

from pyspark import *
from pyspark.sql import *
spark = SparkSession.builder.appName('fun').getOrCreate()
vertices = spark.createDataFrame([('1', 'Carter', 'Derrick', 50), 
('2', 'May', 'Derrick', 26),
('3', 'Mills', 'Jeff', 80),
('4', 'Hood', 'Robert', 65),
('5', 'Banks', 'Mike', 93),
('98', 'Berg', 'Tim', 28),
('99', 'Page', 'Allan', 16)],
['id', 'name', 'firstname', 'age'])
edges = spark.createDataFrame([('1', '2', 'friend'),
('2', '1', 'friend'),
('3', '1', 'friend'),
('1', '3', 'friend'),
('2', '3', 'follows'),
('3', '4', 'friend'),
('4', '3', 'friend'),
('5', '3', 'friend'),
('3', '5', 'friend'),
('4', '5', 'follows'),
('98', '99', 'friend'),
('99', '98', 'friend')],
['src', 'dst', 'type'])
g = GraphFrame(vertices, edges)
## Take a look at the DataFrames
g.vertices.show()
g.edges.show()
## Check the number of edges of each vertex
g.degrees.show()

Of course you want to use real-life, actual data. You can import anything from a simple csv-file to a Parquet-file into a DataFrame. Then name your columns appropriately, filter (if needed) and move on from there. More information about importing data into a Spark Dataframe can be found in the documentation.

The GraphFrame we just created is a directed one, and can be visualized as follows:


Directed vs undirected edges

Undirected graphs have edges that do not have a direction. The edges indicate a two-way relationship, in that each edge can be traversed in both directions. If your DataFrame only consist of two-way directed edges, you may be interested in analyzing undirected edges. You can convert your graph by mapping a function over the edges DataFrame that deletes the row if src ≥ dst (or the other way around). In GraphX you could use to_undirected() to create a deep, undirected copy of the Graph, unfortunately GraphFrames does not support this functionality.

An easy example to work around this missing functionality can be found in the following code snippet. Please note that the ‘follows’ edge doesn’t really make sense in an undirected graph, since doesn’t represent a two-way relationship.

copy = edges
from pyspark.sql.functions import udf
@udf("string")
def to_undir(src, dst):
if src >= dst:
return 'Delete'
else :
return 'Keep'
copy.withColumn('undir', to_undir(copy.src, copy.dst))\
.filter('undir == "Keep"').drop('undir').show()
## for efficiency, it's better to avoid udf functions where possible ## and use built-in pyspark.sql.functions instead.

Filtering and connected components

A GraphFrame itself can’t be filtered, but DataFrames deducted from a Graph can. Consequently, the filter-function (or any other function) can be used just as you would use it with DataFrames. The only trap-hole might be the correct use of quotation marks: the whole condition should be quoted. The examples below should clarify this.

g.vertices.filter("age > 30").show()
g.inDegrees.filter("inDegree >= 2").sort("inDegree", ascending=False).show()
g.edges.filter('type == "friend"')

A connected component of a graph is a subgraph in which any two vertices are connected to each other by one or more edges, and which is connected to no additional vertices in the supergraph. In the (undirected) example below there are three connected components. Connected components detection can be interesting for clustering, but also to make your computations more efficient.


Source: Wikipedia

Practically, GraphFrames requires you to set a directory where it can save checkpoints. Create such a folder in your working directory and drop the following line (where graphframes_cps is your new folder) in Jupyter to set the checkpoint directory.

sc.setCheckpointDir('graphframes_cps')

Then, the connected components can easily be computed with the connectedComponents-function.

g.connectedComponents().show()

Our mini-graph has two connected components, which are described for each vertex in the component column.

+---+------+---------+---+------------+
| id| name|firstname|age| component|
+---+------+---------+---+------------+
| 1|Carter| Derrick| 50|154618822656|
| 2| May| Derrick| 26|154618822656|
| 3| Mills| Jeff| 80|154618822656|
| 4| Hood| Robert| 65|154618822656|
| 5| Banks| Mike| 93|154618822656|
| 98| Berg| Tim| 28|317827579904|
| 99| Page| Allan| 16|317827579904|
+---+------+---------+---+------------+

Motif finding

Finding motifs helps to execute queries to discover structural patterns in graphs. Network motifs are patterns that occur repeatedly in the graph and represent the relationships between the vertices. GraphFrames motif finding uses a declarative Domain Specific Language (DSL) for expressing structural queries.

The query can be invoked by using the find-function, where the motif (in quotation marks) is expressed as the first parameter of the function.

The following example will search for pairs of vertices a,b connected by edge e and pairs of vertices b,c connected by edge e2. It will return a DataFrame of all such structures in the graph, with columns for each of the named elements (vertices or edges) in the motif.

g.find("(a)-[e]->(b); (b)-[e2]->(a)").show()

If edges and/or vertices are anonymous, they won’t be displayed in the resulting DataFrame. Motifs can be joined by a semicolon and can be negated with a exclamation mark. More details about the Domain Specific Language can be found in the documentation.

As an example we can try to find the mutual friends for any pair of users a and c. In order to be a mutual friend b, b must be a friend with both a and c (and not just followed by c, for example).

mutualFriends = 
g.find("(a)-[]->(b); (b)-[]->(c); (c)-[]->(b); (b)-[]->(a)")\
.dropDuplicates()

To query all the mutual friends between 2 and 3 we can filter the DataFrame.

mutualFriends.filter('a.id == 2 and c.id == 3').show()

The b-column will show all the mutual friends (just one in this case).

+--------------------+--------------------+--------------------+
| a| b| c|
+--------------------+--------------------+--------------------+
|[2, May, Derrick,...|[1, Carter, Derri...|[3, Mills, Jeff, 80]|
+--------------------+--------------------+--------------------+

TriangleCount and PageRank

To finish up, we’ll discover two additional built-in algorithms. TriangleCount counts the number of triangles passing through each vertex in this graph. A triangle can be defined as a group of three vertices that is interrelated, i.e. a has an edge to b, b has an edge to c, and c has an edge to a. The example below shows a graph with two triangles.


geeksforgeeks.org

In the GraphFrames package you can count the number of triangles passing through each vertex by invoking the triangleCount-function. Note that our simple example has only two triangles in total. Triangles are used for various tasks for real‐life networks, including community discovery, link prediction, and spam filtering.

g.triangleCount().show()

The last function we discuss is PageRank. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

The PageRank algorithm holds that an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor. The damping factor can be be set by changing the resetProbability parameter. Other important parameters are the tolerance (tol) and the maximum number of iterations (maxIter).

pr = g.pageRank(resetProbability=0.15, tol=0.01)
## look at the pagerank score for every vertex
pr.vertices.show()
## look at the weight of every edge
pr.edges.show()


Posted from my blog with SteemPress : http://selfscroll.com/graphframes-in-jupyter-a-practical-guide/
Sort:  

This user is on the @buildawhale blacklist for one or more of the following reasons:

  • Spam
  • Plagiarism
  • Scam or Fraud

Coin Marketplace

STEEM 0.17
TRX 0.24
JST 0.034
BTC 96239.49
ETH 2782.12
SBD 0.67