An Introduction to pySpark and TigerGraph

Interacting with TigerGraph’s AMLSim Fraud Detection Graph with pySpark

Shreya Chaudhary
DataDrivenInvestor

--

Image from Pixabay

Introduction

Preface

Note: You can find a similar blog post on integrating Spark and TigerGraph here. This blog will walk through similar content but will use pySpark instead of Spark. The set up—most of Part I and all of Part II—will be the same, so you can either skip or breifly skim over those sections if you read the other blog.

Opening

Apache Spark is a data processing software known for its ability to process large datasets and distribute data processing tasks among multiple computers. PySpark is the Python API for Spark and allows Python users to manipulate the data using Spark all in Python. TigerGraph is an enterprise-scale graph database. Together, Spark and PySpark’s abilities with TigerGraph’s graph database can be used to manipulate and analyse big data to make big discoveries.

This tutorial will walk through the basics of getting started with both pySpark and TigerGraph and using TigerGraph’s JDBC driver to connect the two.

Tools

  • Scala (v2.12.15) → Spark is written in Scala, so Scala will need to be downloaded.
  • Java (v16.0.1) → This should already be installed on the computer. Java will be used to process the driver.
  • TigerGraph On-Premise (v3.6.0) → TigerGraph on-premise will be used to run and host the database.
  • TigerGraph JDBC Driver (v1.3.0) → This will connect Spark with TigerGraph.
  • Apache Spark (v3.2.1) → The primary data processing software used.
  • Python3 (v3.8.9) → The primary language used for this project is Python.
  • Pip (v22.1.2) → Pip is a Python package installer and an alternate method to install pySpark.
  • pySpark (v3.2.1) → This is the Python API for Spark that will be used to interact with TigerGraph.
  • Docker Desktop (v4.2.0.5815) → Docker will be used to run TigerGraph on-premise.

Code

The full code for this project can be found here.

Outline

  1. Installation
  2. Create the AML Sim Graph
  3. Create the Project
  4. Read TigerGraph Data with PySpark
  5. Resources and Next Steps

Part I: Installation

Spark

To start off, we will download Spark and Scala using brew, a software that can be used on Linux and Mac OS. Spark is the data manipulation software we will use, and Spark itself is built on Scala.

brew install scala && brew install apache-spark

To verify Spark and Scala were downloaded properly, open the Spark shell with:

spark-shell
The welcome page of Spark’s shell

If the shell runs without any errors, Spark is successfully downloaded!

PySpark

PySpark should get automatically downloaded when installing Spark. To verify it is installed, run pyspark to open the Python Spark shell.

pyspark
pySpark Shell

Note: Notice that, unlike the previous shell, this Python shell is preceded with >>> instead of scala>.

If it is not downloaded, pySpark can also be downloaded via pip. To do so, create a virtual environment and pip install pySpark.

python3 -m venv venvsource venv/bin/activatepip install pyspark

JDBC Driver

Download the latest JDBC driver from Maven.

At the time of publication, the most recent driver is 1.3.0, so that will be used for this blog. Hold on to the tigergraph-jdbc-driver-1.3.0.jar file downloaded; we will use it when creating the repository.

TigerGraph On-Premise

TigerGraph can be downloaded via Docker. The full instructions to download it can be found here.

In summary, start Docker on your computer (by downloading the Desktop app and running the application). Next, create a folder on your machine to hold the Docker data.

mkdir data 
chmod 777 data

Finally, run the TigerGraph Docker image.

docker run -d -p 14022:22 -p 9000:9000 -p 14240:14240 --name tigergraph --ulimit nofile=1000000:1000000 -v ~/data:/home/tigergraph/mydata -t docker.tigergraph.com/tigergraph:latest

To enter the TigerGraph shell, you can ssh into the solution.

ssh -p 14022 tigergraph@localhost

At the password prompt, enter tigergraph.

TigerGraph Welcome Message

Perfect! Your TigerGraph box is ready to load a graph!

Part II: Create the AML Sim Graph

Set Up

To begin, run gadmin start all in the TigerGraph shell. This will start all TigerGraph services.

gadmin start all
gadmin start all

The graph this tutorial will load is the AML Sim Graph. To start off, clone the repository.

git clone https://github.com/TigerGraph-DevLabs/AMLSim_Python_Lab.gitcd AMLSim_Python_Lab

This repository contains GSQL files of the documents to create the graph.

Schema

To build the schema, run schema.gsql.

gsql db_scripts/schema/schema.gsql

This will create the AMLSim graph.

Loading Jobs

Next, to load the data, run the three loading files.

cd db_scripts/loadgsql load_job_accounts.gsql
gsql load_job_alerts.gsql
gsql load_job_transactions.gsql
cd ../..

This will load the CSV data in the data folder to the graph.

Install Queries

Finally, install queries to the graph database. This tutorial will only walk through an example of running one query, but you may install as many as you would like.

gsql db_scripts/query/selectAccountTx.gsql

Exploring in GraphStudio

To view the AMLSim graph visually, you can open http://localhost:14240/ to view GraphStudio. There, you can access the schema, view the stats of the loaded data, and run queries.

Data loading page in GraphStudio

Part III: Create the Project

Project Structure

Unlike Scala and Java, Python Scala projects do not need a specific format type. All of the files for this project will be located in the same folder.

index.pytigergraph-jdbc-driver-1.3.0.jar

Note: The version and naming syntax of the .jar driver can change; just ensure it is the file downloaded from Maven.

index.py

Start index.py by importing SparkSession from pySpark.

from pyspark.sql import SparkSession

Next, set the spark variable using SparkSession.builder.

spark = SparkSession.builder \.appName("TigerGraphAnalysis") \.config("spark.driver.extraClassPath", "/usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*:tigergraph-jdbc-driver-1.3.0.jar") \.getOrCreate()

The appName is simply the name of the project—in my case, “TigerGraphAnalysis.” The .config sets the driver’s path. This input for this contains the directory to the Spark jars and the path to the driver. If you installed Spark via Homebrew, the location will be like the above, /usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*. Next, if the driver is in the same folder as the file, you can simply enter the filename, tigergraph-jdbc-driver-1.3.0.jar. Finally, combine the two paths with a colon and enter it in the config: /usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*:tigergraph-jdbc-driver-1.3.0.jar.

Run the Project

To run the project, spark-submit must be used to run the file and the driver must be specified with the --jars flag.

spark-submit --jars tigergraph-jdbc-driver-1.3.0.jar index.py

With that, we are set to begin interacting with the AMLSim graph!

Part IV: Read TigerGraph Data with PySpark

In general, pySpark’s syntax is similar to Spark’s syntax.

Read Vertices

jdbcDF1 = spark.read \.format("jdbc") \.option("driver", "com.tigergraph.jdbc.Driver") \.option("url", "jdbc:tg:http://127.0.0.1:14240") \.option("user", "tigergraph") \.option("password", "tigergraph") \.option("graph", "AMLSim") \.option("dbtable", "vertex Transaction") \.option("limit", "1000") \.option("debug", "0") \.load()jdbcDF1.show()

To start, reading vertices contain options to specify the usage of the JDBC driver and options to specify graph information and credentials. To select a vertex, the dbtable must be vertex followed by the vertex type—in this case, Transaction. Finally, limit specifies the maximum number of vertices to return.

Vertex results

Read Edges

jdbcDF2 = spark.read \.format("jdbc") \.option("driver", "com.tigergraph.jdbc.Driver") \.option("url", "jdbc:tg:http://127.0.0.1:14240") \.option("user", "tigergraph") \.option("password", "tigergraph") \.option("graph", "AMLSim") \.option("dbtable", "edge RECEIVE_TRANSACTION") \.option("limit", "1000") \.option("source", "9934") \.option("debug", "0") \.load()jdbcDF2.show()

The function to read edges is similar. The dbtable, in this case, is edge followed by the edge type—for this example, RECIEVE_TRANSACTION. Once again, limit is the maximum number of edges. Finally, source is the id of the source vertex of the edge. For this example, the source vertex is a Transaction vertex with the id of 9934.

There is only one edge connected to Transaction 9934.

Run Queries

jdbcDF3 = spark.read \.format("jdbc") \.option("driver", "com.tigergraph.jdbc.Driver") \.option("url", "jdbc:tg:http://127.0.0.1:14240") \.option("user", "tigergraph") \.option("password", "tigergraph") \.option("graph", "AMLSim") \.option("dbtable", "query selectAccountTx(acct=9934)") \.option("debug", "0") \.load()jdbcDF3.show()

Finally, to run queries, the dbtable is query followed by the name of the installed query and the query’s parameters. In the example, the query selectAccountTx is run to find all transactions to and from the account with id 9934.

Query results

Part V: Resources and Next Steps

Congrats on completing this blog! If you made it this far, you can now interact with TigerGraph via Spark. To find the full code, check out the GitHub repository here.

In addition, check out the TigerGraph-Spark connector documentation here.

Finally, check out the introduction to Spark tutorials here.

From here, you can continue to explore pySpark and TigerGraph to create more complex projects. For example, integrating Spark’s MLlib and TigerGraph’s Graph Data Science Algorithms to detect fraudulent transactions could be an awesome way to couple both software’s strengths.

Good luck exploring and building projects! If you have any questions or simply want to chat with other developers or show off some of your projects, feel free to join the TigerGraph Discourse and Discord!

Subscribe to DDIntel Here.

Join our network here: https://datadriveninvestor.com/collaborate

--

--