An Introduction to pySpark and TigerGraph

Interacting with TigerGraph’s AMLSim Fraud Detection Graph with pySpark

Shreya Chaudhary

Published in

DataDrivenInvestor

8 min readJul 3, 2022

Introduction

Preface

Note: You can find a similar blog post on integrating Spark and TigerGraph here. This blog will walk through similar content but will use pySpark instead of Spark. The set up—most of Part I and all of Part II—will be the same, so you can either skip or breifly skim over those sections if you read the other blog.

An Introduction to Spark and TigerGraph

Interacting with TigerGraph’s AMLSim Fraud Detection Graph in Spark

shreya-chaudhary.medium.com

Opening

Apache Spark is a data processing software known for its ability to process large datasets and distribute data processing tasks among multiple computers. PySpark is the Python API for Spark and allows Python users to manipulate the data using Spark all in Python. TigerGraph is an enterprise-scale graph database. Together, Spark and PySpark’s abilities with TigerGraph’s graph database can be used to manipulate and analyse big data to make big discoveries.

This tutorial will walk through the basics of getting started with both pySpark and TigerGraph and using TigerGraph’s JDBC driver to connect the two.

Tools

Scala (v2.12.15) → Spark is written in Scala, so Scala will need to be downloaded.
Java (v16.0.1) → This should already be installed on the computer. Java will be used to process the driver.
TigerGraph On-Premise (v3.6.0) → TigerGraph on-premise will be used to run and host the database.
TigerGraph JDBC Driver (v1.3.0) → This will connect Spark with TigerGraph.
Apache Spark (v3.2.1) → The primary data processing software used.
Python3 (v3.8.9) → The primary language used for this project is Python.
Pip (v22.1.2) → Pip is a Python package installer and an alternate method to install pySpark.
pySpark (v3.2.1) → This is the Python API for Spark that will be used to interact with TigerGraph.
Docker Desktop (v4.2.0.5815) → Docker will be used to run TigerGraph on-premise.

Code

The full code for this project can be found here.

GitHub - GenericP3rson/TigerGraph_pySpark_AML: How to Use pySpark with TigerGraph's AML Sim Fraud…

A sample project reading from TigerGraph with pySpark Install Spark and Scala (brew install apache-spark && brew…

github.com

Outline

Installation
Create the AML Sim Graph
Create the Project
Read TigerGraph Data with PySpark
Resources and Next Steps

Part I: Installation

Spark

To start off, we will download Spark and Scala using brew, a software that can be used on Linux and Mac OS. Spark is the data manipulation software we will use, and Spark itself is built on Scala.

brew install scala && brew install apache-spark

To verify Spark and Scala were downloaded properly, open the Spark shell with:

spark-shell

If the shell runs without any errors, Spark is successfully downloaded!

PySpark

PySpark should get automatically downloaded when installing Spark. To verify it is installed, run pyspark to open the Python Spark shell.

pyspark

Note: Notice that, unlike the previous shell, this Python shell is preceded with >>> instead of scala>.

If it is not downloaded, pySpark can also be downloaded via pip. To do so, create a virtual environment and pip install pySpark.

python3 -m venv venvsource venv/bin/activatepip install pyspark

JDBC Driver

Download the latest JDBC driver from Maven.

Maven Central Repository Search

Official search by the maintainers of Maven Central Repository

search.maven.org

At the time of publication, the most recent driver is 1.3.0, so that will be used for this blog. Hold on to the tigergraph-jdbc-driver-1.3.0.jar file downloaded; we will use it when creating the repository.

TigerGraph On-Premise

TigerGraph can be downloaded via Docker. The full instructions to download it can be found here.

Get Started with Docker — TigerGraph Server

This document provides step-by-step instructions on how to pull the latest TigerGraph Enterprise Edition docker image…

docs.tigergraph.com

In summary, start Docker on your computer (by downloading the Desktop app and running the application). Next, create a folder on your machine to hold the Docker data.

mkdir data 
chmod 777 data

Finally, run the TigerGraph Docker image.

docker run -d -p 14022:22 -p 9000:9000 -p 14240:14240 --name tigergraph --ulimit nofile=1000000:1000000 -v ~/data:/home/tigergraph/mydata -t docker.tigergraph.com/tigergraph:latest

To enter the TigerGraph shell, you can ssh into the solution.

ssh -p 14022 tigergraph@localhost

At the password prompt, enter tigergraph.

Perfect! Your TigerGraph box is ready to load a graph!

Part II: Create the AML Sim Graph

Set Up

To begin, run gadmin start all in the TigerGraph shell. This will start all TigerGraph services.

gadmin start all

gadmin start all

The graph this tutorial will load is the AML Sim Graph. To start off, clone the repository.

GitHub — TigerGraph-DevLabs/AMLSim_Python_Lab

Contribute to TigerGraph-DevLabs/AMLSim_Python_Lab development by creating an account on GitHub.

github.com

git clone https://github.com/TigerGraph-DevLabs/AMLSim_Python_Lab.gitcd AMLSim_Python_Lab

This repository contains GSQL files of the documents to create the graph.

Schema

To build the schema, run schema.gsql.

gsql db_scripts/schema/schema.gsql

This will create the AMLSim graph.

Loading Jobs

Next, to load the data, run the three loading files.

cd db_scripts/loadgsql load_job_accounts.gsql
gsql load_job_alerts.gsql
gsql load_job_transactions.gsqlcd ../..

This will load the CSV data in the data folder to the graph.

Install Queries

Finally, install queries to the graph database. This tutorial will only walk through an example of running one query, but you may install as many as you would like.

gsql db_scripts/query/selectAccountTx.gsql

Exploring in GraphStudio

To view the AMLSim graph visually, you can open http://localhost:14240/ to view GraphStudio. There, you can access the schema, view the stats of the loaded data, and run queries.

Part III: Create the Project

Project Structure

Unlike Scala and Java, Python Scala projects do not need a specific format type. All of the files for this project will be located in the same folder.

index.pytigergraph-jdbc-driver-1.3.0.jar

Note: The version and naming syntax of the .jar driver can change; just ensure it is the file downloaded from Maven.

index.py

Start index.py by importing SparkSession from pySpark.

from pyspark.sql import SparkSession

Next, set the spark variable using SparkSession.builder.

spark = SparkSession.builder \.appName("TigerGraphAnalysis") \.config("spark.driver.extraClassPath", "/usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*:tigergraph-jdbc-driver-1.3.0.jar") \.getOrCreate()

The appName is simply the name of the project—in my case, “TigerGraphAnalysis.” The .config sets the driver’s path. This input for this contains the directory to the Spark jars and the path to the driver. If you installed Spark via Homebrew, the location will be like the above, /usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*. Next, if the driver is in the same folder as the file, you can simply enter the filename, tigergraph-jdbc-driver-1.3.0.jar. Finally, combine the two paths with a colon and enter it in the config: /usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*:tigergraph-jdbc-driver-1.3.0.jar.

Run the Project

To run the project, spark-submit must be used to run the file and the driver must be specified with the --jars flag.

spark-submit --jars tigergraph-jdbc-driver-1.3.0.jar index.py

With that, we are set to begin interacting with the AMLSim graph!

Part IV: Read TigerGraph Data with PySpark

In general, pySpark’s syntax is similar to Spark’s syntax.

Read Vertices

jdbcDF1 = spark.read \.format("jdbc") \.option("driver", "com.tigergraph.jdbc.Driver") \.option("url", "jdbc:tg:http://127.0.0.1:14240") \.option("user", "tigergraph") \.option("password", "tigergraph") \.option("graph", "AMLSim") \.option("dbtable", "vertex Transaction") \.option("limit", "1000") \.option("debug", "0") \.load()jdbcDF1.show()

To start, reading vertices contain options to specify the usage of the JDBC driver and options to specify graph information and credentials. To select a vertex, the dbtable must be vertex followed by the vertex type—in this case, Transaction. Finally, limit specifies the maximum number of vertices to return.

Read Edges

jdbcDF2 = spark.read \.format("jdbc") \.option("driver", "com.tigergraph.jdbc.Driver") \.option("url", "jdbc:tg:http://127.0.0.1:14240") \.option("user", "tigergraph") \.option("password", "tigergraph") \.option("graph", "AMLSim") \.option("dbtable", "edge RECEIVE_TRANSACTION") \.option("limit", "1000") \.option("source", "9934") \.option("debug", "0") \.load()jdbcDF2.show()

The function to read edges is similar. The dbtable, in this case, is edge followed by the edge type—for this example, RECIEVE_TRANSACTION. Once again, limit is the maximum number of edges. Finally, source is the id of the source vertex of the edge. For this example, the source vertex is a Transaction vertex with the id of 9934.

There is only one edge connected to Transaction 9934.

Run Queries

jdbcDF3 = spark.read \.format("jdbc") \.option("driver", "com.tigergraph.jdbc.Driver") \.option("url", "jdbc:tg:http://127.0.0.1:14240") \.option("user", "tigergraph") \.option("password", "tigergraph") \.option("graph", "AMLSim") \.option("dbtable", "query selectAccountTx(acct=9934)") \.option("debug", "0") \.load()jdbcDF3.show()

Finally, to run queries, the dbtable is query followed by the name of the installed query and the query’s parameters. In the example, the query selectAccountTx is run to find all transactions to and from the account with id 9934.

Part V: Resources and Next Steps

Congrats on completing this blog! If you made it this far, you can now interact with TigerGraph via Spark. To find the full code, check out the GitHub repository here.

GitHub - GenericP3rson/TigerGraph_pySpark_AML: How to Use pySpark with TigerGraph's AML Sim Fraud…

A sample project reading from TigerGraph with pySpark Install Spark and Scala (brew install apache-spark && brew…

github.com

In addition, check out the TigerGraph-Spark connector documentation here.

Spark Connection Via JDBC Driver - TigerGraph Server

Apache Spark is a popular big data distributed processing system which is frequently used in data management ETL…

docs.tigergraph.com

Finally, check out the introduction to Spark tutorials here.

Apache Spark - Introduction

Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop framework is based on a…

www.tutorialspoint.com

From here, you can continue to explore pySpark and TigerGraph to create more complex projects. For example, integrating Spark’s MLlib and TigerGraph’s Graph Data Science Algorithms to detect fraudulent transactions could be an awesome way to couple both software’s strengths.

Good luck exploring and building projects! If you have any questions or simply want to chat with other developers or show off some of your projects, feel free to join the TigerGraph Discourse and Discord!

TigerGraph

A place to learn, ask questions, showcase cool projects, and connect with other TigerGraph developers!

community.tigergraph.com

Subscribe to DDIntel Here.

Join our network here: https://datadriveninvestor.com/collaborate

An Introduction to pySpark and TigerGraph

Interacting with TigerGraph’s AMLSim Fraud Detection Graph with pySpark

Introduction

Preface

An Introduction to Spark and TigerGraph

Interacting with TigerGraph’s AMLSim Fraud Detection Graph in Spark

Opening

Tools

Code

GitHub - GenericP3rson/TigerGraph_pySpark_AML: How to Use pySpark with TigerGraph's AML Sim Fraud…

A sample project reading from TigerGraph with pySpark Install Spark and Scala (brew install apache-spark && brew…

Outline

Part I: Installation

Spark

PySpark

JDBC Driver

Maven Central Repository Search

Official search by the maintainers of Maven Central Repository

TigerGraph On-Premise

Get Started with Docker — TigerGraph Server

This document provides step-by-step instructions on how to pull the latest TigerGraph Enterprise Edition docker image…

Part II: Create the AML Sim Graph

Set Up

GitHub — TigerGraph-DevLabs/AMLSim_Python_Lab

Contribute to TigerGraph-DevLabs/AMLSim_Python_Lab development by creating an account on GitHub.

Schema

Loading Jobs

Install Queries

Exploring in GraphStudio

Part III: Create the Project

Project Structure

index.py

Run the Project

Part IV: Read TigerGraph Data with PySpark

Read Vertices

Read Edges

Run Queries

Part V: Resources and Next Steps

GitHub - GenericP3rson/TigerGraph_pySpark_AML: How to Use pySpark with TigerGraph's AML Sim Fraud…

A sample project reading from TigerGraph with pySpark Install Spark and Scala (brew install apache-spark && brew…

Spark Connection Via JDBC Driver - TigerGraph Server

Apache Spark is a popular big data distributed processing system which is frequently used in data management ETL…

Apache Spark - Introduction

Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop framework is based on a…

TigerGraph

A place to learn, ask questions, showcase cool projects, and connect with other TigerGraph developers!

Written by Shreya Chaudhary