An Introduction to pySpark and TigerGraph
Interacting with TigerGraph’s AMLSim Fraud Detection Graph with pySpark
Introduction
Preface
Note: You can find a similar blog post on integrating Spark and TigerGraph here. This blog will walk through similar content but will use pySpark instead of Spark. The set up—most of Part I and all of Part II—will be the same, so you can either skip or breifly skim over those sections if you read the other blog.
Opening
Apache Spark is a data processing software known for its ability to process large datasets and distribute data processing tasks among multiple computers. PySpark is the Python API for Spark and allows Python users to manipulate the data using Spark all in Python. TigerGraph is an enterprise-scale graph database. Together, Spark and PySpark’s abilities with TigerGraph’s graph database can be used to manipulate and analyse big data to make big discoveries.
This tutorial will walk through the basics of getting started with both pySpark and TigerGraph and using TigerGraph’s JDBC driver to connect the two.
Tools
- Scala (v2.12.15) → Spark is written in Scala, so Scala will need to be downloaded.
- Java (v16.0.1) → This should already be installed on the computer. Java will be used to process the driver.
- TigerGraph On-Premise (v3.6.0) → TigerGraph on-premise will be used to run and host the database.
- TigerGraph JDBC Driver (v1.3.0) → This will connect Spark with TigerGraph.
- Apache Spark (v3.2.1) → The primary data processing software used.
- Python3 (v3.8.9) → The primary language used for this project is Python.
- Pip (v22.1.2) → Pip is a Python package installer and an alternate method to install pySpark.
- pySpark (v3.2.1) → This is the Python API for Spark that will be used to interact with TigerGraph.
- Docker Desktop (v4.2.0.5815) → Docker will be used to run TigerGraph on-premise.
Code
The full code for this project can be found here.
Outline
- Installation
- Create the AML Sim Graph
- Create the Project
- Read TigerGraph Data with PySpark
- Resources and Next Steps
Part I: Installation
Spark
To start off, we will download Spark and Scala using brew, a software that can be used on Linux and Mac OS. Spark is the data manipulation software we will use, and Spark itself is built on Scala.
brew install scala && brew install apache-spark
To verify Spark and Scala were downloaded properly, open the Spark shell with:
spark-shell
If the shell runs without any errors, Spark is successfully downloaded!
PySpark
PySpark should get automatically downloaded when installing Spark. To verify it is installed, run pyspark
to open the Python Spark shell.
pyspark
Note: Notice that, unlike the previous shell, this Python shell is preceded with
>>>
instead ofscala>
.
If it is not downloaded, pySpark can also be downloaded via pip. To do so, create a virtual environment and pip install pySpark.
python3 -m venv venvsource venv/bin/activatepip install pyspark
JDBC Driver
Download the latest JDBC driver from Maven.
At the time of publication, the most recent driver is 1.3.0, so that will be used for this blog. Hold on to the tigergraph-jdbc-driver-1.3.0.jar
file downloaded; we will use it when creating the repository.
TigerGraph On-Premise
TigerGraph can be downloaded via Docker. The full instructions to download it can be found here.
In summary, start Docker on your computer (by downloading the Desktop app and running the application). Next, create a folder on your machine to hold the Docker data.
mkdir data
chmod 777 data
Finally, run the TigerGraph Docker image.
docker run -d -p 14022:22 -p 9000:9000 -p 14240:14240 --name tigergraph --ulimit nofile=1000000:1000000 -v ~/data:/home/tigergraph/mydata -t docker.tigergraph.com/tigergraph:latest
To enter the TigerGraph shell, you can ssh into the solution.
ssh -p 14022 tigergraph@localhost
At the password prompt, enter tigergraph
.
Perfect! Your TigerGraph box is ready to load a graph!
Part II: Create the AML Sim Graph
Set Up
To begin, run gadmin start all
in the TigerGraph shell. This will start all TigerGraph services.
gadmin start all
The graph this tutorial will load is the AML Sim Graph. To start off, clone the repository.
git clone https://github.com/TigerGraph-DevLabs/AMLSim_Python_Lab.gitcd AMLSim_Python_Lab
This repository contains GSQL files of the documents to create the graph.
Schema
To build the schema, run schema.gsql
.
gsql db_scripts/schema/schema.gsql
This will create the AMLSim graph.
Loading Jobs
Next, to load the data, run the three loading files.
cd db_scripts/loadgsql load_job_accounts.gsql
gsql load_job_alerts.gsql
gsql load_job_transactions.gsqlcd ../..
This will load the CSV data in the data
folder to the graph.
Install Queries
Finally, install queries to the graph database. This tutorial will only walk through an example of running one query, but you may install as many as you would like.
gsql db_scripts/query/selectAccountTx.gsql
Exploring in GraphStudio
To view the AMLSim graph visually, you can open http://localhost:14240/ to view GraphStudio. There, you can access the schema, view the stats of the loaded data, and run queries.
Part III: Create the Project
Project Structure
Unlike Scala and Java, Python Scala projects do not need a specific format type. All of the files for this project will be located in the same folder.
index.pytigergraph-jdbc-driver-1.3.0.jar
Note: The version and naming syntax of the
.jar
driver can change; just ensure it is the file downloaded from Maven.
index.py
Start index.py
by importing SparkSession from pySpark.
from pyspark.sql import SparkSession
Next, set the spark
variable using SparkSession.builder
.
spark = SparkSession.builder \.appName("TigerGraphAnalysis") \.config("spark.driver.extraClassPath", "/usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*:tigergraph-jdbc-driver-1.3.0.jar") \.getOrCreate()
The appName
is simply the name of the project—in my case, “TigerGraphAnalysis.” The .config
sets the driver’s path. This input for this contains the directory to the Spark jars and the path to the driver. If you installed Spark via Homebrew, the location will be like the above, /usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*
. Next, if the driver is in the same folder as the file, you can simply enter the filename, tigergraph-jdbc-driver-1.3.0.jar
. Finally, combine the two paths with a colon and enter it in the config: /usr/local/Cellar/apache-spark/3.2.1/libexec/jars/*:tigergraph-jdbc-driver-1.3.0.jar
.
Run the Project
To run the project, spark-submit
must be used to run the file and the driver must be specified with the --jars
flag.
spark-submit --jars tigergraph-jdbc-driver-1.3.0.jar index.py
With that, we are set to begin interacting with the AMLSim graph!
Part IV: Read TigerGraph Data with PySpark
In general, pySpark’s syntax is similar to Spark’s syntax.
Read Vertices
jdbcDF1 = spark.read \.format("jdbc") \.option("driver", "com.tigergraph.jdbc.Driver") \.option("url", "jdbc:tg:http://127.0.0.1:14240") \.option("user", "tigergraph") \.option("password", "tigergraph") \.option("graph", "AMLSim") \.option("dbtable", "vertex Transaction") \.option("limit", "1000") \.option("debug", "0") \.load()jdbcDF1.show()
To start, reading vertices contain options to specify the usage of the JDBC driver and options to specify graph information and credentials. To select a vertex, the dbtable
must be vertex
followed by the vertex type—in this case, Transaction
. Finally, limit
specifies the maximum number of vertices to return.
Read Edges
jdbcDF2 = spark.read \.format("jdbc") \.option("driver", "com.tigergraph.jdbc.Driver") \.option("url", "jdbc:tg:http://127.0.0.1:14240") \.option("user", "tigergraph") \.option("password", "tigergraph") \.option("graph", "AMLSim") \.option("dbtable", "edge RECEIVE_TRANSACTION") \.option("limit", "1000") \.option("source", "9934") \.option("debug", "0") \.load()jdbcDF2.show()
The function to read edges is similar. The dbtable
, in this case, is edge
followed by the edge type—for this example, RECIEVE_TRANSACTION
. Once again, limit
is the maximum number of edges. Finally, source
is the id of the source vertex of the edge. For this example, the source vertex is a Transaction vertex with the id of 9934.
There is only one edge connected to Transaction 9934.
Run Queries
jdbcDF3 = spark.read \.format("jdbc") \.option("driver", "com.tigergraph.jdbc.Driver") \.option("url", "jdbc:tg:http://127.0.0.1:14240") \.option("user", "tigergraph") \.option("password", "tigergraph") \.option("graph", "AMLSim") \.option("dbtable", "query selectAccountTx(acct=9934)") \.option("debug", "0") \.load()jdbcDF3.show()
Finally, to run queries, the dbtable
is query
followed by the name of the installed query and the query’s parameters. In the example, the query selectAccountTx
is run to find all transactions to and from the account with id 9934.
Part V: Resources and Next Steps
Congrats on completing this blog! If you made it this far, you can now interact with TigerGraph via Spark. To find the full code, check out the GitHub repository here.
In addition, check out the TigerGraph-Spark connector documentation here.
Finally, check out the introduction to Spark tutorials here.
From here, you can continue to explore pySpark and TigerGraph to create more complex projects. For example, integrating Spark’s MLlib and TigerGraph’s Graph Data Science Algorithms to detect fraudulent transactions could be an awesome way to couple both software’s strengths.
Good luck exploring and building projects! If you have any questions or simply want to chat with other developers or show off some of your projects, feel free to join the TigerGraph Discourse and Discord!
Subscribe to DDIntel Here.
Join our network here: https://datadriveninvestor.com/collaborate