ydot¶
ydot
is a Python API to produce PySpark dataframe models from R-like formula expressions. This project is based on patsy [pat]. As a quickstart, let’s say you have a Spark dataframe with data as follows.
a |
b |
x1 |
x2 |
y |
---|---|---|---|---|
left |
low |
19.945536387662504 |
3.85214120038979 |
0.0 |
left |
low |
20.674308066353493 |
4.098585619118175 |
1.0 |
right |
high |
20.346647025958433 |
2.7107604387194626 |
1.0 |
right |
mid |
18.699653829045985 |
5.2111542692543065 |
1.0 |
left |
low |
21.51851187887476 |
2.432390426907621 |
1.0 |
right |
mid |
20.989823705535017 |
3.6774523253171734 |
1.0 |
right |
high |
20.277680897136328 |
2.4873300559969604 |
0.0 |
right |
mid |
19.551410645704927 |
2.3549674965407372 |
0.0 |
right |
low |
20.96196624352397 |
3.1665930443154995 |
0.0 |
right |
mid |
19.172421360793678 |
3.562224297579924 |
1.0 |
Now, let’s say you want to model this dataset as follows.
y ~ x_1 + x_2 + a + b
Then all you have to do is use the smatrices()
function.
1 2 3 4 | from ydot.spark import smatrices
formula = 'y ~ x1 + x2 + a + b'
y, X = smatrices(formula, sdf)
|
Observe that y
and X
will be Spark dataframes as specified by the formula. Here’s a more interesting example where you want a model specified up to all two-way interactions.
y ~ (x1 + x2 + a + b)**2
Then you could issue the code as below.
1 2 3 4 | from ydot.spark import smatrices
formula = 'y ~ (x1 + x2 + a + b)**2'
y, X = smatrices(formula, sdf)
|
Your resulting X
Spark dataframe will look like the following.
Intercept |
a[T.right] |
b[T.low] |
b[T.mid] |
a[T.right]:b[T.low] |
a[T.right]:b[T.mid] |
x1 |
x1:a[T.right] |
x1:b[T.low] |
x1:b[T.mid] |
x2 |
x2:a[T.right] |
x2:b[T.low] |
x2:b[T.mid] |
x1:x2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
19.945536387662504 |
0.0 |
19.945536387662504 |
0.0 |
3.85214120038979 |
0.0 |
3.85214120038979 |
0.0 |
76.83302248278848 |
1.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
20.674308066353493 |
0.0 |
20.674308066353493 |
0.0 |
4.098585619118175 |
0.0 |
4.098585619118175 |
0.0 |
84.73542172597531 |
1.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
20.346647025958433 |
20.346647025958433 |
0.0 |
0.0 |
2.7107604387194626 |
2.7107604387194626 |
0.0 |
0.0 |
55.154885818557126 |
1.0 |
1.0 |
0.0 |
1.0 |
0.0 |
1.0 |
18.699653829045985 |
18.699653829045985 |
0.0 |
18.699653829045985 |
5.2111542692543065 |
5.2111542692543065 |
0.0 |
5.2111542692543065 |
97.44678088481062 |
1.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
21.51851187887476 |
0.0 |
21.51851187887476 |
0.0 |
2.432390426907621 |
0.0 |
2.432390426907621 |
0.0 |
52.341422295472896 |
1.0 |
1.0 |
0.0 |
1.0 |
0.0 |
1.0 |
20.989823705535017 |
20.989823705535017 |
0.0 |
20.989823705535017 |
3.6774523253171734 |
3.6774523253171734 |
0.0 |
3.6774523253171734 |
77.18907599391727 |
1.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
20.277680897136328 |
20.277680897136328 |
0.0 |
0.0 |
2.4873300559969604 |
2.4873300559969604 |
0.0 |
0.0 |
50.437285161362595 |
1.0 |
1.0 |
0.0 |
1.0 |
0.0 |
1.0 |
19.551410645704927 |
19.551410645704927 |
0.0 |
19.551410645704927 |
2.3549674965407372 |
2.3549674965407372 |
0.0 |
2.3549674965407372 |
46.04293658215565 |
1.0 |
1.0 |
1.0 |
0.0 |
1.0 |
0.0 |
20.96196624352397 |
20.96196624352397 |
20.96196624352397 |
0.0 |
3.1665930443154995 |
3.1665930443154995 |
3.1665930443154995 |
0.0 |
66.3780165019193 |
1.0 |
1.0 |
0.0 |
1.0 |
0.0 |
1.0 |
19.172421360793678 |
19.172421360793678 |
0.0 |
19.172421360793678 |
3.562224297579924 |
3.562224297579924 |
0.0 |
3.562224297579924 |
68.29646521485958 |
In general, what you get with patsy
is what you get with ydot
, however, there are exceptions. For example, the builtin functions such as standardize()
and center()
available with patsy
will not work against Spark dataframes. Additionally, patsy allows for custom transforms, but such transforms (or user defined functions) must be visible. For now, only numpy-based transformed are allowed against continuous variables (or numeric columns).
Indices and tables¶
About¶
One-Off Coder is an educational, service and product company. Please visit us online to discover how we may help you achieve life-long success in your personal coding career or with your company’s business goals and objectives.
Copyright¶
Documentation¶
Software¶
Copyright 2020 One-Off Coder
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Art¶
Copyright 2020 Daytchia Vang
Citation¶
@misc{oneoffcoder_ydot_2020,
title={ydot, R-like formulas for Spark Dataframes},
url={https://github.com/oneoffcoder/pyspark-formula},
author={Jee Vang},
year={2020},
month={Dec}}
Author¶
Jee Vang, Ph.D.