Description
Submission Instructions
It is recommended that you complete this exercises in Python 3 and submit your solutions as a Jupyter notebook.
You may use any other language, as long as you include a README with simple, clear instructions on how to run (and if necessary compile) your code.
Please upload all les (code, README, written answers, etc.) to blackboard in a single zip le named ff irstnameg flastnameg DS5230=DS4220 HW 1:zip.
Before you start coding
Please follow the instruction from the attached le called
Instructions on pySpark.pdf
in order to either setup your local machine or login onto class server so that you can use jupyter with pyspark for exercise 4.
Exercise 1 : Probability Basics

Let X and Y be two independent random variables with densities p(x) and p(y), respectively. Show the following two properties:

E_{p(x;y)}[X + aY ] = E_{p(x)}[X] + aE_{p(y)}[Y ]
(1)
Var_{p(x;y)}[X + aY ] = Var_{p(x)}[X] + a^{2}Var_{p(y)}[Y ]
(2)
for any scalar constant a 2 R. Hint: use the de nition of expectation and variance,

E_{p(x)}[X] = ^{Z}_{x} p(x)xdx
(3)
var_{p(x)}[X] = E_{p(x)}[X^{2}] E_{p}^{2}
(x)^{[X]}
(4)
2. Let X be a random variable with Beta distribution,

p(x; ; ) =
_{x} 1
(1 x)
1
(5)
B(;)
1
where B( ; ) is beta function. Prove that

E[X] =
(6)
+
var[X] =
(7)
(+)^{2}(++1)

Suppose we observe N i.i.d data points D = fx_{1}; x_{2}; :::; x_{N} g, where each x_{n} 2 f1; 2; :::; Kg is a random variable with categorical (discrete) distribution parameterized by = ( _{1}; _{2}; :::; _{K} ), i.e.,

x_{n} Cat( _{1}; _{2}; :::; _{K} ); n = 1; 2; :::; N
(8)
In detail, this distribution means that for a speci c n, the random variable x_{n} follows P (x_{n} = k) = _{k}, k = 1; 2; :::; K.
Equivalently, we can also write the density function of a categorical distribution as

K
p(x_{n}) =
k^{Y}
_{k}I[xn=k]
(9)
=1
where I[x_{n} = k] is called identity function, and de ned as
I[x_{n} = k] =
1;
if x_{n} = k
(10)
0;
otherwise

Now we want to prove that the joint distribution of multiple i.i.d categorical variables is a multinomial distribution. Show that the density function of D = fx_{1}; x_{2}; :::; x_{N} g is

K
k^{Y}
p(Dj ) =_{k}^{N}^{k}
(11)
=1
PN
where N_{k} = _{n}_{=1} I[x_{n} = k] is the number of random variables belonging to category k. In other word, D = fx_{1}; x_{2}; :::; x_{N} g follows a multinomial distribution.

We often call p(Dj ) likelihood function, since it indicates the possibility we observe this dataset given the model parameters . By Bayes rule, we can rewrite the posterior as

p( D) =
p(Dj )p( )
(12)
j
p(D)
where p( ) is piror distribution which indicates our preknowledge about the model parameters. And p(D) is the distribution of the observations (data), which is constant w.r.t. posterior. Thus we can write

p( jD) / p(Dj )p( )
(13)
2
If we assume the Dirichlet prior on , i.e.,

1
K
p( ; _{1}; _{2}; :::; _{K} ) = Dir( ; _{1}; _{2}; :::; _{K} ) =
k^{Y}
(14)
B( )
_{k} k ^{1}
=1
where B( ) is Beta function and = ( _{1}; _{2}; :::; _{K} ).
Now try to derive the joint distribution p(D; ) and ignore the constant term w.r.t. . Show that the posterior is actually also Dirichlet and parameterized as follows:

p( jD) = Dir( ; _{1} + N_{1}; _{2} + N_{2}; :::; _{K} + N_{K} )
(15)
[In fact, this nice property is called conjugacy in machine learning. A general statement is : If the prior distribution is conjuagate to the likelihood, then the posterior will be the same distribution as the prior distribution. Search conjugate prior and exponential family for more detail if you are interested.]
Exercise 2 : Exploratory Analysis and Data Visualization
In this exercise, we will be looking at a public citation dataset from Aminer (https: //aminer.org/), a free online service used to index and search academic social networks. You will perform some exploratory analysis and data visualization for this dataset. The dataset is up to year 2012 and can be downloaded in the attached le called q2 dataset.txt. The data format is documented in the readme.txt le. On the server, both dataset and readme le are put in the directory
/home/yourusername/assignment dataset/homework1/.
P.S.: ArnetMiner public citation dataset is a real world dataset containing lots of noise. For example, you may see venue name like \The Truth About Managing People…And Nothing But the Truth”. However, you are not required to do data cleaning in this phase.

Count the number of distinct authors, publication venues (conferences and journals), and publications in the dataset.


List the each of the counts.



Are these numbers likely to be accurate? As an example look up all the publications venue names associated with the conference \Principles and Practice of Knowledge Discovery in Databases”^{1}.



In the example above a single venue is listed under multiple names. What additional problem arises when you try to determine the number of distinct authors in a dataset?

^{1}https://en.wikipedia.org/wiki/ECML_PKDD
3


For each author, construct the list of publications. Plot a histogram of the number of publications per author (use a logarithmic scale on the y axis).



Calculate the mean and standard deviation of the number of publications per author. Also calculate the Q_{1} (1st quartile^{2}), Q_{2} (2nd quartile, or median) and Q_{3} (3rd quartile) values. Compare the median to the mean and explain the di erence between the two values based on the standard deviation and the 1st and 3rd quartiles.



Now construct a list of publications for each venue. Plot a histogram of the number of publications per venue. Also calculate the mean, standard deviation, median, Q_{1} and Q_{3} values. What is the venue with the largest number of publications in the dataset?


Now construct the list of references (that is, the cited publications) for each publication. Then in turn use this set to calculate the number of citations for each publication (that is, the number of times a publication is cited).


Plot a histogram of the number of references and citations per publication. What is the publication with the largest number of references? What is the publication with the largest number of citations?



Calculate the so called impact factor for each venue. To do so, calculate the total number of citations for the publications in the venue, and then divide this number by the number of publications for the venue. Plot a histogram of the results.



What is the venue with the highest apparent impact factor? Do you believe this number?



Now repeat the calculation from item b., but restrict the calculation to venues with at least 10 publications. How does your histogram change? List the citation counts for all publications from the venue with the highest impact factor. How does the impact factor (mean number of citations) compare to the median number of citations?



Finally, construct a list of publications for each publication year. Use this list to plot the average number of references and average number of citations per publication as a function of time. Explain the di erences you see in the trends.

Exercise 3 : Understanding Apriori and FP growth

Consider a dataset for frequent set mining as in the following table where we have 6 binary features and each row represents a transaction.
^{2}https://en.wikipedia.org/wiki/Quartile
4

Items
1
fc,eg
2
fb,c,d,fg
3
fa,eg
4
fa,b,cg
5
fdg
6
fa,d,fg
7
fc,d,e,fg
8
fa,c,eg
9
fa,dg
10
fb,c,fg


Illustrate the rst three passes of the Apriori algorithm (set sizes 1, 2 and 3) for support threshold of 3 transactions. For each stage, list the candidate sets C_{k} and the frequent sets L_{k}. What are the maximal frequent sets discovered in the rst 3 levels?



Pick one of the maximal sets and check if any of its subsets are association rules with frequency at least 0:3 and con dence at least 0:6. Please explain your answer and show your work.


Given the following transaction database, let the min support = 2, answer the following questions.

TID
Items
1
fa,b,eg
2
fa,b,c,dg
3
fa,c,dg
4
fa,c,eg
5
fb,c,fg
6
fag
7
fa,b,cg
8
fb,d,eg
9
fa,cg
10
fa,b,d,eg

Construct FPtree from the transaction datasbase and draw it here.

Show d’s conditional pattern base (projected database), d’s conditional FPtree, and nd frequent patterns based on d’s conditional FPtree.
Exercise 4 : Market Basket Analysis of Academic Communities
In this problem, you will try to apply frequent pattern mining techniques to the real world bibliographic dataset from Aminer (https://aminer.org/). One thing worth noting is that you are required consider the whole dataset, instead of running with
5
part of the dataset. You may use any Apriori or FPgrowth implementation that is made available in existing libraries. We recommend that you start with Spark (http://spark.apache.org/).
If you use the class server to do exercise 4:
=====================================================
Due to some mismatch of python version between the master node and worker node in pyspark, you would encounter an error when initialing a pyspark object,i.e., using the method pyspark:sparkContext(). The current solution to this issue I found so far, is to add the following two lines of code to your script before you initialize pyspark object :
import os
os.environ[’PYSPARK_PYTHON’] = ’/opt/anaconda3/bin/python3’
I have tried all kinds of ways to adding env variable to the PATH, but unfortunately it still cannot be solved on system level. I am very sorry for the inconvenience.
=====================================================

The dataset included with this problem is q4 dataset.txt. Parse this data, and comment on how it di ers from the previous le (q2 dataset.txt), in terms of number of publications, authors, venues, references, and years of publication.On the server, both dataset and readme le are put in the directory /home/yourusername/assignment dataset/homework1/.

Coauthor discovery: Please use FPGrowth to analyze coauthor relationships, treating each paper as a basket of authors.



What happens when you successively decrease the support threshold using the values f1e4, 1e5, 0.5e5, 1e6g?





Keep threshold = 0.5e5 and report the top 5 coauthors for these researchers: Rakesh Agrawal, Jiawei Han, Zoubin Ghahramani and Christos Faloutsos according to frequency.



Academic community discovery: In computer science communities tend to organize around conferences. Here are 5 key conferences for areas of data science
Machine learning: NIPS (Neural Information Processing Systems)
Data mining: KDD (Conference on Knowledge Discovery and Data Mining)
Databases: VLDB (Very Large Data Bases)
Computer networks: INFOCOM (International Conference on Computer Communications)
6
Natural language processing: ACL (Association for Computational Linguistics)

We will now use FP growth to analyze academic communities. To do so, represent each author as a basket in which the items are the venues in which the author has at least one publication. What happens as you decrease the support threshold using values f1e3, 0.4e3, 1e4g?

Keep the threshold=0.4e3 and report results. For each area, based on seed conferences please rank the top 10 venues that authors also publish in.
7