Data munging cheat sheet

Basic data munging operations: structured data

This page is developing

Python pandasPySpark RDDPySpark DFR dplyrRevo R dplyrXdf
subset columnsdf.colname, df['colname']rdd.map()df.select('col1', 'col2', ...)select(df, col1, col2, ...)
new columnsdf['newcolumn']=...rdd.map(function)df.withColumn(“newcol”, content)mutate(df, col1=col2+col3, col4=col5^2,...)
subset rowsdf[1:10], df.loc['rowname':]rdd.filter(function or boolean vector), rdd.subtract()filter
sample rowsrdd.sample()
order rowsdf.sort('col1')arrange
group & aggregatedf.sum(axis=0), df.groupby(['A', 'B']).agg([np.mean, np.std])rdd.count(), rdd.countByValue(), rdd.reduce(), rdd.reduceByKey(), rdd.aggregate()df.groupBy('col1', 'col2').count().show()group_by(df, var1, var2,...) %>% summarise(col=func(var3), col2=func(var4), ...)rxSummary(formula, df)
or
group_by() %>% summarise()
peek at datadf.head()rdd.take(5)df.show(5)first(), last()
quick statisticsdf.describe()df.describe()summary()rxGetVarInfo()
schema or structuredf.printSchema()

...and there's always SQL


Syntax examples

Python pandas

# TBC

PySpark RDDs & DataFrames

RDDs

Transformations return pointers to new RDDs

Actions return values

A reminder: how lambda functions, map, reduce and filter work

### lambda functions are shorthand anonymous functions
map(lambda x: x * x, range(10))
# is the equivalent of:
def map_squares(nums):
    res = []
    for x in nums:
        res.append( x * x )
    return res
map_squares(range(10))

### map - apply function to each element
## in python
numbers = [1,2,3,4]

def square(x):
    return x*x

results = map(square, numbers)

## in PySpark
nums = sc.parallelize([1, 2, 3, 4])
nums.map(lambda x: x * x).collect()


### reduce = accumulate elements via function
## in python
reduce(lambda x,y: x+y, numbers)

## in PySpark
nums.reduce(lambda x,y: x+y)
# there's also reduceByKey & countByKey
rdd.reduceByKey(lambda x,y: x+y).collect()

### filter = returns new list of items where applied function is TRUE
## in Python
filter(lambda x: x>2, numbers)

## in PySpark

# another example: drop header / first row
lines = lines.filter(lambda x: x.find('x') != 0)

Partitions: rdd.getNumPartitions(), sc.parallelize(data, 500), sc.textFile('file.csv', 500), rdd.repartition(500)

Additional functions for DataFrames

If you want to use an RDD method on a dataframe, you can often df.rdd.function().


Miscellaneous examples of chained data munging:

## examples of chaining
# filter rows
 adultDataFrame
      .select("workclass","age","education","occupation","income")
      .filter( adultDataFrame("age") > 30 )
      .show()

# group by & sort
    adultDataFrame
      .groupBy("income","occupation")
      .count()
      .sort("occupation")
      .show()

# Word count
words = sc.textFile('hamlet.txt')\
        .flatMap(lambda line: re.split('\W+', line.lower().strip()))\
        .filter(lambda x: len(x) > 2 )\
        .map(lambda word: (word, 1))\
        .reduceByKey(lambda a, b: a + b)\
        .map(lambda x: (x[1], x[0])).sortByKey(False)

Further resources


R dplyr

Copy in examples

The 5 verbs:

# select


# summarise


# using pipes to chain transformations
hflights %>%
  group_by(UniqueCarrier) %>%
  filter(!is.na(ArrDelay)) %>%
  summarise(p_delay = mean(ArrDelay > 0)) %>%
  mutate(rank = rank(p_delay)) %>%
  arrange(rank)

Additional functions in dplyr


Revo R dplyrXdf

Notes:

Manipulation with dplyrXdf can use:

#


# summary aggregations
rxSummary(~default + balance, data = BankDS)

# pipe notation works fine
bankSmry <- group_by(bankXdf, marital, housing) %>%
  summarise(mean_age=mean(age), mean_balance=mean(balance))

Further resources

Back