Data Management

Introduction

In this section we cover common data management tasks like concatenation, updating, appending, sorting, finding duplicates, drawing samples, and transposing. Every analysis task requires data to be organized into a specific form before it renders meaningful results.

Update

SAS UPDATE operations are used in cases where a master table containing original data values are updated with smaller transaction datasets. Transaction datasets are typically shaped the same as the master containing new values for updating the master dataset. In the case of SAS, observations from the transaction dataset not corresponding to observations in the master dataset become new observations.

Create the SAS master and trans datasets.

data master;
   input ID 
         salary;
list;
datalines;
023 45650
088 55350
099 55100
111 61625
;;;;
run;

data trans;
   infile datalines dlm=',';
   input ID
         salary
         bonus;
list;
datalines;
023, 45650, 2000
088, 61000,
099, 59100,
61625, 3000
121, 50000,
;;;;
run;

SAS UDAPTE operation creates the new_pay dataset applying transaction values from the transact dataset to the master dataset. Non-missing values for variables in the transact dataset replace corresponding values in the master dataset. A RENAME statement is used to rename the salary variable in order to display the effects of the UPDATE operation.

Updated master Dataset left

The pandas update method modifies DataFrame values using non-NA values from another DataFrame. In contrast to SAS, the pandas update method performs an in-place update to the calling DataFrame, in this case the master DataFrame. The SAS UPDATE statement creates a new dataset and does not modify the master dataset.

Construct and print the master dataframe.

import pandas as pd
master = pd.DataFrame({'ID': ['023', '088', '099', '111'],
                   'Salary': [45650, 55350, 55100, 61625]})
print(master)

    ID  Salary
0  023   45650
1  088   55350
2  099   55100
3  111   61625

Contruct and print the trans dataframe.

import nump as np
trans = pd.DataFrame({'ID': ['023', '088', '099', '111', '121'],
           'Salary': [45650, 61000, 59100, 61625, 50000],
           'Bonus': [2000, np.NaN , np.NaN, 3000, np.NaN]})
print(trans)

    ID  Salary   Bonus
0  023   45650  2000.0
1  088   61000     NaN
2  099   59100     NaN
3  111   61625  3000.0
4  121   50000     NaN

Call the update method to update the master DataFrame values with those from the trans DataFrame and display the results.

master.update(trans)
print(master)

    ID  Salary
0  023   45650
1  088   61000
2  099   59100
3  111   61625

The default join operation for the update method is a left join explaining why row ID 121 is not in the updated master DataFrame. This row exists only in the trans DataFrame. This is also the explanation for why the Bonus column in the trans DataFrame is not updated for the master DataFrame.

Seems like we should be able to call the pandas update method with the join='outer' argument.

master.update(trans, join='outer')

NotImplementedError: Only left join is supported

The correct alternative is to call the merge method shown below.

df6 = pd.merge(master, trans,
               on='ID', how='outer',
               suffixes=('_Old','_Updated' ))
print(df6)

    ID  Salary_Old  Salary_Updated   Bonus
0  023     45650.0           45650  2000.0
1  088     55350.0           61000     NaN
2  099     55100.0           59100     NaN
3  111     61625.0           61625  3000.0
4  121         NaN           50000     NaN

The on='ID' argument uses the ID column common to the master and trans DataFrame as join keys. The how='outer' argument performs an outer join and suffixes=('_Old','_Updated' ) adds a suffix to liked-name columns in the DataFrames to disambiguate column contributions.

Conditional Update

There are cases when updates need to be applied conditionally. SAS users are accustomed to thinking in terms of IF/THEN/ELSE logic for conditional updates. To understand how the logic works with DataFrames we offer two examples.

The first example defines a Python function to calculate tax rates based on the Salary_Updated column in the df6 DataFrame. This example uses row iteration with Python if/else logic to calculate the new Taxes column.

Copy and the df6 DataFrame to df7, to be used in the second example and display the df6 DataFrame.

df7 = df6.copy()
print(df6)

    ID  Salary_Old  Salary_Updated   Bonus
0  023     45650.0           45650  2000.0
1  088     55350.0           61000     NaN
2  099     55100.0           59100     NaN
3  111     61625.0           61625  3000.0
4  121         NaN           50000     NaN

Define the calc_taxes function using if/else logic. To cascade a series of if statements use the elif statement following the first if statement.

def calc_taxes(row):
    if row['Salary_Updated'] <= 50000:
        val = row['Salary_Updated'] * .125
    else:
        val = row['Salary_Updated'] * .175
    return val

Call the apply method applying the calc_taxes function along the df6 columns using the the axis=1 argument.

df6['Taxes'] = df6.apply(calc_taxes, axis=1)
print(df6)

    ID  Salary_Old  Salary_Updated   Bonus      Taxes
0  023     45650.0           45650  2000.0   5706.250
1  088     55350.0           61000     NaN  10675.000
2  099     55100.0           59100     NaN  10342.500
3  111     61625.0           61625  3000.0  10784.375
4  121         NaN           50000     NaN   6250.000

The second approach is a better performing method using the loc indexer to apply calculated values to the DataFrame.

df7.loc[df7['Salary_Updated'] <= 50000, 'Taxes'] = df7.Salary_Updated * .125

df7.loc[df7['Salary_Updated'] >  50000, 'Taxes'] = df7.Salary_Updated * .175

print(df7)

    ID  Salary_Old  Salary_Updated   Bonus      Taxes
0  023     45650.0           45650  2000.0   5706.250
1  088     61000.0           61000     NaN  10675.000
2  099     59100.0           59100     NaN  10342.500
3  111     61625.0           61625  3000.0  10784.375
4  121         NaN           50000     NaN   6250.000

The two conditions are:

1. 12.5% tax rate on the Salary_Updated values less than or equal to 50,000 is:

df7.loc[df7['Salary_Updated'] <= 50000, 'Taxes'] = df7.Salary_Updated * .125

A way to read this statement is consider the syntax to the left of the comma (,) like a WHERE clause. The df7 DataFrame calls the loc indexer to slice rows where column df7['Salary_Updated'] values are less than or equal to 50,000.

To the right of the comma is the assignment when the condition is True; the values for the Taxes column (created on the DataFrame) are calculated at 12.5% of the value for the Salary_Updated column.

2. 17.5% tax rate on the Salary_Updated values greater than 50,000

df7.loc[df7['Salary_Updated'] >  50000, 'Taxes'] = df7.Salary_Updated * .175

works in a similar fashion.

The loc indexer creates a Boolean mask for the df7DataFrame and slices rows meeting the condition. For the condition, df7['Salary_Updated'] <= 50000, when the Boolean mask returns True values are multipled by .125. Likewise for the condition, df7['Salary_Updated'] > 50000 values are multiplied by 17.5.

Display the Boolean mask used to slice the df7 DataFrame conditionally.

nl = '\n'

print(nl,
      "Boolean Mask for 'Salary_Updated' <= 50000",
      nl,
      df7['Salary_Updated'] <= 50000,
      nl,
      "Boolean Mask for 'Salary_Updated' > 50000",
       nl,
      df7['Salary_Updated'] > 50000)

Boolean Mask for 'Salary_Updated' <= 50000
0    True
1    False
2    False
3    False
4    True
Name: Salary_Updated, dtype: bool
Boolean Mask for 'Salary_Updated' > 50000
0     False
1     True
2     True
3     True
4     False
Name: Salary_Updated, dtype: bool

SAS conditional update.

data calc_taxes;
   set new_pay;
if new_salary <= 50000 then
   taxes = new_salary * .125;

else taxes = new_salary * .175;
run;

proc print data=calc_taxes;
   id ID;
run;

PROC PRINT

Concatenation

The pandas library implements a concat method similar in behavior to the SAS SET statement. The concat method ‘glues’ DataFrames on row and column basis.

The concat method signature is:

pd.concat(objs, axis=0, join='outer', join_axes=None,
            ignore_index=False, keys=None, levels=None, 
            names=None, verify_integrity=False, 
            sort=None, copy=True)

Construct loc1, loc2, and loc3 DataFrames. Call the concat method to concatenate them constructing the all DataFrame.

loc1 = pd.DataFrame({'Onhand': [21, 79, 33, 81],
                      'Price': [17.99, 9.99, 21.00, .99]},
                       index = ['A0', 'A1', 'A2', 'A3'])

loc2 = pd.DataFrame({'Onhand': [12, 33, 233, 45],
                      'Price': [21.99, 18.00, .19, 23.99]},
                       index = ['A4', 'A5', 'A6', 'A7'])

loc3 = pd.DataFrame({'Onhand': [37, 50, 13, 88],
                      'Price': [9.99, 5.00, 22.19, 3.99]},
                       index = ['A8', 'A9', 'A10', 'A11'])

frames = [loc1, loc2, loc3]
all = pd.concat(frames)

print(all)

     Onhand  Price
A0       21  17.99
A1       79   9.99
A2       33  21.00
A3       81   0.99
A4       12  21.99
A5       33  18.00
A6      233   0.19
A7       45  23.99
A8       37   9.99
A9       50   5.00
A10      13  22.19
A11      88   3.99

The analog SAS program.

data loc1;
   length id $ 3;
   input id $
         onhand
         price;
list;
datalines;
A0 21 17.19
A1 79 9.99
A2 33 21
A3 81 .99
;;;;
run;

data loc2;
   length id $ 3;
    input id $
          onhand
          price;
list;
datalines;
A4 12 21.99
A5 33 18
A6 233 .19
A7 45 23.99
;;;;
run;

data loc3;
   length id $ 3;
   input id $
          onhand
          price;
list;
datalines;
A8 37 9.99
A9 50 5
A10 13 22.19
A11 88 3.99
;;;;
run;

data all;

   set loc1
       loc2
       loc3;
run;

PROC SQL UNION ALL set operator as an alternative to the SET statement creating the all table.

proc sql;
   create table all as
   select * from loc1
      union all
   select * from loc2
      union all
   select * from loc3;

select * from all;
quit;

PROC PRINT

The concat method is able to construct a MultiIndex by providing the keys= argument to form the outermost level.

Construct the all DataFrame by concatenating loc1, loc2, and loc3 DataFrames. Construct the MultiIndex with the keys= argument.

all = pd.concat(frames, keys=['Loc1', 'Loc2', 'Loc3'])
print(all)

          Onhand  Price
Loc1 A0       21  17.99
     A1       79   9.99
     A2       33  21.00
     A3       81   0.99
Loc2 A4       12  21.99
     A5       33  18.00
     A6      233   0.19
     A7       45  23.99
Loc3 A8       37   9.99
     A9       50   5.00
     A10      13  22.19
     A11      88   3.99

Call the loc indexer to slice rows where the outer-level index is 'Loc3'

all.loc['Loc3']

     Onhand  Price
A8       37   9.99
A9       50   5.00
A10      13  22.19
A11      88   3.99

An approximate SAS analog with the IN= dataset option.

data all;
length id $ 3;
   set loc1 (in=l1)
       loc2 (in=l2)
       loc3 (in=l3);
if l1 then location = 'Loc1';
if l2 then location = 'Loc2';
if l3 then location = 'Loc3';
run;

proc print data = all(where=(location='Loc3'));
   id id;
   var onhand 
       price;
run;

PROC PRINT

Append

Similar to PROC APPEND the pandas library provisions an append method as a convenience to the concat method. The append method syntax is more natural for SAS users.

all_parts = loc1.append([loc2, loc3])
print(all_parts)

     Onhand  Price
A0       21  17.99
A1       79   9.99
A2       33  21.00
A3       81   0.99
A4       12  21.99
A5       33  18.00
A6      233   0.19
A7       45  23.99
A8       37   9.99
A9       50   5.00
A10      13  22.19

Call PROC APPEND to append observations from a dataset to a base dataset. In cases where more than one dataset is appended, multiple calls are needed. In some cases appending is a better performer over the SET statement when appending smaller datasets to a larger ones.

proc append base = loc1
            data = loc2;
run;

proc append base = loc1
            data = loc3;
run;

proc print data=loc1;
run;

PROC PRINT

Find Column Min and Max

pandas min and max methods return the minimum and maximum column values respectively.

all_parts['Price'].max()

23.99

all_parts['Price'].min()

0.19

Slice the row containing the highest price.

all_parts[all_parts['Price']==all_parts['Price'].max()]

    Onhand  Price
A7      45  23.99

PROC SQL returning column min and max.

proc sql;
   select min(price) as Price_min
        , max(price) as Price_max
from loc1;
quit;

PROC PRINT

Sort

pandas sort_values method and PROC SORT use ascending as their default sort order.

Construct the display the df DataFrame.

df = pd.DataFrame({'ID': ['A0', 'A1', 'A2', 'A3', 'A4', '5A', '5B'],
                  'Age': [21, 79, 33, 81, np.NaN, 33, 33],
                  'Rank': [1, 2, 3, 3, 4, 5, 6]})
print(df)

   ID   Age  Rank
0  A0  21.0     1
1  A1  79.0     2
2  A2  33.0     3
3  A3  81.0     3
4  A4   NaN     4
5  5A  33.0     5
6  5B  33.0     6

Call the pandas sort_values method and display results. Pass a list of column labels to the by= argument to request a multi-key sort.

df.sort_values(by=['Age', 'Rank'])

   ID   Age  Rank
0  A0  21.0     1
2  A2  33.0     3
5  5A  33.0     5
6  5B  33.0     6
1  A1  79.0     2
3  A3  81.0     3
4  A4   NaN     4

Construct the df dataset, request a multi-key sort with by age rank; and display the output.

data df;
   length id $ 2;
   input  id $
          age
          rank;
list;
datalines;
A0 21 1
A1 79 2
A2 33 3
A3 81 3
A4 .  4
5A 33 5
5B 33 6
;;;;

proc sort data = df;
   by age rank;
run;

proc print data = df;
   id id;
run;

Dataset sorted by Age, Rank PROC PRINT

PROC SORT sorts missing values as the smallest numeric value. The default behavior for pandas sort_values method is to sort NaN’s as the largest numeric values. The sort_values method has the na_postion= argument with values 'first' or 'last' for placing NaN’s at the beginning or end of a DataFrame.

Sort NaN's to beginning of a DataFrame.

df.sort_values(by=['Age', 'Rank'], na_position='first')

   ID   Age  Rank
4  A4   NaN     4
0  A0  21.0     1
2  A2  33.0     3
5  5A  33.0     5
6  5B  33.0     6
1  A1  79.0     2
3  A3  81.0     3

For a multi-key sort using sort_values method, ascending or descending sort order is specified with the ascending= argument using True or False for the same number of values listed with the by= argument.

DataFrame multi-key sort ascending for the Age column and descending for the Rank column.

df.sort_values(by=['Age', 'Rank'], na_position='first', ascending = (True, False))

   ID   Age  Rank
4  A4   NaN     4
0  A0  21.0     1
6  5B  33.0     6
5  5A  33.0     5
2  A2  33.0     3
1  A1  79.0     2
3  A3  81.0     3

PROC SQL uses the DESCENDING keyword following a column name to indicate a descending sort order used with an ORDER BY clause.

proc sql;
   select * from df
   order by age
           ,rank descending;
quit;

Find Duplicates

In some cases data entry errors lead to duplicate data values or non-unique key values or duplicate rows. The duplicated method return Booleans indicating duplicate rows and, in this case, limited to duplicate values in the Age column. By default the duplicated method applies to all DataFrame columns.

print(df)

   ID   Age  Rank
0  A0  21.0     1
1  A1  79.0     2
2  A2  33.0     3
3  A3  81.0     3
4  A4   NaN     4
5  5A  33.0     5
6  5B  33.0     6

Create a Boolean mask to find all occurrences of duplicate values for Age excluding the first occurrence.

dup_mask = df.duplicated('Age', keep='first')
df_dups = df.loc[dup_mask]
print(df_dups)

   ID   Age  Rank
5  5A  33.0     5
6  5B  33.0     6

The statement:

dup_mask = df.duplicated('Age', keep='first')

defines a Boolean mask using the keep='first' argument for the duplicated attribute. The keep= argument can have three values:

first: Mark duplicates as True except for the first occurrence
last: Mark duplicates as False except of the last occurrence
false: Mark all duplicates as True

Drop Duplicates

The drop_duplicates method returns a de-duplicated DataFrame based on argument values.

Display the df DataFrame.

print(df)

0  A0  21.0     1
1  A1  79.0     2
2  A2  33.0     3
3  A3  81.0     3
4  A4   NaN     4
5  5A  33.0     5
6  5B  33.0     6

Call the drop_duplicates method and drop the rows where there are duplicate values for the Age column excluding the first occurence and display the new de_duped DataFrame.

df_deduped = df.drop_duplicates('Age', keep = 'first')
print(df_deduped)

   ID   Age  Rank
0  A0  21.0     1
1  A1  79.0     2
2  A2  33.0     3
3  A3  81.0     3
4  A4   NaN     4

The NODUPKEY and NODUPRECS options for PROC SORT detect and remove duplicate values. The NODUPKEY option removes observations with duplicate BY values. This is analogous to the pandas drop_duplicates method with a column argument.

The PROC SORT NODUPRECS is analogous to the drop_duplicates method with the keep=False argument. In each case, duplicate observations or rows, if found, are eliminated.

SAS Analog.

proc sort data = df nodupkey
   out = df_deduped
   dupout = df_dups;
   by age;

run;

proc print data = df;
   id id;
run;

proc print data = df_dups;
   id id;
run;

proc print data = df_deduped;
   id id;
run;

df Dataset left df_dups Dataset right

df_deduped Dataset right

Draw Samples

Construct and display the df DataFrame using a seed value to make the results repeatable.

np.random.seed(987654)
df = pd.DataFrame({'value': np  .random.randn(360)},
     index=pd.date_range('1970-01-01', freq='M', periods=360))
print(df.head(5))

               value
1970-01-31 -0.280936
1970-02-28 -1.292098
1970-03-31 -0.881673
1970-04-30  0.518407
1970-05-31  1.829087

Create the samp1 DataFrame calling the sample method drawing a random sample of 100 rows from the df DataFrame without replacement.

samp1 = df.sample(n= 100, replace=False)
samp1.head(5)

               value
1993-10-31 -0.471982
1998-04-30 -0.906632
1980-09-30 -0.313441
1986-07-31 -0.872584
1975-01-31  0.237037

Return the number of rows and columns for the samp1 DataFrame.

print(samp1.shape)

(100, 1)

SAS Analog.

data df;
   do date = '01Jan1970'd to '31Dec2000'd by 31;
       value = rand('NORMAL');
       output;
   end;
format date yymmdd10.;
run;

data samp1 (drop = k n);
   retain k 100 n;
   if _n_ = 1 then n = total;
   set df nobs=total;

   if ranuni(654321) <= k/n then do;
      output;
      k = k -1;
   end;
n = n -1;

if k = 0 then stop;
run;

pandas sample method has the frac= argument to include a portion, for example, 30% of the rows to be included in the sample.

samp2 = df.sample(frac=.3, replace=True)
print(samp2.head(5))

               value
1971-05-31  0.639097
1997-10-31  1.779798
1971-07-31  1.578456
1981-12-31  2.114340
1980-11-30  0.293887

Convert Types

pandas provisions the astype method to convert column types. The DataFrame dtypes attribute returns the column’s type.

Construct the df8 DataFrame and return the column types.

df8 = pd.DataFrame({'String': ['2', '4', '6', '8'],
                    'Ints'  : [1, 3, 5, 7]})
df8.dtypes

String    object
Ints       int64
dtype: object

Convert the String column type from string to float and the Ints column type from integers to strings.

df8['String'] = df8['String'].astype(float)
df8['Ints']  = df8['Ints'].astype(object)

Get the altered column types for the df8 DataFrame.

df8.dtypes

String    float64
Ints       object
dtype: object

Rename Columns

Call the rename method to relabel columns using a dictionary with a key as the existing column label and the value as the new column label.

df8.rename(columns={"String": "String_to_Float",
                    "Ints": "Ints_to_Object"},
inplace=True)
print(df8)

String_to_Float Ints_to_Object
0              2.0              1
1              4.0              3
2              6.0              5
3              8.0              7

Map Values

Mapping column values is similar to the features provided by PROC FORMAT. Calling the map function allows translation of column values to corresponding values.

dow = pd.DataFrame({"Day_Num":[1,2,3,4,5,6,7]})
dow_map={1:'Sunday',
         2:'Monday',
         3:'Tuesday',
         4:'Wednesday',
         5:'Thursday',
         6:'Friday',
         7:'Saturday'
         }

dow["Day_Name"] = dow["Day_Num"].map(dow_map)
print(dow)

   Day_Num   Day_Name
0        1     Sunday
1        2     Monday
2        3    Tuesday
3        4  Wednesday
4        5   Thursday
5        6     Friday
6        7   Saturday

Transpose

Transposing creates a DataFrame by restructuring the values in a DataFrame transposing columns into row and rows into columns. Calling the T property is a convenience for the transpose method. Calling without arguments performs an in-place transpose.

uni = {'School'   :  ['NCSU',   'UNC',         'Duke'],
       'Mascot'   :  ['Wolf',   'Ram',         'Devil'],
       'Students' :  [22751,     31981,         19610],
       'City'     :  ['Raleigh', 'Chapel Hill', 'Durham']}
df_uni = pd.DataFrame(data=uni)
print (df_uni)

  School Mascot  Students         City
0   NCSU   Wolf     22751      Raleigh
1    UNC    Ram     31981  Chapel Hill
2   Duke   Devil    19610       Durham

Return the column types for the df_uni DataFrame.

df_uni.dtypes

School      object
Mascot      object
Students     int64
City        object
dtype: object

Create the new t_df_uni Dataframe by copying the transformed values.

t_df_uni = df_uni.T
print(t_df_uni)

                0            1       2
School       NCSU          UNC    Duke
Mascot       Wolf          Ram   Devil
Students    22751        31981   19610
City      Raleigh  Chapel Hill  Durham

Observe how all column types in the t_df_uni are returned as strings.

t_df_uni.dtypes

0    object
1    object
2    object
dtype: object

Slice column 0 (zero) with the [ ] operator.

t_df_uni[0]

School         NCSU
Mascot         Wolf
Students      22751
City        Raleigh
Name: 0, dtype: object

The SAS analog calling PROC TRANSPOSE.

data df_uni;
   infile datalines dlm=',';
          length school  $ 4
                 mascot  $ 5
                city     $ 11;
         input school    $
                mascot   $
                students
                city     $;
list;
datalines;
NCSU, Wolf, 22751, Raleigh
UNC, Ram, 31981, Chapel Hill
Duke, Devil, 19610, Durham
;;;;
run;

proc print data = df_uni;
   id school;
run;

proc transpose data = df_uni
               out = t_df_uni;
var school 
     mascot 
     students 
     city;
run;

proc print data = t_df_uni;
run;

Input Dataset left

Transposed Dataset right

Display the transposed dataset variable types fetching the result set from the SAS DICTIONARY.COLUMNS table.

proc sql;
   select name
        , type
   from dictionary.columns
where libname='WORK' and memname='T_DF_UNI';
quit;

Transposed Dataset column types right