How To Create A Dataframe From Another Dataframe In Pandas
25. Accessing and Changing values of DataFrames
By Bernd Klein. Last modified: 23 Nov 2021.
On this page ➤
We have seen in the previous chapters of our tutorial many ways to create Series and DataFrames. We also learned how to access and replace complete columns. This chapter of our Pandas and Python tutorial will show various ways to access and change selectively values in Pandas DataFrames and Series. We will show ways how to change single value or values matching strings or regular expressions.
For this purpose we will learn to know the methods loc
, at
and replace
We will work with the following DataFrame structure in our examples:
import pandas as pd first = ( 'Mike' , 'Dorothee' , 'Tom' , 'Bill' , 'Pete' , 'Kate' ) last = ( 'Meyer' , 'Maier' , 'Meyer' , 'Mayer' , 'Meyr' , 'Mair' ) job = ( 'data analyst' , 'programmer' , 'computer scientist' , 'data scientist' , 'accountant' , 'psychiatrist' ) language = ( 'Python' , 'Perl' , 'Java' , 'Java' , 'Cobol' , 'Brainfuck' ) df = pd . DataFrame ( list ( zip ( last , job , language )), columns = [ 'last' , 'job' , 'language' ], index = first ) df
last | job | language | |
---|---|---|---|
Mike | Meyer | data analyst | Python |
Dorothee | Maier | programmer | Perl |
Tom | Meyer | computer scientist | Java |
Bill | Mayer | data scientist | Java |
Pete | Meyr | accountant | Cobol |
Kate | Mair | psychiatrist | Brainfuck |
Changing one value in DataFrame
Pandas provides two ways, i.e. loc
and at
, to access or change a single value of a DataFrame. We will experiment with the height of Bill in the following Python code:
# accessing the job of Bill: print ( df . loc [ 'Bill' , 'job' ]) # alternative way to access it with at: print ( df . at [ 'Bill' , 'job' ]) # setting the job of Bill to 'data analyst' with 'loc' df . loc [ 'Bill' , 'job' ] = 'data analyst' # let us check it: print ( df . loc [ 'Bill' , 'job' ]) # setting the job of Bill to 'computer scientist' with 'at' df . at [ 'Pete' , 'language' ] = 'Python'
OUTPUT:
data scientist data scientist data analyst
The following image shows what we have done:
You will ask yourself now which one you should use? The help on the at
method says the following: "Access a single value for a row/column label pair. Similar to loc
, in that both provide label-based lookups. Use at
if you only need to get or set a single value in a DataFrame or Series." loc
on the other hand can be used to access a single value but also to access a group of rows and columns by a label or labels.
Another intestering question is about the speed of both methods in comparison. We will measure the time behaviour in the following code examples:
% timeit df.loc['Bill', 'language'] = 'Python'
OUTPUT:
129 µs ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
% timeit df.at['Bill', 'language'] = 'Python'
OUTPUT:
4.69 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
When it comes to speed the answer is clear: we should definitely use at
.
replace
The syntax of replace
:
replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
This method replaces values given in to_replace
with value
. This differs from updating with .loc
or .iloc
, which requires you to specify a location to update with some value. With replace it is possible to replace values in a Series or DataFrame without knowing where they occur.
replace
works both with Series and DataFrames.
We will explain the way of working of the method replace
by discussing the different data types of the parameter to_replace
individually:
to_replace can be numeric value, string or regular expression:
Data Type | processing |
---|---|
numeric | numeric values in the DataFrame (or Series) which are equal to to_replace will be replaced with parameter value |
str | If the parameter to_replace consists of a string, all the strings of the DataFrame (or Series) which exactly match this string will be replaced by the value of the parameter value |
regex | All strings inside of the DataFrame (or Series) which match the regular expression of to_replace will be replaced with value |
Let's start with a Series example. We will change one value into another one. Be aware of the fact that replace
by default creates a copy of the object in which all the values are replaced. This means that the parameter inplace
is set to False
by default.
s = pd . Series ([ 27 , 33 , 13 , 19 ]) s . replace ( 13 , 42 )
OUTPUT:
0 27 1 33 2 42 3 19 dtype: int64
If we really want to change the object s
is referencing, we should set the inplace
parameter to True
:
s = pd . Series ([ 27 , 33 , 13 , 19 ]) s . replace ( 13 , 42 , inplace = True ) s
OUTPUT:
0 27 1 33 2 42 3 19 dtype: int64
We can also change multiple values into one single value, as you can see in the following example.
s = pd . Series ([ 0 , 1 , 2 , 3 , 4 ]) s . replace ([ 0 , 1 , 2 ], 42 , inplace = True ) s
OUTPUT:
0 42 1 42 2 42 3 3 4 4 dtype: int64
We will show now, how replace can be used on DataFrames. For this purpose we will recreate the example from the beginning of this chapter of our tutorial. It will not be exactly the same though. More people have learned Python now, but unfortunately somebody has put in some spelling errors in the word Python. Pete has become a programmer now:
import pandas as pd first = ( 'Mike' , 'Dorothee' , 'Tom' , 'Bill' , 'Pete' , 'Kate' ) last = ( 'Meyer' , 'Maier' , 'Meyer' , 'Mayer' , 'Meyr' , 'Mair' ) job = ( 'data analyst' , 'programmer' , 'computer scientist' , 'data scientist' , 'programmer' , 'psychiatrist' ) language = ( 'Python' , 'Perl' , 'Java' , 'Pithon' , 'Pythen' , 'Brainfuck' ) df = pd . DataFrame ( list ( zip ( last , job , language )), columns = [ 'last' , 'job' , 'language' ], index = first ) df
last | job | language | |
---|---|---|---|
Mike | Meyer | data analyst | Python |
Dorothee | Maier | programmer | Perl |
Tom | Meyer | computer scientist | Java |
Bill | Mayer | data scientist | Pithon |
Pete | Meyr | programmer | Pythen |
Kate | Mair | psychiatrist | Brainfuck |
Both ambitious Dorothee and enthusiastic Pete have finished a degree in Computer Science in the meantime. So we have to change our DataFrame accodingly:
df . replace ( "programmer" , "computer scientist" , inplace = True ) df
last | job | language | |
---|---|---|---|
Mike | Meyer | data analyst | Python |
Dorothee | Maier | computer scientist | Perl |
Tom | Meyer | computer scientist | Java |
Bill | Mayer | data scientist | Pithon |
Pete | Meyr | computer scientist | Pythen |
Kate | Mair | psychiatrist | Brainfuck |
to_replace can be a list consisting of str, regex or numeric objects:
to_replace
can be a list consisting of str, regex or numeric objects. If both the values of to_replace
and value
are both lists, they must be the same length.
First, if to_replace
and value
are both lists, they must be the same length. Second, if regex=True
then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. We slightly changed our DataFrame for the next example. The first names build a column now!
df = pd . DataFrame ( list ( zip ( first , last , job , language )), columns = [ 'first' , 'last' , 'job' , 'language' ]) df
first | last | job | language | |
---|---|---|---|---|
0 | Mike | Meyer | data analyst | Python |
1 | Dorothee | Maier | programmer | Perl |
2 | Tom | Meyer | computer scientist | Java |
3 | Bill | Mayer | data scientist | Pithon |
4 | Pete | Meyr | programmer | Pythen |
5 | Kate | Mair | psychiatrist | Brainfuck |
'Mike' and 'Tom' are abbreviations of the real first names, so we will change them now. We are also glad to announce that Dorothee finally migrated from Perl to Python:
df . replace ( to_replace = [ 'Mike' , 'Tom' , 'Perl' ], value = [ 'Michael' , 'Thomas' , 'Python' ], inplace = True ) df
first | last | job | language | |
---|---|---|---|---|
0 | Michael | Meyer | data analyst | Python |
1 | Dorothee | Maier | programmer | Python |
2 | Thomas | Meyer | computer scientist | Java |
3 | Bill | Mayer | data scientist | Pithon |
4 | Pete | Meyr | programmer | Pythen |
5 | Kate | Mair | psychiatrist | Brainfuck |
Using Regular Expressions
Now it is time to do something about the surnames in our DataFrame. All these names sound the same - at least for German speakers. Let us assume that we just found out that they should all be spelled the same, i.e. Mayer. Now the time has come for regular expression. The parameter to_replace
can also be used with regular expressions. We will use a regular expression to unify the surnames. We will also fix the misspellings of Python:
df . replace ( to_replace = [ r 'M[ea][iy]e?r' , r 'P[iy]th[eo]n' ], value = [ 'Mayer' , 'Python' ], regex = True , inplace = True ) df
first | last | job | language | |
---|---|---|---|---|
0 | Michael | Mayer | data analyst | Python |
1 | Dorothee | Mayer | programmer | Python |
2 | Thomas | Mayer | computer scientist | Java |
3 | Bill | Mayer | data scientist | Python |
4 | Pete | Mayer | programmer | Python |
5 | Kate | Mayer | psychiatrist | Brainfuck |
to_replace can be a dict
Dictionaries can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'}
replaces the value 'a' with 'b' and 'y' with 'z'. To use a dict in this way the value
parameter should be None
.
Ìf replace
is applied on a DataFrame, a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'}
looks for the value 1 in column 'a' and the value 'z' in column 'b' and replaces these values with whatever is specified in value
. The value
parameter should not be None
in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
If we use nested dictionaries like {'A': {'foo': 'bar'}}
, they are interpreted like this: look in column 'A' for the value 'foo' and replace it with 'bar'. The value
parameter has to be None
in this case.
df = pd . DataFrame ({ 'A' : [ 0 , 1 , 2 , 3 , 4 ], 'B' : [ 'foo' , 'bar' , 'bloo' , 'blee' , 'bloo' ], 'C' : [ 'green' , 'red' , 'blue' , 'yellow' , 'green' ]}) df . replace ( to_replace = { "A" : { 0 : 42 , 3 : 33 }, 'B' : { 'bloo' : 'vloo' }}, inplace = True ) df
A | B | C | |
---|---|---|---|
0 | 42 | foo | green |
1 | 1 | bar | red |
2 | 2 | vloo | blue |
3 | 33 | blee | yellow |
4 | 4 | vloo | green |
The method parameter of replace:
When the parameter value
is None
and the parameter to_replace
is a scalar, list or tuple, the method replace
will use the parameter method
to decide which replacement to perform. The possible values for method
are pad
, ffill
, bfill
, None
. The default value is pad
.
First of all, you have to know that pad
and ffill
are equivalent.
We will explain the way of working with the following Python example code. We will replace the values NN
and LN
of the following DataFrame in different ways by using the parameter method
.
import pandas as pd df = pd . DataFrame ({ 'name' :[ 'Ben' , 'Kate' , 'Agnes' , 'Ashleigh' , 'Tom' ], 'job' :[ 'programmer' , 'NN' , 'NN' , 'engineer' , 'teacher' ], 'language' :[ 'Java' , 'Python' , 'LN' , 'LN' , 'C' ]})
We will start with changing the 'NN' values by using ffill
in the parameter method
:
The picture above shows how the following call to replace
. In words: Every occurrence of NN
in the job
column will have to be replaced. The value is defined by the parameter method
. ffill
means 'forward fill', i.e. we will take the preceding value to the first NN
as the fill value. This is why we will replace the values by 'programmer':
# method is backfill df . replace ( to_replace = 'NN' , value = None , method = 'ffill' )
name | job | language | |
---|---|---|---|
0 | Ben | programmer | Java |
1 | Kate | programmer | Python |
2 | Agnes | programmer | LN |
3 | Ashleigh | engineer | LN |
4 | Tom | teacher | C |
Instead of using a single value for to_replace
we can also use a list or tuple. We will replace in the following all occurrences of NN
and LN
accordingly, as we can see in the following picture:
df . replace ( to_replace = [ 'NN' , 'LN' ], value = None , method = 'ffill' )
name | job | language | |
---|---|---|---|
0 | Ben | programmer | Java |
1 | Kate | programmer | Python |
2 | Agnes | programmer | Python |
3 | Ashleigh | engineer | Python |
4 | Tom | teacher | C |
We will show now what happens, if we use bfill
(backward fill). Now the occurrences of 'LN' become 'C' instead of Python. We also turn the 'NN's into engineers instead of programmers.
df . replace ([ 'NN' , 'LN' ], value = None , method = 'bfill' )
name | job | language | |
---|---|---|---|
0 | Ben | programmer | Java |
1 | Kate | engineer | Python |
2 | Agnes | engineer | C |
3 | Ashleigh | engineer | C |
4 | Tom | teacher | C |
df . replace ( 'NN' , value = None , inplace = True , method = 'bfill' ) df . replace ( 'LN' , value = None , inplace = True , method = 'ffill' ) df
name | job | language | |
---|---|---|---|
0 | Ben | programmer | Java |
1 | Kate | engineer | Python |
2 | Agnes | engineer | Python |
3 | Ashleigh | engineer | Python |
4 | Tom | teacher | C |
Other Examples
We will present some more examples, which are taken from the help file of loc
:
df = pd . DataFrame ([[ 1 , 2 ], [ 4 , 5 ], [ 7 , 8 ]], index = [ 'cobra' , 'viper' , 'sidewinder' ], columns = [ 'max_speed' , 'shield' ]) df
max_speed | shield | |
---|---|---|
cobra | 1 | 2 |
viper | 4 | 5 |
sidewinder | 7 | 8 |
Single label. Note this returns the row as a Series:
OUTPUT:
max_speed 4 shield 5 Name: viper, dtype: int64
List of labels. Note using [[]]
returns a DataFrame:
df . loc [[ 'viper' , 'sidewinder' ]]
max_speed | shield | |
---|---|---|
viper | 4 | 5 |
sidewinder | 7 | 8 |
Single label for row and column
df . loc [ 'cobra' , 'shield' ]
OUTPUT:
Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of the slice are included.
df . loc [ 'cobra' : 'viper' , 'max_speed' ]
OUTPUT:
cobra 1 viper 4 Name: max_speed, dtype: int64
Boolean list with the same length as the row axis
df . loc [[ False , False , True ]]
max_speed | shield | |
---|---|---|
sidewinder | 7 | 8 |
Conditional that returns a boolean Series
max_speed | shield | |
---|---|---|
sidewinder | 7 | 8 |
Conditional that returns a boolean Series with column labels specified
df . loc [ df [ 'shield' ] > 6 , [ 'max_speed' ]]
Callable that returns a boolean Series
df . loc [ lambda df : df [ 'shield' ] == 8 ]
max_speed | shield | |
---|---|---|
sidewinder | 7 | 8 |
Set value for all items matching the list of labels
df . loc [[ 'viper' , 'sidewinder' ], [ 'shield' ]] = 50 df
max_speed | shield | |
---|---|---|
cobra | 1 | 2 |
viper | 4 | 50 |
sidewinder | 7 | 50 |
Set value for an entire row
max_speed | shield | |
---|---|---|
cobra | 10 | 10 |
viper | 4 | 50 |
sidewinder | 7 | 50 |
Set value for an entire column
df . loc [:, 'max_speed' ] = 30 df
max_speed | shield | |
---|---|---|
cobra | 30 | 10 |
viper | 30 | 50 |
sidewinder | 30 | 50 |
Set value for rows matching callable condition
df . loc [ df [ 'shield' ] > 35 ] = 0 df
max_speed | shield | |
---|---|---|
cobra | 30 | 10 |
viper | 0 | 0 |
sidewinder | 0 | 0 |
How To Create A Dataframe From Another Dataframe In Pandas
Source: https://python-course.eu/numerical-programming/accessing-and-changing-values-dataframes.php
Posted by: ozunaweland.blogspot.com
0 Response to "How To Create A Dataframe From Another Dataframe In Pandas"
Post a Comment